Mathematical Statistics
Basic Ideas and Selected Topics
Volume I
Peter J. Bickel
University of California
Kjell A. Doksum
University of California
1'1"(.'nl icc
Hall
.. ~
PRENTICE HALL
Upper Saddle River, New Jersey 07458
Library of Congress CataloginginPublication Data Bickel. Peter J. Mathematical statistics: basic ideas and selected topics / Peter J. Bickel, Kjell A. Doksum2 nd ed. p. em. Includes bibliographical references and index. ISBN D13850363X(v. 1) L Mathematical statistics. L Doksum, Kjell A. II. Title.
QA276.B47200l
519.5dc21 00031377
Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte
}'I('I1II(\'
lI,dl
@2001, 1977 by PrenticeHall, Inc. Upper Saddle River, New Jersey 07458
All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
I
i
ISBN: D1385D363X
PrenticeHall International (UK) Limited, London PrenticeHall of Australia Pty. Limited, Sydney PrenticeHall of Canada Inc., Toronto PrenticeHall Hispanoamericana, S.A., Mexico PrenticeHall of India Private Limited, New Delhi PrenticeHall of Japan, Inc., Tokyo Pearson Education Asia Pte. Ltd. Editora PrenticeHall do Brasil, Ltda., Rio de Janeiro
J
To Erich L Lehmann
." i
~I
"I
~ !
,'
,
I,
,.~~
_.

..
CONTENTS
PREFACE TO THE SECOND EDITION: VOLUME I PREFACE TO THE FIRST EDITION I STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA 1.1 Data, Models, Parameters, and Statistics
xiii
xvii
1 1
1.1.1
1.1.2
Data and Models
Pararnetrizations and Parameters
I
6
1.1.3
1.2
Statistics as Functions on the Sample Space
8
1.3
1.4
1.5
1.6
1.7 1.8 1.9
1.1.4 Examples, Regression Models Bayesian Models The Decision Theoretic Framework 1.3.1 Components of the Decision Theory Framework 1.3.2 Comparison of Decision Procedures 1.3.3 Bayes and Minimax Criteria Prediction Sufficiency Exponential Families 1.6.1 The OneParameter Case 1.6.2 The Multiparameter Case 1.6.3 Building Exponential Families 1.6.4 Properties of Exponential Families 1.6.5 Conjugate Families of Prior Distributions Problems and Complements Notes References
9 12 16 17 24 26 32 41 49 49 53 56 58 62 66 95 96
VII
VIII
•••
CONTENTS
2 METHODS OF ESTIMATION 2.1 Basic Heuristics of Estimation 2.1.1 Minimum Contrast Estimates; Estimating Equations 2.1.2 The PlugIn and Extension Principles 2.2 Minimum Contrast Estimates and Estimating Equations 2.2.1 Least Squares and Weighted Least Squares 2.2.2 Maximum Likelihood 2.3 Maximum Likelihood in Multiparameter Exponential Families *2.4 Algorithmic Issues 2.4.1 The Method of Bisection 2.4.2 Coordinate Ascent 2.4.3 The NewtonRaphson Algorithm 2.4.4 The EM (ExpectationlMaximization) Algorithm 2.5 Problems and Complements 2.6 Notes 2.7 References
3
MEASURES OF PERFORMANCE
99 99 99 102 107 107 114 121 127 127 129 132 133 138 158 159
3.1 Introduction 3.2 Bayes Procedures 3.3 Minimax Procedures *3.4 Unbiased Estimation and Risk Inequalities 3.4.1 Unbiased Estimation, Survey Sampling 3.4.2 The Information Inequality *3.5 Nondecision Theoretic Criteria 3.5.1 Computation 3.5.2 Interpretability 3.5.3 Robustness 3.6 Problems and Complements 3.7 Notes 3.8 References
4 TESTING AND CONFIDENCE REGIONS 4.1 Introduction 4.2 Choosing a Test Statistic: The NeymanPearson Lemma 4.3 UnifonnIy Most Powerful Tests and Monotone Likelihood Ratio
161 161 161 170 176 176 179 188 188 189 190 197 210 211 213 213 223
227 233
4.4
Models Confidence Bounds, Intervals, and Regions
CONTENTS
ix
241 248 251 252
4.5 *4.6 *4.7 4.8
The Duality Between Confidence Regions and Tests Uniformly Most Accurate Confidence Bounds Frequentist and Bayesian Formulations Prediction Intervals
4.9
Likelihood Ratio Procedures 4.9.1 Inttoduction
4.9.2 4.9.3 Tests for the Mean of a Normal DistributionMatched Pair Experiments Tests and Confidence Intervals for the Difference in Means of
255 255
257
4.9.4
4.9.5
Two Normal PopUlations The TwoSample Prohlem with Unequal Variances
Likelihood Ratio Procedures fOr Bivariate Nonnal
261 264 266 269 295 295
Distrihutions 4.10 Problems and Complements 4.11 Notes 4.12 References
5 ASYMPTOTIC APPROXIMATIONS 5.1 Inttoduction: The Meaning and Uses of Asymptotics
5.2 Consistency
297 297
301
5.2.1
5.2.2
PlugIn Estimates and MLEs in Exponential Family Models
Consistency of Minimum Contrast Estimates
301
304
5.3
5.4
5.5 5.6 5.7 5.8
First and HigherOrder Asymptotics: The Delta Method with Applications 5.3.1 The Delta Method for Moments 5.3.2 The Delta Method for In Law Approximations 5.3.3 Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families Asymptotic Theory in One Dimension 5.4.1 Estimation: The Multinomial Case *5.4.2 Asymptotic Normality of Minimum Conttast and M Estimates *5.4.3 Asymptotic Normality and Efficiency of the MLE *5.4.4 Testing *5.4.5 Confidence Bounds Asymptotic Behavior and Optimality of the Posterior Distribution Problems and Complements Notes References
306 306 311 322 324 324 327 331 332 336 337 345 362 363
x
CONTENTS
6 INFERENCE IN THE MULTIPARAMETER CASE
6.1 Inference for Gaussian Linear Models 6.1.1 6.1.2 6.1.3 *6.2 6.2.1 6.2.2 6.2.3 *6.3 6.3.1 6.3.2 *6.4 6.4.1 6.4.2 6.4.3 *6.5 *6.6 6.7 6.8 6.9 The Classical Gaussian Linear Model Estimation Tests and Confidence Intervals Estimating Equations Asymptotic Normality and Efficiency of the MLE The Posterior Distribution in the Multiparameter Case Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic Wald's and Rao's Large Sample Tests GoodnessofFit in a Multinomial Model. Pearson's X 2 Test GoodnessofFit to Composite Multinomial Models. Contingency Thbles Logistic Regression for Binary Responses
365
365 366 369 374 383 384 386 391 392 392 398 400 401 403 408 411 417 422 438 438 441 441 443 443
Asymptotic Estimation Theory in p Dimensions
Large Sample Tests and Confidence Regions
Large Sample Methods for Discrete Data
Generalized Linear Models
Robustness Properties and Scmiparametric Models Problems and Complements Notes References
A A REVIEW OF BASIC PROBABILITY THEORY A.I The Basic Model A.2 Elementary Properties of Probability Models A.3 Discrete Probability Models A.4 Conditional Probability and Independence A.5 Compound Experiments A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement A.7 Probabilities On Euclidean Space A.8 Random Variables and Vectors: Transformations A.9 Independence of Random Variables and Vectors A.IO The Expectation of a Random Variable A.II Moments A.12 Moment and Cumulant Generating Functions
444
446 447 448 451 453 454 456 459
CONTENTS
XI
•
A. t3 Some Classical Discrete and Continuous Distributions A.14 Modes of Convergence of Random Variables and Limit Theorems A. I5 Further Limit Theorems and Inequalities A.16 Poisson Process
A.17 Notes
460
466 468 472
474 475 477 477 477 479 480 482 484 485 485 488 491 491 494 497 502 502 503 506 506 508
A.18 References
B ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS
B.I Conditioning by a Random Variable or Vector B.I.l B.I.2 B.1.3 B.IA B.l.5 B.2.1 B.2.2 B.3
B.3.1
The Discrete Case Conditional Expectation for Discrete Variables Properties of Conditional Expected Values Continuous Variables Comments on the General Case The Basic Framework The Gamma and Beta Distributions
The X2 , F, and t Distributions
B.2 Distribution Theory for Transformations of Random Vectors
Distribution Theory for Samples from a Normal Population B.3.2 Orthogonal Transformations
BA The Bivariate Normal Distribution
B.5 Moments of Random Vectors and Matrices B.5.1 B.5.2 B.6.1 B.6.2 B.8 Basic Properties of Expectations Properties of Variance Definition and Density Basic Properties. Conditional Distributions
Op
B.6 The Multivariate Normal Distribution
B.7 Convergence for Random Vectors: Multivariate Calculus B.9 Convexity and Inequalities
and Op Notation
511
516 518 519 519 520 521
B.1O Topics in Matrix Theory and Elementary Hilbert Space Theory
B.1O.1 Symmetric Matrices B,10.2 Order on Symmetric Matrices
B.10.3 Elementary Hilbert Space Theory
B.Il Problems and Complements B.12 Notes B.13 References
524
538 539
•• XII
CONTENTS
C TABLFS
Table I The Standard Nonna! Distribution Table I' Auxilliary Table of the Standard Normal Distribution
Table II t Distribution Critical Values Table In x2 Distribution Critical Values Table IV F Distribution Critical Values
INDEX
541 542 543 544 545 546 547
I
PREFACE TO THE SECOND EDITION: VOLUME I
In the twentythree years that have passed since the first edition of our book appeared statistics has changed enonnollsly under dIe impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phy
logenetic and other), and other types of combinatorial objects.
(2) The generation of enonnous amounts of dataterrabytes (the equivalent of 10 12 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore's "law") and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number of directions:
(I) Methods for inference based on larger numbers of observations and minimal assumptionsasymptotic methods in non and semiparametric models, models with ''infinite'' number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo.
XIll
.,q;,
•
:ci'~:,f"
.;.
'4,.
<.it:
...
such as the empirical distribution function. However. such as the density." That is. Others do not and some though theoretically attractive cannot be implemented in a human lifetime. then go to parameters and parametric models stressing the role of identifiability. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. However. Our one long book has grown to two volumes. and probability inequalities as well as more advanced topics in matrix theory and analysis.and semipararnetric models. of course. which we present in 2000. instead of beginning with parametrized models we include from the start non. Volume I. covers material we now view as important for all beginning graduate students in statistics and science and engineering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing concluSIons. These will not be dealt with in OUr work. but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis. we do not require measure theory but assume from the start that our models are what we call "regular. and references to the literature for proofs of the deepest results such as the spectral theorem. weak convergence in Euclidean spaces. or the absolutely continuous case with a density. and functionvalued statistics. some methods run quickly in real time. As a consequence our second edition. Specifically. which includes more advanced topics from probability theory such as the multivariate Gaussian distribution.xiv Preface to the Second Edition: Volume I (4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. Chapter 1 now has become part of a larger Appendix B. each to be only a little shorter than the first edition. reflecting what we now teach our graduate students. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. problems. Despite advances in computing speed. As in the first edition. Appendix B is as selfcontained as possible with proofs of mOst statements. Volume [ covers the malerial of Chapters 16 and Chapter 10 of the first edition with pieces of Chapters 710 and includes Appendix A on basic probability theory. is much changed from the first. we assume either a discrete probability whose support does not depend On the parameter set. We . There have. From the beginning we stress functionvalued parameters. been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. our focus and order of presentation have changed. (5) The study of the interplay between numerical and statistical considerations. Hilbert space theory is not needed.
are an extended discussion of prediction and an expanded introduction to kparameter exponential families. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the mOre advanced topics of Volume II. Chapter 6 is devoted to inference in multivariate (multiparameter) models. Wilks theorem on the asymptotic distribution of the likelihood ratio test. Robustness from an asymptotic theory point of view appears also. There are clear dependencies between starred . Save for these changes of emphasis the other major new elements of Chapter 1. Almost all the previous ones have been kept with an approximately equal number of new ones addedto correspond to our new topics and point of view. Included are asymptotic normality of maximum likelihood estimates. include examples that are important in applications. if somewhat augmented. and some parailels to the optimality theory and comparisons of Bayes and frequentist procedures given in the univariate case in Chapter 5. One of the main ingredients of most modem algorithms for inference. we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis. It includes the initial theory presented in the first edition but goes much further with proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that are given in Appendix B. There is more material on Bayesian models and analysis. such as regression experiments. Chapter 5 of the new edition is devoted to asymptotic approximations. The main difference in our new treatment is the downplaying of unbiasedness both in estimation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. inference in the general linear model. including a complete study of MLEs in canonical kparameter exponential families. Although we believe the material of Chapters 5 and 6 has now become fundamental. from the start. Finaliy. The conventions established on footnotes and notation in the first edition remain. which parallels Chapter 2 of the first edition.Preface to the Second Edition: Volume I xv also. Also new is a section relating Bayesian and frequentist inference via the Bemsteinvon Mises theorem. Major differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs). there is clearly much that could be omitted at a first reading that we also star. Generalized linear models are introduced as examples. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in muitiparameter exponential families and ail introduction to the EM algorithm. the Wald and Rao statistics and associated confidence regions. Chapters 3 and 4 parallel the treatment of Chapters 4 and 5 of the first edition on the theory of testing and confidence regions. including some optimality theory for estimation as well and elementary robustness considerations. As in the first edition problems playa critical role by elucidating and often substantially expanding the text. Nevertheless. Chapters 14 develop the basic principles and examples of statistics. Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation.
greatly extending the material in Chapter 8 of the first edition. taken on a new life. and Prentice Hall for generous production support. are weak. For the first volume of the second edition we would like to add thanks to new colleagnes. We also expect to discuss classification and model selection using the elementary theory of empirical processes. • with the field.2 ~ 6. Examples of application such as the Cox model in survival analysis. Yoram Gat for proofreading that found not only typos but serious errors. 6.edn Kjell Doksnm doksnm@stat. Topics to be covered include permutation and rank tests and their basis in completeness and equivariance. Michael Ostland and Simon Cawley for producing the graphs. and the functional delta method. Semiparametric estimation and testing will be considered more generally. and active participation in an enterprise that at times seemed endless. Jianhna Hnang. Last and most important we would like to thank our wives. Ying Qing Chen. encouragement. convergence for random processes.3 ~ 6. appeared gratifyingly ended in 1976 but has. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo.4.3 ~ 6. and the classical nonparametric k sample and independence problems will be included. The basic asymptotic tools that will be developed or presented. density estimation.edn . With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics. particnlarly Jianging Fan.berkeley.4 ~ 6. j j Peter J. and our families for support. 5.XVI • Pref3ce to the Second Edition: Volume I sections that follow. Bickel bickel@stat.6 Volume II is expected to be forthcoming in 2003. will be studied in the context of nonparametric function estimation. other transformation models. The topic presently in Chapter 8. and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley. in part in appendices.4. Nancy Kramer Bickel and Joan H. in part in the text and. elementary empirical process theory.berkeley. Fujimura. We also thank Faye Yeager for typing.2 ~ 5.5 I. Michael Jordan.
and the structure of both Bayes and admissible solutions in decision theory. which go from modeling through estimation and testing to linear models. covers most of the material we do and much more but at a more abstract level employing measure theory. These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. for instance. Our book contains more material than can be covered in tw~ qp. 3rd 00. At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig. Be cause the book is an introduction to statistics. The work of Rao. 2nd ed. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated. (4) Show how the ideas aod results apply in a variety of important subfields such as Gaussian linear mOdels. Port. Although there are several good books available for tbis purpose. Finally. (2) Give careful proofs of the major "elementary" results such as the NeymanPearson lemma. and engineering that we have taught we cover the core Chapters 2 to 7. Hoel. the information inequality. and nonparametric models. Our appendix does give all the probability that is needed. the physical sciences. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice. PREFACE TO THE FIRST EDITION This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be.. multinomial models. (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates. and the GaussMarkoff theorem.arters.. we feel that none has quite the mix of coverage and depth desirable at this level. the treatment is abridged with few proofs and no examples or problems. and Stone's Introduction to Probability Theory. Introduction to Mathematical Statistics. However. statistics. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory). Linear Statistical Inference and Its Applications. In the twoquarter courses for graduate students in mathematics. we select topics from xvii . the Lehmann5cheffe theorem. we need probability theory and expect readers to have had a course at the level of.
P. W. E. A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text.5 was discovered by F. (i) Various notational conventions and abbreviations are used in the text. distribution functions. and additional references. Peler J. Quang. Much of this material unfortunately does not appear in basic probability texts but we need to draw on it for the rest of the book. Lehmann's wise advice has played a decisive role at many points. They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text. and moments is established in the appendix. i . Gupta. U. Among many others who helped in the same way we would like to mention C. R. and friends who helped us during the various stageS (notes.xviii Preface to the First Edition Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. These comments are ordered by the section to which they pertain. Drew. J. Scholz. preliminary edition.6 would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. I for the first. Later we were both very much influenced by Erich Lehmann whose ideas are strongly rellected in this hook. caught mOre mistakes than both authors together. Bickel Kjell Doksum Berkeley /976 : . Conventions: (i) In order to minimize the number of footnotes we have added a section of comments at the end of each Chapter preceding the problem section. A serious error in Problem 2.2. X. We would like to acknowledge our indebtedness to colleagues. (iii) Basic notation for probabilistic objects such as random variables and vectors. A special feature of the book is its many problems. reservations. S. in proofreading the final version. Gray. They need to be read only as the reader's curiosity is piqued. . densities. and stimUlating lectures of Joe Hodges and Chuck Bell. It may be integrated with the material of Chapters 27 as the course proceeds rather than being given at the start. Chapter 1 covers probability theory rather than statistics. The foundation of oUr statistical knowledge was obtained in the lucid. Chou. enthusiastic. C. and A Samulon. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text. Cannichael. Without Winston Chow's lovely plots Section 9. Minassian who sent us an exhaustive and helpful listing. G. and so On. I . 2 for the second. Chen. Pyke's careful reading of a nexttofinal version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D. final draft) through which this book passed. Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers. We would also like tn thank tlle colleagues and friends who Inspired and helped us to enter the field of statistics. The comments contain digressions. L. or it may be included at the end of an introductory probability course that precedes the statistics course. respectively. students.
Mathematical Statistics Basic Ideas and Selected Topics Volume I Second Edition .
I . . j 1 J 1 . ..
The goals of science and society. The particular angle of mathematical statistics is to view data as the outcome of a random experiment that we model mathematically. for example.3). we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading. A detailed discussion of the appropriateness of the models we shall discuss in particUlar situations is beyond the scope of this book.Chapter 1 STATISTICAL MODELS. MODELS. which statisticians share. large scale or small. (3) Arrays of scalars and/or characters as in contingency tablessee Chapter 6or more generally multifactor IDultiresponse data on a number of individuals.1.1 and 6. measurements.4 and Sections 2. A generic source of trouble often called grf?ss errors is discussed in greater detail in the section on robustness (Section 3. and so on. Subject matter specialists usually have to be principal guides in model formulation. but we will introduce general model diagnostic tools in Volume 2. Moreover. for example. as usual. functions as in signal processing. in particUlar. Chapter 1. scientific or industrial. trees as in evolutionary phylogenies.1 1. In any case all our models are generic and. A 1 . GOALS. produce data whose analysis is the ultimate object of the endeavor. AND PERFORMANCE CRITERIA 1.2.1. PARAMETERS AND STATISTICS Data and Models Most studies and experiments.1. are to draw useful information from data using everything that we know.1 DATA.5. Data can consist of: (1) Vectors of scalars. a single time series of measurements. (2) Matrices of scalars and/or characters. (4) All of the above and more. digitized pictures or more routinely measurements of covariates and response on a set of n individualssee Example 1. and/or characters. ''The Devil is in the details!" All the principles we discuss and calculations we perform should only be suggestive guides in successful applications of statistical analysis in science and policy.
a shipment of manufactured items. An unknown number Ne of these elements are defective. . for modeling purposes. is distributed in a large population. Chapter I. in particular. producing energy. (3) We can assess the effectiveness of the methods we propose. A to m. give methods that assess the generalizability of experimental results. height or income. We begin this discussion with decision theory in Section 1. are never true but fortunately it is only necessary that they be usefuL" In this book we will study how. we can assign two drugs. we obtain one or more quantitative or qualitative measures of efficacy from each experiment. randomly selected patients and then measure temperature and blood pressure. Here are some examples: (a) We are faced with a population of N elements. This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population. e.5 and throughout the book. An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the population. we approximate the actual process of sampling without replacement by sampling with replacement. for example. (c) An experimenter makes n independent detenninations of the value of a physical constant p. Goodness of fit tests. to what extent can we expect the same effect more generally? Estimation.! I 2 Statistical Models. and diagnostics are discussed in Volume 2. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes. for instance. starting with tentative models: (I) We can conceptualize the data structure and our goals more precisely. We begin this in the simple examples that follow and continue in Sections 1. So to get infonnation about a sample of n is drawn without replacement and inspected.. The population is so large that. in the words of George Box (1979). and so on.3 and continue with optimality principles in Chapters 3 and 4. plus some random errors. (5) We can be guided to alternative or more general descriptions that might fit better. "Models of course. (2) We can derive methods of extracting useful information from data and. learning a maze. treating a disease. and B to n. and so on. (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee.21. Goals. have the patients rated qualitatively for improvement by physicians. confidence regions. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p. testing. Hierarchies of models are discussed throughout. In this manner. (b) We want to study how a physical or economic feature. and Performance Criteria Chapter 1 priori. and more general procedures will be discussed in Chapters 24. if we observe an effect in our data. The data gathered are the number of defectives found in the sample. reducing pollution.. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. For instance. Random variability " : . For instance. It is too expensive to examine all of the items. robustness.
"" €n are independent.1. completely specifies the joint distribution of Xl... . which we refer to as: Example 1. On this space we can define a random variable X given by X(k) ~ k. ..d.. n) distribution. €I. The main difference that our model exhibits from the usual probability model is that NO is unknown and. . That is. k ~ 0.. anyone of which could have generated the data actually observed. Sample from a Population. n corresponding to the number of defective items found. . X n ? Of course. F.. Fonnally. .1.. so that sampling with replacement replaces sampling without. The sample space consists of the numbers 0. and also write that Xl.1.. First consider situation (a).. X n as a random sample from F.. X has an hypergeometric.2. The mathematical model suggested by the description is well defined. Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording "defective" or not. we cannot specify the probability structure completely but rather only give a family {H(N8. Given the description in (c).. we postulate (l) The value of the error committed on one determination does not affect the value of the error at other times. .8).0) < k < min(N8. then by (AI3. in principle." The model is fully described by the set :F of distributions that we specify.Section 1. 1 €n) T is the vector of random errors.. A random experiment has been perfonned. which together with J1..1.. Here we can write the n determinations of p. . Sampling Inspection.. If N8 is the number of defective items in the population sampled. where . It can also be thought of as a limiting case in which N = 00. OneSample Models. H(N8. N. 1 <i < n (1. 1. which are modeled as realizations of X I.2) where € = (€ I .. as X with X .d..6) (J.i.l. Models./ ' stands for "is distributed as.. We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models. N.N(I .xn . . .Xn independent. as Xi = I.. can take on any value between and N. n)} of probability distributions for X. if the measurements are scalar.l) if max(n ..) random variables with common unknown distribution function F. . although the sample space is well defined.1 Data. Thus. What should we assume about the distribution of €. So. 0 ° Example 1. .. n).t + (i. We often refer to such X I. Parameters.. . .". X n are ij. and Statistics 3 here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs. we observe XI. .. identically distributed (i.. that depends on how the experiment is carried out. n. I. . . The same model also arises naturally in situation (c)..
1 X Tn . G) pairs. G) : J1 E R. By convention.2) distribution and G is the N(J. 0 ! I = . Example 1. YI. then F(x) = G(x  1") (1.1. or alternatively all distributions with expectation O. . then G(·) F(' . . (3) The control responses are normally distributed.. 0 This default model is also frequently postulated for measurements taken on units obtained by random sampling from populations. Thus. E}. or by {(I"..4 Statistical Models. Then if treatment B had been administered to the same subject instead of treatment A. G E Q} where 9 is the set of all allowable error distributions that we postulate. Natural initial assumptions here are: (1) The x's and y's are realizations of Xl. Often the final simplification is made.. if drug A is a standard or placebo. that is. Then if F is the N(I".1.. where 0'2 is unknown. Commonly considered 9's are all distributions with center of symmetry 0. .Yn. There are absolute bounds on most quantitiesIOO ft high men are impossible. be the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B. 1 X Tn a sample from F. The classical def~ult model is: (4) The common distribution of the errors is N(o. . and Performance Criteria Chapter 1 (2) The distribution of the error at one determination is the same as that at another. •. 0'2).. . tbe set of F's we postulate. £n are identically distributed.t + ~. Heights are always nonnegative. This implies that if F is the distribution of a control. '. heights of individuals or log incomes. We call the y's treatment observations.. and Y1. will have none of this. We let the y's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo. patients improve even if they only think they are being treated. . we refer to the x's as control observations.. cr 2 ) distribution. response y = X + ~ would be obtained where ~ does not depend on x. respectively. That is. for instance. 1 Yn a sample from G. (72) population or equivalently F = {tP (':Ji:) : Jl E R. Let Xl. we have specified the Gaussian two sample model with equal variances. It is important to remember that these are assumptions at best only approximately valid. To specify this set more closely the critical constant treatment effect assumption is often made. The Gaussian distribution. . if we let G be the distribution function of f 1 and F that of Xl. .t. TwoSample Models.. A placebo is a substance such as water tpat is expected to have no effect on the disease and is used to correct for the welldocumented placebo effect. We call this the shift model with parameter ~. the Xi are a sample from a N(J. (3) The distribution of f is independent of J1. All actual measurements are discrete rather than continuous.Xn are a random sample and. so that the model is specified by the set of possible (F. whatever be J...~). Equivalently Xl.3) and the model is alternatively specified by F... (2) Suppose that if treatment A had been administered to a subject response X would have been obtained.." Now consider situation (d). . cr > O} where tP is the standard normal distribution. Goals.3.t and cr.
Statistical methods for models of this kind are given in Volume 2. the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed. in Example 1. P is referred to as the model.Xn ). we observe X and the family P is that of all bypergeometric distributions with sample size n and population size N. On this sample space we have defined a random vector X = (Xl.'" . in Example 1. Since it is only X that we observe.1.3 when F.1. In this situation (and generally) it is important to randomize. for instance. In some applications we often have a tested theoretical model and the danger is small.6. ~(w) is referred to as the observations or data. if they are true.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease. It is often convenient to identify the random vector X with its realization. Experiments in medicine and the social sciences often pose particular difficulties.1. A review of necessary concepts and notation from probability theory are given in the appendices. if they are false. Fortunately. The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise. we need only consider its probability distribution.1). equally trained observers with no knowledge of each other's findings. have been assigned to B. our analyses. though correct for the model written down. may be quite irrelevant to the experiment that was actually performed.4. P is the .Section 1. the data X(w). in comparative experiments such as those of Example 1.1. we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated. Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter. However. and Statistics 5 How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations. Models. the methods needed for its analysis are much the same as those appropriate for the situation of Example 1. For instance. if (1)(4) hold. we know how to combine our measurements to estimate JL in a highly efficient way and also assess the accuracy of our estimation procedure (Example 4. G are assumed arbitrary. but not others.2. All the severely ill patients might. Using our first three examples for illustrative purposes. The number of defectives in the first example clearly has a hypergeometric distribution. In Example 1. As our examples suggest. In others. For instance.1. This distribution is assumed to be a member of a family P of probability distributions on Rn. there is tremendous variation in the degree of knowledge and control we have concerning experiments. For instance. This will be done in Sections 3. That is. When w is the outcome of the experiment.5.1.1.3 and 6. we can ensure independence and identical distribution of the observations by using different.1 Data. we can be reasonably secure about some aspects.2. The advantage of piling on assumptions such as (I )(4) of Example 1.2 is that. we now define the elements of a statistical model. we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + n available patients. The danger is that. We are given a random experiment with sample space O. Parameters.
a map. models such as that of Example 1. X n ) remains the same but = {(I'. and Performance Criteria Chapter 1 family of all distributions according to which Xl.1.X l .2 Parametrizations and Parameters i e To describe P we USe a parametrization. . () is the fraction of defectives. or equivalently write P = {Pe : BEe}.. . .. we may parametrize the model by the first and second moments of the normal distribution of the observations (i. we have = R+ x R+ . . () t Po from a space of labels. Yn .l. . .6 Statistical Models.1. the first parametrization we arrive at is not necessarily the one leading to the simplest analysis.1. Pe the H(NB. G with density 9 such that xg(x )dx = O} and p(".3 that Xl. knowledge of the true Pe. 02 ). .Xn are independent and identically distributed with a common N (IL. If. 1) errors lead parametrization is unidentifiable because.1 we take () to be the fraction of defectives in the shipment. N.2) suppose that we pennit G to be arbitrary. . we can take = {(I'. Of even greater concern is the possibility that the parametrization is not onetoone. In Example 1. j.1 we can use the number of defectives in the population.G taken to be arbitrary are called nonparametric. having expectation 0.1.G) : I' E R. Finally.1. to P. Note that there are many ways of choosing a parametrization in these and all other problems.. Then the map sending B = (I'. Pe the distribution on Rfl with density implicitly taken n~ 1 ." that is. For instance. such that we can have (}1 f. G) into the distribution of (Xl.t2 + k e e e J .e.. as we shall see later. we will need to ensure that our parametrizations are identifiable. If. However. . Ghas(arbitrary)densityg}.Xrn are identically distributed as are Y1 . Thus. X rn are independent of each other and Yl . The critical problem with such parametrizations is that ev~n with "infinite amounts of data. if () = (Il. the parameter space e. we know we are measuring a positive quantity in this model. . What parametrization we choose is usually suggested by the phenomenon we are modeling.. in Example 1.. 1 • • • .1. 02 and yet Pel = Pe 2 • Such parametrizations are called unidentifiable.. in (l. Thus. G) : I' E R. in Example = {O. models P are called parametric. under assumptions (1)(4). tt is the unknown constant being measured.2.1. We may take any onetoone function of 0 as a new parameter. for eXample. When we can take to be a nice subset of Euclidean space and the maps 0 + Po are smooth. . 1) errors. if) (X'. we only wish to make assumptions (1}(3) with t:. parts of fJ remain unknowable. moreover. as a parameter and in Example 1. Models such as that of Example 1... For instance.3 with only (I) holding and F. .2 ) distribution. by (tt. in senses to be made precise later. The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything. that is. n) distribution.. 0.I} and 1... Pe 2 • 2 e ° .JL) where cp is the standard normal density.. e j . still in this example.2 with assumptions (1)(3) are called semiparametric. that is. that is.. on the other hand. I 1.2 with assumptions (1)(4) we have = R x R+ and.G) has density n~ 1 g(Xi 1'). . NO. lh i O => POl f.1.2)). Goals. It's important to note that even nonparametric models make substantial assumptionsin Example 1. . Yn . Now the and N(O. tt = to the same distribution of the observations a~ tt = 1 and N (~ 1.
from to its range P iff the latter map is 11.flx. Parameters. More generally.1. that is.1. we can start by making the difference of the means. For instance. For instance. For instance. if the errors are normally distributed with unknown variance 0'2. implies 8 1 = 82 . G) parametrization of Example 1. Then P is parametrized by 8 = (fl1~' 0'2).1. our interest in studying a population of incomes may precisely be in the mean income. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable).. ~ Po.1. The (fl. . that is.. if POl = Pe'). Sometimes the choice of P starts by the consideration of a particular parameter. Similarly. Here are two points to note: (1) A parameter can have many representations. .2 where fl denotes the mean income and. which indexes the family P. make 8 + Po into a parametrization of P. formally a map. ) evidently is and so is I' + C>. implies q(BJl = q(B2 ) and then v(Po) q(B). consider Example 1.1. Then "is identifiable whenever flx and flY exist. we can define (J : P + as the inverse of the map 8 + Pe. Models. For instance. and observe Xl. A parameter is a feature v(P) of the distribution of X. as long as P is the set of all Gaussian distributions.. X n independent with common distribution. which can be thought of as the difference in the means of the two populations of responses. in Example 1. In addition to the parameters of interest. When we sample. But = Var(X. then 0'2 is a nuisance parameter. For instance. in Example 1. is that of a parameter. instead of postulating a constant treatment effect ~. it is natural to write e e e .2 with assumptions (1)(4) the parameter of interest fl.1. v.2.3. As we have seen this parametrization is unidentifiable and neither f1 nor ~ arc parameters in the sense we've defined. in Example 1. where 0'2 is the variance of €. or the median of P.2 again in which we assume the error € to be Gaussian but with arbitrary mean~. E(€i) = O. say with replacement. a function q : + N can be identified with a parameter v( P) iff Po.3 with assumptions (1H2) we are interested in~.1.1 Data. from P to another space N. But given a parametrization (J + Pe. or the midpoint of the interquantile range of P. . or more generally as the center of symmetry of P. there are also usually nuisance parameters.1. We usually try to combine parameters of interest and nuisance parameters into a single grand parameter ().M(P) can be characterized as the mean of P. thus. and Statistics 7 Dual to the notion of a parametrization. a map from some e to P. the focus of the study.Section 1. Formally. the fraction of defectives () can be thought of as the mean of X/no In Example 1. (J is a parameter if and only if the parametrization is identifiable.1. which correspond to other unknown features of the distribution of X.3) and 9 = (G : xdG(x) = J O}. Implicit in this description is the assumption that () is a parameter in the sense we have just defined. in Example 1." = f1Y .2 is now well defined and identifiable by (1.
is the statistic T(X 11' . The link for us are things we can compute. hence. T( x) is what we can compute if we observe X = x. < xJ.. a common estimate of 0. however. Databased model selection can make it difficult to ascenain or even assign a meaning to the accuracy of estimates or the probability of reaChing correct conclusions. Informally. Nevertheless. which evaluated at ~ X and 8 2 are called the sample mean and sample variance. .• 1 X n ) = X ~ L~ I Xi.1.2 a cOmmon estimate of J1. Xn)(x) = n L I(X n i < x) i=l where (X" . Mandel. which now depends on the data. statistics.1. this difference depends on the patient in a complex manner (the effect of each drug is complex). For future reference we note that a statistic just as a parameter need not be real or Euclidean valued. consider situation (d) listed at the beginning of this section.. Often the outcome of the experiment is used to decide on the model and the appropriate measure of difference. It estimates the function valued parameter F defined by its evaluation at x E R. In this volume we assume that the model has . Next this model. a statistic T is a map from the sample space X to some space of values T. F(P)(x) = PIX. For instance.X)" L i=l I n How we use statistics in estimation and other decision procedures is the subject of the next section. we can draw guidelines from our numbers and cautiously proceed. to narrow down in useful ways our ideas of what the "true" P is. is used to decide what estimate of the measure of difference should be employed (ct. and Performance Criteria Chapter 1 1. Goals. Our aim is to use the data inductively. For instance. then our attention naturally focuses on estimating this constant If. This statistic takes values in the set of all distribution functions on R. a statistic we shall study extensively in Chapter 2 is the function valued statistic F. . Formally. Thus.1. Models and parametrizations are creations of the statistician. If we suppose there is a single numerical measure of performance of the drugs and the difference in performance of the drugs for any given patient is a constant irrespective of the patient.3 Statistics as Functions on the Sample Space I j • . in Example 1. X n ) are a sample from a probability P on R and I(A) is the indicator of the event A. usually a Euclidean space.8 Statistical Models. but the true values of parameters are secrets of nature.. the fraction defective in the sample.1.. Deciding which statistics are important is closely connected to deciding which parameters are important and. 1964). x E Ris F(X " ~ I . for example. T(x) = x/no In Example 1. we have to formulate a relevant measure of the difference in performance of the drugs and decide how to estimate this measure.2 is the statistic I i i = i i 8 2 = n1 "(Xi . called the empirical distribution function. can be related to model formulation as we saw earlier. These issues will be discussed further in Volume 2.
(B. Example 1. . (B. The experimenter may... This selection is based on experience with previous similar experiments (cf. There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. are continuous with densities p(x. density and frequency functions by p(. Zi is a d dimensional vector that gives characteristics such as sex. 0). It will be convenient to assume(1) from now on that in any parametric model we consider either: (I) All of the P.). The distribution of the response Vi for the ith subject or case in the study is postulated to depend on certain characteristics Zi of the ith subject. . drugs A and B are given at several . weight. ) that is independent of 0 such that 1 P(Xi' 0) = 1 for all O. Thus. However. This is the Stage for the following. Y..1. ).Section 1. we shall denote the distribution corresponding to any particular parameter value 8 by Po. assign the drugs alternatively to every other patient in the beginning and then. X m ).. Yn are independent. the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other. Regular models.1. Parameters. 0). in Example 1. When dependence on 8 has to be observed. Lehmann.. and Statistics 9 been selected prior to the current experiment. in the study. Xl). For instance.1.4 Examples.lO. in situation (d) again. ..X" .3 we could take z to be the treatment label and write onr observations as (A. Expectations calculated under the assumption that X rv P e will be written Eo. We observe (ZI' Y.. Yn ). the number of patients in the study (the sample size) is random. Notation. (2) All of the P e are discrete with frequency functions p(x. and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients. Distribution functions will be denoted by F(·. In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1. 0). (zn.3. age. (A. and there exists a set {XI. See A. Problems such as these lie in the fields of sequential analysis and experimental deSign. for example. assign the drug that seems to be working better to a higher proportion of patients. For instance. Models. Y n ) where Y11 ••• . and so On of the ith subject in a study. They are not covered under our general model and will not be treated in this book. 0).1. patients may be considered one at a time. 1. after a while. Regression Models. . Regression Models We end this section with two further important examples indicating the wide scope of the notions we have introduced. This is obviously overkill but suppose that. Moreover. Thus. 8). 'L':' Such models will be called regular parametric models.1 Data. In the discrete case we will use both the tennsfrequency jimction and density for p(x. these and other subscripts and arguments will be omitted where no confusion can arise.4. height. We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more infonnation. 1990). sequentially.
(z) only.1. . . For instance. That is. See Problem 1. I3d) T of unknowns.Z~)T and Jisthen x n identity. So is Example 1.. In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed observations. Goals. which we can write in vector matrix form.... I'(B) = I' + fl. then the model is zT (a) P(YI. semiparametric as with (I) and (2) with F arbitrary. If we let f(Yi I zd denote the density of Yi for a subject with covariate vector Zi. the effect of z on Y is through f. .z) = L. Then we have the classical Gaussian linear model. On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) 1'(z) = 9((3. . In the two sample models this is implied by the constant treatment effect assumption.L(z) denote the expected value of a response with given covariate vector z. (c) whereZ nxa = (zf.1. Often the following final assumption is made: (4) The distributiou F of (I) is N(O.8. Clearly. i = 1. By varying the assumptions we obtain parametric models as with (I).3 with the Gaussian twosample model I'(A) ~ 1'.2 uuknown.L(z) is an unknown function from R d to R that we are interested in.. i=l n If we let J. z) where 9 is known except for a vector (3 = ((31.. .2) with . . (3) g((3. . Identifiability of these parametrizations and the status of their components as parameters are discussed in the problems.~II3Jzj = zT (3 so that (b) becomes (b') This is the linear model.1.10 Statistical Models. d = 2 and can denote the pair (Treatment Label.3(3) is a special case of this model.treat the Zi as a label of the completely unknown distributions of Yi. n..1. .2 with assumptions (1)(4). Here J. then we can write.Yn) ~ II f(Yi I Zi).1. Treatment Dose Level) for patient i. Then. We usually ueed to postulate more. .E(Yi ). in Example 1.. Example 1... (b) where Ei = Yi . Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on Zi.. A eommou (but often violated assumption) is (I) The ti are identically distributed with distribution F. The most common choice of 9 is the linear fOlltl. [n general. 0 . and nonparametric if we drop (1) and simply . and Performance Criteria Chapter 1 dose levels. (3) and (4) above.
0'2) density. 0 . .p(e" I e1. A second example is consecutive measurements Xi of a constant It made by the same observer who seeks to compensate for apparent errors. Example 1. . .t = E(Xi ) be the average time for an infinite series of records. i ~ 2.Il.. To find the density p(Xt. we start by finding the density of CI.... the conceptual issues of stationarity. n.. An example would be.(1 .. ' . X n be the n determinations of a physical constant J. and the associated probability theory models and inference for dependent data are beyond the scope of this book. save for a brief discussion in Volume 2. Xi . ergodicity.1 Data.. Let XI. Consider the model where Xi and assume = J. .(3) + (3X'l + 'i. is that N(O. .cn _d p(edp(e2 I edp(e3 I e2) . p(e n I end f(edf(c2 . the elapsed times X I.. .. .{3Xj1 j=2 n (1. .. . we give an example in which the responses are dependent. . the model for X I. C n where €i are independent identically distributed with density are dependent as are the X's. . .. .t+ei. However. x n ).Section 1.. Parameters.5.n ei = {3ei1 + €i. The default assumption. In fact we can write f.n. . at best an approximation for the wave example. en' Using conditional probability theory and ei = {3ei1 + €i. Let j.... Models. .. model (a) assumes much more but it may be a reasonable first approximation in these situations.1.. . . eo = a Here the errors el. i = l. i = 1. It is plausible that ei depends on Cil because long waves tend to be followed by long waves.. .. Xl = I' + '1. Then we have what is called the AR(1) Gaussian model J is the We include this example to illustrate that we need not be limited by independence. we have p(edp(C21 e1)p(e31 e"e2)". X n spent above a fixed high level for a series of n consecutive wave records at a point on the seashore. . .t.. X n is P(X1... Measurement Model with Autoregressive Errors. say.(3cd··· f(e n Because €i =  (3e n _1). and Statistics 11 Finally. Xi ~ 1..(3)I').xn ) ~ f(x1 I') II f(xj . Of course. . .
There is a substantial number of statisticians who feel that it is always reasonable. . That is. In this section we introduced the first basic notions and formalism of mathematical statistics. the regression model. This is done in the context of a number of classical examples. . 0 = N I I ([. N. The notions of parametrization and identifiability are introduced. If the customers have provided accurate records of the number of defective items that they have found.2. and Performance Criteria Chapter 1 Summary."" 1l"N} for the proportion (J of defectives in past shipments. We know that. to think of the true value of the parameter (J as being the realization of a random variable (} with a known distribution... X has the hypergeometric distribution 'H( i. . There are situations in which most statisticians would agree that more can be said For instance. How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences. N. it is possible that. Thus. in the inspection Example 1.1. we can construct a freq uency distribution {1l"O. ([. 1. the most important of which is the workhorse of statistics. Goals.2. i = 0. They are useful in understanding how the outcomes can he used to draw inferences that go beyond the particular experiment. This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as measure of the beliefs of the experimenter concerning the true value of (J before he or she takes any data. vector observations X with unknown probability distributions P ranging over models P.2) This is an example of a Bayesian model..2 BAYESIAN MODELS Throughout our discussion so far we have assumed that there is no information available about the true value of the parameter beyond that provided by the data. n). We view statistical models as useful tools for learning from the outcomes of experiments and studies. Models are approximations to the mechanisms generating the observations. The general definition of parameters and statistics is given and the connection between parameters and pararnetrizations elucidated. 1l"i is the frequency of shipments with i defective items.1) Our model is then specified by the joint distribution of the observed number X of defectives in the sample and the random variable 9.1. . Now it is reasonable to suppose that the value of (J in the present shipment is the realization of a random variable 9 with distribution given by P[O = N] I = IT" i = 0. and indeed necessary. in the past. given 9 = i/N. a .I 12 Statistical Models. PIX = k.. N. we have had many shipments of size N that have subsequently been distributed.
However. I(O. In this section we shall define and discuss the basic clements of Bayesian models. X) is appropriately continuous or discrete with density Or frequency function. (1.2. (1) Our own point of view is that SUbjective elements including the views of subject matter experts arc an essential element in all model building. (8. whose range is contained in 8. X) is that of the outcome of a random experiment in which we first select (J = (J according to 7r and then. There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in tenus of frequencies. . The most important feature of a Bayesian model is the conditional distribution of 8 given X = x. (1.9)'00'.3). (J) as a conditional density or frequency function given 8 = we will denote it by p(x I 0) for the remainder of this section. let us turn again to Example 1. Eqnation (1. with density Or frequency function 7r. then by (B. given (J = (J. Lindley (1965). 1.3) Because we now think of p(x. After the value x has been obtained for X. 1. Suppose that we have a regular parametric model {Pe : (J E 8}. In the "mixed" cases such as (J continuous X discrete.l.O).3). which is called the posterior distribution of 8. suppose that N = 100 and that from past experience we believe that each item has probability . x) = ?T(O)p(x. by giving (J a distribution purely as a theoretical tool to which no subjective significance is attached. we can obtain important and useful results and insights. Raiffa and Schlaiffer (1961). and Berger (1985). select X according to Pe. the resulting statistical inference becomes subjective.2. the joint distribution is neither continuous nor discrete. To get a Bayesian model we introduce a random vector (J.2. The joint distribution of (8. This would lead to the prior distribution e. For instance. We shall return to the Bayesian framework repeatedly in our discussion. insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later). the information about (J is described by the posterior distribution.2 Bayesian Models 13 Thus.1 of being defective independently of the other members of the shipment. The function 7r represents our belief or information about the parameter (J before the experiment and is called the prior density or frequency function. If both X and (J are continuous or both are discrete.Section 1. Before the experiment is performed. We now think of Pe as the conditional distribution of X given (J = (J.1)'(0. the information or belief about the true value of the parameter is described by the prior distribution. Before sampling any items the chance that a given shipment contains . = ( I~O ) (0. (1962).4) for i = 0.1. However. ?T. The theory of this school is expounded by L. For a concrete illustration.1. .2. De Groot (1969).. An interesting discussion of a variety of points of view on these questions may be found in Savage et al. 100. Savage (1954)..2) is an example of (1.
8) as posterior density of 0. This leads to P[1000 > 20 I X = lOJ '" 0. . Suppose that Xl. I 20 or more bad items is by the normal approximation with continuity correction.6) we argue loosely as follows: If be fore the drawing each item was defective with probability .. 0.2.30.8) roo 1r(9)p(x 19) 1r(t)p(x I t)dt if 8 is continuous. Specifically. Thus.1. Bernoulli Trials..1100(0. is iudependeut of X and has a B(81.6) To calculate the posterior probability given iu (1.<I> 1000 .9) I .2. . ( 1.xn )= Jo'1r(t)t k (lt)n. ..4) can be used..9) I .<1>(0.10). (i) The posterior distribution is discrete or continuous according as the prior distri bution is discrete or continuous.1) (1. 1.2.J C3 (1.2. 1008 .1)(0. . X) given by (1. this will continue to be the case for the items left in the lot after the 19 sample items have been drawn. P[IOOO > 201 = 10] P [ . .9)(0.52) 0.2.181(0. Now suppose that a sample of 19 has been drawn in which 10 defective items are found.. (ii) If we denote the corresponding (posterior) frequency function or density by 1r(9 I x). In the cases where 8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (0.181(0.7) In general.:.1. (A 15. some variant of Bayes' rule (B.j . 1r(t)p(x I t) if 8 is discrete.9 ] .Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1.1) distribution. 1r(9)9k (1 _ 9)nk 1r(Blx" ..1) '" I .10 '" .. (1.30.2.8.3). P[1000 > 20 I X = 10] P[IOOO .X.X) . 14 Statistkal Models..1 and good with probability .2.9)(0. we obtain by (1. to calculate the posterior. If we assume that 8 has a priori distribution with deusity 1r.1 > . theu 1r(91 x) 1r(9)p(x I 9) 2.9 independently of the other items. the number of defectives left after the drawing. Goals. .II ~. and Performance Criteria Chapter 1 I ..2.k dt (1.2.1100(0.. Therefore.X > 10 I X = lOJ P [(1000 . Here is an example..001.5) 5 ) = 0.9) > .1)(0. Example 1.
s) distribution. We may want to assume that B has a density with maximum value at o such as that drawn with a dotted line in Figure B.Section 1. Another bigger conjugate family is that of finite mixtures of beta distributionssee Problem 1.ll) in (1. we might take B to be uniformly distributed on (0.. so that the mean is r/(r + s) = 0. we are interested in the proportion () of "geniuses" (IQ > 160) in a particular city. 0 A feature of Bayesian models exhibited by this example is that there are natural parametric families of priors such that the posterior distributions also belong to this family. for instance. the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all.13) leads us to assume that the number X of geniuses observed has approximately a 8(n.2. say 0.2 Bayesian Models 15 for 0 < () < 1.2. .9). Such families are called conjugaU? Evidently the beta family is conjugate to the binomial. L~ 1 Xi' We also obtain the same posterior density if B has prior density 1r and 1 Xi. and the posterior distribution of B givenl:X i =kis{3(k+r. Summary. nonVshaped bimodal distributions are not pennitted. 1). 'i = 1. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions. which corresponds to using the beta distribution with r = .2.2.n. which has a B( n..2. We return to conjugate families in Section 1. For instance. we only observe We can thus write 1[(8 I k) fo".2. . upon substituting the (3(r.2. The result might be a density such as the one marked with a solid line in Figure B. s) density (B. n .9) we obtain 1 L: L7 (}k+rl (1 _ 8)nk+s1 c (1.. We present an elementary discussion of Bayesian models. Then we can choose r and s in the (3(r.11» be B(k + T.1).2. Suppose.8) distribution. Or else we may think that 1[(8) concentrates its mass near a small number. and s only. Xi = 0 or 1. Specifically.16. As Figure B.15. r.<.2. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B. 8) distribution given B = () (Problem 1. x n ).. k = I Xi· Note that the posterior density depends on the data only through the total number of successes.05. One such class is the twoparameter beta family. must (see (B. To get infonnation we take a sample of n individuals from the city. . If we were interested in some proportion about which we have no information or belief.2.(8 I X" . To choose a prior 11'. If n is small compared to the size of the city. where k ~ l:~ I Xj.05 and its variance is very small.·) is the beta function. introduce the notions of prior and posterior distributions and give Bayes rule.k + s) where B(·. = 1. . we need a class of distributions that concentrate on the interval (0. We also by example introduce the notion of a conjugate family of distributions. (A.nk+s).2.2 indicates.6. which depends on k.10) The proportionality constant c.
1 do we use the observed fraction of defectives .2. if we believe I'(z) = g((3. We may wish to produce "best guesses" of the values of important parameters. sex. Thus.1 contractual agreement between shipper and receiver may penalize the return of "good" shipments.3. A very important class of situations arises when. "hypothesis" or "nonspecialness" (or alternative).lo means the universe is expanding forever and J. one of which will be announced as more consistent with the data than others. say a 50yearold male patient's response to the level of a drug. in Example 1. A consumer organization preparing (say) a report on air conditioners tests samples of several brands.1. with (J < (Jo. if there are k different brands.4. Zi) and then plug our estimate of (3 into g. at a first cut. For instance. drug dose) T that can be used for prediction of a variable of interest Y. if we have observations (Zil Yi). there are many problems of this type in which it's unclear which oftwo disjoint sets of P's.2. there are k! possible rankings or actions.1.l > J. However. the expected value of Y given z.l < Jio or those corresponding to J." (} > Bo.L in Example 1.1. Ranking. 0 In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function. whereas the receiver does not wish to keep "bad. Given a statistical model. i.3. the information we want to draw from data can be put in various forms depending on the purposes of our analysis. 1 < i < n. and Performance Criteria Chapter 1 1.e. We may have other goals as illustrated by the next two examples. say. There are many possible choices of estimates. (age. Note that we really want to estimate the function Ji(')~ our results will guide the selection of doses of drug for future patients. z) we can estimate (3 from our observations Y. As the second example suggests. and as we shall see fonnally later.1 or the physical constant J. then depending on one's philosophy one could take either P's corresponding to J. In other situations certain P are "special" and we may primarily wish to know whether the data support "specialness" or not.l < J. such as. If J. say. In Example 1.3 THE DECISION THEORETIC FRAMEWORK I . For instance.1. as it's usually called. the receiver wants to discriminate and may be able to attach monetary Costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment. for instance.l > JkO correspond to an eternal alternation of Big Bangs and expansions. we have a vector z. Po or Pg is special and the general testing problem is really one of discriminating between Po and Po.lo is the critical matter density in the universe so that J.1. On the basis of the sample outcomes the organization wants to give a ranking from best to worst of the brands (ties not pennitted).3. in Example 1.1. shipments. 0 Example 1. P's that correspond to no treatment effect (i. For instance.1.. of g((3.16 Statistical Models. Making detenninations of "specialness" corresponds to testing significance.lo as special. placebo and treatment are equally effective) are special because the FDA (Food and Drug Administration) does not wish to pennit the marketing of drugs that do no good. Goals. Example 1. the fraction defective B in Example 1. as in Example 1. Prediction. state which is supported by the data: "specialness" or. Unfortunately {t(z) is unknown. Thus. These are estimation problems. Intuitively." In testing problems we. i we can try to estimate the function 1'0. a reasonable prediction rule for an unseen Y (response of a new patient) is the function {t(z).
.. .1.1. 3.. or p. in Example 1. or the median. (2) point to what the different possible actions are. we begin with a statistical model with an observation vector X whose distribution P ranges over a set P. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify Po with the set of "special" P's). A = {( 1. accuracy.1.1. taking action 1 would mean deciding that D. A = {a. if we have three air conditioners.3 The Decision Theoretic Framework 17 X/n as our estimate or ignore the data and use hislOrical infonnation on past shipments. In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter. Here are action spaces for our examples.3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that. 1. For instance. most significantly. 3). Thus. for instance. in Example 1. defined as any value such that half the Xi are at least as large and half no bigger? The same type of question arises in all examples. . A new component is an action space A of actions or decisions or claims that we can contemplate making. # 0. there are 3! = 6 possible rankings. on what criteria of performance we use. I} with 1 corresponding to rejection of H. 3).1.Section 1. 3. X = ~ L~l Xi. Here quite naturally A = {Permntations (i I . k}}. Action space. We usnally take P to be parametrized. ~ 1 • • • l I} in Example 1. 1.1 Components of the Decision Theory Framework As in Section 1. That is. in Example 1. we would want a posteriori estimates of perfomlance. . and reliability of statistical procedures. 2).1.2. a large a 2 will force a large m l n to give us a good chance of correctly deciding that the treatment effect is there. (1. (3. If we are estimating a real parameter such as the fraction () of defectives. ik) of {I. P = {P. (3) provide assessments of risk. in estimation we care how far off we are. or combine them in some way? In Example 1..1. is large.3. whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing. Intuitively. once a study is carried out we would probably want not only to estimate ~ but also know how reliable our estimate is. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study. in testing whether we are right or wrong. in ranking what mistakes we've made. 2. (3. 1.1. On the other hand. The answer will depend on the model and. (2. 2). A = {O. even if. Thus. in Example 1. I)}. In any case. Thus.2 lO estimate J1 do we use the mean of the measurements. it is natural to take A = R though smaller spaces may serve equally well. we need a priori estimates of how well even the best procedure can do. 1).6. (4) provide guidance in the choice of procedures for analyzing outcomes of experiments. .1. and so on. By convention. .3. (2. Estimation. : 0 E 8). Testing. Ranking. 2.
a) 1(0. Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·). [v(P) . Other choices that are. = max{laj  vjl. a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z. a) = 0.. a) = 1 otherwise. the probability distribution producing the data. A = {a . In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is. This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable. a) if P is parametrized.a)2.3. is the nonnegative loss incurred by the statistician if he or she takes action a and the true "state of Nature. a) = l(v < a). ' . as the name suggests. .. is P. r " I' 2:)a. Loss function.. I(P. If. [f Y is real. they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. If v ~ (V1.Vd) = (q1(0).)2 = Vj [ squared Euclidean distance/d absolute distance/d 1. "does not respond" and "responds. less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P. 1 d} = supremum distance." respectively. As we shall see.18 Statistical Models. Far more important than the choice of action space is the choice of loss function defined as a function I : P X A ~ R+.ai.a)2).3. if Y = 0 or 1 corresponds to. a) = min {(v( P) .. For instance.a(z))2dQ(z). which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1. Estimation. say. Closely related to the latter is what we shall call confidence interval loss.2. Although estimation loss functions are typically symmetric in v and a. although loss functions. l(P. a). a) = (v(P) .. a) 1(0. Sex)T. or 1(0. a) I d . .: la.al < d. Q is the empirical distribution of the Zj in .a) ~ (q(O) . ~ 2. the expected squared error if a is used.1). 1'() is the parameter of interest. and z = (Treatment. a) = J (I'(z) . examples of loss functions are 1(0. l( P. . and truncated quadraticloss: I(P. sometimes can genuinely be quantified in economic terms. Quadratic Loss: I(P. d'}.V. The interpretation of I(P. r I I(P.a)2 (or I(O. a) = [v( P) . . Goals. M) would be our prediction of response or no response for a male given treatment B. and z E Z. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider. and Performance Criteria Chapter 1 Prediction. .3. For instance..qd(lJ)) and a ~ (a""" ad) are vectors." that is. as we shall see (Section 5.i = We can also consider function valued parameters. For instance. asymmetric loss functions can also be of importance. then a(B. in the prediction example 1. Here A is much larger. say.
and to decide Ll '# a if our estimate is not close to zero.2) I if Ix ~ 111 (J >c . this leads to the commonly considered I(P. y) = o if Ix::: 111 (J <c (1. relative to the standard deviation a. For the problem of estimating the constant IJ. a(zn) jT and the vector parameter (I'( z.)..1 suppose returning a shipment with °< 1(8. . I) 1(8.). the statistician takes action o(x). Otherwise. .2) and N(I'. a Estimation.3..2). In Section 4. (1. respectively..))2.. in the measurement model. ]=1 which is just n. if we are asking whether the treatment effect p'!1'ameter 6.. where {So.1) Decision procedures.. the decision is wrong and the loss is taken to equal one.1'(zn))T Testing. Y n).Section 1. . In Example 1. that is. The data is a point X = x in the outcome or sample space X. a) = 1 otherwise (The decision is wrong).I loss: /(8.3 The Decision Theoretic Framework 19 the training set (Z 1. We next give a representation of the process Whereby the statistician uses the data to arrive at a decision.9. in Example 00 defectives results in a penalty of s dollars whereas every defective item sold results in an r dollar replacement cost. For instance. we implicitly discussed two estimates or decision rules: 61 (x) = sample mean x and 02(X) = X = sample median.1 loss function can be written as ed. Testing. e ea.3 with X and Y distributed as N(I' + ~. is a or not. We define a decision rule or procedure to be any function from the sample space taking its values in A.1..3. then a reasonable rule is to decide Ll = 0 if our estimate x . Of course. If we take action a when the parameter is in we have made the correct decision and the loss is zero.. L(I'(Zj) _0(Z. The decision rule can now be written o(x.y is close to zero. Using 0 means that if X = x is observed.3 we will show how to obtain an estimate (j of a from the data..0) sif8<80 Oif8 > 80 rN8. Y). 0. . This 0 .). is a partition of (or equivalently if P E Po or P E P. Here we mean close to zero relative to the variability in the experiment. (Zn.a) = I n n..1.1 times the squared Euclidean distance between the prediction vector (a(z. We ask whether the parameter () is in the subset 6 0 or subset 8 1 of e. . Then the appropriate loss function is 1.. other economic loss functions may be appropriate. a) ~ 0 if 8 E e a (The decision is correct) l(0. . I) /(8.
the risk or riskfunction: The risk function. If we expand the square of the righthand side keeping the brackets intact and take the expected value.i. The other two terms are (Bias fi)' and Var(v). the cross term will he zero because E[fi .. we regard I(P. (If one side is infinite.) = 2L n ~ n . If d is the procedure used. R('ld) is our a priori measure of the performance of d. then Bias(X) 1 n Var(X) .3. Proof. Thus.3) where for simplicity dependence on P is suppressed in MSE. i=l . but for a range of plausible x·s.1']. and X = x is the outcome of the experiment. Write the error as 1 = (Bias fi)' + E(fi)] Var(v). How do we choose c? We need the next concept of the decision theoretic framework. Example 1. () is the true value of the parameter. A useful result is Bias(fi) Proposition 1.1. 1 is the loss function. lfwe use the mean X as our estimate of IJ. . we typically want procedures to have good properties not at just one particular x. If we use quadratic loss. our risk function is called the mean squared error (MSE) of. 6(X)] as the measure of the perfonnance of the decision rule o(x). Suppose Xl.E(fi)] = O.. fJ(x)).6) = Ep[I(P." Var(X.20 Statistical Models. Suppose v _ v(P) is the real parameter we wish to estimate and fJ(X) is our estimator (our decision rule). That is.d.v) = [V  + [E(v) .3.3. e Estimation. for each 8. we turn to the average or mean loss over the sample space. MSE(fi) I. so is the other and the result is trivially true. and assume quadratic loss. (Continued). . 6 (x)) as a random variable and introduce the riskfunction R(P..3. R maps P or to R+. Estimation of IJ. Moreover. and Performance Criteria Chapter 1 where c is a positive constant called the critical value. withN(O. (]"2) errors. We do not know the value of the loss because P is unknown. We illustrate computation of R and its a priori use in some examples. fi) = Ep(fi(X) . (fi .. and is given by v= MSE{fl) = R(P.. The MSE depends on the variance of fJ and on what is called the bias ofv where = E{fl) ..v(P))' (1. then the loss is l(P. Thus.) 0 We next illustrate the computation and the a priori and a posteriori use of the risk function. measurements of IJ. X n are i. Goals.v can be thought of as the "longrun average error" of v.
if X rnedian(X 1 .a 2 . If we have no idea of the value of 0.2 .1). . If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = <:TolE measurements.. itself subject to random error.Section 1. Suppose that instead of quadratic loss we used the more natural(l) absolute value loss. X) = a' (P) / n still. an estimate we can justify later. .fii V.Xn ) (and we.3.3. then for quadratic loss.6).. or numerical and/or Monte Carlo computation. (.1".. We shall derive them in Section 1.t. If we have no data for area A.4.X) = ". 00 (1. ( N(o.8)X.3. as we discussed in Example 1. the £. Next suppose we are interested in the mean fl of the same measurement for a certain area of the United States. with mean 0 and variance a 2 (P).1"1 ~ Eill 2 ) where £.3 The Decision Theoretic Framework 21 and.3. a natural guess for fl would be flo. The a posteriori estimate of risk (j2/n is. 0 ~ = a We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1.5) This harder calculation already suggests why quadratic loss is really favored.fii 1 af2 00 Itl<p(t)dt = . Then (1. age or income. E(i . 1 X n from area A..8 can only be made on the basis of additional knowledge about demography or the economy.4) which doesn't depend on J. or na 21(n .2)1"0 + (O. n (1.d.3. as we assumed. for instance by 8 2 = ~ L~ l(X i . . in general. 1) and R(I". an}).X) =.3. or approximated asymptotically.. = X.6 through a . R( P. of course. .3.a 2 then by (A 13. . Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be < crJ. for instance. Example 1.2 and 0. Ii ~ (0. computational difficulties arise even with quadratic loss as soon as we think of estimates other than X.1 is useful for evaluating the performance of competing estimators. X) = EIX .4) can be used for an a priori estimate of the risk of X.. areN(O. The choice of the weights 0. whereas if we have a random sample of measurements X" X 2.fii/a)[ ~ a . but for absolute value loss only approximate.1 2 MSE(X) = R(I". census.S. For instance.X)2. Let flo denote the mean of a certain measurement included in the U. say. If. .i. is possible. Then R(I". In fact.1")' ~ E(~) can only be evaluated numerically (see Problem 1. planning is not possible but having taken n measurements we can then estimate 0. by Proposition 1. analytic. If we only assume.1.2. write for a median of {aI. that the E:i are i. a'. we may wantto combine tLo and X = n 1 L~ 1 Xi into an estimator.2..23).
MSE(ii) MSE(X) i .1')' + (. 22 Statistical Models. We easily find Bias(ii) Yar(ii) 0. and Performance Criteria Chapter 1 i formal Bayesian analysis using a nonnal prior to illustrate a way of bringing in additional knowledge.3. Y) = I] if i'> = 0 P[6(X.64 when I' = 1'0. The two MSE curves cross at I' ~ 1'0 ± 30'1 . I. The mean squared errdrs of X and ji.1 gives the graphs of MSE(ii) and MSE(X) as functions of 1'.64)0" In MSE(ii) = .81' . thus. Figure 1. The test rule (1.3. Y) = 01 if i'> i O. the risk R(I'.21'0 R(I'. D Testing. = 0.3.n.1') (0. Y) = which in the case of 0 .4).04(1'0 . Here we compare the performances of Ii and X as estimators of Jl using MSE.3. then X is optimal (Example 3. respectively. However. and we are to decide whether E 80 or E 8 where 8 = 8 0 U 8 8 0 n 8 . Goals. If I' is close to 1'0..64)0" In. the risk is = 0 and i'> i 0 can only take on the R(i'>. if we use as our criteria the maximum (over p. X) = 0" In of X with the minimum relative risk inf{MSE(ii)IMSE(X).0)P[6(X. Ii) of ii is smaller than the risk R(I'. using MSE.6) P[6(X.2) for deciding between i'> two values 0 and 1. R(i'>. Figure 1.) ofthe MSE (called the minimax criteria). A test " " e e e .6) = 1(i'>.2(1'0 . Because we do not know the value of fL.1 loss is 01 + lei'>. neither estimator can be proclaimed as being better than the other. iiJ + 0. Y) = IJ.I.1. I' E R} being 0.I' = 0. In the general case X and denote the outcome and parameter space.8)'Yar(X) = (. I)P[6(X.
a > v(P) I." in Example 1.6) P(6(X) = 0) if 8 E 8 1 Probability of Type II error.01 or less.6) Probability of Type [error R(8.05. If <5(X) = 1 and we decide 8 E 8 1 when in fact E 80. where 1 denotes the indicator function.[X < k]. the risk of <5(X) is e R(8. 8 < 80 rN8P. confidence bounds and intervals (and more generally regions).7) Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation. it is natural to seek v(X) such that P[v(X) > vi > 1. "Reject the shipment if and only if X > k. usually . on the probability of Type I error.Section 1.1 lead to (Prohlem 1. Such a v is called a (1 .8) for all possible distrihutions P of X. Suppose our primary interest in an estimation type of problem is to give an upper bound for the parameter v.1) and tests 15k of the fonn. This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function.[X < k[.3. For instance. [n the NeymanPearson framework of statistical hypothesis testing. Thus.1. the focus is on first providing a small bound.3 The Decision Theoretic Framework 23 function is a decision rule <5(X) that equals 1 On a set C C X called the critical region and equals 0 on the complement of C. R(8. we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding ~ =1= 0 when ~ = 0).3.IX > k] + rN8P.8) sP. Here a is small. For instance.3. in the treatments A and B example.o) 0. If (say) X represents the amount owed in the sample and v is the unknown total amount owed. (1. an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed.a) upper coofidence houod on v.6) E(6(X)) ~ P(6(X) ~ I) if8 E 8 0 (1. a < v(P) . whereas if <5(X) = 0 and we decide () E 80 when in fact 8 E 8 b we call the error a Type II error. the loss function (1.18). Finding good test functions corresponds to finding critical regions with smaIl probabilities of error. and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding Do = 0 when Do ¥ 0). and then trying to minimize the probability of a Type II error.3. we call the error committed a Type I error. I(P.a (1. For instance.3. say . This is not the only approach to testing. 8 > 80 . 6(X) = IIX E C]. that is.05 or .
3. The 0. and a3. We next tum to the final topic of this section.2 . a component in a piece of equipment either works or does not work.. I I (Oil) (No oil) Bj B2 0 12 10 I 5 6 \ . or wait and see. though. What is missing is the fact that. A has three points. or sell partial rights. for some constant c > O. and Performance Criteria Chapter 1 1 . and the risk of all possible decision procedures can be computed and plotted.3. Goals. the connection is close. a2.a<v(P). as we shall see in Chapter 4.a ·1 for all PEP.: Comparison of Decision Procedures In this section we introduce a variety of concepts used in the comparison of decision procedures.5.a>v(P) . we could operate. Ii) = E(Ii(X) v(P))+. For instance = I(P. replace it. .a)  a~v(P) .1 nature makes it resemble a testing loss function and. We shall go into this further in Chapter 4. aJ. The decision theoretic framework accommodates by adding a component reflecting this. we could leave the component in. and so on. a patient either has a certain disease or does not. an asymmetric estimation type loss function.24 Statistical Models. in fact it is important to get close to the truthknowing that at most 00 doJlars are owed is of no use. sell the location. a 2 certain location either contains oil or does not. . The same issue arises when we are interested in a confidence interval ~(X) l v(X) 1for v defined by the requirement that • P[v(X) < v(P) < Ii(X)] > 1 . are available. rather than this Lagrangian form. e Example 1. it is customary to first fix a in (1.8) and then see what one can do to control (say) R( P. where x+ = xl(x > 0). a) (Drill) aj (Sell) a2 (Partial rights) a3 . general criteria for selecting "optimal" procedures. Suppose that three possible actions. In the context of the foregoing examples. or repair it. Suppose we have two possible states of nature. administer drugs. thai this fonnulation is inadequate because by taking D 00 we can achieve risk O. though upper bounding is the primary goal. we could drill for oil.1. We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model. which we represent by Bl and B . c ~1 . Typically.3. We shall illustrate some of the relationships between these ideas using the following simple example in which has two members. 1. For instance. Suppose the following loss function is decided on TABLE 1.3. The loss function I(B. It is clear.
1.5 3 7 1.3. 8 a3 a2 9 a3 a3 Here. a.7) = 7 12(0.)) 1 1 R(B . 0 .. and frequency function p(x.2. .7 0.3) For instance. if there is oil and we drill.5.6 0.3.5 9. R(B k . e TABLE 1..a2)P[5(X) ~ a2] + I(B.6 4 5 1 " 3 5. 5)) and if k = 2 we can plot the set of all such points obtained by varying 5. .4 = 1. TABLE 1.5) E[I(B. a3 7 a3 a." and so on.0 9 5 6 It remains to pick out the rules that are "good" or "best. 6 a. a.4 and graphed in Figure 1.). a3 4 a2 a. an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0. R(025. .3. The risk points (R(B" 5. X may represent a certain geological formation. i Rock formation x o I 0.5.6 and 0. B) given by the following table TABLE 1. formations 0 and 1 occur with frequencies 0. The frequency function p(x. 5.4.) R(B. 1 9.6. R(B" 52) R(B2 .)P[5(X) = ad +1(B.5 8.).Section 1.52 ) + 10(0. . 5 a. (R(O" 5)..3.. 01 represents "Take action Ul regardless of the value of X. a. e. Possible decision rules 5i (x) . it is known that formation 0 occurs with frequency 0. we can represent the whole risk function of a procedure 0 by a point in kdimensional Euclidean space.3. I x=O x=1 a.)) are given in Table 1.5 4. R(B2.4.) 0 12 2 7 7. the loss is zero.3. if X = 1.6) + 1(0.7.3 0. The risk of 5 at B is R(B. We list all possible decision rules in the following table." 02 corresponds to ''Take action Ul> if X = 0.5(X))] = I(B. whereas if there is no oil. a. 3 a. take action U2.4) ~ 7. If is finite and has k members." Criteria for doing this will be introduced in the next subsection. Next.4 10 6 6.3 The Decision Theoretic Framework 25 Thus.2 for i = 1. Risk points (R(B 5. the loss is 12..4 8 8.). B2 Thus. whereas if there is no oil and we drill.a3)P[5(X) = a3]' 0(0.6 3 3. and so on. and when there is oil.3 and formation 1 with frequency 0. 2 a.2 (Oil) (No oil) e.
2. i = 1.. Researchers have then sought procedures that improve all others within the class.3. we obtain M SE(O) = (}2. and only if. i (1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry. Extensions of unbiasedness ideas may be found in Lehmann (1997. a5).2 . S. in estimating () E R when X '"'' N(().   . unbiasedness (for estimates and tests).S) < R(8.S. R(8. The absurd rule "S'(X) = 0" cannot be improved on at the value 8 ~ 0 because Eo(S'(X)) = 0 ifand only if O(X) = O.) > R( 8" S6)' The problem of selecting good decision " procedures has been attacked in a variety of ways. R( 8" Si)). It is easy to see that there is typically no rule e5 that improves all others.) I • 3 10 . if we ignore the data and use the estimate () = 0.7 . Consider.4 .5 0 0 5 10 R(8 j . We shall pursue this approach further in Chapter 3. for instance. neither improves the other. Usually..5).Si) Figure 1. The risk points (R(8 j . We say that a procedure <5 improves a procedure Of if. and Performance Criteria Chapter 1 R(8 2 .6 . Si).3 Bayes and Minimax Criteria The difficulties of comparing decision procedures have already been discussed in the special contexts of estimation and testing.9. Section 1. or level of significance (for tests). For instance. and S6 in our example. Here R(8 S.3.) < R( 8" S6) but R( 8" S. . Goals. Symmetry (or invariance) restrictions are discussed in Ferguson (1967).8 5 . .9 . (2) A second major approach has been to compare risk functions by global crite .S') for all () with strict inequality for some ().26 Statistical Models. if J and 8' are two rules. 1.
R(0" Oi)) I 9.3. . O(X)) I () = ()]. and reo) = J R(O. If there is a rule <5*. the expected loss.38 9. To illustrate. given by reo) = E[R(O.o(x))].9).3. (1. r( 09) specified by (1.o)] = E[I(O.5 gives r( 0.8R(0. Bayes: The Bayesian point of view leads to a natural global criterion.3 The Decision Theoretic Framework 27 ria rather than on a pointwise basis. ()2 and frequency function 1I(OIl The Bayes risk of <5 is. Recall that in the Bayesian model () is the realization of a random variable or vector () and that Pe is the conditional distribution of X given 0 ~ O.).) ~ 0. .5 7 7.7 6.0) Table 1. 0) isjust E[I(O.3. In this framework R(O.6 3 8.2. we need not stop at this point. r(o.10) r( 0) ~ 0. From Table 1.0).2.2. such that • see that rule 05 is the unique Bayes rule then it is called a Bayes rule.5 9 5.92 5.4 5 2.3. (1.9 8.02 8. If we adopt the Bayesian point of view.3.5.48 7.2R(0.8. Bayes and maximum risks oftbe procedures of Table 1. to Section 3. suppose that in the oil drilling example an expert thinks the chance of finding oil is . ... .) maxi R(0" Oil. therefore.3. . but can proceed to calculate what we expect to lose on the average as () varies. Note that the Bayes approach leads us to compare procedures on the basis of. if we use <5 and () = ().3. We shall discuss the Bayes and minimax criteria. it has smaller Bayes risk. the only reasonable computational method.20) in Appendix B. 11(0. which attains the minimum Bayes risk l that is.6 12 2 7.o)1I(O)dO. TABLE 1.8 10 6 3. The method of computing Bayes procedures by listing all available <5 and their Bayes risk is impracticable in general.Section 1.8 6 In the Bayesian framework 0 is preferable to <5' if.6 4 4.5 we for our prior. This quantity which we shall call the Bayes risk of <5 and denote r( <5) is then.3. + 0. r(o") = minr(o) reo) = EoR(O. and only if. = 0.4 8 4. Then we treat the parameter as a random variable () with possible values ()1. We postpone the consideration of posterior analysis.9) The second preceding identity is a consequence of the double expectation theorem (B. if () is discrete with frequency function 'Tr(e).I.. 0)11(0).
4.28 Statistical Models.3. a randomized decision procedure can be thought of as a random experiment whose outcomes are members of V. Nature's choosing a 8. R(O. but only ac. if V is the class of all decision procedures (nonrandomized).J') . which has . Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule. we prefer 0" to 0"'." Nature (Player I) picks a independently of the statistician (Player II). Of course. J) A procedure 0*. sup R(O. It aims to give maximum protection against the worst that can happen. To illustrate computation of the minimax rule we tum to Table 1. Our expected risk would be.3.20 if 0 = O . = infsupR(O. . I Randomized decision rules: In general.(2) We briefly indicate "the game of decision theory. .J). who picks a decision procedure point 8 E J from V. From the listing ofmax(R(O" J).3.5 we might feel that both values ofthe risk were equally important. . we toss a fair coin and use 04 if the coin lands heads and 06 otherwise. suppose that.4. J)) we see that J 4 is minimax with a maximum risk of 5. if and only if. For simplicity we shall discuss only randomized . J'). For instance. This is. 8)]. if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used. Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk.75 is strictly less than that of 04. in Example 1. Goals. is called minimax (minimizes the maximum risk). It is then natural to compare procedures using the simple average ~ [R( fh 10) + R( fh. a weight function for averaging the values of the function R( B. The principle would be compelling. For instance. This criterion of optimality is very conservative. The criterion comes from the general theory of twoperson zero sum games of von Neumann. 2 The maximum risk 4.5. Nevertheless l in many cases the principle can lead to very reasonable procedures. the sel of all decision procedures.75 if 0 = 0. Player II then pays Player I. supR(O. e 4. R(02. which makes the risk as large as possible. and Performance Criteria Chapter 1 if (J is continuous with density rr(8). in Example 1. But this is just Bayes comparison where 1f places equal probability on f}r and ()2. 4. 6). The maximum risk of 0* is the upper pure value of the game. J). Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to emplOy. Such comparisons make sense even if we do not interpret 1r as a prior density or frequency. < sup R(O.
. Jj ).11) Similarly we can define.. S is the convex hull of the risk points (R(0" Ji ).. R(O .'Y) that is tangent to S al the lower boundary of S. J).i/ (1 . As in Example 1.5.3. 1 q. i=l (1.3. By (1. minimizes r(d) anwng all randomized procedures.. All points of S that are on the tangent are Bayes. S= {(rI. That is...3). TWo cases arise: (I) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule. (2) The tangent is the line connecting two unonrandomized" risk points Ji .3). ~Ai=1}.13) defines a family of parallel lines with slope i/(1 .J): J E V'} where V* is the set of all procedures.R(02.A)R(O"Jj ). T2) on this line can be written AR(O"Ji ) + (1. = ~A.r).3.J.3.J. Ji )).5. (1. AR(02. which is the risk point of the Bayes rule Js (see Figure 1. For instance.i)r2 = c.Ji )] i=l A randomized Bayes procedure d. Ei==l Al = 1.3. i = 1 .R(O. 0 < i < 1.13) As c varies. R(O . J)) and consider the risk sel 2 S = {(RrO"J). If the randomized procedure l5 selects l5i with probability Ai. We will then indicate how much of what we learn carries over to the general case.R(OI. If .).3.3 The Decision Theoretic Framework 29 procedures that select among a finite set (h 1 • • • . Ji ) + (1 .3.3.2. given a prior 7r on q e. .. the Bayes risk of l5 (1. We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1. l5q of nOn randomized procedures.J) ~ L.10). A point (Tl. this point is (10.(02 ). (1.Section 1.3. Finding the Bayes rule corresponds to finding the smallest c for which the line (1. Ai >0.3. . Jj . i = 1.(0 d = i = 1 . (1.3.Ji)' r2 = tAiR(02.1). A randomized minimax procedure minimizes maxa R(8. we represent the risk of any procedure J by the vector (R( 01 ..). when ~ = 0.\. .d) anwng all randomized procedures. 9 (Figure 2 1.13) intersects S.r2):r. including randomiZed ones.A)R(02. we then define q R(O.12) r(J) = LAiEIR(IJ.14) . . This is thai line with slope . then all rules having Bayes risk c correspond to points in S that lie on the line irl + (1 .
In our example. and Performance Criteria Chapter 1 r2 I 10 5 Q(c') o o 5 10 Figure 1.3. OJ with probability (1 . and.. 2 The point where the square Q( c') defined by (1.3.• all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes). the first square that touches S). We can choose two nonrandomized Bayes rules from this class. See Figure 1.3. Goals. 0 < >.3.16) touches S is the risk point of the minimax rule.11) corresponds to the values Oi with probability>. as >./(1 . = r2.9.30 Statistical Models. thus.3. by (1.3. It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*. is Bayes against 7I". (1. the first point of contact between the squares and S is the . . < 1. < 1. To locate the risk point of the nilhimax rule consider the family of squares.13). (\)). oil. Let c' be the srr!~llest c for which Q(c) n S i 0 (i.16) whose diagonal is the line r. where 0 < >. (1. namely Oi (take'\ = 1) and OJ (take>.15) Each one of these rules. ranges from 0 to 1.3.. . = 0).e. i = 1. the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (Le. Because changing the prior 1r corresponds to changing the slope .3.3. Then Q( c*) n S is either a point or a horizontal or vertical line segment.. The convex hull S of the risk poiots (R(B!.)') of the line given by (1.>'). R(B .
y) in S such that x < Tl and y < T2. . 1967.\ the solution of From Table 1. Using Table 1. agrees with the set of risk points of Bayes procedures. . If e is finite.0).)). R( 81 .\). if there is a randomized one. (a) For any prior there is always a nonrandomized Bayes procedure. (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the Sense of (1. for instance. thus. {(x.3. 1 . for instance).\) = 5. However. then any Bayes procedure corresponding If e is not finite there are typically admissible procedures that are not Bayes. From the figure it is clear that such points must be on the lower left boundary... R(8" 0)) : 0 E V'} where V" is the set of all randomized decision procedures. A rule <5 with risk point (TIl T2) is admissible.\ ~ 0. Y < T2} has only (TI 1 T2) in common with S. l Ok}. or equivalently. j = 6 and . (e) If a Bayes prior has 1f(Oi) to 1f is admissible.5(1 .\ + 6. There is another important concept that we want to discuss in the context of the risk set. e = {fh.V) : x < TI. that <5 2 is inadmissible because 64 improves it (i.4.3. the set of all lower left boundary points of S corresponds to the class of admissible rules and.3. they are Bayes procedures. this equation becomes 3. 04) = 3 < 7 = R( 81 .4 we can see.3. Thus..3 The Decision Theoretic Framework 31 intersection between TI = T2 and the line connecting the two points corresponding to <54 and <56 .4'\ + 3(1 .6 ~ R( 8" J..Section 1. if and only if. Naturally. there is no (x. To gain some insight into the class of all admissible procedures (randomized and nOnrandomized) we again use the risk set.14). The following features exhibited by the risk set by Example 1.5 can be shown to hold generally (see Ferguson. > a for all i.14) with i = 4. . (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the positive quadrant.59. all admissible procedures are either Bayes procedures or limits of . (d) All admissible procedures are Bayes procedures. the minimax rule is given by (1.4 < 7. all rules that are not inadmissible are called admissible. (c) If e is finite(4) and minimax procedures exist. which yields . if and only if. A decision rule <5 is said to be inadmissible if there exists another rule <5' such that <5' improves <5.. In fact.) and R( 8" 04) = 5. 0.3. we can define the risk set in general as s ~ {( R(8 .e. under some conditions.
These remarkable results. The joint distribution of Z and Y can be calculated (or rather well estimated) from the records of previous years that the admissions officer has at his disposal. at least in their original fonn. decision rule.32 . and prediction. A meteorologist wants to estimate the amount of rainfall in the coming spring. • 1. we referto Blackwell and Girshick (1954) and Ferguson (1967). In terms of our preceding discussion. and risk through various examples including estimation. The frame we shaH fit them into is the following..Yj'. loss function. For more information on these topics.g(Z») = E[g(Z) ~ YI 2 or its square root yE(g(Z) . Using this information. Other theorems are available characterizing larger but more manageable classes of procedures.3. . ranking.4). he wants to predict the firstyear grade point averages of entering freshmen on the basis of their College Board scores. We introduce the decision theoretic foundation of statistics inclUding the notions of action space. We assume that we know the joint probability distribution of a random vector (or variable) Z and a random variable Y. Next we must specify what close means. testing. Since Y is not known. Z is the information that we have and Y the quantity to be predicted. although it usually turns out that all admissible procedures of interest are indeed nonrandomized. A stockholder wants to predict the value of his holdings at some time in the future on the basis of his past experience with the market and his portfolio.4 PREDICTION . Z would be the College Board score of an entering freshman and Y his or her firstyear grade point average. Section 3. Statistical Models. We stress that looking at randomized procedures is essential for these conclusions. I II . in the college admissions situation. I The prediction Example 1. The MSPE is the measure traditionally used in the . For example. at least when procedures with the same risk function are identified. One reasonable measure of "distance" is (g(Z) . ]967. Here are some further examples of the kind of situation that prompts our study in this section. The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility. A college admissions officer has available the College Board scores at entrance and firstyear grade point averages of freshman classes for a period of several years. Bayes procedures (in various senses). Summary. we tum to the mean squared prediction error (MSPE) t>2(Y. which is the squared prediction error when g(Z) is used to predict Y.I 'jlI . Similar problems abound in every field. We want to find a function 9 defined on the range of Z such that g(Z) (the predictor) is "close" to Y. The basic biasvariance decomposition of mean square error is presented. Goals. which include the admissible rules. and Performance Criteria Chapter 1 '. A government expert wants to predict the amount of heating oil needed next winter. An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson. confidence bounds.y)2. are due essentially to Waldo They are useful because the property of being Bayes is ea·der to analyze than admissibility.2 presented important situations in which a vector z of 00variates can be used to predict an unseen response Y.
C)2 has a unique minimum at c = J.4.711). given a vector Z. EY' < 00 Y . for example.1.4.g(Z))' I Z = z] = E[(Y . that is. See Remark 1. (1. Proof.4.6.4. Lemma 1.c)' ~ Var Y + (c _ 1')'. Grenander and Rosenblatt.1 assures us that E[(Y . consider QNP and the class QL of linear predictors of the form a + We begin the search for the best predictor in the sense of minimizing MSPE by considering the case in which there is no covariate information. In this situation all predictors are constant and the best one is that number Co that minimizes B(Y . in which Z is a constant. 1957) presuppose it. (1. Let (1. We see that E(Y . we can find the 9 that minimizes E(Y .g(Z))2.I'(Z))' < E(Y .4) . E(Y .1.4.1) < 00 if and only if E(Y . Just how widely applicable the notions of this section are will become apparent in Remark 1.3) If we now take expectations of both sides and employ the double expectation theorem (B. EY' = J.2) I'(z) = E(Y I Z = z). g(Z)) such as the mean absolute error E(lg( Z) .c)' < 00 for all c.1) follows because E(Y p.c = (Y 1') + (I' .25.L = E(Y). exists. 0 Now we can solve the problem of finding the best MSPE predictor of Y.g(z))' IZ = z] = E[(Y I'(z))' I Z = z] + [g(z) I'(z)]'. In this section we bjZj . The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6. then either E(Y g(Z))' = 00 for every function g or E(Y .Section 1.20).5 and Section 3.c)2 as a function of c.l.c)2 is either oofor all c or is minimized uniquely by c In fact.4.2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss.4. ljZ is any random vector and Y any random variable.4.g(Z))' (1.4.) = 0 makes the cross product term vanish. we have E[(Y . and by expanding (1. 1.c) implies that p.(Y.YI) (Problems 1.4. see Example 1.16).4.4 Prediction 33 mathematical theory of prediction whose deeper results (see.4.3.L and the lemma follows. The class Q of possible predictors 9 may be the nonparametric class QN P of all 9 : R d Jo R or it may be to some subset of this class.4. see Problem 1.g(z))' I Z = z]. By the substitution theorem for conditional expectations (B. or equivalently. E(Y . we can conclude that Theorem 1. L1=1 Lemma 1. when EY' < 00. Because g(z) is a constant.
4.6).20). That is. Suppose that Var Y < 00. that is.2.4. Property (1. we can derive the following theorem. If E(IYI) < 00 but Z and Y are otherwise arbitrary. E{h(Z)<1 E{ E[h(Z)< I Z]} E{h(Z)E[Y I'(Z) I Z]} = 0 because E[Y I'(Z) I Z] = I'(Z) I'(Z) = O. and Performance Criteria Chapter 1 for every 9 with strict inequality holding unless g(Z) best MSPE predictor.4.4. (1. = I'(Z). so is the other. (1. Proof.E(V)J Let € = Y . then (a) f is uncorrelated with every function ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <. when E(Y') < 00.4.4. Goals. Write Var(Y I z) for the variance of the condition distribution of Y given Z = z. which will prove of importance in estimation theory. Theorem 1.4. I'(Z) is the unique E(Y .1.6) is linked to a notion that we now define: Two random variables U and V with EIUVI < 00 are said to be uncorrelated if EIV .4. Var(Y I z) = E([Y .Il(Z) denote the random prediction error. and recall (B.E(Y I z)]' I z).6) and that (1.1(c) is equivalent to (1. = O. 1.7) IJVar Y • < 00 strict inequality ooids unless 1 Y = E(Y I Z) (1.5) is obtained by taking g(z) ~ E(Y) for all z. Equivalently U and V are uncorrelated if either EV[U E(U)] = 0 or EUIV . Properties (b) and (c) follow from (a).g(Z)' = E(Y .I'(Z))' + E(g(Z) 1'(Z»2 ( 1. then Var(E(Y I Z)) < Var Y.5) becomes. To show (a).8) or equivalently unless Y is a function oJZ. .6) which is generally valid because if one side is infinite. Vat Y = E(Var(Y I Z» + Var(E(Y I Z).5) An important special case of (1.4. then (1.4. Infact. then we can write Proposition 1. 0 Note that Proposition 1. As a consequence of (1.E(v)IIU .4.E(U)] = O.5) follows from (a) because (a) implies that the cross product term in the expansionof E ([Y 1'(z)J + [I'(z) g(z)IP vanishes. then by the iterated expectation theorem.4.34 Statistical Models.4. let h(Z) be any function of Z.
if we are trying to guess.6).025 0.7) follows immediately from (1. I. These fractional figures are not too meaningful as predictors of the natural number values of Y. But this predictor is also the right one. I z) = E(Y I Z).15 3 0.45 ~ 1 The MSPE of the best predictor can be calculated in two ways. We want to predict the number of failures for a given day knowing the state of the assembly line for the month. . also.15 I 0. as we reasonably might. the average number of failures per day in a given month.4.1.4.45. and only if.E(Y I Z))' ~ 0 By (A.4.50 z\y 1 0 0. or 3 failures among all days.10 I 0. p(z.'" 1 Y.:.25 1 0. An assembly line operates either at full.y) pz(z) 0. and only if.30 I 0.25 0. y) = 0.885. E(Var(Y I Z)) ~ E(Y .Section 1.10 0. whereas the column sums py (y) yield the frequency of 0.E(Y I Z))' =L x 3 ~)y y=o . Equality in (1.25 2 0. y) = p (Z = Z 1 Y = y) of the number of shutdowns Y and the capacity state Z of the line for a randomly chosen day. 1.8) is true. The following table gives the frequency function p( z. Within any given month the capacity status does not change.025 0. The first is direct. o Example 1.1 2.10 0. Each day there can be 0.4. 2. i=l 3 E (Y I Z = ~) ~ 2.30 py(y) I 0. E(Y .lI.25 0.9) this can hold if. The assertion (1.05 0.10 0. E (Y I Z = ±) = 1.05 0.20.E(Y I Z = z))2 p (z. We find E(Y I Z = 1) = L iF[Y = i I Z = 1] ~ 2.10.05 0. or 3 shutdowns due to mechanical failure. half. 2.4. In this case if Yi represents the number of failures on day i and Z the state of the assembly line.4 Prediction 35 Proof.7) can hold if. or quarter capacity. (1. The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state. the best predictor is E (30.
the best predictor of Y using Z is the linear function Example 1. whereas its magnitude measures the degree of such dependence.11) I The line y = /lY + p(uy juz)(z ~ /lz).2.36 Statistical Models. Goals.6) writing. p) distribution.4. Theorem B. If p = 0. In the bivariate normal case. O"~. The Bivariate Normal Distribution. = Z))2 I Z = z) = u~(l _ = u~(1 _ p'). 2 I . E((Y . Similarly.4. Suppose Y and Z are bivariate normal random variables with the same mean .p')). Because of (1.y as we would expect in the case of independence. this quantity is just rl'. the best predictor is just the constant f. " " is independent of z. the predictor is a monotone increasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z. If (Z.LY. o tribution of Y given Z = z is N City + p(Uy j UZ ) (z . can reasonably be thought of as a measure of dependence.885 as before.10) I I " The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution.4.4. E(Y . . If p > O.6) we can also write Var /lO(Z) p = VarY . The larger this quantity the more dependent Z and Yare. which corresponds to the best predictor of Y given Z in the bivariate normal model. E(Y ~ E(Y I Z))' VarY ~ Var(E(Y I Z)) E(Y') ~ E[(E(Y I Z))'] I>'PY(Y) .2 tells us that the conditional dis /lO(Z) Because . a'k. (1.9) I .E(Y I Z p') (1.E(Y I Z))' (1. which is the MSPE of the best constant predictor.4. p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence. . Y) has a N(Jlz l f. (1 . u~. = z)]'pz(z) 0. One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y./lz).4. The term regression was coined by Francis Galton and is based on the following observation.L[E(Y I Z y .4. Therefore.J. = /lY + p(uyjuz)(Z ~ /lZ). Thus. and Performance Criteria Chapter 1 The second way is to use (1. for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y. Regression toward the mean. is usually called the regression (line) of Y on Z. the MSPE of our predictor is given by.
6) in which I' (I"~. Yn ) from the population. I"Y = E(Y). . We shall see how to do this in Chapter 2. . (Zn. (1 . variance cr2. Let Z = (Zl.. The variability of the predicted value about 11. .p)1" + pZ) = p'(T'. be less than that of the actual heights and indeed Var((! . I"Y) T. Y) is unavailable and the regression line is estimated on the basis of a sample (Zl. The Multivariate Normal Distribution. In Galton's case. Thus. tall fathers tend to have shorter sons. YI ). y)T has a (d + !) multivariate normal.Section 1. The quadratic fonn EyzEZ"iEzy is positive except when the joint nonnal distribution is degenerate.. uYYlz) where. Y» T T ~ E yZ and Uyy = Var(Y). coefficient ofdetennination or population Rsquared. there is "regression toward the mean. where the last identity follows from (1. ttd)T and suppose that (ZT. should. 0 Example 1. these were the heights of a randomly selected father (Z) and his son (Y) from a large human population.12) with MSPE ElY l"o(Z)]' ~ E{EIY l"o(zlI' I Z} = E(uYYlz) ~ Uyy  EyzEziEzy.COV(Zd.. We write Mee = ' ~! _ PZy ElY /lO(Z))' ~ Var l"o(Z) Var Y Var Y . Then the predicted height of the son.5 states that the conditional distribution of Y given Z = z is N(I"Y +(zl'zf. " Zd)T be a d x 1 covariate vector with mean JLz = (ttl. the distribution of (Z. the MCC equals the square of the .4. the best predictor E(Y I Z) ofY is the linear function (1. By (1. so the MSPE of !lo(Z) is smaller than the MSPE of the constant predictor J1y.3. Ezy O"yy E zz is the d x d variancecovariance matrix Var(Z) . E zy = (COV(Z" Y).4 Prediction 37 tt.. Thus." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox.. or the average height of sbns whose fathers are the height Z. One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y. Note that in practice. in particular in Galton's studies.4. usual correlation coefficient p = UZy / crfyCT lz when d = 1. and positive correlation p.. E). .4. This quantity is called the multiple correlation coefficient (MCC). is closer to the population mean of heights tt than is the height of the father. Theorem B.6.4.. . .8 = EziEzy anduYYlz ~ uyyEyzEziEzy. consequently.p)tt + pZ. distribution (Section B. N d +1 (I'.6). .8. .11).
2. y)T is unknown. and Performance Criteria Chapter 1 For example.3%. the hnear predictor l'a(Z) and its MSPE will be estimated using a 0 sample (Zi.. Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z.)T. p~y = . Suppose that E(Z') and E(Y') are finite and Z and Y are not constant.zz ~ (.39 l. The percentage reductions knowing the father's and both parent's heights are 20..335.4.bZ)' = E(Y') + E(Z')(b  Therefore. and parents. Then the unique best zero intercept linear predictor i~ obtained by taking E(ZY) b=ba = E(Z') .~ ~:~~). Y. The best linear predictor. knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33.l Proof. E(Y . are P~. We first do the onedimensional case.4. If we are willing to sacrifice absolute excellence.393.bZ)' is uniquely minintized by b = ba. In practice. and E(Y _ boZ)' = E(Y') _ [E(ZY)]' E(Z') (1.. yn)T See Sections 2. Z2 = father's height). . respectively.. let Y and Z = (Zl. Z2)T be the heights in inches of a IOyearold girl and her parents (Zl = mother's height.3. E(Y . 29W· Then the strength of association between a girl's height and those of her mother and father.zy ~ (407. The problem of finding the best MSPE predictor is solved by Theorem 1. In words. The natural class to begin with is that of linear combinations of components of Z. Goals.:.E(Z')b~. Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor.bZ}' = E(Y)  b E(Z) 1· = (Y  [Z(b .1 and 2.y = . P~. Suppose(l) that (ZT l Y) T is trivariate nonnal with Var(Y) = 6.38 Statistical Models. What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1.y = .5%. we can avoid both objections by looking for a predictor that is best within a class of simple predictors. . l.4. respectively. We expand {Y . when the distribution of (ZT.13) .209.ba) + Zba]}' to get ba )' .1. Y) a I VarZ. (Z~.9% and 39. whereas the unique best linear predictor is ILL (Z) = al + bl Z where b = Cov(Z.
l). 0 Note that if E(Y I Z) is of the form a + bZ. E(Y . If EY' and (E([Z . by (1.a .boZ)' = 0.4.b. Theorem 1.a.4.bZ)2 is uniquely minimized by taking a ~ E(Y) . (1.5). In that example nothing is lost by using linear prediction.bZ) + (E(Y) .4 Prediction 39 To prove the second assertion of the theorem note that by (lA.bE(Z) . Note that R(a. We can now apply the result on zero intercept linear predictors to the variables Z .4.13) we obtain tlte proof of the CauchySchwarz inequality (A. Therefore.I'l(Z)1' depends on the joint distribution P of only through the expectation J. E(Y .E(Z))]'. Best Multivariate Linear Predictor. and b = b1 • because.E(ZWW ' E([Z . This is because E(Y ..2.4.bE(Z). Let Po denote = . This is in accordance with our evaluation of E(Y I Z) in Example 1. E[Y .b(Z . I 1.4.I1.Section 1.Z)'. whatever be b. in Example 1. . is the unique minimizing value.E(Z)IIZ .E(Y)) .t and covariance E of X.4. it must coincide with the best linear predictor.1).I'd Z )]' = 1 05 ElY I'(Z)j' .14) Yl T Ep[Y . From (1.E(Z)J))1 exist. and only if. Substituting this value of a in E(Y .16) directly by calculating E(Y .1.a .l7) in the appendix. then the Unique best MSPE predictor is I'dZ) Proof.a)'. = I'Y + (Z  J. whiclt corresponds to Y = boZo We could similarly obtain (A. That is. E(Y .4.bZ)2 we see that the b we seek minimizes E[(Y .1 the best linear predictor and best predictor differ (see Figure 1. b) X ~ (ZT. OUf linear predictor is of the fonn d I'I(Z) = a + LbjZj j=l = a+ ZTb [3 = (E([Z . On the other hand.E(Z) and Y .tzf [3.E(Z)f[Z .E(Y) to conclude that b.4.4. then a = a. A loss of about 5% is incurred by using the best linear predictor. if the best predictor is linear.E(Y)]) = EziEzy.bZ)' ~ Var(Y .boZ)' > 0 is equivalent to the CauchySchwarz inequality with equality holding if.E(Z)J[Y .a . 0 Remark 1.
I'LCZ))..4. I"(Z) = E(Y I Z) = a + ZT 13 for unknown Q: E R and f3 E Rd.. (1.4.4. .45z. .2.1. that is. j = 0.15) . Y).4. 0 Remark 1.00 z Figure 1. Suppose the mooel for I'(Z) is linear. E).(a + ZT 13)]) = 0. Thus. In the general. Goals. E(Zj[Y . Zd are uncorrelated. . b) is miuimized by (1. our new proof shows how secondmoment results sometimes can be established by "connecting" them to the nannal distribution.I'I(Z)]'. that is. 0 Remark 1. We could also have established Theorem 1.25 0..4. .4. and R(a.4. the multivariate uormal. By Example 1.05 + 1. b) = Epo IY .40 Statistical Models.17 for an overall measure of the strength of this relationship. and Performance Criteria Chapter 1 y .4.3. . distribution and let Ro(a. 0 Remark 1.4. N(/L. The line represents the best linear predictor y = 1.3 to d > 1.I'(Z) and each of Zo. By Proposition 1..14). A third approach using calculus is given in Problem 1. 2 • • 1 o o 0.d. .19. p~y = Corr2 (y. .4. case the multiple correlation coefficient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y.75 1. However. .. . Because P and Po have the same /L and E. Set Zo = 1. We want to express Q: and f3 in terms of moments of (Z.3. not necessarily nonnal. The three dots give the best predictor.4. ~ Y . Ro(a. See Problem 1.50 0. b). R(a. thus.4.4. b) = Ro(a. the MCC gives the strength of the linear relationship between Z and Y.14). b) is minimized by (1.4 by extending the proof of Theorem 1.4.1(a).
4. o Summary.5 SUFFICIENCY Once we have postulated a statistical model. we see that r(5) ~ MSPE for squared error loss 1(9.(Y I 9). Thus.I'(Z) is orthogonal to I'(Z) and to I'dZ). Remark 1.(Y 1ge) ~ "(I'(Z) 19d (1.' (l'e(Z).4. We return to this in Section 3.6.I 0 and there is a 90 E 9 such that 0 < oc form a Hilbert go = arg inf{".Section 1.5. then 90(Z) is called the projection of Y on the space 9 of functions of Z and we write go(Z) = . but as we have seen in Sec~ tion 1.2 and the Bayes risk (1.1. and projection 1r notation.4.2. Note that (1.16) is the Pythagorean identity. Moreover. Remark 1. It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear. The notion of mean squared prediction error (MSPE) is introduced... I'(Z» Y . we can conclude that I'(Z) = . Using the distance D.(Y.23).3.(Y 19NP). With these concepts the results of this section are linked to the general Hilbert space results of Section 8.4.15) for a and f3 gives (1. When the class 9 of possible predictors 9 with Elg(Z)1 space as defined in Section B.3. The range of T is any space of objects T.4..16) (1..Thus. I'(Z» + ""'(Y.14) (Problem 1.8) defined by r(5) = E[I(II. I'dZ» = ". and it is shown that if we want to predict Y on the basis of information contained in a random vector Z.4. the optimal MSPE predictor is the conditional expected value of Y given Z. The optimal MSPE predictor in the multivariate normal distribution is presented. g(Z) and h(Z) are said to be orthogonal if at least one has expected value zero and E[g(Z)h(Z)] ~ O.10. Because the multivariate normal model is a linear model. can also be a set of functions. 1.5) = (9 . usually R or Rk.17) ""'(Y. If T assigns the same value to different sample points. Consider the Bayesian model of Section 1.5)'. I'dZ) = .5 Sufficiency 41 Solving (1. we would clearly like to separate out any aspects of the data that are irrelevant in the context of the model and that may obscure our understanding of the situation.4. T(X) = X loses information about the Xi as soon as n > 1. then by recording or taking into account only the value of T(X) we have a reduction of the data..12). .4.. We consider situations in which the goal is to predict the (perhaps in the future) value of a random variable Y. the optimal MSPE predictor E(6 I X) is the Bayes procedure for squared error loss. We begin by fonnalizing what we mean by "a reduction of the data" X EX. this gives a new derivation of (1. Recall that a statistic is any function of the observations generically denoted by T(X) or T.g(Z»): 9 E g).4.5(X»]. If we identify II with Y and X with Z.
Xl and X 2 are independent and identically distributed exponential random variables with parameter O. when Xl + X.2.4). X 2 the time between the arrival of the first and second customers. Thus. Thus. Therefore. .'" . However. . Xl has aU(O. Goals. .42 Statistical Models.X. it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T. It follows that..2.1) where Xi is 0 or 1 and t = L:~ I Xi. We prove that T = X I + X 2 is sufficient for O. A statistic T(X) is called sufficient for PEP or the parameter if the conditional distribution of X given T(X) = t does not involve O. Instead of keeping track of several numbers. is a statistic that maps many different values of (Xl. A machine produces n items in succession. and Performance Criteria Chapter 1 Even T(X J. given that P is valid. " . We give a decision theory interpretation that follows. are independent and the first of these statistics has a uniform distribution on (0. X.5. whatever be 8.) is conditionally distributed as (X.Xn ) does not contain any further infonnation about 0 or equivalently P.' • ... T = L~~l Xi.l.l.'" . For instance..l we see that given Xl + X2 = t.9. Thus. + X. T is a sufficient statistic for O.) are the same and we can conclude that given Xl + X. The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X. where 0 is unknown.. The idea of sufficiency is to reduce the data with statistics whose use involves no loss of information.Xn = xnl = 8'(1. Then X = (Xl.5.. Y) where X is uniform on (0.) and that of Xlt/(X l + X.) and Xl +X.. 16. the sample X = (Xl.l. X.8)n' (1. 0 In both of the foregoing examples considerable reduction has been achieved. X(n))' loses information about the labels of the Xi. . the conditional distribution of Xl = [XI/(X l + X.3. By (A. Begin by noting that according to TheoremB. 1). Although the sufficient statistics we have obtained are "natural. Each item produced is good with probability 0 and defective with probability 1. The total number of defective items observed. .)](X I + X. (Xl. . once the value of a sufficient statistic T is known. We could then represent the data by a vector X = (Xl. One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following. Example 1. P[X I = XI. t) and Y = t . Suppose there is no dependence between the quality of the items produced and let Xi = 1 if the ith item is good and 0 otherwise. in the context of a model P = {Pe : (J E e}. t) distribution.Xn ) is the record of n Bernoulli trials with probability 8. • X n ) (X(I) . By Example B. Suppose that arrival of customers at a service counter follows a Poisson process with arrival rate (parameter) Let Xl be the time of arrival of the first customer. By (A. recording at each stage whether the examined item was defective or not. .5." it is important to notice that there are many others e.1 we had sampled the manufactured items in order. suppose that in Example 1.) given Xl + X.0. the conditional distribution of XI/(X l + X. Using our discussion in Section B. whatever be 8. we need only record one. the conditional distribution of X 0 given T = L:~ I Xi = t does not involve O. .5). . = t.1. 1) whatever be t.'" .1. XI/(X l +X.Xn ) where Xi = 1 if the ith item sampled is defective and Xi = 0 otherwise. o Example 1. ~ t. . X n ) into the same number. is sufficient. = tis U(O.
if (152) holds.O) I: {x:T(x)=t.] o if T(xj) oF ti if T(xj) = Ii. then T 1 and T2 provide the same information and achieve the same reduction of the data.j is independent ofe on cach of the sets Si = {O: porT ~ til> OJ. checking sufficiency directly is difficult because we need to compute the conditional distribution.I. The complete result is established for instance by Lehmann (1997. a simple necessary and sufficient criterion for a statistic to be sufficient is available. i = 1. .5) .4) if T(xj) ~ Ii 41 o if T(xj) oF t i . we need only show that Pl/[X = xjlT = til is independent of for every i and j. It is often referred to as the factorization theorem for sufficient statistics. To prove the sufficiency of (152).") be the set of possible realizations of X and let t i = T(Xi)' Then T is discrete and 2::~ 1 porT = Ii] = 1 for every e.]/porT = Ii] p(x..O) Theorem I.In a regular model. 0) porT ~ Ii] ~~CC+ g(li.e) = g(ti. Such statistics are called equivalent. and e and a/unction h defined (152) = g(T(x)..5 Sufficiency 43 that will do the same job.} p(x. a statistic T(X) with range T is sufficient for e if. In general.. Now. We shall give the proof in the discrete case.} h(x). Section 2. Applying (153) we arrive at. (153) By (B. By our definition of conditional probability in the discrete case.6). Let (Xl. • only if. Neyman. it is enough to show that Po[X = XjlT ~ l. forO E Si.. O)h(x) for all X E X. e porT = til = I: {x:T(x)=t. there exists afunction g(t. if Tl and T2 are any two statistics such that 7 1 (x) = T 1 (y) if and only if T2(x) = T 2(y). Po [X ~ XjlT = t. Fortunately.S. More generally.] Po[X = Xj. e) definedfor tin T and e in on X such that p(X.Section 1. 0 E 8.Ll) and (152).2. T = t. X2. This result was proved in various forms by Fisher. Proof. h(xj) (1..5. and Halmos and Savage. Po[X = XjlT = t. e)h(Xj) Po[T (15. Being told that the numbers of successeS in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one.
T = T(x)] = g(T(x). . ...8) if all the Xi are > 0.3).. 0 Example 1.5. O} = 0 otherwise.~.X n JI) = 0 otherwise. T is sufficient... I x n ) = 1 if all the Xi are > 0.5.[27f.. ' . 10 fact.. Common sense indicates that to get information about B. and P(XI' . 0 o Example 1. if T is sufficient. Conversely..').. and Performance Criteria Chapter 1 Therefore.2.1. . . Consider a population with () members labeled consecutively from I to B.Xn ) is given by (see (A.5. X n ). = max(xl. l X n are the interarrival times for n customers.4)). .4. By Theorem 1.5.9) can be rewritten as " 1 Xn . n P(Xl..xn }.5.8) = eneStift > 0. . 1 X n be independent and identically distributed random variables each having a normal distribution with mean {l and variance (j2.5.. Let Xl.1 to conclude that T(X 1 . which admits simple sufficient statistics and to which this example belongs.O)=onexP[OLXil i=l (1.. X(n) is a sufficient statistic for O.X. . 1=1 (!.Il) . then the joint density of (X" .5. .S.5. A whole class of distributions.10) P(Xl. are introduced in the next section.. Then the density of (X" .5.'t n /'[exp{ .16. Estimating the Size of a Population.Xn ) = L~ IXi is sufficient. let g(ti' 0) = porT = tiL h(x) = P[X = xIT(X) = til Then (1. both of which are unknown. ..' (L x~ 1=1 1 n n 2JL LXi))].n /' exp{ .7) o Example 1.3. We may apply Theorem 1. > 0.9) if every Xi is an integer between 1 and B and p( X 11 • (1. [27".Xn are recorded. and h(xl •. 0) by (B. Takeg(t..Xn ) is given by I. The probability distribution of X "is given by (1.'I. }][exp{ . Let 0 = (JL. •.' L(Xi ..5. 1.2. .. . and both functions = 0 otherwise.5.. The population is sampled with replacement and n members of the population are observed and their labels Xl.. O)h(x) (1.JL)'} ~=l I n I! 2 . .44 Statistical Models.. Goals. If X 1. .Xn. . we need only keeep track of X(n) = max(X . we can show that X(n) is sufficient.O) where x(n) = onl{x(n) < O}.'" .6) p(x. _ _ _1 .2 (continued). Expression (1. ~ Po[X = x.
··· . we can. if T(X) is sufficient.5.. EYi 2 .X)'].xnJ)) is itself a function of (L~ upon applying Theorem 1. Specifically.2a I3.l) distributed. Then 6 = {{31. X is sufficient.~ I: xn By the factorization theorem. [1/(n i=l n n 1)1 I:(Xi . X n are independent identically N(I). Then pix. Ezi 1'i) is sufficient for 6.( 2?Ta ')~ exp p {E(fJ.1.9) and _ (y.5. where we assume diat the given constants {Zi} are not all identical. that Y I . Suppose X" . in Example 1. find a randomized decision rule o'(T(X)) depending only on T(X) that does as well as O(X) in the sense of having the same risk function..~0)}(2"r~n exp{ . that is. I)) = exp{nI)(x . X n ) = [(lin) where I: Xi.1. respectively.6. T = (EYi . for any decision procedure o(x)..• Yn are independent. Here is an example. By randomized we mean that o'(T(X») can be generated from the value random mechanism not depending on B. 1 2 2 .' + 213 2a + 213. .5.4 with d = 2. R(O.5. I)) . Thus. a<. . L~l x~) and () only and T(X 1 . Suppose.o) = R(O. X n ) = (I: Xi.} + EY.12) t ofT(X) and a Example 1.. (1.) i=l i=l is sufficient for B.Section 1. a 2 ). we construct a rule O'(X) with the sarne risk = mean squared error as o(X) as follows: Conditionally. .1 we can conclude that n n Xi. Yi N(Jli. i= 1 = (lin) L~ 1 Xi. Let o(X) = Xl' Using only X.5 Sufficiency 45 I Evidently P{Xl 1 • •• . o Sufficiency and decision theory Sufficiency can be given a clear operational interpretation in the decision theoretic setting. a 2)T is identifiable (Problem 1.5. with JLi following the linear regresssion model ("'J .EZiY. {32. The first and second components of this vector are called the sample mean and the sample variance. 0 X Example 1. An equivalent sufficient statistic in this situation that is frequently used IS S(X" . .Zi)'} exp {Er. . 2:>. 0') for allO.
Thus. . it is Bayes sufficient for every 11.5.1 (continued).1 (Bernoulli trials) we saw that the posterior distribution given X = x is 1 Xi. by the double expectation theorem. Example 1.14. L~ I xl) and S(X) = (X" . 0 and X are independent given T(X). we can find a transformation r such that T(X) = r(S(X)).6') ~ E{E[R(O. 0 The proof of (1. Goals. ) Var(T') = E[Var(T'IX)] + VarIE(T'IX)]  ~ nl n + 1 n ~ 1 = Var(X . . (Kolmogorov). in that.. n~l) distribution. O)h(x) E:  . ). Using Section B.Xn is a N(p.. In this Bernoulli trials case.6). This result and a partial converse is the subject of Problem 1. T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of () given T(X) = T(x) for all x.4. = 2:7 Definition. Theorem }. the same as the posterior distribution given T(X) = L~l Xi = k. if Xl.12) follows along the lines of the preceding example: Given T(X) = t. In Example 1.O) = g(S(x).6(X»)IT]} = R(O. where k In this situation we call T Bayes sufficient.6'(T))IT]} ~ E{E[R(O.l and (1.5.2. o Sufficiency aud Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P = {Po : 0 E e} we postulate a prior distribution 11 for e. we find E(T') ~ E[E(T'IX)] ~ E(X) ~ I" ~ E(X . T = I Xi was shown to be sufficient. (T2) sample n > 2...6). We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X). Then by the factorization theorem we can write p(x. This 6' (T(X)) will have the same risk as 6' (X) because.5. X n ) are hoth sufficient. then T(X) = (L~ I Xi. Equivalently. Minimal sufficiency For any model there are many sufficient statistics: Thus. 6'(X) and 6(X) have the same mean squared error. 0) as p(x. Now draw 6' randomly from this conditional distribution. and Performance Criteria Chapter 1 given X = t.2. R(O. But T(X) provides a greater reduction of the data.5. choose T" = 15"' (X) from the normal N(t. the distribution of 6(X) does not depend on O. IfT(X) is sufficient for 0. .46 Statistical Models.. Let SeX) be any other sufficient statistic.
then Lx(O) = (27rO. For any two fixed (h and fh.5 Sufficiency 47 Combining this with (1. the statistic L takes on the value Lx.5. L x (()) gives the probability of observing the point x. when we think of Lx(B) as a function of ().)}.o)nT ~g(S(x).4 (continued). for example. (T'). Example 1. L Xf) i=l i=l n n = (0 10 0. we find OT(1 . = 2 log Lx(O. ~ 2/3 and 0.) ~ (L Xi.11) is determined by the twodimensional sufficient statistic T Set 0 = (TI.1). It is a statistic whose values are functions. the likelihood function (1. The likelihood function o The preceding example shows how we can use p(x.(t. The formula (1. 1/3)]}/21og 2. Now. However. 2/3)/g(S(x). if we set 0. for a given observed x. the ratio of both sides of the foregoing gives In particular.T. }exp ( I .O E e. ()) x E X}. we find T = r(S(x)) ~ (log[2ng(s(x). 20.5. t.Section 1. ()) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient. In this N(/". as a function of (). Thus. for a given 8. 20' 20 I t. In the discrete case. Thus. (T') example.) n/' exp {nO.O). the "likelihood" or "plausibility" of various 8. take the log of both sides of this equation and solve for T. t2) because..O)h(x) forall O. Lx is a map from the sample space X to the class T of functions {() t p(x. if X = x.5. T is minimally sufficient.) = (/".8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of 8.1) nlog27r . L x (') determines (iI. ~ 1/3. We define the likelihood function L for a given observed data vector x as Lx(O) = p(x. it gives.5. In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x.
then for any decision procedure J(X). the ranks..1. If T(X) is sufficient for 0. Summary. or if T(X) ~ (X~l)' . 1 X n ).48 Statistical Models. Goals. but if in fact a 2 I 1 all information about a 2 is contained in the residuals.Rn ).5. The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t. Then Ax is minimal sufficient. if the conditional distribution of X given T(X) = t does not involve O.. as in the Example 1. if a 2 = 1 is postulated.. B ul the ranks are needed if we want to look for possible dependencies in the observations as in Example 1. For instance.X Cn )' the order statistics. Consider an experiment with observation vector X = (Xl •. S(X) = (R" . We say that a statistic T (X) is sufficient for PEP. hence.5. . 0) and heX) such that p(X. 1.. and Scheffe. 1) (Problem 1. .17). We show the following result: If T(X) is sufficient for 0. Thus. we can find a randomized decision rule J'(T(X» depending only on the value of t = T(X) and not on 0 such that J and J' have identical risk functions. The likelihood function is defined for a given data vector of .. 0 the likelihood ratio of () to Bo. O)h(X).1 (continued) we can show that T and.. a statistic closely related to L solves the minimal sufficiency problem in general. and Performance Criteria Chapter 1 with a similar expression for t 1 in terms of Lx (O. The "irrelevant" part of the data We can always rewrite the original X as (T(X).X).).5. Let Ax OJ c (x: p(x. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic SeX) we can find a ttansformation r such that T(X) = r(S(X).X)2) is sufficient. Lehmann. . OJ denote the frequency function or density of X. hence. SeX) becomes irrelevant (ancillary) for inference if T(X) is known but only if P is valid. but if in fact the common distribution of the observations is not Gaussian all the information needed to estimate this distribution is contained in the corresponding S(X)see Problem 1. L is minimal sufficient.4.X. 1) and Lx (1. where R i ~ Lj~t I(X.5. We define a statistic T(X) to be Bayes sufficient for a prior 7r if the posterior distribution of f:} given X = x is the same as the posterior distribution of 0 given T(X) = T(x) for all X. (X.. SeX)~ where SeX) is a statistic needed to uniquely detennine x once we know the sufficient statistic T(x). 1 X n are a random sample. • X( n» is sufficient. . 0) = g(T(X).5.O) > for all B.12 for a proof of this theorem of Dynkin.. < X. . . See Problem ((x:). X is sufficient.13.5.Oo) > OJ = Lx~6o)' Thus. a 2 is assumed unknown. . it is Bayes sufficient for O. If P specifies that X 11 •. Ax is the function valued statistic that at (J takes on the value p x. Suppose there exists 00 such that (x: p(x.5.. If.Xn . l:~ I (Xi . ifT(X) = X we can take SeX) ~ (Xl . in Example 1. (X( I)' . 0 In fact. itself sufficient. or for the parameter O. Suppose that X has distribution in the class P ~ {Po : 0 E e}. L is a statistic that is equivalent to ((1' i 2) and. the residuals. _' . Let p(X. By arguing as in Example 1. Thus.5.
6 Exponential Families 49 observations X to be the function of B defined by Lx(B) = p(X. these families form the basis for an important class of models called generalized linear models. and Dannois through investigations of this property(l). We shall refer to T as a natural sufficient statistic of the family.B) ~ h(x)exp{'7(B)T(x) . This is clear because we need only identify exp{'7(B)T(x) .6 EXPONENTIAL FAMILIES The binomial and normal models considered in the last section exhibit the interesting feature that there is a natural sufficient statistic whose dimension as a random vector is independent of the sample size. B>O. B). B. In a oneparameter exponential family the random variable T(X) is sufficient for B. 1. They will reappear in several connections in this book.exp{xlogBB}. Subsequently. by the factorization theorem.6. then. such that the density (frequency) functions p(x.B) and h(x) with itself in the factorization theorem. binomial. x. B) > O} c {x: p(x. Poisson. p(x. More generally. We return to these models in Chapter 2. Then. Example 1. = 1 .6.o I x. Probability models with these common features include normal.2"" }. The Poisson Distribution. gamma. if there exist realvalued functions fJ(B). Ax(B) is a minimally sufficient statistic.B(B)} (1. Note that the functions 1}.1 The OneParameter Case The family of distributions of a model {Pe : B E 8}. for x E {O.2) .B(B)} with g(T(x). B( 0) on e. B) of the Pe may be written p(x. realvalued functions T and h on Rq.B) = eXe.B o) > O}. is said to be a oneparameter exponentialfami/y. 1. Pitman. and T are not unique.6. and multinomial regression models used to relate a response variable Y to a set of predictor variables. (1. many other common features of these families were discovered and they have become important in much of the modern theory of statistics. Here are some examples. The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman.1) where x E X c Rq. B E' sufficient for B. 1. beta. BEe.Section 1. and if there is a value B E such that o e e.6. IfT(X) is {x: p(x. the likelihood ratio Ax(B) = Lx(B) Lx(Bo) depends on X through T(X) only.1. Let Po be the Poisson distribution with unknown mean IJ.
T(x)=x.O) f(z. This is a oneparameter exponential family distribution with q = 2. n) < 0 < 1. Here is an example where q (1. Z and W are f(x.B(O)=nlog(I0). B(O) = logO. 1 X m ) considered as a random vector in Rmq and p(x.~O'. B(O) = 0.1](0) = . . Then > 0. 1. suppose Xl. the Pe form a oneparameter exponential family with ~~I . T(x) ~ x.4) Therefore. o The families of distributions obtained by sampling from oneparameter exponential families are themselves oneparameter exponential families. q = 1. Specifically..1](0) = logO. .h(X)=( : ) .6.Xm are independent and identically distributed with common distribution Pe.. and Performance Criteria Chapter 1 Therefore.6.~z. I I! . .' x.0)]. is the family of distributions of X = (Xl' . 1).T(x) = (y  z)'. . + OW. Example 1.6. If {pJ=».6.O) = ( : ) 0'(1_ Or" ( : ) ex p [xlog(1 ~ 0) + nlog(l.O) = f(z)f. for x E {O. p(x. (1.~O'(y  z)' logO}.6) .1](O)=log(I~0). B) are the corresponding density (frequency) functions. fonn a oneparameter exponential family as in (1.(y I z) = <p(zW1<p((y . Statistical Models.) i=1 m B(8)] ( 1.6. Suppose X has a B(n.5) o = 2.) exp[~(O)T(x.y.Z)OI) Z)'O'J} (2rrO)1 exp { (2rr)1 exp {  ~ [z' + (y  ~z' } exp { . the family of distributions of X is a oneparameter exponential family with q=I.O) IIh(x. h(x) = (2rr)1 exp { .2.h(x) = . .3) I' .50 .6. 0 Example 1. .} . The Binomial Family..1).. Suppose X = (Z. 0 E e. 1 (1.6. 0 Then.3. y)T where Y = Z independent N(O. where the p. .. . we have p(x. 0) distribution. Goals. .
This family of Poisson distIibutions is oneparameter exponential whatever be m. h(m)(x) ~ i=l II h(xi). if X = (Xl.6.2) r(p. 1 X m). pi pi I:::n TABLE 1. .6. .' fixed I' fixed p fixed . . and pJ LT(Xi).6 Exponential Families 51 where x = (x 1 .\ 1"/'" (p (s 1) 1) 1) (r T(x) x (x . X m ) corresponding to the oneparameter exponential fam1 T(Xd... B.Xm ) = I Xi is distributed as P(mO). fl. the m) fonn a oneparameter exponential family. then the family of distributions of the statistic T(X) is a oneparameter exponential family of discrete distributions whose frequency functions may be written h'Wexp{1)(O)t .Il)" . If we use the superscript m to denote the corresponding T. . and h.x log x 10g(l x) logx The statistic T(m) (X I.6. ily of distributions of a sample from any of the foregoin~ is just In our first Example 1. Let {Pe } be a oneparameter exponential family of discrete distributions with corresponding functions T...1 Family of distributions .8) . . then the m ) fonn a I Xi.. 1]. For example.\ fixed r fixed s fixed 1/2. . B. In the discrete case we can establish the following general result. oneparameter exponential family with natural sufficient statistic T(m)(x) = Some other important examples are summarized in the following table.. .7) Note that the natural sufficient statistic T(m) is onedimensional whatever be m.. .B(O)} for suitable h * . 1)(0) N(I'. and h. · . •. then qCm) = mq. We leave the proof of these assertions to the reader..Section 1.· . B(m)(o) ~ mB(O). Therefore.\) (3(r. I:::n I::n Theorem 1. i=l (1.6.. 1 X m ) is a vector of independent and identically distributed P(O) random variables and m ) is the family of distributions of x.1.1 the sufficient statistic T :m)(x1.
x E X c Rq (1.1}) = h(x)exp[1}T(x) . The exponential family then has the form q(x. and Performance Criteria Chapter 1 Proof.1. Goals.. By definition. }. The model given by (1..6. Jh(x)exp[1}T(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case. then A(1}) must be finite.8) h(x) exp[~(8)T(x) .9) with 1} ranging over E is called the canonical oneparameter exponential family generated by T and h. If X is distributed according to (1. .9) with 1} E E contains the class of models with 8 E 8. 00 00 exp{A(1})} andE = R.6.1}) = (1(x!)exp{1}xexp[1}]}. E is either an interval or all of R and the class of models (1. the result follows.6. 0 A similar theorem holds in the continuous case if the distributions ofT(X) are them Canonical exponential families. P9[T(x) = tl L {x:T(x)=t} p(x. We obtain an important and useful reparametrization of the exponential family (1..2.6. The Poisson family in canonical form is q(x. the momentgenerating function o[T(X) exists and is given by M(s) ~ exp[A(s + 1}) .52 Statistical Models. Let E be the collection of all 1} such that A(~) is finite.2. Example 1.6. £ is called the natural parameter space and T is called the natural sufficient statistic. = L(e"' (x!) = L(e")X (x! = exp(e").9) and ~ is an Ihlerior point of E.9) where A(1}) = logJ .B(8)]{ Ifwe let h'(t) ~ L:{"T(x)~t} h(x). x=o x=o o Here is a useful result.6. if q is definable.1) by letting the model be indexed by 1} rather than 8. . L {x:T{x)=t} h(x)}. x E {O.6. Then as we show in Section 1. If 8 E e.8) exp[~(8)t . (continued).2.6.B(8)] L {x:T(x)=t} ( 1.A(1})1 for s in some neighborhood 0[0.. where 1} = log 8. Theorem 1.A(~)I.1.6. selves continuous.
The rest of the theorem follows from the momentenerating property of M(s) (see Section A. is said to be a kparameter exponential family.12). 1.. B(O) the natural sufficient statistic E~ 2n(j2 and variance nlt}2 4n04 . E(T(X)) ~ A'(1]).6. and Dannois were led in their investigations to the following family of distributions. being the integral of a density.6 Exponential Families 53 Moreover. Proof.2 The Multiparameter Case Our discussion of the "natural form" suggests that oneparameter exponential families are naturally indexed by a onedimensional real parameter fJ and admit a onedimensional sufficient statistic T(x). This is known as the Rayleigh distribution.O) = h(x) exp[L 1]j(O)Tj (x) . . nlog0 J. Example 1.<·or .6. It is used to model the density of "time until failure" for certain types of equipment. ' e k p(x.8(0)].10) .8) ~ (x/8 2)exp(_x 2/28 2). Pitman. A family of distrib~tioos {PO: 0 E 8}. if there exist realvalued functions 171.6. Direct computation of these moments is more complicated.j8 2))exp(i=l n n I>U282) ~=1 = (il xi)exp[202 LX. _ . x E X j=1 c Rq. which is naturally indexed by a kdimensional parameter and admit a kdimensional sufficient statistic..8) = (il(x.8 > O.17k and B of 6.4 Suppose X 1> ••• . Var(T(X)) ~ A"(1]). We give the proof in the continuous case.. . Therefore. 0 Here is a typical application of this result. X n is a sample from a population with density p(x. (1. and realvalued functions T 1. 02 ~ 1/21]. 1 Tk. 2 i=1 i=l n 1 n Here 1] = 1/20 2. 0 = n log 82 and A(1]) 1 xl has mean nit} = = nlog(21]).Section 1.A(1])]dx h(x)exp[(s + 1])T(x)  A(s + 1])Jdx = . Koopman. We compute M(s) = = E(exp(sT(X))) {exp[A(s ~ + 1])  A(1])]} J'" J J'" J h(x)exp[(s + 1])T(x) . h on Rq such that the density (frequency) functions of the Po may be written as.A(1])] because the last factor. exp[A(s + 1]) . x> 0. More generally. is one. c R k . Now p(x.
.').(X) •. .6..TdX))T is sufficient.(x) = x'.Ot = fl.6. +log(27r" »)].1.x E X c Rq where T(x) = (T1 (x).6. the canonical kparameter exponential indexed by 71 = (fJI. i=l . . + log(27r"'».. q(x. . we define the natural parameter space as t: = ('l E R k : 00 < A('l) < oo}. . It will be referred to as a natural sufficient statistic of the family.2. then the preceding discussion leads us to the natural sufficient statistic m m (LXi. 0 Again it will be convenient to consider the "biggest" families.5.fJk)T rather than family generated by T and h is e. Suppose lhat P..11) (]"2. h(x) = I. The Normal Family.. A(71) is defined in the same way except integrals over Rq are replaced by sums.') population. J1 < 00. .. (]"2 > O}..LX.. .Tk(x)f and.Xm ) where the Xi are independent and identically distributed and their common distribution ranges over a k~parameter exponential family given by (1. suppose X = (Xl. in the conlinuous case.. 2 (".LTk(Xi )) t=1 Example 1. letting the model be Thus.. .4).O) =exp[".3.' . i=l i=1 which we obtained in the previous section (Example 1.'l) = h(x)exp{TT(x)'l.x . . Then the distributions of X form a kparameter exponential family with natural sufficient statistic m m TCm)(x) = (LTl(Xi). X m ) from a N(I". Goals.'" .10).5..I' 54 Statistical Models. ..') : 00 < f1 x2 1 J12 2 p(x. If we observe a sample X = (X" . and Performance Criteria Chapter 1 By Theorem 1. . the vector T(X) = (T. (1. . = N(I". Again. The density of Po may be written as e ~ {(I".2(". In either case. which corresponds to a twOparameter exponential family with q = I.').A('l)}... 1)2(0) = 2 " B(O) " 1"' 1 " T. 8 2 = and I" 1 2' Tl(x) = x.. I ! In the discrete case. .
~3 and t: = {(~1. 1/) =exp{T'f.~3 < OJ. . 2. E R. ..~3 ~ 1/2(7'. ~. .k. Then p(x. However 0.id. A(1/) ~ ~[(~U2~..1.~2)]."k. From Exam~ pIe 1.(x).~·(x) . Suppose a<.5 that Y I .'I2.(x) = L:~ I l[X i = j]. . < OJ. . and AJ ~ P(X i ~ j).(x»). L71 Aj = I}. . we can achieve this by the reparametrization k Aj = eO] /2:~ e Oj . Example 1.T.. In this example. ~l ~ /.'12 ~ (3..) + log(rr/. Multinomial Trials. as X and the sample space of each Xi is the k categories {I.kj. the density of Y = (Yb ..= {(~1. Yi rv N(Jli. . j=1 k This is a kparameter canonical exponential family generated by TIl"" Tk and h(x) = II~ I l[Xi E {I. 0) ~ exp{L .. = n'Ezr Example 1..4 and 1.~./(7'. Linear Regression. Now we can write the likelihood as k qo(x.~3): ~l E R. I yn)T can be put in canonical form with k = 3.'. . A E A.. A(1/) = 4n[~:+m'~5+z~1~'+2Iog(rr/~3)]. and rewriting kl q(x. . . = 1/2(7'. k}] with canonical parameter 0. Y n are independent. Example 1. In this example..5._l)(x)1/nlog(l where + Le"')) j=1 .. . with J. We observe the outcomes of n independent trials where each trial can end up in one of k possible categories. k ~ 2.6.5. This can be identifiable because qo(x... . h(x) = I andt: ~ R x R./(7'. 0) for 1 = (1. It will often be more convenient to work with unrestricted parameters.j = 1. .. I <j < k 1./Ak) = ".5.)T. . is not and all c.n log j=l L exp(. and t: = Rk . where m..6 Exponential Families 55 (x... .. A) = I1:~1 AJ'(X). TT(x) = T(Y) ~ (EY"EY. . E R k . remedied by considering IV ~j = log(A.Section 1.ti = f31 + (32Zi. < l..6. i = 1. Let T. x') = (T.6. 0.~. a 2 ).6.k.. 0 + el) ~ qo(x.5. .. (N(/" (7') continued}.~2): ~l E R. in Examples 1.j j=l = 1. where A is the simplex {A E R k : 0 < A. We write the outcome vector as X = (Xl.)j.Xn)'r where the Xi are i.EziY. .. n.7./(7'.~l= (3.
). 0 1. and Performance Criteria Chapter 1 1 parameter canonical exponential family generated by T (k 1) {I. . .d. 1]) is a k and h(x) = rr~ 1 l[xi E over. < X n be specified levels and (1. " Example 1.6. Ai). .6.. as X. 1 <j < k . then all models for X are exponential families because they are submodels of the multinomial trials model. let Xt =integers from 0 to ni generated Vi < n.6. Here 'Ii = log 1\.Y n ) ~ Y...13) ..1. See Problem 1. this. ) 1(0 < However. k}] with canonical parameter TJ and £ = Rk~ 1. 0) ~ q(x. and 1] is a map from e to a subset of R k • Thus.6. 1/(0») (1. X n ) T where the Xi are i. if X is discrete Affine transformations from Rk to RI defined by UP is the canonical family generated by T kx 1 and hand M is the affine transformation I' .6. is unchanged..6. Goals. Similarly. is an npararneter canonical exponential family with Yi by T(Yt . I < k. from Example 1. are identifiable.8.17 e for details. More10g(P'I[X = jI!P'IIX ~ k]). .i. " .12) taking on k values as in Example 1.56 Note that q(x. 1 < i < n. h(y) = 07 I ( ~.3 Building Exponential Families Submodels A submodel of a kparameter canonical exponential family {q(x. If the Ai are unrestricted.6. < . M(T) sponding to and ~ MexkT + hex" it is easy to see that the family generated by M(T(X» and h is the subfamily of P corre 1/(0) = MTO. be independent binomial. the parameters 'Ii = Note that the model for X Statistical Models. Logistic Regression. . where (J E e c R1.' A(1/) = L:7 1 ni 10g(l + e'·). 11 E £ C R k } is an exponential family defined by p(x. Let Y. 1 < i < n. 17). 0 < Ai < 1. then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h. if c Re and 1/( 0) = BkxeO C R k ..2. B(n.·.7 and X = (X t. Here is an example of affine transformations of 6 and T.
.00).xn)T.x and (1.'.. In the nonnal case with Xl. ~ (1.. p(x.13) holds.... I ii. Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic.1. log(P[X < xl!(1 . N(Jl. o Curved exponential families Exponential families (1. which is called the coefficient of variation or signaltonoise ratio. + 8. with B = p" we can write where T 1 = ~~ I Xi. '. so it is not a curved family.) = L ni log(1 + exp(8. .Section 1. . Then (and only then). ranges over (0.8. + 8. n. .x = (Xl.6.8. and l1n+i = 1/2a. . Suppose that YI ~ .6. .i. . Y n are independent.8. = '\501 and 'fJ2(0) = ~'\5B2. . Example 1. E R. generated by ..1)T.12) with the range of 1J( 8) restricted to a subset of dimension l with I < k . x). 'fJ1(B) curved exponential family with l = 1. i=I This model is sometimes applied in experiments to determine the toxicity of a substance.. is a known constant '\0 > O. This is a 0 In Example 1.9. a.5 a 2nparamctercanonical exponential family model with fJi = P. L:~ 1 Xl yi)T and h with A(8. 8) in the 8 parametrization is a canonical exponential family.PIX < x])) 8. . }Ii N(J1.x)W'.6. Yn.6. SetM = B T . The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance. Then. Gaussian with Fixed SignaltoNoise Ratio.'V T(Y) = (Y" . the 8 parametrization has dimension 2. which is less than k = n when n > 3. that is. + 8.. .. are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization. this is by Example 1. > O.10.6. .14) 8..)T .d.. LocationScale Regression.6. T 2 = L:~ 1 X. However. i = 1..Xn i..Xi)).6.i. If each Iii ranges over R and each a.a 2 )..!t is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied. Y. where 1 is (1.i/a. then this is the twoparameter canonical exponential family generated by lYIY = (L~l. y.). . PIX < xl ~ [1 + exp{ (8.. suppose the ratio IJlI/a.. Example 1.6 Exponential Families 57 This is a linear transformation TJ(8) = B nxz 8 corresponding to B nx2 = (1.
the map 1/(0) is Because L:~ 11]i(6)Yi + L:~ i17n+i(6)Y? cannot be written in the fonn ~.3. We return to curved exponential family models in Section 2. and Performance Criteria Chapter 1 and h(Y) = 1. say. and Snedecor and Cochran.6. Next suppose that (JLi. as being distributed according to a twoparameter family generated by Tj(Y.5 that for any random vector Tkxl.8.) depend on the value Zi of some covariate.6.8 note that (1. We extend the statement of Theorem 1.6. 8. . n. we define I I . 1 Bj(O). but a curved exponential family model with 0 1=3.1. ~.d.6.13) exhibits Y. Goals. for unknown parameters 8 1 E R. 1978. and = Ee sTT .2. sampling._I11. . with an exponential family density Then Y = (Y1. Sections 2.).6.10 and 1..58 Statistical Models. Let Yj . 1.g.10). a. For 8 = (8 1 . then p(y. Even more is true. respectively. 0) ~ q(y.xjYj ) and we can apply the supennode! approach to reach the same conclusion as before.i.. 1988. Examples 1. Bickel.6.. Thus. Ii. Models in which the variance Var(}'i) depends on i are called heteroscedastic whereas models in which Var(Yi ) does not depend on i are called homoscedastic.8.5.6. Supermode1s We have already noted that the exponential family structure is preserved under i.4 Properties of Exponential Families Theorem 1. M(s) as the momentgenerating function.(0). 83 > 0 (e. 1989.) and 1 h'(YJ)' with pararneter1/(O). Carroll and Ruppert. TJ'(Y).) = (Yj. 1 Tj(Y.6 are heteroscedastic and homoscedastic models.12) is not an exponential family model. 1 < j < n.E Yj c Rq. yn)T is modeled by the exponential family generated by T(Y) ~. E R.(O)Tj'(Y) for some 11.. In Example 1.12.1 generalizes directly to kparameter families as does its continuous analogue. and B(O) = ~. Section 15.1/(O)) as defined in (6. Recall from Section B. be independent.
6.6.a)A('72) Because (1. S > 0 with ~ + ~ = 1. ('70) II· The corollary follows immediately from Theorem B.1.a. By the Holder inequality (B.3 V. for any u(x). u(x) ~ exp(a'7.A('7o)} vaLidfor aLL s such that 110 + 5 E E. t: and (a) follows.15) is finite.a)'72 E is proved in exactly the same way as Theorem 1. A(U'71 Which is (b).6. Let P be a canonicaL kparameter exponential famiLy generated by (T.Section 1. Proof of Theorem 1.6.a)'72) < aA('7l) + (1 . ~ = 1 .9.3.3(c). Corollary 1. h) with corresponding natural parameter space E and function A(11). .6.1 and Theorem 1.1 give a classical result in Example 1.6. 8A 8A T" = A('7o) f/J. vex) = exp«1 .15) that a'7. . Since 110 is an interior point this set of5 includes a baLL about O. Under the conditions ofTheorem 1.6 Expbnential Families 59 V. T. J exp('7 TT (x))h(x)dx > 0 for all '7 we conclude from (1.8". v(x).6..6. We prove (b) first.15) t: the righthand side of (1.3.T(x)). ('70).6. Fiually (c) 0 The formulae of Corollary 1.2. . + (1 . then T(X) has under 110 a moment M(s) ~ exp{A('7o + s) . Then (a) E is convex (b) A: E ) R is convex (c) If E has nonempty interior in R k generating function M given by and'TJo E E. Suppose '70' '71 E t: and 0 < a < 1.5.4).6.r'7o T(X) . h(x) > 0. 1O)llkxk. A where A( '70) = (8". If '7" '72 E + (1 . 8".. Theorem 1. J u(x)v(x)h(x)dx < (J ur(x)h(x)dx)~(Jv'(x)h(x)dx)~.6. (with 00 pennitted on either side).a)'7rT(x)) and take logs of both sides to obtain.A('70) = II 8".6. ('70)) .r(T) = IICov(7~. Substitute ~ ~ a.
60 Statistical Models. Xl Y1 ).8.(9) = 9. p:x. (continued). Goads. Suppose P = (q(x. But the rank of the family is 1 and 8 1 and 82 are not identifiable. An exponential family is of rank k iff the generating statistic T is kdimensional and 1. (i) P is a/rank k. However. T. + 9. We establish the connection and other fundamental relationships in Theorem 1.(Tj(X» = P>.lx ~ j] = AJ ~ e"'ILe a . there is a minimal dimension.6. if we consider Y with n'> 2 and Xl < X n the family as we have seen remains of rank < 2 and is in fact of rank 2. P7)[L.4. Sintilarly.6.6.7). k A(a) ~ nlog(Le"') j=l and k E>. However. (ii) 7) is a parameter (identifiable).7."I 11 J Evidently every kparameter exponential family is also k'dimensional with k' > k. Here. Our discussion suggests a link between rank and identifiability of the 'TJ parameterization. f=l . Theorem 1. . II ! I Ii . and Performance Criteria Chapter 1 Example 1.7 we can see that the multinomial family is of rank at most k .4.6. Then the following are equivalent.6. and ry. (iii) Var7)(T) is positive definite.~. using the a: parametrization. . Formally.1. .4 that follows. 7) E f} is a canonical exponential/amily generated by (TkXI ' h) with natural parameter space e such that E is open. 2' Going back to Example 1. 9" 9. It is intuitively clear that k .6. ajTj(X) = ak+d < 1 unless all aj are O. such that h(x) > O. (X)".1 is in fact its rank and this is seen in Theorem 1. ifn ~ 1. in Example 1. for all 9 because 0 < p«X. Tk(X) are linearly independent with positive probability. Note that PO(A) = 0 or Po (A) < 1 for some 0 iff the corresponding statement holds i. .x. we are writing the oneparameter binomial family corresponding to Yl as a twoparameter family with generating statistic (Y1 .. o The rank of an exponential family I .~~) < 00 for all x.
with probability 1. Suppose tlult the conditions of Theorem 1. by Theorem 1.. The proof for k details left to a problem. Conversely. ~ P1)o some 1). . have (i) . . 1)0) E n· Q is the exponential family (oneparameter) generated by (1)1 .. ". :. for all T}. F!~ . Now (iii) =} A"(TJ) > 0 by Theorem 1. all 1) = (~i) II. A'(TJ) is strictly monotone increasing and 11.. Thus.TJ2)T(X) = A(TJ2) .) with probability 1 =~(i).. hence. .2.A(TJI)}h(x) ~ exp{TJ2T(x) ..4 hold and P is of rank k.1)o)TT. '" '. Then (a) P may be uniquely parametrized by "'(1)) E1)T(X) where". ~ (ii) ~ _~ (i) (ii) = P1). hence. all 1) ~ (iii) = a T Var1)(T)a = Var1)(aT T) = 0 for some a of 0.6.. Let ~ () denote "(. (b) logq(x. ranges over A(I').6 Exponential Families 61 (iv) 1) ~ A(1)) is 11 onto (v) A is strictly convex on E. This is just a restatement of (iv) and (v) of the theorem.2 and. Note that. We. 0 Corollary 1. Proof. We give a detailed proof for k = 1.A(TJ. A is defined on all ofE. because E is open. by our remarks in the discussion of rank.) is false.'t·· . 1)) is a strictly concavefunction of1) On 1'.' .A(TJ2))h(x).1)o) : 1)0 + c(1).T ~ a2] = 1 for al of O.6. which that T implies that A"(TJ) = 0 for all TJ and. thus. Apply the case k = 1 to Q to get ~ (ii) =~ (i).Section 1. This is equivalent to Var"(T) ~ 0 '*~ (iii) {=} f"V(ii) There exist T}I =I= 1J2 such that F Tll = Pm' Equivalently exp{r/.6. = Proof. ~ (i) =~ (iii) ~ (i) = P1)[aT T = cJ ~ 1 for some a of 0. 0 .6.3. of 1)0· Let Q = (P1)o+o(1). (iv) '" (v) = (tii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined as previously for arbitrary 1)0' 1).~> . (iii) = (iv) and the same discussion shows that (iii) . Taking logs we obtain (TJI .. A"(1]O) = 0 for some 1]0 implies c." Then ~(i) > 1 is then sketched with '* P"[a.(v).(ii) = (iii). A' is constant. III. = Proof ofthe general case sketched I.T(x) ..
Yn)T follows the k = p(p + 3)/2 parameter exponential family with T = (EiYi . E) = Idet(E) 1 1 2 / . Jl.2 we considered beta prior distributions for the probability of success in n Bernoulli trials.6.I. ... which is obviously a 11 function of (J1. .p/2 I exp{ .. The p Van'ate Gaussian Family. write p(x 19) for p(x.6.. It may be shown (Problem 1.Jl) .E).6. B(9) = (log Idet(E) I+JlTEl Jl)... which is a p x p symmetric matrix. E. Then p(xI9) = III h(x. T (and h generalizing Example 1. However. Thus. Suppose Xl. 0 ! = 1. ...~p). (1._. .11.18) _.Xn is a sample from the kparameter exponential family (1. so that Theorem 1. E(X.5. and.(Xi) i=l j=l i=l n • n nB(9)}. E).(9) LT. Goals.. We close the present discussion of exponential families with the following example.. ~iYi Vi).6.11. This is a special case of conjugate families of priors.62 Statistical Models. .6.. Y n are iid Np(Jl.6. 0).ljh<i<. E) _~yTEIy + (E1JllY 2 I 2 (log Idet(E)1 + JlT I P log". 9 ~ (Jl..10).. . Example 1.6.6. Recall that Y px 1 has a p variate Gaussian distribution.Jl.)] eXP{L 1/.6. is open.17) The first two terms on the right in (1.6. N p (1J. .5 Conjugate Families of Prior Distributions In Section 1.17) can be rewritten ( 2: 1 <i<j~p aijYiYj + ~ 2: aii r:2 ) + L(2: aij /lj)Yi i=l i=l j=l p where E. h(Y) .a 2 ).{Y. By our supermodel discussion.Yp.2 p p (1.X') = (JL.. The corollary will prove very important in estimation theory. . E).3.29) that I) generate this family and that the rank of the family is indeed p(p + 3)/2.2 (Y . and that (. with mean J.t parametrization is close to the initial parametrization of classical P.tpx 1 and positive definite variance covariance matrix L::pxp' iff its density is f(Y. revealing that this is a k = p(p + 3)/2 parameter exponential family with statistics (Yi.1 '" 11"'..(Y).... . where X is the Bernoulli trial.4 applies.1 (Y ._. the relation in (a) may be far from obvious (see Problem 1.. where we identify the second element ofT.Jl)}. ifY 1. with its distinct p(p + I) /2 entries.6. as we always do in the Bayesian context..Jl)TE.. the 8(11) 0) family is parametrized by E(X).21). the N(/l. (1. For {N(/" "')}."5) family by E(X). An important exponential family is based on the multivariate Gaussian distributions of Section 8.a 2 + J12). families to which the posterior after sampling also belongs.6. See Section 2. then X . and Performance Criteria Chapter 1 The relation in (a) is sometimes evident and the j.16) Rewriting the exponent we obtain 10gf(Y.
6. . We assume that rl is nonempty (see Problem 1.6. be "parameters" and treating () as the variable of interest.1. n n )T and ex indicates that the two sides are proportional functions of (). .2)' Uo 2uo Ox . Suppose Xl. Forn = I e. .36). Example 1. That is.6.(0) ~ exp{L ~j(O)fj .. . 1r(6jx) is the member of the exponential family (1.. The (k + I)parameter exponential/amity given by k ".fk+IB(O) logw(t)) j=1 (1.) .21) is an updating formula in the sense that as data Xl. .21) where S = (SI. n)T D It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way. where u6 is known and (j is unknown.6 Exponential Families 63 where () E e.20).6. . .O<W(tJ.. X n become available.18). A conjugate exponential family is obtained from (1..u5) sample. Because two probability densities that are proportional must be equal...6. tk+l) E 0. .fk+Jl..(O) ex exp{L ~J(O)(LT.20) where t = (fJ. where a = (I:~ 1 T1 (Xi). .6."" Sk+l)T = ( t} + ~TdxiL .. . If p(xIO) is given by (1.'''' I:~ 1 Tk(Xi)..21) and our assertion follows. tk+ I) T and w(t) = 1= eXP{hiJ~J(O)' _='" _= 1 k ik+JB(II)}dO J···dOk (1.6.'" .. which is k~dimensional. let t = (tl.. 0 Remark 1. Proof.1.. j = 1.19) o {(iJ..12.6.18) by letting 11.Section 1. the parameter t of the prior distribution is updated to s = (t + a).20) given by the last expression in (1. .6.(Olx) ex p(xlO)". ik + ~Tk(Xi)' tk+l + 11. To choose a prior distribution for model defined by (1. is a conjugate prior to p(xIO) given by (1.6.(0).18) and" by (1. then Proposition 1.6. Note that (1.(Xi)+ f.. . k.fkIJ)<oo) with integrals replaced by sums in the discrete case. (1. . we consider the conjugate family of the (1. then k n ..6.20).6.6.6.6.22) 02 p(xIO) ex exp{ 2 . and t j = L~ 1 Tj(xd.(lk+1 j=1 i=1 + n)B(O)) ex ".Xn is a N(O.
23) with (70 TO .6. 0 These formulae can be generalized to the case X. we obtain 1r. Using (1. = 1 . we must have in the (t l •t2) parametrization "g (1. E RP and f symmetric positive definite is a conjugate family f'V .) } 2a5 h t2 t1 2 (1.37).20) has density (1.d. if we observe EX. 75) prior density. rJ) distributions where Tlo varies freely and is positive.6.6.6. i.) density. = nTg(n)I(1~. W.21). t. (0) is defined only for t. we find that 1r(Olx) is a normal density with mean Jl(s.6.i.6.n) and variance = t.6. = rg(n)lrg so that w.26) intuitively as ( 1.6. . 761) where '10 varies over RP. tl(S) = 1]0(10 2 TO 2 + S.W.64 Statistical Models.(S) = t.27) Note that we can rewrite (1...(n) (g +n)I[s+ 1]0~~1 TO TO (1.6.30) that the Np()o" f) family with )0. (J Np(TJo. If we start with aN(f/o.6. Eo). Tg is scalar with TO > 0 and I is the p x p identity matrix (Problem 1. Our conjugate family. consists of all N(TJo.23) Upon completing the square. > 0 and all t I and is the N (tI!t" (1~ It.6. Np(O. 1r. + n.25) By (1. Eo known.(n) = . it can be shown (Problem 1. Goals. Moreover. and Performance Criteria Chapter 1 This is a oneparameter exponential family with The conjugate twoparameter exponential family given by (1.6. therefore.6.28) where W.24) Thus.(O) <X exp{ .( 0 .24). the posterior has a density (1.26) (1. = s. 1 < i < n.
for all s such that '10 + s is in Moreover E 7Jo [T(X)] = A(7Jo) and Var7Jo[T(X)] A( '10) where A and A denote the gradient and Hessian of A. E 1 R is convex. . a theory has been built up that indicates that under suitable regularity conditions families of distributions. x j=1 E X c Rq.29) (Tl (X). . If bas a nonempty interior in R k and '10 E then T(X) has for X ~ P7Jo the momentgenerating function .3 is not covered by this theory.. Discussion Note that the uniform U( (I.. 2. O}) model of Example 1. Eo).. must be kparameter exponential families.Section 1. Summary.6. and Darmois. It is easy to see that one can consUUct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1. the family of distributions in this example and the family U(O. and realvalued functions TI.. " Tk..6. .A(7Jo)} e e. Pitman.6 Exponential Families 65 but a richer one than we've defined in (1.5.J: h(x)exp{TT(x)7J}dx in the continuous case. . The set E is convex. e. is not of the form L~ 1 T(Xi)..6. is a kparameter exponential/amity of distributions if there are realvalued functions 'T}I . . . the map A .32. h on Rq such that the density (frequency) function of Pe can be written as k p(x. 8 C R k . In the onedimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape. The set is called the natural parameter space. to Np (8 .6.1 are often too restrictive. Problem 1. The canonical kparameter exponentialfamily generated by T and h is T(X) = where A(7J) ~ log J:.. which is onedimensional whatever be the sample size..20). See Problems 1..(s) = exp{A(7Jo + s) . Despite the existence of classes of examples such as these. In fact.Xn ).31 and 1. {PO: 0 E 8}. In fact.10 is a special result of this type.20) except for p = 1 because Np(A.6. B) are not exponential.B(O)]. starting with Koopman.6. T k (X)) is called the natural sufficient statistic of the family..6. The natural sufficient statistic max( X 1 ~ . which admit kdimensional sufficient statistics for all sample sizes. with integrals replaced by sums in the discrete case. Some interesting results and a survey of the literature may be found in Brown (1986).. r) is a p(p + 3)/2 rather than a p + I parameter family. (1. the conditions of Proposition 1. ••• .'T}k and B on e. . 0) = h(x) expl2:>j(I1)Ti (x) .
.B(O)tk+l logw} j=l where w = 1:" 1: E exp{l:. tk+..6.1)) (O)t j .L and a 2 but has in advance no knowledge of the magnitudes of the two parameters.29). and t = (t" .L. (v) A is strictlY convex on f.. (iii) Var'7(T) is positive definite.. He wishes to use his observations to obtain some infonnation about J. (a) A geologist measures the diameters of a large number n of pebbles in an old stream bed.) E R k+l : 0 < w < oo}. Assume that the errors are otherwise identically distributed nonnal random variables with known variance.1 1. (iv) the map '7 ~ A('7) is I ' Ion 1'. Goals.L and variance a 2 . State whether the model in question is parametric or nonparametric. Give a formal statement of the following models identifying the probability laws of the data and the parameter space.(0) = exp{L1)j(O)t) . l T k are linearly independent with positive Po probability for some 8 E e. T" .. and Performance Criteria Chapter 1 An exponential family is said to be of rank k if T is kdimensional and 1. is conjugate to the exponential family p(xl9) defined in (1. (b) A measuring instrument is being used to obtain n independent determinations of a physical constant J.1 units.B(O)}dO. then the following are equivalent: (i) P is of rank k.parameter exponential family k ". . Ii II I . .7 PROBLEMS AND COMPLEMENTS Problems for Section 1. Suppose that the measuring instrument is known to be biased to the positive side by 0. The (k + 1). (ii) '7 is identifiable.66 Statistical Models. If P is a canonical exponential family with E open. A family F of priOf distributions for a parameter vector () is called a conjugate family of priors to p(x I 8) if the posterior distribution of () given x is a member of F. Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean J. 1.. tk+1) 0 ~ {(t" .
. Q p ) {(al. 2.. 1 Ab) restricted to the sets where l:f I (Xi = 0 and l:~~1 A. Show that Fu+v(t) < Fu(t) for every t. are sampled at random from a very large population.Xpb .··. and Po is the distribution of X (b) Same as (a) with Q = (X ll . . . . Ab. j = 1.. .1 describe formaHy the foHowing model.1. then X is (b) As in Problem 1. (If F x and F y are distribution functions such that Fx(t) said to be stochastically larger than Y. ~ N(l'ij.. Each .Xp ). At. 0.1(c). .7 Problems and Complements 67 (c) In part (b) suppose that the amount of bias is positive but unknown. v. Q p ) and (A I .. 3..for this model? (d) The number of eggs laid by an insect follows a Poisson distribution with unknown mean A.l(d) if the entomologist observes only the number of eggs hatching but not the number of eggs laid in each case.2). . (e) The parametrization of Problem l.1. . 4. Once laid. .. e (e) Same as (d) with (Qt.. (b) The parametrization of Problem 1. An entomologist studies a set of n such insects observing both the number of eggs laid and the number of eggs hatching for each nest.. . bare independeht with Xi. each egg has an unknown chance p of hatching and the hatChing of one egg is independent of the hatching of the others.. . Two groups of nl and n2 individuals.) < Fy(t) for every t. (d) Xi. restricted to p = (all' . i = 1. = (al.... respectively. .. .Section 1. . = o.2) where fLij = v + ai + Aj.) (a) Xl. .Xp are independent with X~ ruN(ai + v."" a p ..ap ) : Lai = O}.2 ). (a) Let U be any random variable and V be any other nonnegative random variable.. Which of the following parametrizations are identifiable? (Prove or disprove.t.. .1(d).2 ) and Po is the distribution of Xll.1.) (a) The parametrization of Problem 1. 1'2) and we observe Y X.p. B = (1'1. . 0.. Can you perceive any difficulties in making statements about f1.2) and N (1'2. . . i=l (c) X and Y are independent N (I' I ... Are the following parametrizations identifiable? (Prove or disprove...
i = 1. is the distribution of X e = (0... (e) Suppose X ~ N(I'.:". 0'2).. then P(Y < t + c) = P( Y < t .··· . e = = 1 if X < 1 and Y {J.. Mk. the density or frequency function p of Y satisfies p( c + t) ~ p( c . I ! fl.. and Performance Criteria Chapter 1 ":. VI. 1 < i < k and (J = (J.1.. 68 Statistical Models. . . Vk) is unknown.p(0. Let Y and Po is the distribution of Y. 2. The number n of graduate students entering a certain department is recorded. Let Po be the distribution of a treatment response..1. o < J1 < 1.. 7. .. (0)..Lk VI n". Which of the following models are regular? (Prove or disprove.LI ni + . }... 0 < V < 1 are unknown. Hint: IfY . .t). • . . =X if X > 1. Consider the two sample models of Examples 1. 0 = (1'.t) fj>r all t.0'2) (d) Suppose the possible control responses in an experiment are 0....9). I.2.Li < + .. Vi = (1 11")(1 ~ v)iI V for i = 1.. Let N i be the number dropping out and M i the number graduating during year i. What assumptions underlie the simplification? 6.c bas the same distribution as Y + c. . 5. but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown. The following model is proposed.0.9 and they oecur with frequencies p(0. 8. .. Goals.· .2. k.L. 0 < Vi < 1.... + Vk + P = 1..LI. It is known that the drug either has no effect or lowers blood pressure...k + VI + . 0 < J. mk·r.1.. • "j :1 member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour. if and only if.c) = PrY > c . I . . nk·mI .Nk = nk. Both Y and p are said to be symmetric about c. + fl. .. In each of k subsequent years the number of students graduating and of students dropping out is recorded. . . .' I .  . . V k mk P r where J. The simplification J.Li = 71"(1 M)iIJ.3(2) and 1.1).0). whenXis unifonnon {O. ml .O}. .0. Show that Y . .t) = 1 . + mk + r = n 1. J... (b)P. P8[N I .k is proposed where 0 < 71" < I..4(1). + nk + ml + . Suppose the effect of a treatment is to increase the control response by a fixed amount (J. I nl. = n11MI = mI.. (a) What are the assumptions underlying this model'? (b) (J is very difficult to estimate here if k is large. M k = mk] I! n.....) (a) p.p(0.c has the same distribution as Y + c. Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour.P(Y < c ..I nl . is the distribution of X when X is unifonn on (0.2)..
. Let c > 0 be a constant.2x yield the same distribution for the data (Xl.2x and X ~ N(J1.. . The Lelunann 1\voSample Model. Show that if we assume that o(x) + x is strictly increasing. 10. e) : p(j) > 0. j = 0. .. The Scale Model..7 Problems and Complements 69 (a) Show that if Y ~ X + o(X). then satisfy a scale model with parameter e.3 let Xl. Therefore. . C are independent. Yn denote the survival times of two groups of patients receiving treatments A and B. Suppose X I... That is.1.Section 1. . e) vary freely over F ~ {(I'. suppose X has a distribution F that is not necessarily normal. Does X' Y' = yc satisfy a scale model? Does log X'.Xn are observed i. the two cases o(x) . Sbow that {p(j) : j = 0. Collinearity: Suppose Yi LetzJ' = L:j:=1 4j{3j + Ci. e(j) > 0. log X and log Y satisfy a shift model with parameter log (b) Show that if X and Y satisfy a shift model With parameter tl. . {r(j) : j ~ 0. or equivalently. N}. 12. Positive random variables X and Y satisfy a scale model with parameter 0 > 0 if P(Y < t) = P(oX < t) for all t > 0. . fJp) are not identifiable if n pardffieters is larger than the number of observations. C).. then C(·) ~ F(. C(t} = F(tjo)."" Yn ).6. = 9. .tl). if the number of = (min(T. (b) Deduce that (th.{3p) are identifiable iff Zl. t > O.. .. Y. . 0 < j < N. 11. ci '" N(O. 0 'Lf 'Lf op(j) = I."') and o(x) is continuous.. N} are identifiable.. according to the distribution of X.tl) implies that o(x) tl. eY and (c) Suppose a scale model holds for X. that is. (Yl. . (b) In part (a).i. o(x) ~ 21' + tl .zp.. Sx(t) = . . = (Zlj. log y' satisfy a shift model? = Xc. .tl) does not imply the constant treatment effect assumption. I} and N is known.. . Hint: Consider "hazard rates" for Y min(T...Znj?' (a) Show that ({31. then C(·) = F(. (]"2) independent. In Example 1. e(j). are not collinear (linearly independent). .. . I(T < C)) = j] PIC = j] PIT where T. r(j) = = PlY = j. . . N.N = 0. C(·) = F(· ..'" . o (a) Show that in this case. eX o. For what tlando(x) ~ 2J1+tl2x? type ofF is it possible to have C(·) ~ F(·tl) for both o(x) (e) Suppose that Y ~ X + O(X) where X ~ N(J1.tl and o(x) ~ 21' + tl. i . C).Xn ). Y = T I Y > j]. p(j). Let X < 1'. .. and (I'. > 0.. ."'). 1 < i < n. X m and Yi. . ..d.
h(t I Zi) = f(t I Zi)jSy(t I Zi).I ~I . then X' = log So (X) and Y' ~ log So(Y) follow an exponential scale model (see Problem Ll. Show that if So is continuous.(t) if and only if Sy(t) = Sg.). (b) By extending (bjn) from the rationals to 5 E (0.) Show that Sy(t) = S'.ll) with scale parameter 5. as T with survival function So. log So(T) has an exponential distribution. t > 0. t > 0.. then Yi* = g({3.{a(t). I (b) Assume (1... z)) (1. ..ho(t) is called the Cox proportional hazard model. thus. I (3p)T of unknowns.. Zj) + cj for some appropriate €j. Also note that P(X > t) ~ Sort). we have the Lehmann model Sy(t) = si.h.. (c) Under the assumptions of (b) above. So (T) has a U(O. Let T denote a survival time with density fo(t) and hazard rate ho(t) = fo(t)j P(T > t). k = a and b. 12. hy(t) = c. " T k are unobservable and i. 13. The Cox proportional hazani model is defined as h(t I z) = ho(t) exp{g(. The most common choice of 9 is the linear form g({3. Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t). Show that hy(t) = c.7.l3. Sy(t) = Sg.. and Performance CriteriCl Chapter 1 t Ii . where TI". Tk > t all occur..d.G(t).1. Moreover. . I' 70 StCitistical Models.l3. = exp{g(. Hint: See Problem 1. . A proportional hazard model.. Hint: By Problem 8. .(t) with C.(t).2) is equivalent to Sy (t I z) = sf" (t). = = (. z) zT. Goals. (1. are called the survival functions. 7.z)}. Survival beyond time t is modeled to occur if the events T} > t.(t) = fo(t)jSo(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y. F(t) and Sy(t) ~ P(Y > t) = 1 . 1 P(X > t) =1 (. show that there is an increasing function Q'(t) such that ifYt = Q'(Y. I) distribution.00).2) and that Fo(t) = P(T < t) is known and strictly increasing.12. Set C. Then h.2.  .1) Equivalently. . .(t)..l3. (c) Suppose that T and Y have densities fort) andg(t).7. Specify the distribution of €i.. ~ a5. t > 0. Let f(t I Zi) denote the density of the survival time Yi of a patient with covariate vector Zi apd define the regression survival and hazard functions of Y as i Sy(t I Zi) = 1~ fry I zi)dy. For treatments A and E.2) where ho(t) is called the baseline hazard function and 9 is known except for a vector {3 ({311 .) Show that (1.i.7. respectively.
p and C. G. still identifiable? If not. Here is an example in which the mean is extreme and the median is not.(x/c)e. x> c x<c 0.12.11 and 1. Merging Opinions. . Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function J=oo Jo F(x.l (0.000 is the minimum monthly salary for state workers. Let Xl. and for this reason. Show how to choose () to make J. Show that both .2 with assumptions (t){4). and Zl and Z{ are independent random variables.i. 1f(O2 ) = ~.4 1 0.. Yl . Consider a parameter space consisting of two points fh and ()2. '1/1' > 0. X n are independent with frequency fnnction p(x posterior1r(() I Xl. F. Find the .. In Example 1. the parameter of interest can be characterl ized as the median v = F.O) 1 . .2) distribution with .6 Let 1f be the prior frequency function of 0 defined by 1f(I1. where () > 0 and c = 2. Observe that it depends only on l:~ I Xi· I 0). .7 Problems and Complements 71 Hint: See Problems 1.." . 15.d. Examples are the distribution of income and the distribution of wealth. what parameters are? Hint: Ca) PIXI < tl = if>(. Ca) Find the posterior frequency function 1f(0 I x)....2 unknown. the median is preferred in this case. and suppose that for given ().Section 1. When F is not symmetric.2 0. . J1 and v are regarded as centers of the distribution F.8 0.G)} is described by where 'IjJ is an unknown strictly increasing differentiable map from R to R.1.p and ~ are identifi· able.p(t)). Yn be i.d. have a N(O. . .v arbitrarily large.2 1.X n ).. (b) Suppose X" . Problems for Section 1. J1 may be very much pulled in the direction of the longer tail of the density.i.. an experiment leads to a random variable X whose frequency function p(x I 0) is given by O\x 01 O 2 0 0. ) = ~.L . 'I/1(±oo) = ±oo. (a) Suppose Zl' Z. wbere the model {(F.Xm be i. Are. have a N(O. Generally.1. Find the median v and the mean J1 for the values of () where the mean exists. (b) Suppose Zl and Z. 1) distribution..5) or mean I' = xdF(x) = Fl(u)du.1. 14.
O)kO. . 0 < x < O. . (3) Suppose 8 has prior frequency. 3. Suppose that for given 8 = 0.m+I.. 1. . c(a). does it matter which prior. " J = [2:. is used in the fonnula for p(x)? 2. the outcome X has density p(x OJ ~ (2x/O'). 4. Let X I. Find the posteriordensityof8 given Xl = Xl. " ••. IE7 1 Xi = k) forthe two priors (f) Give the set on which the two B's disagree. where a> 1 and c(a) . 0 (c) Find E(81 x) for the two priors in (a) and (b). . .. Show that the probability of this set tends to zero as n t 00.. and Performance CriteriCi Chapter 1 (cj Same as (b) except use the prior 71"1 (0 .75. k = 0.2. This is called the geometric distribution (9(0)). X has the geometric distribution (a) Find the posterior distribution of (J given X = 2 when the prior distribution of (} is . "X n ) = c(n 'n+a m) ' J=m.2. (b) Find the posterior density of 8 when 71"( OJ = 302 . (3) Find the posterior density of 8 when 71"( 0) ~ 1.0 < 0 < 1. < 0 < 1. 71"1 (0 2 ) = . . 3.Xn are natural numbers between 1 and Band e= {I.0 < B < 1. (J. . Consider an experiment in which. . what is the most probable value of 8 given X = 2? Given X = k? (c) Find the posterior distribution of (} given X = k when the prior distribution is beta. Then P.5n) for the two priors and 1f} when 7f (e) Give the most probable values 8 = arg maxo 7l"(B and 71"1. l3(r. I XI . . ':.Xn = X n when 1r(B) = 1.. . For this convergence. X n be distributed as where XI. Unllormon {'13} . 1r(J)= 'u . for given (J = B. 2. (d) Suppose Xl. Compare these B's for n = 2 and 100.. ~ = (h I L~l Xi ~ = ... . Assume X I'V p(x) = l:.. 7r (d) Give the values of P(B n = 2 and 100. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success O. X n are independent with the same distribution as X..8). ) ~ . Goals..'" . .. 7l" or 11"1. }.)=1.[X ~ k] ~ (1.. 4'2'4 (b) Relative to (a). 'IT .72 Statistical Models. .'" 1rajI Show that J + a. Let 71" denote a prior density for 8..25.=l1r(B i )p(x I 8i )..
Xn is a sample with Xi '"'' p(x I (}). a regular model and integrable as a function of O.J:n). where VI.1= n+r+s _ ."'tF·t'. . X n+ k ) be a sample from a population with density f(x I 0). Xn) = Xl = m for all 1 as n + 00 whatever be a. . . 6..0 E Let 9 have prior density 1r.. Show that the conditional distribution of (6. 9.2. is the standard normal distribution function and n _ T 2 x + n+r+s' a J. 7. Suppose Xl. Va' W t ..c(b." .. In Example 1.2.. (a) Show that the family of priors E.(1. } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of () given X = x is . Next use the central limit theorem and Slutsky's theorem. . 10.··."" Zk) where the marginal distribution of Y equals the posterior distribution of 6 given Xl = Xl. W. Show that a conjugate family of distributions for the Poisson family is the gamma family.1 that the conditional distribution of 6 given I Xi = k agrees with the posterior distribution of 6 given X I = Xl •. X n+ Il . XI = XI. then the posterior distribution of D given X = k is that of k + Z where Z has a B(N ..1 suppose n is large and (1 In) E~ I Xi = X is not close to 0 or 1 and the prior distribution is beta.1..and Complements 73 wherern = max(xl. b) is the distribution of (aV loW)[1 + (aV IbW)]t. .) = n+r+s Hint: Let!3( a.. Let (XI. .• X n = Xn' and the conditional distribution of the Zi'S given Y = t is that of sample from the population with density e. n. . then !3( a.. If a and b are integers. b) denote the posterior distribution. Show rigorously using (1. S. 11'0) distribution.'" . Show in Example 1.• X n = Xn.. .1... .8) that if in Example 1.b> 1... where E~ 1 Xi = k.f) = [~....7 Problems .. are independent standard exponential. ZI.. Assume that A = {x : p( x I 8) > O} does not involve O.2. I Xn+k) given .Section 1. X n = Xn is that of (Y. f3(r. s). . . f(x I f).2. ... where {i E A and N E {l. . Interpret this result. Show that ?T( m I Xl.n. Justify the following approximation to the posterior distribution where q. 11'0) distribution. (b) Suppose that max(xl. D = NO has a B(N. Xn) + 5.
X n are i. .) be multinomial M(n. (c) Find the posterior distribution of 0". i).. O(t + v) has a X~+n distribution.2) and we formally put 7t(I". . (N.74 where N' ~ N Statistical Models. Q'j > 0./If (Ito. .6 ""' 1r. (12). . given x.d. then the posterior density ~(Jt 52 = _1_ "'(X _ X)2 nI I.O the posterior density 7t( 0 Ix). Suppose p(x I 0) is the density of i. (b) Discuss the behavior of the two predictive distributions as 15. . .\ > 0. (a) Show that p(x I 0) ()( 04 n exp (~tO) where "proportional to" as a function of 8. LO. and Performance Criteria Chapter 1 + nand ({. The Dirichlet distribution is a conjugate prior for the multinomial. 52) of J1. •• • .Xn + l arei.O. . Here HinJ: Given I" and ". The Dirichlet distribution. Find the posterior distribution 7f(() I x) and show that if>' is an integer. ..i.. = 1.J u.. T6) densities..i. Q'r)T.. 01 )=1 j=l Uj < 1.. vB has a XX distribution. X and 5' are independent with X ~ N(I". L. v > 0. the predictive distribution is the marginal distribution of X n + l . Note that.. X 1.. unconditionally. 0 > O.. Next use Bayes rule. .)T. X n . 1 X n ."Z In) and (n1)S2/q 2 '" X~l' This leads to p(x. where Xi known. a = (a1.. I (b) Use the result (a) to give 7t(0) and 7t(0 I x) when Oexp{ Ox}. . "5) and N(Oo.d.. . po is I t = ~~ l(X i  1"0)2 and ()( denotes (b) Let 7t( 0) ()( 04 (>2) exp { .. < 1.. Show that if Xl.. I(x I (J). 13. . 82 I fL. x> 0.. j=1 '\' • = 1. given (x. N. N(I". Let N = (N l .. j=1 • .1 < j < T. In a Bayesian model whereX l . . 0<0. and () = a. 14. 0 > O. .J t • I X..9). 0> 0 a otherwise.~X) '" tnI. Find I 12. <0<x and let7t(O) = 2exp{20}. posterior predictive distribution...~vO}. has density r("" a) IT • fa(u) ~ n' r(.. 8 2 ) is such that vn(p.Xn. Goals. x n ).) pix I 0) ~ ({" ..i.") ~ .~l . (a) If f and 7t are the N(O. .. . 9= (OJ.) u/' 0 < cx) L".2 is (called) the precision of the distribution of Xi. The posterior predictive distribution is the conditional distribution of X n +l given Xl. . compute the predictive and n + 00. (. . . Letp(x I OJ =exp{(xO)}. 11.. D(a). Xl .d. . 'V . .
(b) Find the minimax rule among {b 1 . and the loss function l((J.1.q) and let when at. a) is given by 0. = 11'. a2.159 be the decision rules of Table 1.5 and (ii) l' = 0. . The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether 8 < O.. O 2 a 2 I a I 2 I Let X be a random variable with frequency function p(x. n r ).. or 0 > a be penoted by I.1. (c) Find the minimax rule among bt.5. (b)p=lq=.3.3. respectively and suppose .5. (J2.. 1C(O. Compute and plot the risk points (a)p=q= . 159 for the preceding case (a)... . = 0. (d) Suppose 0 has prior 1C(01) ~ 1'. .1.p) (I .3. I b"9 }. a3. 159 of Table 1. 1957) °< °= a 0. Let the actions corresponding to deciding whether the loss function is given by (from Lehmann.3...3.Section 1.7 Problems and Complements 75 Show that if the prior7r( 0) for 0 is V( a). Suppose that in Example 1. Suppose the possible states of nature are (Jl.1.3. Find the Bayes rule for case 2. the possible actions are al. 0. Problems for Section 1. .3 IN ~ n) is V( a + n). See Example 1. 1C(02) l' = 0.) = 0. . a new buyer makes a bid and the loss function is changed to 8\a 01 O 2 al a2 a3 a 12 7 I 4 6 (a) Compute and plot the risk points in this case for each rule 1St..5. (d) Suppose that 0 has prior 1C(OIl (a). . 1. 8 = 0 or 8 > 0 for some parameter 8. . I. Find the Bayes rule when (i) 3. (J) given by O\x a (I . (c) Find the minimax rule among the randomized rules. then the posteriof7r(0 where n = (nt l · · · .
. geographic locations or age groups). < J' < s.7. 1 (l)r=s=l. 1 < j < S. I . are known. (. How should nj. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e. 1 < j < S. Show that the strata sample sizes that minimize MSE(!i2) are given by (1. c<I>(y'n(sO))+b<I>(y'n(rO)).=l iii = N. 0 be chosen to make iiI unbiased? I (b) Neyman allocation..(X) 0 b b+c 0 c b+c b 0 where b and c are positive.. Suppose X is aN(B. random variables Xlj"". b<f>(y'ns) + b<I>(y'nr). Within the jth stratum we have a sample of i. Assume that 0 < a.8. I I I.j = 1.i. (b) Plot the risk function when b = c = 1.) c<f>( y'n(r . Stratified sampling. Weassurne that the s samples from different strata are independent.andastratumsamplemeanXj. are known (estimates will be used in a latet chapter).d. . and <I> is the N(O.I L LXij. .76 StatistiCCII Models. 11 . n = 1 and . where <f> = 1 <I>. We want to estimate the mean J1. J". iiz = LpjX j=Ii=1 )=1 s n] s j where we assume that Pj.s. Goals. 1) sample and consider the decision rule = 1 if X <r o (a) Show that the risk function is given by ifr<X<s 1 if X > s. < 00. Suppose that the jth stratum has lOOpj% of the population and that the jth stratum population mean and variances are f£j and Let N = nj and consider the two estimators 0. i I For what values of B does the procedure with r procedure with r = ~s = I? = 8 = 1 have smaller risk than the 4.Xnjj. . . and Performance Criteria ChClpter 1 O\a 1 0 c 1 <0 0 >0 J".j = I.3) .g.. E... (a) Compute the biases.0)) +b<f>( y'n(s . variances. and MSEs of iiI and 'j1z.0)). 1) distribution function.) r=2s=1.. 0<0 0=0 0 >0 R(O.
and = 1.3. X n be a sample from a population with values 0. U(O..45. show that MSE(Xb ) and MSE(Xb ) are the same for all values of b (the MSEs of the sample mean and sample median are invariant with respect to shift). 1).. . and that n is odd. .. ~ ~  ~ 6.= 2:.2.. Hint: By Problem 1. ~ 7.. ..5(0 + I) and S ~ B(n.25..Section 1.'..4.20.3) is N. ~ a) = P(X = c) = P.. X n are i. that X is the median of the sample. (c) Same as (b) except when n ~ 1.5. ~ ~ ~ ..5.. ~ (a) Find the MSE of X when (i) F is discrete with P(X a < b < c. n ~ 1.40.) with nk given by (1.7 Problems and Complements 77 Hint: You may use a Lagrange multiplier. .i. (iii) F is normal. 5. 75.9. (a) Find MSE(X) and the relative risk RR = MSE(X)/MSE(X).2. Hint: Use Problem 1. a = . p = .0.3. ali X '"'' F.13. (d) Find EIX .2'. Also find EIX .0 . (b) Compute the relative risk RR = M SE(X)/MSE(X) in question (i) when b = 0. Hint: See Problem B..b. b = '. . I).bl.5.5. P(X ~ b) = 1 .0 ~ + 2'. < ° P < 1.=1 Pj(O'j where a. (c) Show that MSE(!ill with Ok ~ PkN minus MSE(!i.b. Each value has probability . set () = 0 without loss of generality. and 2 and (e) Compute the relative risks M SE(X)/MSE(X) in questions (ii) and (iii).15.. .1. Use a numerical integration package. We want to estimate "the" median lJ of F. [f the parameters of interest are the population mean and median of Xi . Next note that the distribution of X involves Bernoulli and multinomial trials. . . . ° = 15. Let X and X denote the sample mean and median. plot RR for p = .p).2. (b) Evaluate RR when n = 1.=1 pp7J" aV.3. N(o. respectively. > 0.0 + '. . Let Xb and X b denote the sample mean and the sample median of the sample XI b. Hint: See Problem B.7..2.bl for the situation in (i). Suppose that X I.3. > k) (ii) F is uniform. .I 2:.bl when n = compare it to EIX ..2p. Suppose that n is odd. The answer is MSE(X) ~ [(ab)'+(cb)'JP(S where k = . where lJ is defined as a value satisfyingP(X < v) > ~lllldP(X >v) >~. l X n .5.d. Let XI.
3. A decision rule 15 is said to be unbiased if E. 10% have the characteristic.1'] ..1'])2.'). then this definition coincides with the definition of an unbiased estimate of e. suppose (J is discrete with frequency function 11"(0) = 11" (!) = 11" U) = Compute the Bayes risk of I5r •s when i. and only if.X)' = ([Xi . 9. the powerfunction. 0 < a 2 < 00. Let Xl. II. Show that the value of c that minimizes M S E(c/. defined by (J(6. .2)(.3(a) with b = c = 1 and n = I.a)2. ' . It is known that in the state where the company is located. . .Xn be a sample from a population with variance a 2 . . = c L~ 1 (Xi .3. (a) Show that s:? = (n .(o(X)).  ..L)4 = 3(12. ~ 2(n _ 1)1.8lP where p = XI n is the proportion with the characteristic in a sample of size n from the company.2 ~~ I (Xi . being lefthanded).(1(6'. e 8 ~ (. 6' E e.) is (n+ 1)1 Hint!or question (bi: Recall (Theorem B. Let () denote the proportion of people working in a company who have a certain characteristic (e.X)2 has a X~I distribution. (i) Show that MSE(S2) (ii) Let 0'6 c~ . .0(X))) for all 6.. 8. (a)r = 8 =1 (b)r= ~s =1. Which one of the rules is the better one from the Bayes point of view? 11. then expand (Xi .X)2 keeping the square brackets intact.X)2 is an unbiased estimator of u 2 • Hint: Write (Xi . satisfies > sup{{J(6. (a) Show that if 6 is real and 1(6. 0) (J(6'. Find MSE(6) and MSE(P). Goals.10) + (. I . I . I (b) Show that if we use the 0 1 loss function in testing. .X)2. for all 6' E 8 1. You may use the fact that E(X i .0): 6 E eo}..0(X))) < E. i1 .for what 60 is ~ MSE(8)/MSE(P) < 17 Give the answer for n = 25 and n = 100.. and Performance Criteria Chapter 1 :! .3) that . (b) Suppose Xi _ N(I'.[X . .0) = E. then a test function is unbiased in this sense if. If the true 6 is 60 ..(1(6.4.J. In Problem 1. 10.g. I I 78 Statistical Models. A person in charge of ordering equipment needs to estimate and uses I .1)1 E~ 1 (Xi .. a) = (6 .
consider the estimator < MSE(X).3.7 Problems and Complements 79 12.3. 18.' and 0 = = WJ10 + (1  w)x' II'  1'0 I are known. 14. Po[o(X) ~ 11 ~ 1 . 15. there is a randomized procedure 03 such that R(B. show that if c ::. Show that if J 1 and J 2 are two randomized.. b. 0 < u < 1. A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 < <p(X) < 1. use the test J u . In Example 1.) + (1 ..1. (a) find the value of Wo that minimizes M SE(Mw).1) and let Ok be the decision rule "reject the shipment iff X > k.ao) ~ O. Compare 02 and 03' 19.3.1. . 03) = aR(B. Your answer should Mw If n." (a) Show that the risk is given by (1. o Suppose that U .) for all B. Define the nonrandomized test Ju .7). Consider the following randomized test J: Observe U. In Example 1. and possible actions at and a2.1'0 I· 17. (h) If N ~ 10. Suppose the loss function feB. wc dccide El.Section 1. but if 0 < <p(x) < 1. (b) find the minimum relative risk of P. Show that J agrees with rp in the sense that.wo to X... if <p(x) = 1.a )R(B.' and 0 = II' . by 1 if<p(X) if <p(X) >u < u.4.1. (B) = 0 for some event B implies that Po (B) = 0 for all B E El.3. B = . 16. (c) Same as (b) except k = 2. 1) and is independent of X. a) is . 0. and k o ~ 3. z > 0. The interpretation of <p is the following. Furthersuppose that l(Bo. Show that the procedure O(X) au is admissible. Convexity ofthe risk set.UfO. . find the set of I' where MSE(ji) depend on n. s = r = I. and then JT. we petiorm a Bernoulli trial with probability rp( x) of success and decide 8 1 if we obtain a success and decide 8 0 otherwise. If U = u. consider the loss function (1. Ok) as a function of B. Suppose that Po. For Example 1.3. If X ~ x and <p(x) = 0 we decide 90. procedures. 0.4.~ is unbiased 13. given 0 < a < 1. Suppose that the set of decision procedures is finite. Consider a decision problem with the possible states of nature 81 and 82 .3. plot R(O.Polo(X) ~ 01 = Eo(<p(X)). In Problem 1. then.
What is 1. In Example 1. U2 be independent standard normal random variables and set Z Y = U I. Let U I . and Performance Criteria Chapter 1 B\a 0.7 find the best predictors ofY given X and of X given Y and calculate their MSPEs.4 I 0. and the best zero intercept linear predictor. 4. . Let Y be any random variable and let R(c) = E(IYcl) be the mean absolute prediction error.Is Z of any value in predicting Y? = Ur + U?..2 0.9. al a2 0 3 2 I 0. 0 0. In Problem B. Let X be a random variable with probability function p(x ! B) O\x 0. 6. 3. !.l. The midpoint of the interval of such c is called the conventionally defined median or simply just the median.. Give an example in which Z can be used to predict Y perfectly. An urn contains four red and four black balls. its MSPE.) = 0. (b) Give and plot the risk set S.8 0. (c) Suppose 0 has the prior distribution defined by . (b) Compute the MSPEs of the predictors in (a). !.(0.(0. Show that either R(c) = 00 for all cor R(c) is minimized by taking c to be any number such that PlY > cJ > pry < cJ > A number satisfying these restrictions is called a median of (the distribution of) Y.4.. Give the minimax rule among the nonrandomized decision rules. and the ratio of its MSPE to that of the best and best linear predictors. (a) Find the best predictor of Y given Z. but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z). Give the minimax rule among the randomized decision rules. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn. Give an example in which the best linear predictor of Y given Z is a constant (has no predictive value) whereas the best predictor Y given Z predicts Y perfectly. the best linear predictor. 5. 0.6 (a) Compute and plot the risk points of the nonrandomized decision rules. Four balls are drawn at random without replacement.1. 7.80 Statistical Models.1 calculate explicitly the best zero intercept linear predictor. 2. Goals.) the Bayes decision rule? Problems for Section 1.4 = 0.
Z".L. Y = Y' + Y". Let Zl and Zz be independent and have exponential distributions with density Ae\z. p) distribution and Z". z > 0. (b) Show directly that I" minimizes E(IY . y l ). p( c all z. (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict y J • 13. ICor(Z. ° 14. (a) Show that the relation between Z and Y is weaker than that between Z' and y J . Yare measurements on such a variable for a randomly selected father and son.Section 1. 9. Y" are N(v.t. 7 2 ) variables independent of each other and of (ZI.z) for 11. Find (a) The best MSPE predictor E(Y I Z = z) ofY given Z = z (b) E(E(Y I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z» . Show that c is a median of Z.2 ) distrihution. Y'. that is. p( z) is nonincreasing for z > c. Show that if (Z. and otherwise. Let ZI.YI > s. 0.col < eo. Suppose that Z has a density p. Y)I < Ipl. Y) has a bivariate normal distribution. Y" be the corresponding genetic and environmental components Z = ZJ + Z". 10. (a) Show that P[lZ  tl < s] is maximized as a function oftfor each s > ° by t = c.PlY < eo]} + 2E[(c  Y)llc < Y < co]] 8. Define Z = Z2 and Y = Zl + Z l Z2. 12. exhibit a best predictor of Y given Z for mean absolute prediction error. Suppose that if we observe Z = z and predict 1"( z) for Your loss is 1 unit if 11"( z) . where (ZI . Y') have a N(p" J. + z) = p(c . which is symmetric about c. If Y and Z are any two random variables.  = ElY cl + (c  co){P[Y > co] . which is symmetric about c and which is unimodal.cl) = oQ[lc . (b) Suppose (Z.el) as a function of c. Let Y have a N(I". Suppose that Z has a density p.7 Problems and Complements 81 Hint: If c ElY . Sbow that the predictor that minimizes our expected loss is again the best MSPE predictor.1"1/0] where Q(t) = 2['I'(t) + t<I>(t)]. a 2. that is. a 2. Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables. (a) Show that E(IY . Y) has a bivariate nonnal distribution the best predictor of Y given Z in the sense of MSPE coincides with the best predictor for mean absolute error. Suppose that Z.
/L£(Z)) ~ maxgEL Corr'(Y./L)/<J and Y ~ /3Yo/<J. > 0. then'fJh(Z)Y =1JZY.I • ! . 2 'T ' . €L is uncorrelated with PL(Z) and 1J~Y = P~Y' 18. (b) Show that if Z is onedimensional and h is a 11 increasing transfonnation of Z.4. and Var( Z. hat1s. . At the same time t a diagnostic indicator Zo of the severity of the disease (e.4. IS./L(Z)) = max Corr'(Y. It will be convenient to rescale the problem by introducing Z = (Zo . Hint: Recall that E( Z.. Predicting the past from the present. (a) Show that the conditional density j(z I y) of Z given Y (b) Suppose that Y has the exponential density = y is H(y. Let S be the unknown date in the past when the sUbject was infected. Here j3yo gives the mean increase of Zo for infected subjects over the time period Yo. where piy is the population multiple correlation coefficient of Remark 1. (See Pearson. at time t. >. Let /L(z) = E(Y I Z ~ z). j3 > 0. Yo > O./L(Z)J' /Var(Y) ~ Var(/L(Z))/Var(Y).4. and a 2 are the mean and variance of the severity indicator Zo in the population of people without the disease. IS.1] .g(Z)) where £ is the set of 17. Show that p~y linear predictors. We are interested in the time Yo = t . (c) Let '£ = Y . and Doksurn and Samarov.S from infection until detection.) ~ 1/>''. = Corr'(Y.3.) ~ 1/>. Goals. . 82 Statisflcal Models. on estimation of 1]~y. 1905.y}. 16.) (a) Show that 1]~y > piy. Hint: See Problem 1.) = E( Z. Show that Var(/L(Z))/Var(Y) = Corr'(Y./LdZ) be the linear prediction error.4. Assume that the conditional density of Zo (the present) given Yo = Yo (the past) is where j1.15.(y) = >. in the linear model of Remark 1.ElY . I. . . 1995. .: (0 The best linear MSPE predictor of Y based on Z = z. . 1). One minus the ratio of the smallest possible MSPE to the MSPE of the constant predictor is called Pearson's correlation ratio 1J~y. Show that. and Performance Criteria Chapter 1 . Consider a subject who walks into a clinic today. that is. y> O.g. . ~~y = 1 .exp{ >. g(Z)) 9 where g(Z) stands for any predictor. and is diagnosed with a certain disease. mvanant un d ersuchh. a blood cell or viral load measurement) is obtained..) = Var( Z.
(In practice. Let Y be a vector and let r(Y) and s(Y) be real valued.. s(Y» given Z = z. 1). (d) Find the best predictor of Yo given Zo = zo using mean absolute prediction error EIYo . (a) Show that ifCov[r(Y).4. s(Y) I z] for the covatiance between r(Y) and s(Y) in the conditional distribution of (r(Y). Z] = Cov{E[r(Y) (d) Suppose Y i = al + biZ i + Wand 1'2 = a2 and Y2 are responses of subjects 1 and 2 with common influence W and separate influences ZI and Z2. b) equal to zero.).g(Zo)l· Hint: See Problems 1. b). Y2) using (a). and checking convexity.. Hint: Use Bayes rule.4. Establish 1. 7r(Y I z) ~ (27r) . .~ [y  (z .7 and 1. This density is called the truncated (at zero) normal. then = E{Cov[r(Y). . N(z .z) .14 by setting the derivatives of R(a. + b2 Z 2 + W.z).>. Let Y be the number of heads showing when X fair coins are tossed. need to be estimated from cohort studies. 20.I exp { .. density. where Z}.>. (b) Show that (a) is equivalent to (1. s(Y) I Z]} + Cov{ E[r(Y) I Z]' E[s(Y) I Z]}. Write Cov[r(Y).4.>')1 2 } Y> ° where c = 4>(z .(>' . 21.7 Problems and Complements 83 Show that the conditional distribution of Y (the past) given Z = z (the present) has density . (e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z ~ z) ~ cI<p(>. (c) Show that if Z is real. solving for (a.6) when r = s. (c) Find the conditional density 7I"o(Yo I zo) of Yo given Zo = zo.Section 1.9. 2000). c. Cov[r(Y).s(Y)] Covlr(Y). Find (a) The mean and variance of Y. (c) The optimal predictor of Y given X = x.. (b) The MSPE of the optimal predictor of Y based on X. including the "prior" 71". s(Y)] < 00. . Z2 and W are independent with finite variances. wbere Y j . all the unknowns. Find Corr(Y}. where X is the number of spots showing when a fair die is rolled.4. x = 1. I Z]' Z}. 6. 1990. and Nonnand and Doksum.. 19. see Berman.
Zz). where Po(Y.~(1 . Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. 1 X n be a sample from a Poisson.x > 0. .z) = 6w (y. 0 < z < I. p(e).z) be a positive realvalued function.4.6(r. This is thebeta. . show that the MSPE of the optimal predictor is .. 0) = Ox'. a > O. and Performance Criteria Chapter 1 (e) In the preceding model (d). . Oax"l exp( Ox").. Y ~ B(n. In this case what is Corr(Y. Let Xl.. Show that L~ 1 Xi is sufficient for 8 directly and by the factorization theorem. . a > O.z)lw(y. 1.3.density. 23.p~y). (a) p(x. and suppose that Z has the beta. .. 1 2 y' . 0 > 0.15) yields (1. Verify that solving (1. are N(p. (c) p(x. (a) Show directly that 1 z:::: 1 Xi is sufficient for 8. Y z )? (f) In model (d).14). 0 > 0. Problems for Section 1. z) ~ ep(y. Suppose Xl. In Example 1. (a) Let w(y. Goals. s).84 Statistical Models.z).z). Let Xi = 1 if the ith item drawn is bad. .. \ (b) Establish the same result using the factorization theorem. z "5). Then [y . This is known as the Weibull density. 3.e)' Hint: Whatever be Y and c. < 00 for all e.. we say that there is a 50% overlap between Y 1 and Yz.6(O. if b1 = b2 and Zl.e)' = Y' . . and (ii) w(y.Xn is a sample from a population with one of the following densities.4.2eY + e' < 2(Y' + e'). g(Z)) < for some 9 and that Po is a density. > a. Find the 22. 0) = < X < 1. z) and c is the constant that makes Po a density.e' < (Y . Show that EY' < 00 if and only if E(Y .') and W ~ N(po.g(z») is called weighted squared prediction error. 25. . x  . z) = z(1 . Find Po(Z) when (i) w(y. density. 2. 1). n > 2.4. optimal predictor of Yz given (Yi. suppose that Zl and Z. and = 0 otherwise. 24. 0) = Oa' Ix('H). Zz and ~V have the same variance (T2. 00 (b) Suppose that given Z = z. 0 (b) p(x. z) = I.g(z)]'lw(y.5 1. Assume that Eow (Y.0> O.l . population where e > O. Show that the mean weighted squared prediction error is minimized by Po(Z) = EO(Y I Z).
Section 1. 7
Problems and Complements
85
This is known as the Pareto density. In each case. find a realvalued sufficient statistic for (), a fixed. 4. (a) Show that T 1 and T z are equivalent statistics if, and only if, we can write T z = H (T1 ) for some 11 transformation H of the range of T 1 into the range of T z . Which of the following statistics are equivalent? (Prove or disprove.) (b) n~ (c) I:~
1
x t and I:~
and I:~
I:~
1
log Xi, log Xi.
Xi Xi
I Xi
1
>0 >0
l(Xi X
1 (Xi 
(d)(I:~ lXi,I:~ lxnand(I:~ lXi,I:~
(e) (I:~
I Xi, 1
?)
xl)
and (I:~
1 Xi,
I:~
X)3).
5. Let e = (e lo e,) be a bivariate parameter. Suppose that T l (X) is sufficient for e, whenever 82 is fixed and known, whereas T2 (X) is sufficient for (h whenever 81 is fixed and known. Assume that eh ()2 vary independently, lh E 8 1 , 8 2 E 8 2 and that the set S = {x: pix, e) > O} does not depend on e. (a) Show that ifT, and T, do not depend one2 and e, respectively, then (Tl (X), T2 (X)) is sufficient for e. (b) Exhibit an example in which (T, (X), T2 (X)) is sufficient for T l (X) is sufficient for 8 1 whenever 8 2 is fixed and known, but Tz(X) is not sufficient for 82 , when el is fixed and known. 6. Let X take on the specified values VI, .•. 1 Vk with probabilities 8 1 , .•• ,8k, respectively. Suppose that Xl, ... ,Xn are independently and identically distributed as X. Suppose that IJ = (e" ... , e>l is unknown and may range over the set e = {(e" ... ,ek) : e, > 0, 1 < i < k, E~ 18i = I}, Let Nj be the number of Xi which equal Vj' (a) What is the distribution of (N" ... , N k )? (b) Sbow that N = (N" ... , N k _,) is sufficient for 7. Let Xl,'"
1
e,
e.
X n be a sample from a population with density p(x, 8) given by
pix, e)
o otherwise.
Here
e = (/1, <r) with 00 < /1 < 00, <r > 0.
(a) Show that min (Xl, ... 1 X n ) is sufficient for fl when a is fixed. (b) Find a onedimensional sufficient statistic for a when J1. is fixed. (c) Exhibit a twodimensional sufficient statistic for 8. 8. Let Xl,. " ,Xn be a sample from some continuous distribution Fwith density f, which is unknown. Treating f as a parameter, show that the order statistics X(l),"" X(n) (cf. Problem B.2.8) are sufficient for f.
,
"
I
!
86
Statistical Models, Goals, and Performance Criteria
Chapter 1
9. Let Xl, ... ,Xn be a sample from a population with density
j,(x)
a(O)h(x) if 0, < x < 0,
o othetwise
where h(x)
> 0,0= (0,,0,)
with
00
< 0, < 0, <
00,
and a(O)
=
[J:"
h(X)dXr'
is assumed to exist. Find a twodimensional sufficient statistic for this problem and apply your result to the U[()l, ()2] family of distributions. 10. Suppose Xl" .. , X n are U.d. with density I(x, 8) = ~elx61. Show that (X{I),"" X(n», the order statistics, are minimal sufficient. Hint: t,Lx(O) =  E~ ,sgn(Xi  0), 0 't {X"" . , X n }, which determines X(I),
. " , X(n)'
11. Let X 1 ,X2, ... ,Xn be a sample from the unifonn, U(O,B). distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for O.
12. Dynkin, Lehmann, Scheffe's Theorem. Let P = {Po : () E e} where Po is discrete concentrated on X = {x" x," .. }. Let p(x, 0) p.[X = xl Lx(O) > on Show that f:xx(~~) is minimial sufficient. Hint: Apply the factorization theorem.
=
=
°
x,
13. Suppose that X = (XlI" _, X n ) is a sample from a population with continuous distribution function F(x). If F(x) is N(j1., ,,'), T(X) = (X, ,,'). where,,2 = n l E(Xi 1')2, is sufficient, and S(X) ~ (XCI)"" ,Xin»' where XCi) = (X(i)  1')/'" is "irrelevant" (ancillary) for (IL, a 2 ). However, S(X) is exactly what is needed to estimate the "shape" of F(x) when F(x) is unknown. The shape of F is represented hy the equivalence class F = {F((·  a)/b) : b > 0, a E R}. Thus a distribution G has the same shape as F iff G E F. For instance, one "estimator" of this shape is the scaled empirical distribution function F,(x) jln, x(j) < x < x(i+1)' j = 1, . .. ,nl
~
0, x
< XCI)
> x(n)
1, x
~
Show that for fixed x, F,((x  x)/,,) converges in prohahility to F(x). Here we are using F to represent :F because every member of:F can be obtained from F.
I
I ,
'I i,
14. Kolmogorov's Theorem. We are given a regular model with e finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on 9, the posterior distrihution of 9 depends on x only through T(x). Show that T(X) is sufficient.
(b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the posterior distribution depends on x only through T(x).
Section 1.7
Problems and Complements
87
Hint: Apply the factorization theorem.
15. Let X h .··, X n be a sample from f(x  0), () E R. Show that the order statistics arc minimal sufficient when / is the density Cauchy Itt) ~ I/Jr(1 + t 2 ). 16. Let Xl,"" X rn ; Y1 ,· . " ~l be independently distributed according to N(p, (72) and N(TI, 7 2 ), respectively. Find minimal sufficient statistics for the following three cases:
(i) p, TI,
0", T
are arbitrary:
00
< p, TI < 00, a <
(J,
T.
(ii)
(J
=T
= TJ
and p, TI, (7 are arbitrary. and p,
0", T
(iii) p
are arbitrary.
17. In Example 1.5.4. express tl as a function of Lx(O, 1) and Lx(l, I). Problems to Sectinn 1.6
1. Prove the assertions of Table 1.6.1.
2. Suppose X I, ... , X n is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show that the distribution of X fonns a oneparameter exponential family. Identify 'TI, B, T, and h. 3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability nf success O. Then P, IX = k] = (I  0)'0, k ~ 0 0 1,2, ... This is called thc geometric distribution (9 (0». (a) Show that the family of geometric distributions is a oneparameter exponential family with T(x) ~ x. (b) Deduce from Theorem 1.6.1 that if X lo '" oXn is a sample from 9(0), then the distributions of L~ 1 Xi fonn a oneparameter exponential family. (c) Show that E~
1
Xi in part (b) has a negative binomial distribution with parameters
(noO)definedbyP,[L:71Xi = kJ =
(n+~I
)
(10)'on,k~0,1,2o'"
(The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success 0.) Hint: By Theorem 1.6.1, P,[L:7 1 Xi = kJ = c.(1  o)'on. 0 < 0 < 1. If
=
' " CkW'
I
= c(;'',)::, 0 lw n
k=O
L..J
< W < I, then
4. Which of the following families of distributions are exponential families? (Prove or
disprove.) (a) The U(O, 0) fumily
88
(b) p(.", 0)
Statistical Models, Goals, and Performance Criteria
Chapter 1
(c)p(x,O)
= {exp[2Iog0+log(2x)]}I[XE = ~,xE {O.I +0, ... ,0.9+0j = 2(x +
0)/(1 + 20), 0
(0,0)1
(d) The N(O, 02 ) family, 0 > 0
(e)p(x,O)
< x < 1,0> 0
(f) p(x,9) is the conditional frequency function of a binomial, B(n,O), variable X, given that X > O.
5. Show that the following families of distributions are twoparameter exponential families and identify the functions 1], B, T, and h. (a) The beta family. (b) The gamma family. 6. Let X have the Dirichlet distribution, D( a), of Problem 1.2.15. Show the distribution of X form an rparameter exponential family and identify fJl B, T, and h.
7. Let X = ((XI, Y I ), ... , (X no Y n » be a sample from a bIvariate nonnal population.
Show that the distributions of X form a fiveparameter exponential family and identify 'TJ, B, T, and h.
8. Show that the family of distributions of Example 1.5.3 is not a one parameter eX(Xloential family. Hint: If it were. there would be a set A such that p(x, 0) > on A for all O.
°
9. Prove the analogue of Theorem 1.6.1 for discrete kparameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (XI, X 2) is a sample of size 2 from f(·, 0), then XI + X2 is sufficient for B. Show that f(·, B) corresponds to a onearameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x" X2) such that log f(x" 0) + log f(X2, 0) = g(xI + X2, 0) + h(XI, X2). Fix 00 and let r(x, 0) = log f(x, 0)  log f(x, 00), q(x, 0) = g(x,O)  g(x,Oo). Then, q(xI + X2,0) = r(xI,O) +r(x2,0), and hence, [r(x" 0) r(O, 0)1 + [r(x2, 0)  r(O, 0») = r(xi + X2, 0)  r(O, 0). 11. Use Theorems 1.6.2 and 1.6.3 to obtain momentgenerating functions for the sufficient statistics when sampling from the following distributions. (a) normal, () ~ (ll,a 2 )
(b) gamma. r(p, >.), 0
= >., p fixed
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3)
(0 gamma. r(p, >'). ()
= (p, >.).
 
,

Section 1. 7
Problems and Complements
89
12. Show directly using the definition of the rank of an ex}X)nential family that the multinomialdistribution,M(n;OI, ... ,Ok),O < OJ < 1,1 <j < k,I:~oIOj = 1, is of rank k1. 13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any (k ~ I)dimensional hyperplane. 14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of &. Show that if k = 1 and &0 oJ 0 and A, A are defined on all of &, then Theorem 1.6.3 continues to hold. 15. Let P = {P. : 0 E e} where p. is discrete and concentrated on X = {x" X2, ... }, and let p( x, 0) = p. IX = x I. Show that if P is a (discrete) canonical ex ponential family generated bi, (T, h) and &0 oJ 0, then T is minimal sufficient. Hint: ~;j'Lx('l) = Tj(X)  E'lTj(X). Use Problem 1.5.12.
16. Life testing. Let Xl,.'" X n be independently distributed with exponential density (20)l e x/2. for x > 0, and let the ordered X's be denoted by Y, < Y2 < '" < YnIt is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r aparticles have been emitted. Show that
(i) The joint distribution of Y1 , •.. , Yr is an exponential family with density
n! [ (20), (n _ r)! exp (ii) The distribution of II:: I Y;
(iii) Let
1
I::l Yi + (n 20
r)Yr]
' 0  Y,  ...  Yr·
<
<
<
+ (n 
r)Yrl/O is X2 with 2r degrees of freedom.
denote the time required until the first, second,... event occurs in a Poisson process with parameter 1/20' (see A.I6). Then Z, = YI/O', Z2 = (Y2 Yr)/O', Z3 = (Y3  Y 2)/0', ... are independently distributed as X2 with 2 degrees of freedom, and the joint density of Y1 , ••. , Yr is an exponential family with density
Yi, Yz , ...
The distribution of Yr/B' is again XZ with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burnedout tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
I ,
90
Statistical Models, Goals, and Performance Criteria Chapter 1
1)(Y; l~~l)/e (I = 1", .. ,') are independently distributed as X2 with 2 degrees of freedom, and [L~ 1 Yi + (n  7")Yr]/B = [(ii): The random variables Zi ~ (n  i
+
L::~l Z,.l
17. Suppose that (TkXl' h) generate a canonical exponential family P with parameter k 1Jkxl and E = R . Let
(a) Show that Q is the exponential family generated by IlL T and h exp{ cTT}. where IlL is the projection matrix of Tonto L = {'I : 'I = BO + c). (b) Show that ifP has full rank k and B is of rank I, then Q has full rank l. Hint: If B is of rank I, you may assume
18. Suppose Y1, ... 1 Y n are independent with Yi '" N(131 + {32Zi, (12), where Zl,'" , Zn are covariate values not all equaL (See Example 1.6.6.) Show that the family has rank 3.
Give the mean vector and the variance matrix of T.
19. Logistic Regression. We observe (Zll Y1 ), ... , (zn, Y n ) where the Y1 , .. _ , Y n are independent, Yi "' B(TIi, Ad The success probability Ai depends on the characteristics Zi of the ith subject, for example, on the covariate vector Zi = (age, height, blood pressure)T. The function I(u) ~ log[u/(l  u)] is called the logil function. In the logistic linear re(3 where (3 = ((31, ... ,/3d ) T and Zi is d x 1. gression model it is assumed that I (Ai) = Show that Y = (Y1 , ... , yn)T follow an exponential model with rank d iff Zl, ... , Zd are
zT
not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). 20. (a) In part IT of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by ('11 'Io)TT and that ~(ii) =~(i). (b) Fill in the details of part III of the proof of Theorem 1.6.4. 21. Find JJ.('I) ~ EryT(X) for the gamma,
qa, A), distribution, where e = (a, A).
I
22. Let X I, . _ . ,Xn be a sample from the k·parameter exponential family distribution (1.6.10). Let T = (L:~ 1 1 (Xi ), ... , L:~ 1Tk(X,») and let T
I
S
~
((ryl(O), ... ,ryk(O»): e E 8).
Show that if S contains a subset of k + 1 vectors Vo, .. _, Vk+l so that Vi  Vo, 1 < i are not collinear (linearly independent), then T is minimally sufficient for 8.
< k.
I .' jl,
"
23. Using (1.6.20). find a conjugate family of distributions for the gamma and beta families. (a) With one parameter fixed. (b) With both parameters free.
:
I
Section 1.7
Problems and Complements
91
24. Using (1.6.20), find a conjugate family of distributions for the normal family using as parameter 0 = (O!, O ) where O! = E,(X), 0, ~ l/(Var oX) (cf. Problem 1.2.12). 2 25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with (72 known. Find a conjugate family of prior distributions for (131,132) T. 26. Using (1.6.20), find a conjugate family of distributions for the multinomial distribution. See Problem 1.2.15. 27. Let P denote the canonical exponential family genrated by T and h. For any TJo E £, set ho(x) = q(x, '10) where q is given by (1.6.9). Show that P is also the canonical exponential family generated by T and h o.
28. Exponential/amities are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by h(f)
~ E(logf(X)) =
l:IIOgf(X)I!(X)dx.
This quantity arises naturally in infonnation in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S ~ {x: f(x) > OJ. (a) Show that the canonical kparameter exponential family density
f(x, 'I)
= exp
• ryjrj(x) 1/0 + I:
j:=1
A('I)
, XES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is
f(x)dx
~ 1,
Is
f(x)rj(x)
~ aj,
1 < j < k,
where '17o, .•.• '17k are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. (b) Find the maximum entropy densities when rj(x) = x j and (i) S ~ (0,00), k = 1, at > 0; (ii) S = R, k = 2, at E R, a, > 0; (iii) S = R, k = 3, a) E R, a, > 0, a3 E R. 29. As in Example 1.6.11, suppose that Y 1, ...• Y n are Li.d. Np(f.L. E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y = (Y ... , Yn ) is the p(p + 3)/2 canonical " exponential family generated by h = 1 and the p(p + 3)/2 statistics
n n
Tj
=
LYii>
i=l
1 <j <Pi
Tjl =
LJ'ijJ'iI.
i=l
1 <j< l<p
where Y i = (Yi!, ... , Yip). Show that <: is open and that this family is of rank pcp + 3)/2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = pcp + 3)/2 statistics Tj(Y) ~ Yj, 1 < j < p, and Tj,(Y) = YjYi, 1 <j < I < p,
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Np(J.l, E). As E ranges over all p x p symmetric positive definite matrices, so does E 1 • Next establish that for symmetric matrices M,
J
M
exp{ _uT Mu}du
< 00 iff M
is positive definite
by using the spectral decomposition (see B.I0.1.2)
=L
j=1
p
AjejeJ for el, ... , e p orthogonal. Aj E R.
To show that the family has full rank m, use induction on p to show that if Zt, ... , Zp are i.i.d. N(O, 1) and if B pxp = (b jl ) is symmetric, then
p
P
LajZj
j"" 1
+ Lbj,ZjZ,
j,l
~c
= P(aTZ + ZTBZ = c) ~ 0
N p(l', E), then
unless a ~ 0, B = 0, c = 0. Next recall (Appendix B.6) that since Y ~ y = SZ for some nonsingular p x p matrix S.
I
30. Show that if Xl,'" ,Xn are d.d. N p (8,E o) given (J where ~o is known, then the Np(A, f) family is conjugate to N p(8, Eo), where A varies freely in RP and f ranges over
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I'j, Tj) : 1 < j < k} be a given collection of pairs with I'j E R, Tj > 0. Let (tt, tT) be a random pair with Aj = P«(I', tT) = (I'j, Tj)), 0 < Aj < 1, L:~~l Aj = 1. Let 8 be a random variable whose conditional distribution given (IL, IT) = (p,j 1 Tj) is nonnal, N(p,j, rJ). Consider the model X = 8 + f, where 8 and € are independent and € rv N(O, a3), a~ known. Note that 8 has the prior density
11'(0)
=L
j=l
k
Aj'l'rj (0  I'j)
(1.7.4)
where 'I'r denotes the N(O, T 2 ) density. Also note that (X tion. (a) Find the posterior
i!
I 0) has the N(O, (75) distribu
" "
,;
k
11'(0 I x)
and write it in the fonn
= LP((tt,tT) ~
j=1
(l'j,Tj) I X)1I'(O I (l'j,Tj),X)
" • ,
k
L
j=1
Aj (x)'I'rj(x) (0 l'j(X»
Section 1.7
Problems ;3nd Complements
93
for appropriate A) (x), Tj (x) and ILJ (x). This shows that (1. 7.4) defines a conjugate prior for the N(O, (76), distribution. (b) Let Xi = + Ei, I < i < n, where is as previously and EI," ., En are ij.d. N(O, (76)' Find the posterior 7r( 0 I Xl, ... , x n ), and show that it belongs to class (1.7 A). Hint: Consider the sufficient statistic for p(x I B).
e
e
32. A Hierarchical BinomialBeta Model. Let {(rj, Sj) : 1 <j < k} be a given collection of pair.; with rj > 0, sJ > 0, let (R, S) be a random pair with P(R = cJ' S = 8j) = Aj, D < Aj < 1, E7=1 Aj = 1, and let e be a random variable whose conditional density ".(0, c, s) given R = r, S = S is beta, (3(c, s). Consider the model in which (X I 0) has the binomial, B( n, fJ), distribution. Note that e has the prior density
".(0)
Find the posterior
k
=L
j=1
k
Aj"'(O, cJ ' sJ)'
(J .7.5)
".(0 I x) = LP(R= cj,S =
j=l
8j
I x)7r(O I (rj,sj),x)
and show that it can be written in the form J (x)7r(O,rj(x),sj(x)) for appropriate Aj(X), Cj(x) and 8j(X). This shows that (1.7.5) defines a class of conjugate prior.; for the B( n, 0) distribution.
L:A
33. Let p(x,TJ) be a one parameter canonical exponential family generated by T(x) = x and h(x), X E X C R, and let 1jJ(x) be a nonconstant, nondecreasing function. Show that E,1jJ(X) is strictly increasing in ry. Hint:
Cov,(1jJ(X), X)
~E{(X 
X')i1jJ(X) 1jJ(X')]}
where X and X' are independent identically distributed as X (see A.Il.12).
34. Let (Xl, ... , X n ) be a stationary Markov chain with two states D and 1. That is.
P[Xi
where
= Ei I Xl = EI,·· .,Xi  = Eid = P[Xi = Ci I X i  l = Eid =
l
PEi_1Ei
(POO PIO
pal) is the matrix of transition probabilities. Suppose further that Pn '
= 1  p.
(i) poo
(ii)
= PII = p, so that, PlO = Pal PIX, = OJ = PIX, = IJ = !.
94
Statistical Models, Goals, and Performance Criteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, oneparameter exponential family with T = NOD + N ll where Nt) the number of transitions from i to j. For example, 01011 has N Ol = 2, Nil = 1, N oo = 0, N IO ~ 1.
(b) Show that E(T)
= (n 
l)p (by the method of indicators or otherwise).
35. A Conjugate Priorfor the Two~Sample Problem. Suppose that Xl, ... , X n and Y1 , ... , Yn are independent N(fLI' (12) and N(1l2' ( 2 ) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro 2 has a X~ distribution and given 0 2 , /11 and fL2 are independent with N(~I, <7 2/ kt} and N(6, <7 2/ k2) distributions, respectively, where ~j E R, k j > 0, j = 1,2. Show that Jr is a conjugate prior.
I
36. The inverse Gaussian density. IG(j..t, .\), is
f(x,J1.,>')
= [>./21Tjl/2 x 3/2 exp { >.(x 
J1.)2/ 2J1.2 X}, x> 0, J1. > 0, >. > O.
(a) Show thatthis is an exponentialfamily generated hy T( X) h(x) = (21T)1/2 X 3/'. (b) Show that the canonical parameters TJl, TJ2 are given by TJI that A( 'II, '12) =  [! [Og('I2) + v''I1'I2]'£ = [0,00) x (0,00).
= ! (X, XI) T and = fL 2A, 1]2 =
= '\, and
(e) Fwd the momentgenerating function ofT and show that E(X) J1. 3>., E(XI) = J1. 1 + >.1, Var(X I ) = (>'J1.)1 + 2>'2. (d) Suppose J1. pnor.
~
J1., Var(X)
J1.o is known. Show that the gamma family, qa,,6), is a conjugate
(e) Suppose that>' = >'0 is known. Show that the conjngate prior formula (1.6.20) produces a function that is not integrable with respect to fl. That is, defined in (1.6.19) is empty.
n
(I) Suppose that J1. and>. are both unknown. Show that (1.6.20) produces a function
that is not integrable; that is, f! defined in (1.6,19) is empty. 37, Let XI, ... , X n be i.i.d. as X ~ Np(O, ~o) where ~o is known. Show that the conjugate prior generated by (1.6.20) is the N p ( 1]0,761) family, where 1]0 varies freely in RP, 76 > 0 and I is the p x p identity matrix.
,
•
38. Let Xi
"
(Zi, Yi)T be jj,d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of XI, ... ,Xn as a canonical exponential family and identify T, h, A, and E. Find the expected value and variance of the sufficient statistic.
=
,!,i
i!: "
39. Suppose that Y1 , •.. 1 Y n are independent, Yi
'"'' N(fLi, a 2 ), n > 4.
"'
(a) Write the distribution of Y1 , ..• ,Yn in canonical exponential family fonn. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 1) : (0 1, O , 03)T ~ (1'7, <72 jT where 1) is detennined by 2
fLi
I i'i
I ,
I
= exp{OI
+ 02Zi},
Zl
< Z2 < .. , <
Zn;
(72 =
03


Section 1.8
Notes
95
where 8 r E R, O E R, 03 > O. This model is sometimes used when IIi is restricted to be 2 positive. Show that p(y, 0) as given by (1.6.12) is a curved exponential family model with
1=3.
40. Suppose Y 1 , • •. , y;'l are independent exponentially, E' (Ai), distributed survival times, n > 3. (a) Write the distribution of Y1 , ... 1 Yn in canonical exponential family form. Identify T, h, '1, A, and E. (b) Recall that J.1i = E (Y'i) = Ai I. Suppose lJi depends on the value Zi of a covariate. Because Iti > O. fLi is sometimes modeled as
fLi
= CXp{ 0 1 + (hZi},
i
=
1, ... , n
where not all the z's are equal. Show that p(y, fi) as given by (1.6.12) is a curved exponential family model with 1 = 2.
1.8
NOTES
Note for Section 1.1
(1) For the measure theoretically minded we can assume more generally that the Po are
all dominated by a derivative.
(J
finite measure It and that p(x, 8) denotes
dJ;lI, the Radon Nikodym
Notes for Section 1,3
~
(I) More natural in the sense of measuring the Euclidean distance between the estimate f} and the "truth" Squared error gives much more weight to those that are far away from f} than those close to f}.
e.
e
~
(2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r.
Note for Section 1,4
(I) Source; Hodges, Jr., J. L., D. Keetch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGrawHill, 1975.
Notes for Section 1,6
(1) Exponential families arose much earlier in the work of Boltzmann in statistical mechanics as laws for the distribution of the states of systems of particlessee Feynman (1963), for instance. The connection is through the concept of entropy, which also plays a key role in infonnation theorysee Cover and Thomas (199]). (2) The restriction that's x E Rq and that these families be discrete or continuous is artificial. In general if fL is a (J finite measure on the sample space X. p( x, e) as given by (1.6.1)
J. S. E.733741 (1990)..547572 (1957). U. ley. "Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discussion).266291 (1978). A 143. 1966. D. BROWN. 1954. E. Statistical Analysis of Stationary Time Series New York: Wi • . DE GROOT. A.I' L6HMANN. J. L. B.. "Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression. positions. Math Statist. BLACKWELL.. .. Elements of Information Theory New York: Wiley. RUPPERT. 1985. Vols. 1997. 125. r HorXJEs. I. CRUTCHFIELD. R. . R.g. P. .9 REFERENCES BERGER.. I: I" Ii L6HMANN. THOMAS. BERMAN. Statist. and Performance Criteria Chapter 1 can be taken to be the density of X with respect to fLsee Lehmann (1997). I and II. 14431473 (1995). KRETcH AND R. FERGUSON. 1991. 6. Testing Statistical Hypotheses. "Using Residuals Robustly I: Tests for Heteroscedasticity. R. P.. • • . E. BOX. GIRSHICK. L. Science 5. MA: AddisonWesley." Ann. Leighton. 1961. 0 . Statist... G. AND COVER.. Royal Statist. Ii . ROSENBLATT.7 (I) u T Mu > 0 for all p x 1 vectors u l' o. 1957. Note for Section 1. P. AND M. Ch. Nonlinearity.. SAMAROV. T. M. BICKEL. Soc. Goals." J. 1969.. K A AND A. "A Theory of Some Multiple Decision Problems. and spheres (e.. Statlab: An Empirical Introduction to Statistics New York: McGrawHili. . and so on. m New York: Hafner Publishing Co. v.. L. STUART. T. and Later Developments. New York: Springer. The Advanced Theory of Statistics. AND M. The Feynmtln Lectures on Physics." Ann. Optimal Statistical Decisions New York: McGrawHili. E. 383430 (1979). 1988. M. Mathematical Statistics New York: Academic Press.. Feynman. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. . L6HMANN. 23. and M." Biomelika. Iii I' i M. L. Eels. 1. 22. IMS Lecture NotesMonograph Series. Hayward. for instance. G.. "A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts. M. Theory ofGames and Statistical Decisions New York: Wiley. the Earth). 160168 (1990). DoKSUM.. Transformation and Weighting in Regression New York: Chapman and Hall. 2nd ed. 40 Statistical Mechanics ofPhysics Reading.. J. 77. CARROLL. A. "Model Specification: The Views of Fisher and Neyman." Ann. Sands. 1963. FEYNMAN." Statist.96 Statistical Models. 1986.. 1975. 1967. P. JR. S. 5. AND A.t AND D. H. D.. KENDALL. This permits consideration of data such as images. II.. GRENANDER. J. R.• Statistical Decision Theory and Bayesian Analysis New York: Springer. L. .
The Statistical Analysis of Experimental Data New York: J. "On the General Theory of Skew Correlation and NOnlinear Regression.. E. Soc. J. . K. Graduate School of Business Administration. H. WETHERILL. Statistical Methods. B.. AND K. 2000. Harvard University.. 303 (1905). The Foundation ofStattstical Inference London: Methuen & Co." Empirical Bayes and Likelihood Inference. 8th Ed." Proc. 1965. J. L. 6779. "Empirical Bayes Procedures for a Change Point Problem with Application to HIVJAIDS Data. Roy. J. SNEDECOR. Lecture Notes in Statistics. SL. MANDEL. SAVAGE.. Ahmed and N. 1962. AND W. 1. 1954. Wiley & Sons. W. Reid..Section 1. New York. Ames. G. SAVAGE. PEARSON. AND K. Sequential Methods in Statistics New York: Chapman and Hall. 1986. V.) RAIFFA. A. Part II: Inference London: Cambridge University Press. Division of Research. COCHRAN. Editors: S.. IA: Iowa State University Press. Introduction to Probability and Statistics from a Bayesian Point of View. Part I: Probability. ET AL. 9 References 97 LINDLEY. 1961. DOKSUM.. SCHLAIFFER. NORMAND. Wiley & Sons. D. Applied Statistical Decision Theory. L. G. 1989. London 71. New York: Springer. (Draper's Research Memoirs. The Foundations ofStatistics. Boston.. D. 1964. Biometrics Series II. G. AND R. Dulan & Co. GLAZEBROOK.
I 'I i .·.! ~J . .: '".'. I:' o. :i. I II 1'1 . I .
we don't know the truth so this is inoperable. In this parametric case. The equations (2.1) Arguing heuristically again we are led to estimates B that solve \lOp(X. X E X.8) is smooth. we could obtain 8 0 as the minimizer. p(X. 8).O) EOoP(X.Chapter 2 METHODS OF ESTIMATION 2.O).1 2.1. Of course. if PO o were true and we knew DC 8 0 . That is. and 8 Jo D( 8 0 .O) where V denotes the gradient. Now suppose e is Euclidean C R d .1.2) define a special form of estimating equations.2) 99 . So it is natural to consider 8(X) minimizing p(X. Then we expect  \lOD(Oo. 6) ~ o. (2.8). how do we find a function 8(X) of the vector observation X that in some sense "is close" to the unknown 81 The fundamental heuristic is typically the following. but in a very weak sense (unbiasedness).1.8) is an estimate of D(8 0 . =0 (2. As a function of 8. In order for p to be a contrast function we require that DC 8 0l 9) is uniquely minimized for 8 = (Jo. DC8 0l 9) measures the (population) discrepancy between 8 and the true value 8 0 of the parameter. This is the most general fonn of the minimum contrast estimate we shall consider in the next section. how do we select reasonable estimates for 8 itself? That is. We consider a function that we shall call a contrast function  p:Xx8>R and define D(Oo. Estimating Equations Our basic framework is as before. the true 8 0 is an interior point of e.8) as a function of 8.1 BASIC HEURISTICS OF ESTIMATION Minimum Contrast Estimates.1. X """ PEP. usually parametrized as P = {PO: 8 E e}.
1.10)..2) or equivalently the system of estimating eq!1ations.i. z.8/3 ({3. I D({3o. z).) . where the function 9 js known. z) is differentiable in {3. (1) Suppose V (8 0 . {3 E R d .z.8) ~ E I1 .5) Strictly speaking P is not fully defined here and this is a point we shall explore later. g({3. further. . IL( z) = (g({3. Wd)T V(l1 o. (3) E{3.). Evidently.1.1.16). g({3. A naiural(l) function p(X. .' . W(X. ZI).Zid)T 'j • r . (3) = !Y .1L1 2 = L[Yi .)g(l3.z. Here is an example to be pursued later. If.p(X. W . z.4 R d.)J i=1 ~ 2 (2. there is a substantial overlap between the two classes of estimates.1.»)'. ua). i=l (2.g({3. The estimate (3 is called the least squares estimate.1. . ~ Then we say 8 solving (2.g({3. i=l 3 I (2.1. Example 2. ~ ~8g~ L. z.i + L[g({3o. {3) to consider is the squared Euclidean distance between the vector Y of observed Yi and the vector expectation of Y. Then {3 parametrizes the model and we can compute (see Problem 2.4) w(X. (1). (3) n n<T.100 More generally.6) . N(O. (3) exists if g({3. Zn»T That is. l'i) : 1 < i < n} where Yi. .4 are i. z) is continuous and ! I lim{lg(l3. suppose we postulate that the Ei of Example 1.('l/Jl. we take n p(X.1.d. . But.1.1. d g({3. Consider the parametric version of the regression model of Example 1.. . Least Squares.4 with I'(z) = g({3..)Y. for convenience.8/3 ({3. An estimate (3 that minimizes p(X.7) i=l J .3) = 0 has 8 0 as its unique solution for all 8 0 ~ E e.!I [" '1 L • .zd = LZij!3j j=1 andzi = (Zil.. then {3 satisfies the equation (2.1. I In the important linear case. "'.1. Yn are independent..I1) =0 is an estimating equation estimate.1. J which is indeed minimized at {3 = {3o and uniquely so if and only if the parametrization is identifiable. (2. . :i = ~8g~ ~ L. Here the data are X = {(Zi.z)l: 1{31 ~ oo} 'I = 00 I (Problem 2. suppose we are given a function W : X and define Methods of Estimation Chapter 2 X R d . z.
once defined we have a method of computing a statistic fj from the data X = {(Zi 1 Xi). ~ .1 Basic Heuristics of Estimation 101 the system becomes (2. These equations are commonly written in matrix fonn (2. u5). = 1'. . ..8) the normal equations. Method a/Moments (MOM). say q(8) = h(I'I..i. Thus. Here is another basic estimating equation example. . . ...Xi t=l . The method of moments prescribes that we estimate 9 by the solution of p..Xn are i. which can be judged on its merits whatever the true P governing X is.2 and Chapter = 6.d.. Thus.i. .2.Section 2. if we want to estimate a Rkvalued function q(8) of 9. This very important example is pursued further in Section 2.1. I'd( 8) are the first d moments of the population we are sampling from.. Suppose Xl. In fact.1. N(O. I'd). . lid) as the estimate of q( 8). suppose is 1 .. 0 Example 2. Least squares. 1 1 < j < d. thus.  n n... as X ~ P8. we obtain a MOM estimate of q( 8) by expressing q( 8) as a function of any of the first d moments 1'1. we assume the existence of Define the jth sample moment f1j by. provides a first example of both minimum contrast and estimating equation methods. we need to be able to express 9 as a continuous function 9 of the first d moments.1 <j < d if it exists.(8). To apply the method of moments to the problem of estimating 9. . . d > k.1 from R d to Rd. The motivation of this simplest estimating equation example is the law of large numbers: For X "' Po. We return to the remark that this estimating method is well defined even if the Ci are not i. /1j converges in probability to flj(fJ).d. . and then using h(p!. 1 < i < n}. . L. More generally.1. I'd of X.9) where Zv IIZijllnxd is the design matrix. 8 E R d and 8 is identifiable. Suppose that 1'1 (8). 1 1j ~_l".
An algorithm for estimating equations frequently used when computationofM(X.~ (I'l/a)'.2 The PlugIn and Extension Principles We can view the method of moments as an example of what we call the plugin (or substitution) and extension principles. f(u.i.d. There are many algorithms for optimization and root finding that can be employed. ~ iii ~ (X/a)'. X n be Ltd. Here is some job category data (Mosteller. x>O. neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn. As an illustration consider a population of men whose occupations fall in one of five different job categories. ... Vi = i.X 2 • In this example. Frequency Plugin(2) and Extension.11).)]xO1exp{Ax}. . as X and Ni = number of indices j such that X j = Vi. . but their respective probabilities PI.102 Methods of Estimation Chapter 2 For instance.1. the method of moment estimator is not unique. . the proportion of sample values equal to Vi. . 4. '\). Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category. 2. In this case f) ~ ('" A).•. A = X/a' where a 2 = fl. 3. two other basic heuristics particularly applicable in the i. 2. for instance. 0 Algorithmic issues We note that. then setting (2. Example 2. This algorithm and others will be discussed more extensively in Section 2.3. We can.2 and 0=2 = n 1 EX1.. with density [A"/r(.5.·) fli =D'I'(X.>0.10. i = 1..(1 + ")/A 2 Solving . consider a study in which the survival time X is modeled to have a gamma distribution. l Vk of the population being sampled are known. 1968). It is defined by initializing with eo.4 and in Chapter 6..6.Pk are completely unknown. . I. .1 0) . and 1" ~ E(X') = . . I..1. express () as a function of IJ" and fl3 = E(X 3 ) and obtain a method of moment estimator based on /11 and fi3 (Problem 2. Here k = 5. . We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments. Suppose we observe multinomial trials in which the values VI. A = Jl1/a 2 . If we let Xl. . in general.. or 5. case.. then the natural estimate of Pi = P[X = Vi] suggested by the law of large numbers is Njn. A>O.(X")1 dxd J isquickandM is nonsingular with high probability is the NewtonRaphson algorithm. in particular Problem 6. 1'1 for () gives = E(X) ~ "/ A.·)  l~t. .1... .
. . the difference in the proportions of bluecollar and whitecollar workers. together with the estimates Pi = NiJn. . Pk) of the population proportions..12).Pk) with Pi ~ PIX = 'Vi].0)2. .13 n2::.Pk by the observable sample frequencies Nt/n. the estimate is which in our case is 0.31 5 95 0. v.. whereas categories 2 and 3 correspond to whitecollar jobs. (2. . HardyWeinberg Equilibrium. Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles. .44 . P3) given by (2.03 2 84 0. Example 2.0. can be identified with a parameter v : P ~ R. categories 4 and 5 correspond to bluecollar jobs. If we use the frequency substitution principle. Equivalently. v(P) = (P4 + Ps) and the frequency plugin principle simply says to replace P = (PI.." . If we assume the three different genotypes are identifiable. . then (NIl N 2 .11) to estimate q(Pll··.Section 2. P3 PI = 0 = (1.Ps) = (P4 + Ps) .. . N 3 ) has a multinomial distribution with parameters (n. and Then q(p) (p.}}. Suppose .. 1 Nk/n. . 1 ()d) and that we want to estimate a component of 8 or more generally a function q(8). in v(P) by 0 P ).. suppose that in the previous job category table. that is.1. The frequency plugin principle simply proposes to replace the unknown population frequencies PI. .pl . + P3).41 4 217 0. i < k.1.1. Pk do not vary freely but are continuous functions of some ddimensional parameter 8 = ((h. 1~!. we are led to suppose that there are three types of individuals whose frequencies are given by the socalled HardyWeinberg proportions 2 .0 < 0 < 1. For instance.Pk) = (~l .1. P2 ~ 20(1.Ni =708 2:::o. let P dennte p ~ (P". Many of the models arising in the analysis of discrete data discussed in Chapter 6 are of this type. . . lPk).. use (2. . .. Now suppose that the proportions PI' .(P2 + P3)..53 = 0. Next consider the marc general problem of estimating a continuous function q(Pl' .0). 1 < think of this model as P = {all probability distributions Pan {VI. .P2.1 Basic Heuristics of Estimation 103 Job Category l I Ni Pi ~ 23 0...12 3 289 0. the multinomial empirical distribution of Xl. ..»i=1 for Danish men whose fathers were in category 3..Xn ...4.09.12) If N i is the number of individuals of type i in the sample of size n. . We would be interested in estimating q(P"". That is..
. .). Now q(lJ) can be identified if IJ is identifiable by a parameter v .Pk..h(p) and v(P) = v(P) for P E Po. . .14) As we saw in the HardyWeinberg case. suppose that we want to estimate a continuous Rlvalued function q of e. "·1 is.(IJ). If w.Pk(IJ»). P > R where v(P) . Let be a submodel of P. if X is real and F(x) = P(X < x) is the distribution function (dJ. 1  V In general. . We can think of the extension principle alternatively as follows.104 Methods of Estimation Chapter 2 we want to estimate fJ. thus. T(Xl.Pk are continuous functions of 8. Fu'(a) = sup{x. . F(x) < a}. Because f) = ~.1.. . in the Li. In particular. E A) n.'" . The plugin and extension principles can be abstractly stated as follows: Plugin principle. .1.. . . .. however. a natural estimate of P and v(P) is a plugin estimate of v(P) in this nonparametric context For instance. case if P is the space of all distributions of X and Xl.4) and 5 how to choose among such estimates. .14) are not unique. we can use the principle we have introduced and estimate by J N l In./P3 and. I) and ! F'(a) " = inf{x. . by the law of large numbers.4.d.15) ~ ".. F(x) > a}. We shall consider in Chapters 3 (Example 3.Xn ) =h N. where a E (0..1. . If PI. ( . ..13) and estimate (2.1. the frequency of one of the alleles. I: t=l (2. 0 write 0 = 1 . that we can also Nsfn is also a plausible estimate of O. as X p.. the representation (2. X n are Ll.1. (2. the empirical distribution P of X given by    f'V 1 n P[X E Al = I(X. Nk) .1.1. Po ~ R given by v(PO) = q(O).13) defines an extension of v from Po to P via v.16) .E have an estimate P of PEP such that PEP and v : P 4 T is a parameter. Note. Then (2. we can usually express q(8) as a continuous function of PI. q(lJ) with h defined and continuous on = h(p.d.13) Given h we can apply the extension principle to estimate q( 8) as. then v(P) is the plugin estimate of v.. consider va(P) = [Fl (a) + Fu 1(a )1. (2. (2. that is..
v(P8 ) = q(8) = h(p(8)) is a continuous map from to Rand D( P) = h(p) is a continuous map from P to R. The plugin and extension principles are used when Pe. case (Problem 2. ~ For a second example. The plugin and extension principles must be calibrated with the target parameter. (P) is called the population (2. but only VI(P) = X is a sensible estimate of v(P). P 'Ie Po.12) and to more general method of moment estimates (Problem 2. If v: P ~ T is an extension of v in the sense that v(P) = viP) on Po. i=1 k /1j ~ = ~ LXI 1=1 I n . context is the jth sample moment v(P) = xjdF(x) ~ 0. then the plugin estimate of the jth moment v(P) = f. In this case both VI(P) = Ep(X) and V2(P) = "median of P" satisfy v(P) = v(P).Section 2. then Va. (P) is the ath population quantile Xa. let Po be the class of distributions of X = B + E where B E R and the distribution of E ranges over the class of symmetric distributions with mean zero.17) where F is the empirical dJ. Suppose Po is a submodel of P and P is an element of P but not necessarily Po and suppose v: Po ~ T is a p~eter.) h(p) ~ L vip.i. For instance.1. For instance.d. With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plugin estimates for multinomial trials because I'j(8) where =L i=l k vfPi(8) = h(p(8» = viP.2.1. Remark 2. 1/. Pe as given by the HardyWeinberg p(O).1.1.1 Basic Heuristics of Estimation 105 V 1.d. ~ I ~ ~ Extension principle. is called the sample median.3 and 2. in the multinomial examples 2.4.1.1 I:~ 1 Xl.' Nl = h N () ~ ~ = v(P) and P is the empirical distribution.1. Here x!. and v are continuous. = v(P). if X is real and P is the class of distributions with EIXlj < 00. e e Remark 2. then v(P) is an extension (and plugin) estimate of viP). is a continuous map from = [0.i.14.1. This reasoning extends to the general i. As stated. Here x ~ = median. However.. .1.13).1. Let viP) be the mean of X and let P be the class of distributions of X = 0 + < where B E R and the distribution of E ranges over the class of distributions with mean zero. ~ ~ .. A natural estimate is the ath sample quantile . P E Po. the sample median V2(P) does not converge in probability to Ep(X). casebut see Problem 2. because when P is not symmetric. these principles are general.tj = E(Xj) in this nonparametric . = Lvi n i= 1 k . they are mainly applied in the i. II to P.
as we shall see in Chapter 3. therefore. .3 where X" .2 with assumptions (1)(4) holding.LI (8) = (J the method of moments leads to the natural estimate of 8. 0 Example 2. " . Unfortunately. C' X n are ij. It does turn out that there are "best" frequency plugin estimates. . To estimate the population variance B( 1 . Algorithms for their computation will be introduced in Section 2. Suppose X I. O}.1.106 Methods of Estimation Chapter 2 Here are three further simple examples illustrating reasonable and unreasonable MOM estimates. . (b) If the sample size is large. . a frequency plugin estimate of 0 is Iogpo. we find I' = E. •i . .4.. Remark 2.1. 0 = 21' . If the model fits. .5. Example 2. In Example 1.3. ". See Section 2.. For example. they are often difficult to compute. those obtained by the method of maximum likelihood.1 is a method of moments estimate of B.. and there are best extensions. because Po = P(X = 0) = exp{ O}. U {I. X. a special type of minimum contrast and estimating equation method. 'I I[ i['l i . Estimating the Size of a Population (continued). then B is both the population mean and the population variance.) = ~ (0 + I).l [iX.4. there are often several method of moments estimates for the same q(9). Plugin is not the optimal way to go for the Bayes. 2.1. X n are the indicators of a set of Bernoulli trials with probability of success fJ. where Po is n. if we are sampling from a Poisson population with parameter B. Thus. • . However. for large amounts of data. these estimates are likely to be close to the value estimated (consistency). Example 2. X(l . The method of moments estimates of f1 and a 2 are X and 2 0 a. . [ [ . We will make a selection among such procedures in Chapter 3. The method of moments can lead to either the sample mean or the sample variance.4.5.. This is clearly a foolish estimate if X (n) = max Xi> 2X . the plugin principle is justified.1 because in this model B is always at least as large as X (n)' 0 As we have seen. Moreover.I and 2X .2. X)..Ll). as we shall see in Section 2. (X. X n is a N(f1. = 0]. minimax..7. Because we are dealing with (unrestricted) Bernoulli trials. or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter 3. Suppose that Xl.X). This minimal property is discussed in Section 5. estimation of B real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of (J values rather than minimizers of functions p( (J.d. )" I I • . a saving grace becomes apparent in Chapters 5 and 6. we may arrive at different types of estimates than those discussed in this section. I .6. Discussion. the frequency of successes. Because f. a 2 ) sample as in Example 1. .8) we are led by the first moment to the estimate. these are the frequency plugin (substitution) estimates (see Problem 2. valuable as preliminary estimates in algorithms that search for more efficient estimates. When we consider optimality principles.1.1. . For instance. optimality principle solutions agree to first order with the best minimum contrast and estimating equation solutions. What are the good points of the method of moments and frequency plugin? (a) They generally lead to procedures that are easy to compute and are.
P E Po.Section 2. . In the multinomial case the frequency plugin estimators are empirical PIEs based on v(P) = (Ph .lj = E( XJ\ 1 < j < d.. Zi).. For this contrast. I < i < n. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters. li) : I < i < n} with li independent and E(Yi ) = g((3. The general principles are shown to be related to each other.. 0).(3) ~ L[li . Method of moment estimates are empirical PIEs based on v(P) = (f.. When P is the empirical probability distribution P E defined by Pe(A) ~ n. D(P) is called the extensionplug·in estimate of v(P).. and the contrast estimating equations are V Op(X. An extension ii of v from Po to P is a parameter satisfying v(P) = v(P).. P. when g({3. It is of great importance in many areas of statistics such as the analysis of variance and regression theory.lJ) ~ E'o"p(X. A minimum contrast estimator is a minimizer of p(X.ll.g((3. ~ ~ ~ ~ 2.Zi)j2.2. f. Let Po and P be two statistical models for X with Po c P.1 L:~ 1 I[X i E AI. . the associated estimating equations are called the normal equations and are given by Z1 Y = ZtZD{3. 0 E e c Rd is uniquely minimized at the true value 8 = 8 0 of the parameter. where 9 is a known function and {3 E Rd is a vector of unknown regression coefficients. . ~ ~ ~ ~ ~ v we find a parameter v such that v(p.1 MINIMUM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares Least squares(1) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement. If P = PO. where Pj is the probability of the jth category..) = q(O) and call v(P) a plugin estimator of q(6). a least squares estimate of {3 is a minimizer of p(X.ld)T where f. 1 < j < k..2 2.O) = O.. For the model {Pe : () E e} a contrast p is a function from X x H to R such that the discrepancy D(lJo.Pk). 0 E e.2 Minimum Contrast Estimates and Estimating Equations 107 Summary. If P is an estimate of P with PEP. where ZD = IIziJ·llnxd is called the design matrix. then is called the empirical PIE. 6). z) = ZT{3. . For data {(Zi. In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6. The plugin estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P. Suppose X . is parametric and a vector q( 8) is to be estimated.
.z).g(l3. Z could be educationallevel and Y income. 1_' . This is frequently the case for studies in the social and biological sciences.2) Are least squares estimates still reasonable? For P in the semiparametric model P.) i. The contrast p(X. i=l (2. . If we consider this model. (2. aJ) and (3 ranges over R d or an open subset. z. E«. . Yi).2.4) (2.(z.1.1. . I Zj = Zj).2.).d.::.z.f3d and is often applied in situations in which specification of the model beyond (2. and 13 ~ (g(l3.»)2.) + <" I < i < n n (2..:0::d=':Of. Suppose that we enlarge Po to P where we retain the independence of the Yi but only require lJ.6) is difficult Sometimes z can be viewed as the realization of a population variable Z. is modeled as a sample from a joint distribution.E(Y. which satisfies (but is not fully specified by) (2. the model is semiparametric with {3. = 0. " .. (Z" Y.i.108_~ ~_ _~ ~ ~cMc.2. z.::'hc.g(l3.2.13) n n i=1  L Varp«.:. 1 < j < n. . For instance.zt}.)]' i=l ~ led to the least squares estimates (LSEs) {3 of {3... . 1 < i < n..2. < i < n.g(l3.Y) such thatE(Y I Z = z) = g(j3. 13 = I3(P) is the miniml2er of E(Y . .5) (2.4}{2.d.4 with <j simply defined as Y...) = g(l3. The estimates continue to be reasonable under the GaussMarkov assumptions.::2 : In Example 2. • . that is. or Z could be height and Y log weight Then we can write the conditional model of Y given Zj = Zj. 1 < i < j < n 1 > 0\ .1. z.2.=..2. .i.En) is any distribution satisfying the GaussMarkov assumptions.. we can compute still Dp(l3o.) = E(Y. .:1. " (2.z. 13 E R d }.. 13) = LIY. The least squares method of estimation applies only to the parameters {31' .) . j.z.1 we considered the nonlinear (and linear) Gaussian model Po given by Y. That is.) + L[g(l3 o. I) are i. I<i<n.g(l3.) =0. Z)f.2. as in (a) of Example 1.E:::'::'. j. Note that the joint distribution H of (C}.3) continues to be valid.g(l3. where Ci = g(l3. as (Z.o=nC. E«. This follows "I .). (3)  E p p(X.2).3) which is again minimized as a function of 13 by 13 = 130 and uniquely so if the map 13 ~ (g(j3. a 2 and H unknown. fj) because (2. .6) Var( €i) = u 2 COV(fi.) = 0. . then 13 has an interpretation as a parameter on p. (Zi.2. Y) ~ PEP = {Alljnintdistributions of(Z. I' .2. NCO. . .zn)f is 11. Zn»)T is 1.m::a. .C=ha:::p::.. that is.
1 the most commonly used 9 in these models is g({3.2. is called the linear (multiple) regression modeL For the data {(Zi. we can approximate p(z) by p(z) ~ p(zo) +L j=1 d a: a (zo)(z . Sections 6.. j=O This type of approximation is the basis for nonlinear regression analysis based on local polynomials.1. Yi). The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth. as we have seen in Section lA. i = 1.. In that case. n} we write this model in matrix fonn as Y = ZD(3 + €    where Zv = IIZij 11 is the design matrix. z) = zT (3. We can then treat p(zo)  nnknown (30 and identify (zo) with (3j to give an approximate (d 1 and Zj as before and linear model with Zo = t!. see Rnppert and Wand (1994). Yi) are a sample from a (d + 1)dimensional distribution and the covariates that are the coordinates of Z are continuous.5. E(Y I Z = z) can be written as d p(z) where = (30 + L(3jZj j=1 (2..4.1).2.2. .7) (2. (2. and Seber and Wild (1989). where P is the empirical distribution assigning mass n. we are in a situation in which it is plausible to assume that (Zi. (2) If as we discussed earlier. J for Zo an interior point of the domain. which.Section 2.. 2:). y)T has a nondegenerate multivariate Gaussian distribution Nd+1(Il.5) and (2.. Zd. {3d).1.2. and Volnme n. .4).4. E1=1 g~ (zo)zo as an + 1) dimensional d p(z) = L(3jZj.1.I to each of the n pairs (Zil Yi).2. As we noted in Example 2. In this case we recognize the LSE {3 as simply being the usual plugin estimate (3(P).2.7). .L{3jPj. Fan and Gijbels (1996).6).8) d f30 = («(3" .2.2 Minimum Contrast Estimates and Estimating Equations 109 from Theorem 1. For nonlinear cases we can use numerical methods to solve the estimating equations (2. I)/. j=1 (2.2. .41. in conjnnction with (2. (2. = py . We continue our discussion for this important special case for which explicit fonnulae and theory have been derived. See also Problem 2.9) . a further modeling step is often taken and it is assumed that {Zll .zo)..3 and 6.
1967. {3. Example 2.(3zz.) = il" a~. il.) = 1 and the normal equation is L:~ 1(Yi . given Zi = 1 < i < n.) = O.8). if the parametrization {3 ) ZD{3 is identifiable. Y is random with a distribution P(y I z).il.1 '. Zi 1'. g(z. ~ = (lin) L:~ 1Yi = ii. . we assume that for a given z. Following are the results of an experiment to which a regression model can be applied (Sned. Example 2. il. i=l (2. Nine samples of soil were treated with different amounts Z of phosphorus.25. is independent of Z and has a N(O.2. 1 64 4 71 5 54 9 81 11 76 13 23 77 93 23 95 28 109 The points (Zi' Yi) and an estimate of the line 131 + 132z are plotted in Figure 2. d = 1.12) . g(z. In the measurement model in which Yi is the detennination of a constant Ih.il.2.2. Zi. The parametrization is identifiable if and only ifZ D is of rank d or equivalently if Z'bZD is affulI rank d. p. the relationship between z and y can be approximated well by a linear equation y = 131 + (32Z provided z is restricted to a reasonably small interval. 139). I I. 1 < i < n. see Problem 2. we will find that the values of y will not be the same. The nonnal equations are n i=l n ~(Yi .2. the least squares estimate. . L 1 that. We want to estimate f31 and {32.) ~ 0. We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil. il. Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil.2. In that case.2. < Methods of Estimation Chapter 2 =y  I'(Z) Eyz:Ez~Ezy. i I .1. LZi(Yi .il2Zi) = 0.ecor and Cochran. For this reason. For certain chemicals and plants. exists and is unique and satisfies the Donnal equations (2.no Furthennore.2. If we run several experiments with the same z using plants and soils that are as nearly identical as possible. . necessarily.11) When the Zi'S are not all equal.ill .1. 0 Ii. We have already argued in Examp}e 2.I 'i.1. Estimation of {3 in the linear regression model. we have a Gaussian linear regression model for Yi. whose solution is .2. ( (J2 2 ) distribution where = ayy Therefore.10) Here are some examples. we get the solutions (2. the solution of the normal equations can be given "explicitly" by (2.
2. . and fi = (lin) I:~ 1 Yi· The line y = /31 + f32z is known as the sample regression line or line of best fit of Yl.Section 2.. . i = 1.. 0 ~ ~ ~ ~ ~ Remark 2. Yn). 91.9.58 and 132 = 1..132 z ~ ~ (2. suppose we select p realvalued functions of z.2 Minimum Contrast Estimates and Estimating Equations 111 y 50 • o 10 20 x 30 Figure 22. then the regression line minimizes the sum of the squared distances to the n points (ZI' Yl)...Yn on ZI.. €6 is the residual for (Z6. .13) where Z ~ (lin) I:~ 1 Zi. .2. if we measure the distance between a point (Zi. 9p( z). . and 131 = fi . Here {31 = 61.. . . .3. n. Y6). i = 1.1.. &atter plot {(Zi..(a + bzi)l. .(/31 + (32Zi)] are called the residuals of the fit.4..Yi) and a line Y ~ a + bz vertically by di = Iy. . This connection to prediction explains the use of vertical distance! in regression..(Z).42. .1. The linear regression model is considerably more general than appears at first sight.. P > d and postulate that 1'( z) is a linear combination of 91 (z). . n} and sample regression line for the phosphorus data. j=1 Then we are still dealing with a linear regression model because we can define WpXI . that is p I'(z) ~ 2:). Yi). The regression line for the phosphorus data is given in Figure 2. Geometrically. For instance. . Zn.2.9p. The vertical distances €i = fYi . The line Y = /31 + (32Z is an estimate of the best linear MSPE predictor Ul + b1Z of Theorem 1.1... . . (zn.
5) fails. That is.17) ..z. Note that the variables .= y. ..zd + ti.wi ..wi) g((3..2. In Example 2.)]2 i=l i=l ~ n (2.. which for given Yi = yd..2 and many similar situations it may not be reasonable to assume that the variances of the errors Ci are the same for all levels Zi of the covariate variable..wi.2. . we can write (2. if we setg«(3.2. Weighted Linear Regression. We return to this in Volume II. Weighted least squares.= g({3.. 1<i <n Wij = 9j(Zi). .2. iI L Vi[Yi i=l n ({3. We need to find the values {3l and /32 of (3.Zi)]2 = L ~.. . .2.5).15) . 1 n. + /3zzi)J2 (2. Thus. but the Wi are known weights. .wi.2.. 0 < j < 2. if d = 1 and we take gj (z) = zj.2.2. The weighted least squares estimate of {3 is now the value f3. .. gp( z)) T as our covariate and consider the linear model Yi = where L j=l p ()jWij + Ci.filii minimizes ~  I i i L[1Ii .3.2. .'l are sufficient for the Yi and that Var(Ed.. Example 2..g(.. Zil = 1.. Such models are called heteroscedastic (as opposed to the equal variance models that are homoscedastic). However. Zi2 = Zi. < n. [Yi . and g({3.8.g«(3. and (32 that minimize ~ ~ "I.wi 1 < . Yi = 80 + 8 1Zi + 82 + Cisee Problem 2. _ Yi .wi  _ g((3.". 1 <i < n Yi n 1 . then = WW2/Wi = . we may be able to characterize the dependence of Var( ci) on Zi at least up to a multiplicative constant. fi = <d. Zi) = (2. Consider the case in which d = 2. we arrive at quadratic regression. Zi)/. Whether any linear model is appropriate in particular situations is a delicate matter partially explorable through further analysis of the data and knowledge of the subject matter.. • I' "0 r . The method of least squares may not be appropriate because (2.II 112 Methods of Estimation Chapter 2 (91 (z).. I . For instance.16) as a function of {3... I and the Y i satisfy the assumption (2.. i = 1.2.24 for more on polynomial regression. Zi) = {3l + fhZi.14) where (J2 is unknown as before.. z. Zi) + Ei .
.(L:~1 UlYi)(L:~l uizd Li~] Ui Zi .2.2 Minimum Contrast Estimates and Estimating Equations 113 where Vi = l/Wi_ This problem may be solved by setting up analogues to the normal equations (2. (Zn. .1 7) is equivalent to finding the best linear MSPE predictor of y . a__ Cov(Z"'.2.6). . the . we may allow for correlation between the errors {Ei}. 0 Next consider finding the.. (3 and for general d. .2.)] ~Ui. By following the steps of Exarnple 2. That is.~ UiYi .~ UiZi· n n i"~l I" n n F"l This computation suggests..27) that f3 satisfy the weighted least squares normal equations ~ ~ where W = diag(wI.Y") ~(Zi. .Y.18) ~ ~ ~I" (31 = E(Y') ... Y*) V. Let (Z*.2.. .2.2..3. suppose Var(€) = a 2W for some invertible matrix W nxn..4 as follows. z) = zT.28) that the model Y = ZD.(32E(Z') ~ .1) and (2. ~ . Y*) denote a pair of discrete random variables with possible values (z" Yl).2.20). .2. When ZD has rank d and Wi > 0..B. we find (Problem 2. we can write Remark 2. . Var(Z") and  _ L:~l UiZiYi .B that minimizes (2. We can also use the results on prediction in Section 1. when g(. More generally. using Theorem 1. that weighted least squares estimates are also plugin estimates. 1 < i < n. .2.4..Section 2. Then it can be shown (Problem 2.26.B.2.13 minimizing the least squares contrast in this transformed model is given by (2.7).B+€ can be transformed to one satisfying (2. i ~ I.19) and (2. .4H2.8). as we make precise in Problem 2. then its MSPE is ElY" ~ 1l1(Z")f ~ :L UdYi i=l n «(3] + (32 Zi)1 2 It follows that the problem of minimizing (2. i i=l = 1.n where Ui n = vi/Lvi. Yn) and probability distribution given by PI(Z". wn ) and ZD = IIZijllnxd is the design matrix.1.2.16) for g((3. Zi) = z. If Itl(Z*) = {31 given by + {32Z* denotes a linear predictor of y* based on Z . Moreover.2.1 leading to (2.1.(3. Thus..n.(Li~] Ui Z i)2 n 2 n (2.2. .2.
.
.
.
.
2. respectively. there may be solutions of (2. and n3 denote the number of {Xl. Nevertheless.1. the dual point of view of (2. x. Because . 0)p(2. Example 2. which is maximized by 0 = 0. 2 and 3. . OJ = (1  0)' where 0 < () < 1 (see Example 2.28) If 2n.. Then the same calculation shows that if 2nl + nz and n2 + 2n3 are both positive. then X has a Poisson distribution with parameter nA. Similarly. .7. then I • Lx(O) ~ p(l. situations with f) well defined but (2.27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section. } with probabilities.2. In general.27) doesn't make sense. Here are two simple examples with (j real. 8' 80.. . which has the unique solution B = ~.lx(0) = 5 1 0' . n2. and as we have seen in Example 2.2.27) that are not maxima or only local maxima. 0) ~ 20(1 .>. 0 e Example 2.29) '(j .2.5. 2.ny l.. A is an unknown positive constant and we wish to estimate A using X.) = ·j1 e'"p. is zero. I). X2 = 2.O) = 0'.l.0). .118 Methods of Estimation Chapter 2 which again enables us to analyze the behavior of B using known properties of sums of independent random variables. the maximum likelihood estimate exists and is given by 8(x) = 2n.4). Evidently. the MLE does not exist if 112 + 2n3 = O. X3 = 1. .2. p(X. 0) = 20'(1...2. + n. If we make the usual simplifying assumption that the arrivals form a Poisson process. x n } equal to 1. 1). . 2n (2. Here X takes on values {O. O)p(l. represents the expected number of arrivals in an hour or.. (2.(1. If we observe a sample of three individuals and obtain Xl = 1.6.2. 1.22) and (2. where ). so the MLE does not exist because = (0.2.x=O. Consider a popUlation with three kinds of individuals labeled 1. equivalently.2. p(2.0) The likelihood equation is 8 80 Ix (0) ~ = 5 1 0 10 =0.. the rate of arrival.0)' < ° for all B E (0. the likelihood is {1 . and 3 and occurring in the HardyWeinberg proportions p(I.OJ'".2. let nJ. Let X denote the number of customers arriving at a service counter during n hours. p(3. ~ maximizes Lx(B). ]n practice. + n.
. and let N j = L:~ 1 l[Xi = j] be the number of observations in the jth category.2.30) for all This is the condition we applied in Example 2. (see (2.8B " 1=13 k k =0. ~ ~ .2 Minimum Contrast Estimates and Estimating Equations 119 The likelihood equation is which has the unique solution)..7.. . J = I. . . familiar from calculus. and must satisfy the likelihood equa~ons ~ 8B1x(fJ) = 8B 3 8 8 " n. for an experiment in which we observe nj ~ 2:7 1 I[X. e.30)). and k j=1 k Ix(fJ) = LnjlogB j .2. consider an experiment with n Li. j=d (2.2. this estimate is the MLE of ).. Multinomial Trials.6.\ I 0..kl. the MLE must have all OJ > 0.32) We first consider the case with all the nj positive. . k. A similar condition applies for vector parameters. ~ n . = x/no If x is positive.ekl with kl Ok ~ 1.2.6. this is well known to be equivalent to ~ e.logB.n.2. 8' (2. We assume that n > k ~ 1. j = 1... 8) = 0 if any of the 8j are zero. fJ) = TI7~1 B7'. thus.2. . 31=1 By (2. is that l be concave in If l is twice differentiable. A sufficient condition.lx(B) <0. As in Example 1.. 80. Then.. Let Xi = j if the ith trial produces a result in the jth category. fJE6= {fJ:Bj >O'LOj ~ I}.. 0 To apply the likelihood equation successfully we need to know when a solution is an MLE. However.1. trials in which each trial can produce a result in one of k categories.d.Section 2. and the equation becomes (BkIB j ) n· OJ = . 8Bk/8Bj (2..32).8. let B = P(X. L. p(x. Example 2. 8B. = L.2. = jl. . If x = 0 the MLE does not exist.2. the maximum is approached as . Then p(x.o.32) to find I.31 ) To obtain the MLE (J we consider l as a function of 8 1 . = j) be the probability J of the jth category.LBj j=1 • (2.
.IV..2. are known functions and {3 is a parameter to be estimated from the independent observations YI . Next suppose that nj = 0 for some j. .2. Using the concavity argument. (2.120 ~ Methods of Estimation Chapter 2 To show thaI this () maximizes lx(8). .11(a))... Zi). we check the concavity of lx(O): let 1 1 <j < k .2. g«(3.. More generally. As we have IIV.. See Problem 2. n. .' .1 holds and X = (Y" .2. z..1.2 are Maximum likelihood and least squares We conclude with the link between least squares and maximum likelihood. Thus. zi)1 . This approach is applied to experiments in which for the ith case in a study the mean of the response Yi depends on • . where E(V. .. is still the unique MLE of fJ. i = 1..1 we consider least squares estimators (LSEs) obtained by min2 imizing a contrast of the fonn L:~ 1 IV. YnJT... gi. . . version of this example will be considered in the exponential family case in Section 2. we find that for n > 2 the unique MLEs of Il and Ii = X and iT2 ~ n.Zi)]2.202 L.). ~ g«(3.1. j = 1. ~ g((3.33) It follows that in this nj > 0.  seen and shall see more in Section 6. Zi)J[l:} . we can consider minimizing L:i. 1 < i < n.2 both unknown.) = gi«(3).d. . ((g((3.X)2 (Problem 2. 0. and 0. zn) 03W.2. Summary.2. N(lll 0. Example 2. In Section 2. Zi)) 00 ao n log 2 I"" 2 "2 (21Too) .(Xi .3. . Then Ix «(3) log IT ~'P (V.gi«(3) 1 . as maximum likelihood estimates for f3 when Y is distributed as lV. . where Var(Yi) does not depend on i. Suppose the model Po of Example 2.34) g«(3.1 L:~ . X n are Li. i=l g«(3. k.2 ) with J1. then <r <k I.. wi(5). 1 < j < k.9.2. 'ffi f. Ix(O) is strictly concave and (} is the unique ~ ~ maximizer of lx(O). Zj)]Wij where W = Ilwijllnxn is a symmetric positive definite matrix.6.30. The 0 < OJ < 1..g«(3. o 1=1 Evidently maximizing Ix«(3) is equivalent to minimizing L:~ n (2.1). these estimates viewed as an algorithm applied to the set of data X make sense much more generally. It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model Yi independent N(g({3.lV. Then (} with OJ = njjn. see Problem 2. least squares estimates are maximum likelihood for the particular model Po. OJ > ~ 0 case..28. Suppose that Xl. 1 Yn ..
6. in the N(O" O ) case.1.1 ).00. e e I"V e De= {(a. These estimates are shown to be equivalent to minimum contrast estimates based on a contrast function related to Shannon entropy and KullbackLeibler information divergence.bE {a.a= ±oo.2 we consider maximum likelihood estimators (MLEs) 0 that are defined as maximizers of the likelihood Lx (B) = p(x. b).a < b< oo} U {(a. (a.1 and 1. In particular we consider the case with 9i({3) = Zij{3j and give the LSE of (3 in the case in which I!ZijllnXd is of rank d. (h). (a.4 and Corollaries 1. oo]P. for a sequence {8 m } of points from open. Properties that derive solely fTOm concavity are given in Propositon 2. Suppose 8 c RP is an open set. In Section 2.3. as k .00. Then there exists 8 E e such that 1(8) = max{l(li) : Ii E e}.3 and 1. b). 8).3 Maximum Likelihood in Multiparameter Exponential Families 121 a set of available covariate values Zil.2. Formally. 2 e Lemma 2.I ) all tend to De as m ~ 00. For instance. For instance. if X N(B I . .Section 2. (m. Let &e = be the boundary of where denotes the closure of in [00.. = RxR+ and ee e. or 8 1nk diverges with 181nk I . a8 is the set of points outside of 8 that can be obtained as limits of points in e.zid.3. That is.6.b) .ae as m t 00 to mean that for any subsequence {B1nI<} either 8 1nk t t with t ¢ e. z=1=1 ~ 2. Extensions to weighted least squares. . are given.3.2 and other exponential family properties also playa role.3 MAXIMUM LIKELIHOOD IN MUlTIPARAMETER EXPONENTIAL FAMILIES Questions of existence and uniqueness of maximum likelihood estimates in canonical exponential families can be answered completely and elegantly. it is shown that the MLEs coincide with the LSEs. where II denotes the Euclidean norm.oo}}. Concavity also plays a crucial role in the analysis of algorithms in the next section. (}"2) distribution.. including all points with ±oo as a coordinate. m.1) lim{l(li) : Ii ~ ~ De} = 00. which are appropriate when Var(Yj) depends on i or the Y's are correlated.1 only. . (m. In the case of independent response variables Yi that are modeled to have a N(9i({3). b) : aER. This is largely a consequence of the strict concavity of the log likelihood in the natural parameter TI. Suppose we are given a function 1: continuous. We start with a useful general framework and lemma. In general.. m. Suppose also that e t R where e c RP is open and 1 is (2.6. Proof. though the results of Theorems 1. we define 8 1n .5. See Problem 2.3. Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2. (m.6.3.1. m).
' L n .8 and 2.3.3. then Ix lx(O. have a necessary and sufficient condition for existence and uniqueness of the MLE given the data. a contradiction...3.3. is unique.3) has i 1 !. Existence and Uniqueness o/the MLE ij. By Lemma 2.)) > ~ (fx(Od +lx (0. if to doesn't satisfy (2.1 hold.9) we know that II lx(lI) is continuous on e. We can now prove the following. (ii) The family is of rank k.3. are distinct maximizers. which implies existence of 17 by Lemma 2. Proofof Theorem 2.1.1.). '10) for some reference '10 E [ (see Problem 1.122 Methods of Estimation Chapter 2 Proposition 2.. no solution.to.2). ! " . lI(x) =  exists. is open. Suppose X ~ (PII .3.1.3. Furthennore. h) and that (i) The natural parameter space. . i . ~ Proof. U m = Jl~::rr' Am = . I I. (a) If to E R k satisfies!!) Ifc ! 'I" 0 (2.3.2) then the MLE Tj exists. Corollary 2. II E j. Suppose the cOlulitions o/Theorem 2.3. " We. We show that if {11 m } has no subsequence converging to a point in E. open C RP. Let x be the observed data vector and set to = T(x). E.'1 . If 0.12. /fCT is the convex suppon ofthe distribution ofT (X). we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) . = . then lx(11 m ) .00.27).1.'1) with T(x) = 0. Define the convex suppon of a probability P to be the smallest convex set C such that P(C) = 1.  (hO! + 0. with corresponding logp(x. then the MLE doesn't exist and (2.3. Then. then 1] exists and is unique ijfto E ~ where C!i: is the interior of CT. Applications of this theorem are given in Problems 2. Suppose P is the canonical exponential/amity generated by (T. and 0.1.3) I I' (b) Conversely. [ffunher {x (II) e = e e t 88.6. II).3. . From (B. We give the proof for the continuous case. Without loss of generality we can suppose h(x) = pix. " . . and is a solution to the equation (2. Write 11 m = Am U m .(11) ~ 00 as densities p(x.)) 0 Themem 2.3. II) is strictly concave and 1. thus.1. then the MLE8(x) exists and is unique. if lx('1) logp(x.3.
Then AU ¢ £ by assumption. If ij exists then E'I T ~ 0 E'I(cTT) ~ 0. um~.3. .1 follow.2.3. if {1]m} has no subsequence converging in £ it must have a subsequence { 11m k} that obeys either case 1 or 2 as follows. Nonexistence: if (2. 0 ° '* '* PrOD/a/Corollary 2. iff.5.3) by TheoTem 1. t u.3.1 hold and T k x 1 has a continuous case density on R k • Then the MLE Tj exists with probabiliry 1 and necessarily satisfies (2. T(X) has a density and. I Xl) and 1.4. The Gaussian Model. N(p" ( 2 ). 0 (L:7 L:7 CJ.Xn are i..3.1. u mk t u. So In either case limm.* P'IlcTT = 0] = I. CT = R )( R+ FOT n > 2. Case 1: Amk t 111]=11. that is. both {t : dTt > dTto} n CO and {t : dTt < dTto} n CO are nonempty open sets. 0 Example 2.2) fails.3.1) a point to belongs to the interior C of a convex set C iff there exist points in CO on either side of it. So we have Case 2: Amk t A. contradicting the assumption that the family is of rank: k. Then.2) and COTollary 2. thus. Theorem 2. for every d i= 0.3 Maximum Likelihood in Multiparameter Exponential Families ~ 123 1. Suppose X"". This is equivalent to the fact that if n = 1 the formal solution to the likelihood equations gives 0'2 = 0. Suppose the conditions ofTheorem 2. there exists c # such that Po[cTT < 0] = 1 E1](c T T(X)) < 0.k Lx( 11m. .Section 2. Po[uTT(X) > 61 > O. I' E R. It is unique and satisfies x (2.3. By (B.d. = 0 because T(XI) is always a point on the parabola T2 = T'f and the MLE does not exist.1.9.3. existence of MLEs when T has a continuous ca. Evidently. In fact.6.i.3). Because any subsequence of {11m} has no subsequence converging in E we conclude L ( 11m) t 00 and Tj exists. which is impossible. Write Eo for E110 and Po for P11o . lIu m ll 00. fOT all 1].3. So.6. Por n = 1. CT = C~ and the MLE always exists.se density is a general phenomenon. As we observed in Example 1. (1"2 > O.3.J = 00. Then because for some 6 > 0. this is the exponential family generated by T(X) = 1 Xi. The equivalence of (2.
I <j < k.. with density 9p.3... It is easy to see that ifn > 2. but only Os is a MLE.Xn are i.4. In Example 3.2) holds. Example 2.i..4) (2. Thus. . I < j < k. P > 0. This is a rank 2 canonical exponential family generated by T   = (L log Xi.I which generates the family is T(k_l) = (Tl>"" T.3).6.4) and (2.4 we will see that 83 is.d.3. Example 2.2(a). the best estimate of 8.. by Problem 2. where 0 < Aj PIX = j] < I.3.T(X) = A(.3. Because the resulting value of t is possible if 0 < tjO < n.9).1. and one of the two sums is nonempty because c oF 0.1. we see that (2.>. . exist iff all Tj > O.3.13 and the next example). I < j < k iff 0 < Tj < n._t)T.5) have a unique solution with probability 1.1. = = L L i i bJ _ I .3.3. The likelihood equations are equivalent to (problem 2.1. using (2.2.3.. 0 = If T is discrete MLEs need not exist. We conclude from Theorem 2. > O.3.4. o Remark 2.I and verify using Theorem 2.3 we know that E. Here is an example. in a certain sense.(X) = r~.2(b» r' log .1.. For instance. The statistic of rank k . How to find such nonexplicit solutions is discussed in Section 2.I. The boundary of a convex set necessarily has volume 0 (Problem 2.2. We assume n > k .ln.1). The TwoParameter Gamma Family. where Tj(X) L~ I I(Xi = j). the maximum likelihood principle in many cases selects the «best" estimate among them.5) ~=X A where log X ~ L:~ t1ogXi. They are determined by Aj = Tj In.).3. From Theorem 1. if we write c T to = {CjtjO : Cj > O} + {cjtjO : Cj < O} we can increase c T to by replacing a tjO by tjo + I in the first sum or a tjO by tjO . Thasadensity. x > 0. We follow the notation of Example 1. lit ~ .3. thus.124 Methods of Estimation Chapter 2 Proof. in the HardyWeinberg examples 2.6. then and the result follows from Corollary 2. I <t< k ... I < j < k.6..)e>"XxPI.n. LXi).3.3.3. A nontrivial application of Theorem 2. with i'.. if T has a continuous case density PT(t). liz = I  vn3/n and li3 = (2nl + nz)/2n are frequency substitution estimates (Problem 2.4 and 2. h(x) = XI.= r(jJ) A log X (2. >.2 follows. the MLE 7j in exponential families has an interpretation as a generalized method of moments estimate (see Problem 2.1 in the second.1 that in this caseMLEs of"'j = 10g(AjIAk). Suppose Xl. To see this note that Tj > 0. I < j < k. When method of moments and frequency substitution estimates are not unique.2 that (2.3.3. Multinomial Trials.7. Thus.3.
1] it does exist ~nd is unique.3.3.3. e open C R=..3.13).6) on c( II) = ~~.7) Note that c(lI) E c(8) and is in general not ij.3. x E X. Corollary 2. (B).1.3) does not have a closedform solution as in Example 1.3. our parameter set is open.exist and are unique. Then Theorem 2. Unfortunately strict concavity of Ix is not inherited by curved exponential families.1.B(B) j=l .. L.3. Here [ is the natural paramerer space of the exponential fantily P generated by (T. the MLEs ofA"j = 1.3. Consider the exponential family k p(x.8see Problem 2. n > kl. The remaining case T k = a gives a contradiction if c = (1.2.1 directly D (Problem 2...10). the MLE does not exist if 8 ~ (0. 1)T.k.lI) ~ exp{cT (II)T(x) .3.3. whereas if B = [0. (II) . .3 can be applied to determine existence in cases for which (2.2.3.A(c(ii)) ~ = O. . In Example 2. Remark 2. 0 < j < k .Section 2.1. the bivariate normal case (Problem 2.I. BEe.2) so that the MLE ij in P exists. the following corollary to Theorem 2. lfP above satisfies the condition of Theorem 2. if any TJ = 0 or n.2) by taking Cj = 1(i = j). However.3. and unicity can be losttake c not onetoone for instance. be a cUlVed exponential family p(x.~l Aj = I}. when we put the multinomial in canonical exponential family form. Alternatively we can appeal to Corollary 2. Ck (B)) T and let x be the observed data..3. In some applications. if 201 + 0. then it is the unique MLE ofB.1 we can obtain a contradiction to (2.1. If the equations ~ have a solution B(x) E Co.1 is useful.8 we saw that in the multinomial case with the clQsed parameter set {Aj : Aj > 0. . .A(c(II))}h(x) Suppose c : 8 ~ [ C R k has a differential (2. h). 1 < i < k .. note that in the HardyWeinberg Example 2. m < k . c(8) is closed in [ and T(x) = to satisfies (2.6.1). 0 The argument of Example 2.B) ~ h(x)exp LCj(B)Tj(x) .3.3 Maximum likelihood in Multiparameter Exponential Families 125 On the other hand. Let CO denote the interior of the range of (c. mxk e.3. When P is not an exponential family both existence and unicity of MLEs become more problematic.3. .6. for example. . Let Q = {PII : II E e). = 0. then so does the MLE II in Q and it satisfies the likelihood equation ~ cT(ii) (to . (2. The following result can be useful.1 and Haberman (1974). Similarly. .2.
9) is a curved exponential family of the form (2. l}m. Next suppose. . < Zn where Zt. generated by h(Y) = 1 and "'> I T(Y) '. . which implies i"i+ > 0.~Yn.it.3. . . . with I.6. Gaussian with Fixed Signal to Noise. E R.9.. 11.6. '() A 1/ Thus. Jl > 0.n(it' + )"5it'))T which with Ji2 = n.3.ryl > O.10. Evidently c(8) = {(lh.3.i. i = 1.1/'1') T .\()'.126 The proof is sketched in Problem 2. j = 1.5. . which is closed in E = {( ryt> ry. . = ( ~Yll'''' . l = 1.2. = 1 " = 2 n( ryd'1' . .6.10.. LocationScale Regression. .. N'(fl.d. suppose Xl.5 and 1. Because It > 0. Equation (2. Using Examples 1. ry. .3.g( it' . tLi = (}1 + 82 z i .. '1./2'1' . L: Xi and I. Zl < ..6) with " • .. . 1 n. we see that the distribution of {01 : j = 1.7) if n > 2. m m m f?..5X' +41i2]' < 0. where 0l N(tLj 1 0'1). i Example 2. Now p(y. Methods of Estimation Chapter 2 family with (It) = C2 (It) = .7) becomes = 0. We find CI Example 2. = L: xl.ry.Zn are given constants.?' m)T " ".2 and 2.11. Suppose that }jI •.0'2) with Jl/ a = AO > a known..A6iL2 = 0 Ii. m} is a 2nparameter canonical exponential family with 'TJi = /tila. C(O) and from Example 1. . ... < O}.3)T.!0b''. As in Example 1..6..oJ).gx±)..'' . corresponding to 1]1 = //x' 1]2 = ...) : ry.I LX.3. I . simplifies to Jl2 + A6XIl.2~~2 . Ii± ~ ~[).4.it. the o q.. that ...3.nit.. .Xn are i. < O}. As a consequence of Theorems 2. t. 'TJn+i = 1/2a. we can conclude that an MLE Ii always exists and satisfies (2. af = f)3(8 1 + 82 z i )2.".~Y'1.3..'..3.3 )(t. are n independent random samples.ry.6. . n.) : ry.3.n.5 = ..< O.. = ~ryf. This is a curved exponential ¥.2 . as in Example 1.: Note that 11+11_ = '6M2 solution we seek is ii+. ).
is the bisection algOrithm to find x*. there exists unique x*€(a. . even in the context of canonical multiparameter exponential families. Given f continuous on (a. such as the twoparameter gamma. entrust ourselves. b) such that f(x*) = O.1).x*l: Find Xo < x" f(xo) < 0 < f(x') by taking Ixol. Finally. > 2.7). x o1d = xo. In this section we derive necessary and sufficient conditions for existence of MLEs in canonical exponential families of full rank with £ open (Theorem 2. then. if implemented as usual. Ixd large enough. Let £ be the canonical parameter set for this full model and let e ~ (II: II. d.3. then the full 2nparameter model satisfies the conditions of Theorem 2.O. L 10).1. We begin with the bisection and coordinate ascent methods. strict concavity. 2.). f i strictly. L 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of Z'bZD requires on the order of nd 2 operations to evaluate each of d(d + 1)/2 tenns with n operations to get Z'bZD and then.3. an MLE (J of (J exists and () 0 ~ ~ Summary.3.1.3. However.Section 2. in this section. even in the classical regression model with design matrix ZD of full rank. order d3 operations to invert. we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all. The packages that produce least squares estimates do not in fact use fonnula (2. at various times. In fact.3. These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency. Then c(8) is closed in £ and we can conclude that for m satisfies (2. is isolated and shown to apply to a broader class of models. f( a+) < 0 < f (b. the basic property making Theorem 2. MLEs may not be given explicitly by fonnulae but only implicitly as the solutions of systems of nonlinear equations. in pseudocode. It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis. the fonnula (2.3. Here. ~ 2. Given tolerance € > a for IXfinal . which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2. > OJ.1 The Method of Bisection The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in kparameter exponential families. b). E R.4 ALGORITHMIC ISSUES As we have seen. by the intermediate value theorem. Initialize x61d ~ XI.02 E R.4.1 work.4 Algorithmic Issues 127 If m > 2.1 and Corollary 2.
xoldl Methods of Estimation Chapter 2 < 2E..IXm+l ..1. If(xfinal)1 < E. for m = log2(lxt .128 (I) If IX~ld .i. h). o ! i If desired one could evidently also arrange it so that. xfinal = Xnew· (4) If f(xnew) < 0. Let X" .4.1. The bisection algorithm stops at a solution xfinal such that i Proot If X m is the mth iterate of Xnew i (I) Moreover. Let p(x I 1]) be a oneparameter canonical exponentialfamily generated by (T. by the intermediate value theorem. may befound (to tolerance €) by the method afbisection applied to f(. x~ld = Xnew· Go to (I). . xfinal = !(x~ld + x old ) and return xfinal' (2) Else.1) . x~ld (5) If f(xnew) > 0.4. (2. in addition. . !'(. By Theorem 1. I. Xnew = !(x~ld + x~ld)' ~ Xnew· (3) If f(xnew) = 0.. Xm < x" < X m +l for all m. From this lemma we can deduce the following. b) of the convex support a/PT. [(8.4.) = VarryT(X) > 0 for all .1. X n be i. Then.to· I' Proaf. 0 Example 2..x*1 < €. 00.. which exists and is unique by Theorem 2.. lx. so that f is strictly increasing and continuous and necessarily because i. End Lemma 2. f(a+) < 0 < f(b).4.xol/€)... Moreover. Theorem 2.) = EryT(X) .. . The Shape Parameter Gamma Family..3. the interior (a. satisfying the conditions of Theorem 2. i I i. I (3) and X m + x* i as m j.1). • .6.1 and T = to E C~.d.. the MLE Tj.xol· 1 (2) Therefore.3. exists.1.4.
Notes: (1) in practice. bisection itself is a defined function in some packages. it is in fact available to high precision in standard packages such as NAG or MATLAB. say c.. getting 1j 1r). but as we shall see.· .4.4. Here is the algorithm.'TJ3. 1 k.1]2.1. E1/(T(X» = A(1/) = to when the MLE Tj = Tj(to) exists. for a canonical kparameter exponential family.'fJk' ~1) an soon. This example points to another hidden difficulty.2 Coordinate Ascent The problem we consider is to solve numerically. However.1]2. in cycle j and stop possibly in midcycle as soon as <I< .4 Algorithmic Issues 129 Because T(X) = L:~' equation I log Xi has a density for all n the MLE always exists. r > 1. we would again set a tolerenceto be. 0 J:: 2. always converges to Ti.···. The case k = 1: see Theorem 2. 1] ~Ok = (~1 ~1 {} "") 1]I.Section 2. The general case: Initialize ~o 1] = ("" 'TJll···' "") • 'TJk Solve ~1 f or1]k: Set 1] 8'T]k [) A(~l ~1 1]ll1'J2.oJ _ = (~1 "" {}) ~02 _ 1]1.4. d and finally 1] _ ~Il) _ =1] (~1 1]1"". It solves the r'(B) r(B) T(X) n which by Theorem 2.'TJk =tk' ) . which is slow. for each of the if'.'TJk.'fJk· Repeat. In fact. .1 can be evaluated by bisection. The function r(B) = x'Iexdx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method.···. eventually.
0 ~ (4) :~i (77') = 0 because :~.4.2. ilL" iii. A(71') ~ to. We give a series of steps. (2) The sequence (iji!. The case we have C¥.. 1 < j < k. the algorithm may be viewed as successive fitting of oneparameter families. The TwoParameter Gamma Family (continued). = 7Jk. .j l(i/i) = A (say) exists and is > 00. Proof.··.4. Therefore.1J k) . they result in substantial savings of time...2. ij = (V'I) . the MLE 1j doesn't exist.2.A(71) + log hex). This twodimensional problem is essentially no harder than the onedimensional problem of Example 2.2.I) solving r0'. Suppose that we can write fiT = (fir. (I) l(ij'j) Tin j for i fixed and in.. _A (1»). (fij(e) are as above. ff 1 < j < k.\(0) = ~.5). Then each step of the iteration both within cycles and from cycle to cycle is quick. (Wn') ~ O. . Thus.. Continuing in this way we can get arbitrarily close to 1j.4. 0 Example 2. to ¢ Fortunately in these cases the algorithm. (6) By (4) and (5).pO) = We now use bisection ' r' . Suppose that this is true for each l.2. j ¥ I) can be solved in closed form. refuses to converge (in fI space!)see Problem 2. We use the notation of Example 2. as it should.. To complete the proof notice that if 1j(rk ) is any subsequence of 1j(r) that converges to ij' (say) then. Hence. (i). We pursue this discussion next. Let 1(71) = tif '1. 1J ! But 71j E [. Whenever we can obtain such steps in algorithms. . TIl = .. ij as r t 00.l(ij') = A. (ii) a/Theorem 2... is the expectation of TI(X) in the oneparameter exponential family model with all parameters save T1I assumed known.. . is computationally explicit and simple.? Continuing. (3) I (71j) = A for all j because the sequence oflikelihoods is monotone. We can initialize with the method 2 of moments estimate from Example 2. limi.1.=1 d j = k and the problem of obtaining ijl(tO. 'II. the log likelihood.2: For some coordinates l. 71 1 is the unique MLE. by (I).4..130 Methods of Estimation Chapter 2 (2) Notice that (iii. that is. .3.. Else lim. We note some important generalizations. I • . 71J.1 hold and to E 1j(r) t g:.. ijik) has a convergent subsequence in t x . x ink) ( '~ini . I(W ) = 00 for some j. . 'tn..4. fI.3. For n > 2 we know the MLE exists. in fact.1j{ can be explicit. .. Consider a point we noted in Example 2. Because l(ijI) ~ Aand the MLE is unique. Bya standard argument it follows that.2'  It is natural to ask what happens if.. .I ~(~ (1) 1 to get V. ij'j and ij'(j+I) differ in only one coordinate for which iji(j+1) maximizes l. j (5) Because 1]1. (3) and (4) => 1]1 .'11 11 t t (I ..3. .1 because the equa~ tion leading to Anew given bold' (2. .'. Theorem 2. Here we use the strict concavity of t... W = iji = ij. 1j(r) ~ 1j. rp differ only in the second coordinate.···) C!J.I)) = log X + log A 0) and then A(1) = pY.) where flj has dimension d j and 2::.
.4 Algorithmic Issues 131 just discussed has d 1 = .1 in which Ix(O).2 has a generalization with cycles of length r.B2 )T where the log likelihood is constant. values of (B 1 .B~) = 0 by the method of j bisection in B to get OJ for j = 1.10.4. . A special case of this is the famous DemingStephan proportional fitting of contingency tables algorithmsee Bishop...Section 2. At each stage with one coordinate fixed. 1' B . Change other coordinates accordingly. Figure 2. and Problems 2. See also Problem 2. The coordinate ascent algorithm can be slow if the contours in Figure 2.p. Next consider the setting of Proposition 2. .1 illustrates the j process. If 8(x) exists and Ix is differentiable.. each of whose members can be evaluated easily.4.3. and Holland (1975). The coordinate ascent algorithm. the log likelihood for 8 E open C RP. find that member of the family of contours to which the vertical (or horizontal) line is tangent.4. which we now sketch.4.92. . the method extends straightforwardly.. 1 B. e ~ 3 2 I o o 1 2 3 Figure 2.. BJ+l' . iterate and proceed. for instance.1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method. r = k. . is strictly concave. that is. Solve g~: (Bt. = dr = 1. Feinberg.4..4..1. The graph shows log likelihood contours.4.7. Then it is easy to see that Theorem 2. .
coordinate ascent. iinew after only one step behaves approximately like the MLE.2) The rationale here is simple. methods such as bisection. if l(B) denotes the log likelihood. B)}F(Xi.4.2) gives (2. Let Xl. If 110 ld is close enough to fj. If 7J o ld is close to the root 'ij of A(ij) expanding A(ij) around 11old' we obtain = to. _. which may counterbalance its advantage in speed of convergence when it does converge. This method requires computation of the inverse of the Hessian.B)} [i+exp{(xB)}j2' n 1 l(B) l(B) n2Lexp{(X. F(x.3) Example 2. and NewtonRaphson's are still employed.4.3. the argument that led to (2.10. (2.B) We find exp{ (x . in general. A hybrid of the two methods that always converges and shares the increased speed of the NewtonRaphson method is given in Problem 2. When likelihoods are noncave.7..f. when it converges. 1 l(x. then by f1new is the solution for 1] to the approximation equation given by the right.B) i=l 1 1 2 L i=l n I(X" B) < O. can be shown to be faster than coordinate ascent.3. I 1 The NewtonRaphson method can be implemented by taking Bold X.. In this case.4.4.4. then ii new = iiold . Newton's method also extends to the framework of Proposition 2.3 The NewtonRaphson Algorithm An algorithm that. is the NewtonRaphson method.132 Methods of Estimation Chapter 2 2. B) The density is = [I + exp{ (x = B) WI.4. though there is a distinct possibility of nonconvergence or convergence to a local rather than global maximum. A onedimensional problem 7 . Here is the method: If 110 ld is the current value of the algorithm.  I o The NewtonRaphson algorithm has the property that for large n. We return to this property in Problem 6.to).6.k I (iiold)(A(iiold) . X n be a sample from the logistic distribution with d.and lefthand sides. and Anderson (1974).1. Bjork. this method is known to converge to ij at a faster rate than coordinate ascentsee Dahlquist.
I)] (1 . and Weiss (1970). Po[X = (0. We give a few examples of situations of the foregoing type in which it is used.(0) })2EiI 10gB + Ei210g2B(1 .. . m+ 1 <i< n. Po[X (0. . and so on. and Rubin (977). X ~ Po with density p(x. but the computation is clearly not as simple as in the Original HardyWeinberg canonical exponential family example. Unfortunately.4) Evidently. . and Anderson (1974). however. A prototypical example folIows.4.€i3). Bjork. 0) is difficultto maximize. is not X but S where = 5i 5i Xi. €i2 + €i3). Petrie. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997).0)2. 1 8 m are not Xi but (€il. Say there is a closedfonn MLE or at least Ip. There are ideal observations. with an appropriate starting point. This could happen if. in Chapter 6 of Dahlquist.4. Many examples and important issues and methods are discussed. . the function is not concave.. a < 0 < I.Section 2. 0. where Po[X = (1.x(B) is "easy" to maximize.6. Soules. the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€i2 = 1). Xi = (EiI.4. difficult to compute. It does tum out that in this simplest case an explicit maximum likelihood solution is still possible. S = S(X) where S(X) is given by (2. for some individuals. though an earlier general form goes back to Baum. Laird. (2.. As in Example 2.x(B) is concave in B. 0). (0) = log q(s.4. be a sample from a population in HardyWeinberg equilibrium for a twoallele locus.0 E c Rd Their log likelihood Ip. 0 leads us to an MLE if it exists in both cases. 0) where I".4.n.2.OJ] i=l n (2.4 Algorithmic Issues 133 in which such difficulties arise is given in Problem 2. A fruitful way of thinking of such problems is in terms of 8 as representing part of X. and its main properties. for instance.0)] ~ 28(1 . The algorithm was fonnalized with many examples in Dempster. What is observed. 2. we observe 5 5(X) ~ Qo with density q(s. Here is another important example.5) + ~ [(EiI+Ei2)log(1(10)')+2Ei310g(IB)] i=m+l a function that is of curved exponential family fonn.4 The EM (Expectation/Maximization) Algorithm There are many models that have the following structure. the rest of X is "missing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood.4. Yet the EM algorithm. i = 1. a)] = B2.B).4. then explicit solution is in general not lXJssible. let Xi. Lumped HardyWeinberg Data. The log likelihood of S now is 1"..13. e = Example 2.4).B) + 2E'310g(1 . 1 <i< m (€i1 +€i2.0. Ei3). E'2. If we suppose (say) that observations 81. 1.
Note that (2. we can think of S as S(X) where X is given by (2.<T.)<T~).). that under fJ..i = 1.4. ~i tells us whether to sample from N(Jil. J. the M step is easy and the E step doable.8).6) where A.4.2)andO <.7) ii. S has the marginal distribution given previously. which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2. .p() [A.B) =E. If this is difficul~ the EM algorithm is probably not suitable.(sl'd +. Although MLEs do not exist in these models.\.(f~). Then we set Bnew = arg max J(B I Bold). O't..B) J(B I Bo) ~ E.B) 0=00 = E. o (:B 10gp(X. It is not obvious that this falls under our scheme but let (2. The second (M) step is to maximize J(B I Bold) as a function of B. Initialize with Bold = Bo· Tbe first (E) step of the algorithm is to compute J(B I Bold) for as many values of Bas needed. • • i ! I • . Bo) _ I S(X) = s ) (2.4.4.9) I for all B (under suitable regularity conditions).\"'u. . = 11 = .4. EM is not particularly appropriate.. we have given. j. . i. + (1  A. the Si are independent with L()(Si I .. reset Bold = Bnew and repeatthe process.Ai)I'Z.ll. .\ < 1. The EM algorithm can lead to such a local maximum. :B 10gq(s. (P(X.134 Methods of Estimation Chapter 2 Example 2. Suppose that given ~ = (~11"" ~n). It is easy to see (Problem 2.0'2 > 0.I. Suppose 8 1 . I:' .(I'. p(s. . Thus. Mixture of Gaussians. A. I . including the examples.B) IS(X)=s) q(s.8) by o taking logs in (2.4) = L()(Si I Ai) = N(Ail'l + (1. . As we shall see in important situations.12) q(s.4. The EM Algorithm. I' Hi where we suppress dependence on s.4. differentiating and exchanging E oo and differentiation with respect i.tz E Rand 'Pu (5) = .\ = 1.5.9) follows from (2. = 01.<T. a local maximum close to the true 8 0 turns out to be a good "proxy" for the 0 nonexistent MLE.o log p(X. The log likelihood similarly can have a number of local maxima and can tend to 00 as e tends to the boundary of the parameter space (Problem 2. Here is the algorithm. Let .Bo) and (2.8) .4.p (~)..6). .\)"'u. This fiveparameter model is very rich pennitting up to two modes and scales.af) or N(J12.Bo) 0 p(X. are independent identically distributed with p() [A.4.4.4.(sI'z)where() = (. ..B) I S(X) = s) 0=00 (2. Again. B) = (1. I.4.11).Sn is a sample from a population P whose density is modeled as a mixture of two Gaussian densities. I That is. if this step is difficult. The rationale behind the algorithm lies in the following formulas.12). (P(X.
I S(X) ~ s > 0 } (2. (2. then . Let SeX) be any statistic. hence. Theorem 2.4.. 0old)' Equality holds in (2.o log . 0 0 = 0old' = Onew.1." ) I S(X) ~ 8 . Fat x EX.1.4.4. 0) I S(X) = 8 ) DO (2. Suppose {Po : () E e} is a canonical exponential family generated by (T. O n e w ) } log ( " ) ~ J(Onew I 0old) . J(Onew I 0old) > J(Oold I 0old) = 0 by definition of Onew.lfOnew.4.0) is the conditional frequency function of X given S(X) J(O I 00 ) If 00 = s. Because.4. (2. 0 ) +E.4 Algorithmic Issues 135 to 0 at 00 .0) = O.3.O) { " ( X I 8. uold 0 r s.2.1. Lemma 2.O)r(x I 8.3.0 ) I S(X) ~ 8 . S (x) ~ 8 p(x. 00ld are as defined earlier and S(X) (2.4. We give the proof in the discrete case. However.0) (2.00Id) by Shannon's ineqnality. DO The main reason the algorithm behaves well follows. r(Xj8. Then (2.4.4.4.13) q(8. q S. formally. the result holds whenever the quantities in J(O I (0 ) can be defined in a reasonable fashion. ~ E '0 (~I ogp(X . uold Now.10) [)J(O I 00 ) DO it follows that a fixed point 0 of the algorithm satisfies the likelihood equation.16) ° q(s. Onew) > q(s. DJ(O I 00 ) [)O and.h) satisfying the conditions a/Theorem 2.(X 18.Onew) {r(X I 8 . Lemma 2. Id log (X I .Onew) E'Old { log r(X I 8.12) = s.Section 2.0) } ~ log q(s. On the other hand.E.15) q(s.13) iff the conditional distribution of X given S(X) forOnew asfor 00ld and Onew maximizes J(O I 0old)' =8 is the same Proof.14) whete r(· j ·.4.11)  D IOgq(8. In the discrete case we appeal to the product rule.17) o The most important and revealing special case of this lemma follows.4. (2. 0) ~ q(s.
4.I<j<2 I r r: I I r! Pol<it 11 <it + <i' = 11 0' B' B' + 20(1 . i i. (2. that.24) " ! I . Now. 1'I.A('1)}h(x) (2.21) Part (a) follows.B) 1 . • I E e (2Ntn + N'n I S) = 2Nt= + N. 1 <j < 3.136 (a) The EM algorithm consists of the alternation Methods of Estimation Chapter 2 A(Bnew ) = Eoold(T(X) I S(X) Bold ~ = s) (2. X is distributed according to the exponential family I where p(x.18) (2.25) I" . after some simplification. I • Under the assumption that the process that causes lumping is independent of the values of the €ij. we see.(A(O) .4 (continued)..B) 1) = log (1 = exp(1)(2Ntn (x) + N'n(x)) . I .A(Bo)) (2.4.4. = Ee(T(X) I S(X) = s) ~ (2.B). Thus.4.4.19) = Bnew · If a solution of(2.A(Bo)) I S(X) = s} (B .Bof Eoo(T(X) I S(X) = y) . A'(1)) = 2nB (2. .20) has a unique solution.4.+l (2.(A(O) . (b) lfthe sequence of iterates {Bm} so obtained is bounded and the equation A(B) o/q(s.(1 .4. Part (b) is more difficult.16.4.23) I . 0 Example 2. . = 11 <it +<" = IJ.4.= +Eo ( t (2<" + <i') I <it + <i'. In this case. which is necessarily a local maximum J(B I Bo) I I = Eoo{(B . i=m. m + 1 < i < n) .(x). A(1)) = 2nlog(1 + e") and N jn = L:~ t <ij(Xi).0)' 1.BO)TT(X) .Pol<i.4.4. h(x) = 2 N i . A proof due to Wu (I 983) is sketched in Problem 2.22) • ~ 0) . Proof.18) exists it is necessarily unique. then it converges to a limit 0·. 1 I • • Po I<ij = 1 I <it + <i' = OJ =  0.
T3 = The observed data are n l L zl.T = (1"1. to conclude ar.T 2 .(Y.T{ (2. (Zn.i. we oberve only Zi. and for n2 + 1 < i < n. we note that for the cases with Zi andlor Yi observed. converges to the unique root of ~ + N.(y. I). Y).d.) E.) E. a~ + ~.. ii2 .6. where (Z. compute (Problem 2.new ii~.12) that if2Nl = Bm.). we observe only Yi .: nl + 1 <i< n. Suppose that some of the Zi and some of the Yi are missing as follows: For 1 < i < nl we observe both Zi and Yi. I Zi) 1"2 + P'Y2(Z.l L ZiYi.26) It may be shown directly (Problem 2.4. Y) ~ N (I"" 1"" 1 O'~.8) of B based on the observed data.new ~ + I'i. b. r) (Problem 2. then B' .4). To compute Eo (T I S = s). at Example 2. ai ~ We take Bo~ = B MOM ' where BMOM is the method of moment estimates (11" 112.p2)a~ [1'2 + P'Y2(Z. This completes the Estep. where B = (111. l"l)/al]Zi with the corresponding Z on Y regression equations when conditioning on Yi (Problem 2. unew  _ 2N1m + N 2m n + 2 . I Z.4.(ZiY.4.1.Section 2. for nl + 1 < i < n2. 2 Mn .4.4.1).1) that the Mstep produces I1l..Bold n ~ (2. T 5 = n.(2N3= + N 2=)B + 2 (N1= + (I n n =0 0 in (0. 1"1)/ad 2 + (1 . i=I i=l n n n i=l s = {(Z" Y. For the Mstep. In this case a set of sufficient statistics is T 1 = Z.new T4 (Bo1d ) .4. ~ T l (801 d).4.Tl IlT4 (8ol d) .} U {Y. new = T 3 (8ol d) . al a2P + 1"1/L2). . Jl2. 112.Yn) be i. the EM iteration is L i=tn+l n ('iI + <. T 2 = Y.T2 ]}' .= > N 3 =)) 0 and M n > 0. a~..27) hew i'tT2Ji{[T3 (8ol d) . /L2.4 Algoritnmic Issues 137 where Mn ~ Thus. . the conditional expected values equal their observed values. Let (Zl'yl). + 1 <i< n}.l LY/. p).. For other cases we use the properties of the bivariate nonna] distribution (Appendix BA and Section 1. B). and find (Problem 2.. [T5 (8ald) 2 = T2 (8old)' iii. which is indeed the MLE when S is observed. T4 = n.1) A( B) = E. iii. as (Z.: n.2 I Z. I'l)/al [1'2 + P'Y2(Z.): 1 <i< nt} U {Z.
the EM algorithm is often called multiple imputation.2..3 and the problems. then J(B i Bo) is log[p(X. . including the NewtonRaphson method.). . involves imputing missing values. based On the first two moments. ! I 2. X n have a beta.6. (b) Find the method of moments estimate of>.4 we derive and discuss the important EM algorithm and its basic properties..4. where 0 < B < 1.1 1. distributions.0. what is a frequency substitution estimate of the odds ratio B/(l.4.. Summary. show that T 3 is a method of moment estimate of ().4. (b) Using the estimate of (a).2. based on the first moment. (a) Show that T 3 = N1/n + N 2 /2n is a frequency substitution estimate of e. (d) Find the method of moments estimate of the probability P(X1 will last at least a month. We then.4.P3 given by the HardyWeinberg proportions. 2B(1 .. X n assumed to be independent and identically distributed with exponential. (a) Find the method of moments estimate of >.1. in general.. in Section 2. which as a function of () is maximized where the contrast logp(X. .B)2. in the context of Example 2. (3( 0'1.2. j : > I) that one system 3.. j.d. use this algorithm as a building block for the general coordinate ascent algorithm. . Note that if S(X) = X. which yields with certainty the MLEs in kparameter canonical exponential families with E open when it exists. E"IJ(O I 00)] is the KullbackLeiblerdivergence (2.P2.0) and (I .. The basic bisection algorithm for finding roots of monotone functions is developed and shown to yield a rapid way of computing the MLE in all oneparameter canonical exponential families with E open (when it exists). Consider n systems with failure times X I . B)/p(X.4. (c) Combine your answers to (a) and (b) to get a method of moment estimate of >. 0) is minimized. Find the method of moments estimates of a = (0'1. Remark 2. Also note that.5 PROBLEMS AND COMPLEMENTS Problems fnr Sectinn 2. &(>. Finally in Section 2. respectively.B o)]. are discussed and introduced in Section 2. ~' ". . Important variants of and alternatives to this algorithm.1 with respective probabilities PI. (c) Suppose X takes the values 1. X I. based on the second moment. 2.0'2) distribution.i' I' I i.0'2) based on the first two moments. 0 Because the Estep. Now the process is repeated with B MOM replaced by ~ ~ ~ ~ 0new.138 Methods of Estimation Chapter 2 where T j (B) denotes T j with missing values replaced by the values computed in the Estep and T j = Tj(B o1d )' j = 1..23).B)? . Consider a population made up of three different types of individuals occurring in the HardyWeinberg proportions 02.. Suppose that Li. By considering the first moment of X. .
. (See Problem B. (d) FortI tk.. X 1l be the indicators of n Bernoulli trials with probability of success 8. YI ).F(tk)... ... Give the details of this correspondence.. < .) There is a Onetoone correspondence between the empirical distribution function ~ F and the order statistics in the sense that. given the order statistics we may construct F and given P. 4. The jth cumulant '0' of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant Cj' Give the first three sample cumulants.12. ~ ~ ~ (a) Show that in the finite discrete case. 8. . . . empirical substitution estimates coincides with frequency substitution estimates. Nk+I) where N I ~ nF(t l ). < ~ ~ Nk+1 = n(l  F(tk)). The empirical distribution function F is defined by F(x) = [No.2.Section 2. . (Z2...2. Let X(l) < .Xn be a sample from a population with distribution function F and frequency function or density p. (c) Argue that in this case all frequency substitution estimates of q(8) must agree with q(X). See A.. 6. . Yi) such that Zi < sand Yi < t n . t) is the bivariate empirical distribution function FCs. (b) Exhibit method of moments estimates for VaroX = 8(1 . 5. Let (ZI. we know the order statistics. (b) Show that in the continuous case X ~ rv F means that X (c) Show that the empirical substitution estimate of the jth moment JLj is the jth sample moment JLj' Hinr: Write mj ~ f== xjdF(x) ormj = Ep(Xj) where XF. of Xi < xl/n... N 2 = n(F(t2J . = Xi with probability lin..5 Problems and Complements 139 Hint: See Problem 8.)). ~  7. . .. (Zn.8)ln first using only the first moment and then using only the second moment of the population. Yn ) be a set of independent and identically distributed random vectors with common distribution function F. . < X(n) be the order statistics of a sample Xl. . Show that these estimates coincide. X n . . which we define by ~ F ~( s. (a) Show that X is a method of moments estimate of 8. The natural estimate of F(s.findthejointfrequencyfunctionofF(tl). Y2 ). ... t). ~ Hint: Express F in terms of p and F in terms of P ~() X = No. Let X I..F(t. . Hint: Consi~r (N I . of Xi n ~ =X. .. .8. Let Xl. If q(8) can be written in the fonn q(8) ~ s(F) for sOme function s of F we define the empirical substitution principle estimate of q( 8) to be s( F).t ) = Number of vectors (Zi.5.
the result follows.' where Z. Show that the least squares estimate exists.1. Vk and that q(9) can be written as ec q«(J) = h(!'I«(J)"" .2 with X iiI and /is. (J E R d. Vi). z) I tends to 00 as \. f(a.. 9.). >..81 tends to 00. . and so on.. . L I. .".. . Since p(X. I) = ".j2. z) is continuous in {3 and that Ig({3.. . as X ~ P(J.. There exists a compact set K such that for f3 in the complement of K. I .) is the distribution function of a probability P on R2 assigning mass lin to each point (Zi. X n be LLd.) Note that it follows from (A. 10.1. . (b) Define the sample product moment of order (i.• . Suppose X = (X"". Y are the sample means of the sample correlation coefficient is given by Z11 •. 0).j) is given by ~ The sample covariance is given by n l n L (Zk k=l .2). with (J identifiable. ~ i .4. Hint: See Problem B.140 Methods of Estimation Chapter 2 (a) Show that F(.!'r«(J)) I .."ZkYk . the sampIe correlation. (See Problem 2. respectively. are independent N"(O. the sample covariance. (a) Find an estimate of a 2 based on the second mOment. {3) is continuous on K. " . Suppose X has possible values VI. Show that the sample product moment of order (i.Y) = Z)(Yk . (b) Construct an estimate of a using the estimate of part (a) and the equation a .. ..2.Yn . 11. .L. Let X". Hint: Set c = p(X. = #. as the corresponding characteristics of the distribution F. . In Example 2. . : J • . 1".IU9) that 1 < T < L .17. p(X. suppose that g({3. (c) Use the empirical substitution principle to COnstruct an estimate of cr using the relation E(IX. j)..ZY.Xn ) where the X. In Example 2. find the method of moments estimate based on i 12. k=l n  n . {3) > c. . . The All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates.. . Zn and Y11 .
.9r be given linearly independent functions and write ec n Pi(O) = EO(gj(X)). 1'(0.1. Let g.l and (72 are fixed. IG(p. find the method of moments estimates (i) Beta. where U.. General method of moment estimates(!). in estimate.. (c) If J. Use E(U[). 0).. . Show that the method of moments estimate q = h(j1!. = h(PI(O). . j i=l = 1.O) ~ (x/O')exp(x'/20').6..d. See Problem 1.6. A).Section 2.d. p fixed (v) Inverse Gaussian. 1'(1. .1) (iii) Raleigh. with (J E R d and (J identifiable.(X) ~ Tj(X).p(x.0 > 0 14.5 Problems and Com". to give a method of moments estimate of (72.... can you give a method of moments estimate of {3? . 0 ~ (p.x (iv) Gamma. . (b) Suppose P ~ po and I' = b are fixed. . . .. X n are i..) to give a method of moments estimate of p.10). fir) can be written as a frequency plugin estimate.i. When the data are not i. rep. . . (a) Use E(X. A). r. > 0.5. Let 911 .:::nt:::' ~__ 141 for some R k valued function h..l Lgj(Xi ).:::m:::. In the fOllowing cases.• it may still be possible to express parameters as functions of moments and then use estimates based on replacing population moments with "sample" moments.. Consider the Gaussian AR(I) model of Example 1. as X rv P(J. Iii = n. (b) Suppose {PO: 0 E 8} is the kparameter exponential family given by (1.1.. Suppose X 1.ilr) is a frequency plug (a) Show that the method of moments estimate if = h(fill .p:::'.6.0) (ii) Beta.  1/' Po) / .36. 13.Pr(O)) .. . Suppose that X has possible values VI. 1 < j < k. . ~ (X. Vk and that q(O) for some Rkvalued function h. Hint: Use Corollary 1.i.
. ' l X q ). i = 1... let the moments be mjkrs = E(XtX~). .z. Show that method of moments estimators of the parameters b1 and al in the best linear predictor are estimate ~ 8 of B is obtained L:Z. the method of moments l by replacing mjkrs by mjkrs... Q.g(I3. Hint: [y. respectively. Y) and (ai.. 1 . and IF. Establish (2.. where (Z.. + '2P4 + 2PS ."" X iq ). we define the empirical or sample moment to be j ~ . .•• Om) can be expressed as a function of the moments. Let X = (Z.J is' _ n n... Y) and B = (ai. ).. k > 0.. . t n on the position of the object. S = 1. HardyWeinberg with six genotypes. . For a vector X = (Xl..z..z.. . 1 q. 16. I • ... b.4..1 " Xir Xk J.ZY ~ . I .. >.. j > 0.Y. Multivariate method a/moments.. FF..6).) .. = Y . and (J3.... _ (Z)' ' a.. n..)] = [Y. . The reading Yi ... . Readings Y1 . bI) are as in Theorem 1.. of observations. 5 SF 28.. . and F. II.  1.. 17.)] + [g(l3o. 1 6 and let Pl ~ N j / n.g(I3. .... P3 + "2PS + '2P6 PI are frequency plugin estimates of OJ. = n 1 L:Z.83 8t Let N j be the number of plants of genotype j in a sample of n independent plants.. k> Q mjkrs L.8..bIZ.z...142 Methods of Estimation Chapter 2 15..S 1 . 1 I .1". .. Let 8" 8" and 83 denote the probabilities of S.)].2 1. I.. Show that <j < i ! .. SF.. The HardyWeinberg model specifies that the six genotypes have probabilities L7=1 I Genotype Genotype Probability SS 2 II 8~ 3 FF 8j 4 SI 28...g(l3o.3. ... 83 6 IF 28... For independent identically distributed Xi = (XiI.1.. 1"... I. I I .  t=1 liB = (0 1 . In a large natural population of plants (Mimulus guttatus) there are three possible alleles S. SI. 11P2 + "2P4 + '2P6 . I ._ .. n 1  Problems for Sectinn 2. ()2. where OJ = 1.. An Object of unit mass is placed in a force field of unknown constant intensity 8. .• l Yn are taken at times t 1.! ... .q. and F at one locus resulting in six genotypes labeled SS.~b......
. (zn.2 may be derived from Theorem 1. . 13). Zn and that the linear regression model holds.. i = 1.. Find the least squares estimate of 0:. (b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1. Find the MLE of 8.. B) = Be'x.<) ~ .BJ ..4. Suppose that observations YI . .z... _. A new observation Ynl.2. Yn be independent random variables with equal variances such that E(Yi) = O:Zj where the Zj are known constants. The regression line minimizes the sum of the squared vertical distances from the points (Zl. all lie on aline.. . Yl). .. .6) under the restrictions BJ > 0. . .2. if we consider the distribution assigning mass l/n to each of the points (zJ.. (J2) variables. .. and 13 ranges over R d 8. B > O. . (Pareto density) . g(zn. .Yi).1.. i + 1. B > O.t of the best (MSPE) predictor of Yn + I ? 4. Yn). . nl + n2.5) provided that 9 is differentiable with respect to (3" 1 < i < d. Let X I.n. Find the least squares estimates of ()l and B . 9. ~p ~(yj}) ~ . X n denote a sample from a population with one of the following densities Or frequency functions. . " 7 5. 2 10.B. x > 0.). 13). Show that the least squares estimate is always defined and satisfies the equations (2. Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (Zi.3.4.Yn). B) = Bc'x('+JJ.4)(2... Find the LSE of 8.)' 1+ B5 6.. . (exponential density) (b) f(x. .5 Problems and Complements 143 ICi differs from the true position (8 /2)tt by a mndom error f l . .I is to be taken at time Zn+l.... 3. . in fact. where €nlln2 are independent N(O. Hint: Write the lines in the fonn (z.Section 2. Find the line that minimizes the sum of the squared perpendicular distance to the same points. B. What is the least squares estimate based on YI .y. to have mean 2.2. Hint: The quantity to be minimized is I:7J (y.." + 82 z i + €i = nl with ICi a~ given by = 81 + ICi.. x > c. Show that the fonnulae of Example 2. Suppose Yi ICl".13 E R d } is closed. Yn have been taken at times Zl. . Find the least squares estimates for the model Yi = 8 1 (2. < O. c cOnstant> 0.. i = 1. (a) f(x. nl and Yi = 82 + ICi. . ..(zn. . . . the range {g(zr. 7. (a) Let Y I .We suppose the oand be uncorrelated with constant variance. Y.
MLEs are unaffected by reparametrization.. x > 0. (b) Find the maximum likelihood estimate of PelX l > tl for t > 1".:r > 0. Show that the MLE of ry is h(O) (i. 0 <x < I. Show that any T such that X(n) < T < X(1) + is a maximum likelihood estimate of O. Hint: Use the factorization theorem (Theorem 1. Let X" . (We write U[a. be a family of models for X E X C Rd. Because q is onto n. (a) Let X . 0 > O. > O. X n be a sample from a U[0 0 + I distribution. Define ry = h(8) and let f(x.5 show that no maximum likelihood estimate of e = (1".r(c+<).. I < k < p.: I I \ . I). .. .a)l rather than 0.2 are unknown. Show that depends on X through T(X) only provided that 0 is unique. . ry) denote the density or frequency function of X in terms of T} (Le.t and a 2 . (Rayleigh density) (f) f(x. density) (e) fix. 0) = Ocx c . then {e(w) : wEn} is a partition of e. then i  WMLE = arg sup sup{Lx(O) : 0 E e(w)}. .O) where 0 ~ (1". iJ( VB. ~ I in Example 2. for each wEn there is 0 E El such that w ~ q( 0). 0 > O. Let X I.l exp{ OX C } .. q(9) is an MLE of w = q(O).X)" > 0. Hint: Let e(w) = {O E e : q(O) = w}. be independently and identically distributed with density f(x.0"2). 0) = cO". I I I (b) LetP = {PO: 0 E e}.II ". Hint: You may use Problem 2.  under onetoone transformations).e.. c constant> 0. they are equivariant 16.I")/O") . 0 > O. say e(w). WEn . = X and 8 2 ~ n. 00 = I 0" exp{ (x . Pi od maximum likelihood estimates of J1 and a 2 . is a sample from a N(/l' ( 2 ) distribution. Suppose that Xl. (Wei bull density) 11. and o belongs to only one member of this partition. If n exists.. n > 2. < I" < 00.l L:~ 1 (Xi . I • :: I Suppose that h is a onetoone function from e onto h(e). reparametrize the model using 1]). bl to make pia) ~ p(b) = (b . . (beta. Thus. 0 E e and let 0 denote the MLE of O. 0> O. 12.. then the unique MLEs are (b) Suppose J. I i' 13.16(b).5.0"2) 15. J1 E R. Suppose that T(X) is sufficient for 0 and ~at O(X) is an MLE of O. (72 Ii (a) Show that if J1 and 0. Show that if 0 is a MLE of 0. X n • n > 2.0"2 (a) Find maximum likelihood estimates of J.144 Methods of Estimation Chapter 2 (c) f(".Pe. 0) = VBXv'O'. .  e .1.1)..2. e c Let q be a map from e onto n. the MLE of w is by definition ~ W. ncR'. x> 0. c constant> 0.) ! ! !' ! 14. . 0) = (x/0 2) exp{ _x 2/20 2 }. . . x > 1".t and a 2 are both known to be nonnegative but otherwise unspecified.P > I. X n . (Pareto density) (d) fix.
2.P. 20. a model that is often used for the time X to failure of an item is P. Show that the maximum likelihood estimate of 0 based on Y1 . We want to estimate B and Var. . (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B. (b) The observations are Xl = the number of failures before the first success. If time is measured in discrete periods. .Xn be independently distributed with Xi having a N( Oi.." Y: . and have commOn frequency function. but e . Derive maximum likelihood estimates in the following models.B). identically distributed. Y n IS _ B(Y) = L. k=I. . . (a) Find maximum likelihood estimates of the fJ i under the assumption that these quantities vary freely.) Let M = number of indices i such that Y i = r + 1. f(k.. . . Bartholomew.n .. 19.~1 ' Li~1 Y. < 00.. where 0 < 0 < 1. ..B) =Bk . B) = 100' cp 9 (x I') + 0' 10 cp(x 1') 1 where cp is the standard normal density and B = (1'. and Brunk (1972). 1) distribution.[X ~ k] ~ Bk .5 Problems and Complements 145 Now show that WMLE ~ W ~ q(O).  .B).. (a) The observations are indicators of Bernoulli trials with probability of success 8. k ~ 1...1 (1 .1 (I_B). Bremner.6. A general solution of this and related problems may be found in the book by Barlow.. •• .X I = B(I. . We want to estimate O. and so on.. Thus. .. 21.Section 2. M 18. (We denote by "r + 1" survival for at least (r + 1) periods.r f(r + I. Let Xl. find the MLE of B. 0 < a 2 < oo}. . we observe YI . in a sequence of binomial trials with probability of success fJ.B) = 1.. Yn which are independent. Censored Geometric Waiting Times.O") : 00 < fl..[X < r] ~ 1 LB k=l k  1 (1_ B) = B'. 17. X n ) is a sample from a population with density f(x. X 2 = the number of failures between the first and second successes. Show that maximum likelihood estimates do not exist.. In the "life testing" problem 1. if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods. Suppose that we only record the time of failure. 1 <i < n. 16(i). (KieferWolfowitz) Suppose (Xl.0") E = {(I'.
' wherej E J andJisasubsetof {UI . respectively. x) as a function of b.6. Ii equals one of the numbers 22..9 3.5 66..6).2. Tool life data Peed 1 Speed 1 1 Life 54. II I I :' 24.2 4. .jp): 0 <j.ji.2 Peed Speed  2 I 1 1 1 1 1 1 1 J2 o o 1 1 1 3. (1') is If = (X.0 2. Weisberg. where'i satisfy (2. .0"2) if.x n .Y)' i=1 j=1 /(rn + n).2 3. Show that the maximum likelihood estimate of b for Nand n fixed is given by if ~ (N + 1) is not an integer. where [t] is the largest integer that is < t.z!.2 4. Hint: Coosider the mtio L(b + 1. Suppose Y. (1') populations. x)/ L(b.I' fl. 1 1 1 o o . .8 3.0 5.0 I . Let Xl. 1i(b.2.(Zi) + 'i. the following data were obtained (from S. n).up(X.4)(2. and assume that zt'·· .. 7 . = fl.Xm and YI .tl.a 2 ) = Sllp~..8 2. J2 J2 o o o I 3.'. Polynomial Regression. " Yn be two independent samples from N(J.5 0.. .5 86..4 o o o o o o o o o o o o o Life 20..8 14.8 0.0 0.p(x. 1 <k< p}.< J. Y.t.1 2. TABLE 2..0 11.J.1.a 2 ) and NU". N. Set zl ~ . and ~ b(X) = X X (N + 1) or (N + 1)1 n n othetwise.5 I' I I . Assume that Xi =I=. . 1985).X)' + L(Y. (1') where e n ii' = L(Xi .146 Methods of Estimation Chapter 2 that snp.Xj for i =F j and that n > 2. distribution. . and only if. . i: In an experiment to study tool life (in minutes) of steelcutting tools as a function of cutting speed (in feet per minute) and feed rate (in thousands of an inch per revolution). Show that the MLE of = (fl. . Xl. 23. Suppose X has a hypergeometric.
y)dzdy.6) with g({3.6.y)!(z. ZD = Wz ZD and € = WZ€. (a) The parameterization f3 (b) ZD is of rank d. Use these estimated coefficients to compute the values of the contrast function (2. ~. weighted least squares estimates are plugin estimates.Y') have density v(z.1). Y)Z') and E(v(Z.5) for (a) and (b). (b) Let P be the empirical probability .. in Problem 2. + E eto (b) Y = + O:IZl + 0:2Z2 + Q3Zi +.y)!(z.y) defined .4)(2.Section 2. (a) Show tha!. Show that p.2..(P) and p. The best linear weighted mean squared prediction error predictor PI (P) + p. of Example 2. However. Let W. z) ~ Z D{3. Two models are contemplated (a) Y ~ 130 + pIZI + p.1.z. l/Var(Y I Z = z). = (cutting speed . (a) Let (Z'.2. Let Z D = I!zijllnxd be a design matrix and let W nXn be a known symmetric invertible matrix.4)(2.8 and let v(z. Show that the follow ZDf3 is identifiable.900)/300. Both of these models are approximations to the true mechanism generating the data.2. Set Y = Wly. this has to be balanced against greater variability in the estimated coefficients.19).0:4Z? + O:SZlZ2 + f Use a least squares computer package to compute estimates of the coefficients (f3's and o:'s) in the two models. z) ing are equivalent.2.1). .3.1. . Y)Y') are finite. This will be discussed in Volume II.6)).2. .1 (see (B.ZD is of rank d. (12 unknown.(P) = Cov(Z'. Consider the model (2. Show that (3. That is.l be a square root matrix of W.2..(P)Z of Y is defined as the minimizer of t = zT {3. (c) Zj.Y = Z D{3+€ satisfy the linear regression mooel (2. Consider the model Y = Z Df3 + € where € has covariance matrix (12 W. Y')/Var Z' and pI(P) = E(Y*) 11. let v(z. Being larger. 28. Y)[Y .Z)]'). ZI ~ (feed rate .2.(b l + b.(P) coincide with PI and 13.13)/6. Y) have joint probability P with joint density ! (z. 25.(P)E(Z·). (2. 26. (2. Derive the weighted least squares nonnal equations (2. Let (Z. y) > 0 be a weight funciton such that E(v(Z. z. the second model provides a better approximation.5 Problems and Complements 147 The researchers analyzed these data using Y = log tool life.2.. y). E{v(Z. .y)/c where c = I Iv(z. .  27..6) with g({3.
nk > 0....196 2.071 0.455 9.788 2. \ (e) Find the weighted least squares estimate of p.860 5.) ~ ~ (I' + 1'..j}. Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore.393 3.392 4.20).229 4. .. <J > 0. Yn are independent with Yi unifonnly distributed on [/1i . which vanishes if 8 j = 0 for any j = q + 1. (See Problem 2. Then = n2 = . . n.093 1. .476 3.921 2. 31.. 0) = II j=q+l k 0.564 1.716 7. different from Y? I I TABLE 2.2.2.1. 1'.916 6.870 30..958 10. .511 3. . k. in this model the optimal " MSPE predictor of the future Yi+ 1 given the past YI .397 4.723 1.391 0.274 5..457 2.~=l Z.6) W. Chon) shonld be read row by row. Suppose YI . J.17. = nq = 0. Consider the model Yi = 11 + ei.'. . /1i + a}.. .666 3. (a) Show that E(1'. i = 1. ~ ~ ~Show that the MLE of OJ is 0 with OJ = nj In. Use a weighted least squares c~mputer routine to compute the weighted least squares estimate Ii of /1.. I Yi is (/1 + Vi). i = 1.301 2.097 1.155 4.114 1. Show that the MLE of . In the multinomial Example 2..611 4.968 9.379 2.+l I Y .689 4.. Hint: Suppose without loss of generality that ni 0. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay. 4 (b) Show that Y is a multivariate method of moments estimate of p.6)T(y .856 2. . Let ei = (€i + {HI )/2.182 0.5.). The data (courtesy S. where El" .480 5.599 0.300 6.a.020 8. = L. _'.(Y .081 2.703 4.249 1...360 1.1. where 1'. . . That is. ' I (f) The following data give the elapsed times YI .968 2..971 0. 2.075 4..858 3. then the  f3 that minimizes ZD.jf3j for given covariate values {z. k. Is ji.053 4.ZD..054 1. = (Y  29..918 2.) (c) Find a matrix A such that enxl = Anx(n+l)ECn+l)xl' (d) Find the covariance matrix W of e.039 9.669 7.676 5.€n+l are i.046 2.156 5.834 3..058 3.019 6. suppose some of the nj are zero.148 ~ Methods of Estimation Chapter 2 (b) Show that if Z  D has rank d.582 2.665 2.093 5. .038 3.6) T 1 (Y .908 1. .131 5.100 3. with mean zero and variance a 2 • The ei are called moving average errors.i.8.091 3..064 5.ZD. nq+1 > p(x.d. .6) is given by (2. . .ZD. n.453 1. j = 1.
= L Ix. Yn are independent with Vi having the Laplace density 1 2"exp {[Y. as (Z. ...... 34.. .i. Hint: Use Problem 1. ~ f3I. be the observations and set II = ~ (Xl .) Suppose 1'. . .l1. then the MLE exists and is unique./Jr ~ ~ and iii.. .5 Problems and Complements 149 (f3I' .{:i. A > 0.Section 2.. The HodgesLehmann (location) estimate XHL is defined to be the median of the ~n(n + 1) pairwise averages ~(Xi + Xj). be i. 1). a)T is obtained by finding 131. (a) Show that the MLE of ((31.1. the sample median yis defined as ~ [Y(. Z and W are independent N(O. F)(x) *F denotes convolution. . y)T where Y = Z + v"XW. let Xl and X.. the sample median fj is defined as Y(k) where k ~ ~(n + 1) and Y(l)... .Jitl.I/a}. .:C j • i <j (a) Show that the HodgesLehmann estimate is the minimizer of the contrast function p(x.. .. Show that the sample median ii is the minimizer of L~ 1 IYi . Hint: See Example 1..J. Give the MLE when . + Xj i<j  201· (b) Define BH L to be the minimizer of J where F [x .O) Hint: See Problem 2. (a) Show Ihat if Ill[ < 1.(3p that minimizes the least absolute deviation contrast function L~ I IYi . . Let x. i < j. 8 E R.(1 + x'j.tLil and then setting a = max t IYi . where Jii = L:J=l Zij{3j.& at each Xi.>0 ~ ~ where tLi = :Ej=l Ztj{3j for given covariate values {Zij}.6.i.. XHL i& at each point x'. Yn ordered from smallest to largest.). x E R.) + Y(r+l)] where r = ~n.{3p. ~ 33.17). Let g(x) ~ 1/". Let B = arg max Lx (8) be "the" MLE.d... Suppose YI . Find the MLE of A and give its mean and variance. . and x. . with density g(x .3. If n is even.2. = I' for each i. . be the Cauchy density..32(b). " . These least absolute deviation estimates (LADEs).7 with Y having the empirical distribution F. .fin are called (b) If n is odd.x.. ./3p that minimizes the maximum absolute value conrrast function maxi IYi .I'.d.28Id(F. where Iii = ~ L:j:=1 ZtjPj· 32..d.  Illi < 1.. (3p. be i. Let X. . (See (2.Ili I and then setting (j = n~I L~ I IYi .4. . . a) is obtained by finding (31.8). Show that XHL is a plugin estimate of BHL. 35. Y( n) denotes YI . An asymptotically equivalent procedure is to take the median of the distribution placing mass and mass .
51.x2)/2.) be a random samplefrom the distribution with density f(x.b).h(y) . Ii I. N(B. Let Xi denote the number of hits at a certain Web site on day i.. (a.i. then the MLE is not unique.B). that 8 1 E. ryo) and p' (x.150 Methods of Estimation Chapter 2 (b) Show that if 1t>1 > 1. BI ) (p' (x. 1985). Also assume that S. Assume :1.B)g(i . Sl. b) there exists a 0 > 0 ~ (c) There is an interval such that h(y + 0) . Bo) is tn( B. Let Xl and X2 be the observed values of Xl and X z and write j. B) denote their joint density.. Assume that 5 = L:~ I Xi has a Poisson.h(y . 9 is twice continuously differentiable everywhere except perhaps at O. Suppose h is a II function from 8 Onto = h(8). Define ry = h(B) and let p' (x. Let X ~ P" B E 8.B)g(x. n. 38. where Al +.\2 = A. such that for every y E (a.ryt» denote the KullbackLeibler divergence between p( x.Xn are i. = I ! I!: Ii. Find the values of B that maximize the likelihood Lx(B) when 1t>1 > L Hint: Factor out (x . B ) and p( X. Let 9 be a probability density on R satisfying the following three conditions: I.. Show that the entropy of p(x. B) is and that the KullbackLiebler divergence between p(x. 3.+~l Vi and 8 2 = E. 9 is continuous. i = 1. . then B is not unique.0). (b) Either () = x or () is not unique. j = n + I. (7') and let p(x.f)) in the likelihood equation.t> .B) g(x + t> . ryl Show that o n ». . .+~l Wj have P(rnAl) and P(mA2) distributions. > h(y) . and positive everywhere.)/2 and t> = (XI . and 8 2 are independent. . b). Let Vj and W j denote the number of hits of type 1 and 2 on day j. . Suppose X I. .B )'/ (7'. X. 'I i 39. and 5.. On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making). . Find the MLEs of Al and A2 hased on 5... ~ (d) Use (c) to show that if t> E (a. . If we write h = logg. P(nA).d. ry) = p(x. Show that (a) The likelihood is symmetric about ~ x.. . 36. The likelihood function is given by Lx(B) g(xI . . h I (ry») denote the density or frequency function of X for the ry parametrization. symmetric about 0. B) and pix.B I ) (K'(ryo.j'. Let (XI. o Ii !n 'I . Let K(Bo.. B) = g(xB). Problem 35 can be generalized as follows (Dhannadhikari and JoagDev. 2. 37. 1 n + m. distribution. then h"(y) > 0 for some nonzero y. where x E Rand (} E R. ~ (XI + x... . ~ Let B ~ arg max Lx (B) be "the" MLE. a < b. ..
.)/o]}" (b) Suppose we have observations (t I . j 1 1 cxp{ x/Oj}. An example of a neural net model is Vi = L j=l p h(Zij.. ..1. . .1'.. 0). [3. > OJ and T. on a population of a large number of organisms. Yn). . 0 > O.) Show that statistics. ..7) for a. For the ca. where OJ > O. 1 En are uncorrelated with mean 0 and variance estimating equations (2. 1989) Y dt ~ [3 1  Idy [ (Y)!] ' y > 0.1. = LX...O) ~ a {l+exp[[3(tJ1. n. and /1. 42. exp{x/O. = L X. Give the least squares where El" •. [3. /3.olog{1 + exp[[3(t. Suppose Xl. yd 1 ••• . and fj. [3. Aj) + Ei. Variation in the population is modeled on the log scale by using the model logY.2. The mean relative growth of an organism of size y at time t is sometimes modeled by the equation (Richards. (tn. [3 > 0. (e) Let Yi denote the response of the ith organism in a sample and let Zij denote the level of the jth covariate (stimulus) for the ith organism. . a.)/o)} (72.1. h(z. n where A = (a. Seber and Wild. 1'. n > 4. 41. 1').. . 1). j = 1. where 0 (a) Show that a solution to this equation is of the form y (a.c n are uncorrelated with mean zero and variance (72.. give the lea. 6. a > 0. [3.j ~ x <0 1. (.5 Problems and Complements 151 40.~t square estimating equations (2.I[X.I[X. Let Xl. and 1'. 1 X n satisfy the autoregressive model of Example 1. X n be a sample from the generalized Laplace distribution with density 1 O + 0.Section 2.}. . .7) for estimating a. T.~e p = 1.  J1.0). + C. A) = g(z. . i = 1.p. and g(t. i = 1. < 0] are sufficient (b) Find the maximum likelihood estimates of 81 and 82 in tenns of T 1 and T 2 • Carefully check the "T1 = 0 or T 2 = 0" case.. 1959.. o + 0. . = log a . I' E R. x > 0. . .5. a get.
This set K will have a point where the max is attained.. t·. X n be i.1 > _£l. In a sample of n independent plants. (X. 3. .. 1 Xl <i< n.5).3. 0 for Xi < _£I. = 1] = p(x"a.y...'" £n)T of autoregression errors. Consider the HardyWeinberg model with the six genotypes given in Problem 2. Yn) is not a sequence of 1's followed by all O's or the reverse.Xi)Y' < L(C1 + c. .::..d. .i... . Prove Lenama 2. fi exists iff (Yl ..3.02): 01 > 0.152 Methods of Estimation Chapter 2 (aJ If /' is known. = OJ...0 . (b) Show that the likelihood equations are equivalent to (2.). S. Hint: Let C = 1(0).1.0.. = If C2 > 0. '12 = A and where r denotes the gamma function. < Show that the MLE of a.. I . Suppose Y I .15.fi) = 1log p pry. Give details of the proof or Corollary 2.p)..x.. I < j < 6.. . .)I(c.J. find the covariance matrix W of the vector € = (€1. . Xn)T can be written as the rank 2 canonical exponential family generated by T = (ElogX"EXi ) and hex) = XI with ryl = p. . .3. (One way to do this is to find a matrix A such that enxl = Anxn€nx 1. Xn· Ip (x. > 0. show that the MLE of fi is jj =  2. Let Xl. .. • II ~. . There exists a compact set K c e such that 1(8) < c for all () not in K.. Under what conditions on (Xl 1 . n > 2.. Hint: I . write Xi = j if the ith plant has genotype j. + 8.l. a.3 1.) 1 II(Z'i /1)2 (b) If j3 is known. + 0.) Then find the weighted least square estimate of f. + C1 > 0). fi) = a + fix. r(A. Is this also the MLE of /1? Problems for Section 2. C2' y' 1 1 for (a) Show that the density of X = (Xl. .4) and (2. n i=l n i=l n i=1 n i=l Cl LYi + C. the bound is sharp and is attained only if Yi x. " l Xn) does the Mill exist? What is the MLE? Is it unique? e 4. . < I} and let 03 = 1.l  L: ft)(Xi /. Yn are independent PlY. = L(CI + C.(8 . _ C2' 2. gamma. L x.3. Let = {(01.x.l. < .
if n > 2.0 < ZI < . < Show that the MLE of (a. X n be Ll. Then {'Ij} has a subsequence that converges to a point '10 E f. .1 <j < ac k1.'~ve a unique solution (. and Zi is the income of the person whose duration time is ti.0. . < Zn Zn.1 (a) ~ ~ c(a) exp{ Ix . .Xn E RP be i.3. convex set {(Ij.' . a(i) + a..40.fl. . with density.ill T exists and is unique. ~fo (X u I. distribution where Iti = E(Y.l E R. < Zn. . C! > 0. e E RP... 8.1)...Ld.1 to show that in the multinomial Example 2. . MLEs of ryj exist iffallTj > 0. show 7. Prove Theorem 2.. • It) w' ( Xi It) .6.0). w( ±oo) = 00. .10 with that the MLE exists and is unique. then it must contain a sphere and the center of the sphere is an interior point by (B. ji(i) + ji.) ~ Ail = exp{a + ilz.. 9. Yn denote the duration times of n independent visits to a Web site.. . _.n.A( ryj) ~ max{ 'IT toA( 'I) : 'I E c(8)} > 00.3. 0 < Zl < . Suppose Y has an exponential. . (0. 10. Use Corollary 2. and assume forw wll > 0 so that w is strictly convex. But c( e) is closed so that '10 = c( eO) and eO must satisfy the likelihood equations. " (a) Show that if Cl: > ~ 1. [(Ai). the MLE 8 exists and is unique. Hint: The kpoints (0.z=7 11.. J. . the likelihood equations logfo that t (X It) w' iOJ t=I = 0 ~ { (X i .3.1} 0 OJ = (b) Give an algorithm such that starting at iP = 0. > 3.6..0). Zj < .. .n) are the vertices of the :Ij <n}. (b) Show that if a = 1.lkIl :Ij >0. Let Xl. 12..d.5 Problems and Complements 153 n 6. .3. See also Problem 1.Section 2. Hint: If it didn't there would exist ryj = c(9j ) such that ryJ 10 . n > 2.}. . the MLE 8 exists but is not unique if n is even.. Show that the boundary of a convex C set in Rk has volume 0. (O. In the heterogenous regression Example 1. .9. Let XI. 0:0 = 1.en. Hint: If BC has positive volume. (3) Show that. u > 1 fR exp{Ixlfr}dxand 1·1 is the Euclidean norm. 1 <j< k1.. fe(x) wherec.. Let Y I .O...8) .t). ...3..
O'~ + J1~. .b)_(ao.2.) has a density you may assume that > 0.' ! (a) In the bivariate nonnal Example 2. O'~. for example. 3. ar 1 a~.9).bo) D(a. 2.b) .d.'2). .4. See. (b) If n > 5 and J11 and /12 are unknown.1. respectively. Y. {t2. . . EeT = (J11.4. Golnb and Van Loan. (I't') IfEPD 8a2 a2 D > 0 an d > 0.4.) Hint: (a) Thefunction D( a.. (Check that you are describing the GaussSeidel iterative method for solving a system of linear equations. (b) Reparametrize by a = . b) and lim(a.J1. O'~ + J1i..n log a is strictly convex in (a. .. provided that n > 3. > 0. EM for bivariate data. (a) Show that the MLEs of a'f. O'?. E" . P coincide with the method of moments estimates of Problem 2. it is unique. show that the estimates of J11. /12...2)2. .. 0'1 Problems for Section 2. (Xn .) I .4 1. Chapter 10.. b = : and consider varying a.)(Yi .6.J1. = (lin) L:.6.8. Show that if T is minimal and t: is open and the MLE doesn't exist. and p when J1.3. p) population.3.l and CT. En i. (i) If a strictly convex function has a minimum. complete the Estep by finding E(Zl I l'i) and E(ZiYi I Yd· (b) In Example 2. then the coordinate ascent algorithm doesn't converge to a member of t:. (See Example 2. b) ~ I w( aXi . ag. J12. "i . Hint: (b) Because (XI. W is strictly convex and give the likelihood equations for f. . Let (X10 Yd. Describe in detail what the coordinate ascent algorithm does in estimation of the regression coefficients in the Gaussian linear model i: y ~ ZDJ3 + <. rank(ZD) = k. 1985.2).. ~J ' . b successively. il' i.:. Note: You may use without proof (see Appendix B. verify the Mstep by showing that E(Zi I Yi). and p = [~(Xi .(Yi . b) = x if either ao = 0 or 00 or bo = ±oo. I (Xi ..1)". N(O.1 and J1.:.J1. L:.i. 8aob :1 13.. 7if)F 8 08 D 802 vb2 2 2 > (a'D)2 ' then D'IS strictIy convex.2)/M'''2] iI I'. Yn ) be a sample from a N(/11.2 are assumed to be known are = (lin) L:. Apply Corollary 2. .154 Methods of Estimation Chapter 2 (c) Show that for the logistic distribution Fo(x) [1 + exp{ X}]I. IPl < 1.J1. pal0'2 + J11J.
P. X is the number of members who have the trait.5 Problems and Complements 155 4. be independent and identically distributed according to P6.x = 1. 1 < i < n. ~2 = log C\) + (b) Deduce that T is minimal sufficient. 1 and (a) Show that X .. B) variable.O"~).Section 2. j = 0.. I n ) 8"(10)"" (a) Show that P(X = x MLE exists and is unique. For families in which one member has the disease. (c) Give explicitly the maximum likelihood estimates of Jt and 5.n. Y'i). also the trait). given X>l.B)n} _ nB'(I. the model often used for X is that it has the conditional distribution of a B( n. 1') E (0. Suppose the Ii in Problem 4 are not observed. 8new . 1]. thus. Because it is known that X > 1.{(Ii.. and that the (b) Use (2. Hint: Use Bayes rule. Suppose that in a family of n members in which one has the disease (and. 6. 2 (uJ L }iIi + k 2: Y. . Y1 '" N(fLl a}). Y (~ A 2:7 I(Yi  Y)'  O"~)/ (0". and given II = j. (a) Justify the following crude estimates of Jt and A.and Msteps of the EM algorithm for this problem. Do you see any problems with >. when they exist._x(1.B)"]'[(1 .[lt ~ 1] = A ~ 1 ..(1 _ B)n]{x nB . Let (Ii.(I.[lt ~ OJ.Ii). .I ~ + (1  B)n] . is distributed according to an exponential ryl < n} = i£(1 I) uzuy. i ~ 1'. Consider a genetic trait that is directly unobservable but will cause a disease among a certain proportion of the individuals that have it. 2:Ji) ... IX > 1) = \ X ) 1 (1 6)" .1) x Rwhere P.? (b) Give as explicitly as possible the E.4.[1 . where 8 = 80l d and 8 1 estimate of 8. as the first approximation to the maximum likelihood .3) to show that the NewtonRaphson algorithm gives ~ _ 81 ~ 8 ~ _ ~ _ B(I. B= (A.B)n[n . it is desired to estimate the proportion 8 that has the genetic trait. B E [0.2B)x + nB'] _. I' >.(1 .B)[I. Yd : 1 < i family with T afi I a? known.
Na+c. Show that the sequence defined by this algorithm converges to the MLE if it exists. 1 < b < B. Let Xl. b1 c and then are . + Vbc where 00 < /l. .c)} and "+" indicates summation over the ind~x. Let Xl.9) = ".. . X n be i.2. W). Define TjD as before. ..A(iJ(A)). . iff U and V are independent given W.1 (1 + (x e. .v ~ b I W = c] = P[U = a I W = c]P[V = b I W ~ i.c Pabc = l. N +bc given by < N ++c for all a. PIU = a. N . Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a. X 3 = a. (1) 10gPabc = /lac = Pabe.i. (h) Show that the family of distributions obtained by letting 1'.156 (c) [f n = 5.X3 be independent observations from the Cauchy distribution about f(x.9)')1 Suppose X. v < 00.b. (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the other of the local maxima (b) Make a similar study of the NewtonRaphson method in this case. Let and iJnew where ).riJ(A) . ~ 0. 9. 1 <c < C and La. N++ c N a +c N+ bc Pabc = . X. the sequence (11ml ijm+l) has a convergent subse quence. c]' Show that this holds iff PIU ~ a.4.e. W = c] 1 < a < A.d. where X = (U. = 1.1 generated by N++c. 7. Hint: Apply the argument of the proof of Theorem 2. (a) Suppose for aU a. hence.N+bc where N abc = #{i : Xi = (a. (c) Show that the MLEs exist iff 0 < N a+c . Consider the following algorithm under the conditions of Theorem 2. V.1) + C(A + B2) = C(A + B1) . v vary freely is an exponential family of rank (C . b1 C. V = b. 8.b.X 2 .2 noting that the sequence of iterates {rymJ is bounded and.4. X Methods of Estimation Chapter 2 = 2. * maximizes = iJ(A') t. find (}l of (b) above using (} =   x/n as a preliminary estimate. n N++ c ++c  .
4.and Msteps of the EM algorithm in this case. Nl+co (c) The model implies Pabc = P+bcPa±c/P++c and use the likelihood equations.3 + (A . Hint: Note that because {p~~~} belongs to the model so do all subsequent iterates and that ~~~ is the MLE for the exponential family Pabc = ellauplO) _=.I)(C .1)(B .N++c/A. = fo(x  9) where fo(x) 3'P(x) + 3'P(x .. Let f. (b) Give explicitly the E.(A + B + C).Section 2.c' obtained by fixing the "b. (a) Show that S in Example 2.I)(C . c" and "a.=_ L ellalblp~~~'CI a'.8).N++ c / B. 11.. c" parameters. v.[X = x I SeX) = s] = 13.I) ~ AB + AC + BC . Suppose X is as in Problem 9.b'. 10. f vary freely. (b) Consider the following "proportional fitting" algorithm for finding the maximum likelihood estimate in this model.c.a>!b".5 Problems and Complements 157 Hint: (b) Consider N a+ c . 12. Hint: P. (a) Show that this is an exponential family of rank A + B + C .:.I) + (B . Show that the algorithm converges to the MLE if it exists and di· verges otheIWise. Initialize: pd: O) = N a ++ N_jo/>+ N++ c abc nnn Pabc d2) dl) Nab+ Pabc dO) n n Pab+ d1) dl) dO) Pabc d3) N a + c Pabc Pa+c N+bc Pabc d 2) Pabc n P+bc d2)· Reinitialize with ~t~.1) + (A . but now (2) logPabc = /lac + Vbc + 'Yab where J1.(x) :1:::: 1(S(x) ~ = s). N±bc .4.5 has the specified mixtnre of Gaussian distribution. Justify formula (2.a) 1 2 .
d. 17. the "missingness" of Vi is independent of Yi.n}." .1 (1) "Natural" now was not so natural in the eighteenth century when the least squares principle was introduced by Legendre and Gauss. Bm+d} has a subsequence converging to (0" JJ*) and. The assumption underlying the computations in the EM algorithm is that the conditional probability that a component X j of the data vector X is missing given the rest of the data vector is not a function of X j .. I Z. For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1986). this assumption may not be satisfied. A. If Vi represents the seriousness of a disease. Verify the fonnula given in Example 2. the process determining whether X j is missing is independent of X j .5."" En. En a an 2. {32).4. For instance.{Xj }. NOTES . but not on Yi.5. suppose all subjects with Yi > 2 drop out of the study. . In Example 2. For example. (J*) and necessarily g. . al = a2 = 1 and p = 0. R.and Msteps of the EM algorithm for estimating (ftl. if a is sufficiently large.4. Establish the last claim in part (2) of the proof of Theorem 2. N(ftl. EM and Regression. • . in Example 2. . find the probability that E(Y. where E) . (2) The frequency plugin estimates are sometimes called Fisher consistent.. the probability that Vi is missing may depend on Zi. 14. given X .3. . That is. necessarily 0* is the global maximizer. That is.) underpredicts Y. ~ . This condition is called missing at random. N(O.158 Methods of Estimation Chapter 2 and r. = {(Zil Yi) Yi :i = 1. we observe only}li. ! . Zl... If /12 = 1. . 15. I I' . Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n. Hint: Use the canonical nature of the family and openness of E. Establish part (b) of Theorem 2.~ go. Noles for Section 2. Hint: Show that {(19 m.p is the N(O} '1) density.d..3 for the actual MLE in that example. For X .6.. and independent of El. a 2 . given Zi. 18.4. ~ 16. 2 ). These considerations lead essentially to maximum likelihood estimates.i. suppose Yi is missing iff Vi < 2. Limitations a/the EM Algorithm.6.2. Then using the Estep to impute values for the missing Y's would greatly unclerpredict the actual V's because all the V's in the imputation would have Y < 2. consider the model I I: ! I I = J31 + J32 Z i + Ei I are i.4.. Fisher (1922) argued that only estimates possessing the substitution property should be considered and the best of these selected. Bm + I)} has a subsequence converging to (B* . Ii . Zn are ii. {31 1 a~. Complete the E.4. Show for 11 = 1 that bisection may lead to a local maximum of the likelihood.6 . . Hint: Show that {( 8m . . thus.
GIlBELS. G. FEINBERG. M. P[T(X) E Al o for all or for no PEP. LAIRD. SOULES. T. Fisher 1950) New York: J. "On the Mathematical Foundations of Theoretical Statistics. J.3 (I) Recall that in an exponential family. M. ANOH. J.. La. 138 (1977). Roy. Soc. see Cover and Thomas (1991). 2. AND P.. 1922. Numerical Analysis New York: Prentice Hall. S.. 199200 (1985). G. 164171 (1970). Local Polynomial Modelling and Its Applications London: Chapman and Hall. Campbell.• AND I.. La. t 996. "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm. "Y.5 (1) In the econometrics literature (e. Note for Section 2. F. FISHER. M. W. MA: MIT Press. 1997.g.. R.Section 2. 0. HABERMAN. Statistical Inference Under Order Restrictions New York: Wiley. BRUNK. "Y. Note for Section 2. EiSENHART. J. B..loAGDEV.2. Elements oflnfonnation Theory New York: Wiley. MACKINLAY. A." Journal Wash. C.7 REFERENCES BARLOW. RUBIN. . AND A. (2) For further properties of KullbackLeibler divergence. 1985. AND N. NJ: Princeton University Press.. E.1. G. AND K. 2433 (1964). GoLUB." reprinted in Contributions to MatheltUltical Statistics (by R. and MacKinlay. Matrix Computations Baltimore: John Hopkins University Press. WEISS." J.1974. DAHLQUIST." The American Statistician. "Examples of Nonunique Maximum Likelihood Estimators. BJORK.. PETRIE. Wiley and Sons. B. R. A. 1974. A. Appendix A. "The Meaning of Least in Least Squares. 1991. 54. ANDERSON. 1. BARTHOLOMEW. The Analysis ofFrequency Data Chicago: University of Chicago Press. 1997). HOLLAND. The Econometrics ofFinancial Markets Princeton. Statist. CAMPBELL.. A. Acad. 39. for any A. H. Discrete Multivariate Analysis: Theory and Practice Cambridge.. AND N. DHARMADHlKARI..7 References 159 Notes for Section 2. L. "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Am!.2 (I) An excellent historical account of the development of least squares methods may be found in Eisenhart (1964). 41. 1972. Statist. c. S. M. D.. T. 1975. W. M. THOMAS. BREMNER. a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates. E.. VAN LOAN. DEMPSTER. FAN. Math. AND J. AND D. BAUM. AND C. S. BISHOP. A. M. A. E. COVER. Sciences.
G. 1997. I • I .379243. AND D. 128 (1968). AND c. F. "Multivariate Locally Weighted Least Squares Regression. . E. IA: Iowa State University Press.. G. 2nd ed. 1%7. STIGLER. "A Flexible Growth Function for Empirical Use. N." Bell System Tech. "On the Convergence Properties of the EM Algorithm. Statist. 6th ed. Wiley. The EM Algorithm and Extensions New York: Wiley..1. G. Statistical Anal. E... WAND. Statist. Exp. 290300 ( 1959). SHANNON. 27. J. Amer. Assoc. SNEDECOR. 1985. S. J. KR[SHNAN. Botany. J. Statisr. LIlTLE. RUBIN. "A Mathematical Theory of Communication. Theory..·sis with Missing Data New York: J. W." Ann. J. 102108 (1956)." J. i I... WEISBERG.623656 (1948). 63. 11. E. WILD. MACLACHLAN. RUPPERT. 1986." 1. Statistical Methods. A. R. 1989. AND W. RICHARDS. AND M. D. Nonlinear Regression New York: Wiley." Ann. C. MOSTELLER. MA: Harvard University Press. 22. A.160 Methods of Estimation Chapter 2 KOLMOGOROV. I 1 ."/RE Trans! Inform. "On the Shannon Theory of Information Transmission in the Case of Continuous Signals. AND T. B. A. C. 13461370 (1994).. 10. The History ofStatistics Cambridge.. In. S. WU. P. Journal.. • . 1987.. "Association and Estimation in Contingency Tables. Applied linear Regression.. Ames... New York: Wiley. COCHRAN. SEBER. F..95103 (1983) .
action space A. We also discuss other desiderata that strongly compete with decision theoretic optimality. after similarly discussing testing and confidence bounds.6) = E o{(O. R(·.6) =ER(O.3.a). (3. We return to these themes in Chapter 6. However. 6) as measuring a priori the performance of 6 for this model. which is how to appraise and select among decision procedures. in particular computational simplicity and robustness. In Sections 3. o r(1T.1) 161 .3 that if we specify a parametric model P = {Po: E 8}. and 62 on the basis of the risks alone is not well defined unless R(O. ) < R(B.2.4. 6 2 ) for all 0 or vice versa.2 BAYES PROCEDURES Recall from Section 1. we study. Strict comparison of 6.+ R+ by ° R(O.6(X). We think of R(. actual implementation is limited.6) = EI(6. However. In Section 3. loss function l(O. in the context of estimation.6(X)). NOTIONS OF OPTIMALITY.3 we show how the important Bayes and minimax criteria can in principle be implemented.2 and 3. OUf examples are primarily estimation of a real parameter. AND OPTIMAL PROCEDURES 3. in Chapter 4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case. then for data X '" Po and any decision procedure J randomized or not we can define its risk function. the relation of the two major decision theoretic principles to the 000decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness.1 INTRODUCTION Here we develop the theme of Section 1.6 . by introducing a Bayes prior density (say) 1r for comparison becomes unambiguous by considering the scalar Bayes risk. 6) : e .Chapter 3 MEASURES OF PERFORMANCE. 3.
3).2. Thus. 0) = (q(O) using a nonrandomized decision rule 6.(O)dfI p(x I O). In view of formulae (1. (0. Suppose () is a random variable (or vector) with (prior) frequency function or density 11"(0).. e "(). In the continuous case with real valued and prior density 1T. . 1 1  ..4) the parametrization plays a crucial role here. we find that either r(1I".8) for the posterior density and frequency functions.6) = J R(0. This exercise is interesting and important even if we do not view 1r as reflecting an implicitly believed in prior distribution on (). we can give the Bayes estimate a more explicit form.162 Measures of Performance Chapter 3 where (0.2. B denotes mean height of people in meters.2) the Bayes risk of the problem.(O)dfI (3.6): 6 E VJ (3.5). such that (3. and 1r may express that we care more about the values of the risk in some rather than other regions of e.4) !. j I 1': II .. 'f " . This is just the prohlem of finding the best mean squared prediction error (MSPE) predictor of q(O) given X (see Remark 1. oj = ". 6) ~ E(q(O) . For testing problems the hypothesis is often treated as more important than the alternative.3 we showed how in an example. and that in Section 1. Here is an example. Clearly. 0) 2 and 1T2(0) = 1. for instance.2.2. if 1[' is a density and e c R r(".5) This procedure is called the Bayes estimate for squared error loss...5. (3.4. 6) Bayes rule 6* is given by = a? 6'(X) = E[q(O) I XJ.2.2.6(X)j2. x _ !oo= q(O)p(x I 0).6) In the discrete case. it is plausible that 6. 00 for all 6 or the Using our results on MSPE prediction. 1(0.(0)1 . if computable will c plays behave reasonably even if our knowledge is only roughly right. 0) and "I (0) is equivalent to considering 1 (0. considering I. as usual. Recall also that we can define R(.(O)dO (3.6). Issues such as these and many others are taken up in the fundamental treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger (1985) and Bernardo and Smith (1994). We don't pursue them further except in Problem 3.) ~ inf{r(".(0) a special role ("equal weight") though (Problem 3.2. (0.r= . we just need to replace the integrals by sums. we could identify the Bayes rules 0. After all.2.2. Our problem is to find the function 6 of X that minimizes r(". If 1r is then thought of as a weight function roughly reflecting our knowledge. It is in fact clear that prior and loss function cannot be separated out clearly either. We may have vague prior notions such as "IBI > 5 is physically implausible" if. We first consider the problem of estimating q(O) with quadratic loss. and instead tum to construction of Bayes procedure. X) is given the joint distribution specified by (1.3) In this section we shall show systematically how to construct Bayes rules..
7) Its Bayes risk (the MSPE of the predictor) is just r(7r.Section 3. we obtain the posterior distribution The Bayes estimate is just the mean of the posterior distribution J'(X) ~ 1]0 l/T' ] _ [ n/(J2 ] [n/(J' + l/T' + X n/(J' + l/T' (3. In fact. If we look at the proof of Theorem 1. Thus. Bayes Estimates for the Mean of a Normal Distribution with a Normal Prior. J') ~ 0 as n + 00. However. Applying the same idea in the general Bayes decision problem.w)X of the estimate to be used when there are no observations. This quantity r(a I x) is what we expect to lose.2.5. r) . J') E(e .1. X is approximately a Bayes estimate for anyone of these prior distributions in the sense that Ir( 7r. For more on this. X is the estimate that (3.12.6.6) yields. This action need not exist nor be unique if it does exist. if X = :x and we use action a. Such priors with f 7r(e) ~ 00 or 2: 7r(0) ~ 00 are called impropa The resulting Bayes procedures are also called improper. But X is the limit of such estimates as prior knowledge becomes "vague" (7 . Fonnula (3. 1]0 fixed). Intuitively. " Xn.2. for each x. we fonn the posterior risk r(a I x) = E(l(e.2.E(e I X))' = E[E«e . . Because the Bayes risk of X. 0 We now tum to the problem of finding Bayes rules for gelleral action spaces A and loss functions l. To begin with we consider only nonrandomized rules.E(e I X))' I X)] E[(J'/(l+~)]=nja 2 + 1/72 ' 1 2 n n/ No finite choice of '1]0 and 7 2 will lead to X as a Bayes estimate.) minimizes the conditional MSPE E«(Y . take that action a = J*(x) that makes r(a I x) as small as possible. In fact. .a)' I X ~ x) as a function of the action a. that is. tends to a as n _ 00. and X with weights inversely proportional to the Bayes risks of these two estimates. Suppose that we want to estimate the mean B of a nonnal distribution with known variance a 2 on the basis of a sample Xl.r( 7r. see Section 5.7) reveals the Bayes estimate in the proper case to be a weighted average MID + (1 .2 Bayes Procedures 163 Example 3.1). E(Y I X) is the best predictor because E(Y I X ~ .2. '1]0. if we substitute the prior "density" 7r( 0) = 1 (Prohlem 3. If we choose the conjugate prior N( '1]0. we should. we see that the key idea is to consider what we should do given X = x. a) I X = x). J' )]/r( 7r. the Bayes estimate corresponding to the prior density N(1]O' 7 2 ) differs little from X for n large.2. /2) as in Example 1. a 2 / n.4.00 with. .1.
8). " .·.2. ..1. Example 3. r(a.2.. let Wij > 0 be given constants.· .o(X)) I X = x] = r(o(x) Therefore.70 and we conclude that 0'(1) = a. I x) > r(o'(x) I x) = E[I(O.8) Then 0* is a Bayes rule. 11) = 5. Let = {Oo. az. !i n j.2. Therefore.67 5.2. (3.) = 0.o) = E[I(O. E[I(O. Suppose we observe x = O.o(X)) I X] > E[I(O. r(al 11) = 8.9).) = 0. I 0)  + 91(0" ad r(a.! " j.. More generally consider the following class of situations.2.1. 0'(0) Similarly.o'(X)) IX = x].89.2. consider the oildrilling example (Example 1. E[I(O. As in the proof of Theorem 1. Then the posterior distribution of o is by (1. Suppose that there exists a/unction 8*(x) such that r(o'(x) [x) = inf{r(a I x) : a E A}. .8) Thus.Op}.35. az has the smallest posterior risk and.o(X)) I X)]. and the resnlt follows from (3. we obtain for any 0 r(?f. 10) 8 10. the posterior risks of the actions aI.2. Bayes Procedures Whfn and A Are Finite. :i " " Proof. o As a first illustration. 6* = 05 as we found previously. by (3.8.4. [I) = 3.o'(X)) I X]. (3. Therefore. a q }.2. and aa are r(at 10) r(a. and let the loss incurred when (}i is true and action aj is taken be given by j e e . A = {ao. if 0" is the Bayes rule.o(X))] = E[E(I(O.5) with priOf ?f(0. .74.9) " "I " • But. ?f(O.. ~ a. r(a.2. The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corrpeting procedures.164 Measures of Performance Chapter 3 Proposition 3.3.
Section 3.2
Bayes Procedures
165
Let 1r(0) be a prior distribution assigning mass 1ri to Oi, so that 1ri > 0, i = 0, ... ,p, and Ef 0 1ri = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each O. Then, by (1.2.8), the posterior probabilities are
and, thus,
raj (
x I) =
EiWipTiP(X I Oi) . Ei 1fiP(X I Oi)
(3.2.10)
The optimal action 6* (x) has
r(o'(x) I x)
=
O<rSq
min r(oj
I x).
Here are two interesting specializations. (a) Classification: Suppose that p
= q, we identify aj with OJ, j
1, i f j
=
0, ... ,p, and let
Wii
O.
This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,
r(Oi
Ix) = Pl8 f ei I X = xl
and minimizing r(Oi I x) is equivalent to the reasonable procedure of maximizing the posterior probability,
PI8 = e I X = i
xl
=
1fiP(X lei) Ej1fjp(x I OJ)
(b) Testing: Suppose p = q = 1, 1ro = 1r, 1rl = 1  'Jr, 0 < 1r < 1, ao corresponds to deciding 0 = 00 and al to deciding 0 = 01 • This is a special case of the testing fonnulation of Section 1.3 with 8 0 = {eo} and 8 1 = {ed. The Bayes rule is then to
decide e decide e
= 01 if (1 1f)p(x I 0, ) > 1fp(x I eo) = eo if (1  1f)p(x IeIl < 1fp(x Ieo)
and decide either ao or al if equality occurs. See Sections 1.3 and 4.2 on the option of randomizing .between ao and al if equality occurs. As we let 'Jr vary between zero and one. we obtain what is called the class of NeymanPearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) < Ct. This is treated further in Chapter 4. D
166
Measures of Performance
Chapter 3
To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form" the Bayes procedure for an estimation problem when the loss is not quadratic.
Example 3.2.3. Bayes Estimation ofthe Probability ofSuccess in n Bernoulli Trials. Suppose that we wish to estimate () using X I, ... , X n , the indicators of n Bernoulli trials with
probability of success
e, We shall consider the loss function I given by
(0  a)2 1(0, a) = 0(1 _ 0)' 0 < 0 < I, a real.
(3.2.11)
"
" " ,

d
This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for () close to zero, this l((), a) is close to the relative squared error (fJ  a)2 JO, lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if aU terms on the righthand side are finite,
rea I k)
(3.2.12)
;,
I
, " " I
,

I
Minimizing this parabola in a, we find our Bayes procedure is given by
J'(k) = E(1/(1  0) I S = k) E(I/O(I  0) IS  k)
;
(3.2.13)

i! , ,.
provided the denominator is not zero. For convenience let us now take as prior density the density br " (0) of the bela distribution i3( r, s). In Example 1. 2.1 we showed that this leads to a i3(k +r,n +s  k) posteriordistributionforOif S = k. Ifl < k < nI and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and
.
,
";1 
J'(k)
Jo'(I/(1  O))bk+r,n_k+,(O)dO
J~(1/0(1 O))bk+r,n_k+,(O)dO
B(k+r,nk+sI) B(k + r  I, n  k + s  I) k+rl n+s+r2'
(3.2.14)
where we are using the notation B.2.11 of Appendix B. If k = 0, it is easy to see that a = is the only a that makes r(a I k) < 00. Thus, J'(O) = O. Similarly, we get J'(n) = 1. If we assume a un!form prior density, (r = s = 1), we see that the Bayes procedure is the usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). 0
°
"Real" computation of Bayes procedures
The closed fonos of (3.2.6) and (3.2.10) make the compulation of (3.2.8) appear straightforward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 ~ (0" . .. , Op) has a hierarchically defined prior density,
"(0,, O ,... ,Op) = 2
"1 (Od"2(02
I( 1 ) ... "p(Op I Op_ d.
(3.2.15)
I
Section 3.2
Bayes Procedures
167
Here is an example. Example 3.2.4. The random effects model we shall study in Volume II has
(3.2.16)
where the €ij are i.i.d. N(O, (J:) and Jl, and the vector ~ = (AI, .. ' ,AI) is independent of {€i; ; 1 < i < I, 1 < j < J} with LI." ... ,LI.[ i.i.d. N(O,cri), 1 < j < J, Jl, '"'" N(Jl,O l (J~). Here the Xi)" can be thought of as measurements on individual i and Ai is an "individual" effect. If we now put a prior distribution on (Jl,,(J~,(J~) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (J.L, (J~, (J~, AI,. '. ,AI) with the Xii I () independent N(f.l + Ll. i . cr~). Then p(x I 0) = lli,; 'Pu.(Xi;  f.l Ll. i ) and
11'(0)
=
11'}(f.l)11'2(cr~)11'3(cri)
II 'PU" (LI.,)
i=l
I
(3.2.17)
where <Pu denotes the N(O, (J2) density. ]n such a context a loss function frequently will single out some single coordinate Os (e.g., LI.} in 3.2.17) and to compute r(a I x) we will need the posterior distribution of ill I X. But this is obtainable from the posterior distribution of (J given X = x only by integrating out OJ, j t= s, and if p is large this is intractable. ]n recent years socalled Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable 0 and the use of Bayesian methods has spread. We return to the topic in Volume II.
Linear Bayes estimates
When the problem of computing r( 1r, 6) and 611" is daunting, an alternative is to consider  of procedures for which r( 'R', 6) is easy to compute and then to look for 011" E 'D  a class 'D that minimizes r( 1f 16) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q( 0)  a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;=l bjXj _ If in (1.4.14) we identify q(8) with Y and X with Z, the solution is

8(X)
= Eq(8) + IX 
E(X)]T{3
where f3 is as defined in Section 1.4. For example, if in the 11l0del (3.2.16), (3.2.17) we set q(fJ) = Ll 1 , we can find the linear Bayes estimate of Ll 1 hy using 1.4.6 and Problem 1.4.21. We find from (1.4.14) that the best linear Bayes estimator of LI.} is
(3.2.18)
where E(Ll.I) given model
= 0,
X = (XIJ, ... ,XIJ)T, P.
= E(X)
and/l
=
ExlcExA,. For the
168
Var(X I ;)
Measures of Performance
Chapter 3
=E
Var(X,)
I B) +
Var E(XIj
I BJ = E«T~) +
(T~ + (T~,
E Cov(X,j , Xlk I B)
+ Cov(E(X'j I B), E(Xlk I BJ)
0+ Cov(1' + II'!, I' + tl.,)
= (T~ +
(T~,
Cov(tl.
Xlj)
=E
Cov(X,j , tl.,
!: ,,
I',
" From these calculations we find and Norberg (1986).
I B) + Cov(E(XIj I B), E(tl. l I 0» = 0 + (T~, =
(Tt·
f3
and OL(X). We leave the details to Problem 3.2.10.
'I
Linear Bayes procedures are useful in actuarial science, for example, Biihlmann (1970)
if
ii "
Bayes estimation, maximum likelihood, and equivariance
As we have noted earlier. the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior 71"(0) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than
modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing the estimate (j M is said to be equivariant with respect to reparametrization if for every onetoDne function h from e to fl = h (e). the estimate of w '" h(B) is W = h(OM); that is, M
~
I
= h(BM ). In Problem 2.2.16 we show that the MLE procedure is equivariant. If we consider squared error loss. then the Bayes procedure (}B = E(8 I X) is not equivariant
(h(O»
M
~

~
~
for nonlinear transfonnations because
E(h(O) I X)
op h(E(O I X))
1 ,
,
,,
for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is
i
re(a I x)
= LIB  a]'7f(O I x).
'Ee
(3.2.19)
i
If we set w = h(O) for h onetoone onto fl = heel, then w has prior >.(w) and in the w parametrization, the posterior Bayes risk is
= 7f(h
1
(w))
I
,
r,,(alx)
= L[waj2>,(wlx)
wE"
L[h(B)  a)2 7f (B I x).
(3.2.20)
'Ee
Thus, the Bayes procedure for squared error loss is not equivariant because squared error loss is not equivariant and, thus, r,,(a I x) op Te(h 1(a) Ix).
Section 3.2
Bayes Procedures
169
Loss functions of the form l((}, a) = Q(Po, Pa) are necessarily equivariant. The KullbackLeibler divergence K«(},a), (J,a E e, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h 1 (a)), thus, with this loss function,
ro(a I x)
~
re(h1(a) I x).
See Problem 2.2.38. In the discrete case using K means that the importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.l 0». In the N(O, case the K L (KullbackLeibler) loss K( a) is ~n(a  0)' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families
"J)
e,
K(1], a) = L[ryj  ajJE1]Tj
j=1
•
+ A(1]) 
A(a).
Moreover, if we can find the KL loss Bayes estimate 1]BKL of the canonical parameter 7] and if 1] = c( 9) :  t E is onetoone, then the K L loss Bayes estimate of 9 in the
e
general exponential family is 9 BKL = C 1(1]BKd. For instance, in Example 3.2.1 where J.l is the mean of a nonnal distribution and the prior is nonnal, we found the squared error Bayes estimate jiB = wf/o + (lw)X, where 1]0 is the prior mean and w is a weight. Because the K L loss is equivalent to squared error for the canonical parameter p, then if w = h(p), WBKL = h('iiUKL), where 'iiBKL =
~
wTJo + (1  w)X.
Bayes procedures based on the KullbackLeibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume 11.
Bayes methods and doing reasonable things
There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1%9; Lindley, 1965; Savage, 1954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior 1r. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E V there is a 60 E V o such that R(0, 60 ) < R(0, 6) for all O. Summary. We show how Bayes procedures can be obtained for certain problems by COmputing posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.
170
Measures of Performance
Chapter 3
3.3
MINIMAX PROCEDURES
In Section 1.3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worstcase analysis; the true () is one that is as "hard" as possible. That is, 6 1 is better than 82 from a minimax point of view if sUPe R(0,6 1 ) < sUPe R(B, 82 ) and is said to be minimax if
o·
supR(O, 0')
.
= infsupR(O,o).
,
.
,
I·
Here (J and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {p. : 0 E e}. It is fruitful to consider proper subclasses of V and subsets of P. but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by considering a socalled zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 8. For the basic game, 5 picks 0 without N's knowledge, N picks 1f without 5's knowledge and then all is revealed and S pays N
r(Jr,o)
, ,
,
=
J
R(O,o)dJr(O)
where the notation f R(O, o)dJr(0) stands for f R(0, o)Jr(O)dII in the continuous case and L, R(Oj, o)Jr(O j) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important.
I: N is told the choice 0 of 5 before picking 1r and 5 knows the rules of the game. Then
Ii ,
N naturally picks 1f,; such that
r(Jr"o)
that is, 1f,; is leastfavorable against such that
~supr(Jr,o),
o. Knowing the rules of the game S naturally picks 0*
•
(3.3.1)
r( Jr,., 0·) = sup r(Jr, 0') = inf sup r(Jr, 0).
We claim that 0* is minimax. To see this we note first that,
.
,
.
(3.3.2)
for allJr, o. On the other hand, if R(0" 0) = suP. R(0, 0), then if Jr, is point mass at 0" r( Jr" 0) = R(O" 0) and we conclude that supr(Jr,o) = sup R(O, 0)
,
•
•
(3.3.3)
I
,
•
•
Section 3.3
Minimax Procedures
171
and our claim follows. II: S is told the choke 7r of N before picking 6 and N knows the rules of the game. Then S naturally picks 6 1r such that
That is, r5 rr is a Bayes procedure for
11".
Then N should pick
7r*
such that (3.3.4)
For obvious reasons, 1f* is called a least favorable (to S) prior distribution. As we shall see by example, altbough the rigbthand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language.
Theorem 3.3.1. (von Neumann). If both
e and D are finite,
?5
11"
then:
(a)
v=supinfr(1I",6), v=infsupr(1I",6)
rr
?5
are both assumed by (say)
7r*
(least favorable), 6* minimax, respectively. Further,
") =v v=r1f,u (
.
(3.3.5)
v and v are called the lower and upper values of the basic game. When v (saY), v is called the value of the game.
=
v
=
v
Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification ~ {eo} and = {eIl (Example 3.2.2) but is too reSlrictive in ilS and testing when assumption for the great majority of inference problems. A generalization due to Wald and Karlinsee Karlin (1 959)states that the conclusions of the theorem remain valid if and D are compact subsets of Euclidean spaces. There are more farreaching generalizations but, as we shall see later, without some form of compactness of and/or D, although equality of v and v holds quite generally, existence of least favorable priors and/or minimax procedures may fail.
eo
e,
e
e
The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on and D and are easy to prove.
e
Proposition 3.3.1. Suppose 6**. 7r** can be found such that
U
£** =
r ()1r.',
1f **
= 1rJ••
(3.3.6)
172
Measures of Performance
Chapter 3
that is, 0** is Bayes against 11"** and 11"** is least favorable against 0**. Then v R( 11"** ,0**). That is, 11"** is least favorable and J*'" is minimax.
To utilize this result we need a characterization of 11"8. This is given by
v
=
Proposition 3.3.2.11"8 is leastfavorable against
°iff
1r,jO: R(O,b) = supR(O',b)) = 1.
.'
(3,3,7)
That is, n a assigns probability only to points () at which the function R(·, 0) is maximal.
Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 11"8 may not be unique. In particular, if R(O, 0) = constant. the rule has constant risk, then all 11" are least favorable. We now prove Propositions 3.3.1 and 3.3.2.
Proof of Proposition 3.3.1. Note first that we always have
v<v
because, trivially,
i~fr(1r,b)
(3.3.8)
< r(1r,b')
(3.3,9)
for aIln, 5'. Hence,
v = sup in,fr(1r,b) < supr(1r,b')
•
(3,3.10)
•
for all 0' and v
<
infa, sUP1r 1'(11", (/) =
v. On the other hand. by hypothesis,
sup1'(1r,6**) > v.
v> inf1'(11"*'",6)
,
= 1'(1I"*",0*'") =
.
(3.3.11)
Combining (3.3.8) and (3.3.11) we conclude that
:r
v
as advertised.
= i~f 1'(11"**,0) = 1'(11"**,6**) = s~p1'(1I",0*"')
= V
(3.3.12)
" 'I
,.;
~l
o
Proofof Proposition 3.3.2. 1r is least favorable for b iff E.R(8,6) =
,.,
f r(O,b)d1r(O) =s~pr( ..,6).
•
(3.3.13)
But by (3.3.3),
supr(..,b) = supR(O, 6),
(3.3,14)
•
i ,
Because E.R(8, 6)
= sUP. R(O, b), (3.3,13) is possible iff (3.3.7) holds.
o
Putting the two propositions together we have the following.
•
Section 3.3
Minimax Procedures
173
11"*
Theorem 3.3.2. Suppose 0* has sUPo R((}, 0*) = 1" < 00. If there exists a prior that 0* is Bayes for 11"* and tr" {(} : R( (}, 0") = r} = 1, then 0'" is minimax.
such
Example 3.3.1. Minimax estimation in the Binomial Case. Suppose S has a B(n,B) distribution and X = Sjn,as in Example 3.2.3. Let I(B, a) ~ (Ba)'jB(IB),O < B < 1. For this loss function,
R(B X) = E(X _B)' , B(IB)

=
B(IB) = ~ nB(IB) n'

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(Ol 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic loss neither of these assertions holds. The minimax estimate is
o's=S+hln = .,fii X+ I 1 () n+.,fii .,fii+l .,fii+1 2
This estimate does have constant risk and is Bayes against a (J( y'ri/2, vn/2) prior (Problem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n t 00 of the ratio of the risks of 0* and X is > 1 for every () =f ~. At B = the ratio tends to 1. Details are left to Problem 3.3.4. 0
!
Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending ~r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p" 0'2) where p, = v, if the satellite functions, and otherwise. The variance 0'2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or"p, = v." We view this as a decision problem with 1 loss. If the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to and 1  1r to v, use 0  1 loss, and set L(x, 0, v) = p( x Iv) j p(x I 0), then the Bayes test decides I' = v if
°
°°
L(x,O,v)
and decides p,
=
=
exp
°
" {
2"EXi   , a 2a
nv'} >
17l'
if L(x,O,v) <:17l' 7l'
174
Measures of Performance
Chapter 3
This test is equivalent to deciding f.t
= v (Problem 3.3.1) if. and only if,
"yn
1 ;;;:EXi>t,
T=
, , ,
where,
t
If we call this test d,r.
=
" [log " + 2 nv'] ;;;: vyn a
17r
2
'
R(O,J. )
1 ~ <I>(t)
<I>
= <1>( t)
R(v,o,)
(t  vf)
To get a minimax test we must have R(O, 61r ) = R( V, 6ft), which is equivalent to
v.,fii t = t  "'
h •
or
"
','
I ,~
~.
, ,
•
v.,fii t= . 20
•
Because this value of t corresponds to 7r = the intuitive test, which decides JJ only ifT > ~[Eo(T) + Ev(T)J, is indeed minimax.
!.
= v if and
0
'I
•
If is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem
e
3.3.2.
Theorem 3,3,3, Let 0' be a rule such that sup.R(O,o') = r < 00, let {"d denote a sequence of pn'or distributions such that 'lrk;{8 : R(B,o*) = r} = 1, and let Tk = inffJ r( ?rie, J), where r( 7fkl 0) denotes the Bayes risk wrt 'lrk;. If
Tk Task 00,
(3.3.i5)
then J* is minimax. Proof Because r( "k, 0') = r
supR(B, 0') = rk + 0(1)
•
where 0(1)
~
0 as k
~ 00.
But hy (3.3.13) for any competitor 0
supR(O,o) > E.,(R(B,o)) > rk ~supR(O,o') 0(1).
•
•
(3.3.16)
,
'.
If we let k _ suP. R(B, 0').
00
the lefthand side of (3.3.16) is unchanged, whereas the right tends to
0
j
1
•
Section 3.3
M'mimax Procedures
175
Example 3.3.3. Normal Mean. We now show that X is minimax in Example 3.2.1. Identify 1fk with theN(1Jo, 7 2 ) prior where k = 7 2 . Then
whereas the Bayes risk of the Bayes rule of Example 3.2.1 is
i~frk(J) ~ (,,'/n) +7' n
Because
(0
7
2
0
2
=
00,
n  (,,'/n) +7'
a2
1
0
2
n·
2
In)1 « (T2 In) + 7 2 )
>
0 as T 2

we can conclude that
X is minimax.
0
Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann). Suppose XI,'" ,Xn arei.i.d, FE:F
Then X is minimax for estimating B(F) EF(Xt} with quadratic loss. This can be viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as foUows:(J)
(i)
=
".{F: VarF(XIl # M}
= O.
(ii) ".{ F : F
# N(I', M) for some I'} = O.
(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F),M).
Evidently, the Bayes risk is now the same as in Example 3.3.3 with 0"2 evidently,
= M.
Because.
) VarF(X, ) max R(F, X = max :F :F n
Theorem 3.3.3 applies and the result follows. Minimax procedures and symmetry
M
n
o
As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" 0, There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann (1986) and Chapter 5 of Lehmann and CaseUa (1998), for instance. We shall discuss this approach somewhat, by example. in Chapters 4 and Volume II but refer to Lehmann (1986) and Lehmann and Casella (1998) for further reading. Summary. We introduce the minimax principle in the contex.t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.
.4 UNBIASED ESTIMATION AND RISK INEQUALITIES 3.6.L and (72 when XI. for instance. Survey Sampling In the previous two sections we have considered two decision theoretic optimality principles. D. This notion has intuitive appeal..2. Von Neumann's Theorem states that if e and D are both finite. The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk. which can't be beat for 8 = 80 but can obviously be arbitrarily terrible." R(0. or more generally with constant risk over the support of some prior. symmetry. in many cases. . An alternative approach is to specify a proper subclass of procedures. (72) = . for example. This approach has early on been applied to parametric families Va.1 . 176 Measures of Performance Chapter 3 . This approach coupled with the principle of unbiasedness we now introduce leads to the famous GaussMarkov theorem proved in Section 6. A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called leas! favorable. if Y is postulated as following a linear regression model with E(Y) = zT (3 as in Section 2. then in estimating a linear function of the /3J. II . in Section 1. v equals the Bayes risk of the Bayes rule 8* for the prior 1r*... When Do is the class of linear procedures and I is quadratic Joss. Moreover. The most famous unbiased estimates are the familiar estimates of f. In the nonBayesian framework. according to these criteria.I .2. on other grounds. 5) > R(0. 3. we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1['.3. then the game of S versus N has a value 11.. E Do that minimizes the Bayes risk with respect to a prior 1r among all J E Do. for which it is possible to characterize and. We show that Bayes rules with constant risk. looking for the procedure 0. there is a least favorable prior 1[" and a minimax rule 8* such that J* is the Bayes rule for n* and 1r* maximizes the Bayes risk of J* over all priors.• •••. N (Il. estimates that ignore the data. the solution is given in Section 3. all 5 E Va.1. 5') for aile. and so on. When v = ii. the game is said to have a value v. 0'5) model. I X n are d. computational ease. An estimate such that Biase (8) 0 is called unbiased. ruling out. it is natural to consider the computationally simple class of linear estimates. are minimax. the notion of bias of an estimate O(X) of a parameter q(O) in a model P {Po: 0 E e} as = Biaso(5) = Eo5(X)  q(O). Do C D. We introduced. . Bayes and minimaxity. S(Y) = L:~ 1 diYi .d. This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in the N (0 . compute procedures (in particular estimates) that are best in the class of all procedures. we can also take this point of view with humbler aims. such as 5(X) = q(Oo).4. I I .1 Unbiased Estimation. Obviously. and then see if within the Do we can find 8* E Do that is best according to the "gold standard. Ii I More specifically.
.8) Jt =X .4. to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census.Section 3. ~ N L....4 Unbiased Estimation and Risk Inequalities 177 given by (see Example 1. (3. . If . This leads to the model with x = (Xl..14) and has = iv L:fl (3.. a census unit. We want to estimate the parameter X = Xj. (3.il) Clearly for each b.3 and Problem 1.3. . .. . . Unbiased Estimates in Survey Sampling. .6) where b is a prespecified positive constant.3) = 0 otherwise.2. UMVU (uniformly minimum variance unbiased).. Write Xl. UN for the known last census incomes.4... . X N ).. and u =X  b(U . is to estimate by a regression estimate ~ XR Xi. As we shall see shortly for X and in Volume 2 for .4.4. Suppose we wish to sample from a finite population. . Ui is the last census income corresponding to = it E~ 1 'Ui.. We let Xl.2) 1 Because for unbiased estimates mean square error and variance coincide we call an unbiased estimate O*(X) of q(O) that has minimum MSE among all unbiased estimat~s for all 0. Example 3. . ...(Xi 1=1 I" N ~2 x) .4) where 2 CT.4.···. [J = ~ I:~ 1 Ui· XR is also unbiased.< 2 ~ ~ (3. UN) and (Xl.. . Unbiased estimates playa particularly important role in survey sampling.1.. X N for the unknown current family incomes and correspondingly UI. . .4. It is easy to see that the natural estimate X ~ L:~ 1 Xi is unbiased (Problem 3. reflecting the probable correlation between ('ttl.Xn denote the incomes of a sample of n families drawn at random without replacement. (3. these are both UMVU. UN' One way to do this.5) This method of sampling does not use the information contained in 'ttl. .1 ) ~ 1 nl L (Xi ~ ooc n ~ X) 2 .an}C{XI.4....XN} 1 (3.3. (~) If{aj. We ignore difficulties such as families moving. X N ) as parameter .4.' . for instance.
estimate than X and the best choice of b is bopt The value of bopt is unknown but can be estimated by = The resulting estimate is no longer unbiased but behaves well for large samplessee Problem 5.11. 178 Measures of Performance Chapter 3 i .3.... The result is a sample S = {Xl. X M } of random size M such that E(M) = n (Problem 3.4. . Ifthc 'lrj are not all equal. .4. :I .3.19). Xl!Var(U).I. . Specifically let 0 < 7fl.18. To see this write ~ 1 ' " Xj XHT ~ N LJ 7'" l(Xj E S). The HorvitzThompson estimate then stays unbiased. the unbiasedness principle has largely fallen out of favor for a number of reasons. It is possible to avoid the undesirable random sample size of these schemes and yet have specified 1rj.4). II it >: . They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later..4.'I .i . This makes it more likely for big incomes to be induded and is intuitively desirable. .. An alternative approach to using the Uj is to not sample all units with the same prob ability. for instance. 0 Discussion. ~ .bey the attractive equivariance property. . N toss a coin with probability 7fj of landing heads and select Xj if the coin lands heads. X is not unbiased but the following estimate known as the HorvitzThompson estimate is: Ef . . j=1 3 N ! I: .15).j .1 the correlation of Ui and Xi is positive and b < 2Cov(U. . q(e) is biased for q(e) unless q is linear. . outside of sampling.7) II ! where Ji is defined by Xi = xJ.. ' ..4. If B is unbiased for e. this will be a beller cov(U. I j . M (3. . (iii) Unbiased_estimates do not <:.4. (i) Typically unbiased estimates do not existsee Bickel and Lehmann (1969) and Problem 3. A natural choice of 1rj is ~n.Xi XHT=LJN i=l 7fJ.. However. For each unit 1. . . I . . 1.X)!Var(U) (Problem 3.. ".. (ii) Bayes estimates are necessarily biasedsee Problem 3. Unbiasedness is also used in stratified sampling theory (see Problem 1.. 111" N < 1 with 1 1fj = n. I II.2Gand minimax estimates often are.~'! I . Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems.j' I ': Because 1rj = P[Xj E Sj by construction unbiasedness follows. " 1 ".
2 The Information Inequality The oneparameter case We will develop a lower bound for the variance of a statistic. What is needed are simple sufficient conditions on p(x.8) is finite. OJ exists and is finite.   . Some classical conditiolls may be found in Apostol (1974). as we shall see in Chapters 5 and 6. (II) 1fT is any statistic such that E 8 (ITf) < 00 for all 0 E then the operations of integration and differentiation by B can be interchanged in J T( x )p(x.OJdx (3.. For instance. has some decision theoretic applications. unbiased estimates are still in favor when it comes to estimating residual variances. B)dx. This preference of 8 2 over the MLE 52 = iT e/n is in accord with optimal behavior when both the number of observations and number of parameters are large. in the linear regression model Y = ZDf3 + e of Section 2. Assumption II is practically useless as written. for integration over Rq. The arguments will be based on asymptotic versions of the important inequalities in the next subsection. Finally.4 Unbiased Estimation and Risk Inequalities 179 Nevertheless.8) whenever the righthand side of (3. 3.8) is assumed to hold if T(xJ = I for all x. We expect that jBiase(Bn)I/Vart (B n ) . p. and appears in the asymptotic optimality theory of Section 5. The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous~case theorems given later. In particular we shall show that maximum likelihood estimates are approximately unbiased and approximately best among all estimates. We suppose throughout that we have a regular parametric model and further that is an open subset of the line.O) > O} docs not depend on O. B) is a density.4. and we can interchange differentiation and integration in p(x. B)dx.Section 3.4.ZDf3). 0 or equivalently Vare(Bn)/MSEe(Bn) t 1 as n t 00. That is. Note that in particular (3.9.OJdx] = J T(xJ%oP(x. e.p) where i = (Y . %0 [j T(X)P(x.4. Simpler assumptions can be fonnulated using Lebesgue integration theory.( lTD < 00 J .4. (I) The set A ~ {x : p(x. From this point on we will suppose p(x. f3 is the least squares estimate. B) for II to hold. 167. For all x E A. which can be used to show that an estimate is UMVU.4. The lower bound is interesting in its own right. suppose I holds. Then 11 holdS provided that for all T such that E. good estimates in large samples ~ 1 ~ are approximately unbiased.4. We make two regularity assumptions on the family {Pe : BEe}. e e. the variance a 2 = Var(ei) is estimated by the unbiased estimate 8 2 = iTe/(n .2. and p is the number of coefficients in f3. For instance.O E 81iJO log p( x. See Problem 3.
B(O)} is an exponential family and TJ(B) has a nonvanishing continuous derivative on 8.0 Xi) 1 = nO =  0' n O· o . 00.6. O)dx ~ :0 J p(x.4.O)] dx " I'  are continuous functions(3) of O. (3. Then Xi . /fp(x. (3.i J T(x) [:OP(X.11 ) . Ii ! I h(x) exp{ ~(O)T(x) . .10) and. 0) ~ 1(0) = E.&0 '0 {[:oP(X.1.O)). 1 X n is a sample from a Poisson PCB) population. thus.4. It is not hard to check (using Laplace transform theory) that a oneparameter exponential family quite generally satisfies Assumptions I and II. (J2) population. suppose XI. Then (see Table 1.4. O)dx :Op(x. Suppose Xl. 0))' = J(:Ologp(x. the integrals .I [I ·1 . (~ logp(X. O)dx. which is denoted by I(B) and given by Proposition 3.O)} p(x. then I and II hold. O)dx = O.9) Note that 0 < 1(0) < Lemma 3.4. For instance.180 for all Measures of Performance Chapter 3 e. . 0))' p(x.4. 1 and 11 are satisfied for samples from gamma and beta distributions with one parameter fixed..O)] dxand J T(x) [:oP(X.. where a 2 is known.  Proof.1) ~(O) = 0la' and 1 and 11 are satisfied. Then (3.O)]j P(x. I' ..4. Suppose that 1 and II hold and that E &Ologp(X.2. the Fisher information number. o Example 3.n and 1(0) = Var (~n_l . I 1(0) 1 = Var (~ !Ogp(X. 1 X n is a sample from a N(B. If I holds it is possible to define an important characteristic of the family {Po}.O) & < 00.. t. I J J & logp(x 0 ) = ~n 1 . Similarly. .1.
l4) and Lemma 3. 0 E e. 1/. Suppose that I and II hold and 0 < 1(0) < 00.O))P(x.4. e)) = I(e). by Lemma 3. e) and T(X).4.' (e) = Cov (:e log p(X. CoroUary 3.4. then I(e) = nI.(T(X) > I(O) 1 (3. > [1/. Var (t.13) By (A. Here's another important special case. Suppose that X = (XI.(T(X)) by 1/.(0) is differentiable and Var (T(X)) . 0 (3.17) . (Information Inequality).4. Suppose the conditions of Theorem 3.4.4 Unbiased Estimation and Risk Inequalities 181 Here is the main result of this section.O)dX.1 hold. nI. ej. 1/.4.4.4. (3.16) The number 1/I(B) is often referred to as the information or CramerRoo lower bound for the variance of an unbiased estimate of 'tjJ(B).(T(X)) < 00 for all O.Xu) is a sample from a population with density f(x.1.4.O).O)dX= J T(x) UOIOgp(x.I1.'(0)]'  I(e) .I 1.'(0) = J T(X):Op(x.(0).4.15) The theorem follows because.1. (3. (3. 0 The lower bound given in the information inequality depends on T(X) through 1/. .Section 3. We get (3. Then Var. Using I and II we obtain. Let [.(T(X)) > [1/. 10gp(X.'(~)I)'.4. .4. Proposition 3.1. If we consider the class of unbiased estimates of q(B) = B.1 hold and T is an unbiased estimate ofB. (e) and Var. Then for all 0.12) Proof.(0) = E(feiogf(X"e»)'.14) Now let ns apply the correlation (CauchySchwarz) inequality (A. Let T(X) be all)' statistic such that Var.16) to the random variables 8/8010gp(X. 1/.2.(0). Denote E..4. we obtain a universal lower bound given by the following.1. and that the conditions of Theorem 3. T(X)) .. Theorem 3.
18) follows.2.2..19) Ii i . the MLE is B ~ X..6. This is no accident. By Corollary 3.4. Suppose that the family {p.[T'(X)] = [. (Br Now Var(X) ~ a'ln.(B) such that Var. whereas if 'P denotes the N(O.4. Then {P9} is a oneparameter exponential family with density or frequency function af the fonn II p(x..182 Measures of Performance Chapter 3 Proof. .18) 1 a ' . Because X is unbiased and Var( X) ~ Bin.. we have in fact proved that X is UMVU even if a 2 is unknown.1 and I(B) = Var [:B !ogp(X.. o II (B) is often referred to as the information contained in one observation.4. If the family {Pe} satisfies I and II and if there exists an unbiased estimate T* of 1/.4. then T' is UMVU as an estimate of 1.Xn are the indicators of n Bernoulli trials with probability of success B. Example 3. which achieves the lower bound ofTheorem 3.B)] Var " i) ~ 8B iogj(Xi.B)] = nh(B).(8) has a continuous nonvanishing derivative on e. . Note that because X is UMVU whatever may be a 2 . These are situations in which X follows a oneparameter exponential family. Suppose Xl. Next we note how we can apply the information inequality to the problem of unbiased estimation. X n is a sample from a nonnal distribution with unknown mean B and known variance a 2 . (3. I I "I and (3.p'(B)]'ll(B) for all BEe.1) with natural sufficient statistic T( X) and . Example 3.j.3. This is a consequence of Lemma 3. (Continued).4.B(B)J.4..4.j.4. .(B). the conditions of the information inequality are satisfied. 0 We can similarly show (Problem 3. Theorem 3. then T(X) achieves the infonnation inequality bound and is a UMVU estimate of E. if {P9} is a oneparameter exponentialfamily of the form (1. then  1 (3.. We have just shown that the information [(B) in a sample of size n is nh (8). Conversely.1) that if Xl. 1) density.4.1 for every B. As we previously remarked.B) = h(x)exp[ry(B)T'(x) .B) ] [ ~var [%0 ]Ogj(Xi.(T(X». : BEe} satisfies assumptions I and II and there exists an unbiased estimate T* of1.1 we see that the conclusion that X is UMVU follows if ~  Var(X) ~ nI.4. . then X is a UMVU estimate of B. i g . For a sample from a P(B) distribution. then X is UMVU. .
4. Suppose without loss of generality that T(xI) of T(X2) for Xl. " and both sides are continuous in B.4.(B)T' (x) +a2(B) for all BEe}. the information bound is [A"(B))2/A"(B) A"(B) Vare(T(X») so that T(X) achieves the information bound as an estimate of EeT(X). + n2) 10gB + (2n3 + n3) Iog(1 . However.19).16) we know that T'" achieves the lower bound for all (J if. .3) that we have the canonical case with ~(B) = Band B( Bj = A(B) = logf h(x)exp{BT(x))dx.Section 3. If A. we see that all a2 are linear combinations of {} log p( Xj. then (3.22) for j = 1.6).4.ll.. By (3.23) for all B1 .20) guarantees Pe(Ae) ~ 1 and assumption 1 guarantees P. j hence. B) = 2"' exp{(2nj 2"' exp{(2n.25) But .6.10g(1 . Thus. continuous in B.14) and the conditions for equality in the correlation inequality (A. (3. there exist functions al ((J) and a2 (B) such that :B 10gp(X.1) we assume without loss of generality (Problem 3. and only if.4.4. B) = aj(B)T'(x) + a2(B) (3.20) with respect to B we get (3.4. In the HardyWeinberg model of Examples 2. Here is the argument. 0 Example 3. be a denumerable dense subset of 8. denotes the set of x for which (3.2. From this equality of random variables we shall show that Pe[X E A'I = 1 for all B where A' = {x: :Blogp(x. B) IdB. thus.20) with Pg probability 1 for each B.4. it is necessary. 2 A ** = A'" and the result follows.(Ae) = 1 for all B' (Problem 3. p(x.B) so that = T(X) . The passage from (3.4.4.24) [(B) = Vare(T(X) . J 2 Pe. X2 E A"''''. B .20) to (3. Note that if AU = nmAem .(A") = 1 for all B'. :e Iogp(x.4.1.4.4.4. By solving for a" a2 in (3.4.4.2..6.4.2 and. Our argument is essentially that of Wijsman (1973).20) hold.A (B) .4 Unbiased Estimation and Risk Inequalities 183 Proof We start with the first assertion.p(B) = A'(B) and.4 and 2. But now if x is such that = 1. (B)T'(X) + a2(B) (3.B)) + nz)[log B . B .. Conversely in the exponential family case (1.B) = a.4.A'(B» = VareT(X) = A"(B). Then ~ ( 80 logp X. then (3.21 ) Upon integrating both sides of (3. (3.19) is highly technical. (3.B) I + 2nlog(1  BJ} .23) must hold for all B.. B) = a.4. Let B .
. B) aB' p(x. Because this is an exponential family. See Volume II. although UMVU estimates exist.B) is twice differentiable and interchange between integration and differentiation is pennitted. I I' f . =e'E(~Xi) = .p' (B) I'( I (B). in many situations. Suppose p('. for instance. '.2 suggests.ed)' In particular. that I and II fail to hold. . () (ell" .2. • The multiparameter case We will extend the information lower bound to the case of several parameters.4.25).26) Proposition 3.6.4. A third metho:! would be to use Var(B) = I(I(B) and formula (3.4. B) 2 1 a = p(x. The variance of B can be computed directly using the moments of the multinomial distribution of (NI • N z . Na). B)  2 (a log p(x. " 'oi . or by transforming p(x.B) ~ A "( B).2 implies that T = (2N 1 + N z )/2n is UMVU for estimating E(T) ~ (2n)1[2nB' + 2nB(1 . Extensions to models in which B is multidimensional are considered next.25) we obtain a' 10gp(X.B)(2n. Proof We need only check that I a aB' log p(x. ~ ~ This T coincides with the MLE () of Example 2. By (3. B) example. in the U(O. we have iJ' aB' logp (X.27) and integrate both sides with respect to p(x.184 Measures of Performance Chapter 3 where we have used the identity (2nl + HZ) + (2n3 + nz) = 211.4. Sharpenings of the information inequality are available but don't help in general. B) satisfies in addition to I and II: p(·. 0 ~ Note that by differentiating (3.4. Theorem 3.4. Then (3.2. but the variance of the best estimate is not equal to the bound [.B)] and then using Theorem 1.6.. I(B) = Eo aB' It turns out that this identity also holds outside exponential families. as Theorem 3.2.4. B) )' aB (3. . assumptions I and II are satisfied and UMVU estimates of 1/.7) Var(B) ~ B(I .4. . We find (Problem 3.. . It often happens.4.(e) exist. (Continued).24)..B)) which equals I(B).4.B)] ~ B.26) holds.3.'" .. For a sample from a P( B) distribution o EB(::. we will find a lower bound on the variance of an estimator .4. Even worse. 0 0 . e). DiscRSsion. (3. B) to canonical form by setting t) = 10g[B((1 . Example 3. B). .IOgp(X.
0 = (I".Xnf' has information matrix 1(0) = EO (aO:.. iJO logp(X. j = 1. 0) j k That is.. in addition. . Under the conditions in the opening paragraph. (0) iJiJ lt2(0) ~ E [aer 2 iJl" logp(x. then X = (Xl.? ologp(X. 0).4.O») .30) iJ Ijd O) = Covo ( eO logp(X. . We assume that e is an open subset of Rd and that {p(x.Xn are U.2 ] ~ er..4. 0). .4. B) : 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect OJ.4 /2.0) = er.O) ] iJ2 ] l22(0) = E [ (iJer2)21ogp(x. Then 1 = log(211') 2 1 2 1 2 loger . . = Var('. . 0)] = E[er.29) Proposition 3. .4.4 E(x 1") = 0 ~ I. The (Fisher) information matrix is defined as (3. er 2 ).( x 1") 2 2a 2 logp(x 0) . Let p( x..d.4. 1 <k< d. d.. 1 <j< d. (c) If..4. .4 Unbiased Estimation and Risk Inequalities 185 of fh when the parameters 82 . 6) denote the density or frequency function of X where X E X C Rq..Bd are unknown. (a) B=T 1 (3.4.2 = er. er 2). III (0) = E [~:2 logp(x.Section 3. The arguments follow the d Example 3.4. .28) where (3.Ok IOgp(X. .31) and 1(0) (b) If Xl.5.. as X. Suppose X = 1 case and are left to the problems. a ).. (3. pC B) is twice differentiable and double integration and differentiation under the integral sign can be interchanged. nh (O) where h is the information matrix ofX..32) Proof. (3. ~ N(I".
.6 hold. A(O). Here are some consequences of this result.36) Proof.(0) exists and (3..4. By (3. . UMVU Estimates in Canonical Exponential Families.O) = T(X) . II are easily checked and because VOlogp(x.A(O)}h(x) (3.4.4/2 .2 ( 0 0 ) . that is. Then VarO(T(X» > Lz~ [1(0) Lzy (3. = .(0) = EOT(X) and let ". [(0) = VarOT(X) = (3. 0) . (3. Z = V 0 logp(x. ! p(x. 1/. Let 1/.4.'(0) = V1/.O») = VOEO(T(X» and the last equality follows 0 from the argument in (3..38) where Lzy = EO(TVOlogp(X.6.3.4.30) and Corollary 1.33) o Example 3.34) () E e open. We will use the prediction inequality Var(Y) > Var(I'L(Z». Suppose k . We claim that each of Tj(X) is a UMVU . .(O). The conditions I. in this case [(0) ! . Suppose the conditions of Example 3.6.A.4.13). J)d assumed unknown.4..4. Canonical kParameter Exponential Family.!pening paragraph hold and suppose that the matrix [(0) is nonsingular Then/or all 0.( 0) be the d x 1 vector of partial derivatives.. (continued). . Then Theorem 3.186 Measures of Performance Chapter 3 Thus.' = explLTj(x)9j j=1 .37) = T(X).1. Assume the conditions of the .4.6.4. 0).4. where I'L(Z) denotes the optimal MSPE linear predictor of Y.4.. then [(0) = VarOT(X). I'L(Z) = I'Y + (Z l'z)TLz~ Lzy· Now set Y (3.4. Example 3.35) o ~ Next suppose 8 1 = T is an estimate of (Jl with 82.
n T.6. are known. kI A(O) ~ 1£ log Note that 1+ Le j=l Oj 80 A (0) J 8 = 1 "k 11 ne~ I 0 +LJl=le l = 1£>'.)/1£.A.. We have already computed in Proposition 3. without loss of generality.A(O) is just VarOT! (X). the lower bound on the variance of an unbiased estimator of 1/Jj(O) = E(n1Tj(X) = >. (1 + E7 / eel) = n>'j(1.p(0)11(0) ~ (1. .. AI. j. Thus. then Nj/n is UMVU for >'j. . Ak) to the canonical form p(x. .0.A(O)} whereTT(x) = (T!(x). This is a different claim than TJ(X) is UMVU for EOTj(X) if Oi.. we let j = 1.Tk_I(X))..(1.p(0) is the first row of 1(0) and. a'i X and Aj = P(X = j). k.O) = exp{TT(x)O . . .>'. .4 8'A rl(O)= ( 8080 t )1 kx k .A(0) j ne ej = (1 + E7 eel _ eOj ) . .Xn)T. . In the multinomial Example 1. But f..3. .d.40) J We claim that in this case . j = 1.41) because ..4...4. .i. .) = Var(Tj(X».4 Unbiased Estimation and Risk Inequalities 187 estimate of EOTj(X).4.6 with Xl.39) where. (3...(X)) 82 80.. hence. Example 3.j. by Theorem 3. . " Ok_il T.. = nE(T. is >.4. 0 = (0 1 . .T(O) = ~:~ I (3. i =j:..4. .(X) = I)IXi = j].0). we transformed the multinomial model M(n. .7. Multinomial Trials. To see our claim note that in our case (3... X n i..p(OJrI(O).>'j)/n. j=1 X = (XI. .. But because Nj/n is unbiased and has 0 variance >'j(1 .4.Section 3.
. . The Normal Case. • Theorem 3.8. and robustness to model departures.42) where A > B means aT(A .21. We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal. 3." . features other than the risk function are also of importance in selection of a procedure. X n are i. Asymptotic analogues oftbese inequalities are sharp and lead to the notion and construction of efficient estimates. 188 MeasureS of Performance Chapter 3 j Example 3.3 whose proof is left to Problem 3..4.3 are 0 explored in the problems. T(X) is the UMVU estimate of its expectation."xl' Note that both sides of (3. 1 j Here is an important extension of Theorem 3. The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure. reasonable estimates are asymptotically unbiased. Summary. . If Xl.2 + (J2. These and other examples and the implications of Theorem 3..1 Computation Speed of computation and numerical stability issues have been discussed briefly in Section 2. .4.4.4..i.42) are d x d matrices. (72) then X is UMVU for J1 and ~ is UMVU for p.3 hold and • is a ddimensional statistic.. We study the important application of the unbiasedness principle in survey sampling.d.4.4. Suppose that the conditions afTheorem 3.5. " il 'I. We derive the information inequa!ity in oneparameter models and show how it can be used to establish that in a canonical exponential family. Also note that ~ o unbiased '* VarOO > r'(O). .4. even if the loss function and model are well specified. LX.B)a > Ofor all . . Let 1/J(O) ~EO(T(X))dXl and "'(0) = (1/Jl(O). Then J (3. .5 NON DECISION THEORETIC CRITERIA In practice. Using inequalities from prediction theory.4. ~ In Chapters 5 and 6 we show that in smoothly parametrized models. interpretability of the procedure. N(IL. 3.X)2 is UMVU for 0 2 .4. we show how the infomlation inequality can be extended to the multiparameter case. But it does not follow that n~1 L:(Xi .pd(OW = (~(O)) dxd . They are dealt with extensively 'in books on numerical analysis such as Dahlquist.4..
3 and 3.3. variance and computation As we have seen in special cases in Examples 3.2 is given by where 0. say. The improvement in speed may however be spurious since AI is costly to compute if d is largethough the same trick as in computing least squares estimates can be used.AI (flU 1) (T(X) .2). It is in fact faster and better to Y. in the algOrithm we discuss in Section 2.2 Interpretability Suppose that iu the normal N(Il.5 Nondecision Theoretic Criteria 189 Bjork. Then least squares estimates are given in closed fonn by equ<\tion (2. the signaltonoise ratio. the NewtonRaphson method in ~ ~(J) which the jth iterate.1 / 2 . for this population of measurements has a clear interpretation. with ever faster computers a difference at this level is irrelevant. < 1 then J is of the order of log ~ (Problem 3.2. Gaussian elimination for the particular z'b Faster versus slower algorithms Consider estimation of the MLE 8 in a general canonical exponential family as in Section 2. solve equation (2.10). It may be shown that. But it reappears when the data sets are big and the number of parameters large.5. For instance. The interplay between estimated. It follows that striving for numerical accuracy of ord~r smaller than n. estimates of parameters based on samples of size n have standard deviations of order n. consider the Gaussian linear model of Example 2.1.A(BU ~l»)).2.4.1. fI(j) = iPl) . Its maximum likelihood estimate X/a continues to have the same intuitive interpretation as an estimate of /1.5. a method of moments estimate of (A.2 is the empirical variance (Problem 2.5 we are interested in the parameter III (7 • • This parameter.1. Of course.3.11).Section 3.4.1 / 2 is wasteful.5. (72) Example 2. We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory. if we seek to take enough steps J so that III III < . Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at least bounds on the COnstants involved. at least if started close enough to 8.P) in Example 2.1. takes on the order of log log ~ steps (Problem 3. It is clearly easier to compute than the MLE.4. The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d 3 operations when done in the usual way and can be numerically unstable./ a even if the data are a sample from a distribution with .9) by. Closed form versus iteratively computed estimates At one level closed form is clearly preferable. On the other hand. On the other hand.1). and Anderson (1974).4. 3.
would be an adequate approximation to the distribution of X* (i. We consider three situations (a) The problem dictates the parameter. they may be interested in total consumption of a commodity such as coffee. amoog others. but there are a few • • = .e.1. E p. the HardyWeinberg parameter () has a clear biological interpretation and is the parameter for the experiment described in Example 2. Then E(X)/lVar(X) ~ (p/>. We return to this in Section 5. but there are several parameters that satisfy this qualitative notion.')1/2 = l. 9(Pl. 1976) and Doksum (1975). However.3 Robustness Finally. E P*). v is any value such that P(X < v) > ~. say () = N p" where N is the population size and tt is the expected consumption of a randomly drawn individuaL (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true'' parametric model with an interpretable parameter B. For instance..5. To be a bit formal. Alternatively. . economists often work with median housing prices. This is an issue easy to point to in practice but remarkably difficult to formalize appropriately.)(p/>. and both the mean tt and median v qualify. X n ) where most of the Xi = X:. Similarly. On the other hand.5. For instance. that is. P(X > v) > ~).. (c) We have a qualitative idea of what the parameter is.I1'. See Problem 3. we observe not X· but X = (XI •. However. We can now use the MLE iF2. the parameter v that has half of the population prices on either side (fonnally. The actual observation X is X· contaminated with "gross errOfs"see the following discussion.4) is for n large a more precise estimate than X/a if this model is correct. However..190 Measures of Performance Chapter 3 mean Jl and variance 0'2 other than the normal. 3.4. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied..13. 1975b. if gross errors occur.. we may be interested in the center of a population. suppose we initially postulate a model in which the data are a sample from a gamma. the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X) / [Var( X)] 1/2. 1 X~) could be taken without gross errors then p.\)' distribution. suppose that if n measurements X· (Xi. But () is still the target in which we are interested. which as we shall see later (Section 5.5. we could suppose X* rv p. but we do not necessarily observe X*. We will consider situations (b) and (c). what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about). Gross error models Most measurement and recording processes are subject to gross errors. anomalous values that arise because of human error (often in recording) or instrument malfunction. we turn to robustness. . This idea has been developed by Bickel and Lehmann (l975a.
Again informally we shall call such procedures robust. h = ~(7 <p (:(7) where K ::» 1 or more generally that h is an unknown density symmetric about O.j) E P. in situation (c).f.. Example 1. .. If the error distribution is normal. it is possible to have PC""f.Xn are i. Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of X· . and symmetric about 0 with common density f and d. where f satisfies (3. Xi Xt with probability 1 . That is. However. Now suppose we want to estimate B(P*) and use B(X 1 . so it is unclear what we are estimating. two notions.i. The advantage of this formulation is that jJ. X n ) will continue to be a good or at least reasonable estimate if its value is not greatly affected by the Xi I Xt.1') and (Xt.d. we encounter one of the basic difficulties in formulating robustness in situation (b).j = PC""f. . We return to these issues in Chapter 6.5..remains identifiable. make sense for fixed n. X n ) knowing that B(Xi. Then the gross error model issemiparametric. p("..5. the gross errors.2). However.\ is the probability of making a gross error.. F.l. = {f (. where Y.5.2) Here h is the density of the gross errors and .h(x). . Informally B(X1 . Further assumptions that are commonly made are that h has a particular fonn. X~) is a good estimate. (3. with common density f(x . Consider the onesample symmetric location model P defined by ~ ~ i=l""l n . Formal definitions require model specification.5.5 Nondecision Theoretic Criteria ~ 191 wild values.1'). .A Y.5.) for 1'1 of 1'2 (Problem 3.)~ 'P C) + .) are ij.l. and then more generally. with probability . the sensitivity curve and the breakdown point.is not a parameter. we do not need the symmetry assumption. but with common distribution function F and density f of the form f(x) ~ (1 .1).d. Unfortunately. .2. and definitions of insensitivity to gross errors.Section 3. A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the Ci still i. .1') : f satisfies (3. specification of the gross error mechanism. We next define and examine the sensitivity curve in the context of the Gaussian location model. The breakdown point will be discussed in Volume II. Y. it is the center of symmetry of p(Po. (3.J) for all such P. ISi'1 or 1'2 our goal? On the other hand.l. P. In our new formulation it is the xt that obey (3. has density h(y . identically distributed. This corresponds to. •.. the assumption that h is itself symmetric about 0 seems patently untenable for gross errors. Without h symmetric the quantity jJ.2) for some h such that h(x) = h(x) for all x}. That is.1. iff X" . X is the best estimate in a variety of senses. for example.5..d.18).1 ) where the errors are independent. if we drop the symmetry assumption.i. . Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6. .
~ ) SC( X. (2.l. 1') = 0. We start by defining the sensitivity curve for general plugin estimates. . . See (2..5.1." . the sample mean is arbitrarily sensitive to gross errOfa large gross error can throw the mean off entirely. .L = E(X).1 for all p(/l.32.X. . (i) It is the empirical plugin estimate of the population median v (Problem 3.1. (}(PU. Xnl as an "ideal" sample of size n .L.od Problem 2. in particular. that is... x) .Xn ? An interesting way of studying this due to Tukey (1972) and Hampel (1974) is the sensitivity curve defined as follows for plugin estimates (which are well defined for all sample sizes n).. X n ordered from smallest to largest. . This is equivalent to shifting the Be vertically to make its value at x = 0 equal to zero.. .. X n ) . We are interested in the shape of the sensitivity curve. X n ) = B(F).Xnl so that their mean has the ideal value zero..2. has the plugin property.1 for which the estimator () gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation X does to the value of ~ We return to the location problem with e equal to the mean J. . . . Now fix Xl!'" ...16). (}(X 1 . . Suppose that X ~ P and that 0 ~ O(P) is a parameter.O(Xl' . X" 1'). not its location.1. SC(x.··. The sample median can be motivated as an estimate of location on various grounds. is appropriate for the symmetric location model. .L = (}(X l 1'. . At this point we ask: Suppose that an estimate T(X 1 . Are there estimates that are less sensitive? A classical estimate of location based on the order statistics is the sample median X defined by ~ ~ X X(k+l) !(X(k) ifn~2k+1 + X(k+l)) ifn = 2k where X (I)' .. P.j)) = J.. .. X"_l )J. The empirical plugin estimate of 0 is 0 = O(P) where P is the empirical probability distribution.17). therefore. . Often this is done by fixing Xl. 0) = n[O(xl. shift the sensitivity curve in the horizontal or vertical direction whenever this produces more transparent formulas..2.192 The sensitivity curve Measures of Performance Chapter 3 . See Problem 3. X(n) are the order statistics.. . I . Because the estimators we consider are location invariant.. In our examples we shall. 1 Xnl represents an observed sample of size nl from P and X represents an observation that (potentially) comes from a distribution different from P.f) E P.J. . The sensitivity curoe of () is defined as ~ ~ ~ ~ ~ ~ ~ ~ ~ . X =n (Xl+ . . Thus. and it splits the sample into two equal halves. Then ~ ~ O. where F is the empirical d. where Xl.5.od because E(X.I.4). . we take I' ~ 0 without loss of generality.14. ... How sensitive is it to the presence of gross errors among XI.. +X"_I+X) n = x. See Section 2. that is. Xl. . . • . ..
SC(x) SC(x) x x Figure 3.5.1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when X is near Ji. its perfonnance at the nonnal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X.5. SC(x. x is an empirical (iii) The sample median is the MLE when we assume the common density f(x) of the errors {cd in (3. n = 2k + 1 is odd and the median of Xl.. say. v coincides with fL and plugin estimate of fL. A class of estimates providing such intermediate behavior and including both the mean and . < X(nl) are the ordered XI. .Xn_l = (X(k) + X(ktl)/2 = 0. we obtain .5.32 and 3.5.1)..5. . The sensitivity curve in Figure 3. The sensitivity curves of the mean and median. 27 a density having substantially heavier tails than the normaL See Problems 2. Although the median behaves well when gross errors are expected. .. x) nx(k) nx = _nx(k+l) for for < x(k) for x(k) < x x <x(k+I) nx(k+l) x> x(k+l) where xCI) < ..9.5 Nondecision Theoretic Criteria 193 (ii) In the symmetric location model (3. XnI.1) is the Laplace (double exponential) density f(x) = 1 exp{l x l/7}.Section 3..1. The sensitivity curve of the median is as follows: If.2..
The range 0. There has also been some research into procedures for which a is chosen using the observations." see Jaeckel (1971). then the trimmed means for a > 0 and even the median can be much better than the mean.3) Xu = X([no]+l) + . i I . If [na] = [(n . Xa  X. The estimates can be justified on plugin grounds (see Problem 3. the Laplace density..5.2[nnl where [na] is the largest integer < nO' and X(1) < . infinitely better in the case of the Cauchysee Problem 5.) SC(x) X x(n[naJ) ..l)Q] and the trimmed mean of Xl. We define the a (3. Intuitively we expect that if there are no gross errors. l Xnl is zero. 4e'x'. 4. whereas as Q i ~.4.1. Haber. Figure 3. . If f is symmetric about 0 but has "heavier tails" (see Problem 3.5). For instance. the sensitivity curve calculation points to an equally intuitive conclusion. For more sophisticated arguments see Huber (1981). That is.5. + X(n[nuj) n . See Andrews.5. . Note that if Q = 0.2[nn]/n)I. f = <p. For a discussion of these and other forms of "adaptation.2. that is.20 seems to yield estimates that provide adequate protection against the proportions of gross errors expected and yet perform reasonably well when sampling is from the nonnal distribution. suppose we take as our data the differences in Table 3.5. Xa. Let 0 trimmed mean.4. which corresponds approximately to a = This can be verified in tenns of asymptotic variances (MSEs}see Problem 5. < X (n) are the ordered observations.8) than the Gaussian density.. Bickel.5. Which a should we choose in the trimmed mean? There seems to be no simple answer. I I ! .5. by <Q < 4.194 Measures of Performance Chapter 3 the median has ~en known since the eighteenth century. The sensitivity curve of the trimmed mean. f(x) = or even more strikingly the Cauchy. Xu = X. and Tukey (1972).... . f(x) = 1/11"(1 + x 2 ).10 < a < 0.. the sensitivity Jo ~ CUIve of an Q trimmed mean is sketched in Figure 3. and Hogg (1974).1 again. . However.2. we throw out the "outer" [na] observations on either side and take the average of the rest. for example. (The middle portion is the line y = x(1 . the mean is better than any trimmed mean with a > 0 including the median.1. . Hampel. Huber (1972). Rogers. .
where Xu has 100Ct percent of the values in the population on its left (fonnally. 0 < 0: < 1.742(x. Write xn = 11.16).Section 3. say k. .75 .25 are called the upper and lower quartiles. other examples will be given in the problems.5. then SC(X. Similarly.I L:~ I (Xi .is typically used. If no: is an integer.25). If we are interested in the spread of the values in a population.. then the variance 0"2 or standard deviation 0. Let B(P) = Var(X) = 0"2 denote the variance in a population and let XI. . .2 . Example 3. n + 00 (Problem 3. the scale measure used is 0. Because T = 2 x (.2. Xu is called a Ctth quantile and X. Let B(P) = X o deaote a ath quantile of the distribution of X.4) where the approximation is valid for X fixed. P( X > xO') > 1 . = ~ [X(k) + X(k+l)]' and at sample size n  1.1 L~ 1 Xi = 11.1.Ct)..X.5.. .5 Nondecision Theoretic Criteria 195 Gross errors or outlying data points affect estimates in a variety of situations.674)0".1x. &) (3.25. To simplify our expression we shift the horizontal axis so that L~/ Xi = O.. SC(X. We next consider two estimates of the spread in the population as well as estimates of quantiles. Xo: = x(k). . < x(nI) are the ordered Xl.XnI. A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = X. the nth sample quantile is Xo. Then a~ = 11.75 and X.. The IQR is often calibrated so that it equals 0" in the N(/l. .1. X n is any value such that P( X < x n ) > Ct. 0"2) model. 0 Example 3.75 .X. where x(t) < . X n denote a sample from that population. Spread.5. and let Xo.5. &2) It is clear that a:~ is very sensitive to large outlying Ixi values. Quantiles and the lQR.10). denote the o:th sample quantile (see 2..X)2 is the empirical plugin estimate of 0.
5. Rousseuw. have been studied extensively and a number of procedures proposed and implemented.5. We will return to the influence function in Volume II. . o Remark 3.5) 1 [xlk+11 _ xlkl] x> xlk+1) ' Clearly.75 X. Other aspects of robustness. Summary. xa: is not sensitive to outlying x's. 0.F) where = limIF. . although this difficulty appears to be being overcome lately.6. Ronchetti. median.. ~ ~ ~ Next consider the sample lQR T=X.25) and the sample IQR is robust with respect to outlying gross errors x.O(F)] = .. We discuss briefly nondecision theoretic considerations for selecting procedures including interpretability.25· Then we can write SC(x. . discussing the difficult issues of identifiability.1 denotes the empirical distribution based on Xl.• . and other procedures. ~' is the distribution function of point mass at x (.: : where F n . F) and~:z. r: ~i II i .1. The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean..) .196 thus. which is defined by IF(x.15) I[t < xl).SC(x.Xnl. Most of the section focuses on robustness.O. 1'75) . It is easy to see ~ ! '. x (t) that (Problem 3. for 2 Measures of Performance Chapter 3 < k < n . ~ . trimmed mean. Unfortunately these procedures tend to be extremely demanding computationally.5. Discussion. in particular the breakdown point. 1'0) ~[Xlkl) _ xlk)] x < XlkI) 2 '  2 2 1 Ix _ xlkl] XlkI) < x < xlk+11 '  (3... Ii I~ . The sensitivity of the parameter B(F) to x can be measured by the influence function.(x. SC(x. It plays an important role in functional expansions of estimates. = <1[0((1  <)F + <t.O.F) dO I. i) = SC(x.2. and computability. and Stabel (1983). 1'. An exposition of this point of view and some of the earlier procedures proposed is in Hampel.(x.
In Problem 3.24. lf we put a (improper) uniform prior on A. where R( 71') is the Bayes risk. 0) = (0 .. 6. (X I 0 = 0) ~ p(x I 0). respectively. ~ ~ 4.2 with unifonn prior on the probabilility of success O.2.a)' jBQ(1.2. Find the Bayes risk r(7f. 71') = R(7f) /r( 71'. 3.Section 3.0). a(O) > 0.4.O) and = p(x 1 0)[7f(0)lw(0)]/c c= JJ p(x I 0)[7f(0)/w(0)]dOdx is assumed to be finite.Xn be the indicators of n Bernoulli trials with success probability O.2 preceeding.O)P. 1 X n is a N(B.0) and give the Bayes estimate of q(O). (a) Show that the joint density of X and 0 is f(x. which is called the odds ratio (for success).r 7f(0)p(x I O)dO. That is.3). under what condition on S does the Bayes rule exist and what is the Bayes rule? 5. the parameter A = 01(1 . E = R.. 0) is the quadratic loss (0 .0 E e. 71') as ..2. (72) sample and 71" is the improper prior 71"(0) = 1. density.O) = p(x I 0)71'(0) = c(x)7f(0 . Show that (h) Let 1(0. if 71' and 1 are changed to 0(0)71'(0) and 1(0. Suppose IJ ~ 71'(0). In some studies (see Section 6. J). = (0 o)'lw(O) for some weight function w(O) > 0. 0) the Bayes rule is where fo(x. Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule.2 o e 1. the Bayes rule does not change. Check whether q(OB) = E(q(IJ) I x). Suppose 1(0. 2.0)/0(0). where OB is the Bayes estimate of O. is preferred to 0. Show that if Xl. then the improper Bayes rule for squared error loss is 6"* (x) = X. Hint: See Problem 1.4. Let X I. x) where c(x) =. (c) In Example 3.6 PROBLEMS AND COMPLEMENTS Problems for Section 3. In the Bernoulli Problem 3. J) ofJ(x) = X in Example 3. give the MLE of the Bernoulli variance q(0) = 0(1 .2.3.. Consider the relative risk e( J. we found that (S + 1)/(n + 2) is the Bayes rule. .0)' and that the prinf7r(O) is the beta. s).6 Problems and Complements 197 3. . .1. Compute the limit of e( J. Find the Bayes estimate 0B of 0 and write it as a weighted average wOo + (1 ~ w)X of the mean 00 of the prior and the sample mean X = Sin. (3(r. Show that OB ~ (S + 1)/(n+2) for ~ ~ the uniform prior. change the loss function to 1(0.
where ao = L:J=l Qj. B. E) and positive when f) 1998) is 1.) (b) When the loss function is I(B.2(d)(i) and (ii). . (Bj aj)2. Suppose tbat N lo .19(c)..9j ) = aiO:'j/a5(aa + 1). find necessary and j sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule. 7. where CI. (a) If I(B.2. where is known. B . . On the basis of X = (X I. Let q( 0) = L:. find the Bayes decision mle o' and the minimum conditional Bayes risk r(o'(x) I x). (E. (.Xu are Li..O'5).exp {  2~' B' } . compute the posterior risks of the possible actions and give the optimal Bayes decisions when x = O. Find the Bayes decision rule. . . equivalent to a namebrand drug. Assume that given f). • 9. E). 8. A regulatory " ! I i· . I ~' .. (c) We want to estimate the vector (B" .3.\(B) = r .3....• N. and Cov{8j . One such function (Lindley.15. 0) and I( B. a) = [q(B)a]'. 1). .O)  I(B. 76) distribution.a)' / nj~. " X n of differences in the effect of generic and namebrand effects fora certain drug. a) = (q(B) . and that 9 is random with aN(rlO. . There are two possible actions: 0'5 a a with losses I( B. I) = difference in loss of acceptance and rejection of bioequivalence. by definition. I j l . and that 0 has the Dirichlet distribution D( a).  I .Xn ) we want to decide whether or not 0 E (e.. do not derive them. Note that ). N{O. Measures of Performance Chapter 3 (b) n . Bioequivalence trials are used to test whether a generic drug is. E). .) with loss function I( B. a = (a" . 00. Suppose we have a sample Xl. c2 > 0 I.. Hint: If 0 ~ D(a). .(0) should be negative when 0 E (E. Cr are given constants. then E(O)) = aj/ao.. For the following problems.3. then the generic and brandname drugs are. (c) Problem 1.B.ft B be the difference in mean effect of the generic and namebrand drugs. i: agency specifies a number f > a such that if f) E (E.I(d).198 (a) T + 00.. bioequivalent. a) = Lj~. ..)T. B).aj)/a5(ao + 1).. B = (B" . a r )T. (a) Problem 1. Var(Oj) = aj(ao . to a close approximation. . defined in Problem 1. Set a{::} Bioequivalent 1 {::} Not Bioequivalent . (e) a 2 + 00.=l CjUj . given 0 = B are multinomial M(n. (b) Problem 1. where E(X) = O. (Use these results. Let 0 = ftc . .d.. Xl..E). .\(B) = I(B.
3. ." (c) Discuss the behavior of the preceding decision rule for large n ("n sider the general case (a) and the specific case (b). (a) Show that the Bayes rule is equivalent to "Accept biocquivalence if E(>'(O) I X and show that (3. . (b) It is proposed that the preceding prior is "uninformative" if it has 170 large ("76 + 00").) = 0 implies that r satisfies logr 1 = ( 2 2c' This is an example with two possible actions 0 and 1 where l(()...Yp). Note that . Discuss the preceding decision rule for this "prior. S x T ~ R.Yo) Xi {}g {}g Yj = 0.6.17). Yo) = sup g(x. Any two functions with difference "\(8) are possible loss functions at a = 0 and I. + = 0 and it 00").2 show that L(x. Yo) = inf g(xo. 0.. .2. (c) Is the assumption that the ~ 's are nonnal needed in (a) and (b)? Problems for Section 3. representing x = (Xl. (a) Show that a necessary condition for (xo.2.1) is equivalent to "Accept bioequivalence if[E(O I x»)' where = x) < 0" (3. In Example 3.1.y = (YI. S T Suppose S and T are subsets of Rm. y). For the model defined by (3. find (a) the linear Bayes estimate of ~l. A point (xo. and 9 is twice differentiable.3 1. . Hint: See Example 3. (b) the linear Bayes estimate of fl.. 0) and l(().16) and (3. 1) are not constant.6.1) < (T6(n) + c'){log(rg(~)+. {} (Xo...6 Problems and Complements 199 where 0 < r < 1.2. v) > ?r/(l ?r) is equivalent to T > t. respectively..Yo) = {} (Xo. (Xv. RP.) + ~}" . Suppose 9 .\(±€. Con 10. yo) is a saddle point of 9 if g(xo.Xm). Yo) is in the interior of S x T.Section 3. 2. . Yo) to be a saddle point is that..
Let Lx (Bo. . Hint: See Problem 3. I j 3... the conclusion of von Neumann's theorem holds.n)/(n + . 1 <j._ I)'.thesimplex.i.2. 5. I(Bi. n: I>l is minimax. Bd. Yo .andg(x.= 1 ..=1 CijXiYj with XE 8 m .1 < i < m. .(X) = 1 if Lx (B o. fJ( . and show that this limit ( . Show that the von Neumann minimax theorem is equivalent to the existence of a saddle point for any twice differentiable g. . N(I'. (b) There exists 0 < 11"* < 1 such that the prior 1r* is least favorable against is. &* is minimax. 0») equals 1 when B = ~. B1 ) Ip(X.y) ~ L~l 12. I}.)w".d.c. Yc Yd foraHI < i. I • > 1 for 0 i ~.. o')IR(O. 4.n). X n be i. and that the mndel is regnlar.. Let S ~ 8(n. Let X" . B1 ) >  (l.Xn ) . and o'(S) = (S + 1 2 . a) ~ (0 .. prior. BIl has a continuous distrio 1 bution under both P Oo and P BI • Show that (a) For every 0 < 7f < 1. i.1(0.2.Yo) > 0 8 8 8 X a 8 Xb 0. • .200 and Measures of Performance Chapter 3 8'g (x ) < 0 8'g(xo.l.:.b < m. o(S) = X = Sin. B ) and suppose that Lx (B o. 1rWlO = 0 otherwise is Bayes against a prior such that PIB = B ] = .0)..o•• ) =R(B1 . y ESp. Suppose i I(Bi. Thus. 2 J'(X1 .n12).PIB = Bo]. f.i) =0. and 1 .j)=Wij>O.<7') and 1(<7'. • " = '!. .n12. A = {O. the test rule 1511" given by o.j=O. iij.. d) = (a) Show that if I' is known to be 0 (!.a)'. Suppose e = {Bo.." 1 Xi ~ l}.d< p. Hint: Show that there exists (a unique) 1r* so that 61f~' that R(Bo. (b) Show that limn_ooIR(O. L:. .a. B ) ~ p(X. f" I • L : I' (a) Show that 0' has constant risk and is Bayes for the beta..0•• ). (b) Suppose Sm = (x: Xi > 0..
Hint: (a) Consider a gamma prior on 0 = 1/<7'. o'(X) is also minimax and R(I'.a? 1... . . Rj = L~ I I(XI < Xi)' Hint: Consider the uniform prior on permutations and compute the Bayes rule by showing that the posterior risk of a pennutation (i l . See Volume II. LetA = {(iI. Jlk. . where (Ill. . that is..~~n X < Pi < < R(I'.l . N k ) has a multinomial.j.. respectively. ik) is smaller than that of (i'l"'" i~). 9. . . (jl. Show that if = (I'I''''.. .» = Show that the minimax rule is to take L l.k} I«i).?. prior. " a. i=l then o(X) = X is minimax. . J. 1 <i< k. f(k.Pi)' Pjqj <j < k. .j.\. . h were t·.. X k ) = (R l . Then X is nurnmax. . Show that X has a Poisson (. . See Problem 1.. .. X k be independent with means f.I)..2.tj. 1). Let Xl. .\ . d = (d)...)..k. Remark: Stein (1956) has shown that if k > 3. . . X is no longer unique minimax. .). 00. 6.. show that 0' is unifonnly best among all rules of the form oc(X) = CL Conclude that the MLE is inadmissible.. ftk) = (Il?l'"'' J1.. ttl . Write X  • "i)' 1(1'. Xl· (e) Show that if I' is unknown. 1 < j < k.X)' are inadmissible.1)1 L(Xi ...) . . jl~ < . I' (X"". b. (c) Use (B.o) for alII'. distribution. ik). .P. = tj..a < b" = 'lb.\. < im.. . a) = (. then is minimax for the loss function r: l(p.. hence. . . Show that if (N). . . M (n.I'kf.X)' is best among all rules of the form oc(X) = c L(Xi . 0 1. dk)T. 1 = t j=l (di . Let k .. ik is an arbitrary unknown permutation of 1. 8.. o(X) ~ n~l L(Xi . .. d) = L(d. an d R a < R b. .j. Let Xi beindependentN(I'i..d) where qj ~ 1 .\) distribution and 1(. that both the MLE and the estimate 8' = (n . . and iI. .3. . . . PI..X)' and.. < /12 is a known set of values. . .1..tn I(i.29). j 7. / ... b = ttl.12. For instance... o(X 1. Xk)T. 0') = (1. . Hint: Consider the gamma.Section 3_6 Problems and Complements 201 (b) If I' ~ 0.Pi. Rk) where Rj is the rank of Xi.. > jm). Permutations of I. ..
distribution. For a given x we want to estimate the proportion F(x) of the population to the left of x.1r) = Show that the marginal density of X.2.i f X> . B j > 0... Define OiflXI v'n < d < v'n d v'n d d X .d with density defined in Problem 1.d._.BO)T. LB= j 01 1. 1). (b) How does the risk of these estimates compare to that of X? 12. See also Problem 3. I: • . Pk. . See Problem 3.=1 Show that the Bayes estimate is (Xi ... M(n. X = (X"".. ! p(x) = J po(x)1r(B)dB... Suppose that given 6 = B = (B .... 1 .8. B(n.. with unknown distribution F.I)!. of Xi < X • 1 + 1 o. . I: I ir'z .i. X has a binomial. .Xo)T has a multinomial. Let X" . Let the loss " function be the KullbackLeibler divergence lp(B. 14.2.4.15.. ~ I I . . + l)j(n + k).. v'n v'n (a) Show that the risk (for squared error loss) E( v'n(o(X) ...... d X+ifX 11. Show that the Bayes estimate of 0 for the KullbackLeibler loss function lp(B.. ... Hint: Consider the risk function of o~ No.. Let K(po. n) be i. 10. J K(po. . . Suppose that given 6 ~ B... 13. q) denote the K LD (KullbackLeiblerdivergence) between the densities Po and q and define the Bayes KLD between P = {Po : BEe} and q as k(q..q)1r(B)dB.. a) is the posterior mean E(6 I X). distribution.a) and let the prior be the uniform prior 1r(B" . B). .. . . X n be independent N(I'.202 Measures of Performance Chapter 3 Hint: Consider Dirichlet priors on (PI. BO l) ~ (k ..3. Show that v'n 1+ v'n 2(1+ v'n) is minimax for estimating F(x) = P(X i < x) with squared error loss.... .1'»2 of these estimates is bounded for all nand 1'.B). Let Xi(i = 1..
that is.d. .aao + (1 . S. R. Let A = 3. N(I". . We shall say a loss function is convex..a )I( 0. Problems for Section 3. X n are Ll..k(p.4. . the Fisher information lower bound is equivariant. Hint: Use Jensen's inequality: If 9 is a convex function and X is a random variable.1 .4. then R(O. It is often improper. and O~ (1 . if I(O. then E(g(X) > g(E(X». 2.). Jeffrey's "Prior. 15. . Reparametrize the model by setting ry = h(O) and let q(x.alai) < al(O. X n be the indicators of n Bernoulli trials with success probability B. suppose that assumptions I and II hold and that h is a monotone increasing differentiable function from e onto h(8). a < a < 1.4. p(X) IO. ry) = p(x.12) for the two parametrizations p(x. then That is. a. .1 E: 1 (Xi  J.x =.1"0 known. J). for any ao. . Show that if 1(0. 0) cases. a" (J. . . Suppose X I. K) .X is called the mutual information between 0 and X. 0. h1(ry)) denote the model in the new parametrization. Equivariance. ry).Section 3... p( x. J') < R( (J. Show that B q(ry) = Bp(h. (a) Show that if Ip(O) and Iq(fJ) denote the Fisher information in the two parametrizations.2) with I' . respectively.'? /1 as in (3..6 Problems and Complements 203 minimizes k( q. 7f) and that the minimum is I(J. Suppose that there is an unbiased estimate J of q((J) and that T(X) is sufficient. ao) + (1 . N(I"o. Let X .." A density proportional to VJp(O) is called Jeffrey's prior. Show that in theN(O. Fisher infonnation is not equivariant under increasing transformations of the parameter. 0) and q(x. . Give the Bayes rules for squared error in these three cases.tO)2 is a UMVU estimate of a 2 • &'8 is inadmissible. .4 1. Hint: k(q. (ry»).~). {log PB(X)}] K(O)d(J. 0) with 0 E e c R. Show that (a) (b) cr5 = n. Show that X is an UMVU estimate of O.. a) is convex and J'(X) = E( J(X) I I(X».f [E. Prove Proposition 3.O)!. (b) Equivariance of the Fisher Information Bound. . Let X I. K) = J [Eo {log ~i~i}] K((J)d(J > a by Jensen's inequality. Jeffrey's priors are proportional to 1. 4. 0) and B(n. Let Bp(O) and Bq(ry) denote the information inequality lower bound ('Ij.
00. . a ~ (c) Suppose that Zi = log[i/(n + 1)1.4.1. =f. ~ ~ ~ (a) Write the model for Yl give the sufficient statistic.4.. 2 ~ ~ 10. Is it unbiased? Does it achieve the infonnation inequality lower bound? (b) Show that X is an unbiased estimate of 0/(0 inequality lower bound? + 1).p) is an unbiased estimate of (72 in the linear regression model of Section 2. then (a) X is an unbiased estimate of x = I ~ L~ 1 Xi· . .{x : p(x.\ = a + bB is UMVU for estimating .... Show that if (Xl. Show that a density that minimizes the Fisher infonnation over F is f(x. Suppose (J is UMVU for estimating fJ. X n ) is a sample drawn without replacement from an unknown finite population {Xl. Hint: Use the integral approximation to sums. then = 1 forallO. Pe(E) = 1 for some 0 if and only if Pe(E) ~ ~2 > OJ docsn't depend on 0. P.Ii . .3. distribution. In Example 3. Let X" . . compute Var(O) using each ofthe three methods indicated.4.. . X n be a sample from the beta. Zi could be the level of a drug given to the ith patient with an infectious disease and Vi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered.o.ZDf3)T(y .8. . . Y n in twoparameter canonical exponential form and (b) Let 0 = (a. I I .4.2.Yn are independent Poisson random variables with E(Y'i) = !Ji where Jli = exp{ Ct + (3Zi} depends on the levels Zi of a covariate. . 0) for any set E.\ = a + bOo 11. Show that 8 = (Y . (a) Find the MLE of 1/0... . 204 Hint: See Problem 3.11(9) as n ~ give the limit of n times the lower bound on the variances of and (J.. 14. Establish the claims of Example 3. Show that . 0) = Oc"I(x > 0). Measures of Performance Chapter 3 I (c) if 110 is not known and the true distribution of X t is N(Ji. Ct. . Show that assumption I implies that if A .' .. . Suppose Yl . 8. . find the bias o f ao' 6. Does X achieve the infonnation . Let a and b be constants. . {3) T Compute I( 9) for the model in (a) and then find the lower bound on the variances of unbiased estimators and {J of a and (J. B(O. Find lim n. n. For instance. . Let F denote the class of densities with mean 0.1 and variance 0.P. and . ( 2).XN }.. . Hint: Consider T(X) = X in Theorem 3. 9. 1).13 E R. • 13.ZDf3)/(n . 7.5(b). i = 1..2 (0 > 0) that satisfy the conditions of the information inequality. a ~ i 12.
.Section 3.4).. 16.4. = N(O. (c) Explain how it is possible if Po is binomial. More generally only polynomials of degree n in p are unbiasedly estimable..3. X is not II Bayes estimate for any prior 1r. :/i.6) is (a) unbiased and (b) has smaller variance than X if  b < 2 Cov(U. 20.1  E1 k I Xki doesn't depend on k for all k such that 1Tk > o. XK..4.• . Show that X k given by (3.4. UN are as in Example 3... (b) Deduce that if p. 17.1 and Uj is retained independently of all other Uj with probability 1rj where 'Lf 11rj = n.511 > (b) Show that the inequality between Var X and Var X continues to hold if ~ . . Let X have a binomial. Define ~ {Xkl.6 Problems and Complements 205 (b) The variance of X is given by (3. .Xkl.) Suppose the Uj can be relabeled into strata {xkd.". 7. _ Show that X is unbiased and if X is the mean of a simple random sample without replacement from the population then VarX<VarX with equality iff Xk.} and X =K  1". Suppose UI. G 19. 15. 8). k = 1. Let 7fk = ~ and suppose 7fk = 1 < k < K.. Suppose X is distributed accordihg to {p. ec R} and 1r is a prior distribution (a) Show that o(X) is both an unbiased estimate of (J and the Bayes estimate with respect to quadratic loss. (a) Take samples with replacement of size mk from stratum k = fonn the corresponding sample averages Xl.X)/Var(U). . K.l. P[o(X) = 91 = 1. . if and only if.p). Show that if M is the expected sample size. (See also Problem 1. k=l K ~1rkXk. 'L~ 1 h = N.~). . : 0 E for (J such that E((J2) < 00. even for sampling without replacement in each stratum. = 1.~ for all k. Show that is not unbiasedly estimable. . then E( M) ~ L 1r j=1 N J ~ n. B(n. B(n.. Suppose the sampling scheme given in Problem 15 is employed with 'Trj _ ~.4. Stratified Sampling.. distribution. 18.. Show that the resulting unbiased HorvitzThompson estimate for the population mean has variance strictly larger than the estimate obtained by taking the mean of a sample of size n taken without replacement from the population. that ~ is a Bayes estimate for O. . 1 < i < h.
Let X ~ U(O.. with probability I for each B. 5. Prove Theorem 3. the upper quartile X.206 Hint: Given E(ii(X) 19) ~ 9. is an empirical plugin estimate of ~ Here case. give and plot the sensitivity curve of the median. Hint: It is equivalent to show that. Regularity Conditions are Needed for the Information Inequality. however. If n IQR. .25 and (n . If a = 0. • Problems for Section 35 I is an integer.4. Show that the a trimmed mean XCII.4. 1 2. 1 X n1 C. . and we can thus define moments of 8/8B log p(x.75' and the IQR. give and plot the sensitivity curves of the lower quartile X.l)a is an integer. for all X!.5) to plot the sensitivity curve of the 1. (iii) 2X is unbiased for ."" F (ij) Vat (:0 logp(X. J xdF(x) denotes Jxp(x)dx in the continuous case and L:xp(x) in the discrete 6. Note that logp(x. that is. 22. = 2k is even... B)./(9)a [¢ T (9)a]T II (9)[¢ T (9)a].25 and net =k 3.4. Note that 1/J (9)a ~ 'i7 E9(a 9) and apply Theorem 3. B). If a = 0. E(9 Measures of Performance Chapter 3 I X) ~ ii(X) compute E(ii(X) .5. for all adx 1. Show that. 4.25. T . Yet show eand has finite variance. Var(a T 6) > aT (¢(9)II(9). use (3. An estimate J(X) is said to be shift or translation equivariant if.9)' 21. B) be the uniform distribution on (0. B) is differentiable fat anB > x. B») ~ °and the information bound is infinite.3. Show that the sample median X is an empirical plugin estimate of the population median v.
F(x .5.. median. (b) Suppose Xl. . plot the sensitivity curves of the mean. (a) Suppose n = 5 and the "ideal" ordered sample of size n ~ 1 = 4 is 1. .30. then (i. .. (7= moo 1 IX. It has the advantage that there is no trimming proportion Q that needs to be subjectively specified. and xiflxl < k kifx > k kifx<k.6.5 and for (j is.67.. .). X a arc translation equivariant and antisymmetric.. Deduce that X.) ~ 8. and the HodgesLehmann estimate.03 (these are expected values of four N(O. Xu.. I)order statistics)..03.6 Problems and Complements 207 It is antisymmetric if for all Xl.Section 3. X are unbiased estimates of the center of symmetry of a symmetric distribution.(a) Show that X. xH L is translation equivariant and antisymmetric. For x > . J is an unbiased estimate of 11. In . Show that if 15 is translation equivariant and antisymmetric and E o(15(X» exists and is finite. 7.1. X n is a sample from a population with dJ.  XI/0. i < j. <i< n Show that (a) k = 00 corresponds to X.30. (b) Show that   . One reasonable choice for k is k = 1. (See Problem 3..e. . .JL) where JL is unknown and Xi . X. . Its properties are similar to those of the trimmed mean. The Huber estimate X k is defined implicitly as the solution of the equation where 0 < k < 00. is symmetrically distributed about O. .. k t 0 to the median. '& is an estimate of scale.1. trimmed mean with a = 1/4. The HodgesLehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages ~(Xi + Xj).3.
(e) If k < 00. 00. we will use the IQR scale parameter T = X.) has heavier tails than f() if g(x) is above f(x) for Ixllarge.~ !~ t . (d) Xk is translation equivariant and antisymmetric (see Problem 3.7(a). . with k and € connected through 2<p(k) _ 24>(k) ~ e k 1(e) Xk exists and is unique when k £ ! i > O. .• 13.75 .) are two densities with medians v zero and identical scale parameters 7.i. This problem may be done on the computer. n is fixed. Suppose L:~i Xi ~ . iTn ) ~ (2a)1 (x 2  a 2 ) as n ~ 00. (c) Show that go(x)/<p(x) is of order exp{x2 } as Ixl ~ 10. If f(·) and g(.5.X. 9.11. Let JJo be a hypothesized mean for a certain population. Show that SC(x.Xn arei. .1)1 L:~ 1(Xi .208 ~ Measures of Performance Chapter 3 (b) Ifa is replaced by a known 0"0. 00. where S2 = (n . The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant \. •• ." . thenXk is the MLEofB when Xl. Let 1'0 = 0 and choose the ideal sample Xl.25. X 12. [. . ... and "" "'" (b) the Iratio of Problem 3.d. then limlxj>00 SC(x.O)/ao) where fo(x) for Ixl for Ixl <k > k. the standard deviation does not exist.. In the case of the Cauchy density. and is fixed.. Laplace. Location Parameters.5. (b) Find thetml probabilities P(IXI > 2). The (student) tratio is defined as 1= v'n(x I'o)/s. thus. In what follows adjust f and 9 to have v = 0 and 'T = 1. . Let X be a random variable with continuous distribution function F. 11. plot the sensitivity curve of (a) iTn . Xk) is a finite constant. we say that g(..x}2. and Cauchy distributions. = O. For the ideal sample of Problem 3.6). P(IXI > 3) and P(IXI > 4) for the nonnal. (a) Find the set of Ixl where g(Ixl) > <p(lxl) for 9 equal to the Laplace and Cauchy densitiesgL(x) = (2ry)1 exp{ Ixl/ry} and gc(x) = b[b2 + x 2 ]1 /rr.5. 1 Xnl to have sample mean zero. ii. Use a fixed known 0"0 in place of Ci. with density fo((. Find the limit of the sensitivity curve of t as (a) Ixl ~ (b) n ~ 00.
.). 8) and ~ IF(x. let v" = v.  vl/O. the ~ = bdSC(x. and note thatH(xi/(F» < F(x) < H(xv(F».t. and trimmed population mean JiOl (see Problem 3. Show that if () is shift and location equivariant. 8. c + dO. In this case we write X " Y. i/(F)] is the value of some location parameter..5._. then v(F) and i/( F) are location parameters. .) That is..(F) = !(xa + XI").6 Problems and Complements 209 if (Ja+bX = a + bex· It is antisymmetric if (J x :.Section 3.1: (a) Show that SC(x.. (e) Let Ji{k) be the solution to the equation E ( t/Jk (X :. 0 < " < 1. and sample trimmed mean are shift and scale equivariant. (b) Show that the mean Ji. 8) = IF~ (x. any point in [v(F).5. Show that v" is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < i/(F). a + bx. (a) Show that if F is symmetric about c and () is a location parameter.5. · ._. median v. If 0 is scale and shift equivariant. . i/(F)] is the location parameter set in the sense that for any continuous F the value ()( F) of any location parameter must be in [v(F). . 8..xn ). (e) Show that if the support S(F) = {x : 0 < F(x) < 1} of F is a finite interval.xn . In Remark 3. 8 < " is said to be order preserving if X < Y ::::} Ox < ()y..) 14. and ordcr preserving.5) are location parameters..Xn_l) to show its dependence on Xnl (Xl. F(t) ~ P(X < t) > P(Y < t) antisymmctric. n (b) In the following cases. if F is also strictly increasing.O...F). X is said to be stochastically smaller than Y if = G(t) for all t E R.t )) = 0.8. SC(a + bx. Let Y denote a random variable with continuous distribution function G.. compare SC(x. F) lim n _ oo SC(x. b > 0. .. ([v(F). i/(F)] and. An estimate ()n is said to be shift and scale equivariant if for all xl. x n _.67 and tPk is defined in Problem 3. ~ ~ 8n (a +bxI. (jx. Fn _ I ). Hint: For the second part. where T is the median of the distributioo of IX location parameter.]. ~ ~ 15.(k) is a (d) For 0 < " < 1. ~ se is shift invariant and scale equivariant. . b > 0. sample median...8.:. c E R. . Show that J1. xnd. then ()(F) = c.(F): 0 <" < 1/2}. ~ (a) Show that the sample mean. ~ = d> 0. then for a E R. it is called a location parameter. . let H(x) be the distribution function whose inverse is H'(a) = ![xax.a +bxn ) ~ a + b8n (XI. ~ (b) Write the Be as SC(x. Also note that H(x) is symmetric about zero. a. 8. v(F) = inf{va(F) : 0 < " < I/2} andi/(F) =sup{v..
also true of the method of coordinate ascent.B ~ {F E :F : Pp(A) E B}. {BU)} do not converge. (b) ~ ~ . (ii) O(F) = (iii) e(F) "7.1 is commonly known as the Cramer~Rao inequality.5. ) (a) Show by example that for suitable t/J and 1(1(0) . Frechet.4 I . show that (a) If h is a density that is symmetric about zero. we shall • I' .0 > 0 (depending on t/J) such that if lOCO) . and (iii) preceding? ~ 16.rdF(l:). Because priority of discovery is now given to the French mathematician M. ~ I(x  = X n . IlFfdF(x). C < 00. where B is the class of Borel sets. in order to be certain that the Jth iterate (j J is within e of the desired () such that 'IjJ( fJ) = 0. Show that in the bisection method. 81" < 0..2). B E B. .Bilarge enough. (1) The result of Theorem 3. .. (b) If no assumptions are made about h. . • .BI < C1(1(i.1) Hint: (a) Try t/J(x) = Alogx with A> 1.3 (1) A technical problem is to give the class S of subsets of:F for which we can assign probability (the measurable sets). is not identifiable.8) .7 NOTES Note for Section 3. (ii). 0. · . we in general must take on the order of log ~ steps. consequently. 18. and we seek the unique solution (j of 'IjJ(8) = O. In the gross error model (3. ~ ~ 17. • • . then fJ. Let d = 1 and suppose that 'I/J is twice continuously differentiable.' > 0.field generated by SA.t is identifiable. . We define S as the .OJ then l(I(i) . Assume that F is strictly increasing. then J. .210 (i) Measures of Performance Chapter 3 O(F) ~ 1'1' ~ J .1/. (e) Does n j [SC(x. The NewtonRaphson method in this case is .I F(x.4. A. This is. 3. I : . F)! ~ 0 in the cases (i). (b) Show that there exists. I I Notes for Section 3.
J. DoKSUM.. Robust Estimates of Location: Sun>ey and Advances Princeton. 3. G. D. SMITH. HUBER. APoSTOL. 1970. P. Statist. AND E..cific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: SpringerVerlag. I. 1985. (3) The continuity of the first integral ensures that : (J [rO )00 .thematical Methods in Risk Theory Heidelberg: Springer Verlag.. Jroo T(x) [~p(X' 0)] dx roo Joo 8(J 00 00 00 for all (J whereas the continuity (or even boundedness on compact sets) of the second integral guarantees that we can interchange the order of integration in (4) The finiteness of Var8(T(X)) and f(O) imply that 1/J'(0) is finite by the covariance interpretation given in (3. F. 1. BICKEL. "Unbiased Estimation in Convex. BJORK. NJ: Princeton University Press. 0. ANDERSON. Bayesian Theory New York: Wiley." Ann.Section 3. AND C. P. (2) Note that this inequality is true but uninteresting if f(O) = 00 (and 1/J'(0) is finite) or if Var. "Descriptive Statistics for Nonparametric Models. P. 1974. J. M. 3. Families. 11391158 (1976). Dispersion. R. H. BICKEL. M. Statistical Decision Theory and Bayesian Analysis New York: Springer.. AND E. Mathematical Analysis." Ann. R. P.)dXd>.. J. LEHMANN... Statist. AND E. Math. Optimal Statistical Decisions New York: McGrawHill. II. 1972. DOWE..8). Jroo T(x) :>. BICKEL... BERNARDO. . "Measures of Location and AsymmeUy. 2nd ed.8 REFERENCES ANDREWS. Statist. ROGERS. A. >. AND A. BOHLMANN. T. F. A. AND J. "Descriptive Statistics for Nonparametric Models.. DAHLQUIST. J.8 References 211 follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement.. MA: AddisonWesley.. H." Ann. LEHMANN. H. BAXTER. M. OLIVER. DE GROOT. 4. AND E.. 1969. Location. 1122 (1975). D. in Proceedings of the Second Pa. Statist. BICKEL. LEHMANN." Ann.. III. F. "Descriptive Statistics for Nonparametric Models. S.4. J. TUKEY.. 3. 1994. L. Ma. LEHMANN. BICKEL. W. W. K.(T(X)) = 00." Scand. M. A. 10381044 (l975a). Introduction. 15231535 (1969). HAMPEL. p(x. J.. 40. BERGER. AND N. Reading. 2..10451069 (1975b).] = Jroo. Numen'cal Analysis New York: Prentice Hall. 1974. WALLACE. ofStatist. P.. P. 1998. Point Estimation Using the KullbackLeibler Loss Function and MML.
HAMPEL. StatiSl. F. LINDLEY. New York: Springer. R. HUBER. Soc. WALLACE. Assoc. "Stochastic Complexity (With Discussions):' J. 69. "Decision Analysis and BioequivaIence Trials. LINDLEY. Stall'. (2000). L. Soc. A.. WUSMAN.. S. B. and Economics Reading. Wiley & Sons. ." J. C. LEHMANN. RoyalStatist." Ann. Exploratory Data Analysis Reading. 13. • • 1 . AND B. Y..375394 (1997). .10411067 (1972). 538542(1973). P. "The Influence Curve and Its Role in Robust Estimation. AND P. 2nd ed. E.212 Measures of Performance Chapter 3 HAMPEL. StrJtist. University of California Press. 1. 197206 (1956). RONCHEro. Amer. J. 1965. 1948. 1959. 43. Programming." Statistica Sinica. 69. 1986." J. "Estimation and Inference by Compact Coding (With Discussions).. P. Mathematical Methods and Theory in Games. D. "Robust Statistics: A Review... Part II: Inference. Third Berkeley Symposium on Math.. Math. 2nd ed. STAHEL.. C. 383393 (1974). 1972.. 1986. M. . Part I: Probability. Robust Statistics: The Approach Based on Influence Functions New York: J.. E. S. R. 49. D. MA: AddisonWesley." Proc.222 (1986)." Ann. MA: AddisonWesley." Ann. Math. 1981. Statist. 223239 (1987)." Scand.. J. R. 42. Amer. STEIN.V. 1998. 1954. L. I. HOGG. Assoc. B. 909927 (1974). LEHMANN.. FREEMAN.. Actuarial J.. London: Oxford University Press.. Stu/ist. "Robust Estimates of Location. L. 2Q4. E. H. Amer. "Boostrap Estimate of KullbackLeibler Information for Model Selection. 1. W..." J. HANSEN. Testing Statistical Hypotheses New York: Springer.. TuKEY.. SHIBATA. P. "Adaptive Robust Procedures. AND G. AND W. SAVAGE. • NORBERG. 49. "On the Attainment of the CramerRao Lower Bound. Stlltist. HUBER." J. Robust Statistics New York: Wiley. . 136141 (1998)." Statistical Science. Introduction to Probability and Statistics from a Bayesian Point of View. H. JAECKEL. "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification. E. Royal Statist. CASELLA. "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution. R. ROUSSEUW. Math... Cambridge University Press.. A. YU.. 10201034 (1971). Assoc. The Foundations afStatistics New York: J. R. Theory ofProbability... KARLIN. Wiley & Sons. j I JEFFREYS... and Probability.n. Theory ofPoint Estimation. 240251 (1987). "Model Selection and the Principle of Mimimum Description Length. 7. L. London.. RISSANEN. Statist.
conesponding.3. and we have data providing some evidence one way or the other. the situation is less simple. and 3. 3. Nmo. where is partitioned into {80 . in examples such as 1. Usually.2. where Po.1. the parameter space e. 8d with 8 0 and 8. 8 1 are a partition of the model P or.Nfb the numbers of admitted male and female applicants. The Graduate Division of the University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. The design of the experiment may not be under our control. But this model is suspect because in fact we are looking at the population of all applicants here. Nfo) by a multinomial.Chapter 4 TESTING AND CONFIDENCE REGIONS: BASIC THEORY 4.1 INTRODUCTION In Sections 1. as is often the case. we are trying to get a yes or no answer to important questions in science.PfO).Pfl. respectively. and what 8 0 and 8 1 correspond to in tenns of the stochastic model may be unclear. respectively. parametrically. Here are two examples that illustrate these issues. and indeed most human activities.PmllPmO. and the corresponding numbers N mo . what does the 213 .1. o E 8.Nf1 . e Example 4. whether (j E 8 0 or 8 1 jf P j = {Pe : () E 8 j }. not a sample. modeled by us as having distribution P(j.1. Sex Bias in Graduate Admissions at Berkeley. what is an appropriate stochastic model for the data may be questionable. it might be tempting to model (Nm1 . M(n. Nfo of denied applicants. As we have seen. pUblic policy.3 the questions are sometimes simple and the type of data to be gathered under our control. This framework is natural if. to answering "no" or "yes" to the preceding questions. Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial. distribution. medicine. Accepting this model provisionally. peIfonn a survey. They initially tabulated Nm1.3 we defined the testing problem abstractly. If n is the total number of applicants. or more generally construct an experiment that yields data X in X C Rq. treating it as a decision theory problem in which we are to decide whether P E Po or P l or. PI or 8 0 .
p) distribution. ~. . .< . • I I . It was noted by Fisher corresponds to H : p = ~ with the alternative K : p as reported in Jeffreys (1961) that in this experiment the observed fraction ':: was much closer to 3.. D). d = 1. • .D. I . The hypothesis of dominant inheritance ~.1.. and O'Connell (1975). the same data can lead to opposite conclusions regarding these hypothesesa phenomenon called Simpson's paradox. 0 I • . . Hammel. where N m1d is the number of male admits to department d. In these tenns the hypothesis of "no bias" can now be translated into: H: Pml Pmld PmOd Pfld Pild + + PfOd .. Mendel crossed peas heterozygous for a trait with two alleles. . d = 1. .. if the inheritance ratio can be arbitrary. B (n. . Mendel's Peas. . I . In a modem formulation. PfOd.! . . Pfld. This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate. admissions are petfonned at the departmental level and rates of admission differ significantly from department to department. i:. • Pml + P/I + PmO + Pio • I . one of which was dominant.. if there were n dominant offspring (seeds). ! .: Example 4.. as is discussed in a paper by Bickel. has a binomial (n.2. OUf multinomial assumption now becomes N ""' M(pmld' PmOd. ~). The example illustrates both the difficulty of specifying a stochastic model and translating the question one wants to answer into a statistical hypothesis. either N AA cannot really be thought of as stochastic or any stochastic I . .n 3 t I . I I 1] Fisher conjectured that rather than believing that such a very extraordinary event occurred it is more likely that the numbers were made to "agree with theory" by an overzealous assistant.. for d = 1. D). and so on. m P [ NAA . the natural model is to assume.=7xlO 5 ' n 3 .. NIld.." then the data are naturally decomposed into N = (Nm1d. The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive). . If departments "use different coins. NjOd. distribution.214 Testing and Confidence Regions Chapter 4 hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male] = Pm! Pml +PmO = P[Admit I Female] = P fI Pil +PjO But is this a correct translation of what absence of bias means? Only if admission is determined centrally by the toss of a coin with probability Pml Pi! Pml + PmO PIl +PiO [n fact. In one of his famous experiments laying the foundation of the quantitative theory of genetics. N mOd . . that N AA. . That is. the number of homozygous dominants. pml +P/I !. • In fact.than might be expected under the hypothesis that N AA has a binomial.
we shall simplify notation and write H : () = eo. (}o] and eo is composite..1.1. 8 0 and H are called compOSite. That a treatment has no effect is easier to specify than what its effect is. P = Po. a) = 0 if BE 8 a and 1 otherwise. . In situations such as this one . (1 .1. say 8 0 . where 1 ~ E is the probability that the assistant fudged the data and 6!. In this example with 80 = {()o} it is reasonable to reject IJ if S is "much" larger than what would be expected by chance if H is true and the value of B is eo.€)51 + cB( nlP). we call 8 0 and H simple.3. = Example 4. . 11 and K is composite.3. n 3' 0 What the second of these examples suggests is often the case. we reject H if S exceeds or equals some integer. Suppose that we know from past experience that a fixed proportion Bo = 0. see. the set of points for which we reject.!. Most simply we would sample n patients. If the theory is false. That is. suppose we observe S = EXi .Section 4. Thus.1 loss l(B. where Xi is 1 if the ith patient recovers and 0 otherwise. and then base our decision on the observed sample X = (X J. then S has a B(n. say. it's not clear what P should be as in the preceding Mendel example. I} or critical region C {x: Ii(x) = I}. is better defined than the alternative answer 8 1 . Moreover. When 8 0 contains more than one point. The set of distributions corresponding to one answer. say k. What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug. Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug. is point mass at . In science generally a theory typically closely specifies the type of distribution P of the data X as. and accept H otherwise.1 Introduction 215 model needs to pennit distributions other than B( n. Now 8 0 = {Oo} and H is simple. our discussion of constant treatment effect in Example 1.. then = [00 . for instance. The same conventions apply to 8] and K. acceptance and rejection can be thought of as actions a = 0 or 1.11.Xn ).(1) As we have stated earlier. 0 E 8 0 .3 recover from the disease with the old drug. These considerations lead to the asymmetric formulation that saying P E Po (e E 8 0 ) corresponds to acceptance of the hypothesis H : P E Po and P E PI corresponds to rejection sometimes written as K : P E PJ . where 00 is the probability of recovery usiog the old drug. the number of recoveries among the n randomly selected patients who have been administered the new drug. Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied. If we suppose the new drug is at least as effective as the old. administer the new drug. in the e . K : () > Bo Ifwe allow for the possibility that the new drug is less effective than the old. B) distribution. p). See Remark 4. for instance. We illustrate these ideas in the following example. It will turn out that in most cases the solution to testing problem~ with 80 simple also solves the composite 8 0 problem. and we are then led to the natural 0 . It is convenient to distinguish between two structural possibilities for 8 0 and 8 1 : If 8 0 consists of only one point. recall that a decision procedure in the case of a test is described by a test function Ii: x ~ {D. 8 1 is the interval ((}o. then 8 0 = [0 . To investigate this question we would have to perform a random experiment. (2) If we let () be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite. Thus.
For instance.e.. 0 PH = probability of type II error ~ P. is much better defined than its complement and/or the distribution of statistics T under eo is easy to compute. 4. We now tum to the prevalent point of view on how to choose c. 1.216 Testing and Confidence Regions Chapter 4 tenninology of Section 1. In most problems it turns out that the tests that arise naturally have the kind of structure we have just described.3. one can be thought of as more important.2 and 4. if H is true. : I I j ! 1 I . 0 > 00 . There is a statistic T that "tends" to be small. No one really believes that H is true and possible types of alternatives are vaguely known at best.3. We will discuss the fundamental issue of how to choose T in Sections 4. . The Neyman Pearson framework is still valuable in these situations by at least making us think of possible alternatives and then.T would then be a test statistic in our sense. • . but if this view is accepted.3 this asymmetry appears reasonable.3. how reasonable is this point of view? In the medical setting of Example 4.2. (Other authors consider test statistics T that tend to be small. if H is false.. In that case rejecting the hypothesis at level a is interpreted as a measure of the weight of evidence we attach to the falsity of H. Thresholds (critical values) are set so that if the matches occur at random (i. Note that a test statistic generates a family of possible tests as c varies. testing techniques are used in searching for regions of the genome that resemble other regions that are known to have significant biological activity..that has in fact occurred.1 . matches at one position are independent of matches at other positions) and the probability of a match is ~. announcing that a new phenomenon has been observed when in fact nothing has happened (the socalled null hypothesis) is more serious than missing something new. and large. of the two errors. . We call T a test statistic. it again reason~bly leads to a Neyman Pearson fonnulation. To detennine significant values of these statistics a (more complicated) version of the following is done..2. We do not find this persuasive. By convention this is chosen to be the type I error and that in tum detennines what we call H and what we call K. (5 > k) < k). .) We select a number c and our test is to calculate T(x) and then reject H ifT(x) > c and accept H otherwise. when H is false. . Given this position. as we shall see in Sections 4. our critical region C is {X : S rule is Ok(X) = I{S > k} with > k} and the test function or PI ~ probability of type I error = Pe. . ! : . (5 The constant k that determines the critical region is called the critical value. ~ : I ! j . One way of doing this is to align the known and unknown regions and compute statistics based on the number of matches... suggesting what test statistics it is best to use. then the probability of exceeding the threshold (type I) error is smaller than Q".1. 1 i . generally in science. It has also been argued that. The value c that completes our specification is referred to as the critical value of the test.1. asymmetry is often also imposed because one of eo. As we noted in Examples 4. and later chapters.1 and4. e 1 .1. The Neyman Pearson Framework The Neyman Pearson approach rests On the idea that. but computation under H is easy.
we can attach.2. 0) is just the probability of type! errot. it is too limited in any situation in which.2(b) is the one to take in all cases with 8 0 and 8 1 simple. 80 = 0.1.0) = Pe[Rejection] = Pe[o(X) = 1] = Pe[T(X) If 8 E 8 0 • {3(8.8)n. Finally.1. {3(8. . Begin by specifying a small number a > 0 such that probabilities of type I error greater than a arc undesirable. then the probability of type I error is also a function of 8.(6) = Pe. Specifically.(S > 6) = 0. even though there are just two actions. if our test statistic is T and we want level a.05 are commonly used in practice. See Problem 3. It can be thought of as the probability that the test will "detect" that the alternative 8 holds.1. It is referred to as the level a critical value. Indeed. it is convenient to give a name to the smallest level of significance of a test. Definition 4.0473. This quantity is called the size of the test and is the maximum probability of type I error.9. if 0 < a < 1.. Such tests are said to have level (o/significance) a. Because a test of level a is also of level a' > a. the probabilities of type II error as 8 ranges over 8 1 are determined. such as the quality control Example 1. if we have a test statistic T and use critical value c. the approach of Example 3. The values a = 0.j J = 10.3 and n = 10. By convention 1 .1. which is defined/or all 8 E 8 by e {3(8) = {3(8. That is. we find from binomial tables the level 0.01 and 0. whereas if 8 E power against (). our test has size a(c) given by a(e) ~ sup{Pe[T(X) > cJ: 8 E eo}· (4. the power is 1 minus the probability of type II error. Here are the elements of the Neyman Pearson story.Section 4. in the Bayesian framework with a prior distribution on the parameter.P [type 11 error] is usually considered.3. Ll) Nowa(e) is nonincreasing in c and typically a(c) r 1 as c 1 00 and a(e) 1 0 as c r 00.3 (continued).1.1.Ok) = P(S > k) A plot of this function for n ~ t( j=k n ) 8i (1 . k = 6 is given in Figure 4. If 8 0 is composite as well. Here > cJ.1 Introduction 217 There is an important class of situations in which the Neyman Pearson framework is inappropriate.2. 8 0 = 0.05 critical value 6 and the test has size . Once the level or critical value is fixed. The power is a function of 8 on I.1. e t . there exists a unique smallest c for which a(c) < a. 0) is the {3(8. In that case. numbers to the two losses that are not equal and/or depend on 0. and we speak of rejecting H at level a.3 with O(X) = I{S > k}. Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8 0 .1. The power of a test against the alternative 0 is the probability of rejecting H when () is true. Thus. In Example 4. even nominally. Both the power and the probability of type I error are contained in the power function. This is the critical value we shall use. Example 4.
I I I I I 1 .9 1. .3770.7 0. and variance (T2.1.3 to (it > 0.:"::':::'. whereas K is the alternative that it has some positive effect. _l X n are nonnally distributed with mean p.8 0. a 67% improvement.t > O.3).. (The (T2 unknown case is treated in Section 4.. . Suppose that X = (X I. If we assume XI.2 is known. We might. suppose we want to see if a drug induces sleep. 0 Remark 4. for each of a group of n randomly selected patients.5. .3 04 05 0.. OneSided Tests for the Mean ofa Normal Distribution with Known Variance. j 1 . It follows that the level and size of the test are unchanged if instead of 80 = {Oo} we used eo = [0.1. ! • i.1. ..1.05 test will detect an improvement of the recovery fate from 0.0 Figure 4.6 0.3 versus K : B> 0. k ~ 6 and the size is 0. That is. . record sleeping time without the drug (or after the administration of a placebo) and then after some time administer the drug and record sleeping time again.4.2 o~31:.0 0 ]. Power function of the level 0. From Figure 4.) We want to test H : J. then the drug effect is measured by p. We return to this question in Section 4.[T(X) > k].2 0.2) popnlation with .3.3. 0) family of distributions. When (}1 is 0. Let Xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient.0 0.t < 0 versus K : J. This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject.218 Testing and Confidence Regions Chapter 4 1.~==/++_++1'+ 0 o 0.8 06 04 0.05 onesided test c5k of H : () = 0. What is needed to improve on this situation is a larger sample size n.1. Example 4.3 is the probability that the level 0. I j a(k) = sup{Pe[T(X) > k]: 0 E eo} = Pe.:.1 it appears that the power function is increasing (a proof will be given in Section 4.. Xn ) is a sample from N (/'. :1 • • Note that in this example the power at () = B1 > 0. One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives... this probability is only .1 0.3 for the B(lO._~:. For instance.0473. The power is plotted as a function of 0.5. I i j . and H is the hypothesis that the drug has no effect or is detrimental.
1 Introduction 219 Because X tends to be larger under K than under H. has level a if £0 is continuous and (B + 1)(1 .p) because ~(z) (4. In Example 4. Because (J(jL) a(e) ~ is increasing.. In Example 4.~(z).y) : YES}.a) quantile oftheN(O. T(X(B)) from £0.1. eo). £0. critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods. ~ estimates(jandd(~. where d is the Euclidean (or some equivalent) distance and d(x. It is convenient to replace X by the test statistic T(X) = . In all of these cases.   = . where T(l) < . But it occurs also in more interesting situations such as testing p.(T(X)) doesn't depend on 0 for 0 E 8 0 .3.a) is an integer (Problem 4.( c) = C\' or e = z(a) where z(a) = z(1 . Given a test statistic T(X) we need to determine critical values and eventually the power of the resulting tests.a) is the (1. .d.• T(X(B)). the common distribution of T(X) under (j E 8 0 .Section 4.9). (Tn) for (j E 8 0 is usually invariance The key feature of situations in which under the action of a group of transformations_ See Lehmann (1997) and Volume II for discussions of this property. This occurs if 8 0 is simple as in Example 4. < T(B+l) are the ordered T(X). However.1.1 and Example 4. (12) observations with both parameters unknown (the t tests of Example 4.(jo) = (~ (jo)+ where y+ = Y l(y > 0). S) inf{d(x.3.I') ~ ~ (c + V.3. which generates the same family of critical regions. o The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H : (j E eo c RP. it is natural to reject H for large values of X. in any case. Rejecting for large values of this statistic is equivalent to rejecting for large values of X. This minimum distance principle is essentially what underlies Examples 4..2 and 4. and we have available a good estimate (j of (j.2) = 1.nX/ (J.P.. if we generate i. #. . has a closed form and is tabled. T(X(lI).i. then the test that rejects iff T(X) > T«B+l)(la)). p ~ P[AA].5). .1.V.co =.o versus p. The task of finding a critical value is greatly simplified if £. The smallest c for whieh ~(c) < C\' is obtained by setting q:. That is.5.1. it is clear that a reasonable test statistic is d((j. Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward.2.1. N~A is the MLE of p and d(N~A. 1) distribution.co ..~ (c ..eo) = IN~A .1.~I.. sup{(J(p) : p < OJ ~ (J(O) = ~(c). .1.o if we have H(p.. = P. T(X(l)). The power function of the test with critical value c is p " [Vii (X !") > e _ Vii!"] (J (J 1.1.
5. which is evidently composite..1. Also F(x) < x} = n. ).   Dn = sup IF(x) . Let F denote the empirical distribution and consider the sup distance between the hypothesis Fo and the plugin estimate of F.1 El{Fo(Xi ) < Fo(x)} n.. In particular. The distribution of D n under H is the same for all continuous Fo. and hn(t) = t/( Vri + 0. . . x It can be shown (Problem 4..6.1.i.. Goodness of Fit Tests. Set Ui ~ j . Goodness of Fit to the Gaussian Family. .Fa. . :i' D n = sup [U(u) . the empirical distribution function F. In particular.4. Suppose Xl. then by Problem B.L5 rewriting H : F(!' + (Tx) = <p(x) for all x where !' = EF(X .Xn be i.d. .)} n (4. This is again a consequence of invariance properties (Lehmann.. F.1 J) where x(l) < . and the result follows. (T2 = VarF(Xtl. can be wriHen as Dn =~ax max tI.. . and . P Fo (D n < d) = Pu (D n < d).n {~ n Fo(x(i))' FO(X(i)) _ . .01. < x(n) is the ordered observed sample. U. 1). • . What is remarkable is that it is independ~nt of which F o we consider.1.....1 El{U. < Fo(x)} = U(Fo(x)) . As x ranges over R. Let X I.u[ O<u<l . h n (L358). as a tcst statistic . that is..12 + OJI/Vri) close approximations to the size" critical values ka are h n (L628). which is called the Kolmogorov statistic. thus. Proof. Un' where U denotes the empirical distribution function of Ul u = Fo(x) ranges over (0. 0 Example 4.05. • Example 4..U. . 1) distribution. F o (Xi). . . where F is continuous.d. X n are ij. as X . 1). F and the hypothesis is H : F = ~ (' (7'/1:) for some M. the order statistics..(i_l. + (Tx) is . .10 respectively. and h n (L224) for" = . o Note that the hypothesis here is simple so that for anyone of these hypotheses F = F o• the distribution can be simulated (or exhibited in closed fonn). This statistic has the following distributionjree property: 1 Proposition 4. for n > 80.1.. Consider the problem of testing H : F = Fo versus K : F i. 1997).7) that Do. The natural estimate of the parameter F(!....3. n.1 El{Xi ~ U(O. The distribution of D n has been thoroughly smdied for finite and large n.1. where U denotes the U(O.220 Testing and Confidence Regions Chapter 4 1 .Fo(x)[. We can proceed as in Example 4.
Tn has the same distribution £0 under H. 1 < i < n..<l'(x)1 where G is the empirical distribution of (L~l"'" Lln ) with Lli (Xi . I). ~n) doesn't depend on fl. 1. . ..i.. . . H is not rejected. whereas experimenter II insists on using 0' = 0. for instance. .Section 4. . If we observe X = x = (Xl. T nB .a) + IJth order statistic among Tn. 1).(iix u > z(a) or upon applying <l' to both sides if. We do this B times independently. · · . Tn sup IF'(X x x + iix) . otherwise. Example 4.. I < i < n.X)/ii. and only if. Therefore.(3) 0 The pValue: The Test Statistic as Evidence o Different individuals faced with the same testing problem may have different criteria of size. from N(O.01. where Z I. and only if. If the two experimenters can agree on a common test statistic T. then computing the Tn corresponding to those Zi. (4. . .Z) / (~ 2::7 ! (Zi  if) • . {12 and is that of ~  (Zi . Experimenter] may be satisfied to reject the hypothesis H using a test with size a = 0. under H.1. the joint distribution of (Dq . H is rejected. where X and 0'2 are the MLEs of J1 and we obtain the statistic F(X + ax) Applying the sup distance again.05. But. (Sec Section 8.. That is..d.4.. Consider. a satisfies T(x) = . It is then possible that experimenter I rejects the hypothesis H.. whereas experimenter II accepts H on the basis of the same outcome x of an experiment. if the experimenter's critical value corresponds to a test of size less than the p~value. this difficulty may be overcome by reporting the outcome of the experiment in tenus of the observed size or pvalue or significance probability of the test. . whatever be 11 and (12. if X = x. the pvalue is <l'( T(x)) = <l' (.d.4) Considered as a statistic the pvalue is <l'( y"nX /u).3. This quantity is a statistic that is defined as the smallest level of significance 0' at which an experimenter using T would reject on the basis ofthe observed outcome x.. Now the Monte Carlo critical value is the I(B + 1)(1 . and the critical value may be obtained by simulating ij. Zn arc i.X) .1 Introduction 221 (12. we would reject H if.2. observations Zi. T nB . thereby obtaining Tn1 .<l'(x)1 sup IG(x) . In). N(O. Tn!..) Thus. a > <l'( T(x)).' .
Thus. • i~! <:1 im _ . But the size of a test with critical value c is just a(c) and a(c) is decreasing in c. We have proved the following. . when H is well defined. and the pvalue is a( s) where s is the observed value of X. ~ H fJ. distribution (Problem 4. . n(1 . !: The pvalue is used extensively in situations of the type we described earlier.. but K is not.1. the smallest a for which we would reject corresponds to the largest c for which we would reject and is just a(T(x)).g. we would reject H if.1.1. 1). Thus. Thus..ath quantile of the X~n distribution. I. In this context. aCT) has a uniform. 1985). the largest critical value c for which we would reject is c = T(x). a(T) is on the unit interval and when H is simple and T has a continuous distribution.1.<I> ( [nOo(1 ~ 0 )1' 2 s~l~nOo) 0 . This is in agreement with (4. Thus.1. The statistic T has a chisquare distribution with 2n degrees of freedom (Problem 4.Oo)} > 5. The pvalue can be thought of as a standardized version of our original statistic. • T = ~2 I: loga(Tj) j=l ~ ~ r (4.222 Testing and Confidence Regions Cnapter 4 In general. a(8) = p.3.1).4)..). Similarly in Example 4. a(Tr ).(S > s) '" 1. to quote Fisher (1958). 80).1.5) . ''The actual value of p obtainable from the table by interpolation indicates the strength of the evidence against the null hypothesis" (p. let X be a q dimensional random vector. More generally. U(O.5).. Suppose that we observe X = x. We will show that we can express the pvalue simply in tenns of the function a(·) defined in (4.1. Then if we use critical value c..2. .6) I • to test H. so that type II error considerations are unclear... •• 1.1. Various melhods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff (1967). that is. . T(x) > c. for miu{ nOo. I T r to produce p . H is rejected if T > Xla where Xl_a is the 1 . For example.6). The pvalue is a(T(X)). Proposition 4. then if H is simple Fisher (1958) proposed using l j :i. It is possible to use pvalues to combine the evidence relating to a given hypothesis H provided by several different independent experiments producing different kinds of data.• see f [' Hedges and Olkin. and only if. if r experimenters use continuous test statistics T 1 . (4. ' I values a(T.. The normal approximation is used for the pvalue also. these kinds of issues are currently being discussed under the rubric of datafusion and metaanalysis (e.
or equivalently.Oo)]n. (4.0. (I . and S tends to be large when K : () = 01 > 00 is . OIl > 0. The statistic L is reasonable for testing H versuS K with large values of L favoring K over H. is measured in the NeymanPearson theory.)1 5 [(1. then. p(x. 1) distribution. test statistics. For instance.1. o:(Td has a U(O. The statistic L takes on the value 00 when p(x. This is an instance of testing goodness offit. we try to maximize the probability (power) of rejecting H when K is true. that is. under H. a given test statistic T. In this section we will consider the problem of finding the level a test that ha<. power function. 00 . and pvalue. 0) is the density or frequency function of the random vector X.00)/00 (1 . and.2 CHOOSING A TEST STATISTIC: THE NEYMANPEARSON LEMMA We have seen how a hypothesistesting problem is defined and how perfonnance of a given test b. Typically a test statistic is not given but must be chosen on the basis of its perfonnance. OIl = p (x.2 and 3. where eo and e 1 are disjoint subsets of the parameter space 6. in the binomial example (4.Otl/(1 . size. critical regions. we specify a small number 0: and conStruct tests that have at most probability (significance level) 0: of rejecting H (deciding K) when H is true. In Sections 3.Section 4./00)5[(1. (null) hypothesis H and alternative (hypothesis) K. L(x.3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk. 4.3). 0.2.00)ln~5 [0. significance level.2 Choosing a Test Statistic The NeymanPearson lemma 223 The preceding paragraph gives an example in which the hypothesis specifies a distribution completely. type I error. we test whether the distribution of X is different from a specified Fo. We start with the problem of testing a simple hypothesis H : () = ()o versus a simple alternative K : 0 = 01. type II error. (0. the highest possible power. OIl where p(x. equals 0 when both numerator and denominator vanish. We introduce the basic concepts and terminology of testing statistical hypotheses and give the NeymanPearson framework. In this case the Bayes principle led to procedures based on the simple likelihood ratio statistic defined by L(x. that is. In particular. Summary.) which is large when S true. I) = EXi is large. 00 ) = 0. subject to this restriction. power. test functions.01 )/(1. by convention. We introduce the basic concepts of simple and composite hypotheses. 00. 0) 0 p(x. Such a test and the corresponding test statistic are called most poweiful (MP). In the NeymanPearson framework. we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 80 or e 1.
we choose 'P(x) = 0 if S < 5. To this end consider (4. we consider randomized tests '1'. and suppose r.Bo. which are tests that may take values in (0. B .05 in Example 4. the interpretation is that we toss a coin with probability of heads cp(x) and reject H iff the coin shows heads. where 7l' denotes the prior probability of {Bo}. If 0 < 'P(x) < 1 for the observation vector x. Because we want results valid for all possible test sizes Cl' in [0. Such randomized tests are not used in practice. thns.'P(X)[L(X.2.4) 1 j 7 . 'Pk(X) = 1 if p(x.1]. Note that a > 0 implies k < 00 and.'P(X)] a..p is a level a test.'P(X)] [:~~:::l.05 . there exists k such that (4.kJ}. (a) If a > 0 and I{Jk is a size a likelihood ratio test.1).7l'). then ~ . (4.2.'P(X)] > kEo['PdX) . and because 0 < 'P(x) < 1. that is. Bo) = O. Bo.'P(X)] > O.1. 1. then it must be a level a likelihood ratio test. We show that in addition to being Bayes optimal. then <{Jk is MP in the class oflevel a tests. 'P(x) = 1 if S > 5.) For instance.3 with n = 10 and Bo = 0. using (4. BIl o . Theorem 4. if want size a ~ . likelihood ratio tests are unbeatable no matter what the size a is.0262 if S = 5. (NeymanPearson Lemma). 0 < <p(x) < 1.['Pk(X) .. . They are only used to show that with randomization.2. then I > O. Bo) = O} = I + II (say).2.224 Testing and Confidence Regions Chapter 4 We call iflk a likelihood ratio or NeymanPearsoll (NP) test ifunction) if for some 0 k < OJ we can write the test function Yk as 'Pk () = X < 1 o if L( x. Proof. (See also Section 1. B. BI )  . i' ~ E.P(S > 5)I!P(S = 5) = .2. and 'P(x) = [0. B .k is < 0 or > 0 according as 'Pk(X) is 0 or 1. (b) For each 0 < a < 1 there exists an MP size 0' likelihood ratio test provided that randomization is permitted.. 'Pk is MP for level E'c'PdX).k] + E'['Pk(X) .1) if equality occurs.2) forB = BoandB = B. we have shown that I j I i E. If L(x.3). (a) Let E i denote E8. (c) If <p is an MP level 0: test. k] . Note (Section 3.2) that 'Pk is a Bayes rule with k ~ 7l' /(1 .'P(X)] = Eo['Pk{X) .[CPk(X) . Eo'P(X) < a.'P(X)] 1{P(X. where 1= EO{CPk(X)[L(X. o Because L(x.) . i = 0.3) 1 : > O.3.Bo.3.('Pk(X) .kEo['Pk(X) .) . It follows that II > O.'P(X)] . EO'PdX) We want to show E.B. Finally. lOY some x.Bd >k <k with 'Pdx) any value in (0.1.
4) we need tohave'P(x) ~ 'Pk(X) = I when L(x. we conc1udefrom (4.I.1T)p(x.2. If not. 0. 00 . 00 . 0.1. 'Pk(X) = 1 and !.2. OJ! = ooJ ~ 0.pk is MP size a. for 0 < a < 1.OI)' Proof.2. 'P(X) > a with equality iffp(" 60 ) = P(·.. denotes the Bayes procedure of Exarnple 3.v) + nv ] 2(72 x . Next consider 0 < Q: < 1. Consider Example 3. 2 (7 T(X) ~. 0.. decides 0. Therefore. Let Pi denote Po" i = 0.5) If 0. Now 'Pk is MP size a.2) holds forO = 0.) . OJ! > k] > If Po[L(X.) (1 .Section 4. 'Pk(X) =a .. k = 0 makes E.k] on the set {x : L(x. that is.2.fit. k = 00 makes 'Pk MP size a.0. Then the posterior probability of ().2.fit [ logL(X. OJ! > k] < a and Po[T. Because Po[L(X.O.) < k. 0.1.3.2. 0. It follows that (4.v)=exp 2LX'2 ./f 'P is an MP level a test. is 1T(O. then 'Pk is MP size a. See Problem 4.2. 00 ) (11T)L(x.Po[L(X. 0. I x) = (1 1T)p(X. (c) Let x E {x: p(x.1T)L(x.) > k] Po[L(X. We found nv2} V n L(X. . Here is an example illustrating calculation of the most powerful level a test <{Jk. { (7 i=l 2(7 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions. 0. . 00 .) (1 .2(b). 00 .Oo.L. tben.) > k and have 'P(x) = 'Pk(X) ~ 0 when L(x. = 'Pk· Moreover.2. 0. also be easily argued from this Bayes property of 'Pk (Problem 4.2.. 0.(X.Oo. The same argument works for x E {x : p(x. Let 1T denote the prior probability of 00 so that (I 1T) is the prior probability of (it.5) thatthis 0. or 00 according as 1T(01 Ix) is larger than or smaller than 1/2. OJ..10). 0. define a.2. where v is a known signal. then E9.u 2 ) random variables with (72 known and we test H : 11 = a versus K : 11 = v.) = k] = 0.) + 1Tp(X. 00 . 60 . = v. Part (a) of the lemma can. when 1T = k/(k + 1).2 Choosing a Test Statistic: The NeymanPearson lemma " 225 =:cc (b) If a ~ 0. OJ Corollary 4.2 where X = (XI.O. Example 4. 00 ) > and 0 = 00 .1. Remark 4. 00 .7. If a = 1..) + 1T' (4. 00 .) = k j. 0 It follows from the NeymanPearson lemma that an MP test has power at least as large as its level. X n ) is a sample of n N(j. then there exists k < 00 such that Po[L(X.O. 00 .) > then to have equality in (4.
they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population CQvanances. the test that rejects if. (JI correspond to two known populations and we desire to classify a new observation X as belonging to one or the other. . O)T and Eo = I. The following important example illustrates.. l. ). if we want the probability of detecting a signal v to be at least a preassigned value j3 (say. • • . The likelihood ratio test for H : (J = 6 0 versus K : (J = (h is based on . say. We will discuss the phenomenon further in the next section.90 or . In this example we bave assnmed that (Jo and (J. for the two popnlations are known. This is the smallest possible n for any size Ct test. By the NeymanPearsoo lemma this is the largest power available with a level Ct test.2. Example 4. among other things./ii/(J)) = 13 for n and find that we need to take n = ((J Iv j2[z(la) + z(I3)]'. Simple Hypothesis Against Simple Alternative for the Multivariate Normal: Fisher's Discriminant Flmction.2. itl' However if.9).4.Jto)E01X large..2.a)[~6Eo' ~oJ! (Problem 4. . I I ." The function F is known as the Fisher discriminant function. however. if ito. But T is the test statistic we proposed in Example 4.7) where c ~ z(1 . Such a test is called uniformly most powerful (UMP). 0. <I>(z(a) + (v. > 0 and E l = Eo. T>z(la) (4. large. E j ). . (Jj = (Pj.' Rejecting H for L large is equivalent to rejecting for I 1 . Particularly important is the case Eo = E 1 when "Q large" is equivalent to "F = (ttl . We return to this in Volume II.. Eo are known..1.2. this is no longer the case (Problem 4. . Thus. that the UMP test phenomenon is largely a feature of onedimensional parameter problems. then. by (4.2. It is used in a classification context in which 9 0 . Note that in general the test statistic L depends intrinsically on ito. and only if.226 Testing and Confidence Regions Chapter 4 is also optimal for this problem. If ~o = (1. Suppose X ~ N(Pj.2. then this test rule is to reject H if Xl is large.) test exists and is given by: Reject if (4. a UMP (for all ). If this is not the case. then we solve <1>( z(a) + (v./iil (J)).0.2). The power of this test is..I.. if Eo #. itl = ito + )"6. j = 0. From our discussion there we know that for any specified Ct.2.95). 6. 0.8).6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > O.1. . 0 An interesting feature of the preceding example is that the test defined by (4. E j ).6) has probability of type I error 0:.
Ok)' The simple hypothesis would correspond to the theory that the expected proportion of offspring of types 1. 0) P n! nn. then (N" . here is the general definition of UMP: Definition 4. N k ) has a multinomial M(n. . 1 (}k) distribution with frequency function. A level a test 'P' is wtiformly most powerful (UMP) for H : 0 E versus K : 0 E 8 1 if eo (3(0.. For instance. .2 for deciding between 00 and Ol.)N' i=l to Here is an interesting special case: Suppose OjO integer I with 1 < I < k > 0 for all}.p. . . We note the connection of the MP test to the Bayes procedure of Section 3.. if a match in a genetic breeding experiment can result in k types. Nk) .3. .nk. .···..3 UNIFORMLY MOST POWERFUL TESTS ANO MONOTONE LIKELIHOOD RATIO MODELS We saw in the two Gaussian examples of Section 4.. Two examples in which the MP test does not depend on 0 1 are given. 0" . Before we give the example.u k nn.. is established. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : (j = ()o versus the simple alternative K : 0 = ()1.1.'P) for all 0 E 8" for any other level 0:' (4.. nk are integers summing to n.3.3. Example 4. .. 0 < € < 1 and for some fixed (4. Suppose that (N1 . . This phenomenon is not restricted to the Gaussian case as the next example illustrates. which states that the size 0:' SLR test is uniquely most powerful (MP) in the class of level a tests.3.. 'P') > (3(0. However.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 227 Summary. .. Ok = OkO. With such data we often want to test a simple hypothesis H : (}I = Ow. The NeymanPearson lemma. . (}l. . . r lUI nt···· nk· ". ...M( n.OkO' Usually the alternative to H is composite. and N i is the number of offspring of type i.2) where . k are given by Ow. .Section 4. Testing for a Multinomial Vector.1) test !. . the likelihood ratio L 's L~ rr(:."" (}k = Oklo In this case. there is sometimes a simple alternative theory K : (}l = (}ll.2 that UMP tests for onedimensional parameter problems exist.. _ (nl.1. where nl. n offspring are observed... Such tests are said to be UMP (unifonnly most powerful). 4..
X ifT(x) < t ° (4. p(x. B( n.3.nu)!". Example 4. 0 form with T(x) .3 continned).o)n[O/(1 .(X) = '" > 0. .1. for". Critical values for level a are easily determined because Nl . < 1 implies that p > E. J Definition 4. Then L = pnN1£N1 = pTl(E/p)N1.. = P( N 1 < c). Consider the oneparameter exponential family mode! o p(x.. equals the likelihood ratio test 'Ph(t) and is MP.228 Testmg and Confidence Regions Chapter 4 That is. we have seen three models where. we get radically different best tests depending on which Oi we assume to be (ho under H. If {P..6. 00 . type I is less frequent than under H and the conditional probabilities of the other types given that type I has not occurred are the same under K as they are under H. Moreover. it is UMP at level a = EOodt(X) for testing H : e = eo versus K : B > Bo in fact.3. Example 4. Consider the problem of testing H : 0 = 00 versus K: 0 ~ 01 with 00 < 81. there is a statistic T such that the test with critical region {x : T(x) > c} is UMP.. Theorem 4. we conclude that the MP lest rejects H. 0 .2) with 0 < f < I}. Oil = h(T(x)) for some increasing function h.2 (Example 4.1) MLR in s. Ow) under H. 0 Typically the MP test of H : () = eo versus K : () = (}1 depends on (h and the test is not UMP.nx/u and ry(!") Define the NeymanPearson (NP) test function 6 ( ) _ 1 ifT(x) > t . In this i.3. N 1 < c. this test is UMP fortesting H versus K : 0 E 8 1 = {/:/ : /:/ . 6. .3. = h(x)exp{ry(O)T(x) ~ B(O)}. Note that because l can be any of the integers 1 .. . The family of models {P. = (I .O) = 0'(1 . . Example 4. where u is known.3) with 6. = . If 1J(O) is strictly increasing in () E e. :.. then this family is MLR. is of the form (4. for testing H : 8 < /:/0 versus K:O>/:/I' .(X) is increasing in 0. is an MLRfamily in T(x). This is part of a general phenomena we now describe. Because f. then 6.00).d. Bernoulli case.1) if T(x) = t.i.1 is of this (. Oil is an increasing function ofT(x). e c R. . the power function (3(/:/) = E. : 0 E e} with e c R is said to be a monotone likelihood ratio (MLR) family if for (it < O the distributions POl and P0 2 are distinct and 2 the ratio p(x.o)n. in the case of a real parameter. under the alternative. Suppose {P.2.(x) any value in (0. . .OW and the model is by (4. Because dt does not depend on ()l.2. set s then = l:~ 1 Xi. . if and only if.. 02)/p(X. 0) ~.3. e c R. (1) For each t E (0.3.3.1. is UMP level"..: 0 E e}. : 8 E e}. then L(x.. However.2. is an MLRfamily in T(x). Thus. k. . (2) If E'o6.
N. is UMP level a. If a is a value taken on by the distribution of X.3. Suppose tha~ as in Example l. Because the class of tests with level 0' for H : (J < (Jo is contained in the class of tests with level 0' for H : (J = (Jo. (8 < s(a)) = a.. If the inspector making the test considers lots with bo = N(Jo defectives or more unsatisfactory. N .o. Testing Precision.U 2 ) population.3. X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives. and because dt maximizes the power over this larger class.Section 4. where b = N(J.Xn is a sample from a N(I1. Eoo.l. d t is UMP for H : (J < (Jo versus K : (J > (Jo.<» is lfMP level afor testing H: (J < (Jo versus K : (J > (Jo.. distribution.(X). For simplicity suppose that bo > n.1. Suppose {Po: 0 E Example 4. Let S = l:~ I (X. then e}.<>.Xn . 0. and we are interested in the precision u. is an MLRfamily in T(r). Quality Control. the alternative K as (J < (Jo.l. Corollary 4. . if N0 1 = b1 < bo and 0 < x < b1 .( NOo. 0) = exp {~8 ~ IOg(27r0'2)} . To show (2).. 20' 2  This is a oneparameter exponential family and is MLR in T = So The UMP level 0' test rejects H if and only if 8 < s(a) where s(a) is such that Pa . Suppose X!. distribution.5. where U O represents the minimum tolerable precision.4.. 0 Proof (I) follows from b l = iPh(t) The following useful result follows immediately. 0 Example 4.(bl1) x. she formulates the hypothesis H as > (Jo. X < h(a). If 0 < 00 . . _1')2. and specifies an 0' such that the probability of rejecting H (keeping a bad lot) is at most 0'. the critical constant 8(0') is u5xn(a).l.(X) for testing H: 0 = 01 versus J( : 0 = 0.l of the measurements Xl. Ifwe write ~= Uo t i''''l (Xi _1')2 000 we see that Sju5 has a X~ distribution. Then. (X) < <> and b t is of level 0: for H : (J :S (Jo. e c p(x. we now show that the test O· with reject H if. . If the distributionfunction Fo ofT(X) under X"" POo is continuous and ift(1a) isasolution of Fo(t) = 1 . n). recall that we have seen that e5 t maximizes the power for testing II : (J = (Jo versus K : (J > (Jo among the class of tests with level <> ~ Eo. where 11 is a known standard. .l) yields e L( 0 0 )=b.o. For instance. where xn(a) is the ath quantile of the X. Because the most serious error is to judge the precision adequate when it is not. we test H : u > Uo l versus K : u < uo.2. 1 bo(bol) (blX+l)(Nbl) (box+1)(Nbo) (Nbln+x+l) (Nbon+x+l)' . then the test/hat rejects H if and only ifT(r) > t(1 . and only if.bo > n.3 Uniformly Most Po~rful Tests and Monotone likelihood Ratio Models 229 and Corollary 4.. H. where h(a) is the ath quantile of the hypergeometric. then by (1). (l. we could be interested in the precision of a new measuring instrument and test it by applying it to a known standard.3. R.1 by noting that for any (it < (J2' e5 t is MP at level Eo. Thus.
The critical values for the hypergeometric distribution are available on statistical calculators and software. This is not serious in practice if we have an indifference region. we specify {3 close to I and would like to have (3(!') > (3 for aU !' > t.. we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small. ". On the other hand. this is. This continuity of the power shows that not too much significance can be attached to acceptance of H. 1 I .n + I) ..' . we want guaranteed power as well as an upper bound on the probability of type I error.1. we want the probability of correctly detecting an alternative K to be large.. .(b o . in (O . In Example 4.Oo.1. H and K are of the fonn H : () < ()o and K : () > ()o. For such the probability of falsely accepting H is almost 1 .Il = (b l . and the powers are continuous increasing functions with limOlO. ~) for some small ~ > 0 because such improvements are negligible. this is a general phenomenon in MLR family models with p( x..B 1 ) =0 forb l Testing and Confidence Regions Chapter 4 < X <n.4 we might be uninterested in values of p. This is possible for arbitrary /3 < 1 only by making the sample size n large enough.4 because (3(p. In our nonnal example 4.(}o.3. not possible for all parameters in the alternative 8.ntI") for sample size n.a. 0 Power and Sample Size In the NeymanPearson framework we choose the test whose size is small. This equation is equiValent to = {3 z(a) + .c:O". Off the indifference region. 0) continuous in 0. In both these cases. we would also like large power (3(0) when () E 8 1 . Therefore.) is increasing. I' . i. This is a subset of the alternative on which we are willing to tolerate low power.x) (N . ..2)..ntI" = z({3) whose solution is il .I.O'"O.1.230 NotethatL(x. L is decreasing in x and the hypergeometric model is an MLR family in T( x) = r. By CoroUary 4.1 and formula (4. Thus. However. {3(0) ~ a.:(x:... (0. as seen in Figure 4. Note that a small signaltonoise ratio ~/ a will require a large sample size n. That is. the appropriate n is obtained by solving i' .+. forO <:1' < b1 1.x) <I L(x.L.'. in general. In our example this means that in addition to the indifference region and level a.OIl box (Nn+I)(blx) . Thus..1. ~) would be our indifference region. ° ° . if all points in the alternative are of equal significance: We can find > 00 sufficiently close to 00 so that {3( 0) is arbitrarily close to {3(00) = a. I i (3(t) ~ 'I>(z(a) + .1. that is.. It follows that 8* is UMP level Q.
we find (3(0) = PotS > so) = <I> ( [nO(l ~ 0)11/ 2 Now consider the indifference region (80 . Suppose 8 is a vector. we next show how to find the sample size that will "approximately" achieve desired power {3 for the size 0: test in the binomial example. Bd.5).3. and 0." The reason is that n is so large that unimportant small discrepancies are picked up. 0 ° Our discussion can be generalized.1.86. and only if. using the SPLUS package) for the level .. > O.90 of detecting the 17% increase in 8 from 0. 00 = 0. we can have very great power for alternatives very close to O. It is natural to associate statistical significance with practical significance so that a very low pvalue is interpreted as evidence that the alternative that holds is physically significant.1.05. In Example 4. to achieve approximate size a.3 to 0.35. Dt.3 continued). ifn is very large and/or a is small. Such hypotheses are often rejected even though for practical purposes "the fit is good enough.05)2{1.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 231 Dual to the problem of not having enough power is that of having too much. We solve For instance. if Oi = . Our discussion uses the classical normal approximation to the binomial distribution.4.0.Section 4. Thus.3.2) shows that. Often there is a function q(B) such that H and K can be formulated as H : q(O) <: qo and K : q(O) > qo. far from the hypothesis.90.282 x 0. There are various ways of dealing with this problem. test for = .645 x 0. Again using the nonnal approximation. that is. (h = 00 + Dt.55)}2 = 162. 1. when we test the hypothesis that a very large sample comes from a particular distribution.3(0. Example 4.Oi) [nOo(1 . This problem arises particularly in goodnessoffit tests (see Example 4.) = (3 for n and find the approximate solution no+ 1 . Formula (4. They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is 0:.4. where (3(0.80 ) .1.3 requires approximately 163 observations to have probability . we solve (3(00 ) = Po" (S > s) for s using (4. Now let . the size .1.6 (Example 4. As a further example and precursor to Section 5.4. = 0.4) and find the approximate critical value So ~ nOo 1 + 2 + z(l.7) + 1.00 )] 1/2 .35.05 binomial test of H : 8 = 0.4 this would mean rejecting H if.35(0.35 and n = 163 is 0. we fleed n ~ (0. (3 = . The power achievable (exactly. First.
1) 1(0. Suppose that j3(O) depends on () only through q( (J) and is a continuous increasing function of q( 0). In general. (4. The theory we have developed demonstrates that if C.. . 0 = Complete Families of Tests The NeymanPearson framework is based on using the 01 loss function. to the F test of the linear model in Section 6. is such that q( Oil = q. Although H : 0. for instance.4) j j 1 . the critical value for testing H : 0.3.(0) = o} aild 9. when testing H : 0 < 00 versus K : B > Bo.4. a reasonable class of loss functions are those that satisfy : 1 i j 1 1(0.. and also increases to 1 for fixed () E 6 1 as n .7.+ 00. We have seen in Example 4. Suppose that in the Gaussian model of Example 4.9. 00).2 is (}'2 = ~ E~ 1 (Xi . . Implicit in this calculation is the assumption that POl [T > col is an increasing function ofn. To achieve level a and power at least (3. is the a percentile of X~l' It is evident from the argument of Example 4. Thus.< (10 and rejecting H if Tn is small.(0) > O} and Co (Tn) = C'(') (Tn) for all O..Bo)..£ is unknown.3.. I}. first let Co be the smallest number c such that Then let n be the smallest integer such that P" IT > col > fJ where 00 is such that q( 00 ) = qo and 0. For instance. a E A = {O. Then the MLE of 0. However. J. It may also happen that the distribution of Tn as () ranges over 9.(0) so that 9 0 = {O : ). I .232 Testing and Confidence Regions Chapter 4 q.< (10 among all tests depending on <. This procedure can be applied. > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q.3.0) > ° I(O. = {O : ). the distribution of Tn ni:T 2/ (15 is X.O)<O forB < 00 forB>B o.. Testing Precision Continued. We may ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II.3 that this test is UMP for H : (1 > (10 versus K : 0.1 by taking q( 0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative.(Tn ) is an MLR family. Example 4. we illustrate what can happen with a simple example. 0 E 9. then rejecting for large values of Tn is UMP among all tests based on Tn. 0 E 9. a).Xf as in Example 2.3.2 only.= (10 versus K : 0. = (tm. for 9.5 that a particular test statistic can have a fixed distribution £0 under the hypothesis. The set {O : qo < q(O) < ql} is our indjfference region. . is detennined by a onedimensional parameter ). we may consider l(O.l)l(O.1. 0) = (B . For each n suppose we have a level a test for H versus I< based on a suitable test statistic T..= 00 is now composite. that are not 01.l' independent of IJ.2.
if the model is correct and loss function is appropriate. is UMP for H : 0 < 00 versus K: 8 > 8 0 by Theorem 4. Theorem 4.1 and.5).O) + [1(0. Proof. Now 0. hence.I"(X) 0 for allO then O=(X) clearly satisfies (4. INTERVALS.o. then any procedure not in the complete class can be matched or improved at all () by one in the complete class.5) That is.(X)) = 1.12) and. 1") for all 0 E e.3. a) satisfies (4. it isn't worthwhile to look outside of complete classes.(I"(X))) < 0 for 8 > 8 0 . Thus. O))(E.3. e 4.o) < R(O.E.3. 1) 1(0. E. 0 Summary. The risk function of any test rule 'P is R(O. If E. we show that for MLR models.3.(2) a v R(O.O)]I"(X)} Let o.4 Confidence Bounds.E. locally most powerful (LMP) tests in some cases can be found. hence. Suppose {Po: 0 E e}. for some 00 • E"o.3) with Eo. is an MLR family in T( x) and suppose the loss function 1(0.0.(x)) .(X) = E"I"(X) > O. AND REGIONS We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter {J i" a .(o.E. We consider models {Po : () E e} for which there exist tests that are most powerful for every () in a composite alternative t (UMP tests). e c R. a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing ()o versus 8 1 is an increasing function of a statistic T( x) for every ()o < ()1. In the following the decision procedures arc test functions. For MLR models. the risk of any procedure can be matched or improved by an NP test.(X) = ".(X) be such that. = R(O.(X) > 1. We also show how.I"(X) for 0 < 00 .I") ~ = EO{I"(X)I(O.O)} E.4). 1") = (1(0.6) But 1 .3.3. is complete. 1) 1(O.o.{1(8. then the class of tests of the form (4.3. Finally. (4. (4. For () real. 0 < " < 1. and Regions 233 if for any decision rule 'P The class D of decision procedures is said to be there exists E such that complete(I).3.5) holds for all 8. 1) + [1 1"(X)]I(O. (4.2.Section 4.R(O. Thus. the class of NP tests is complete in the sense that for loss functions other than the 01 loss function.3. Intervals. is similarly UMP for H : 0 > 8 0 versus K : 0 < 00 (Problem 4. the test that rejects H : 8 < 8 0 for large values of T(x) is UMP for K : 8 > 8 0 . In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H. when UMP tests do not exist.4 CONFIDENCE BOUNDS.(Io.) .
we want both lower and upper bounds. . Finally.i. As an illustration consider Example 4.a.(X) that satisfies P(.+(X)] is a level (1 .a) for a prescribed (1. X E Rq..a)I.a. This gives ..(X) .) and =1 a . is a parameter.3.fri < ..(X). is a lower bound with P(. it may not be possible for a bound or interval to achieve exactly probability (1 .. or sets that constrain the parameter with prescribed probability 1 . In the nonBayesian framework.(X) is a lower confidence bound with confidence level 1 . .) = 1 .. we settle for a probability at least (1 .oz(1 .a) confidence bound for "..oz(1 .95. and X ~ P. . we want to find a such that the probability that the interval [X . 0.a). ..Q equal to ... and we look for a statistic ..(X) for" with a prescribed probability (1 . We say that [j.. we find P(X .a.fri.a) confidence interval for ". That is... In the N(". if v ~ v(P). ( 2 ) with (72 known.a)l. is a constant. PEP. . We say that .a) of being correct. intervals. Similarly. Then we can use the experimental outcome X = (Xl. in many situations where we want an indication of the accuracy of an estimator.(X) and solving the inequality inside the probability for p.1.a) such as .. In general. That is.d.95 Or some other desired level of confidence. Here ii(X) is called an upper level (1 .a.fri.! = X .±(X) ~ X ± oz (1. 11 I • J . Now we consider the problem of giving confidence bounds.. We find such an interval by noting .fri < .. Suppose that JL represents the mean increase in sleep among patients administered a drug.) = 1 .(Xj < .. .) = Ia.2 ) example this means finding a statistic ii(X) snch that P(ii(X) > .4 where X I. N (/1..a.234 Testing and Confidence Regions Chapter 4 member of a specified set 8 0 ..8). In our example this is achieved by writing By solving the inequality inside the probability for tt. X n ) to establish a lower bound .Q with 1 ..!a)/.. and a solution is ii(X) = X + O"z(1 . X + a] contains pis 1 .a)l. In this case.. • j I where . we may be interested in an upper bound on a parameter. 1 X n are i. . as in (1...
L.3.~o) for 1'.a) upper confidence bound for v if for every PEP. 1) distribution and is.0) or a 100(1 . For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient.! that Z(iL)1 "jVI(n . which has a X~l distribution. We conclude from the definition of the (Student) t distribution in Section B. . independent of V = (n . has a N(o.0') < (J .1. A statistic v(X) is called a level (1 .l)s2 lIT'. we will need the distribution of Now Z(I') = Jii(X 1')/". In the preceding discussion we used the fact tbat Z (1') = Jii(X .4.3.O!) lower confidence bound for v if for every PEP. the minimum probability of coverage). v(X) is a level (I . the confidence level is clearly not unique because any number (I . I) distribu~on to obtain a confidence interval for I' by solving z (1 .3. Moreover.0)% confidence intervolfor v if. In this process Z(I') is called a pivot. The (Student) t Interval and Bounds.. the random interval [v( X). P E PI} (i. Similarly.e...1) = T(I') . P[v(X) < vi > I . and Regions 235 Definition 4. Let X t . ii( Xl] formed by a pair of statistics v(X).X) n i=l 2 .4.1')1". P[v(X) < v < v(X)] > I . finding confidence intervals (or bounds) often involves finding appropriate pivots. The quantities on the left are called the probabilities of coverage and (1 .L) by its estimate s. In general. In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level. lhat is. For a given bound or interval. for all PEP. " X n be a sample from a N(J. Note that in the case of intervals this is just inf{ PI!c(X) < v < v(X). where S 2 = 1 nI L (X.4 Confidence Bounds. (72) population. has aN(O.0) will be a confidence level if (I .o.Section 4. Now we tum to the (72 unknown case and propose the pivot T(J. by Theorem B.a) is called a confidence level. D(X) is called a level (1 .L) obtained by replacing (7 in Z(J. P[v(X) = v] > 10. and assume initially that (72 is known.1.!o) < Z(I') < z (1. Example 4.o.0) is. Intervals.
99 confidence intervaL From the results of Section B. . By solving the inequality inside the probability for a 2 we find that [(n . . Example 4.1) The shortest level (I . For instance... . " . . (4.Xn is a sample from a N(p" a 2 ) population.7.a).1) has confidence coefficient close to 1 . or very heavytailed distributions such as the Cauchy. we see that as n + 00 the Tn_l distribution converges in law to the standard normal distribution. Let tk (p) denote the pth quantile of the P (t n.355 and IX .355 is .~a) < T(J. .1. .. . On the other hand.3.n is. and if al + a2 = a.1 )s2/<7 2 has a X. Hence. if we let Xn l(p) denote the pth quantile of the X~1 distribution. the interval will have probability (1 . V( <7 2 ) = (n .~a)) = 1...1.l) is fairly close to the Tn .. ~.!a) and tndl . By Theorem B .a) /..3.2.4.)/.nJ = 1" X ± sc/. or Tables I and II. the confidence coefficient of (4.4. • r. . if n = 9 and a = 0.3.l)s2/X("I)1 is a confidence interval with confidence coefficient (1 . .)/ v'n]. computer software.236 has the t distribution 7.fii and X + stn. N (p" a 2 ).12). Confidence Intervals and Bounds for the Vartance of a Normal Distribution. The properties of confidence intervals such as (4. we enter Table II to find that the probability that a 'TnI variable exceeds 3."2)) = 1".1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal. i '1 !.005.l (I . Thus. distribution. ...1 (p) by the standard normal quantile z(p) for n > 120. Testing and Confidence Regions Chapter 4 1 ... If we assume a 2 < 00.. In this case the interval (4.. See Figure 5. (4. we find P [X ."2)' (n .01..4.Q).i. for very skew distributions such as the X2 with few degrees of freedom.d.fii.l) < t n_ 1 (1.1 (1 . X +stn _ 1 (1 Similarly.l < X + st n_ 1 (1.11 whatever be Il and Tj. thus.1 distribution and can be used as a pivot. To calculate the coefficients t n.2) f.355s/3. Suppose that X 1.a).4.. It turns out that the distribution of the pivot T(J.n < J. Solving the inequality inside the probability for Jt.st n _ 1 (I .a.l)s2/X(1.) confidence interval of the type [X . . Then (72.355s/31 is the desired level 0.Q.~a) = 3.a.st n_l (1 ~.. we have assumed that Xl. • " ~.fii are natural lower and upper confidence bounds with confidence coefficients (1 . X n are i. .1) can be much larger than 1 . then P(X("I) < V(<7 2) < x(l. we use a calculator.) /. X . Up to this point.a) in the limit as n + 00.~a)/.7 (see Problem 8. For the usual values of a.1.1 (I .1) in nonGaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5. we can reasonably replace t n .1.~a)/.. .l (I . 0 .stn_ 1 (I . t n . X + 3.
O)j )0(1 . There is a unique choice of cq and a2.Section 44 Confidence Bounds. However.. 0 The method of pivots works primarily in problems related to sampling from nonnal populations. which unifonnly minimizes expected length among all intervals of this type. the scope of the method becomes much broader.Q) and (n ." we can write P [ Let ka vIn(X . by the De MoivreLaplace theorem. g( 0. then X is the MLE of (j.S)jn] + k~j4} / (n + k~).0)  1Ql] = 12 Q. X) < OJ (4.16 we give an interval with correct limiting coverage probability.k' < .4.4.~ a) and observe that this is equivalent to Q P [(X . = ~(X) < 0 < B(X)]' Because the coefficient of (j2 in g(() 1 X) is greater than zero. In contrast to Example 4. The pivot V( (T2) similarly yields the respective lower and upper confidence bounds (n . There is no natural "exact" pivot based on X and O.ka)[S(n . 1 X n are the indicators of n Bernoulli trials with probability of success (j.. if we drop the nonnality assumption. X) is a quadratic polynomial with two real roots. Intervals.4. Example 4.4) .0(1.. If we consider "approximate" pivots. 1) distribution.X) < OJ'" 1where g(O.3.1.0) has approximately a N(O. In tenns of S = nX.a even in the limit as n + 00.l)sjx(1 .4. which typically is unknown. [0: g(O. = Z (1 . they are(1) O(X) O(X) {S + k2~ . Approximate Confidence Bounds and Intervalsfor the Probability of Success in n Bernoulli Trials. vIn(X .l)sjx(Q).1.(2X + ~) e+X 2 For fixed 0 < X < I. taking at = (Y2 = ~a is not far from optimal (Tate and Klett. If we use this function as an "approximate" pivot and let:=:::::: denote "approximate equality.0) . In Problem 1. and Regions 237 The length of this interval is random.3) + ka)[S(n . .0) ] = Plg(O. If X I. 1959).X) = (1+ k!) 0' . It may be shown that for n large. We illustrate by an example. the confidence interval and bounds for (T2 do not have confidence coefficient 1.S)jn] {S + k2~ + k~j4} / (n + k~) (4.0) < z (1)0(1 . Asymptotic methods and Monte Carlo experiments as described in Chapter 5 have shown that the confidence coefficient may be arbitrarily small depending on the underlying true distribution.
and uses the preceding model.O > 8 1 > ~. Note that in this example we can detennine the sample size needed for desired accuracy. . and Das Gupta (2000). and we can achieve the desired .7) See Brown.600. Cai.02 and is a confidence interval with confidence coefficient approximately 0.~a = 0. consider the market researcher whose interest is the proportion B of a population that will buy a product.975..4.96) 2 = 9.16.S)ln ~ to conclude that tn . We can similarly show that the endpoints of the level (1 .4. i " . say I.~a) ko 2 kcc i 1.4) has length 0. For instance.: iI j .4. calls willingness to buy success. Cai. See Problem 4.~n)2 < S(n . He can then detennine how many customers should be sampled so that (4.95. or n = 9.. ka. we n . For small n.a) confidence bounds.4.02 ) 2 . This leads to the simple interval 1 I (4. of the interval is I ~ 2ko { JIS(n .(n + k~)~ = 10 . To see this. [n this case.8) is at least 6. = T. That is.96 0.5) (4.4. if the smaller of nO.a) confidence interval for O.4.4.02 by choosing n so that Z = n~ ( 1.5) is used and it is only good when 8 is near 1/2. i o 1 [. = 0. Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 00 < ~.96.S)ln] Now use the fact that + kzJ4}(n + k~)l (S .601. (1 . we choose n so that ka. = length 0.a) procedure developed in Section 4. These inlervals and bounds are satisfactory in practice for the usual levels.(1.02. He draws a sample of n potential customers. 1 . Another approximate pivot for this example is yIn(X . This fonnula for the sample size is very crude because (4. . it is better to use the exact level (1 .X). A discussion is given in Brown. n(l .n 1 tn (4.2a) interval are approximate upper and lower level (1 .6) Thus. note that the length. to bound I above by 1 0 choose . and Das Gupta (2000) for a discussion.238 Testing and Confidence Regions Chapter 4 so that [O(X).0)1 JX(1 .5. O(X)] is an approximate level (1 .
then the rectangle I( X) Ic(X) has level = h (X) x . then q(C(X)) ~ {q(O) . ~. j==1 (4. ii..). I is a level (I . We write this as P[q(O) E I(X)I > 1. . . Let Xl. . q(C(X)) is larger than the confidence set obtained by focusing on B1 alone. and Regions 239 Confidence Regions for Functions of Parameters We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 . (T c' Tc ) are independent.a Note that if I.a). qc( 0)) is at least (I . Intervals. Suppose X I. Confidence Regions of Higher Dimension We can extend the notion of a confidence interval for onedimensional functions q( B) to rdimensional vectors q( 0) = (q.4.1 ). Then the rdimensional random rectangle I(X) = {q(O).r} is said to be a level (1 ..a).0') confidence region for q(O). then exp{ x/O} edenote the lower < q(O) < exp{ x/i)} is a confidence interval for q( 0) with confidence coefficient (1 . .0')..8) . . (X) ~ [q.(X). and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet. £(8..(X) < q.. j ~ I. For instance...a).Section 44 Confidence Bounds. That is.3. qc (0)). if the probability that it covers the unknown but fixed true (ql (0). 0 E C(X)} is a level (1. . Let 0 and and upper boundaries of this interval. we will later give confidence regions C(X) for pairs 8 = (0 1 . we can find confidence regions for q( 0) entirely contained in q( C(X)) with confidence level (1 ..F(x) = exp{ x/O}.~a) < 0 <2nX/x (~a) where x(j3) denotes the 13th quantile of the X~n distribution. . If q is not 1 .) confidence interval for qj (0) and if the ) pairs (T l ' 1'.4. 2nX/0 has a chi~square..a) confidence interval 2nX/x (I . (0). Note that ifC(X) is a level (1 ~ 0') confidence region for 0. X~n' distribution. . In this case..a) confidence region. . this technique is typically wasteful.(O) < ii..1.)..a. Here q(O) = 1 . Suppose q (X) and ii) (X) are realJ valued.. Example 4.X n denote the number of hours a sample of internet subscribers spend per week on the Internet.4. X n is modeled as a sample from an exponential. .. By Problem B. ( 2 ) T. .4. By using 2nX /0 as a pivot we find the (1 . distribution. x IT(1a. . if q( B) = 01 .
.. ( 2 ) sample and we want a confidence rectangle for (Ii. and we are interested in the distribution function F(t) = P(X < t). From Example 4. Suppose Xl> . 0 The method of pivots can also be applied to oodimensional parameters such as F.6. F(t) ..4. Suppose Xl.Laj. as X rv P. .do}.7).~a). (7 2).a.2.. r.5).2).4. is a confidence interval for J1 with confidence coefficient (1  I (X) _ [(nl)S2 (nl)S2] 2 Xnl (1 .~a)/rn !a). then leX) has confidence level 1 .15.LP[qj(O) '" Ij(X)1 > 1.2. v(P) = F(·). j=1 j=1 Thus. Example 4. .4. We assume that F is continuous.a). for each t E R. j = 1. Let dO' be chosen such that PF(Dn(F) < do) = 1 . if we choose 0' j = 1 .1.0:.(1 .d.. consists of the interval i J C(x)(t) = (max{O. 1 X n is a N (M.4. h (X) X 12 (X) is a level (1 .. Confidence Rectangle for the Parameters ofa Normal Distribution. According to this inequality.. Moreover. that is. Example 4..a) confidence rectangle for (/" .a = set of P with P( 00. then I(X) has confidence level (1 . an rdimensional confidence rectangle is in this case automaticalll obtained from the onedimensional intervals. tl continuous in t. we find that a simultaneous in t size 1. in which case (Proposition 4. From Example 4. if we choose a J = air. D n (F) is a pivot. .4.40) 2.F(t)1 tEn i " does not depend on F and is known (Example 4.1) the distribution of .i. That is.ia)' Xnl (lo) is a reasonable confidence interval for (72 with confidence coefficient (1.240 Testing and Confidence Regions Chapter 4 Thus. c c P[q(O) E I(X)] > 1. min{l. An approach that works even if the I j are not independent is to use Bonferroni's inequality (A. See Problem 4.0 confidence region C(x)(·) is the confidence band which. It is possible to show that the exact confidence coefficient is (1 . F(t) + do})· We have shown ~ ~ I o P(C(X)(t) :) F(t) for all t E R) for all P E 'P =1. . ~ ~'f Dn(F) = sup IF(t) .1 h(X) = X ± stn_l (1.1.a) r. X n are i.5. Thus. . Then by solving Dn(F) < do for F.. .
confidence intervals. In a parametric model {Pe : BEe}.4. Acceptance regions of statistical tests are.9)   because for C(X) as in Example 4.5.0'.5 THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS Confidence regions are random subsets of the parameter space that contain the true parameter with probability at least 1 ..!(F)   ~ oosee Problem 4. Suppose Xl. We begin by illustrating the duality in the following example.!(F+) and sUP{J. For a nonparametric class P ~ {P} and parameter v ~ v(P). X n are i.0: when H is true.d.5 The Duality Between Confidence Regions and Tests 241 We can apply the notions studied in Examples 4.0: for all E e. a 2 ) model with a 2 unknown. as X and that X has a density f( t) = F' (t). if J. A Lower Confidence Boundfor the Mean of a Nonnegative Random Variable.5 to give confidence regions for scalar or vector parameters in nonparametric models. We shall establish a duality between confidence regions and acceptance regions for families of hypotheses. a level 1 .0') lower confidence bound for /1." for all PEP.4. By integration by parts.7.Section 4. Intervals for the case F supported on an interval (see Problem 4. is /1. then Let F(t) and F+(t) be the lower and upper simultaneous confidence boundaries of Example 4.1. Example 4.4. 0 Summary.19. We derive the (Student) t interval for I' in the N(Il. which is zero for t < 0 and nonzero for t > O.6. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X.! ~ inf{J. 1WoSided Tests for the Mean of a Normal Distribution.18) arise in accounting practice (see Bickel. and more generally confidence regions. and we derive an exact confidence interval for the binomial parameter. We define lower and upper confidence bounds (LCBs and DCBs). J..6.4.4 and 4. Then a (1 . 1992) where such bounds are discussed and shown to be asymptotically strictly conservative. . Example 4. A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining . given by (4.4. for a given hypothesis H.4.!(F) : F E C(X)) ~ J.!(F) : F E C(X)} = J. we similarly require P(C(X) :J v) > 1. subsets of the sample space with probability of accepting H at least 1 .0: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Pe that C(X) covers q( 8) is at least 1 . o 4.i.4. .4.! ~ /L(F) = f o tf(t)dt = exists. Suppose that an established theory postulates the value Po for a certain physical constant.
~a)] = 0 the test is equivalently characterized by rejecting H when ITI > tnl (1 .a) confidence region for v if the probability that S(X) contains v is at least (1 . consider v(P) = F. then our test accepts H. We can base a size Q' test on the level (1 ...l (1.. in Example 4. (4.4. Xl!_ Knowledge of his instruments leads him to assume that the Xi are independent and identically distributed normal random variables with mean {L and variance a 2 .. in Example 4.4.1 (1 .~Ct)..1) is used for every flo we see that we have. /l = /l(P) takes values in N = (00.5.1 (1. that is P[v E S(X)] > 1 .a)s/ vn and define J'(X.2 = .I n _ 1 (1.6.a) LCB X .1) by finding the set of /l where J(X. .4.2) These tests correspond to different hypotheses. Evidently. and only if. o These are examples of a general phenomenon. (4.2.a)s/vn > /l. /l) ~ O..flO.~1 >tn_l(l~a) ootherwise. if and only if.~a). . by starting with the test (4.4.Q) confidence interval (4. X + Slnl (1  ~a) /vn]. /l) to equal 1 if. If any value of J. generating a family of level a tests. (/" . /l)} where lifvnIX. t n . Jlo) being of size a only for the hypothesis H : Jl = flo· Conversely.00).a.2) takes values in N = (00 .2(p) takes values in N = (0. For a function space example. Because the same interval (4.00)..5.l other than flo is a possible alternative. X .5.2) we obtain the confidence interval (4.!a) < T < I n. 1.1. generated a family of level a tests {J(X.. then it is reasonable to formulate the problem as that of testing H : fl = Jlo versus K : fl i.. and in Example 4. Consider the general framework where the random vector X takes values in the sample space X C Rq and X has distribution PEP. • j n . Let S = S{X) be a map from X to subsets of N.a. O{X. if and only if.5. Because p"IITI = I n. then S is a (I . = 1.5.242 Testing and Confidence Regions Chapter 4 measurements Xl . 00) X (0.00). all PEP. where F is the distribution function of Xi' Here an example of N is the class of all continuous distribution functions.fLo)/ s.4. We accept H. For instance. it has power against parameter values on either side of flo.5'[) If we let T = vn(X .4. as in Example 4. This test is called twosided because it rejects for both large and small values of the statistic T. i:.I n .1 (1 . the postulated value JLo is a member of the level (1 0' ) confidence interval [X  Slnl (1 ~a) /vn. in fact. . In contrast to the tests of Example 4. We achieve a similar effect. if we start out with (say) the level (1 . Let v = v{P) be a parameter that takes values in the set N.a).1) we constructed for Jl as follows.
G iff 0 > Iia(t) and 8(t) = Ilia.3.0" of containing the true value of v(P) whatever be P. Conversely. If the equation Fo(t) = 1 .0" quantile of F oo • By the duality theorem. is a level 0 test for HI/o. if 0"1 + 0"2 < 1. then the test that accepts HI/o if and only if Va is in S(X). acceptance regions. then PIX E A(vo)) > 1  a for all P E P vo if and only if S (X) is a 1 . Suppose we have a test 15(X.a has a solution O. By Corollary 4. It follows that Fe (t) < 1 . By Theorem 4.(T) of Fo(T) = a with coefficient (1 .Fo(t) for a test with critical constant t is increasing in (). eo e Proof.a)}. then [QUI' oU2] is confidence interval for () with confidence coefficient 1 . For some specified Va. E is an upper confidence bound for 0 with any solution e. Formally. Fe (t) is decreasing in ().vo) = OJ is a subset of X with probability at least 1 . H may be rejected.Ct) is the 1. 00).Section 4.(t) in e. Moreover. We next apply the duality theorem to MLR families: Theorem 4. Let S(X) = (va EN. By applying F o to hoth sides oft we find 8(t) = (O E e: Fo(t) < 1.Ct).Ct confidence region for v. the power function Po(T > t) = 1 . the acceptance region of the UMP size a test of H : 0 = 00 versus K : > ()a can be written e A(Oo) = (x: T(x) < too (1. va) with level 0".a)) where to o (1 . and pvalues for MLR families: Let t denote the observed value t = T(x) ofT(X) for the datum x. Duality Theorem. if 8(t) = (O E e: t < to(1 .1. The proofs for the npper confid~nce hound and interval follow by the same type of argument. Then the acceptance regIOn A(vo) ~ (x: J(x. We have the following.0" confidence region for v. X E A(vo)).aj. That i~. Similarly. H may be accepted. Suppose X ~ Po where (Po: 0 E ej is MLR in T = T(X) and suppose that the distribution function Fo(t) ofT under Po is continuous in each of the variables t and 0 when the other is fixed. let .1.3.(01 + 0"2).1. this is a random set contained in N with probapility at least 1 . 0 We next give connections between confidence bounds. then 8(T) is a Ia confidence region forO.5. for other specified Va. Consider the set of Va for which HI/o is accepted. then ()u(T) is a lower confidence boundfor () with confidence coefficient 1 . if S(X) is a level 1 .5 The Duality Between Confidence Regions and Tests 243 Next consider the testing framework where we test the hypothesis H = Hvo : v = Va for some specified value va. < to(la). let Pvo = (P : v(P) ~ Vo : va E V}.0".Ct.
80 E (0.Xn ) = {B : S < k(B. a confidence region S( to).. vo) = 1 IT > cJ of H : v ~ Vo based on a statistic T = T(X) with observed value t = T(x). Because D Fe{t) is a distribution function. To find a lower confidence bound for B our preceding discussion leads us to consider level a tests for H : 6 < 00.2 known.d. We shall use some of the results derived in Example 4.a) is nondecreasing in B.v) : a(t. The pvalue is {t: a(t.1.5.5.Fo(t). • .~a)/ v'n}. We have seen in the proof of Theorem 4. Under the conditions a/Theorem 4. • a(t. a). vertical sections of C are the confidence regions B( t) whereas borizontal sections are the acceptance regions A" (v) = {t : J( t.a)~k(Bo. (ii) k(B.B) > a} {B: a(t. In the (t.1 shows the set C. let a( t. . a) . B) plane. where S = Ef I Xi_ To analyze the structure of the region we need to examine k(fJ.Fo(t) is decreasing in t. Then the set . we seek reasonable exact level (1 .a) test of H. and for the given v.B) = (oo. B) ~ poeT > t) = 1 . v) <a} = {(t. '1 Example 4.. .oo). Let Xl. cr 2 ) with 0. We illustrate these ideas using the example of testing H : fL = J. B) pnints. I X n are U. : : !.5. For a E (0.4.. Figure 4.1. 1 . N (IL. and A"(B) = T(A(B)) = {T(x) : x Corollary 4.1.1).1). v) =O} gives the pairs (t. for the given t. v will be accepted.p): It pI < <7Z (1.1 244 Testing and Confidence Regions Chapter 4 1 o:(t. .v) : J(t.Fo(t) is increasing in O. The result follows.to(l. . We claim that (i) k(B. v) = O}. Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials. E A(B)).~) upper and lower confidence bounds and confidence intervals for B.1 that 1 .. vol denote the pvalue of a test6(T. _1 X n be the indicators of n binomial trials with probability of success 6. .1 I. Let T = X. We call C the set of compatible (t. and an acceptance set A" (po) for this example.10 when X I . a) denote the critical constant of a level (1.a) confidence region is given by C(X t.a)ifBiBo. then C = {(t. I c= {(t.3. In general.3. 00 ) denote the pvalue for the UMP size Q' test of H : () let = eo versus K : () > eo.2. t is in the acceptance region. .a)] > a} = [~(t).I}. . Let k(Bo. A"(B) Set) Proof. The corresponding level (1 .v) where.
if 0 > 00 . S(to) is a confidence interval for 11 for a given value to of T. Q) would imply that Q > Po. Therefore. [S > j] = Q and j = k(Oo.[S > k(02. Po. hence. Therefore.Q) I] > Po. To prove (i) note that it was shown in Theorem 4.Q) = n+ 1. On the other hand.Q) = S + I}. a) increases by exactly 1 at its points of discontinuity. and (iv) we see that.o' (iii) k(fJ.o : J. Then P. whereas A'" (110) is the acceptance region for Hp. Q).Section 4.1. P. The claims (iii) and (iv) are left as exercises. From (i).[S > k(02. I] if S ifS >0 =0 . Clearly. Q) as 0 tOo. (iii). The shaded region is the compatibility set C for the twosided test of Hp.L = {to in the normal model. (iv) k(O.5 The Duality Between Confidence Regions and Tests 245 Figure 4. [S > j] > Q. let j be the limit of k(O. if we define a contradiction. and. Po. [S > j] < Q.Q) ~ I andk(I. it is also nonincreasing in j for fixed e. The assertion (ii) is a consequence of the following remarks. then C(X) ={ (O(S).I] [0.1 (i) that PetS > j] is nondecreasing in () for fixed j. e < e and 1 2 k(O" Q) > k(02. If fJo is a discontinuity point 0 k(O. Q).5. (ii).[S > k(O"Q) I] > Q.3.Q)] > Po.[S > j] < Q for all 0 < 00 O(S) = inf{O: k(O.
j Figure 4. Plot of k(8.1 0. When S 0. 4 k(8.I.3 0. 1 _ . if n is large.Q) = S and.246 Testing and Confidence Regions Chapter 4 and O(S) is the desired level (I . when S > 0. . . O(S) I oflevel (1. then k(O(S). From our discussion. we find O(S) as the unique solution of the equation. 0. I • i I ..0. j '~ I • I • . we define ~ O(S)~sup{O:j(O. O(S) = I. When S = n. Then 0(S) is a level (1 . Putting the bounds O(S). 8(S) = O.4.3.Q) DCB for 0 and when S < n.2Q).2 0. therefore. i • . O(S) together we get the confidence interval [8(S). Similarly. these bounds and intervals differ little from those obtained by the first approximate method in Example 4. .2.5. Q) is given by.2 portrays the situatiou. As might be expected.Q) LCB for 0(2) Figure 4.Q)=Sl} where j (0. O(S) is the unique solution of 1 ~( ~ s ) or(1 _ 8)nr = Q.5 I i . . These intervals can be obtained from computer packages that use algorithms based on the preceding considerations.5.1 d I J f o o 0.16) 3 I 2 11. .4 0.16) for n = 2.
o o Decide B < 80 if! is entirely to the left of B . we decide whether this is because () is smaller or larger than Bo. it is natural to carry the comparison of A and B further by asking whether 8 < 0 or B > O. we usually want to know whether H : () > B or H : B < 80 . We can use the twosided tests and confidence intervals introduced in later chapters in similar fashions. For this threedecision rule. If J(x. the probability of falsely claiming significance of either 8 < 80 or 0 > 80 is bounded above by ~Q.Jii for !t. Example 4. Here we consider the simple solution suggested by the level (1 . v = va.4 we considered the level (1 . and 3. In Section 4. and vice versa.4. Suppose Xl. I X n are i. but if H is rejected.d. To see this consider first the case 8 > 00 . Thus.5 The Duality Between Confidence Regions and Tests 247 Applications of Confidence Intervals to Comparisons and Selections We have seen that confidence intervals lead naturally to twosided tests.~Q). If H is rejected.5. This event has probability Similarly. we test H : B = 0 versus K : 8 i.5.5. However.O.3) 3. twosided tests seem incomplete in the sense that if H : B = B is rejected in favor of o H : () i. we make no claims of significance..0') confidence interval!: 1. Summary. are given to high blood pressure patients. N(/l. Then the wrong decision "'1 < !to" is made when T < z(1 .2. we can control the probabilities of a wrong selection by setting the 0' of the parent test or confidence interval.3.~Q) / .!a).Section 4. 0. If we decide B < 0.i. Decide I' > I'c ifT > z(1 .~a). o o For instance.. Therefore.3) we obtain the following three decision rule based on T = J1i(X  1'0)/": Do not reject H : /' ~ I'c if ITI < z(1 . A and B. 2.8 < B • or B > 80 is an example of a threeo decision problem and is a special case of the decision problems in Section 1.!a).13.Q) confidence interval X ± uz(1 . by using this kind of procedure in a comparison or selection problem. the probability of the wrong decision is at most ~Q. when !t < !to. . the twosided test can be regarded as the first step in the decision procedure where if H is not rejected. We explore the connection between tests of statistical hypotheses and confidence regions. then the set S(x) of Vo where . o (4. vo) is a level 0' test of H . Because we do not know whether A or B is to be preferred. Make no judgment as 1O whether 0 < 80 or 8 > B if I contains B . The problem of deciding whether B = 80 . Using this interval and (4. Decide I' < 1'0 ifT < z(l..B . then we select A as the better treatment. and o Decide 8 > 80 if I is entirely to the right of B . suppose B is the expected difference in blood pressure when two treatments.2 ) with u 2 known.
we say that the bound with the smaller probability of being far below () is more accurate. P. e.6. the following is true. .6 UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account.[B(X) < B'].1 continued).6. twosided tests..Xn ) is a sample of a N (/L.a) LCB 0* of (J is said to be more accurate than a competing level (1 . But we also want the bounds to be close to (J.Ilo)/er > z(1 .3. for X E X C Rq..1) for all competitors are called uniformly most accurate as are upper confidence bounds satisfying (4.z(1 .~'(X) < B'] < P. and only if. ~). Using Problem 4... (72) model.2.I (X) is /L more accurate than . va) ~ 0 is a level (1 . which is connected to the power of the associated onesided tests.6. for any fixed B and all B' < B.Q) UCB for B.1 (Examples 3.u2(X) and is. B(n..2 and 4. v = Va when lIa E 8(:1') is a level 0: test.a) LCB for if and only if 0* is a unifonnly most accurate level (1 . where 80 is a specified value.0') lower confidence bounds for (J. for 0') UCB e is more accurate than a competitor e . optimality of the tests translates into accuracy of the bounds. Definition 4. .6.Q) LCB B if.. P. We also give a connection between confidence intervals. less than 80 . < X(n) denotes the ordered Xl. 4. A level (1. Note that 0* is a unifonnly most accurate level (1 . (4.Q for a binomial.Q)er/. or larger than eo. j i I .IB' (X) I • • < B'] < P.6. (4.. Thus.6. .2) Lower confidence bounds e* satisfying (4. in fact.if. unifonnly most accurate in the N(/L.6. (X) = X .1) Similarly.n. and only if.' . If S(x) is a level (1 . I' > 1'0 rejects H when . where X(1) < X(2) < .n(X . I' i' . Which lower bound is more accurate? It does tum out that . random variable S.1. then the lest that accepts H . we find that a competing lower confidence bound is 1'2(X) = X(k). This is a consequence of the following theorem. Ii n II r . The dual lower confidence bound is 1'. 0"2) random variables with (72 known..[B(X) < B'].Q). 0 I i.0) confidence region for va. Formally. A level a test of H : /L = /La vs K . If (J and (J* are two competing level (1 . and the threedecision problem of deciding whether a parameter 1 l e is eo.1 t! Example 4. .5. We next show that for a certain notion of accuracy of confidence bounds. .6.248 Testing and Confidence Regions Chapter 4 0("'. a level (1 any fixed B and all B' > B. they are both very likely to fall below the true (J.Q) con· fidence region for v.Xn and k is defined by P(S > k) = 1 . We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution.. Suppose X = (X" .1) is nothing more than a comparison of (X)wer functions.2) for all competitors. which reveals that (4.
we find that j. However.6. the robustness considerations of Section 3.Oo)) < Eo. We want a unifonnly most accurate level (1 . 0 If we apply the result and Example 4. Let XI. and only if. Also.7 for the proof). .1. Let f)* be a level (1 .4.t o . 00 ) is a level a test for H : 0 = 00 versus K : 0 > 00 .a).a) Lea for q( 0) if.z(1 a)a/ JTi is uniformly most accurate. Boundsforthe Probability ofEarly Failure ofEquipment. Most accurate upper confidence bounds are defined similarly.1 to Example 4. [O'(X) > 001.1. and only if. . X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it.2.Because O'(X. [O(X) > 001 < Pe. (Jo) is given by o'(x.a) LCB q. and 0 otherwise. Defined 0(x.5.1.a) lower confidence boundfor O. a real parameter. Pe[f < q(O')1 < Pe[q < q(O')] whenever q((}') < q((}). then for all 0 where a+ = a. Suppose ~'(X) is UMA level (1 . such that for each (Jo the associated test whose critical function o*(x. Uniformly most accurate (UMA) bounds turn out to have related nice properties.a) LeB 00.2 and the result follows.(o(X. For instance (see Problem 4.•.0') LeB for (J. Identify 00 with {}/ and 01 with (J in the statement of Definition 4. We can extend the notion of accuracy to confidence bounds for realvalued functions of an arbitrary parameter. for any other level (1 .6.6. Proof Let 0 be a competing level (1 . Then O(X. if a > 0.6. 00 ) ~ 0 if. 00 )) or Pe.6. O(x) < 00 .e>.a) upper confidence bound q* for q( A) = 1 . 00 ) by o(x. Example 4. 00 ) is UMP level Q' for H : (J = eo versus K : 0 > 00 .Section 4. .6 Uniformly Most Accurate Confidence Bounds 249 Theorem 4.2. (O'(X. We define q* to be a uniformly most accurate level (1 .2). they have the smallest expected "distance" to 0: Corollary 4. Then!l* is uniformly most accurate at level (1 . for e 1 > (Jo we must have Ee. Let O(X) be any other (1 . the probability of early failure of a piece of equipment.a) lower confidence bound.5 favor X{k) (see Example 3.Oo) 1 ifO'(x) > 00 ootherwise is UMP level a for H : (J = eo versus K : (J > 00 . 1 X n be the times to failure of n pieces of equipment where we assume that the Xi are independent £(A) variables.
There are. By Problem 4. the intervals developed in Example 4. is random and it can be shown that in most situations there is no confidence interval of level (1 . the situation is still unsatisfactory because.a) in the sense that. the interval must be at least as likely to cover the true value of q( B) as any other value. . l .250 Testing and Confidence Regions Chapter 4 We begin by finding a uniformly most accurate level (1. there does not exist a member of the class of level (1 . By using the duality between onesided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level (1 .a) unbiased confidence intervals.a) quantile of the X§n distribution. the UMP test accepts H if LXi < X2n(1. the lengili t . in general. the confidence interval be as short as possible. To find'\* we invert the family of UMP level a tests of H : A ~ AO versus K : A < AO.a) intervals iliat has minimum expected length for all B. The situation wiili confidence intervals is more complicated. some large sample results in iliis direction (see Wilks. B'. In particular. because q is strictly increasing in A.. If we turn to the expected length Ee(t ...a) iliat has uniformly minimum length among all such intervals. pp.6.5. it follows that q( >.8. These topics are discussed in Lehmann (1997). they are less likely ilian oilier level (1 .a)/ 2Ao i=1 n (4.a) by the property that Pe[T. as in the estimation problem.6. however.\ *) where>. Confidence intervals obtained from twosided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes.a) UCB for the probability of early failure. That is.6.a) confidence intervals that have uniformly minimum expected lengili among all level (1. the confidence region corresponding to this test is (0. However. ~ q(()') ~ t] for every ().a) is the (1. Pratt (1961) showed that in many of the classical problems of estimation . a uniformly most accurate level (1 .a) UCB for A and. in the case of lower bounds. 374376). *) is a uniformly most accurate level (1 . Summary.) as a measure of precision. subject to ilie requirement that the confidence level is (1 ..a).T.1 have this property. 0 Discussion We have only considered confidence bounds.3) or equivalently if A X2n(1a) o < 2"'~ 1 X 2 L_n= where X2n(1.1.T. Neyman defines unbiased confidence intervals of level (1 . there exist level (1 . Considerations of accuracy lead us to ask that. we can restrict attention to certain reasonable subclasses of level (1 . Thus. ~ q(B) ~ t] ~ Pe[T. 1962.a) lower confidence bounds to fall below any value ()' below the true B.a) UCB'\* for A. Therefore. * is by Theorem 4.a) intervals for which members with uniformly smallest expected length exist. Of course.
4.7 FREQUENTIST AND BAYESIAN FORMULATIONS We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E X c Rq are random while the parameters are fixed but unknown.7. Then. Example 4.1.Xn are i. Thus. Suppose that..a) by the posterior distribution of the parameter given the data. Instead...d. Let II( 'Ix) denote the posterior probability distribution of given X = x.a)% confidence interval. the interpretation of a 100(1". a a Turning to Bayesian credible intervals and regions.A)2 nro ao 1 Ji = liB + Zla Vn (1 + .a. no probability statement can be attached to this interval. II(a:s: t9lx) .12. then Ck will be an interval of the form [fl.Xn is N(liB.)2 nro 1 .::: k} is called a level (1 . . from Example 1. . 75).a) credible region for e if II( C k Ix) ~ 1  a .7 Frequentist and Bayesian Formulations 251 4.3.Section 4.'" . and that fL rv N(fLo.2. Xl.a lower and upper credible bounds for fL are  fL = fLB  ao Zla Vn (1 + .i.a)% of the intervals would contain the true unknown parameter value. then 100(1 .7. a Definition 4. (j].7. X has distribution P e. the posterior distribution of fL given Xl.a) credible bounds and intervals are subsets of the parameter space which are given probability at least (1 .a)% confidence interval is that if we repeated an experiment indefinitely each time computing a 100(1 . what are called level (1 '. with fLo and 75 known. then fl. given a. We next give such an example. A consequence of this approach is that once a numerical interval has been computed from experimental data.1.2 and 1. Suppose that given fL. with ~ a6) a6 fLB = ~ nx+l/1 ~t""o n ~ + I ~ ~2 . :s: alx) ~ 1 . with known. a~).a) lower and upper credible bounds for if they respectively satisfy II(fl. In the Bayesian formulation of Sections 1. it is natural to consider the collec tion of that is "most likely" under the distribution II(alx). then Ck = {a: 7r(lx) . Let 7r('lx) denote the density of agiven X = x. and {j are level (1 .a.aB = n ~ + 1 I ~ It follows that the level 1 .1.6. Definition 4. If 7r( alx) is unimodal. N(fL.::: 1 . and that a has the prior probability distribution II. a E e c R.
the interpretations of the intervals are different. Xl. In the case of a normal prior w( B) and normal model p( X I B).2 . Suppose that given 0.4 for sources of such prior guesses. we may want an interval for the .. 4. However.a). /1 +1with /1 ± = /111 ± Zl'" 2 ~ 00 .3. called level (1 a) credible bounds and intervals. b > are known parameters.a) credible interval is [/1..a) credible interval is similar to the frequentist interval except it is pulled in the direction /10 of the prior mean and it is a little narrower. We shall analyze Bayesian credible regions further in Chapter 5.. t] in which the treatment is likely to take effect. whereas in the Bayesian credible interval. the Bayesian interval tends to the frequentist interval. D Example 4. N(/10..Xn are Li.7.d. 01 Summary.. .252 Testing and Confidence Regions Chapter 4 while the level (1 .a) upper credible bound for 0. Similarly. where /10 is a prior guess of the value of /1. Compared to the frequentist bound (n 1) 8 2 / Xnl (a) of Example 4. In addition to point prediction of Y. For instance. Y] that contains the unknown value Y with prescribed probability (1 .02) where /10 is known.a) by the posterior distribution of the parameter () given the data :c. given Xl. Let A = a. the interpretations are different: In the frequentist confidence interval.n.4.8 PREDICTION INTERVALS In Section 1. X n . = x a+n (a) / (t + b) is a level (1 .2./10)2. a doctor administering a treatment with delayed effect will give patients a time interval [1::.2 . by Problem 1. then . where t = 2:(Xi .4 we discussed situations in which we want to predict the value of a random variable Y.• .12. (t + b)A has a X~+n distribution. the probability of coverage is computed with X = x fixed and () random with probability distribution II (B I X = x)..6. however. the center liB of the Bayesian interval is pulled in the direction of /10. In the Bayesian framework we define bounds and intervals. it is desirable to give an interval [Y.a) lower credible bound for A and ° is a level (1 . Note that as TO > 00. .2.2. the probability of coverage is computed with the data X random and B fixed. . Let xa+n(a) denote the ath quantile of the X~+n distribution. the level (1 . is shifted in the direction of the reciprocal b/ a of the mean of W(A).2 and suppose A has the gamma f( ~a. Then.n(1+ ~)2 nTo 1 • Compared to the frequentist interval X ± Zl~oO/. See Example 1. ~b) density where a > 0. that determine subsets of the parameter space that are assigned probability at least (1 .
by the definition of the (Student) t distribution in Section B. .l distribution.8. . By solving tnl Y.Xn .. ::. Thus.1)1 L~(Xi .. which has a X.. Y) ?: 1 Ct. X is the optimal estimator. .3. the optimal estimator when it exists is also the optimal predictor.3. Let Y = Y(X) denote a predictor based on X = (Xl. the optimal MSPE predictor is Y = X. 1) distribution and is independent of V = (n . Moreover. in this case. Note that Y Y = X .8. In fact.8. Y] based on data X such that P(Y ::. It follows that Z (Y) p  Y vnY l+lcr has a N(O.1. ~ We next use the prediction error Y .4.. let Xl.X) is independent of X by Theorem B. TnI. The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4.i. "~). which is assumed to be also N(p" cr 2 ) and independent of Xl.Ct) prediction interval as an interval [Y. We define a level (1.1)8 2 /cr 2.Xn be i.4. It follows that. We want a prediction interval for Y = X n +1. . We define a predictor Y* to be prediction unbiased for Y if E(Y* .Y to construct a pivot that can be used to give a prediction interval.Section 4. [n. Y ::.d. and can c~nclu~e that in the class of prediction unbiased predictors. As in Example 4. where MSE denotes the estimation theory mean squared error.. .1).8 Prediction Intervals 253 future GPA of a student or a future value of a portfolio.1.1) Note that Tp(Y) acts as a prediction interval pivot in the same way that T(p. we find the (1 . it can be shown using the methods of .4.) acts as a confidence interval pivot in Example 4.l + 1]cr2 ).a) prediction interval Y = X± l (1 ~Ct) ::. 8 2 = (n . Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is Note that Y can be regarded as both a predictor of Y and as an estimate of p" and when we do so.. has the t distribution.1. Also note that the prediction interval is much wider than the confidence interval (4. we found that in the class of unbiased estimators. The (Student) t Prediction Interval. In Example 3.+ 18tn l (1  (4. fnl (1  ~a) for Vn. . as X '" N(p" cr 2 ). Tp(Y) !a) .3 and independent of X n +l by assumption. AISP E(Y) = MSE(Y) + cr 2 .X n +1 '" N(O.4..Y) = 0.
Ul . By ProblemB.4.Un+l are i..xn ). by Problem B. 0 We next give a prediction interval that is valid from samples from any population with a continuous distribution. . We want a prediction interval for Y = X n+l rv F. as X rv F.Xn are i.8. X(k)] with k :s: Xn+l :s: XCk») = n + l' kj = n + 1 . with a sum replacing the integral in the discrete case. Set Ui = F(Xi ). Here Xl"'" Xn are observable and Xn+l is to be predicted. whereas the width of the prediction interval tends to 2(]z (1. Example 4.. whereas the level of the prediction interval (4. This interval is a distributionfree prediction interval.. . p(X I e).E(U(j)) where H is the joint distribution of UCj) and U(k).8.. . E(UCi») thus. .2j)/(n + 1) prediction interval for X n +l . (4. uniform. Bayesian Predictive Distributions Xl' ..4.2).i.i. the confidence level of (4.j is a level 0: = (n + 1 ..d.0:) Bayesian prediction interval for Y = X n + l if .. < UCn) be Ul .5 for a simpler proof of (4.1) is not (1 .2..8.' •. ..0:) in the limit as n ) 00 for samples from nonGaussian distributions.1) is approximately correct for large n even if the sample comes from a nonnormal distribution. .. .i. Let X(1) < .d. i = 1.v) = E(U(k») .8. The posterior predictive distribution Q(.12. = i/(n+ 1). 00 :s: a < b :s: 00. UCk ) = v)dH(u. Moreover. Q(. < X(n) denote the order statistics of Xl. I x) has in the continuous case density e. then. Now [Y B' YB ] is said to be a level (1 . 1). then P(U(j) j :s: Un+l :s: UCk ») P(u:S: Un+l:S: v I U(j) = U.9.1) tends to zero in probability at the rate n!. I x) of Xn+l is defined as the conditional distribution of X n + l given x = (XI. " X n. " X n + l are Suppose that () is random with () rv 1'( and that given () = i. that is.254 Testing and Confidence Regions Chapter 4 Chapter 5 that the width of the confidence interval (4. See Problem 4. where F is a continuous distribution function with positive density f on (a. b). Un ordered. .. where X n+l is independent of the data Xl'.d. Let U(1) < ..8..~o:).2) P(X(j) It follows that [X(j) .. U(O.2.' X n . .v) j(v lI)dH(u.2. Suppose XI. .'. n + 1.
B = (2 / T 2) TJo + ( 2 / (J"0 X. where X and X n + l are independent.I . Thus.8. and it is enough to derive the marginal distribution of Y = X n +l from the joint distribution of X.8.i.  0) + 0 I X = t} = N(liB. Thus.2 o + I' T2 ~ P.. Note that E[(Xn+l . Consider Example 3. (J"5). 0 The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box.9 4. N(B.B)B I 0 = B} = O.1) . To obtain the predictive distribution.3) converges in probability to B±z (1 .c{Xn+l I X = t} = . we find that the interval (4. X n + l and 0. The Bayesian prediction interval is derived for the normal model with a normal prior. The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables. (n(J"~ /(J"5) + 1.3) To consider the frequentist properties of the Bayesian prediction interval (4.9 Likelihood Ratio Procedures 255 Example 4.0 and 0 are still uncorrelated and independent.2.0 and 0 are uncorrelated and. Xn+l . (4. .1.1 LIKELIHOOD RATIO PROCEDURES Introduction Up to this point.~o:) (J"o as n + 00. (J" B n(J" B 2) It follows that a level (1 .d. T2).9. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distributionfree prediction interval.c{(Xn +1 where. Because (J"~ + 0.1 where (Xi I B) '" N(B .0)0] = E{E(Xn+l .~o:) V(J"5 + a~.3) we compute its probability limit under the assumption that Xl " '" Xn are i. from Example 4.Section 4. and X ~ Bas n + 00. Xn is T = X = n 1 2:~=1 Xi . the results and examples in this chapter deal mostly with oneparameter problems in which it sometimes is possible to find optimal procedures. This is the same as the probability limit of the frequentist interval (4.0:). (J"5). even in . (J"5 known.. 1983). Xn+l . However. independent. . note that given X = t. and 7r( B) is N( TJo .3. .8. T2 known.0:) Bayesian prediction interval for Y is [YB ' Yt] with Yf = liB ± Z (1. We consider intervals based on observable random variables that contain an unobservable random variable with probability at least (1.8. we construct the Student t prediction interval for the unobservable variable. In the case of a normal sample of size n + 1 with only n variables observable. A sufficient statistic based on the observables Xl. Summary. (J"5 + a~) ~2 = (J" B n I (.7. by Theorem BA. 4.8.
likelihood ratio tests have weak optimality properties to be discussed in Chapters 5 and 6. where T = fo(X . the basic steps are always the same. sup{p(x. eo) when 8 0 = {eo}. . z(a ). a 2 ) population with a 2 known. .1) whose computation is often simple. Calculate the MLE e of e. 8 1 = {e l }.2. if /Ll < /Lo. the MP level a test 'Pa(X) rejects H if T ::::. eo). . Xn). . Suppose that X = (Xl . 4. In this section we introduce intuitive and efficient procedures that can be used when no optimal methods are available and that are natural for multidimensional parameters.1(c». eo). e) : e E 8l}islargecomparedtosup{p(x.Xn ) has density or frequency function p(x.~a). .2 that we think of the likelihood function L(e. We are going to derive likelihood ratio tests in several important testing problems./LO. In particular cases. Because 6a (x) I. e) : e E 8 0 } e e e Tests that reject H for large values of L(x) are called likelihood ratio tests. e) is a continuous function of e and eo is of smaller dimension than 8 = 8 0 U 8 1 so that the likelihood ratio equals the test statistic I e e 1 A(X) = sup{p(x. 1). e) : () sup{p(x. . Because h(A(X)) is equivalent to A(X) ./Lo)/a. On the other hand. . . So. optimal procedures may not exist. there can be no UMP test of H : /L = /Lo vs H : /L I.256 Testing and Confidence Regions Chapter 4 the case in which is onedimensional. then the observed sample is best explained by some E 8 1 . ed/p(x. . e) E 8} : eE 8 0} (4. In the cases we shall consider. Calculate the MLE eo of e where e may vary only over 8 0 . there is no UMP test for testing H : /L = /Lo vs K : /L I.2.9. e) as a measure of how well e "explains" the given sample x = (Xl.. The test statistic we want to consider is the likelihood ratio given by L(x) = sup{p(x. 2. the MP level a test 6a (X) rejects H for T > z(l .. e) : e E 8 l }. p(x.x) = p(x. and conversely. recall from Section 2. Note that in general A(X) = max(L(x).its (1 . note that it follows from Example 4. B')/p(x . ifsup{p(x. The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6. To see that this is a plausible statistic. 1. Although the calculations differ from case to case.'Pa(x). Xn is a sample from a N(/L. Find a function h that is strictly increasing on the range of A such that h(A(X)) has a simple form and a tabled distribution under H . To see this.. and for large samples. We start with a generalization of the NeymanPearson statistic p(x . we specify the size a likelihood ratio test through the test statistic h(A(X) and .a )th quantile obtained from the table. For instance.2. by the uniqueness of the NP test (Theorem 4. 3. if Xl.1 that if /Ll > /Lo./Lo.e): E 8 0 }. Form A(X) = p(x.. ed / p(x. e) and we wish to test H : E 8 0 vs K : E 8 1 . Also note that L(x) coincides with the optimal test statistic p(x .
while the second patient serves as control and receives a placebo.2.2) where sUPe denotes sup over 8 E e and the critical constant c( 8) satisfies p. The family of such level a likelihood ratio tests obtained by varying 8lD can also be inverted and yield confidence regions for 8 1 . 8) > (8)] = eo a. In order to reduce differences due to the extraneous factors. We can regard twins as being matched pairs.Xn form a sample from a N(JL. An important class of situations for which this model may be appropriate occurs in matched pair experiments.2 Tests for the Mean of a Normal DistributionMatched Pair Experiments Suppose Xl. we can invert the family of size a likelihood ratio tests of the point hypothesis H : 8 = 80 and obtain the level (1 .24. which are composite because 82 can vary freely. and soon. 8) ~ [c( 8)t l sup p(x. sales performance before and after a course in salesmanship. An example is discussed in Section 4. Studies in which subjects serve as their own control can also be thought of as matched pair experiments. we measure the response of a subject when under treatment and when not under treatment.9 Likelihood Ratio Procedures 257 We can also invert families of likelihood ratio tests to obtain what we shall call likelihood confidence regions. We are interested in expected differences in responses due to the treatment effect.9.5. diet. the difference Xi has a distribution that is .82 ) where 81 is the parameter of interest and 82 is a nuisance parameter. In that case. If the treatment and placebo have the same effect.9. 4. and other factors. That is. mileage of cars with and without a certain ingredient or adjustment.c 0 0 It is often approximately true (see Chapter 6) that c( 8) is independent of 8. p (X . For instance. bounds. and so on. 8n e (4. Response measurements are taken on the treated and control members of each pair. This section includes situations in which 8 = (8 1 . C (x) is just the set of all 8 whose likelihood is on or above some fixed value dependent on the data. with probability ~) and given the treatment. Here are some examples.9. We shall obtain likelihood ratio tests for hypotheses of the form H : 81 = 8lD . 0'2) population in which both JL and 0'2 are unknown.'" .a) confidence region C(x) = {8 : p(x. In the ith pair one patient is picked at random (i. [ supe p(X. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo.Section 4.9.e. the experiment proceeds as follows.9.8) . Let Xi denote the difference between the treated and control responses for the ith pair. To see how the process works we refer to the specific examples in Sections 4. we consider pairs of patients matched so that within each pair the patients are as alike as possible with respect to the extraneous factors. Suppose we want to study the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age.. After the matching.
which has the immediate solution ao ~2 = . B) at (fJ.0}. Finding sup{p(x.L Xi 1 ~( n i=l . 8 0 = {(fJ" Under our assumptions. = fJ.=1 fJ. =Ie fJ. To this we have added the nonnality assumption. as discussed in Section 4. where we think of fJ. n i=l is the maximum likelihood estimate of B. 'for the purpose of referring to the duality between testing and confidence procedures. TwoSided Tests We begin by considering K : fJ.0 )2 . the test can be modified into a threedecision rule that decides whether there is a significant positive or negative effect.6.0) 2  a2 n] = 0.o. as representing the treatment effect. B) : B E 8 0 } boils down to finding the maximum likelihood estimate a~ of a 2 when fJ. The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter 5 if the nonnality assumption is not satisfied.5. B) 1[1~ ="2 a 4 L(Xi . good or bad.B): B E 8} = p(x. = fJ. a 2 ) : fJ.fJ. = fJ. The problem of finding the supremum of p(x.L X i ' . We think of fJ.0 as an established standard for an old treatment. We found that sup{p(x. = E(X 1 ) denote the mean difference between the response of the treated and control subjects.258 Testing and Confidence Regions Chapter 4 symmetric about zero. where B=(x. we test H : fJ. However. = O." However. Form of the TwoSided Tests Let B = (fJ" a 2 ).3.0.L ( X i . This corresponds to the alternative "The treatment has some effect. The likelihood equation is oa 2 a logp(x.0 is known and then evaluating p(x. Let fJ.a)= ~ ~2 (1 ~ 1 ~ n i=l 2 ) . e).0. Our null hypothesis of no treatment effect is then H : fJ. a~).X) . B) was solved in Example 3. .
1).\(x).2. Thus. Mo . A and B. is of size a for H : M ::.Mo) . Then we would reject H if.~a) and we can use calculators or software that gives quantiles of the t distribution. .1)1 I:(Xi .064.8) logp(x. where 8 = (M . to find the critical value.3. &5/&2 (&5/&2) is monotone increasing Tn = y'n(x . A proof is sketched in Problem 4.11 we argue that P"[Tn Z t] is increasing in 8. &5 gives the maximum of p(x. or Table III.MO)2/&2.x)2 = n&2/(n . the likelihood ratio tests reject for large values of ITn I. therefore.9 Likelihood Ratio Procedures 259 By Theorem 2. Similarly. OneSided Tests The twosided formulation is natural if two treatments. suppose n = 25 and we want a = 0.a).1. the size a critical value is t n 1 (1 .\(x) is equivalent to log .1). The test statistic . the size a likelihood ratio test for H : M Z Mo versus K : M < Mo rejects H if. (n . the test that rejects H for Tn Z t nl(1 . 8) for 8 E 8 0.9. the testing problem H : M ::. To simplify the rule further we use the following equation.\(x) logp(x. &0)) ~ 2 {~[(log271') + (log&2)]~} ~ log(&5/&2). The statistic Tn is equivalent to the likelihood ratio statistic A for this problem. However. (Mo.Section 4. Because 8 2 function of ITn 1 where = 1 + (x .  {~[(log271') + (log&5)]~} Our test rule. In Problem 4. Mo versus K : M > Mo (with Mo = 0) is suggested. and only if. Because Tn has a T distribution under H (see Example 4. Therefore. rejects H for iarge values of (&5/&2).05. and only if. ITnl Z 2. For instance.4. if we are comparing a treatment and control. which thus equals log . 8 Therefore. which can be established by expanding both sides. Therefore.9.Mo)/a. the relevant question is whether the treatment creates an improvement. are considered to be equal before the experiment is performed.
To derive the distribution of Tn.b distribution. the ratio fo(X . 1997./Lo) / a./Lo) / a has N (8. / (n1)s2 /u 2 n1 V has a 'Tn1. .1.n(X .b.260 Power Functions and Confidence 4 To discuss the power of these tests. A Stein solution. We have met similar difficulties (Problem 4.12. we can no longer control both probabilities of error by choosing the sample size. and the power can be obtained from computer software or tables of the noncentral t distribution. 1) distribution.3.4. denoted by Tk. we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8.4. The reason is that.9. with 8 = fo(/L .jV/k is given in Problem 4. however. fo( X . With this solution. Similarly the onesided tests lead to the lower and upper confidence bounds of Example 4.9. The density of Z/ . 1) and X~ distributions. Computer software will compute n. We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 ~ 81 > 0 in the twosided case and 8 ~ 81 or 8 :S 81 in the onesided cases.. Because E[fo(X /Lo)/a] = fo(/L . say.1 are monotone in fo/L/ a. is by definition the distribution of Z / . Note that the distribution of Tn depends on () (/L. Problem 17).1. thus.2.7) when discussing confidence intervals. If we consider alternatives of the form (/L /Lo) ~ ~. whatever be n. respectively./Lo)/a and Yare . we know that fo(X /L)/a and (n .jV/ k where Z and V are independent and have N(8./Lo)/a) = 1. we obtain the confidence region We recognize C(X) as the confidence interval of Example 4. bring the power arbitrarily close to 0:. in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with I/L. we may be required to take more observations than we can afford on the second stage. by making a sufficiently large we can force the noncentrality parameter 8 = fo(/L . note that from Section B. Likelihood Confidence Regions If we invert the twosided tests. 260.l distribution. ( 2) only through 8.1)s2/a 2 are independent and that (n 1)s2/a 2 has a X. p. This distribution./Lo) / a as close to 0 as we please and./Lol ~ ~. is possible (Lehmann. Thus.11) just as the power functions of the corresponding tests of Example 4./Lo)/a .4. The power functions of the onesided tests are monotone in 8 (Problem 4.
. p..2 the likelihood function and its log are maximized In Problem 4. blood pressure). then x = 1.3 5 0.1 1.2 2 1..6.5 1. Tests We first consider the problem of testing H : Jl1 = J:L2 versus H : Jli =F Jl2. Then Xl. Jl2. A discussion of the consequences of the violation of these assumptions will be postponed to Chapters 5 and 6.Section 4. . The preceding assumptions were discussed in Example 1. and so forth..1 1. while YI . .32.513.7 1.3).9.2 0.06.. .7 5. 'Yn2 are independent samples from N (JlI. we conclude at the 1% level of significance that the two drugs are significantly different...2 1. For instance.8 2.58.0 7 3..g.6 10 2.6 0. Let e = (Jlt. respectively.1.Xnl and YI . Because tg(0.25.9. 'Yn2 are the measurements on a sample given the drug. As in Section 4. 8 2 = 1. height. weight. Yn2 .3 Tests and Confidence Intervals for the Difference in Means of Two Normal Populations We often want to compare two populations with distribution F and G on the basis of two independent samples X I.9.0 3. suppose we wanted to test the effect of a certain drug on some biological variable (e..Xn1 and YI . 1958. . . In the control versus treatment example.Xn1 could be blood pressure measurements on a sample of patients given a placebo. .84].9 1. 121) giving the difference B . . ( 2) and N (Jl2..4 If we denote the difference as x's.4 1. Patient i A B BA 1 0.) 4. temperature.2.1 0. It suggests that not only are the drugs different but in fact B is better than A because no hypothesis Jl = Jl' < 0 is accepted at this level..995) = 3.99 confidence interval for the mean difference Jl between treatments is [0.3.0 4. Then 8 0 = {e : Jli = Jl2} and 9 1 = {e : Jli =F Jl2}' The log of the likelihood of (X. The 0.Y ) is n2 where n = nl + n2. ( 2). one from each popUlation. consider the following data due to Cushny and Peebles (see Fisher.0 6 3.4 4.'" .8 8 0.8 1. length. this is the problem of determining whether the treatment has any effect.1 0. YI .9 likelihood Ratio Procedures 261 Data Example As an illustration of these procedures.4 3 0. it is usually assumed that X I. and ITnl = 4. . For quantitative measurements such as blood pressure. it is shown that = over 8 by the maximum likelihood estimate e. . volume.8 9 0. Y) = (Xl. ( 2) populations... This is a matched pair experiment with each subject serving as its own control...6 0. e .Xn1 .6 4. .5. (See also (4.A in sleep gained using drugs A and B on 10 patients.3 4 1.. .4 1.
. we find that the is equivalent to the test statistic ITI where and To complete specification of the size a likelihood ratio test we show that T has distribution when ttl = tt2.262 (X.2 X. (75).3 Tn2 1 .1 .9.2 (y. Y.1 I:rtl . Y. our model reduces to the onesample model of Section 4.Y) '1':" a a a i=l a j=l J I .3. where 2 Testing and Confidence Regions Chapter 4 When ttl tt2 = tt. (7 ). By Theorem B.X) + (X  j1)]2 and expanding. Thus. .2. j1.2 1 I:rt2 .j112 log likelihood ratio statistic [(Xi . the maximum of p over 8 0 is obtained for f) (j1.2 (Xi . where and If we use the identities ~ fCYi ?l i=l j1)2 1 n fp'i i=l y)2 + n2 (Y _ j1)2 n obtained by writing [Xi .X) and .
1L2.Ll. 1/n2).2 • Therefore. .3) for 1L2 . We conclude from this remark and the additive property of the X2 distribution that (n . there are two onesided tests with critical regions. corresponding to the twosided test. and that j nl n2/n(Y .ILl #.N(1L2/0. Similarly.X) /0. IL 1 and for H : ILl ::. f"V As usual. We can show that these tests are likelihood ratio tests for these hypotheses. Confidence Intervals To obtain confidence intervals for 1L2 . As for the special case Ll = 0.9 likelihood Ratio Procedures 263 are independent and distributed asN(ILI/o.ILl = Ll versus K : 1L2 . for H : 1L2 ::. T( Ll) has a Tn2 distribution and inversion ofthe tests leads to the interval (4.9.2)8 2 /0. we find a simple equivalent statistic IT(Ll)1 where If 1L2 ILl = Ll. and only if.ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 .2 under H bution and is independent of (n . T has a noncentral t distribution with noncentrality parameter. As in the onesample case.2 has a X~2 distribution.1L2. by definition. 1/nI). twosample t test rejects if.2)8 2/0.Section 4.ILl.has a N(O. this follows from the fact that. 1) distriTn .~ Q likelihood confidence bounds. X~ll' X~2l' respectively. T and the resulting twosided. onesided tests lead to the upper and lower endpoints of the interval as 1 . if ILl #. It is also true that these procedures are of size Q for their respective hypotheses.
thus. it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same. H is rejected if ITI 2 t n . the more waterproof material.9.4 The TwoSample Problem with Unequal Variances In twosample problems of the kind mentioned in the introduction to Section 4.042 1. consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines..845 1. thus. respectively. As a first step we may still want to compare mean responses for the X and Y populations. ai) and N(Ji2.9. 472) x (machine 1) Y (machine 2) 1.3. The results in terms oflogarithms were (from Hald. 1952. we are lead to a model where Xl.975) = 2. p.. From past experience.977. .8 2 = 0.395 ± 0. :. 4..Xn1 ..0264.776.790 1. . When Ji1 = Ji2 = Ji. . setting the derivative of the log likelihood equal to zero yields the MLE .0:) confidence interval has probability at most ~ 0: of making the wrong selection. Suppose first that aI and a~ are known.395. The level 0. If normality holds.583 1. The log likelihood. Y1. we conclude at the 5% level of significance that there is a significant difference between the expected log permeability for the two machines. Again we can show that the selection procedure based on the level (1 . .282 We test the hypothesis H of no difference in expected log permeability. and T = 2.2 (1 .264 Testing and Confidence Regions Chapter 4 Data Example As an illustration. For instance. 'Yn2 are two independent N (Ji1. Ji2) E R x R. This is the BehrensFisher problem.~o:). is The MLEs of Ji1 and Ji2 are. /11 = x and /12 = y for (Ji1. a~) samples.x = 0.627 2. a treatment that increases mean response may increase the variance of the responses. it may happen that the X's and Y's have different variances. except for an additive constant. Here y .368. Because t4(0. On the basis of the results of this experiment. we would select machine 2 as producing the smaller permeability and.95 confidence interval for the difference in mean log permeability is 0. .
n2 an approximation to the distribution of (D .t1 = J. y) Next we compute = exp {2~~ eM  x)2 + 2:'~ (Ii  y?}.x)/(nl + . where D = Y .Section 4. by Slutsky's theorem and the central limit theorem. 2 8y 8~ .)18D has approximately a standard normal distribution (Problem 5.x)2 i=1 n2 i=1 n2 L(Yi .X and aJy is the variance of D.y) By writing nl L(Xi . it Thus.fi)2 we obtain \(x. An unbiased estimate is J.t2 :::. n2. j1x = = nl ... 1) distribution. It follows that "\(x. n2.n2(fi . IDI/8D as a test statistic for H : J.. For large nl.. = at I a~.3.n2)' It follows that the likelihood ratio test is equivalent to the statistic IDl/aD.t2 must be estimated. DiaD has aN(I:::.t2..1:::. j1.fi? j=1 + n2(j1.) 18 D due to Welch (1949) works well. Unfortunately the distribution of (D 1:::.y)/(nl + . 8D=+' nl n2 It is natural to try and use D I 8 D as a test statistic for the onesided hypothesis H : J.j1)2 j=1 = L(Yi .t1... where I:::.9 Likelihood Ratio Procedures J 265 where. BecauseaJy is unknown. = J.1:::. For small and moderate nI.n2)' Similarly. J. and more generally (D .)18D to generate confidence procedures.X)2 + nl(j1.t1.fi = n2(x .) I 8 D depends on at I a~ for fixed nl. (D 1:::. that is Yare X) + Var(Y) = nl + n2 .28).
machines.266 Testing and Confidence Regions Chapter 4 Let c = Sf/nlsb' Then Welch's approximation is Tk where k c2 [ nl 1 (1 C)2]1 + n2 . The LR procedure derived in Section 4. Wang (1971) has shown the approximation to be very good for Q: = 0.2 . X percentage of fat in score on mathematics exam. Y = age at death.3.u~. uI 4. If we have a sample as before and assume the bivariate normal model for (X. the maximum error in size being bounded by 0.J1. u~ > O. (Xn' Yn). which works well if the variances are equal or nl = n2.01.. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the BehrensFisher problem. Some familiar examples are: X test score on English exam.05 and Q: 0. can unfortunately be very misleading if =1= u~ and nl =1= n2.1 When k is not an integer.003. ..5 likelihood Ratio Procedures for Bivariate Normal Distributions If n subjects (persons. X = average cigarette consumption per day in grams.28. TwoSided Tests . with > 0. .3 and Problem 5. Y diet. Y) have a joint bivariate normal distribution. Y cholesterol level in blood. Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many staweight. . See Figure 5.) are sampled from a population and two numerical characteristics are measured on each case. Y = blood pressure. Empirical data sometimes suggest that a reasonable model is one in which the two characteristics (X. etc. X test tistical studies. then we end up with a bivariate random sample (XI. Yd.9. Note that Welch's solution works whether the variances are equal or not. fields. the critical value is obtained by linear interpolation in the t tables or using computer software. our problem becomes that of testing H : p O.1.3.3.9. Y). ur Testing Independence. N(j1. p). mice.
0"2 = ..2 distribution. the distribution of p depends on p only.Xn and that ofY1 . To obtain critical values we need the distribution of p or an equivalent statistic under H. then by Problem B. .X . Pis called the sample correlation coefficient and satisfies 1 ::. p::. .3. for any a. If p = 0. if we specify indifference regions. ~ = (Yi . When p = 0.4. .~( Xi .9.8). Because (U1. log A(X) is an increasing function of p2.Y . where (jr e was given in Problem 2. and the likelihood ratio tests reject H for large values of [P1. 0.o. When p = 0 we have two independent samples.9. the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1. See Example 5. . (j~. VI)"" . .1. y. There is no simple form for the distribution of p (or Tn) when p #.. where Ui = (Xi .(j~.5.13 as 2 = . the distribution of pis available on computer packages.1.Jl2)/0"2. p). . Therefore.Now. we can control probabilities of type II error by increasing the sample size. 0) and the log of the likelihood ratio statistic becomes log A(X) (4. the twosided likelihood ratio tests can be based on [Tn I and the critical values obtained from Table II.~( Yi .Jll)/O"I.5) has a Tn. ii. 1 (Problem 2.Section 4.X)(Yi ~=1 fj)] /n(jl(72.4) = Thus. Qualitatively..1. Because ITn [ is an increasing function of [P1. We have eo = (x.3.9 Likelihood Ratio Procedures 267 The unrestricted maximum likelihood estimate (x..9. and eo can be obtained by separately maximizing the likelihood of XI. (jr . V n ) is a sample from the N(O. Yn . 1~ )2 2 1~ )2 0"1 n i=1 n i=1 p= [t(Xi . (4. p) distribution. A normal approximation is available. (Un.
7 18. we would test H : P = 0 versus K : P > 0 or H : P ::. We can show that Po [p.15). we can start by constructing size Q likelihood ratio tests of H : P Po versus K : p > po. Po but rather of the "equaltailed" test that rejects if. p::. 'I7 Summary. ! Data Example A~ an illustration. is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis. and only if.~Q bounds together. consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week.:::: c] is an increasing function of p for fixed c (Problem 4.: : d(po) or p::. we find that there is no evidence of correlation: the pvalue is bigger than 0. 0 versus K : P > 0 and similarly that corresponds to the likelihood ratio statistic for H : P 0 versus K : P < O.18 and Tn 0.48. and only if p . 0 versus K : P > O. We find the likelihood ratio tests and associated confidence procedures for four classical normal models: . We can similarly obtain 1 Q upper confidence bounds and. by using the twosided test and referring to the tables. Confidence Bounds and Intervals Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present. It can be shown that pis equivalent to the likelihood ratio statistic for testing H : P ::. only onesided alternatives are of interest. if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood. for large n the equal tails and LR confidence intervals approximately coincide with each other.75. These intervals do not correspond to the inversion of the size Q LR tests of H : p = Po versus K : p :::f. Thus. c(po)] 1 . c(Po)] 1 ~Q.268 Testing and Confidence Regions Chapter 4 OneSided Tests In many cases. we obtain a commonly used confidence interval for p.Q.9. These tests can be shown to be of the form "Accept if. We want to know whether there is a correlation between the initial weights and the weight increase and formulate the hypothesis H : p O. However. Therefore. by putting two level 1 . we obtain size Q tests for each of these hypotheses by setting the critical value so that the probability of type I error is Q when p = O. The power functions of these tests are monotone. c(po) where P po [I)' d(po)] P po [I)'::. We obtain c(p) either from computer software or by the approximation of Chapter 5. To obtain lower confidence bounds. Because c can be shown to be monotone increasing in p. 370. The likelihood ratio test statistic >.1 Here p 0. inversion of this family of tests leads to levell Q lower confidence bounds. c(po)" where Ppo [p::. For instance.
e). X n are independently and identically distributed according to the uniform distribution U(O.. where SD is an estimate of the standard deviation of D = Y .L :::. (4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases.Ll' ( 2) and N(J. We also find that the likelihood ratio statistic is equivalent to a t statistic with n . .Ll' an and N(J. Mn e = ~? = 0. When a? and a~ are unknown. . We test the hypothesis that the means are equal and find that the likelihood ratio test is equivalent to the twosample (Student) t test. (d) How large should n be so that the 6c specified in (b) has power 0. Suppose that Xl.98 for (e) If in a sample of size n = 20.Lo.L2' ( 2) populations. The likelihood ratio test is equivalent to the onesample (Student) t test. . respectively..L is zero. ( 2) and we test the hypothesis that the mean difference J.48. (3) Twosample experiments in which two independent samples are modeled as coming from N(J. Approximate critical values are obtained using Welch's t distribution approximation. what is the pvalue? 2. 0 otherwise. the likelihood ratio test is equivalent to the test based on IY . . J..Xn ) is an £(A) sample.L2' a~) populations. We test the hypothesis that X and Yare indepen dent and find that the likelihood ratio test is equivalent to the test based on IPl. 4. . . where p is the sample correlation coefficient.L. Assume the model where X = (Xl. .X. Let Xl. Consider the hypothesis H that the mean life II A = J. what choice of c would make 6c have size (c) Draw a rough graph of the power function of 6c specified in (b) when n = 20.Xn denote the times in days to failure of n similar pieces of equipment. (2) Twosample experiments in which two independent samples are modeled as coming from N(J. (a) Compute the power function of 6c and show that it is a monotone increasing function (b) In testing H : exactly 0.2 degrees of freedom. X n ) and let 1 if Mn 2:: c = of e. . .05? e :::. When a? and a~ are known. .X)I SD..XI.10 Problems and Complements 269 (1) Matched pair experiments in which differences are modeled as N(J.Section 4..1 1. Let Mn = max(X 1 . ~ versus K : e> ~. .. we use (Y . respectively.10 PROBLEMS AND COMPLEMENTS Problems for Section 4.
1nz t Z i .. Establish (4. . Assume that F o and F are continuous.) 4.3.1. under H. .. T ~ Hint: See Problem B. (b) Give an expression of the power in tenns of the X~n distribution.r. then the pvalue <>(T) has a unifonn..Xn be a sample from a population with the Rayleigh density . j . . 1). 7. 0: (a) Construct a test of H : B = 1 versus K : B > 1 with approximate size complete sufficient statistic for this model....4 to show that the test with critical region IX > I"ox( 1  <» 12n]. Hint: Approximate the critical region by IX > 1"0(1 + z(1 . Draw a graph of the approximate power function. j=l . 6. X> 0.3. r. " i I 5. If 110 = 25. Let Xl.. (b) Show that the power function of your test is increasing in 8. f(x.ontinuous distribution. (0) Show that.~llog <>(Tj ) has a X~r distribution.2 L:. .. j = 1. .. Tr are independent test statistics for the same simple H and that each Tj has a continuous distribution. Suppose that T 1 .05? 3.3).<»1 VTi)J . Days until failure: 315040343237342316514150274627103037 Is H rejected at level <> ~ 0. (c) Give an approximate expression for the critical value if n is large and B not too close to 0 or 00. .. Let o:(Tj ) denote the pvalue for 1j. .2. .. (cj Use the central limit theorem to show that <l?[ (I"oz( <» II") + VTi(1"  1"0) I 1"] is an approximation to the power of the test in part (a). Show that if H is simple and the test statistic T has a c. .a)th quantile of the X~n distribution. is a size Q test.0 > 0..12.a) is the (1 . . 1 using a I . I . . Hint: See Problem B. . . . X n be a 1'(0) sample. Let Xl. Hint: Use the central limit theorem for the critical value. (d) The following are days until failure of air monitors at a nuclear plant. i I (b) Check that your test statistic has greater expected value under K than under H. give a normal approximation to the significance probability.O) = (xIO')exp{x'/20'}..4.270 Testing and Confidence Regions Chapter 4 (a) Use the result of Problem 8. distribution. (a) Use the MLE X of 0 to construct a level <> test for H : 0 < 00 versus K : 0 > 00 . (Use the central limit theorem. where x(I . U(O.
10.. . T(B) be B independent Monte Carlo simulated values of T. and 1.. (In practice these can be obtained by drawing B independent samples X(I). .5.00). B.p(Fo(x))lF(x) .Fo(x)IO x ~ ~ sup. That is. 00. then the power of the Kolmogorov test tends to 1 as n . let . if X. Here. (a) Show that the statistic Tn of Example 4. In Example 4. (a) For each of these statistics show that the distribution under H does not depend on F o.p(F(x»!F(x) . 9. Show that the test rejects H iff T > T(B+lm) has level a = m/(B + 1). (This is the logistic distribution with mean zero and variance 1. X n be d.Fo(x)1 > k o ) for a ~ 0. T(l). (b) Suppose Fo isN(O. (a) Show that the power PF[D n > kal of the Kolmogorov test is bounded below by ~ sup Fpl!F(x).5.o J J x .T(X(l)). b > 0. = (Xi .1... 80 and x = 0. Express the Cramervon Mises statistic as a sum..12(b).X(B) from F o on the computer and computing TU) = T(XU)). (b) Use part (a) to conclude that LN(".) Next let T(I). is chosen 00 so that 00 x 2 dF(x) = 1. .T(B+l) denote T. Vw. Define the statistics S¢. .1. Fo(x)1 > kol x ~ Hint: D n > !F(x) .. . 1) to (0.o sup.Section 4. j = 1.6 is invariant under location and scale.Fo(x)1 foteach. Use the fact that T(X) is equally likely to be any particular order statistic.. Suppose that the distribution £0 of the statistic T = T(X) is continuous under H and that H is rejected for large values of T.u.Fo(x)IOdFo(x) .)(Tn ) = LN(O.) Evaluate the bound Pp(lF(x) .Fo(x)IOdF(x). .a)/b.. .5.2.10.T.(X') = Tn(X). 1. Let X I. . .. .Fo• generate a U(O. . and let a > O.. with distribution function F and consider H : F = Fo. = ~ I. to get X with distribution .p(u) be a function from (0. T(B) ordered.. n (c) Show that if F and Fo are continuous and FiFo.o V¢.. then T.p(Fo(x))IF(x) .J(u) = 1 and 0: = 2. (b) When 'Ij. . 1) and F(x) ~ (1 +exp( _"/7))1 where 7 = "13/.Fo(x) 10 U¢.1. . ~ ll. T(X(B)) is a sample of size B + 1 from La. ..10 Problems and Complements 271 8.d.p(F(x)IF(x) . .l) (Tn). Hint: If H is true T(X)..o T¢..5 using the nonnal approximation to the binomial distribution of nF(x) and the approximate critical value in Example 4. 1) variable on the computer and set X ~ F O 1 (U) as in Problem B.o: is called the Cramervon Mises statistic. . Let T(l).
respectively. Show that the EPV(O) for I{T > c) is uniformly minimal in 0 > 0 when compared to the EPV(O) for any other test. then the pvalue is U =I . A gambler observing a game in which a single die is tossed repeatedly gets the impression that 6 comes up about 18% of the time.O) = 0'. the other $105 • One second of transmission on either system COsts $103 each.)... .2 I i i 1. f(2. 00 . > cis MP for testing H : 0 = 00 versus K : 0 = 0 1 • .1. then the test that rejects H if. (For a recent review of expected p values see Sackrowitz and SamuelCahn.Fo(T). You want to buy one of two systems. and only if..272 Testing and Confidence Regions Chapter 4 (e) Are any of the four statistics in (a) invariant under location and scale. = /10 versus K : J. ! J (c) Suppose that for each a E (0. which is independent ofT. the other has v/aQ = 1.L > J.0). and 3 occuring in the HardyWeinberg proportions f(I.. Consider Examples 3.2 and 4. f(3. ! . you intend to test the satellite 100 times. Let 0 < 00 < 0 1 < 1. Expected pvalues. X n . 2N1 + N. j 1 i :J . the UMP test is of the form I {T > c)../"0' Show that EPV(O) = if! ( .05. where" is known.0)'. (a) Show that L(x. (See Problem 4.3.Fe(F"l(l. Show that EPV(O) = P(To > T).. Hint: peT < to I To = to) is 1 minus the power of a test with critical value to_ (d) Consider the problem of testing H : J1. If each time you test.. Problems for Section 4. I). ! .0) = (1.1) satisfy Pe. 2.2. .0) = 20(1. [2N1 + N. you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0. and 3.10. Consider a population with three kinds of individuals labeled 1.1. Whichever system you buy during the year.. i i (b) Define the expected pvalue as EPV(O) = EeU. Suppose that T has a continuous distribution Fe. .nO /. the power is (3(0) = P(U where FO 1 (u)  < a) = 1.') sample Xl. which system is cheaper on the basis of a year's operation? 1 I Ii I il 2. Consider a test with critical region of the fann {T > c} for testing H : () = Bo versus I< : () > 80 . 5 about 14% of the time. I X n from this population.2..) . Let To denote a random variable with distribution F o. let N I • N 2 • and N 3 denote the number of Xj equal to I. 01 ) is an increasing function of 2N1 + N. For a sample Xl. 2. take 80 = O.) I 1 . 12. whereas the other a I ~ a ..a)) = inf{t: Fo(t) > u}. I 1 (b) Show that if c > 0 and a E (0. where if! denotes the standard normal distribution. Hint: P(To > T) = P(To > tiT = t)fe(t)dt where fe(t) is the density of Fe(t). Without loss of generality.to on the basis of the N(p" . Let T = X/"o and 0 = /" . i • (a) Show that if the test has level a. 3. The first system costs $106 . 1999. One has signalMtonoise ratio v/ao = 2.. > cJ = a.
the gambler asks that he first be allowed to test his hypothesis by tossing the die n times. 0. (b) Find the test that is best in this sense for Example 4...6) or (as in population 1) according to N(I.0. Y) known to be distributed either (as in population 0) according to N(O.2.1.. Prove Corollary 4... A newly discovered skull has cranial measurements (X.2.Pe. H : (J = (Jo versus K : (J = (J. (a) Show that if in testing H : f} such that = f}o versus K : f) = f)l there exists a critical value c Po.0. Bd > cJ = I . .0196 test rejects if. 7. Upon being asked to play. In Examle 4. Bo.1.. L .1(a) using the connection between likelihood ratio tests and Bayes tests given in Remark 4. A fonnulation of goodness of tests specifies that a test is best if the max.2.<l. .. find the MP test for testing 10... Find a statistic T( X. 6. For 0 < a < I.2) that linear combinations of bivariate nonnal random variables are nonnaUy distributed. I.. then the maximum of the two probabilities ofmisclassification porT > cJ. 9. I.. prove Theorem 4. (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerfullevel.. 1.e.17). 5. Y) belongs to population 1 if T > c. derive the UMP test defined by (4.6) where all parameters are known.. [L(X. . Bk).2. if .7). B" . +. M(n. and only if.4 and recall (Proposition B.1. . two 5's are obtained. where 11 = I:7 1 ajf}j and a 2 = I:~ 1 f}i(ai 11)2. +akNk has approx. .=  Section 4. = a. with probability . Show that if randomization is pennitted.Bd > cJ then the likelihood ratio test with critical value c is best in this sense.0.1.N.2. Hint: The MP test has power at least that of the test with test function J(x) 8.10 Problems and Complements 273 four numbers are equally likely to occur (i.2. MPsized a likelihood ratio tests with 0 < a < 1 have power nondecreasing in the sample size. Nk) ~ 4. PdT < cJ is as small as possible.2.4.2.imum probability of error (of either type) is as small as possible.~:":2:~~=C:="::===='. and to population 0 ifT < c..2. Bo.. (X.o = (1.2. then a. (c) Using the fact that if(N" . In Example 4. of and ~o '" I.. Y) and a critical value c such that if we use the classification rule.. [L(X.imately a N(np" na 2 ) distribution.. . find an approximation to the critical value of the MP level a test for this problem. Hint: Use Problem 4.
Let Xl.<» is the (1 .&(>')..4. I i . .(IOn x = 1•.i . You want to ensure that if the arrival rate is < 10.• n. . .. How many days must you observe to ensure that the UMP test of Problem 4. Consider the foregoing situation of Problem 4. ]f the equipment is subject to wear. (a) Exhibit the optimal (UMP) test statistic for H : 0 < 00 versus K : 0 > 00. Here c is a known positive constant and A > 0 is the parameter of interest.01. Find the sample size needed for a level 0. Let Xl be the number of arrivals at a service counter on the ith of a sequence of n days. the probability of your deciding to close is also < 0. (a) Show that L~ K: 1/>.. . that the Xi are a sample from a Poisson distribution with parameter 8. x > O. the probability of your deciding to stay open is < 0. i' i . Show that if X I. 1 j ! Xf is an optimal test statistic for testing H : 1/ A < 1/ AO versus J ! . (c) Suppose 1/>'0 ~ 12. > 1/>'0.. .3. the expected number of arrivals per day.3. a model often used (see Barlow and Proschan... I i . A possible model for these data is to assume that customers arrive according to a homogeneous Poisson process and.. Use the normal approximation to the critical value and the probability of rejection..1. but if the arrival rate is > 15. Suppose that if 8 < 8 0 it is not worth keeping the counter open.Xn is a sample from a truncated binomial distribution with • I. (b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP test? 2.: (b) Show that the critical value for the size a test with critical region [L~ 1 Xi > k] is k = X2n(1 .) 3. 1965) is the one where Xl.95 at the alternative value 1/>'1 = 15.<»/2>'0 where X2n(1.. 274 Problems for Section 4. • I i • 4.3. show that the power of the UMP test can be written as (3(<J) = Gn(<J~Xn(<»/<J2) where G n denotes the X~n distribution function. . i 1 .0)"'/[1. . r p(x.Xn is a sample from a Weibull distribution with density f(x.01 lest to have power at least 0.1 achieves this? (Use the normal approximation.<»th quantile of the X~n distributiou and that the power function of the lIMP level a test is given by where G 2n denotes the X~n distribution function. • • " 5. 0) = ( : ) 0'(1. . Hint: Show that Xf . A) = c AcxC1e>'x . .Xn be the times in months until failure of n similar pieces of equipment. In Example 4.01.3 Testing and Confidence Regions Chapter 4 1. hence.
.6) is UMP for testing that the pvalues are uniformly distributed.XFo(y)  1 P(Y < y) = e ' 1 ' Y > 0.1.6..10 Problems and Complements 275 then 2.. imagine a sequence X I.. A > O. 6. P().d. 1 ' Y > 0. against F(u)=u B.2 to find the mean and variance of the optimal test statistic..Q)th quantile of the X~n distribution. (a) Express mean income JL in terms of e. y > 0. > 1.6. survival times with distribution Fa.6. and let YI . In the goodnessoffit Example 4.. . x> c where 8 > 1 and c > O. Suppose that each Xi has the Pareto density f(x. b).~) = 1. A > O.12. Find the UMP test.1.. > po. X N }). 8.Fo(y)  }). (b) NabeyaMiura Alternative. X N e>.). In Problem 1. . (Ii) Show that if we model the distribotion of Y as C(min{X I . . survival times of a sample of patients receiving an experimental treatment. Assume that Fo has ~ density fo... . Let Xl. .:7 I Xi is an optimal test statistic for testing H : () = eo versus l\ .) It follows that Fisher's method for cQmbining pvalues (see 4. (i) Show that if we model the distribution of Y as C(max{X I .O<B<1.O) = cBBx(l+BI. X 2.d. Show how to find critical values. ~ > O. 0 < B < 1. suppose that Fo(x) has a nonzero density on some interval (a. which is independent of XI. . Y n be the Ll. Let N be a zerotruncated Poisson. For the purpose of modeling.[1. Let the distribution of sUIvival times of patients receiving a standard treatment be the known distribution Fo.8 > 80 .  .B) = Fi!(x). Show that the UMP test for testing H : B > 1 versuS K : B < 1 rejects H if 2E log FO(Xi) > Xl_a.. random variable. (b) Find the optimal test statistic for testing H : JL = JLo versus K . . 00 < a < b < 00. JL (e) Use the central limit theorem to find a normal approximation to the critical value of test in part (b).1. we derived the model G(y. then e. To test whether the new treatment is beneficial we test H : ~ < 1 versus K : . (a) Lehmann Alte~p.1.Fo(y)l". X 2 . then P(Y < y) = 1 e  . 7."" X n denote the incomes of n persons chosen at random from a certain population.•• of Li.. Hint: Use the results of Theorem 1.Section 4.5. and consider the alternative with distribution function F(x. where Xl_ a isthe (1 .tive. (See Problem 4..
3. e e at l · 6. L(x. 0 = 0. > Er • . Oo)d. 'i . I .. 0. We want to test whether F is exponential. eo versus K : B = fh where I i I : 11. B> O. every Bayes test for H : < 80 versus K : > (Jl is of the fann for some t.01.0) confidence intervals of fixed finite length for loga2 • (b) Suppose that By 1 (Xi . Problems for Section 4.i. 1 X n be i. Show that under the assumptions of Theorem 4.0". 1 • .(O) 6 .(O)/ 16' I • 00 p(x.j "'" . we test H : {} < 0 versus K : {} > O..X)' = 16. J J • 9.1). i = 1. F(x) = 1 .. Using a pivot based on 1 (Xi . The numerator is an increasing function of T(x).2. ..{Oo} = 1 . Oo)d.X)2.'? = 2. n. 12. Show that the UMP test is based on the statistic L~ I Fo(Y.4 1.2 the class of all Bayes tests is complete. .' L(x. x > 0. 1 X n be a sample from a normal population with unknown mean J.3.O)d.. Show that under the assumptions of Theorem 4. .6t is UMP for testing H : 8 80 versus K : 8 < 80 .(O) «) L > 1 i The lefthand side equals f6". Problem 2. j 1 To see whether the new treatment is beneficial. O)d...' (cf.exp( x).52.276 (iii) Consider the model Testing and Confidence Regions Chapter 4 G(y.. Show that under the assumptions of Theorem 4. 10.  1 j . Let Xi (f)/2)t~ + €i. F(x) = 1 . Let Xl. = . with distribution function F(x). (a) Show how to construct level (1 . where the €i are independent normal random variables with mean 0 and known variance . ..L and unknown variance (T2. Show that the test is not UMP. 0) e9Fo (Y) Fo(Y)..(O) L':x. 0.).. I Hint: Consider the class of all Bayes tests of H : () = . Let Xl. /. Hint: A Bayes test rejects (accepts) H if J .. Assume that F o has a density !o(Y).0) UeB for '. Find the MP test for testing H : {} = 1 versus K : B = 8 1 > 1.{Od varies between 0 and 1.. the denominator decreasing..exp( _x B)..d.0 e6 1 0~0. 1 . . What would you .' " 2. n announce as your level (1 .. . x > 0.3. 00 p(x. . or Weibull.2.1 and 01 loss.
) Hint: Use (A. n.(2) " . although N is random. . < 0.4. X!)/~i 1tl ofB.IN(X  = I:[' lX.1 Show that.~a)/ .1) based on n = N observations has length at most l for some preassigned length l = 2d.1. minI ii. i = 1. what values should we use for the t! so as to make our interval as short as possible for given a? 3. ii are (b) Calculate the smallest n needed to bound the length of the 95% interval of part (a) by 0. but we may otherwise choose the t! freely. Suppose that in Example 4.(X) = X .2.l.IN] .10 Problems and Complements 277 (a) Using a pivot based on the MLE (2L:r<ol (1 .. Compare yonr result to the n needed for (4.X are ij. . then [q(X). where 8.~a)!.al).a) interval of the fonn [ X . < a. with X . find a fixed length level (b) ItO < t i < I.4. then the shortest level n (1 . X + z(1 .) LCB and q(X) is a level (1..a.q(X)] is a level (1 . has a 1. Suppose we want to select a sample size N such that the interval (4.Section 4. . X+SOt no l (1. with N being the smallest integer greater than no and greater than or equal to [Sot". .(al + (2)) confidence interval for q(8). N(Ji" 17 2) and al + a2 < a. t. [0.1.. of the form Ji.n . What is the actual confidence coefficient of Ji" if 17 2 can take on all positive values? 4. I")/so.3 we know that 8 < O.1)] if 8 given by (4. 6.c.a. It follows that (1 .1. Suppose that an experimenter thinking he knows the value of 17 2 uses a lower confidence bound for Ji.n is obtained by taking at = Q2 = a/2 (assume 172 known)... where c is chosen so that the confidence level under the assumed value of 17 2 is 1 . 5..02.SOt no l (1. (a) Justify the interval [ii.no further observations..Xn be as in Problem 4.~a) /d]2 .4. distribution.4. there is a shorter interval 7./N.1. Show that if q(X) is a level (1 . Then take N .. 0.z(1 . (Define the interval arbitrarily if q > q.4. .3). Let Xl.1] if 8 > 0.IN.a) confidence interval for B.. .d.(2) UCB for q(8).. . Stein's (1945) twostage procedure is the following. Show that if Xl. Begin by taking a fixed number no > 2 of observations and calculate X 0 = (1/ no) I:~O 1 Xi and S5 = (no 1)II:~OI(Xi  XO )2. 0..01 [X .7). Use calculus.3).. Hint: Reduce to QI +a2 = a by showing that if al +a2 with Ql + Q2 = a.] .
.051 (c) Suppose that n (12 = 5. respectively.) I ~i . (c) If (12.  10. .O). use the result in part (a) to compute an approximate level 0. ..a) confidence interval for sin 1 (v'e).. X has a N(J1. X n1 and Yll . we may very likely be forced to take a prohibitively large number of observations.) Hint: Note that X = (noIN)Xno + (lIN)Etn n _ +IXi..O)]l < z (1. (The sticky point of this approach is that we have no control over N. = 0.001.jn is an approximate level (b) If n = 100 and X = 0. Let XI.jn(X .4.3. and.fN(X . (a) Show that in Problem 4. (J'2 jk) distribution. Let 8 ~ B(n.. 4()./3z (1.) (lr!d observations are necessary to achieve the aim of part (a).0') for Jt of length at most 2d. and the fundamental monograph of Wald.a) interval defined by (4. ±. in order to have a level (1 . . The reader interested in pursuing the study of sequential procedures such as this one is referred to the book of Wetherill and Glazebrook. Show that the endpoints of the approximate level (1 . T 2 I I' " I'.. 11. 0") distribution and is independent of sn" 8. .0)/[0(1.a) confidence interval for 1"2/(Y2 using a pivot based on the statistics of part (a).4. find ML estimates of 11. (1  ~a) / 2. Such two sample problems arise In comparing the precision of two instruments and in detennining the effect of a treatment.: " are known..95 confidence interval for (J. I 1 j Hint: [O(X) < OJ = [. is _ independent of X no ' Because N depends only on sno' given N = k.1') has a N(o.3. v.. Hint: Set up an inequality for the length and solve for n. 278 Testing and Confidence Regions Chapter 4 I i' I is a confidence interval with confidence coefficient (1 . . Y n2 be two independent samples from N(IJ. > z2 (1  is not known exactly. 0) and X = 81n.l4.a) confidence interval for (1/1'). 0. Indicate what tables you wdtdd need to calculate the interval. Let 8 ~ B(n. Hence.1.jn is an approximate level (1  • 12. but we are sure that (12 < (It. exhibit a fixed length level (1 . (12. (b) Exhibit a level (1 . (12 2 9.a:) confidence interval of length at most 2d wheri 0'2 is known. (a) Show that X interval for 8. . (a) If all parameters are unknown. d = i II: . " 1 a) confidence .3) are indeed approximate level (1 . Show that these two quadruples are each sufficient. 1947. I . 1"2. it is necessary to take at least Z2 (1 (12/ d 2 observations. (a) Use (A.~ a) upper and lower bounds. . if (J' is large. 1986.. By Theorem B.18) to show that sin 1( v'X)±z (1 . i (b) What would be the minimum sample size in part (a) if ().6.. (12) and N(1/. Suppose that it is known that 0 < ~.!a)/ 4.') populations. Show that ~(). sn.~a)].
3.) can be approximated by x(" 1) "" (n . Z.027 13.4 and the fact that X~ = r (k. Show that the confidence coefficient of the rectangle of Example 4.3). :L7/ (b) Suppose that Xi does not necessarily have a nonnal diStribution.2.2 is known as the kurtosis coefficient.I) + V2( n1) t z( "1) and x(1 .4.2.2.I). Now use the law of large numbers. In the case where Xi is normal."') . Assuming that the sample is from a /V({L.3.1)8'/0') .)'.J1.t ([en .) Hint: (n . and 10 000. Slutsky's theorem.1) t z{1 . (In practice.(n . V(o') can be written as a sum of squares of n 1 independentN(O. known) confidence intervals of part (b) when k = I. = 3.4.4/0. Compute the (K. 100.1. Now the result follows from Theorem 8.5 is (1 _ ~a) 2.4.3. . but assume that J1.' = 2:.J1. (d) A level 0.Section 4. Hint: By B. find (a) A level 0.7). Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures are observed.9 confidence interval for {L.ia).3.J1. Hint: Let t = tn~l (1 . 15.4 = E(Xi .1) + V2( n . Compare them to the approximate interval given in part (a). Xl = Xn_l na).".9 confidence interval for cr.10 Problems and Complements 279 (b) What sample size is needed to guarantee that this interval has length at most 0. (c) A level 0. and X2 = X n _ I (1 . 14.' 1 (Xi .)' .9 confidence interval for {L x= + cr. See A.3.<.4.n} and use this distribution to find an approximate 1 . (n _1)8'/0' ~ X~l = r(~(nl). K.IUI. K. Hint: Use Problem 8. 0). .4 ) .8). (c) Suppose Xi has a X~ distribution.J1. is replaced by its MOM estimate. 1) random variables. it equals O. 10. If S "' 8(64.1 and.n(X . Find the limit of the distribution ofn.".~Q).1 is known. 4) are independent.30.)4 < 00 and that'" = Var[(X. (a) Show that x( "d and x{1.). ~). and (b) (4. XI 16. (b) A level 0. .D andn(X{L)2/ cr 2 '" = r (4.9 confidence region for (It.4. give a 95% confidence interval for the true proportion of cures 8 using (al (4. cr 2 ) population.' Now use the central limit theorem. In Example 4.)/01' ~ (J1.Q confidence interval for cr 2 . and the central limit theorem as given in Appendix A. See Problem 5. then BytheproofofTheoremB. Suppose that 25 measurements on the breaking strength of a certain alloy yield 11.
F(x)]dx. )F(x)[l.I Bn(F) = sup vnl1'(x) . < xl has a binomial distribotion and ~ vnl1'(x) .. i 1 .11 . In this case nF(l') = #[X. I (b) Using Example 4. define . (a) Show that I" • j j j = fa F(x)dx + f. Assume that f(t) > 0 ifft E (a.1.u) confidence interval for 1".3 can be turned into simultaneous confidence intervals for F(x) by replacing z (I . U(O.4.1.a) confidence interval for F(x). I' . (a) For 0 < a < b < I.05 and .F(x)1 )F(x)[1 . indicate how critical values U a and t a for An (Fo) and Bn(Fo) can be obtained using the Monte Carlo method of Section 4.4. find a level (I . In Example 4.u.l). (c) For testing H o : F = Fo with F o continuous. (b)ForO<o<b< I.I I 280 Testing and Confidence Regions Chapter 4 17. Consider Example 4. .4. b) for some 00 < a < 0 < b < 00.i. as X and that X has density f(t) = F' (t). 1'1(0) < X < 1'. distribution function. . Show that for F continuous where U denotes the uniform.9) and the upper boundary I" = 00.6.F(x)1 . . • i.7.d. 18.~u) by the value U a determined by Pu(An(U) < u) = 1. define An(F) = sup {vnl1'(X) . That is. X n are i.F(x)l. pl(O) < X <Fl(b)}..1 (b) V1'(x)[l. It follows that the binomial confidence intervals for B in Example 4.4.4. we want a level (1 . Suppose Xl.F(x)1 is the approximate pivot given in Example 4.6 with . verify the lower bounary I" given by (4.1'(x)] Show that for F continuous. 19.95. " • j .3 for deriving a confidence interval for B = F(x)..F(x)1 Typical choices of a and bare .r fixed. !i . .
= 1 versns K : Il. test given in part (a) has power 0. lem of testing H : 8 1 = 82 = 0 versus K : Or (a) Let oc(X 1 .~a) I Xl is a confidence interval for Il.. (e) Similarly derive the level (1 .a) LCB corresponding to Oc of parr (a).3.0. .. Let Xl. respectively..5. for H : (12 > (15. Let Xl. = 0.al) and (1 . 1/ n J is the shortest such confidence interval.Section 4.a) VCB for this problem and exhibit the confidence intervals obtained by pntting two such hounds of level (1 .5. .. Hint: Use the results of Problems B. of l.~a)J has size (c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant.4. Yf (1 . X2) = 1 if and only if Xl + Xi > c.2 that the tests of H : . (d) Show that [Mn .2). [8(X) < 0] :J [8(X) < 00 ].) denotes the o.a) UCB for 0. and consider the prob2 + 8~ > 0 when (}"2 is known. Or . N( O . show that [Y f( ~a) I X.th quantile of the . respectively.13 show that the power (3(0 1 . Hint: If 0> 00 . = 1 rejected at level a 0.) samples. . Yn2 be independent exponential E( 8) and £(.1 O? 2.2n2 distribution.1 has size a for H : 0 (}"2 be < 00 . 3. What value of c givessizea? (b) Using Problems B. (15 = 1. Show that if 8(X) is a level (1 .5 1. and let Il.. .12 and B.. Give a 90% confidence interval for the ratio ~ of mean life times.90? 4.· (a) If 1(0. if and only if 8(X) > 00 . (b) Derive the level (1. ~ 01).Xn1 and YI .3.. (a) Find c such that Oc of Problem 4.2 are level 0.. M n /0. is of level a for testing H : 0 > 00 .(2) together.1. Experience has shown that the exponential assumption is warranted.4 and 8. "6 based on the level (1  a) (b) Give explicitly the power function of the test of part (a) in tenos of the X~l distribution function.05. .2).a. then the test that accepts. (e) Suppose that n = 16. 5. x y 315040343237342316514150274627103037 826 10 8 29 20 10 ~ Is H : Il.. X 2 be independent N( 01.10 Problems and Complements 281 Problems for Section 4. with confidence coefficient 1 . a (b) Show that the test with acceptance region [f(~a) for testing H : Il.1"2nl.2 = VCBs of Example 4. O ) is an increasing 2 function of + B~. How small must an alternative before the size 0.3. (a) Deduce from Problem 4.3.\. < XI? < f(1. .
(c) Show that Ok(o) (X. and 1 is continuous and positive. og are independentN (0.0:) LeB for 8 whatever be f satisfying our conditions.IX(j) < 0 < X(k)] do not depend on 1 nr I. . T) where 1] is a parameter of interest and T is a nuisance parameter. k . . respectively. Suppose () = ('T}.. we have a location parameter family.2). .0:) confidence interval for 11.a)y'n. X n  00 ) is a level a test of H : 0 < 00 versus K : 0 > 00 • (d) Deduce that X(nk(Q)+l) (where XU) is the jth order statistic of the sample) is a level (1 . (b) The sign test of H versus K is given by.4. of Example 4.00 . 0. O?. Let X l1 .  j 1 j .282 Testing and Confidence Regions Chapter 4 = ()~ 1 2 eg and exhibit the corresponding family of confidence circles for (ell ( 2).a. 1 Ct ~r (b) Find the family aftests corresponding to the level (1.~n + ~z(l . (a) Show that testing H : 8 < 0 versus K : e > 0 is equivalent to testing .. Hint: (c) X. Let C(X) = (ry : o(X. 'TJo) of the composite hypothesis H : ry = ryo. (a) Shnw that C(X) is a level (l . 6. (g) Suppose that we drop the assumption that I(t) = J( t) for all t and replace 0 by the v = median of F. N( 020g. X. Thus. l 7. We are given for each possible value 1]0 of1] a level 0: test o(X. but I(t) = I( t) for all t. Show that PIX(k) < 0 < X(nk+l)] ~ . lif (tllXi > OJ) ootherwise.8) where () and fare unknown. >k Determine the smallest value k = k(a) such that oklo) is level a for H and show that for n large.[XU) (f) Suppose that a = 2(n') < OJ and P. . 0. O.0:) confidence region for 1] is equivalent to a family of level tests of these composite hypotheses. I .1 when (72 is unknown. Show that the conclusions of (aHf) still hold.Xn be a sample from a population with density f (t . (e) Show directly that P. . (c) Modify the test of part (a) to obtain a procedure that is level 0: for H : OJ e= O?. . L:J ~ ( . ry) = O}. .2).a) confidence region for the parameter ry and con versely that any level (1 . ). I: H': PIX! >OJ < ~ versus K': PIX! >OJ> ~.
7.00 indicates how far we have to go from 80 before the value 8 is not at all surprising under H.a) confidence interval [ry(x).1.10 Problems and Complements 283 8. the quantity ~ = 8(8) .2a) confidence interval for 0 of Example 4.a).Section 4. Y ~ N(r/. 8( S)] be the exact level (1 . Note that the region is not necessarily an interval or ray.a) = (b) Show that 0 if X < z(1 . Suppose X.p) = = 0 ifiX . Yare independent and X ~ N(v. a( S. ij(x)] for ry is said to be unbiased confidence interval if Pl/[ry(X) < ry' < 7)(X)] <1 a for all ry' F ry.5.2. Let P (p.z(1 . let T denote a nuisance parameter. Y. Hint: You may use the result of Problem 4.a))2 if X > z(1 . Let 1] denote a parameter of interest. (b) Describe the confidence region obtained by inverting the family (J(X.5.Oo) denote the pvalue of the test of H : 0 = 00 versus K : 0 > 00 in Example 4. 1). 11. 9. p. Then the level (1 . Po) is a size a test of H : p = Po.2 a) (a) Show that J(X. Show that the Student t interval (4.z(l. and let () = (ry.5.1) is unbiased. 1). or). laifO>O <I>(z(1 .5.pYI < (1 1 otherwise. 1) and q(O) the ray (X .[q(X) < 0 2 ] = 1. . 00) is = 0'. Let a(S. ' + P2 l' z(1 1 .3 and let [O(S). (a) Show that the lower confidence bound for q( 8) obtained from the image under q of q(X) (X . Let X ~ N(O.20) if 0 < 0 and. That is. 7)).a).a) . the interval is unbiased if it has larger probability of covering the true value 1] than the wrong value 1]'. Y. Show that as 0 ranges from O(S) to 8(S).a.5.Y. This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line. 0) ranges from a to a value no smaller than 1 . Thus. alll/. p)} as in Problem 4. if 00 < O(S) (S is inconsistent with H : 0 = 00 ).7. 12. hence. that suP. 10. Define = vl1J.2. Establish (iii) and (iv) of Example 4.O ~ J(X.
L.p) variable and choose k and I such that 1 . Let Xl. Simultaneous Confidence Regions for Quantiles. it is distribution free.) Suppose that p is specified.. Let x p = ~ IFI (p) + Fi! I (P)]. k(a) . That is. Show that Jk(X I x'. Show that the interval in parts (e) and (0 can be derived from the pivot T(x ) p  . (a) Show that testing H : x p < 0 versus I< : x p > 0 is equivalent to testing H' : P(X > 0) < (I . and F+(x) be as in Examples 4. i j (d) Deduce that X(n_k(Q)+I) (XU) is the jth order statistic of the sample) is a level (1 . Then   P(P(x) < F(x) < P+(x)) for all x E (a. That is. (See Section 3. .F(x p)]. I • (c) Let x· be a specified number with 0 < F(x') < 1.4.. . ! . Detennine the smallest value k = k(a) such that Jk(Q) has level a for H and show that for n large. P(xp < x p < xp for all p E (0. vF(xp) [1 F(xp)] " Hint: Note that F(x p ) = p. (b) The quantile sign test Ok of H versus K has critical region {x : L:~ 1 1 [Xi > 0] > k}. = ynIP(xp) .1» ..a. In Problem 13 preceding we gave a disstributionfree confidence interval for the pth quantile x p for p fixed. .p) versus K' : P(X > 0) > (1 . .1 and Fu 1.95 could be the 95th percentile of the salaries in a certain profession l or lOOx . 1 . F(x).a) confidence interval for x p whatever be F satisfying our conditions. 0 < p < 1. Thus..05 could be the fifth percentile of the duration time for a certain disease. (g) Let F(x) denote the empirical distribution. k+ ' i .a) LeB for x p whatever be f satisfying our conditions. (X(k)l X(n_I») is a level (1 .p). . iI . lOOx.b) (a) Show that this statement is equivalent to = 1.6 and 4.. X n x*) is a level a test for testing H : x p < x* versus K : x p > x*.284 Testing and Confidence Regions Chapter 4 13. Suppose we want a distributionfree confidence region for x p valid for all 0 < p < 1.a.4. (t) Show that k and 1in part (e) can be approximated by h (~a) and h (1  ~a) where ! heal is given in part (b). (e) Let S denote a B(n. Let F. where ! 1 1 heal c: n(1  p) + zl_QVnp(1  pl.7.. 14.h(a).a = P(k < S < n I + 1) = pi(1. Confidence Regions for QlIalltiles. .5.p)"j. Show that P(X(k) < x p < X(nl)) = 1. " II [I . il. be the pth quantile of F. We can proceed as follows. X n be a sample from a population with continuous distribution F. Construct the interval using F.
where p ~ F(x). F f (..X I.xp]: 0 < l' < I}. F~x) has the same distribution Show that if F x is continuous and H holds. 15.d.Section 4. X I.1. Express the band in terms of critical values for An(F) and the order statistics. The hypothesis that A and B are equally effective can be expressed as H : F ~x(t) = Fx(t) for all t E R. then L i==l n IIXi < X + t..1. ..T: a <.X n . where A is a placebo. L i==I n I IFx (Xi) .) < p} and. .. VF(p) = ![x p + xlpl. LetFx and F x be the empirical distributions based on the i. F(x) > pl.. That is.. . that is.. ~ ~ (a) Consider the test statistic x . Suppose that X has the continuous distribution F. Hint: Let t.r" ~ inf{x: a < x < b. X n and . Note the similarity to the interval in Problem 4.13(g) preceding.5. We will write Fx for F when we need to distinguish it from the distribution F_ x of X. the desired confidence region is the band consisting of the collection of intervals {[.p. .17(a) and (c) can be used to give another distributionfree simultaneous confidence band for x p . Give a distributionfree level (1 .i.z. .j < F_x(x)] See also Example 4.Q) simultaneous confidence band for the curve {VF(p) : 0 < p < I}. . then D(F ~ ~ ~ ~ as D(Fu 1 F I  U = F(X) U ). (c) Show how the statistic An(F) of Problem 4.10 Problems and Complements 285 where x" ~ SUp{.r < b. TbeaitemativeisthatF_x(t) IF(t) forsomet E R.u are the empirical distributions of U and 1 .(x)] < Fx(x)] = nF1_u(Fx(x)).r p in terms of the critical value of the Kolmogorov statistic and the order statistics. where F u and F t .x. n Hint: nFx(x) = Li~ll[Fx(Xi) < Fx(x)] ~ nFu(F(x)) and ~ ~ nF_x(x) = L IIXi < xl ~ L i==I i==I n n IIFx( X. I).(x) = F ~(Fx(x» .. (b) Suppose we measure the difference between the effects of A and B by ~ the difference between the quantiles of X and X. Suppose X denotes the difference between responses after a subject has been given treatments A and B..U with ~ U(O. (b) Express x p and .4.
 "" = Yp . the probability is (1 .) ~ < F x (")] ~ ~ = nFu(Fx(x)). and by solving D( Fx...a) simultaneous confidence band for A(x) = F~l:(Fx(x)) . vt(p)) is the band in part (b). (p).F x l (l .t::. Hint: Define H by H.. where :F is the class of distribution functions with finite support. he a location parameter as defined in Problem 3. It follows that if c. . we test H : Fx(t) ~ Fy (t) for all t versus K : Fx(t) # Fy(t) for some t E R. Then H is symmetric about zero.) < Fy(x p )] = nFv (Fx (t)) under H. .x p . i I f F. let Xl.2A(x) ~ HI(F(x)) l 1 + vF(F(x)). Also note that x = H..Fy ): 0 < p < 1}. Properties of this and other bands are given by Doksum. < F y I (Fx(x))] i=I n = L: l[Fy (Y. Let 8(. nFy(t) = 2. Hint: nFx(t) = 2.. F y ). then  nFy(x + A(X)) = L: I[Y. then D(Fx .. .. Give a distributionfree level (1.) : :F VF ~ inf vF(p). Let t (c) A Distdbution and ParameterFree Confidence Interval... The result now follows from the properties of B(·). We assume that the X's and Y's are in~ dependent and that they have respective continuous distributions F x and F y . = 1 i 16.Fy ) = maxIFy(t) 'ER "" .3. As in Example 1.F~x..) = D(Fu .(x) ~ R. nFx(x) we set ~ 2. Hint: Let A(x) = FyI (Fx(x)) .i. treatment A (placebo) responses and let Y1 . Moreover. Show that for given F E F.:~ II[Fx(X.Jt: 0 < p < 1) for the curve (5 p (Fx. F y ) .x (': + A(x)). for ~.. then D(Fx.17. vt = O<p<l sup vt(p) O<p<l where [V.1) empirical distributions.. It follows that X is stochastically between Xsv F and XS+VF where X s HI(F(X)) has the symmetric distribution H.a) simultaneous confidence band 15. Fenstad and Aaberge (1977). we get a distributionfree level (1 .p)] = ~[Fxl(p) + F~l:(p)]..U ).:~ 1 1 [Fy(Y.) < do.286 Testing and Confidence Regions ~ Chapter 4 '1. To test the hypothesis H that the two treatments are equally effective. ~ ~ ~ F~x. (b) Consider the parameter tJ p ( F x.1.) is a location parameter} of all location parameter values at F.5..".l (p) ~ ~[Fxl(p) . . treatment B responses.x. is the nth quantile of the distribution of D(Fu l F 1 . Fx(t)l· . where do:. i=I n 1 i t .x = 2VF(F(x)).d.) < Fx(x)) = nFv(Fx(x)). where x p and YP are the pth quan tiles of F x and Fy. F y ) has the same distribntion as D(Fu .) < Fx(t)] = nFu(Fx(t)). Let Fx and F y denote the X and Y empirical distributions and consider the test statistic ~ ~ .:~ I1IFx(X. i .(F(x)) .'" 1 X n be Li.F1 _u). <aJ Show that if H holds. I I I 1 1 D(Fx.a) that the interval IvF' vt] contains the location set LF = {O(F) : 0(.d. 1 Yn be i. where F u and F v are independent U(O. . Fx..
.F y ' (p) = 6. thea It a unifonnly most accurate level 1 . x' 7 R.. where is nnknown. F y ) is in [6. Fy. Show that n p is known and 0 O' ~ (2 2~A X. 3.(Fx .Xn is a sample from a r (p. ~) distribution. Show that for given (Fx . tt Hint: Both 0 and 8* are nonnally distributed.('. Xl > X 8t 8t 0} (c) A parameter 0 = <5 ('. F y ) < O(Fx .E(X).4.6 1.2. the probability is (1 .·) is a shift parameter. Show that if 0(·. . 6+ maxO<p<1 6+(p). where :F is the class of distri butions with finite support. (d) Show that E(Y) . Show that for the model of Problem 4. Q.(X). Properties of ~ £ = "" this and other bands are given by Doksum and Sievers (1976).) < do: for Ll. moreover X + 6 < Y' < X + 6. then B( Fx .)I L ti . Suppose X I.2uynz(1 i=! i=l all L i=l n ti is also a level (1. 0 < p < 1.) I i=l L tt .= minO<p<l 6.6+] contains the shift parameter set {O(Fx. = 22:7 I Xf/X2n( a) is . nFx ("') = nFu(Fx(x)). (b) Consider the unbiased estimate of O..0:) simultaneouS coofidence band for t.).. :F x :F O(Fx. then Y' = Y.Fy).(x) = Fy(x thea D(Fx. F y ) > O(Fx " Fy). Let6..6]. .).) ~ + t.(.3.T. we find a distributionfree level (1 .4. . Let do denote a size a critical value for D(Fu .0: UCB for J..2z(1 i=l n n a)ulL tt]} i=l is a uniformly most accurate lower confidence bound for 8.. F x ) ~ a and YI > Y.. Exhibit the UMA level (1 . ~ ~ = (e) A Distribution and ParameterFree Confidence Interval. 6. F y ) and b _maxo<p<l 6p(Fx . 2. F y ) E F x F. ) is a shift parameter} of the values of all shift parameters at (Fx . It follows that if we set F\~•. T n n = (22:7 I X. ~ D(Fu •F v ).(x)) . Fx +a) = O(Fx a.(P). Show that B = (2 L X. Let 6 = minO<1'<16p(Fx. Fy ) .. is called a shift parameter if O(Fx. t. F y. Now apply the axioms. O(Fx. . Fv ). F y ).    Problems for Section 4.) = F (p) . and 8 are shift parameters.) . F y.6.0:) confidence bound forO.a) that the interval [8. B(. t Hint: Set Y' = X + t.a) UCB for O.10 Problems and Complements 287 Moreover. (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E~ 1 X i )/ E~ 1 has uniformly smaller variance than T.Section 4.L.) 12:7 I if. if I' = 1I>'. (a) Consider the model of Problem 4. Fy ). then by solving D(F x.
. where So is some COnstant and V rv X~· Let T = L:~ 1 Xi' (a) Show that ()" I T m = k + 2t.6.3.3 and 4. n = 1. density = B. (B" 0') have joint densities.3.4. distribution with r and s positive integers. B). (a) Show that if 8 has a beta.d. Suppose that B' is a uniformly most accurate level (I . where s = 80+ 2n and W ~ X. uniform.2. In Example 4.f:s F. and for (). Let U.C..a. f3( T. Pa( c.E(V) are finite. Suppose that given B Pareto. g. s).1 are well defined and strictly increasing. 5.. . (1: dU) pes.6 for c fixed.l. Suppose that given.B(n. Show how the quantiles of the F distribution can be used to find upper and lower credible bounds for.B). I'. . .W < B' < OJ for allB' l' B. U = (8 .' with "> ~.1.B') < E. then E.O] are two level (1 .l2(b).X. G corresponding to densities j. s > 0.0 upper and lower confidence bounds for J1 in the model of Problem 4. 1. (0 . Poisson.3). Prove Corollary 4. V be random variables with d.Bj has the F distribution F 2r . [B. p.2.6.. where Show that if (B.3. Hint: By Problem B. • • • = t) is distributed as W( s. /3( T.7 1 . i . Construct unifoffilly most accurate level 1 . .d. and that B has the = se' (t.1. ~.a) confidence intervals such that Po[8' < B' < 0'] < p.[B < u < O]du.\ is distributed as V /050. Xl. . > 1" (h) Show how quantiles of the X2 distribution can be used to determine level (I .. P(>. 0'].Xn are Ll.. J • 6. '" J. ·'"1 1 .W < BI = 7. Problems for Section 4.6 to V = (B . Suppose [B'. . establish that the UMP test has acceptance region (4. .B')+. j .B)+..B) = J== J<>'= p( s. distribution and that B has beta. U(O.( 0' Hint: E. t)dsdt = J<>'= P.aj LCB such that I .. then )" = sB(r(1 . 8. satisfying the conditions of Problem 8.) and that. s). F1(tjdt. t) is the joint density of (8. .288 Testing and Confidence Regions Chapter 4 4.i. 2. (b) Suppose that given B = B. X has a binomial. E(U) = .6.4.6. Hint: Apply Problem 4.B). respectively. Establish the following result due to Pratt (1961).. e> 0.\ =.2 and B. • . OJ. Show that if F(x) < G(x) for all x and E(U). t > e. Hint: Use Examples 4. . 3."" X n are i. then E(U) > E(V).2" Hint: See Sections 8. s) distribution with T and s integers.a) upper and lower credible bounds for A. 1I"(t) Xl.12 so that F. 0).2.(O .\.
Section 4. y) of is (Student) t with m + n . r( TnI (c) Set s2 + n. y) is obtained by integrating out Tin 7f(Ll. . Problems for Section 4. In particular consider the credible bounds as n + 00.10 Problems and Complements 289 (a)LetM = Sf max{Xj.x) is aN(x.Xn + 1 be i.y. (d) Compare the level (1. . ". (a) Give a level (1 . .rlm) density and 7f(/12 1r.. Yn are two independent N (111 \ 'T) and N(112 1 'T) samples. /11 and /12 are independent in the posterior distribution p(O X. = PI ~ /12 and'T is 7f(Ll.r). T) = (111. Let Xl.Xn is observable and X n + 1 is to be predicted.Xn ). Show that (8 = S IM ~ Tn) ~ Pa(c'.1. 'T). Here Xl. respectively.. .a) upper and lower credible bounds forO to the level (Ia) upper and lower confidence bounds for B.r 1x. and Y1. .. Hint: p( 0 I x. y) and that the joint density of. (b) Find level (1 . r > O.d.x)1f(/1. Show thatthe posterior distribution 7f( t 1 x.y) is 1f(r 180)1f(/11 1 r. (b) Show that given r.l.1 )) distribution.. 1 r..Xm.0:) confidence interval for B. Hint: 7f(Ll 1x.0"5).y) = 7f(r I sO)7f(LlI x . (c) Give a level (1 .y. 4. (d) Use part (c) to give level (Ia) credible bounds and a level (Ia) credible interval for Ll. (a) Let So = E(x.. as X.s') withe' = max{c. Suppose that given 8 = (ttl' J. Xl. y) is aN(y. Suppose 8 has the improper priof?r(O) = l/r.y. N(J.y) proportional to where 1f(r I so) is the density of solV with V ~ Xm+n2. Show fonnaUy that the posteriof?r(O 1 x.. y). 'P) is the N(x .m} and + n.2 degrees of freedom.112..1.. r In) density. r 1x. = So I (m + n  2). y) is proportional to p(8)p(x 1/11.x)' + E(Yj y)'. .8 1.a) upper and lower credible bounds for O. r) where 7f(Ll I x . ... .i. .. where (75 is known.2... .0:) prediction interval for Xnt 1. 1f(/11 1r.r)p(y 1/12..
Xn + 1 be i.i.. and a = .a) lower and upper prediction bounds for X n + 1 .9. Show that the likelihood ratio statistic for testing If : 0 = ~ versus K : 0 i ~ is equivalent to 12X .10. CJ2) with (J2 unknown. suppose Xl.12.8. random variable. b).'" . X n such that P(Y < Y) > 1. distribution. B( n.2) hy using the observation that Un + l is equally likely to be any of the values U(I)... Let X have a binomial. I x) is sometimes called the Polya q(y I x) = J p(y I 0). Let Xl. (b) If F is N(p" bounds for X n + 1 .. has a B(m. .·) denotes the beta function. let U(l) < ...5. F. ~o = 10.5. . .2. )B(r+x+y. and that (J has a beta.a). s). where Xl. . give level 3. (This q(y distribution.) Hint: First show that . 4. Then the level of the frequentist interval is 95%. 1 2.i. X n+ l are i. That is. Give a level (1 .a (P(Y < Y) > I .8. :[ = . Suppose XI.15.i. 0 > 0.'.d.x .9. Suppose that given (J = 0.. 5. Hint: Xi/8 has a ~ distribution and nXn+d E~ 1 Xi has an F 2 .Q) distribution free lower and upper prediction bounds for X n + 1 • < 00.05. (a) If F is N(Jl.' .0:) lower (upper) prediction bound on Y = X n+ 1 is defined to be a function Y(Y) of Xl. 0'5)' Take 0'5 ~ 7 2 = I. X is a binomial. 00 < a < b (1 .11. Show that the conditional (predictive) distribution of Y given X = xis 1 J I i I . 0) distribution given (J = 8. .. i! Problems for Section 4. distribution. In Example 4..3) by doing a frequentist computation of the probability of coverage.d. Establish (4.8).Xn are observable and X n + 1 is to be predicted.• .s+n..d. . n = 100.(0 I x)dO.. x > 0. . • .8. . 0). . < u(n+l) denote Ul .. . which is not observable.. ! " • Suppose Xl.e. . 1 X n are i.8.....2n distribution. B(n.Xn are observable and we want to predict Xn+l. Find the probability that the Bayesian interval covers the true mean j.9 1. (3(r.. N(Jl.x ) where B(·.a) prediction interval for X n + l . . as X . give level (1 ~ a:) lower and upper prediction (c) If F is continuous with a positive density f on (a.nl. " u(n+l). Un + 1 ordered.t for M = 5. A level (1 .. Present the results in a table and a graph. give level (I . q(ylx)= ( . as X where X has the exponential distribution F(x I 0) =I .290 Testing and Confidence Regions Chapter 4 • • (b) Compare the interval in part (a) to the Bayesian prediction interval (4.10. Suppose that Y.0'5) with 0'5 known.s+nx+my)/B(r+x.
Hint: log . 0' 4..(x) = 0. if ii' /175 < 1 and = .='7"nlog c l n /C2n CIn . Thus. (c) These tests coincide with the testS obtained by inverting the family of level (1  Q) lower confidence bounds for (12.f. In testing H : It < 110 versus K . Xn_1 (~a). Hint: Note that liD ~ X if X < 1'0 and ~ 1'0 otherwise.. (i) F(c. (n/2)[ii' / C a5  (b) To obtain size Q for H we should take Hint: Recall Theorem B.. onesample t test is the likelihood ratio test (fo~ Q <S ~).~Q) n + V2nz(l. and only if.Q. = Xnl (1 . of the X~I distribution.Section 4. n Cj I L..~Q) approximately satisfy (i) and also (ii) in the sense that the ratio .(x) Hint: Show that for x < !n.F(CI) ~ 1.Xn be a N(/L. and only if. ° 3. = aD nO' 1"... ~2 . where F is the d.C2n !' 1 as n !' 00. (ii) CI ~ C2 = n logcl/C2. I .( x) ~ 0. OneSided Tests for Scale.X) aD· t=l n > C. log .3. We want to test H . 2. let X 1l . where Tn is the t statistic.n) and >.I)) for Tn > 0.. TwoSided Tests for Scale. (b) Use the nonnal approximatioh to check that CIn C'n n . JL > Po show that the onesided. 2 2 L. In Problems 24.) .(Xi .10 Problems and Complements 291 is an increasing function of (2x . . Show that (a) Likelihood ratio tests are of the form: Reject if. We want to test H : = 0'0 versus K : 0' 1= 0'0' (a) Show that the size a likelihood ratio test accepts if. · ...(x) = . >. (c) Deduce that the critical values of the commonly used equaltailed test. 0'2 < O'~ versus K : 0'2 > O'~.Q).V2nz(1 .3. 0'2) sample with both JL and 0'2 unknown. if Tn < and = (n/2) log(1 + T.(Xi < 2" " (10 i=I  X) 2 < C2 where CI and C2 satisfy.log (ii' / (5)1 otherwise./(n . XnI (1 .~Q) also approximately satisfy (i) and (ii) of part (a).(nx).
Suppose X has density p(x. The control measurements (x's) correspond to 0 pounds of mulch per acre. n..P. (b) Can we conclude that leaving the indicated amount of mulch on the ground significantly improves forage production? Use ct = 0. 1 I 794 2012 1800 2477 576 3498 411 2092 897 1808 I Assume the twosample normal model with equal variances.9. 8. find the sample size n needed for the level 0. (X.05. eo. i = 1. €n are independent N(O. .95 confidence interval for p.L2 versus K : j. Assume the onesample normal model. . . 9. 0 E e. x y vi I I .lIl (7 2) and N(Jl. (a) Find a level 0. 7. The following blood pressures were obtained in a sample of size n = 5 from a certain population: 124. The following data are from an experiment to study the relationship between forage production in the spring and mulch left on the ground the previous fall.. (b) Consider the problem aftesting H : Jl.L.292 Testing and Confidence Regions Chapter 4 5. 6. respectively. a 2 ) random variables.1 < J.z .tl > {t2... . Show that A(X.3. Let Xl. 190. .95 when nl = nz = ~n and (1'1 1'2)/17 ~ ~. Y. 110.l. .1'2. whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre. Forage production is also measured in pounds per acre. 0'2 ~ ! corresponding to inversion of the (c) Compute a level 0. ..O)... I Yn2 be two independent N(J. .9.Xn are said to be serially correlated or to follow an autoregressive model if we can write ~ . and that T is sufficient for O. 0'2). . (c) Using the normal approximation <l>(z(a)+ nl nz/n(1'1 I'z) /(7) to the power.05 onesample t test. (a) Using the size 0: = 0.2' (12) samples. . can we conclude that the mean blood pressure in the population is significantly larger than IDO? (b) Compute a level 0. 100.01 test to have power 0. .Xn1 and YI . • Xi = (JXi where X o I 1 + €i. = 0 and €l.. 114.90 confidence interval for a by using the pivot 8 2 ja z . Assume a Show that the likelihood ratio statistic is equivalent to the twosample t statistic T.. I'· (c) Find a level 0. The nonnally distributed random variables Xl.. e 1 ) depends on X only throngh T.4.95 confidence interval for equaltailed tests of Problem 4.. (7 2) is Section 4. (a) Show that the MLE of 0 ~ (1'1. where a' is as defined in < ~..90 confidence interval for the mean bloC<! pressure J.
Y2 )dY2. has density !k . The F Test for Equality of Scale. . Then. = 0.2) In exp{ (I/Z(72) 2)Xi . 13. 12.. II.IIZI > tjvlk] is increasing in [61. Show that the noncentral t distribution. and only if..O) for 00 = (Z. 0) by the following table. Hint: Let Z and V be independent and have N(o. Define the frequency functions p(x.. From the joint distribution of Z and V. Condition on V and apply the double expectation theorem. 1 {= xj(kl)ellx+(tVx/k'I'ldx. . respectively. i = 1.[Z > tVV/k1 is increasing in 6. (a) P.2.(t) = . . (b) Show that the likelihood ratio statistic of H : 0 = 0 (independence) versus K : 0 o(serial correlation) is equivalent to C~=~ 2 X i X i _I)2 / l Xl.. for each v > 0. Fix 0 < a < and <>/[2(1 . X powerful whatever be B. X n1 .. Let e consist of the point I and the interval [0. Then use py.[T > tl is an increasing function of 6.. with all parameters assumed unknown.. 7k.. Yn2 be two independent N(P. . Yl.and twosided t tests. Xo = o. P.. has level a and is strictly more 11. . J7i'k(~k)ZJ(k+ll io Hint: Let Z and V be as in the preceding hint. distribution.6. Stein). XZ distributions respectively. (b) P. . (Yl.<»1 < e < <>. The power functions of one.'" p(X.. (Tn. I). . 7k.Section 4. Let Xl. Show that.OXi_d 2} i= 1 < Xi < 00.10 Problems and Complements 293 X n ) is n (a) Show that the density of X = (Xl. Suppose that T has a noncentral t. get the joint distribution of Yl = ZI vVlk and Y2 = V.6. samples from N(J1l. ! x 0 2 1<> 2 I 0 <> I 2 1<> 2 1 ~a Ia 2 iI Oe (i~H <» (1') a 10: (t~) (~. P.n.<» (1 .O)e (a) What is the size a likelihood ratio test for testing H : B = 1 versus K : B 1= I? (b) Show that the test that rejects if.IITI > tl is an increasing function of 161. (Yl) = f PY"Y. (An example due to C. / L:: i 10. (T~). Consider the following model.
12 337 1.7 to conclude that tribution as 8 12 /81 8 2 . I ' LX. (a) Show that the likelihood ratio ~tatistic is equivalent to ITI where . using (4.9. • . 0.V. Let '(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p the bivariate normal model. 14.1)]E(Y. T has a noncentral Tnz distribution with noncentrality parameter p. F ~ [(nl . can you conclude at the 10% level of significance that blood cholest~rollevel is correlated with weight/height ratio? 'I I .1)/(n.0 has the same distribution as R. where l _ . . si = L v.62 284 2. i 0 in xi !:. V.61 310 2. show that given Uz = Uz . .J J . (Xn .4.. Un). .71 240 2.. . and only if. (d) Relate the twosided test of part (c) to the confidence intervals for a?/ar obtained in Problem 4.'. 1 . . <. .64 298 2.68 250 2. . > C. (11. the continuous versiop of (B.1I(a).4. p) distribution. Vn ) is a sample from a N(O. The following data are the blood cholesterol levels (x's) and weightlheight ratios (y's) of 10 men involved in a heart study. !I " . Let (XI. (Un.. as an approximation to the LR test of H : a1 = a2 versus K : a1 =I a2. and using the arguments of Problems B. distribution and that critical values can be . . Finally.. . Un = Un.1' Sf = L U. . Argue as in Proplem 4. r> (c) Justify the twosided F test: Reject H if. (a) Show that the LR test of H : af = a~ versus K : ai > and only if.a/2) or F < f( n/2). . vn 1 l i 16.9.' ! .. ! . Yn) be a sampl~ from a bivariateN(O. i=2 i=2 n n S12 =L i=2 n U.96 279 2.~ I .19 315 2.94 Using the likelihood ratio test for the bivariate nonnal model. • I " and (U" V. . . 15. L Yj'.0 has the same dis r j .37 384 2.~ I i " 294 Testing and Confidence Regions Chapter 4 .4.5) that 2 log '(X) V has a distribution.9.nt 1 distribution.Y)'/E(X. Because this conditional distribution does not depend on (U21 •.. . p) distribution. where f( t) is the tth quantile of the F nz . (1~. 1.8. Consider the problem of testing H : p = 0 versus K : p i O. .. then P[p > c] is an increasing function of p for fixed c. a?. p) distribution.1. 0. Sbow.X)' (b) Show that (aUa~)F has an F nz obtained from the :F table.'.).nl1 ar is of the fonn: Reject if. YI ). and use Probl~1ll4. 1. Hint: Use the transfonnations and Problem B.4. i=l j=l n n (b) Show that if we have a sample from a bivariate N(1L1l 1L2. . · . .7 and B. Let R = S12/SIS" T = 2R/V1 R'. F > /(1 .4.24) implies that this is also the unconditional distribution.8. note that .4) and (4. . where a.1 .10. x y 254 2.. that T is an increasing function of R. . .
G. S in 8(X) would be replaced by 5 + and 5 in 8(X) is replaced by 5 . if the parameter space is compact and loss functions are bounded.9.12 1965.[ii .851.3. (2) We ignore at this time some reallife inadequacies of this experiment such as the placebo effect (see Example 1. it also has confidence level (1 . HAMMEL. !. P.398404 (975). Box.. respectively.2.01 + 0.. see also Ferguson (1967).Section 4. R. AND J. Consider the cases Ii.1.3) holds for some 8 if <p 't V. Box..).15 is used here. PROSCHAN. 00. and C. T. 4. E. 187.11 NOTES Notes for Section 4.035. 1983.895 and t = 0. Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding. . E.01. The term complete is then reserved for the class where strict inequality in (4. i]. Nntes for Section 4. "Is there a sex bias in graduate admissions?" Science. Essentially. (2) In using 8(5) as a confidence hound we are using the region [8(5).11 Notes 295 17. Data Analysis and Robustness. 4. T6 .I\ E. Apology for Ecumenism in Statistics and Scientific lnjerence. Rejection is more definitive.11. 1974) to the critical value is <. (b) Compare your solution to the Bayesian solution based on a continuous loss function rro = 0. the class of Bayes procedures is complete.9.(t) = tl(. . (2) The theory of complete and essentially complete families is developed in Wald (1950).2. W.05 and 0. 00. Wiley & Sons. AND F.0.3 (1) Such a class is sometimes called essentially complete...1 (1) The point of view usually taken in science is that of Karl Popper [1968]. Because the region contains C(X). F. REFERENCES BARLOW. 1973.. Notes for Section 4. Consider the bioequivalence example in Problem 3. G. (a) Find the level" LR test for testing H : 8 E given in Problem 3.[ii) where t = 1..10. and r.0.. i] versus K : 8 't [i.o = 0. Wu. P.819 for" = 0. BICKEL. O'CONNELL. Mathematical Theory of Reliability New York: J.3). Tl . Stephens. Editors New York: Academic Press. Leonard. (3) A good approximation (Durbin.4 (I) If the continuity correction discussed in Section A. More generally the closure of the class of Bayes procedures (in a suitable metric) is complete.~.
! .. "EDF statistics for goodness of fit. 549567 (1961). New York: Springer. WILKS. WANG.674682 (1959). T. KLETI. "A twosample test for a linear hypothesis whose pOWer is independent of the variance. VAN ZWET.. "Plots and tests for symmetry. 9. Statistical Methods for MetaAnalysis Orlando. TATE. . 1958. 64. Amer. POPPER. AND I. Aspin's tables:' Biometrika. New York: Hafner Publishing Company. A. Wiley & Sons. WALD. "Further notes on Mrs. AND A.. W. Math. 8. 1961. . AND G." J. SIAM. Statistical Theory with Engineering Applications . K.. 1997." J. 54 (2000). 1950. Wiley & Sons. D.. Assoc. 243246 (1949). 2nd ed. AND 1. CAl. AARBERGE. WETHERDJ. OSTERHOFF. F. AND R. 473487 (1977). DAS GUPTA.. AND E." Ann. j 1 . I . New York: I J 1 Harper and Row. 326331 (1999). AND K. FISHER. "On the combination of independent test statistics.. DURBIN." Biometrika. 1947. 66. A." Regional Conference Series in Applied Math.• Statistical Methods for Research Workers. "Interval estimation for a binomial proportion. Assoc. A. The Theory of Probability Oxford: Oxford University Press. DOKSUM. Sequential Analysis New York: Wiley. K.. Statist. STEPHENs. 659680 (1967). j j New York: 1. 16. D. 1. A. FL: Academic Press.. A. PRATI. 3rd ed. H. GLAZEBROOK.. 36. . 13th ed. Amer. Sratisr. Y. Statist. 53. "Optimal confidence intervals for the variance of a normal distribution. . WALD.. 54. 56. Statist.. ." J. R. 1%2. A Decision Theoretic Approach New York: Academic Press. R. 1985. SIEVERS.. W. STEIN. G. 38. 421434 (1976). 605608 (1971). G. HAlO. Mathematical Statistics. R." 1. B. Testing Statistical Hypotheses. JEFFREYS. "Length of confidence intervals.. A." The American Statistician.• Conjectures and Refutations. J. . OLlaN.. Statist. S. Statistical Decision Functions New York: Wiley. "Probabilities of the type I errors of the Welch tests. "PloUing with confidence: Graphical comparisons of two populations. Sequential Methods in Statistics New York: Chapman and Hall. H.. T. 1968. AND G. L. "Distribution theory for tests based on the sample distribution function." The AmencanStatistician. SAMUEL<:AHN.296 Testing and Confidence Regions Chapter 4 BROWN. 1952.. Amer.. I . 730737 (1974). 63.. SACKRoWITZ. 1967. .. c. DOKSUM. K.. Statist. Assoc. M. 1986.." An" Math. the Growth ofScientific Knowledge. j 1 .243258 (1945). S. Mathematical Statistics New York: J. HEDGES. 69.. V.. L. ! I I I FERGUSON. ." Biometrika. L. WELCH. R. E. Pennsylvania (1973). "P values as random variablesExpected P values. FENSTAD. Amer. LEHMANN. Philadelphia.
computation even at a single point may involve highdimensional integrals. v(F) ~ F.i.Xn are i. and most of this chapter. .. k Evaluation here requires only evaluation of F and a onedimensional integration.• ..1. if n ~ 2k + 1.2.2) where.1 degrees of freedom.Chapter 5 ASYMPTOTIC APPROXIMATIONS 5. . our setting for this section EFX1 and use X we can write.F(x)k f(x).. If n is odd. Ifwe want to estimate J1(F} = (5.3) gn(X) =n ( 2k ) F k(x)(1 .3).1. 1 X n from a distribution F.13).1. consider a sample Xl.1. Worse. This distribution may be evaluated by a twodimensional integral using classical functions 297 .d.Xn ) as an estimate of the population median II(F). closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule. and F has density f we can wrile (5. N(J1. from Problem (B. and calculable for any F and all n by a single onedimensional integration. Worse. To go one step further. In particular. If XI.1. a 2 ) we have seen in Section 4. the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5.1)..1 INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS Despite the many simple examples we have dealt with. the qualitative behavior of the risk as a function of parameter and sample size is hard to ascertain. Even if the risk is computable for a specific P by numerical integration in one dimension.fiiXIS has a noncentral t distribution with parameter 1"1 (J and n .2 that .9.2) and (5.1) This is a highly informative formula. . but a different one for each n (Problem 5.1 (D.1. consider med(X 11 •. (5. consider evaluation of the power function of the onesided t test of Chapter 4. telling us exactly how the MSE behaves as a function of n. However.
10=1 f=l There are two complementary approaches to these difficulties. I j {Tn (X" . _.4) i RB =B LI(F. Rn (F). Thus. always refers to a sequence of statistics I. The first. 1 < j < B from F using a random number generator and an explicit fonn for F. The classical examples are. for instance the sequence of means {Xn }n>l.1. Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or.. just as in numerical integration. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5. i . X nj }. Draw B independent "samples" of size n.2) and its qualitative properties are reasonably transparent. • . Monte Carlo is described as follows. we can approximate Rn(F) arbitrarily closely. X n as n + 00. . Xn)}n>l. based on observing n i.I I !' 298 Asymptotic Approximations Chapter 5 (Problem 5.o(X1).)2) } . X n + EF(Xd or p £F( . in this context. In its simplest fonn. We shall see later that the scope of asymptotics is much greater. but for the time being let's stick to this case as we have until now. .. Asymptotics. The other..i. is to use the Monte Carlo method.1.. . Approximately evaluate Rn(F) by _ 1 B i (5. {X 1j l ' •• . . 00.. VarF(X 1 ».d. of distributions of statistics. is to approximate the risk function under study by a qualitatively simpler to understand and easier to compute function. more specifically..Xnj ) . . . . which occupies us for most of this chapter.3.fiit !Xi. . . where X n = ~ :E~ of medians.Xn):~Xi<n_l (~2 (EX. S) and in general this is only representable as an ndimensional integral. fiB ~ R n (F).. .n (Xl. or the sequence Asymptotic statements are always statements about the sequence. It seems impossible to determine explicitly what happens to the power function because the distribution of fiX / S requires the joint distribution of (X.fii(Xn  EF(Xd) + N(O. where A= { ~ . save for the possibility of a very unlikely event.3. j=l • By the law of large numbers as B j. observations Xl. But suppose F is not Gaussian. or it refers to the sequence of their distributions 1 Xi. which we explore further in later chapters.
15.1. PF[IX n _  . 7).. DOD? Similarly. (see A.VarF(Xd. Is n > 100 enough or does it have to be n > 100. the weak law of large numbers tells us that.1.11) . (5.2 .1.1. . say.6) for all £ > O. For instance. (5.1.3). this reads (5.14.01. (5./' is as above and . then P [vn:(X F n .1. As an approximation. the right sup PF x [r. which are available in the classical situations of (5. £ = ..omes 11m'..6) and (5.. For instance. 1') < z] ~ ij>(z) (5.6) gives IX11 < 1. . X n ) we consider are closely related as functions of n so that we expect the limit to approximate Tn(Xl~"· .25 whereas (5.1. Thus.1. the much PFlIx n  Because IXd < 1 implies that ..9.1 Introduction: The Meaning and Uses of Asymptotics 299 In theory these limits say nothing about any particular Tn (Xl.2 is unknown be. and P F • What we in principle prefer are bounds.10) is .5) That is.1. Further qualitative features of these bounds and relations to approximation (5.9) when . the central limit theorem tells us that if EFIXII < 00...l1) states that if EFIXd' < 00.Xn ) or £F(Tn (Xl. Xn is approximately equal to its expectation.01. . for n sufficiently large.6) does not tell us how large n has to be for the chance of the approximation ~ot holding to this degree (the lefthand side of (5. say. .9) As a bound this is typically far too conservative. if EFXf < 00. .1') < x] _ ij>(x) < CEFIXd' v'~ 3 1/' " "n (5. x.. For € = .1.2 hand side of (5.9) is .10) < 1 with . 1'1 > €] < 2exp {~n€'}. if EFIXd < 00. We interpret this as saying that.f. X n )) (in an appropriate sense).. n = 400. by Chebychev's inequality.1. ' .8) Again we are faced with the questions of how good the approximation is for given n. The trouble is that for any specified degree of approximation.l) (5. Similarly.' 1'1 > €] < .1.1. below .1.Section 5.(Xn .' m (5.1.1..6) to fall.1..4.' = 1 possible (Problem 5..1. then (5.8) are given in Problem 5. X n ) but in practice we act as if they do because the T n (Xl.7) where 41 is the standard nonnal d. the celebrated BerryEsseen bound (A. if more delicate Hoeffding bound (B.14.
Although giving us some idea of how much (5. (5. consistency is proved for the estimates of canonical parameters in exponential families.7) says that the behavior of the distribution of X n is for large n governed (approximately) only by j.2 deals with consistency of various estimates including maximum likelihood. . Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo. . Yet. If they are simple. for all F in the n n . Bounds for the goodness of approximations have been available for X n and its distribution to a much greater extent than for nonlinear statistics such as the median. Practically one proceeds as follows: (a) Asymptotic approximations are derived. • model.I '1. for any loss function of the form I(F.1.1. The qualitative implications of results such as are very impor~ tant when we consider comparisons between competing procedures. I . . The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17  j • i. The arguments apply to vectorvalued estimates of Euclidean parameters.di) where '(0) = 0. i .8) differs from the truth. good estimates On of parameters O(F) will behave like  Xn does in relation to Ji.11) suggests.t and 0"2 in a precise way. (5.e(F)]) (1(e. We now turn to specifics.1.' (0)( (1 / y'n)( v'27i') (Problem 5.12) where (T(O. In particular.5) and quite generally that risk increases with (1 and decreases with n. • c F (y'n[Bn . as we have seen. even here they are not a very reliable guide.11) is again much too consctvative generally.1. behaves like . Section 5. It suggests that qualitatively the risk of X n as an estimate of Ji.1. Section 5. The estimates B will be consistent. As we shall see. F) typically is the standard deviation (SD) of J1iOn or an approximation to this SD.F) > N(O 1) .8) is typically much betler than (5.3 begins with asymptotic computation of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for oneparameter exponential families.1. (5. asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate. which is reasonable.300 Asymptotic Approximations Chapter 5 where C is a universal constant known to be < 33/4.3. and asymptotically normal. As we mentioned. For instance. Consistency will be pursued in Section 5. Asymptotics has another important function beyond suggesting numerical approximations for specific nand F. B ~ O(F).2 and asymptotic normality via the delta method in Section 5. although the actual djstribution depends on Pp in a complicated way.' I I • If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good.1. quite generally.(l) The approximation (5. 1 i . d) = '(II' . : i . "(0) > 0. (b) Their validity for the given n and Ttl for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation.
The stronger statement (5.lS.) However.Xn from Po where 0 E and want to estimate a real or vector q(O).7 without further discussion.2.7.2. by the WLLN. (5. That is.d. and probability bounds. e Example 5. Section 5. which is called consistency of qn and can be thought of as O'th order asymptotics. and B. . for all (5. ' . AlS.14. > O. But. We also introduce Monte Carlo methods and discuss the interaction of asymptotics.2) are preferable and we shall indicate some of qualitative interest when we can.2. The least we can ask of our estimate Qn(X I. case become increasingly valid as the sample size increases.. for . Monte Carlo.2 5. remains central to all asymptotic theory. A stronger requirement is (5. we talk of uniform cornistency over K.2 Consistency 301 in exponential families among other results. is a consistent estimate of p(P). asymptotics are methods of approximating risks. The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A. . with all the caveats of Section 5. X ~ p(P) ~ ~  p = E(XJ) and p(P) = X.4 deals with optimality results for likelihoodbased procedures in onedimensional parameter models.2. 1 X n are i.Xn ) is that as e n 8 E ~ e. by quantities that can be so computed. We will recall relevant definitions from that appendix as we need them.2. In practice. and 8.14.i. and other statistical quantities that are not realistically computable in closed form. A. If is replaced by a smaller set K. Means.2) is called unifonn consistency. 1denotes Euclidean distance.2. Finally in Section 5.5 we examine the asymptotic behavior of Bayes procedures. distributions. The simplest example of consistency is that of the mean.2) Bounds b(n. Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to 00.. For P this large it is not unifonnly consistent.1) where I . . If Xl. in accordance with (A. (See Problem 5. if.2.1 CONSISTENCY PlugIn Estimates and MlEs in Exponential Family Models Suppose that we have a sample Xl. 'in 1] q(8) for all 8. 00. where P is the empirical distribution. but we shall use results we need from A.1).i..1).) forsuP8 P8 lI'in .d. 5.. P where P is unknown but EplX11 < 00 then. . Summary. .14.1.1..7.q(8)1 > 'I that yield (5.Section 5.2. Most aSymptotic theory we consider leads to approximations that in the i. .1) and (B.2.
there exists 6 «) > 0 such that p.2. Pp [IPn  pi > 6] ~ .2.pi > 6«)] But. Asymptotic Approximations Chapter 5 X is uniformly consistent over P because o Example 5.2.4) I i (5.. 0 ~ To some extent the plugin method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plugin estimates. Evidently.. . Then qn q(Pn) is a unifonnly consistent estimate of q(p). By the weak law of large numbers for all p. · . (5. w( q. it is uniformly continuous on S. then by Chebyshev's inequality. Because q is continuous and S is compact. Theorem 5. Binomial Variance. we can go further. l iA) E S be the empirical distribution. Other moments of Xl can be consistently estimated in the same way. with Xi E X 1 .. 0 < p < 1. Suppose the modulus ofcontinuity of q. Suppose thnt P = S = {(Pl.3) that > <} (5. If q is continuous w(q.14.. Ip' pi < 6«). the kdimensional simplex.. where q(p) = p(1p). and (Xl. Pn = (iiI. p' E S.1) and the result follows. q(P) is consistent.b] < R+bedefinedastheinver~eofw.• X n be the indicators of binomial trials with P[X I = I] = p. for all PEP. P = {P : EpX'f < M < oo}. xd is the range of Xl' Let N i = L~ 11(Xi = Xj) and Pj _ Njln. Let X 1. Then N = LXi has a B(n.6)! Oas6! O. 0 In fact. w(q. Suppose that q : S + RP is continuous. Thus.2. implies Iq(p')q(p) I < <Then Pp[li1n . i . consider the plugin estimate p(Iji)ln of the variance of p.2.2.p) distribution.2.ql > <] < Pp[IPn .·) is increasing in 0 and has the range [a.6.p'l < 6}.. which is ~ q(ji). Letw. But further. . where Pi = PIXI = xi]' 1 < j < k.3) Evidently. .1 < j < k.1.l : [a..l «) = inf{6 : w(q. in this case. . by A. w(q. 0) is defined by .I l 302 instance.Pk) : 0 < Pi < 1. and p = X = N In is a uniformly consistent estimate of p. for every < > 0. 6) It easily follows (Problem 5. = Proof. sup{ Pp[liln pi > 61 : pES} < kl4n6 2 (Problem 5. 6 >0 O. w.6) = sup{lq(p)  q(p')I: Ip .b) say..5) I A simple and important result for the case in which X!." is the following: I X n are LLd..2. . .LJ~IPJ = I}.
i.3.1 that the empirical means.. = Proof. let mj(O) '" E09j(X. U implies !'.1.1 .v} = (u.V2. Theorem 5.1: D n 00. Var(U.p). 1 < j < d. af > o.lpl < 1. ' . Let g(u. Vi)..1. We need only apply the general weak law of large numbers (for vectors) to conclude that (5. (ii) ij is consistent. .UV) so that E~ I g(Ui .4.p).2. Suppose [ is open. Variances and Correlations.JL2. Var(V.7..9d) map X onto Y C R d Eolgj(X')1 < 00.2.2 Consistency 303 Suppose Proposition 5. and let q(O) = h(m(O)). )) and P = {P : Eplg(X . !:.V.a~.U 2. if h is continuous. Then. . and correlation coefficient are all consistent. where P is the empirical distribution. is a consistent estimate of q(O). N2(JLI. )1 < oo}. then vIP) h(g).) > 0.then which is well defined and continuous at all points of the range of m. where h: Y ~ RP. Jl2.2. Let g (91.2.2. Then. 1 < i < n be i. X n are a sample from PrJ E P. ai.) I < that h(D n) Foreonsisteney of h(g) apply Proposition B.2. .1 and Theorem 2. Let Xi = (Ui .. thus.) > 0. [ and A(·) correspond to P as in Section 1. Questions of uniform consistency and consistency when P = { DistribuCorr(Uj. . conclude by Proposition 5.)1 < 1 } tions such that EUr < 00. EVj2 < 00. o Example 5. h(D) for all continuous h.). is consistent for v(P). Let TI.Section 5.3. More generally if v(P) = h(E p g(X . If Xl.6) if Ep[g(X.6. Suppose P is a canonical exponentialfamily of rank d generated by T.. (i) Plj [The MLE Ti exists] ~ 1.Jor all 0.d.ar. If we let 8 = (JLI.2. variances. 1 < j < d. 11. 'Vi) is the statistic generating this 5parameter exponential family. 0 Here is a general consequence of Proposition 5. We may. ag. 1 are discussed in Problem 5.2. then Ifh=m.
1.1I)11: II E 6} ~'O (5.p(X" II) is uniquely minimized at 11 i=l for all 110 E 6.. as usual..  inf{D(II.2. = 1 n p. z=l 1 n i. By definition of the interior of the convex support there exists a ball 8. Rudin (1987).. n . for example.2. n LT(Xi ) i=I 21 En. II) 0 j =EII. Po. which solves 1 n A(1)) = . We showed in Theorem 2.8) I > O.LT(Xi )... By a classical result.2. is a continuous function of a mean of Li.3. But Tj. Note that. j 1 Pn(X. (T(Xl)) must hy Theorem 2. . . 5.2. (5. is solved by 110.lI) .2.E1).1 belong to the interior of the convex support hecause the equation A(1)) = to. II) L p(X n.304 Asymptotic Approximations Chapter 5 .1I 0 ) foreverye I . i I . Let Xl.. A more general argument is given in the following simple theorem whose conditions are hard to check.= 1 .3. I . '1 P1)'[n LT(Xi ) E C T) ~ 1. exists iff the event in (5. . if 1)0 is true. . D( 110 . i=l 1 n (5..d.2..L[p(Xi .1 that i)(Xj.1 that On CT the map 17 A(1]) is II and continuous on E. .9) n and Then (} is consistent.lI o): 111110 1 > e} > D(1I0 . I ! . see. . E1). • . i i • • ! . I I Proof Recall from Corollary 2. . where to = A(1)o) = E1)o T(X 1 ).3. D . T(Xdl < 6} C CT' By the law of large numbers. Let 0 be a minimum contrast estimate that minimizes I . The argument of the the previous subsection in which a minimum contrast estimate. X n be Li. .. T(Xl).7) I .2 Consistency of Minimum Contrast Estimates .D(1I0 .. Suppose 1 n PII sup{l.2. vectors evidently used exponential family properties. the inverse AI: A(e) ~ e is continuous on 8. {t: It . BEe c Rd. and the result follows from Proposition 5.j.3. .Xn ) exists iff ~ L~ 1 T(X i ) = Tn belongs to the interior CT of the convex support of the distribution of Tn.d.7) occurs and (i) follows.. 0 Hence.. the MLE.3. 1 i . II) = where. 7 I . I J Theorem 5.1 to Theorem 2.
11) sup{l.8) can often failsee Problem 5.1.2.IJol > E] < PIJ [inf{ . ifIJ is the MLE. (5.[IJ ¥ OJ] = PIJ.2..IJd ). IJ). EIJollogp(XI.IJor > <} < 0] because the event in (5.D(IJo. PIJ."'[p(X i . IJ) = logp(x. (5.10) tends to O.2. {IJ" . then. A simple and important special case is given by the following. Ie ¥ OJ] ~ Ojoroll j.14) (ii) For some compact K PIJ.13) By Shannon's Lemma 2.lJ o): IIJ IJol > oj.2.2.2. 1 n PIJo[inf{.p(X" IJo)) : IJ E g ~(P(X" K'} > 0] ~ 1.8) and (5. IJ) .9) follows from Shannon's lemma. [I ~ L:~ 1 (p(Xi .2.1 we need only check that (5.IJ o)  D(IJo. IJ) n ~ i=l p(X"IJ o)] . 0 Condition (5.IJ) p(Xi.2. [inf c e.9) hold for p(x. 2 0 (5. .IJO)) ' IIJ IJol > <} n i=l .8) follows from the WLLN and PIJ.2.2. IJ j )) : 1 <j < d} > <] < d maxi PIJ.2.2. IJ)I : IJ K } PIJ !l o.2.inf{D(IJo.2 Consistency 305 proof Note that. .10) By hypothesis.8) by (i) For ail compact K sup In { c e.D( IJ o.2. IJo)) .2. _ I) 1 n PIJ o [16 .D(IJo. PIJ..L(p(Xi. is finite.12) which has probability tending to 0 by (5.IJol > <} < 0] (5. IJ) . and (5. Coronary 5. An alternative condition that is readily seen to work more widely is the replacement of (5. But because e is finite. But for ( > 0 let o ~ ~ inf{D(IJ.2.D(IJo. for all 0 > 0. IIJ .Section 5.IJ)1 < 00 and the parameterization is identifiable. 0) .[max{[ ~ L:~ 1 (p(Xi .8). IJ)II ' IJ n i=l E e} > . IJ) .2. IJj ) . IJ j ) . IIJ .D(IJ o.11) implies that 1 n ~ 0 (5.L[p(Xi .2. E n ~ [p(Xi .IIIJ  IJol > c] (5.5.11) implies that the righthand side of (5. o Then (5. IJj))1 > <I : 1 < j < d} ~ O. /fe e =   Proof Note that for some < > 0.
.Il E(X Il). Let h : R ~ R.) = <7 2 Eh(X) We have the following. Unfortunately checking conditions such as (5. let Ilglioo = sup{ Ig( t) I : t E R) denote the sup nonn. that sup{PIiBn .3.14) is in general difficult.. see Problem 5. . we require that On ~ B(P) as n ~ 00. As usual let Xl.B(P)I > Ej : PEP} t 0 for all € > O. . If (i) and (ii) hold. 5.. I ~. > 2. Unifonn consistency for l' requires more. When the observations are independent but not identically distributed.3. . ~l Theorem 5. VariX.Xn be i. then = h(ll) + L (j)( ) j=l h .1) where • . in Section 5. ! ! (ii) EIXd~ < 00 Let E(X1 ) = Il. We show how consistency holds for continuous functions of vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families.8) and (5. We conclude by studying consistency of the MLE and more generally Me estimates in the case e finite and e Euclidean. Summary.1 that the principal use of asymptotics is to provide quantitatively or qualitatively useful approximations to risk.33. A general approach due to Wald and a similar approach for consistency of generalized estimating equation solutions are left to the problems.i.3.d. and assume (i) (a) h is m times differentiable on R. Sufficient conditions are explored in the problems.306 Asymptotic Approximations Chapter 5 We shall see examples in which this modification works in the problems. If fin is an estimate of B(P). consistency of the MLE may fail if the number of parameters tends to infinity.AND HIGHERORDER ASYMPTOTlCS: THE DelTA METHOD WITH APPLICATIONS We have argued. X valued and for the moment take X = R.2.3. We denote the jth derivative of h by 00 . D .. m Mi) and assume (b) IIh(~)lloo j 1 . ~ 5. sequence of estimates) consistency. + Rm (5.2.1. We introduce the minimal property we require of any estimate (strictly speak ing. =suPx 1k<~)(x)1 <M < .1 The Delta Method for Moments • • We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders. We then sketch the extension to functions of vector means. J.3 FIRST.
. EIX I'li = E(X _1')' Proof.'" . for j < n/2. The expression in (c) is. _ . . X ij ) least twice. .2) where IX' that 1'1 < IX . ik > 2.) ~ 0.1'1..3. Moreover.. j > 2...Section 5.3.3.. tr i/o>2 all k where tl ...3.. then there are constants C j > 0 and D j > 0 such (5. 21. .3 First. + i r = j.[jj2] + 1) where Cj = 1 <r.5[jJ2J max {l:: { .3) and j odd is given in Problem 5...and HigherOrder Asymptotics: The Delta Method with Applications 307 The proof is an immediate consequence of Taylor's expansion..3) for j even.. If EjX Ilj < 00. . .3. +'r=j J .3. and the following lemma. In!.3.. Let I' = E(X. (b) a unless each integer that appears among {i I .3) (5. then (a) But E(Xi . so the number d of nonzero tenns in (a) is b/2] (c) l::n _ r. . . bounded by (d) < t.4) Note that for j eveo. (5. .ij } • sup IE(Xi .i j appears at by Problem 5. . C [~]! n(n . . (n . Xij)1 = Elxdj il.3..'1 +. The more difficult argument needed for (5.1) ..3. 'Zr J .1. We give the proof of (5. . . .5. rl l:: . Lemma 5. :i1 + ..1 < k < r}}. .t" = t1 .4) for all j and (5.2. t r 1 and [t] denotes the greatest integer n. tI.
i • r' Var h(X) = 02[h(11(J.3.3.l) + 00 2n I' + G(n3f2).1 with m = 4.6) can be 1 Eh 2 (X) = h'(J.4) for j odd and (5.l)h(J.1') + {h(2)(J.6.3. I .5) can be replaced by Proof For (5.l) + {h(21(J.3f2 ).3.ll j < 2j EIXd j and the lemma follows. then h(')( ) 2 0 = h(J.3f') in (5.2.1.. n (b) Next.3.3.. I 1 . apply Theorem 5. 0 1 I .3. .l)}E(X .3. respectively. (5. give approximations to the bias of h(X) as an estimate of h(J.3. Then Rm = G(n 2) and also E(X . (a) ifEIXd3 < 00 and Ilh( 31 11 oo < Eh(X) 00. .6) n . Because E(X . if 1 1 <j (a) Ilh(J)II= < 00. EIX1 .3) for j even. then I • I.1. 1 < j < 3.3f2 ) in (5.3. if I' = O.1 with m = 3. (d).l) 6 + 2h(J. (5. 00 and 11hC 41 11= < then G(n. and (e) applied to (a) imply (5.1'11.1')3 = G(n 2) by (5.J.3..llJ' + G(n3f2) (5. Proof (a) Write 00. • .5) (b) if E(Xt) < G(n.3.3.308 But (e) Asymptotic Approximations Chapter 5 1 I j i njn(n 1) . < 3 and EIXd 3 < 00. In general by considering Xi .5) follows.1')' + 1 E[h'](3)(X')(X _ 1')3 = h2 (J.3..1')2 = 0 2 In. then G(n. and EXt < replaced by G(n').+ O(n.l)h(1)(J. I I I Corollary 5. 0 The two most important corollaries of Theorem 5. I (b) Ifllh(j11l= < 00.5) apply Theorem 5. If the conditions of (b) hold.l) + [h(llj'(1')}':. Corollary 5.1.l) + [h(1)]2(J..4).3.l)h(J.l)E(X .l) and its variance and MSE.. . By Problem 5. 1 iln . I .JL as our basic variables we obtain the lemma but with EIXd j replaced by EIX 1 .3. (n li/2] + 1) < nIJf'jj and (c).2 ). using Corollary 5.
3.3.6). where /1.(h(X) .h(/1) is G(n. We can use the two coronaries to compute asymptotic approximations to the means and variance of heX).Section 5.4) E(X .Z.2 (Problem (5. (5.3. then heX) is the MLE of 1 .3.1.9) o Further expansion can be done to increase precision of the approximation to Var heX) for large n.2 ) 2e.3. 0 Clearly the statements of the corollaries as well can be turned to expansions as in Theorem 5. then we may be interested in the warranty failure probability (5.l / Z) unless h(l)(/1) = O.(h(X)) E.5).. Bias and Variance of the MLE of the Binomial VanOance. Note an important qualitative feature revealed by these approximations. the MLE of ' is X . the bias of h(X) defined by Eh(X) .1 with bounds on the remainders.3.3.l ).exp( 2It). .Z) (5.1.  .Z ".3. as we nonnally would. as the plugin estimate of the parameter h(Jl) then.d.3.3.and HigherOrder Asymptotics: The Delta Method with Applications 309 Subtracting Cal from (b) we get C5.1 If the Xi represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours. (1Z 5.3. . Bias. {h(l) (/1)h(ZI U')/13 (5..4 ) exp( 2/t).3. by Corollary 5. and.3.. We will compare E(h(X)) and Var h(X) with their approximations. If X [. X n are i.3 First. Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5.t.h(/1)) h(2~(II):: + O(n.i. T!Jus.7) If h(t) ~ 1 . which is G(n. by expanding Eh 2 (X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = ~[h(I)U')]'(1Z +. To get part (b) we need to expand Eh 2 (X) to four terms and similarly apply the appropriate form of (5.3.2.= E"X I ~ 11 A. If heX) is viewed.. Thus.3. A qualitatively simple explanation of this important phenonemon will be given in Theorem 5.8) because h(ZI (t) = 4(r· 3 .3(1_ ')In + G(n. when hit) = t(1 ./1)3 = ~~.10) + [h<ZI (/1)J'(14} + R~ ! with R~ tending to zero at the rate 1/n3 . for large n.11) Example 5.t) and X.i. Example 5.1) = 11. by Corollary 5.exp( 2') c(') = h(/1).3. which is neglible compared to the standard deviation of h(X).
p) n . . Theorem 5.2t. Next compute Varh(X)=p(I.p)J n p(1 ~ p) [I .2p)' .p) + .P) {(1_2 P)2+ 2P(I.p)(I. a Xd d} . R'n p(1 ~ p) [(I .p)'} + R~ p(l .3.3 ).5) yields E(h(X))  ~ p(1 .3.{2(1. .. • ~ O(n. and will illustrate how accurate (5.310 Asymptotic Approximations Chapter 5 B(l.p) = p ( l .2p)')} + R~.p). First calculate Eh(X) = E(X) .' .2(1 . 1 and in this case (5. • . D " ~ I . 1 < j < { Xl .p) {(I _ 2p)2 n Thus. D 1 i I . asSume that h has continuous partial derivatives of order up to m.p)] n I " . Let h : R d t R. + id = m. .pl.p)(I.p(1 .3.p) . . (5.3.' • " The generalization of this approach to approximation of moments for functions of vector means is fonnally the same but computationally not much used for d larger than 2. I .10) is in a situation in which the approximation can be checked.2p(1 . and that (i) l IIDm(h)ll= < 00 where Dmh(x) is the array (tensor) a i. Suppose g : X ~ R d and let Y i = g(Xi ) = (91 (Xi). M2) ~ 2.2p). 0 < i.. II'.2. < m. . n n Because MII(t) = I .E(X') ~ p ..2p)2p(1 .3. ..10) yields .. (5.p(1 .6p(l.amh i.P )} (n_l)2 n nl n Because 1'3 = p(l..p) . the error of approximation is 1 1 + .e x ) : i l + . " ! .2p) n n +21"(1 .5) is exact as it should be.j . ~ Varh(X) = (1.!c[2p(1 n p) .2p)p(l.9d(Xi )f.[Var(X) + (E(X»2] I nI ~ p(1 .
Suppose thot X = R. We get.I' < 00. Suppose {Un} are real random variables and tfult/or a sequence {an} constants with an + 00 as n + 00.14) + (g:.(J1. h : R ~ R.13).6).11.5). B. (5. then Eh(Y) This is a consequence of Taylor's expansion in d variables.2 ).2 The Delta Method for In law Approximations = As usual we begin with d !. 1 < j < d where Y iJ I = 9j(Xi ). ijYk = ~ I:~ Yik> Y = ~ I:~ I Vi. and the appropriate generalization of Lemma 5.3.3.3.1 Then. Theorem 5... The proof is outlined in Problem 5. = EY 1 .3.))) ~ N(O. 5. for d = 2. then O(n.) + 2 g:'.3. Y12 ) (5.3.3. if EIY. by (5. and (5. is to m = 3.14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d . Lemma 5. EI Yl13 < 00 Eh(Y) h(J1.3.3.1.12) Var h(Y) ~ [(:. The results in the next subsection go much further and "explain" the fonn of the approximations we already have. Y 12 ) 3/2). and J. (5.) )' Var(Y.3.12) can be replaced by O(n..) Var(Y.3.) Cov(Y".3).'.h(!.~E(X. (7'(h)) where and (7' = VariX.) + a. + ~ ~:l (J1. The most interesting application.»)'var(Y'2)] +O(n') Approximations (5.3.) gx~ (J1.3 / 2 ) in (5.).x.~ (J1... (J1.3.hUll)~.) Var(Y.) + ~ {~~:~ (J1.C( v'n(h(X) .2.)} + O(n (5.3.and HigherOrder Asymptotics: The Delta Method with Applications 311 (ii) EIY'Jlm < 00. Similarly. 1. 0/ The result follows from the more generally usefullernma.". E xl < 00 and h is differentiable at (5.).3.15) Then . .8. as for the case d = 1..3 First. (J1. under appropriate conditions (Problem 5.) Cov(Y".4.13) Moreover.Section 5.3.
u = /}" j \ V . V for some constant u.15) "explains" Lemma 5. .7.1.3. .312 Asymptotic Approximations Chapter 5 (i) an(Un .g(ll(U)(v  u)1 < <Iv . By definition of the derivative. V and the result follows. . 'j But (e) implies (f) from (b).u)!:.ul < Ii '* Ig(v)  g(u) .i. . • • .a 2 ).3. = I 0 . PIIUn €  ul < iii ~ 1 • "'. we expect (5. then EV.32 and B. an(Un . an = n l / 2. Formally we expect that if Vn !:.3. ~: . from (a). for every (e) > O. Therefore. (5.8). by hypothesis.16) Proof.N(O. (e) Using (e). for every € > 0 there exists a 8 > 0 such that (a) Iv . hence. ( 2 ). j " But. _ F 7 . ·i Note that (5. ~ EVj (although this need not be true. for every Ii > 0 (d) j. Consider 1 Vn = v'n(X !") !:. Thus. . V .N(O. I . (ii) 9 : R Then > R is d(fferentiable at u with derivative 9(1) (u). V.17) .ul Note that (i) (b) '* . The theorem follows from the central limit theorem letting Un X.u) !:.3.3.• ' " and. see Problems 5. I (g) I ~: .. 0 . j .
N(O. "t" Statistics.'" 1 X n1 and Y1 . 1 2 a P by Theorem 5. then Tn . (1 .d.j odd.X) n 2 .a) critical value (or Zla) is approximately correct if H is true and F is not Gaussian. Let Xl.3. we find (Problem 5. Using the central limit theorem. Example 5. Then (5. EZJ > 0. 1'2 = E(Y.9. FE :F where EF(X I ) = '"..> 0 ./2 ).2.3. VarF(Xd = a 2 < 00. For the proof note that by the central limit theorem. £ (5. (1  A)a~ .3 First. (b) The TwoSample Case.. (a) The OneSample Case.3. Consider testing H: /11 = j. v) = u!v. In general we claim that if F E F and H is true.3 we saw that the two sample t statistic Sn vn1n2 (Y X) ' n = = n s nl + n2 has a Tn2 distribution under H when the X's and Y's are normal withar = a~.2 and Slutsky's theorem.and HigherOrder Asymptotics: The Delta Method with Applications 313 where Z ~ N(O.X" be i. Let Xl.a) for Tn from the Tn . But if j is even. 1). A statistic for testing the hypothesis H : 11. Slutsky's theorem. then S t:.1 (Ia) + Zla butthat the t n.'" .O < A < 1.3. i=l If:F = {Gaussian distributions}. and the foregoing arguments. sn!a).+A)al + Aa~) .).s Tn =. and S2 _ n nl ( ~ X) 2) n~(X. ••• .). Now Slutsky's theorem yields (5. 1). else EZJ ~ = O.) and a~ = Var(Y.Section 5. where g(u. j even = o(nJ/').TiS X where S 2 = 1 nl L (X.= 0 versuS K : 11.l (1.18) because Tn = Un!(sn!a) = g(Un . N n (0 Aal.28) that if nI/n _ A.t2 versus K : 112 > 111.3. Yn2 be two independent samples with 1'1 = E(X I ).3.i. In Example 4.18) In particular this implies not only that t n. . we can obtain the critical value t n.17) yields O(n J.. al = Var(X.1 (1 .l distribution.
I = u~ and nl = n2. One sample: 10000 Simulations..'.95) approximation is only good for n > 10 2 . . 316. We illustrate such simulations for the preceding t tests by generating data from the X~ distribution M times independently. the For the twosample t tests.5 . Other distributions should also be tried. Y .05.. each lime computing the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table... 20.. then the critical value t. i I . Figure 5.000 onesample t tests using X~ data. ur ..3. X~.5 3 Log10 sample size j i Figure 5. Each plotted point represents the results of 10.1. 0) for Sn is Monte Carlo Simulation As mentioned in Section 5.. Chisquare data I 0. and the true distribution F is X~ with d > 10.c:'::c~:____:_J 0.2 or of a~.1 shows that for the onesample t test. The X~ distribution is extremely skew.1. Here we use the XJ distribution because for small to moderate d it is quite different from the normal distribution.. or 50.1 (0. approximations based on asymptotic results should be checked by Monte Carlo simulations. the asymptotic result gives a good approximation when n > 10 1.. as indicated in the plot. when 0: = 0. .314 Asymptotic Approximations Chapter 5 It follows that if 111 = 11. 10.2 shows that when t n _2(10:) critical value is a very good approximation even for small n and for X. Figure 5.. oL....3. 32. and in this case the t n .02 .5 1 1.. where d is either 2. The simulations are repeated for different sample sizes and the observed significance levels are plotted. .3. ..5 2 2.5 . . i 2(1 approximately correct if H is true and the X's and Y's are not normal... " .
3. in this case.2 (1 .and HigherOrder Asymptotics: The Delta Method with Applications 315 1 This is because.h(JL)] ".' 0. 1 (ri . in the onesample situation.) i O.5 1 1. or 50. and the data are X~ where d is one of 2.02 d'::'.9.hoi n s[h(l)(X)1 .Xi have a symmetric distribution. 2:7 ar Two sample.a~ show that as long as nl = n2. ChiSquare Dala. For each simulation the two samples are the same size (the size indicated on the xaxis).3. and Yi .3.(})) do not have approximate level 0.Section 5_3 First. the t n _2(O. In this case Monte Carlo studies have shown that the test in Section 4. D Next.4 based on Welch's approximation works well.L) where h is con tinuously differentiable at !'.o'[h(l)(JLJf).n2 and at = a')._' 0.. when both nl I=.n2 and at I=.a~.5 3 log10 sample size Figure 5. then the twosample t tests with critical region 1 {S'n > t n . scaled to have the same means. the t n 2(1 .ll)(!. Y .Xd.0' H<I o 0.95) approximation is good for nl > 100.2. af = a~.. as we see from the limiting law of Sn and Figure 5. let h(X) be an estimate of h(J. Each plotted point represents the results of 10. _ y'ii:[h(X) .3. even when the X's and Y's have different X~ distributions. Equal Variances 0. Other Monte Carlo runs (not shown) with I=. }. Moreover. By Theoretn 5. However.X = ~.cCc":'::~___:. To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T.000 twosample t tests.3. 10000 Simulations.12 0. 10. and a~ = 12af. N(O.0:) approximation is good when nl I=. y'ii:[h(X) .5 2 2.
 0. then the sample mean X will be approximately normally distributed with variance 0"2 In depending on the parameters indexing the family considered. Poisson...~'ri i _ . It turns out to be useful to know transformations h. The data in the first sample are N(O. The xaxis denotes the size of the smaller of the two samples. Tn .5 Log10 (smaller sample size) l Figure 5. . " 0.3... o.J~ f).6.I _ __ 0 I +__ I :::. +___ 9.3. +2 + 25 J O'. Unequal Variances.£. '! 1 ..5 1 1. From (5. such that Var heX) is approximately independent of the parameters indexing the family we are considering. too.c:~c:_~___:J 0.{x)() twosample t tests.4. In Appendices A and B we encounter several important families of distributions.3. which are indexed by one or more parameters.3. 2nd sample 2x bigger 0.02""~". £l __ __ 0.". we see that here. If we take a sample from a member of one of these families. Variance Stabilizing Transfonnations Example 5. Combining Theorem 5.. N(O. Each plotted point represents the results of IO. We have seen that smooth transformations heX) are also approximately normally distributed. For each simulation the two samples differ in size: The second sample is two times the size of the first.3. 1) and in the second they are N(Ola 2) where a 2 takes on the values 1.3. and beta.0.. 1) so that ZI_Q is the asymptotic critical value. gamma. as indicated in the plot.10000 Simulations: Gaussian Data. called variance stabilizing. I: ..' OIlS .316 Asymptotic Approximations Chapter 5 Two Sample. such as the binomial. .12 . and 9.3 and Slutsky's theorem.oj:::  6K . if H is true .6) and .
3.. where d is arbitrary.19) for all '/. As an example. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II. To have Varh(X) approximately constant in A.3.\ + d. X n) is an estimate of a real parameter! indexing a family of distributions from which Xl.13) we see that a first approximation to the variance of h( X) is a' [h(l) (/. suppose that Xl. c) (5. which has as its solution h(A) = 2. finding a variance stabilizing transfonnation is equivalent to finding a function h such that for all Jl and (J appropriate to our family.3. . In this case (5. sample.Section 5.. which varies freely. 1/4) distribution. Also closely related but different are socaBed normalizing transformations.3. Suppose. p.3. Edgeworth Approximations The normal approximation to the distribution of X utilizes only the first two moments of X. If we require that h is increasing..hb)) + N(o.1 0 ) Vi' ' is an approximate 1.5.. h(t) = Ii is a variance stabilizing transformation of X for the Poisson family of distributions. in the preceding P( A) case.' ..)C. . Thus. See Problems 5..15 and 5.. 0 One application of variance stabilizing transformations. 538) one can improve on . Substituting in (5.(A)') has approximately aN(O. Some further examples of variance stabilizing transformations are given in the problems. A > 0. .3 First.16. by their definition.3. See Example 5. A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models. yX± r5 2z(1 .and HigherOrder Asymptotics: The Delta Method with Applications 317 (5. Thus. . this leads to h(l)(A) = VC/J>.6..19) is an ordinary differential equation.3..)] 2/ n . . Major examples are the generalized linear models of Section 6.d. 1n (X I.i.Xn are an i. h must satisfy the differential equation [h(ll(A)j2A = C > 0 for some arbitrary c > O. The notion of such transformations can be extended to the following situation.6) we find Var(X)' ~ 1/4n and Vi'((X) . Suppose further that Then again. a variance stabilizing transformation h is such that Vi'(h('Y) .Ct confidence interval for J>. Such a function can usually be found if (J depends only on fJ. 1976. Thus. is to exhibit monotone functions of parameters of interest for which we can give fixed length (independent of the data) confidence intervals. Under general conditions (Bhattacharya and Rao. In this case a'2 = A and Var(X) = A/n. I X n is a sample from a P(A) family.
1).9990 0.9900 0.' •• .95 :' 1.75 0.38 0.II 0.1254 0.3x) + _(xS .!.3. . P(Tn < x).0254 0.9500 'll .5000 0.0287 0. we need only compute lIn and 1'2n. ~ 2.0481 1./2 2 . " I.ססoo 0.21) The expansion (5.5.34 2.15 0. 0.9950 0.1000 0..0553 0.35 0.9097 o.JL) / a and let lIn and 1'211 denote the coefficient of skewness and kurtosis of Tn.0877 0.3513 0. i = 1.n)1 V2ri has approximately aN(O.7000 0.1051 0.ססOO I .8008 0.15 0.86 0.9506 0.4415 0.20) is called the Edgeworth expansion for Fn .2024 EA NA x 0.3.1.66 0.04 1.0005 0 0.9029 0.5999 0. ii.40 0.0397 0.0105 0.5421 0. ..0010 ~: EA NA x Exact .9995 0. Let F'1 denote the distribution of Tn = vn( X .79 0.IOx 3 + 15x)] I I y'n(X n 9n ! I i !. Therefore. H 3 (x) = x3  3x.6999 0. 1) distribution. To improve on this approximation. Exact to' EA NA {.4000 0.n)' _ 3 (2n)2 = 12 n 1 I .0208 ~1.0100 0.51 1. 0 x Exacl 2.2.4 to compute Xl.9876 0. Edgeworth(2) and nonna! approximations EA and NA to the X~o distribution..91 0. xro I • I I . • " .9997 4.72 .4000 0. It follows from the central limit theorem that Tn = (2::7 I Xi n)/V2ri = (V . Fn(x) ~ <!>(x) . V has the same distribution as E~ 1 where the Xi are independent and Xi N(O.0250 0. Edgeworth Approximations to the X 2 Distribution.0500 ! ! .9984 0.77 0.0050 0.9750 0.9996 1.9724 0..ססoo 1. We can use Problem B..4999 0.0655 O. . Then under some conditionsY) where Tn tends to zero at a rate faster than lin and H 2 • H 3 • and H s are Hermite polynomials defined by H 2 (x) ~ x 2  I.0284 1.6548 4.1) + 2 (x 3 .9999 1. Hs(x) = xS  IOx 3 + 15x.3.I .40 0.85 0 0.9999 1.1964 1.7792 5.ססoo 0. 0.1.<p(x) ' .95 3. 4 0.8000 0. (5. I' 0.318 Asymptotic Approximations Chapter 5 the normal approximation by utilizing the third and fourth moments.3000 0.61 0. "J 1 '~ • 'YIn = E(Vn? (2n)1 E(V .0032 ·1. Suppose V rv X~.3006 0.2706 0.9684 0." • [3 n .1 gives this approximation together with the exact distribution and the nonnal approximation when n = 10.6000 0.9943 0.9905 0. TABLE 5. According to Theorem 8.3.0001 0 0.2000 0.. where Tn is a standardized random variable. Table 5. Example 5.38 0.3.3.n.
2 + p'A2.u) ~ N(O. and let r 2 = C2 /(j3<.I. and Yj = (Y .u) where Un = (n1EXiYi.0.. >'1. It follows from Lemma 5..9) that vn(C . a~ = Var Y. .i.d..22) 2 g(l)(u) = (2U.p).2.2 >'1. Suppose {Un} are ddimensional random vectors and that for some sequence of constants {an} with an !' (Xl as n Jo (Xl. = Var(Xkyj) and Ak.1.3. Y)/(T~ai where = Var X. .6. Using the central limit and Slutsky's theorems.J 2 2 + 4 2 4 2 Tn P 120 + P T 02 +2{ _2 p3Al./U2U3.2.0 .9. Ui/U~U3.2 extends to the dvariate case. We can write r 2 = g(C.1.a~) : n 3 !' R.3 and (B.o}. vn(i7f . where g(UI > U2.2 >'2.J11)/a.2rJ >".).1.3. U3) = Ui/U2U3.j.3.0. Yn ) be i. (X n . E). (ii) g. (i) un(U n .3..3 First.3.0 >'1.2. . The proof follows from the arguments of the proof of Lemma 5.5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test of H : p = is based on Irl.= (Xi . Let p2 = Cov 2(X. 1. _p2.ij~ where ar Recall from Section 4.5. = ai = 1. R d !' RP has a differential g~~d(U) at u._p2). with  r?) is asymptotically nonnal.0 TO .u) ~~ V dx1 forsome d xl vector ofconstants u. Lemma 5.Section 5. .n lEX.1. as (X.2 T20 . then by the 2 vn(U .J12) / a2 to conclude that without use the transformations Xi j loss of generality we may assume J11 = 1k2 = 0.ar.0.2.0 >'1.3. Let central limit theorem T'f. Let (X" Y. Because of the location and scale invariance of p and r. we can .3.3. P = E(XY).0. Then Proof.and HigherOrder Asymptotics: The Delta Method with Applications 319 The Multivariate Case Lemma 5.0.m. 4f. 0< Ey4 < 00.0.0 2 (5.2.1) and vn(i7~ .6) that vn(r 2 N(O.1) jointly have the same asymptotic distribution as vn(U n .1. Y) where 0 < EX 4 < 00.n1EY/) ° ~ ai ~ and u = (p.2. xmyl). UillL2U~) = (2p. >'2. Example 5.2. aiD. 1). we can show (Problem 5. E Next we compute = T11 .l = Cov(XkyJ.
. we see (Problem 5.10) that in the bivariate nonnal case a variance stahilizing transformation h(r) with .23) • .Y) ~ N(/li.p) !:.h= (h i .. 1938) that £(vn . Asymptotic Approximations Chapter 5 = 4p2(1_ p2)'.320 When (X. David.u 2. Var Y i = E and h : 0 ~ RP where 0 is an open subset ofRd.P The approximation based on this transfonnation.fii(Y .rn) N(o. .fii(r . N(O.g. o Here is an extension of Theorem 5. E) = .fii[h(r) .8. 2 1. j f· ~. .3. .UI.3. Theorem 5.3[h(c) .mil £ ~ hey) = h(rn) and (b) so that .5 (a) ! .24) I Proof.3. it gives approximations to the power of these tests. . Suppose Y 1.3(h(r) . ': I = Ilgki(rn)11 pxd. N(O. c E (1.3. .~a)/vn3} where tanh is the hyperbolic tangent.19). and (Prohlem 5. . .4.1'2.hp )andhhasatotaldifferentialh(1)(rn) f • .3. I Y n are independent identically distributed d vectors with ElY Ii 2 < 00.1)..p).3. (5. that is.a)% confidence interval of fixed length. p=tanh{h(r)±z (1. Argue as before using B.: ! + h(i)(rn)(y .h(p)J !:. . and it provides the approximate 100(1 .h(p)) is closely approximated by the NCO..l) distribution.3.9) Refening to (5.3. (5. Then ~. then u5 . ! f This expression provides approximations to the critical value of tests of H : p = O.rn) + o(ly . (1 _ p2)2).h(p)]). has been studied extensively and it has been shown (e. EY 1 = rn. .1) is achieved hy choosing h(P)=!log(1+ P ). which is called Fisher's z. per < c) '" <I>(vn . . .
' = Var(X..:: .1. we first note that (11m) Xl is the average ofm independent xi random variables. . the:F statistic T _ k. if k = 5 and m = 60. which is labeled m = 00. 00.m' To show this. we conclude that for fixed k.15.k+m L::l Xl 2 (5.7) implies that as m t 00. This row. where Z ~ N(O. gives the quantiles of the distribution of V/k. check the entries of Table IV against the last row.3.i=k+l Xi ". only that they be i.k.. 1 Xl has a x% distribution. then P[T5 . i=k+1 Using the (b) part of Slutsky's theorem. We do not require the Xi to be normal. with EX l = 0.21 for the distribution of Vlk. For instance. > 0 and EXt < 00..05 and the respective 0.1.and HigherOrder Asymptotics: The Delta Method with Applications 321 (c) jn(h(Y) . D Example 5. xi and Normal Approximation to the Distribution of F Statistics. When k is fixed and m (or equivalently n = k + m) is large.3.Xn is a sample from a N(O. But E(Z') = Var(Z) = 1. By Theorem B.3..~1.m.0 < 2.3 First. See also Figure B. we can use Slutsky's theorem (A.05 quantiles are 2..3. . Next we turn to the normal approximation to the distribution of Tk.0 distribution and 2. the :Fk.). the mean ofaxi variable is E(Z2).h(m)) = y'nh(l)(m)(Y  m) + op(I).37] = P[(vlk) < 2.1 in which the density of Vlk.x%. when the number of degrees of freedom in the denominator is large.m. 14. we can write. > 0 is left to the problems. as m t 00. (5.m  (11k) (11m) L. EX.3.m distribution can be approximated by the distribution of Vlk. Suppose for simplicity that k = m and k . where V .1. Suppose that Xl.:l ~ m k+m L X..m distribution.211 = 0.i. Then according to Corollary 8. is given as the :FIO . Thus. when k = 10.3.26) l.l) distribution. Suppose that n > 60 so that Table IV cannot be used for the distribution ofTk.7. By Theorem B.37 for the :F5 .d.25) has an :Fk. if .Section 5. Then. L::+. I). Now the weak law of large numbers (A.= density.9) to find an approximation to the distribution of Tk. We write T k for Tk. To get an idea of the accuracy of this approximation.3. The case m = )'k for some ). where k + m = n.
X n are a sample from PTJ E P and ij is defined as the MLE 1 . i. a').3 Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families Our final application of the 8method follows. j m"': k (/k. v) ~ ~. l (: An interesting and important point (noted by Box. v'n(Tk . when Xi ~ N(o.»)T and ~ = Var(Yll)J.)/ad 2 .3. Then if Xl.)T.. h(i)(u.3. the distribution of'.k (t 1 (5.m} t 00.8(c» one has to use the critical value . 1953) is that unlike the t test.m critical value fk.m = (1+ jK(:: z. where J is the 2 x 2 v'n(Tk 1) ""N(0. Thus (Problem 5.28) . !1 . Equivalently Tk = hey) where Y i ~ (li" li. which by (5.3. i = 1.5.m • < tl P[) :'. v) identity. 322 Asymptotic Approximations Chapter 5 where Yi1 = and Yi2 = Xf+i/a2. .k(tl)] '" 'fI() :'. ~ N (0. 2) distribution. = E(X.m 1) <) :'. 0 ! j i · 1 where K = Var[(X. the upper h. 1)T. . = (~.' !k . and a~ = Var(X. the F test for equality of variances (Problem 5. a'). We conclude that xl jer}. _. Suppose P is a canonical exponential family of rank d generated by T with [open. :f. :. when rnin{k.k(T>. P[Tk.8(a» does not have robustness of level.29) is unknown. When it can be eSlimated by the method of moments (Problem 5.a). .3. In general. ! i l _ ..k(Tk.3.a) .2Var(Yll))' In particular if X.1) "" N(O. ifVar(Xf) of 2a'. k.m(l. ~ Ck. I)T and h(u.3.3. 1'. as k ~ 00. • • 1)/12) . By Theorem 5.)..).'.o) k) K (5.m(1 . I' Theorem 5.8(d)).1)/12 j 2(m + k) mk Z.I.3.m(1. (5.27) where 1 = (1. 1 5.7). Specifically. .. 1'.28) satisfies Zto '" 1 I I:. 4).3.4._o l or ~. if it exists and equal to c (some fixed value) otherwise.a) '" 1 + is asymptotically incorrect.3. .m 1) can be ap proximated by a N(O. i. E(Y i ) ~ (1. In general (Problem 53.
the asymptotic variance matrix /1(11) of . and 1).30) But D A . Example 5.24). in our case.1.8.8.2 where 0" = T. We showed in the proof ofTheorem 5.33) 71' 711 Here 711 = 1'/0'. = A by definition and.1. For (ii) simply note that.". T/2 By Theorem 5. The result is a consequence. therefore. iiI = X/O".2 that. hence. by Corollary 1. = 1/20'.11) eqnals the lower bound (3.).In(ij .and HigherOrder Asymptotics: The Delta Method with Applications 323 (i) ii = 71 + ~ I:71 A •• • I (T/)(T(X.3.T.3. Thus. Recall that A(T/) ~ VarT/(T) = 1(71) is the Fisher information.i. of Theorems 5.4. 0'). (5. This is an "asymptotic efficiency" property of the MLE we return to in Section 6.2. I'.A(T/)) + oPT/ (n .Section 5. (i) follows from (5.3.38) on the variance matrix of y'n(ij . and.3. by Example 2.) . = 1/2. thus.1.. (I" +0')] .3. are sufficient statistics in the canonical model.3.3.'l 1 (ii) LT/(Vii(iiT/)).4 with AI and m with A(T/). where.3.6. o Remark 5. = (5. as X witb X ~ Nil'. (5. Ih our case.Nd(O. Then T 1 = X and 1 T2 = n. Thus.2 and 5.N(o.EX. X n be i.3. Note that by B.2.  (TIl' .3 First.4.I(T/) (5. . (ii) follows from (5.2..ift = A(T/). PT/[T E A(E)] ~ 1 and.32) Hence.71) for any nnbiased estimator ij.5.A 1(71))· Proof.3.31) .d.l4.23).O. Let Xl>' ..4. PT/[ii = AI(T)I ~ L Identify h in Theorem 5. Now Vii11.3.3. if T ~ I:7 I T(X.
4 ASYMPTOTIC THEORY IN ONE DIMENSION I: I " ! .33) and Theorem 5.i.. which lead to a result on the asymptotic nonnality of the MLE in multiparameter exponential families. i 7 •• .4. 5.1. I I I· < P.. . and il' = T. . Consistency is Othorder asymptotics. N = (No. testing.d.d. as Y. taking values {xo. and h is smooth. These stochastic approximations lead to Gaussian approximations to the laws of important statistics. is twice differentiable for 0 < j < k. the difference between a consistent estimate and the parameter it estimates. I .3. We begin in Section 5. .1) i . We focus first on estimation of O. In Chapter 6 we sketch how these ideas can be extended to multidimensional parametric families. We consider onedimensional parametric submodels of S defined by P = {(p(xo. 8 open C R (e. These "8 method" approximations based on Taylor's fonnula and elementary results about moments of means of Ll. Finally. . . Firstorder asymptotics provides approximations to the difference between a quantity tending to a limit and the limit. Secondorder asyrnptotics provides approximations to the difference between the error and its firstorder approximation.4. Eo) .1.".6. Summary. The moment and in law approximations lead to the definition of variance stabilizing transfonnations for classical onedimensional exponential families.L~ I l(Xi = Xj) is sufficient. 5. Y n are i. and so on. Thus. Nk) where N j . . Following Fisher (1958)'p) we develop the theory first for the case that X""" X n are i. < 1. the (k+ I)dimensional simplex (see Example 1.4 to find (Problem 5.. .... sampling. Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean.4 and Problem 2.d. 0. Xk} only so that P is defined by p . J J . Assume A : 0 ~ pix.d. and PES. see Example 2.i.(T. .7).P(Xk' 0)) : 0 E 8}. .3. stochastic approximations in the case of vector statistics and parameters are developed.26) vn(X /"..Pk) where (5. for instance. variables are explained in tenns of similar stochastic approximations to h(Y) . . under Li.15). 0). EY = J. and confidence bounds.: CPo. .s where Eo = diag(a 2 )2a 4 ). .)'. when we are dealing with onedimensional smooth parametric models..3. .') N(O.1 by studying approximations to moments and central moments of estimates.h(JL) where Y 1.. Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal. . . we can use (5.3.. . I i I . Higherorder approximations to distributions (Edgeworth series) are discussed briefly.g.L. .1 Estimation: The Multinomial Case . 0 . i In this section we define and study asymptotic optimality for estimation. 0). il' .324 Asymptotic Approximations Chapter 5 Because X = T..
2 (8. > rl(O) M a(p(8)) P. .2) is twice differentiable and g~ (X I) 8) is a well~defined.2).3) Furthermore (See'ion 3.1. < k.8) logp(X I . for instance.8) . fJl E. 8) is similarly bounded and well defined with (5.8) ~ I)ogp(Xj. pC') =I 1 m . (5.11)) of (J where h: S satisfies ~ R h(p(8» = 8 for all 8 E e > (5. Assume H : h is differentiable.9) .4..4.4.Jor all 8.4.1. 88 (Xl> 8) and =0 (5. Moreover. Then we have the following theorem.0). Consider Example where p(8) ~ (p(xo.11).l (Xl.7) where .1.6) 1.4. Next suppose we are given a plugin estimator h (r.Section 5.4.8)1(X I =Xj) )=00 (5. Under H. Many such h exist if k 2. 8))T. (0) 80(Xj. 0 <J (5. .4..:) (see (2.. h) with eqUality if and only if.4.4..4) g.4. bounded random variable (5. h) is given by (5.p(Xk. (5.4 Asymptotic Theory in One Dimension 325 Note that A implies that k [(X I .8).4.5) As usual we call 1(8) the Fisher infonnation.4.2(0. if A also holds. Theorem 5.
II.8)).l6) as in the proof of the information inequality (3.12).14) ..4.4. &8(X.4.10) Thus.4. Taking expectations we get b(8) = O.8) j=O 'PJ Note that by differentiating (5.. using the correlation inequality (A.4. h(p(8)) ) ~ vn Note that.8) ) ~ . h k &h &pj(p(8)) (N p(xj. Apply Theorem 5. not only is vn{h (. 0) = • &h fJp (5. using the definition of N j .9).6).4.4. 2: • j=O &l a:(p(8))(I(XI = Xj) .8) gives a'(8)I'(8) = 1. Noting that the covariance of the right' anp lefthand sides is a(8). By (5..326 Asymptotic Approximations Chapter 5 Proof.13) I I · .11) • (&h (p(O)) ) ' p(xj. by noting ~(Xj.. we obtain 2: a:(p(8)) &0(xj. ~ir common variance is a'(8)I(8) = 0' (0. I h 2 (5. : h • &h &Pj (p(8)) (N . h).12) [t. 0)] p(Xj.I n.3. (5. ( = 0 '(8.8) = 1 j=o PJ or equivalently. which implies (5.4. by (5.10).p(xj.13). we s~~ iliat equality in (5. h) = I . (8.4. whil.p(xj.2 noting that N) vn (h (. 8). I (x j.8)&Pj 2: • &h a:(p(0))p(x. (5.4./: . for some a(8) i' 0 and some b(8) with prob~bility 1.4.4. • • "'i.h)I8) (5.16) o .')  h(p(8)) } asymptotically normal with mean 0.8)) = a(8) &8 (X" 8) +b(O) &h Pl (5.p(x). (5..h)Var.8) ) +op(I). kh &h &Pj (p(8))(I(Xi = Xj) . but also its asymptotic variance is 0'(8.15) i • . we obtain &l 1 <0 .8) with equality iff.4.
0)=0. if it exists and under regularity conditions. Then Theo(5. Xk}) and rem 5.. by ( 2.3 that the information bound (5.18) and k h(p) = [A]I LT(xj)pj )". J(~)) with the asymptotic variance achieving the infonnation bound Jl(B)..::7 We shall see in Section 5.4. : E Ell...Section 5A Asymptotic Theory in One Dim"e::n::'::io::n ~_ __'3::2:.3) L.Oo) = E. In the next two subsections we consider some more general situations.d. is a canonical oneparameter exponential family (supported on {xo. E open C R and corresponding density/frequency functions p('.. p) and HardyWeinberg models can both be put into this framework with canonical parameters such as B = log ( G) in the first case. . ."j~O (5. 5.8) is. As in Theorem 5. 0 Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena.1.8) = exp{OT(x) .4. .0)) ~N (0.3.17) L.0 (5. 0).4. 0) and (ii) solvesL::7~ONJgi(Xj.). which (i) maximizes L::7~o Nj logp(xj.I L::i~1 T(Xi) = " k T(xJ )". Xl.(vn(B .19) The binomial (n. Suppose i.4. Note that because n N T = n .xd).Xn are tentatively modeled to be distributed according to Po. then... .2 Asymptotic Normality of Minimum Contrast and MEstimates o e o We begin with an asymptotic normality theorem for minimum contrast estimates. Write P = {P. .5 applies to the MLE 0 and ~ e is open. Let p: X x ~ R where e D(O. Suppose p(x. .o(p(X"O)  p(X"lJo )) ..3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check. 0 E e. the MLE of B where It is defined implicitly ~ by: h(p) is the value of O.i. Example 5. achieved by if = It (r.4.2. OneParameter Discrete Exponential Families..3..A(O)}h(x) where h(x) = l(x E {xo.4.4.
=1 n ! (5. 8.21) i J A2: Ep.pix.4.1.328 Asymptotic Approximations Chapter 5 .20) In what follows we let p. # o. parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for.p = Then Uis well defined.23) I.. ~ (X" 0) has a finite expectation and i ! . Under AOA5. . { ~ L~ I (~(Xi. We need only that O(P) is a parameter as defined in Section 1. j. . . Theorem 5.4. f 1 . ~I AS: On £. L.t) . i' On where = O(P) +.0(P)) l. P) = .O(P)) +op(n. 0) is differentiable.4.~(Xi. O)ldP(x) < co.1.L . Let On be the minimum contrast estimate On ~ argmin . hence.!:.p(. o if En ~ O. On is consistent on P = {Pe : 0 E e}.1/ 2) n '. • is uniquely minimized at Bo.22) J i I .LP(Xi.p(x.4. O)dP(x) = 0 (5.O(p)))I: It . O(P). < co for all PEP.O). Suppose AI: The parameter O(P) given hy the solution of . . as pointed out later in Remark 5.21) and.. As we saw in Section 2.' '. denote the distribution of Xi_ This is because. rather than Pe.4. That is.p E p 80 (X" O(P») A3: .4. J is well defined on P.3. PEP and O(P) is the unique solution of(5. ~i  i .p(x. 1 n i=l _ .4.p2(X. O(Pe ) = O. under regularity conditions the properties developed in this section are valid for P ~ {Pe : 0 E e}. . (5.O(P)I < En} £.p(Xi. n i=l _ 1 n Suppose AO: .2.p(x..p(X" On) n = O. A4: sup. O(P))) . 0 E e.. O(P» / ( Ep ~~ (XI. (5. That is.
28) .p) = E p 1jJ2(Xl .p) < 00 by AI.l   2::7 1 iJ1jJ.On)(8n .22) because .20).L 1jJ(X" 8(P)) = Op(n.4.24) follows from the central limit theorem and Slutsky's theorem.4.4.4. applied to (5.O(P)) = n.O(P))) p (5.. 1 1jJ(Xi .25) where 18~  8(P)1 < IiJn . n..22) follows by a Taylor expansion of the equations (5. Let On = O(P) where P denotes the empirical probability.20) and (5. O(P))..4. where (T2(1jJ.27) we get.1j2 ).O(P)I· Apply AS and A4 to conclude that (5.Section 5.24) proof Claim (5.Ti(en .4.4 Asymptotic Theory in One Dimension 329 Hence. A2.29) . t=1 1 n (5.' " 1jJ(Xi .425)(5. 8(P)) + op(l) = n ~ 1jJ(Xi .O(P)) / ( El' ~t (X" O(P))) o while E p 1jJ2(X"p) = (T2(1jJ._ . we obtain. (iJ1jJ ) In (On .lj2 and L 1jJ(X" P) + op(l) i=1 n EI'1jJ(X 1. But by the central limit theorem and AI.O(P)) ~ .4. .4. and A3.27) Combining (5.O(P)) 2' (E '!W(X.4.21).4.O(P)) Ep iJO (Xl.26) and A3 and the WLLN to conclude that (5. By expanding n. Next we show that (5.4. using (5.4.' " iJ (Xi .8(P)) I n n n~ t=1 n~ t=1 e (5.4. (5. en) around 8(P).
B) where p('. essentially due to Cramer (1946). I o Remark 5.4.O(P) Ik = n n ?jJ(Xi .e. 1 •• I ! .4. 0) l' P = {Po: or more generally. O(P) in AIA5 is then replaced by (I) O(P) ~ argmin Epp(X I . We conclude by stating some sufficient conditions.?jJ(XI'O)).4. .4..4.4.2 may hold even if ?jJ is not differentiable provided that . and A3 are readily checkable whereas we have given conditions for AS in Section 5.=0 ?jJ(x. written as J i J 2::.30) Note that (5.4. P but P E a}. O(P)) + op (I k n n ?jJ(Xi . This is in fact truesee Problem 5.1. 0) is replaced by Covo(?jJ(X" 0). O(P)) ) and (5.28) we tinally obtain On .. =0 (5. B» and a suitable replacement for A3.1. I 'iiIf 1 I • I I .3.30) suggests that ifP is regular the conclusion of Theorem 5..O). n I _ . If further Xl takes on a finite set of values.31) for all O.O) = O. Nothing in the arguments require that be a minimum contrast as well as an Mestimate (i.20) are called M~estimates. that 1/J = ~ for some p). it is easy to see that A6 is the same as (3. O)dp. A4 is found.4.Eo (XI.30) is formally obtained by differentiating the equation (5. A2. O)p(x.O)?jJ(X I .4.2. • Theorem 5.4.8). This extension will be pursued in Volume2.12).2 is valid with O(P) as in (I) or (2).4. Remark 5. and that the model P is regular and letl(x.2. X n are i.4. for A4 and A6. 330 Asymptotic Approximations Chapter 5 Dividing by the second factor in (5. for Mestimates. Suppose lis differen· liable and assume that ! • &1 Eo &O(X I .22) follows from the foregoing and (5. e) is as usual a density or frequency function. Our arguments apply even if Xl.'.d.: 1. en j• . If an unbiased estimateJ (X d of 0 exists and we let ?jJ (x. {xo 1 ••• 1 Xk}. 0) = 6 (x) 0.4. 0) Covo (:~(Xt. Solutions to (5. Remark 5. . 0 • . Z~ (Xl. (2) O(P) solves Ep?jJ(XI. Identity (5.! . Conditions AO.4.29).4. and we define h(p) = 6(xj )Pj.4.i.4. An additional assumption A6 gives a slightly different formula for E p 'iiIf (X" O(P)) if P = Po. 0) = logp(x.21). we see that A6 corresponds to (5. A6: Suppose P = Po so that O(P) = 0. Our arguments apply to Mestimates.(x) (5. . AI.
0) . 10' . B). 0') is defined for all x. 5.4.35) with equality iff.33) so that (5. In this case is the MLE and we obtain an identity of Fisher's.O) = l(x..Section 5.0'1 < J(O)} 00.4. 0) (5.4.8) is a continuous function of8 for all x.1). satisfies If AOA6 apply to p(x.4. < M(Xl.:d in Section 3. 0) and . (b) There exists J(O) sup { > 0 such that O"lj') Ehi) eo (Xl.O) and P ~ ~ Pe. We also indicate by example in the problems that some conditions are needed (Problem 5.3. 0') . We can now state the basic result on asymptotic normality and e~ciency of the MLE. w.p = a(0) g~ for some a 01 o.O) = l(x.4. Pe) > 1(0) 1 (5.O)." Details of how A4' (with ADA3) iIT!plies A4 and A6' implies A6 are given in the problems. 0) < A6': ~t (x. "dP(x) = p(x)diL(x).4. ..4.20) occurs when p(x.4) but A4' and A6' are not necessary (Problem 5.01 < J( 0) and J:+: JI'Ii:' (x.4.4. = en en = &21 Ee e02 (Xl. Theorem 5. AOA6.p. then the MLE On (5. lO.34) Furthermore.4.p(x..O) 10gp(x. where EeM(Xl. 0) obeys AOA6. where iL(X) is the dominating measure for P(x) defined in (A. 0) g~ (x. s) diL(X )ds < 00 for some J ~ J(O) > 0.32) where 1(8) is t~e Fisher information introduq.4 Asymptotic Theory in One Dimension 331 A4/: (a) 8 + ~~(xI. then if On is a minimum contrast estimate whose corresponding p and '1jJ satisfy 2 0' (.3 Asymptotic Normality and Efficiency of the MLE The most important special case of (5. That is.eo (Xl.
PoliXI < n1/'1 .1/4 I (5.36) is just the correlation inequality and the theorem follows because equality holds iff 'I/J is a nonzero multiple a( B) of 3!.. I I Example 5. we know that all likelihood ratio tests for simple BI e e eo.4.4.. 5. 1 . 00.l'1 Let X" .r 332 Asymptotic Approximations Chapter 5 Proof Claims (5. 1). {X 1. 1 I I.nBI < nl/'J <I>(n l/ ' . . Let Z ~ N(O..4.Xn be i.38) Therefore.4.4. thus. ... . (5. see Lehmann and Casella. Hodges's Example..4. We next compule the limiting distribution of .4.'(B) = I = 1(~I' B I' 0..34) follow directly by Theorem 5.. .<1>( _n '/4 .". 1998. 442.3 generalizes Example 5.2..nB) . 1.30) and (5. The major testing problem if B is onedimensional is H : < 8 0 versus K : > If p(.4. . Therefore. . Then . 0 e en e. (5. PoliXI < n 1/4 ] .• •. Consider the following competitor to X: B n I 0 if IXI < n. For this estimate superefficiency implies poor behavior of at values close to 0. N(B. cross multiplication shows that (5. 0 because nIl' . .4. 0 1 . B) = 0. The optimality part of Theorem 5.33) and (5.4.36) Because Eo 'lj. 1.. and.2. }.B).ne .'(0) = 0 < l(~)' The phenomenon (5. . ..37) .i .. PIIZ + . .4. for higherdimensional the phenomenon becomes more disturbing and has important practical consequences. I. B) with J T(x) . 1 .A'(B). claim (5. ..d.1). .35). for some Bo E is known as superefficiency. B) is an MLR family in T(X).3 is not valid without some conditions on the esti· mates being considered...39) with "'(B) < [I(B) for all B E eand"'(Bo) < [I(Bo).4. We discuss this further in Volume II..35) is equivalent to (5. By (5. p.1/4 X if!XI > n.i.4. PolBn = Xl . if B I' 0.. However.4 Testing ! I i .n(Bn .4.(X B). Then X is the MLE of B and it is trivial to calculate [( B) _ 1.1/ 4 " and using X as our estimate if the test rejects and 0 as our estimate otherwise. and Polen = OJ .. .439) where .4.nB).4. We can interpret this estimate as first testing H : () = 0 using the test "Reject iff IXI > n.1 once we identify 'IjJ(x. If B = 0. j 1 " Note that Theorem 5.
46) PolBn > cn(a.42) is sometimes called consistency of the test against a fixed alternative. It seems natural in general to study the behavior of the test. ljO < 00 .45) which implies that vnI(Oo)(c.4. > AO.4. The test is then precisely. Theorem 5.40). . Suppose the model P = {Po .d.(a. X n distributed according to Po..Oo)] PolOn > cn(a.:cO"=~ ~ 3=3:..41) follows.3 versus simple 2 .44) But Polya's theorem (A. a < eo < b.4. On the (5. B E (a. Xl. 00) . 1) distribution.m='=":c'. this test can also be interpreted as a test of H : >. 00)] = PolvnI(O)(Bn  0) > VnI(O)(cn(a. 00) . e e e e. derive an optimality property.(1 1>(z))1 ~ 0.4. "Reject H for large values of the MLE T(X) of >. Let en (0:'.(a. The proof is straightforward: PoolvnI(Oo)(Bn .Oo)] = a and B is the MLE n n of We will use asymptotic theory to study the behavior of this test when we observe ij. eo) denote the critical value of the test using the MLE en based On n observations.4. Then ljO > 00 .I / 2 ) (5.Section 5.Oo) = 00 + ZIa/VnI(Oo) +0(n..4.(a. 00 )" where P8 .00) other hand. PoolBn > 00 + zl_a/VnI(Oo)] = POo IVnI(Oo) (Bn  00) > ZI_a] ~ a..2 apply to '0 = g~ and On.h the same behavior.l4..4. That is.43) Property (5. (5. and then directly and through problems exhibit other tests wi!.[Bn > c(a.:c". [o(vn(Bn 0)) ~N(O. "Reject H if B > c(a.42)  ~ o. the MLE.4. 00 )]  ~ 1. PolO n > c.22) guarantees that sup IPoolvn(Bn .a quantile o/the N(O.41) where Zlct is the 1 . . ~ 0.:. 0 E e} is such thot the conditions of Theorem 5.0)]. as well as the likelihood ratio test for H versus K. (5. ..4.0:c":c'=D. Proof. b). Zla (5.. Suppose (A4') holds as well as (A6) and 1(0) < oofor all O.4 Asymptotic Theo:c'Y:c. and (5. Thus.40) > Ofor all O. (5.. If pC e) is a oneparameter exponential family in B generated by T(X). < >"0 versus K : >. Then c.00) > z] .4...4.4. 1 < z . are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error Q at eo.4. • • where A = A(e) because A is strictly increasing.::.r'(O)) where 1(0) (5.00) > z] ~ 11>(z) by (5.
8 + zl_o/.8) > .jnI(8)(80 . the power of the test based on 8n tends to I by (5. X n ) i. Claims (5. iflpn(X1 . .jnI(80) + 0(n. ..jnI(O)(Bn .40) hold uniformly for (J in a neighborhood of (Jo.1 / 2 )) .4. X n ) is any sequence of{possibly randomized) critical (test) functions such that . .   j Proof.4. the power of tests with asymptotic level 0' tend to 0'.8) < z] .jnI(8)(80 ... then limnE'o+. o if "m(8 .80) ~ 8)J Po[. < 1 lI(Zl_a ")'.42) and (5.4. Furthennore.jnI(80) +0(n.4. (3) = Theorem 5. j (5. 80 )  8) .4.50) i.51) I 1 I I .4.48) and (5.49) ! i .4.jnI(8)(Bn .50). . In fact.4.017". f.4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to Ct from the left at 00 and continues rising steeply to 1 to the right of 80 . On the other hand.jI(80)) ill' > 0 > llI(zl_a ")'.50)) and has asymptotically smallest probability of type I error for B < Bo.1/ 2 ))].4. Suppose the conditions afTheorem 5. assume sup{IP.4. Theorem 5.  i.")' ~ "m(8  80 ).80 ). 00 if 8 < 80 . the test based on 8n is still asymptotically MP.49)) the test based on rejecting for large values of 8n is asymptotically uniformly most powerful (obey (5. (5. then (5.2 and (5. (5. .J 334 Asymptotic Approximations Chapter 5 By (5.48).[.4. . 80)J 1 P. LetQ) = Po.4. these statements can only be interpreted as valid in a small neighborhood of 80 because 'Y fixed means () + B .(1 lI(z))1 : 18 .jnI(8)(Bn .4.48) . I. (5.43) follow. ~ " uniformly in 'Y.4.jnI(8)(80 .jnI(8)(Cn(0'. 0.80) tends to infinity.80 1 < «80 )) I J ..4. Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ")' "m( 8 . Write Po[Bn > cn(O'.jnI(8)(Cn(0'.(80) > O. 1 ~.80 ) tends to zero.8) > . That is.4.[.jI(80)) ill' < O. then by (5. 00 if8 > 80 0 and .8 + zl_a/.41).k'Pn(X1.4. I .47) lor. j .5.50) can be interpreted as saying that among all tests that are asymptotically level 0' (obey (5.8) + 0(1) . If "m(8 .4. ~! i • Note that (5. .···. In either case.
54) establishes that the test 0Ln yields equality in (5. " Xn. if I + 0(1)) (5. P.48) follows.' L10g i=1 p (X 8) 1.5 in the future will be referred to as a Wald test.5. 0) denotes the density of Xi and dn .. . .8 0 ) . > dn (Q. note that the Neyman Pearson LR test for H : 0 = 00 versus K : 00 + t.O+. i=1 P 1.53) tends to the righthand side of (5.4.8) that..4 Asymptotic Theory in One Dimension 335 If.8) 1. 0 (5. € > 0 rejects for large values of z1.4. To prove (5.50) for all. . Q). Thus.j1i(8  8 0 ) is fixed.4.4. 00 + E) logPn(X l E I " .4. 8 + fi) 0 P'o+:. 80)] of Theorems 5. Further Taylor expansion and probabilistic arguments of the type we have used show that the righthand side of (5.4. is asymptotically most powerful as well.7). These are the likelihood ratio test and the score or Rao test.[logPn( Xl. for all 'Y.jn(I(8o) + 0(1))(80 . hence.. .50) and.Xn) is the critical function of the Wald test and oLn (Xl.Xn ) = 5Wn (X" . 0 The asymptotic!esults we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n..Section 5.. [5In (X" .o + 7. The details are in Problem 5.4. . t n are uniquely chosen so that the right. It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form "Reject if L log[p(X" 8n )!p(X i=1 n i ." + 0(1) and that It may be shown (Problem 5. ..53) where p(x. is O..?. There are two other types of test that have the same asymptotic behavior. 1(8) = 1(80 + + fi) ~ 1(80) because our uniformity assumption implies that 0 1(0) is continuous (Problem 5. for Q < ~.4.. X n .OO)J .)] ~ L (5.jI(Oo)) + 0(1)) and (5.54) Assertion (5..Xn) is the critical function of the LR test then.4.4. . 1.53) is Q if. ~ .4.52) > 0.4. Finally.. k n (80 ' Q) ~ if OWn (Xl. Llog·· (X 0) =dn (Q.4 and 5.<P(Zl_a . hand side of (5. 00)]1(8n > 80 ) > kn(Oo. X. .4..80 ) < n p (Xi...4. The test li8n > en (Q.<P(Zla(1 + 0(1)) + .. . 0 n p(Xi .OO+7n) +EnP. .50) note that by the NeymanPearson lemma.4..
336
Asymptotic Approximations
Chapter 5
where PlI(X 1, ... ,Xn ,8) is the joint density of Xl, ... ,Xn . Fort: small. n fixed, this is approximately the same as rejecting for large values of a~o logPn(X 11 • • • 1 X n ) eo).
,
• ,
The preceding argument doesn't depend on the fact that Xl" .. , X n are i.i.d. with common density or frequency function p{x, 8) and the test that rejects H for large values of a~o log Pn (XI, ... ,Xn , eo) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming
"Reject H iff
t
i=I
iJ~
: ,
logp(X i ,90) > Tn (a,9 0)."
0
I I
(5.4.55)
It is easy to see (Problem 5.4.15) that
Tn(a, 90 ) = Z10 VnI(90) + o(n I/2 )
and that again if G (Xl, ... , X n ) is the critical function of the Rao test then
nn
,
•
.,
,•
Po,+' WRn (X t, ... ,Xn ) = Ow n(X t , ... ,Xn)J ~ 1, rn
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(90 ), which d' may require numerical integration, can be replaced by _n l d021n(Bn) (Problem 5.4.10).
5.4,5
Confidence Bounds
Q
We define an asymptotic Levell that
lower confidence bound (LCB) On by the requirement
(5.4.57)
r I I i ,
., '
.'
for all () and similarly define asymptotic level!  a DeBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
(i) By using a natural pivot.
, f.
(
.,
(ii) By inverting the testing regions derived in Section 5.4.4.
,. "',
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)(A6), (A4'), and I(9) finite for all it follows (Problem 5.4.9) that
e.
Co(
V n  e)) ~ N(o, 1) nI(lin)(li
Z1a/VnI(lin ).
(5.4.58)
for all () and, hence. an asymptotic level!  a lower confidence bound is given by
9~ = lin e~,
(5.4.59)
Turning tto method (ii), inversion of 8Wn gives fonnally
= inf{9: en(a, 9) > 9n }
(5.4.60)
,
=
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
337
or if we use the approximation C (0, e) ~ n
e+ zlQ/vnI(iJ), (5,4,41),
e) > en}'
~~, = inf{e , Cn(C>,

(5,4,61)
In fact neither e~I' or e~2 properly inverts the tests unless cn(Q, e) and Cn (Q, e) are increasing in The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, e~l is preferable because this bound is not only approximately but genuinely level 1 Q. But computationally it is often hard to implement because cn(Q, 0) needs, in general, to be computed by simulation for a grid of values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4,10) are preferred but Can be quite inadequate (Problem 5,4,1 I), These bounds e~, O~I' e~2' are in fact asymptotically equivalent and optimal in a suitable sense (Problems 5,4,12 and 5,4,13),
e.
e
Summary. We have defined asymptotic optimality for estimates in oneparameter models. In particular, we developed an asymptotic analogue of the information inequality of Chapter 3 for estimates of in a onedimensional subfamily of the multinomial distributions, showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of oneparameter discrete exponential families. In Section 5.4.2 we developed the theory of minimum contrast and M estimates, generalizations of the MLE, along the lines of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mooel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimality results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson (1996), Le Cam and Yang (1990), Lehmann (1999), Rao (1973), and Serfling (1980),
e
5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as n t 00 in a sense we now describe. The framework we consider is the one considered in Sections 5.2 and 5.4, i.i.d. observations from a regular madel in which is open C R or = {e 1 , , , , , e,} finite, and e is identifiable, Most of the questions we address and answer are under the assumption that fJ = 0, an arbitrary specified value, or in frequentist tenns, that 8 is true.
e
e
Consistency The first natural question is whether the Bayes posterior distribution as n + 00 concentrates all mass more and more tightly around B. Intuitively this means that the data that are coming from Po eventually wipe out any prior belief that parameter values not close to are likely, Formalizing this statement about the posterior distribution, II(· I X It •.• , X n ), which is a functionvalued statistic, is somewhat subtle in general. But for = {O l ,' .. , Ok} it is
e
e
i
338
straightforward. Let
Asymptotic Approximations Chapter 5
I.
11(8 i XI,···, Xn)
Then we say that II(·
=PIO = 8 I Xl,···, Xn].
(5.5.1)
e, P,li"(8 I Xl, ... , Xn)  11 > ,] ~ 0 for all f. > O. There is a slightly stronger definition: rIC I XI,' .. ,Xn ) is a.S. iff for all 8 E e, 11(8 I Xl, ... , Xn) ~ 1 a.s. P,.
is consistent iff for all 8 E
General a.s. consistency is not hard to formulate:
I Xl, ... , Xn)
(5.5.2)
consistent
(5.5.3)
11(· I X), ... , Xn)
=}
OJ'} a.s. P,
(5.5.4)
where::::} denotes convergence in law and <5{O} is point mass at satisfactory result for finite.
e
e.
There is a completely
, ,
,
Theorem 5.5.1. Let 1rj  p[e = Bj ], j = 1, ... 1 k denote the prior distribution 0/8. Then II(· I Xl, ... ,Xn ) is consistent (a.s. consistent) iff 7fj > afor j = I, ... , k.
Proof. Let p(., B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1["] = 0 for some j implies that 1f(Bj I Xl, ... ,Xn ) = 0 for all Xl, .. . , X n because, by (1.2.8),
.,
,J,
11(8j
I Xl, ... ,Xn)
PIO = 8j I Xl, ... ,Xn ] 11j Ir~l p(Xi , 8il
L:.~l 11. ni~l p(Xi , 8.)
k
n .
(5.5.5)
,
,
, ,,
Intuitively, no amount of data can convince a Bayesian who has decided a priori that OJ is impossible. On the other hand, suppose all 71" j are positive. If the true is (J j or equivalently 8 = (J j, then
e
log
11(8.IXl, ... ,Xn) =n 11(8 j I X), ... ,Xn)
(11og+ L.. og P(Xi,8.)) . 11. 1{f.,1 n
7fj
n i~l
p(Xi ,8j)
,
By the weak (respectively strong) LLN, under POi'
.,i
1{f.,log p(Xi ,8.)  Lni~l
p(Xi ,8j )
+
E
OJ
(I
I
P(XI ,8.)) og p(X I ,8j )
i
I
in probability (respectively a.s.). But Eo;
(log: ~:::;))
~
< 0, by Shannon's inequality, if
Ba
• .,
=I=
Bj' Therefore,
11(8.IXI, ... ,Xn) 1 og 11(8j I X), ... , X n )
00
,
in the appropriate sense, and the theorem follows.
o
i ~,
h
_
Section 55
Asymptotic Behavior and Optimality of the Posterior Distribution
339
e
Remark 5.5.1. We have proved more than is stated. Namely. that for each I XI, . .. ,Xn ]  a exponentially.
e E e. Po[O =l0
As this proof suggests, consistency of the posterior distribution is very much akin to consistency of the MLE. The appropriate analogues of Theorem 5.2.3 are valid. Next we give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution
Under conditions AOA6 for p(x, B) that if B is the MLE,

= lex, Bj
=logp(x, B), we showed in Section 5.4
(5.5.6)
Ca(y'n(e  B» ~ N(O,rl(B)).
Consider C( ..;ii((}  B) I Xl, ... , X n ), the posterior probability distribution of y'n((} B( Xl, ... , X n )), where we emphasize that (j depends only on the data and is a constant given XI, ... , X n . For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions that strengthen A4 and A5 by replacing convergence in Po probability by convergence a.s. p•. We also add,



A7: For all (), and all 0> o there exists t(o,(})
p. [sup
> 0 such that
{~ t[I(Xi,B') /(Xi,B)]: 18'  BI > /j} < '(0, B)] ~ I.
e such that 1r(') is continuous and positive
AS: The prior distribution has a density 1f(') On at all B. Remarkably,
Theorem 55.2 (UBernsteinlvon Mises"). If conditions ADA3, A4(a.s.), A5(a.s.), A6, A7, and A8 hold. then
C(y'n((}(})
I X1, ... ,Xn )
~N(O,l
1
(B»)
(5.5.7)
a.s. under P%ralle.
We can rewrite (5.5.7) more usefully as
sup IP[y'n((}  e) < x I Xl, ... , X n]  of>(xVI(B»)j ~ 0
x
(5.5.8)
for all a.s. Po and, of course, the statement holds for our usual and weaker convergence in Po probability also. From this restatement we obtain the important corollary. Corollary 5.5.1. Under the conditions of Theorem 5.5.2,
e
sup IP[y'n(O  e) < x j Xl, ... , XnJ  of>(xVl(e)1
x
~0
(5.5.9)
a.s. P%r all B.
1 , ,
340
Asymptotic Approximations Chapter 5
Remarks
(I) Statements (5,5.4) and (5,5,7)(5,5,9) are, in fact, frequentist statements about the
asymptotic behavior of certain functionvalued statistics.
(2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in P, probability if A4 and
A5 are used rather than their strong formssee Problem 5.5.7.
(3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
identifiability guarantees consistency of Bin a regular model.

Proof We compute the posterior density of .,fii(O  B) as
(5.5.10)
where en = en(X!, . .. ,Xn) is given by

Divide top and bottom of (5.5.10) by
;,'
II7
1 p(Xi ,
B) to obtain
(5.5.11)

where l(x,B)
= 10gp(x,B) and
,
,
We claim that
for all B. To establish this note that (a) sup { 11" + 1I"(B) : ItI < M} tent and 1T' is continuous. (b) Expanding, (5.5.13)
(e In) 
~ 0 a.s.
for all M because
eis a.s. consis
I
I
1
i
1 ! , ,
I
p
J
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
341
where
Ie  Bit)1 < )n.
We use I:~
1
g~ (Xi, e) ~ 0 here. By A4(a.s.), A5(a.s.),
1
n
sup { n~[}B,(Xi,B'(t))n~[}B,(Xi,B):ltl<M
In [}'I
[}'l
}
~O,
for all M, a.s. Po. Using (5.5.13). the strong law of large numbers (SLLN) and A8, we obtain (Problem 5.5.3),
Po
[dnqn(t)~1f(B)exp{Eo:;:(Xl,B)~}
forallt] =1.
(5.5.14)
Using A6 we obtain (5.5.12).
Now consider
dn =
I:
r
+y'n
1f(e+
;")exp{~I(Xi,9+;,,) 1(Xi,e)}ds
(5.5.15)
dnqn(s)ds
J1:;I<o,fii
J
1f(t) exp
{~(l(Xi' t) 1(X
i,
9)) } l(lt 
el > o)dt
By AS and A7,
Po [sup { exp
{~(l(Xi,t) 1(Xi , e») } : It  el > 0} < e"'("O)] ~ 1
(5.5.16)
for all 0 so that the second teon in (5.5.14) is bounded by y'ne"'("O) ~ 0 a.s. Po for all 0> O. Finally note that (Problem 5.5.4) by arguing as for (5.5.14), tbere exists o(B) > 0 such that
Po [dnqn(t) < 21f(8) exp {~ Eo (:;: (Xl, B))
By (5.5.15) and (5.5.16), for all 0
~}
for all It 1 < 0(8)y'n]
~ I.
(5.5.17)
> 0,
(5.5.18)
Po [dn 
r dnqn(s)ds ~ 0] = I. J1:;I<o,fii
exp {_ 8'I(B)} ds
2
Finally, apply the dominated convergence theorem, Theorem B.7.5, to dnqn(sl(lsl < 0(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s. Po,
d ~ 1f(B)
n
r= L=
= 1f(8)v'21i'.
JI(B)
(5.5.19)
,
I
342
Hence, a.S. Po,
Asymptot'lc Approximations Chapter 5
qn(t) ~ V1(e)<p(tvI(e))
where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. 0 Example 5.5.1. Posterior Behavior in the Normal Translation Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N{ (), ( 2 ) distribution with a 2 known and we put aN ('TJ, 7 2 ) prior on 8. Then the posterior
distribution of8 isN(Wln7J!W2nX,
(~I r12)1) where
,,2
W2n
.,
I
• •
WIn
= nT 2 +U2'
= !WIn
(5.5.20)
,
'"
,
r
, .,,
I
Evidently, as n + 00, WIn + 0, X + 8, a.s., if () = e, and (~I T\) 1 + O. That is, the posterior distribution has mean approximately (j and variance approximately 0, for n large, or equivalently the posterior is close to point mass at as we vn(O  9) has posterior distribution expect from Theorem 5.5.1. Because 9 =
;
N ( .,!nw1n(ry  X), n (~+
;'»
1).
x,
e
Now, vnW1n
=
O(n 1/ 2) ~ 0(1) and
0
n (;i + ~ ) 1
rem 5.5.2.
+ (12 =
II (8) and we have directly established the conclusion of Theo
I ,
I
I
Example 5.5.2. Posterior Behavior in the Binomial~Beta Model. (Example 3.2.3 continued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X" ... , X n Li.d. Bernoulli (I, e) and put a beta, (3(r, s) prior on e, then, as in Example 3.2.3, (J has posterior (3(8n +r, n+s  8 n ). We have shown in Problem 5.3.20 that if Ua,b has a f3(a, b) distribution, then as a + 00, b + 00,
I
If 0
a) £ (a+b)3]1( Ua,b a+b ~N(o,I). [ ab
< B<
(5.5.21)
j I ,
i j
1 is true, Sn/n ~. () so that Sn + r + 00, n + s  Sn + 00 a.s. Po. By identifying a with Sn + r and b with n + s  Sn we conclude after some algebra that because 9 = X,
vn((J  X)!:' N(O,e(l e))
a.s. Po, as claimed by Theorem 5.5.2.
o
Bayesian optimality of optimal frequentist procedures and frequentist optimality of
Bayesian procedures
•
Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors are asymptotically efficient in the sense of the previous section.
,
1 ,
I
I
t
hz _
I
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
343
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following.
~
Theorem 5.~.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution ofB. Then
(i)
(5.5.22)
a.s. Pe for all
e. Consequently,
~, _ I ~ 1 az. I (e)ae(X" e) +op,(n 1/2 ) e  e+ n L.
l=l
(5.5.23)
and LO( .,fii(rr  e)) ~ N(o, rl(e)).
(ii)
(5.5.24)
E( .,fii(111 
el11111 Xl,'"
,Xn) = mjn E(.,fii(111  dl 
1(11) 1Xl.··· ,Xn) + op(I).
(5.5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions In (e, d) = .,fii(18 dlIell· But the Bayes estimatesforl n and forl(e, d) = 18dl must agree whenever E(11111 Xl, ... , Xn) < 00. (Note that if E(1111 I Xl, ... , X n ) = 00, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.l4.22)
sup IP[.,fii(O  e)
< x I Xl,'" ,Xu) 1>(xy'""'I(""'e))1 ~
Oa.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7.1 I). Thus, any median of the posterior distribution of .,fii(11  e) tends to 0, the median of N(O, II (~)), a.s. Po. But the median of the posterior of .,fii(0  (1) is .,fii(e'  e), and (5.5.22) follows. To prove (5.5.24) note that
~ ~ ~
and, hence, that
E(.,fii(IOelll1e'l) IXl, .. ·,Xn) < .,fiile
e'l ~O
(5.5.27)
a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?). claim (5.5.24) follows and, hence,
E( .,fii(10 
01 101) I h ... , Xn)
= E( .,fii(10 
0'1  101) I X" ... , X n ) + op(I).
(5.5.28)
344
~
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for In(e,d), (5.5.25) and the theorem follows. 0 Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang (1990).
Bayes credible regions
~
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
Theorem 5.5.4. Suppose the conditions afTheorem 5.5.2 are satisfied. Let
where en is chosen so that 1l"(Cn I Xl, ... ,Xn) = 1  0', be the Bayes credible region defined in Section 4.7. Let Inh) be the asymptotically level 1  'Y optimal interval based on B, given by
~
where dn(y)
i •
= z (! !)
JI~). ThenJorevery€ >
0, 0,
~ 1.
P.lIn(a + €) C Cn(X1 , .•. ,Xn ) C In(a  €)J
(5.5.29)
I
I
I
The proof, which uses a strengthened version of Theorem 5.5.2 by which the posterior density of Jii( IJ  0) converges to the N(O,Il (0)) density nnifonnly over compact neighborhoods of 0 for each fixed 0, is sketched in Problem 5.5.6. The message of the theorem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n 1j2 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n 1 order (see Schervisch, 1995).
Thsting
! ,
•
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of 00 given X I, ... ,Xn if H is false is of a different magnitude than the pvalue for the same data. For more on this socalled Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + D.) or (00  D., 00 + D.), then Bayes and freqnentist testing procedures agree in the limit. See Problem 5.5.2. Summary. Here we established the frequentist consistency of Bayes estimates in the finite parameter case, if all parameter values are a prior possible. Second. we established
i
! I
I b
II
!
_
j
TI
Section 5.6
Problems and Complements
345
the socalled Bernsteinvon Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the socalled Bernstein~von Mises theorem and frequentist contjdence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1
1. Suppose Xl, ... , X n are i.i.d. as X ous case density.
rv
F, where F has median F 1 (4) and a continu
(a) Show that, if n
= 2k + 1,
, Xn)
EFmed(X),
n (
_
~
)
l'
k (1  t)kdt F' (t)t
EF
med (X"
2
,Xn )
n( 2;) [1P'(t)f tk (1t)k dt
= 1, 3,
(b) Suppose F is unifonn, U(O, 1). Find the MSE of the sample median for n and 5.
2. Suppose Z ~ N(I', 1) and V is independent of Z with distribution X;'. Then T
Z/
=
(~)!
is said to have a noncentral t distribution with noncentrality J1 and m degrees
of freedom. See Section 4.9.2. (a) Show that
where fm(w) is the x~ density, and <P is the nonnal distribution function.
(b) If X" ... ,Xn are i.i.d.N(I',<T2 ) show that y'nX /
(,.~, L:(Xi
_X)2)! has a
noncentral t distribution with noncentrality parameter .fiiJ1/IT and n  1 degrees of freedom. (c) Show that T 2 in (a) has a noncentral :FI,m distribution with noncentrality parameter J12. Deduce that the density of T is
p(t)
= 2L
i=O
00
P[R = iJ . hi+,(f)[<p(t 1')1(t > 0)
+ <p(t + 1')1(t < 0)1
where R is given in Problem B.3.12.
, " I I. ,
346
Hint: Condition on
Asymptotic Approximations
Chapter 5
ITI.
1, then Var(X) < 1 with equality iff X
3. Show that if P[lXI < 11
±1 with
probability! .
Hint: Var(X) < EX 2
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through ..jiif..
(a) Show that the ratio of the Hoeffding function h( VilE) to the Chebychev function e( Jii€) tends to as Jii€ ~ 00 so that he) is arbitrarily better than en in the tails.
°
I,.,'
i~
(b) Show that the normal approximation 24> (
V;€)  1 gives lower results than h in
00.
,
the tails if P[lXI < 1] = 1 because, if ,,2 < 1. 1  <p(t)  <p(t)lt as t ~ Note: Hoeffding (1963) exhibits better bounds for known a 2 .
R has .\(0) ~ 0, is bounded, and has a hounded second derivative .\n. Show that if Xl, ... , X n are i.i.d., EX l = f.L and Var Xl = 02 < 00, then
5. Suppose.\ : R
~
E.\(X 
1') = .\'(0)
;'/!; +
0
(~)
as n >
00.
= E>.'(O)JiiIX  1'1 + E (";' (X  I')(X I'?) where IX  1'1 < Ix  1'1· The last term is < suPx I>." (x) 1,,2 In and the first tends to
Hint: JiiE(.\(IX 1,1) .\(0))
I
",
1! "
.\'(0)" f== Izl<p(z)dz by Remark B.7. 1(2).
Problems for Section 5,2
1. Using the notation of Theorern 5.2.1, show that
"
2. Let X" ... ,Xn be ij,d. N(I',,,2), Show that for all n
Sup p(•• u) [IX
u
>
1, all €
>
°
/,1 > ,] ~ 1.
Hint: Let (J
)
00.
•
3, Establish (5.2.5). Hint: Iiln  q(p)1
> € =} IPn  pi > w (€).
l
J
4. Let (Ui , V;), 1 < i < n, be i.i.d.  PEP.
(a) Let y(P)
= PIU, > 0, V, > OJ. Show that if P = N(O, 0,1,1, p), then
p
~ sin21l' (Y(P)  ~).
eo) : e' E S(O. 5_0  lim Eo. .2. .14)(i). then is a consistent estimate of p.eol > . Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but p is no longer consistent.sup{lp(X.p(X.l)2 _ + tTl = o:J. t e.(eo)} < 00. Hint: From continuity of p..p(X. 7. sup{lp(X. n Lt= ". N (/J. e') .n 1 (XiJ. (c) Suppose p{P) is defined generally as Covp (U.. i (e))} > <.8) fails even in this simplest case in which X ~ /J is clear. Show that the maximum contrast estimate 8 is consistent.~ . and < > 0 there is o{0) > 0 such that e.2. Or of sphere centers such that K Now inf n {e: Ie .p(Xi . Hint: K can be taken as [A. inf{p(X. Suppose Xl. (Wald) Suppose e ~ p( X. A]. (i).2 rII.\} c U s(ejJ(e j=l T J) {~ t n . ". Eo.\} .e)1 : e' E 5'(e.lO)2)) . eo) : Ie  : Ie .. VarpUVarpV > O}. Hint: sup 1. . (1 (J. e) is continuous. Prove that (5. 5.(ii) holds. e') . e E Rand (i) For some «eo) >0 Eo.6 Problems and Complements 347 (b) Deduce that if P is the bivariate normal distribution.d. for each e eo.eol > . (Ii) Show that condition (5.i. 6. 05) where ao is known.lJ. {p(X" 0) . . V)j /VarpU Varp V for PEP ~ {P: EpU' + EpV' < 00. eo)} : e E K n {I : 1 IB . . and the dominated convergence theorem. By compactness there is a finite number 8 1 .2.•. where A is an arbitrary positive and finite constant. (a) Show that condition (5.o)} =0 where S( 0) is the 0 ball about Therefore. inf{p(X.Xn are i.e'l < . e) .e')p(X. e)1 (ii) Eo.p(X.Section 5.14)(i) add (ii) suffice for consistency. eol > A} > 0 for some A < 00. by the basic property of maximum contrast estimates.
i . with the same distribution as Xl.. Indicate how the conditions of Problem 7 have to be changed to ensure uniform consistencyon K.. 1 + . Problems for Section 5. and if CI.)I' < MjE [L:~ 1 (Xi .3.11). Hint: Taylor expand and note that if i .I'lj < EIX . and let X' = n1EX.X.Xn but independent of them. then by Jensen's inequality. Establish (5.1'. in (i) and apply (ii) to get (iv) E IL:~ 1 (Xi  x. Hint: See part (a) of the proof of Lemma 5. Extend the result of Problem 7 to the case () E RP. and take the values ±1 with probability ~. . For r fixed apply the law of large numbers.. . n. Establish (5. (72). L:!Xi . J (iii) Condition on IXi  X. . < Mjn~ E (. The condition of Problem 7(ii) can also fail.d.1.3.O'l n i=l p(X"Oo)}: B' E S(Bj.i.L.l m < Cm k=I n. < > OJ. Show that the log likelihood tends to 00 as a + 0 and the condition fails.d. = 1.O(BJll}. en are constants.3. Then EIX . . • • (i) Suppose Xf. ~ .) 2] .. .)2]' < t Mjn~ E [. 1/(J. P > 1.d.xW) < I'lj· 3.1. < < 17 < 1/<. 2."d' k=l d < mm+1 L d ElY. 8. L:~ 1(Xi . .. . i .3) for j odd as follows: . (ii) If ti are ij.348 Asymptotic Approximations Chapter 5 > min l<J$r {~tinf{p(Xi. ..X'i j .i.. II I ..X.. 9. .. Establish Theorem 5. + id = m E II (Y. 4. .3. Let X~ be i. for some constants M j .3 I. Compact sets J( can be taken of the form {II'I < A.X~ are i. .[.2. ...k / 2 . . ..3..3. I 10. Establish (5.9) in the exponential model of Example 5. li .
i. Establish 5. . L~=1 i j = m then m a~l. i. LetX1"". X. (b) Show that when F and G are nonnal as in part (a).1)'2:7' .d. .(l'i .i..1) +E{IX. )=1 5.. ~ xl". I < liLl}P(IXd < liLl)· 7.. s~ ~ (n2 .a~d < [max(al"" .m 00. 2 = Var[ ( ) /0'1 ]2 .l(e) Now suppose that F and G are not necessarily nonnal but that and that 0 < Var( Xn < Ck. Show that under the assumptions of part (c).Xn1 bei. theu the LR test of H : or = a~ versus K : af =I a~ is based on the statistic s1 / s~. R valued with EX 1 = O..d. Show that if m = .pli ~ E{IX.ad)]m < m L aj j=1 m <mmILaj. . j = 1.c> . 1 < j < m. .Fk. ~ iLli I IX.m be Ck.I. 0'1 = Var(Xd· Xl /LI Ck. ~ iLl' I IXli > liLl}P(lXd > 11.6 Problems and Complements 349 Suppose ad > 0.d. .m distribution with k = nl . FandYi""'Yn2 bei./LI = E(Xt}. Let XI.andsupposetheX'sandY's are independent.m) Then. Show that ~ sup{ IE( X.m with K.aD.Tn) + 1 . where s1 = (n. ..Section 5. if a < EXr < 00. Show that if EIXlii < 00.) I : i" . km K.i.I) Hint: By the iterated expectation theorem EIX.r'j". j > 2. replaced by its method of moments estimate.\ > 0 and = 1 + JI«k+m) ZIa. ~ iLli < 2 i EIX... 6. P(sll s~ < ~ 1~ a as k ~ 00.28. 1)'2:7' . n} EIX. .\k for some .3. G. PH(st! s~ < Ck. (d) Let Ck.). then EIX.. under H : Var(Xtl ~ Var(Y.a as k t 00. 8.1 and 'In = n2 .. .(X."" X n be i. (a) Show that if F and G are N(iL"af) and N(iL2. respectively. then (sVaf)/(s~/a~) has an ..
Wnte (l 1pP .. . (a) jn(X . if p i' 0. then jn(r' . .(I_A)a2). as N 2 t Show that. (b) Use the delta method for the multivariate case and note (b opt  b)(U .. . ! .l EY? . (b) If (X.1 that we use Til. = ("t .m with KI and K2 replaced by their method of moment estimates..x) ~ N(O. (I _ P')'). (a) If 1'1 = 1'2 ~ 0.". if 0 < EX~ < 00 and 0 < Eyl8 < 00.1 · /. show that i' ~ .Ii > 0. . .. "i.4. ' . jn((C . In particular. p). .l EXiYi .m) + 1 . (iJi .d.350 Asymptotic Approximations Chapter 5 (e) Next drop the assumption that G E g.1. N where the t:i are i. . t N }. Instead assume that 0 < Var(Y?) < x.." •.p) ~ N(O. . 7 (1 .a as k .XC) wHere XC = N~n Et n+l Xi.p. Tin' which we have sampled at random ~ E~ 1 Xi. .i.1 EX.1JT.I' ill "rI' and "2 = Var[( X. (cj Show that if p = 0.2< 7'.XN)} we are interested in is itself a sample from a superpopulation that is known up to parameters. 0 < .3.xd. .m (0 Let 9. . (iJl.p) ~ N(O. Show that under the assumptions of part (e). suppose i j = j. • 10. TN i.+. from {t I. "1 = "2 = I. X) U t" (ii) X R = bopt(U ~  iL) as in Example 3.XN} or {(uI..d. i = I. Po." . Et:i = 0.2" ( 11. . + I+p" ' ) I I I II._ J . Consider as estimates e = ') (I X = x.t .xd. Hint: Use the centra] limit theorem and Slutsky's theorem. and if EX. i = 1.) =a 2 < 00 and Var(U) Hint: (a) X .. . H En. In survey sampling the modelbased approach postulates that the population {Xl.".1'2)1"2)' such that PH(SI!S~ < Qk.\ < 1. n.. 4p' (I .p')') and.m (depending on ~'l ~ Var[ (X I .iL) • . . JLl = JL2 = 0. err eri Show that 4log (~+~) is the variance stabilizing transfonnation for the correlation 1 Ip coefficient in Example 5. use a normal approximation to tind an approximate critical value qk.3.I))T has the same asymptotic distribution as n~ [n. . Show that jn(XRx) ~N(0. .. ~ ] ...+xn n when T.lIl) + 1 . .. then jn(r .~) (X ..N. n. Without loss of generality.I. = (1. then P(sUs~ < qk. 1). . iik..Il2. . Without loss of generality. jn(r .1. there exists T I . Y) ~ N(I'I. In Example 5. = = 1.a: as k + 00.. . suppose in the context of Example 3. if ~ then .i.. .6. . i (b) Suppose Po is such that X I = bUi+t:i. • . Var(€. be qk.A») where 7 2 = Var(XIl. () E such that T i = ti where ti = (ui. that is. Under the assumptions of part (c).I).00. • op(n').. < 00 (in the supennodel).p') ~ N(O. to estimate Ii 1 < j < n.4. .p).6.\ 00. (UN.
16..3.6) to explain why.90. This < xl (h) From (a) deduce the approximation P[Sn '" 1>( v'2X  v'2rl)...E{h(X))1' = 0 to terms up to order l/n' for all A > 0 are of the form h{t) ~ ct'!3 + d.3' Justify fonnally j1" variance (72.: ..13(c).i'c) I < Mn'. Hint: If U is independent of (V.14 to explain the numerical results of Problem 5. X n is a sample from a population with mean third central moment j1. W). (h) Deduce fonnula (5. .14). The following approximation to the distribution of Sn (due to Wilson and Hilferty. Suppose XI. (c) Compare the approximation of (b) with the central limit approximation P[Sn < xl = 1'((x . . then E(UVW) = O. X n are independent.3. Here x q denotes the qth quantile of the distribution. . X = XO.s..y'ri has approximately aN (0.3. Suppose X 1l .3. 1931) is found to be excellent Use (5.3. . Show that IE(Y. X. Let Sn have a X~ distribution.99. (a) Show that if n is large.3 < 00.i'b)(Yc .. It can be shown (under suitable conditions) that the nonnal approximation to the distribution of h( X) improves as the coefficient of skewness 'Y1 n of heX) diminishes. E(WV) < 00.25. (a) Suppose that Ely. 14. (a) Show that the only transformations h that make E[h(X) .. . 15. = 0..6 Problems and Complements 351 12.. each with HardyWeinberg frequency function f given by .10. n = 5. (a) Use this fact and Problem 5. and Hint: Use (5. (h) Use (a) to justify the approximation 17. Suppose X I I • • • l X n is a sample from a peA) distrihution..Section 5. !) distribution.n)/v'2rl) and the exact values of PISn < xl from the X' table for x = XO.. Normalizing Transformation for the Poisson Distribution.12). EU 13. is known as Fisher's approximation... X~. (b) Let Sn .i'a)(Yb .
. Yn) is a sample from a bivariate population with E(X) ~ 1'1. Justify fonnally the following expressions for the moments of h(X. ° . which are integers. Yn are < < . Let then have a beta distribution with parameters m and n.y).h(l'l. Y) where (X" Yi). i 21. . Let X I. Show directly using Problem B.• • • ~{[hl (1".. (1'1. h(l) = I. 1'2)(X . is given by h(t) = (2/1r) sin' (Yt).a tends to zero at the rate I/(m + n)2. Cov(X. Bm.1') + X 2 .n  m/(m + n») < x] 1 > 'li(x). . E(Y) = 1'2..)? 18.n = (mX/nY)i1 independent standard eXJX>nentials. [vm +  n (Bm.n P .. 1'2)(Y  + O( n '). 172 + [h 2(1'l. if m/(m + n) . • j b _ . + (mX/nY)] where Xl. Var(X) = 171. Show that ifm and n are both tending to oc in such a way that m/(m + n) > a. . (1'1.y) = a ayh(x.2. V)) '" . . (Xn .. v'a(l. X n be the indicators of n binomial trials with probability of success B. 1'2)h2(1'1. where I' = (a) Find an approximation to P[X < E(X.. (e) What is the approximate distribution of Vr'( X . 101 02 20(10) I f (10)2 2 tJ in terms of fJ and t.a) E(Bmn ) = . 1'2)]2C7n + O(n 2) i . where I I h. Y) .1'2)]2 177 n +2h. Var(Y) = C7~.(x. 1'2) Hint: h(X.. 0< a < I. 20. (b) Find an approximation to P[JX' < t] in terms of 0 and t.. Y) ~ p!7IC72. 1'2)pC7.5 that under the conditions of the previous problem. . (a) (b) Var(h(X. and h'(t) > for all t. Var Bmnm + n +Rmn = 'm+n ' ' where Rm.. I ~ 352 x Asymptotic Approximations Chapter 5 where °< e < f(x) 1.n tends to zero at the rate I/(m + n)2.y).a) Hmt: Use Bm. 1'2) = h.. Variance Stabilizing Transfo111U1tion for the Binomial Distribution. Show that the only variance stabilizing transformation h such that h(O) = 0.y) = a axh(x. h 2 (x. then I m a(l.Xm . .I'll + h2(1'1. Y" . 19.
Suppose that Xl. n[X(I .l = ~.J1. .2) using the delta method.X2 . _. xi. 1 X n be a sample from a population with mean J. . 24.)] ~ as ~h(2)(J1.3 1/6n2 + M(J1.6 Problems and Complements 353 22.J1. (a) When J1.) 23..(1. and that h(1)(J.2V with V "' Give an approximation to the distribution of X (1 .J1.Section 5..) + ~h(2)(J1..h(J1.. (e) Fiud the limiting laws of y'n(X . = ~.). X n is a sample from a population and that h is a realvalued function 01 X whose derivatives of order k are denoted by h(k). find the asymptotic distribution of y'n(T.)] is asymptotically distributed (b) Use part (a) to show that when J1.l and variance (J2 < Suppose h has a second derivative h<2) continuous at IJ. Usc Stirling's approximation and Problem 8.J1. .2.4 to give a directjustification of where R n / yin 0 as in n Recall Stirling's approximation: + + 00. (a) Show that y'n[h(X) . (It may be shown but is not required that [y'nR" I is bounded. ° while n[h(X h(J1.. k > I.2V where V ~ 00. xi 25. ()'2 = Var(X) < 00.)1J1.l4 is finite. Let Sn "' X~. < t) = p(Vi < (b) When J1. = 0.2)/24n2 Hint: Therefore.4 + 3o. Let Xl.X) in tenns of the distribution function when J. J1.l) = o. Show that Eh(X) 3 h(J1. Compare your answer to the answer in part (a).J1. XI. find the asymptotic distrintion of nT nsing P(nT y'nX < Vi). .)] !:.)2.) + ": + Rn where IRnl < M 1(J1..)2 and n(X . Suppose IM41 (x)1 < M for all x and some constant M and suppose that J. Let Xl>"" X n be a sample Irom a population with and let T = X2 be an estimate of 112 . = E(X) "f 0.
. < tJ ~ <P(t[(. (d) Make a comparison of the asymptotic length of (4. 1 Xn. Suppose Xl. " Yn as separate samples and use the twosample t intervals (4. Yn2 are as in Section 4. then limn PIt.\)al + .9. b l .I I (a) Show that T has asymptotically a standard Donnal distribution as and I I . (a) Show that P[T(t. 2pawz)z(1  > 0 and In is given by (4.1/ S D where D and S D are as in Section 4. 'I :I ! t.. . . . E In] > 1 . Suppose (Xl. so that ntln ~ .3) has asymptotic probability of coverage < 1 . n2 + 00. cyi .4.\ probability of coverage. I . .t. Hint: (a).) (b) Deduce that if.3).3 independent samples with III = E(XIl. (c) Apply Slutsky's theorem. the interval (4.2a 4 ).9. al = Var(XIl. Viillnl ~ 2vul + a~z(l . Suppose nl + 00 . = = az. whereas the situation is reversed if the sample size inequalities and variance inequalities agree. .\. Yd. Show that if Xl" .3.o.a. 27. ..Xi we treat Xl.) of Example 4.3).t.4.\. .33) and Theorem 5. . 0 < . and a~ = Var(Y1 ).') < 00. 1 X n1 and Y1 .9. XII are ij. . p) distribution..(j ' a ) N(O. (c) Show that if IInl is the length of the interval In. Asymptotic Approximations Chapter 5 then ~ £ c: 2 yn(X . I i t I ./". What happens? Analysis for fixed n is difficult because T(~) no longer has a '12n2 distribution.t2.3. Eo) where Eo = diag(a'.. We want to study the behavior of the twosample pivot T(Li.JTi times the length of the onesample t interval based on the differences.9. .. a~. that E(Xt) < 00 and E(Y. . Y1 .(~':~fJ').2 .3.)/SD where D.3) and the intervals based on the pivot ID .) < (b) Deduce that if p tl ~ <P (t [1. n2 + 00..9. .3) have correct asymptotic (c) Show that if a~ > al and .a.9. .Jli = ~. Hint: Use (5. . .\ > 1 . . Assume that the observations have a common N(jtl.\ul + (1 ~ or al . 29. Let n _ 00 and (a) Show that P[T(t.4. if n}.. the intervals (4.9. /"' = E(YIl.~a) > 2( Val + a'4  ~ a) where the righthand side is the limit of .}. and SD are as defined in Section 4..d..•. (Xnl Yn ) are n sets of control and treatment responses in a matched pair experiment.\a'4)I'). We want to obtain confidence intervals on J1.9.\ < 1. Let T = (D . Suppose that instead of using the onesample t intervals based on the differences Vi .9. 28.\)a'4)/(1 . N(Pl (}2).354 26. .
I) + y'i«n .9 and 4. then lim infn Var(Tn ) > Var(T). vectors and EIY 1 1k < 00. Tn .xdf and Ixl is Euclidean distance. U( 1.I k and k only.z = Var(X).1)1 L." .d. (a) Show that the MLEs of I'i and (). (). I< = Var[(X 1')/()'jZ.z are Iii = k.:~l IXjl. Show that as n t 00. . . . I is the Euclidean norm... < XnI (a)) (b) Let Xn_1 (a) be the ath quantile of X. .1.16. X. but Var(Tn ) + 00..6 Problems and Complements 355 (b) Let k be the Welch degrees of freedom defined in Section 4. Vn = (n .1). EIY. Carry out a Monte Carlo study such as the one that led to Figure 5.p.3.3... Hint: See Problems B.~ t (Xi . Let .£.. j = I.£.3 by showing that if Y 1. as X ~ F and let I' = E(X). (c) Show using parts (a) and (b) that the tests that reject H : fJI = 112 in favor of K: 112 > III when T > tk(l . then for all integers k: where C depends on d. Plot your results.a). has asymptotic level a. Let X ij (i = I. Hint: If Ixll ~ L. ().3. k) be independent with Xi} ~ N(l'i. where I. . and let Va ~ (n .. then P(Vn < va<) t a as 11 t 00.. Let XI. T and where X is unifonn.a) is the critical value using the Welch approximation.8Z = (n .Z).1 L j=I k X ij and iT z ~ (kp)I L L(Xij i=l j=l p k Iii)" . Show that k ~ ex:: as 111 ~ 00 and 112 t 00. .X)" Then by Theorem B. then there exist universal constants 0 < Cd < Cd < 00 Such that cdlxl1 < Ixl < Cdlxh· 31. 32.9. (d) Find or write a computer program that carries out the Welch test.. It may be shown that if Tn is any sequence of random variables such that Tn if the variances ofT and Tn exist. x = (XI. Show that if 0 < EX 8 < 00._I' Find approximations to P(Yn and P{Vn < XnI (I . X n be i. Y nERd are Li.I)z(a).I)sz/(). (c) Let R be the method of moment estimaie of K.d.3.3 using the Welch test based on T rather than the twosample t test based on Sn. (a) SupposeE(X 4 ) < 00.4. 33.4. . where tk(1.z has a X~I distribution when F is theN(fJl (72) distribution. . 30. Generalize Lemma 5.Section 5.a)) and evaluate the approximations when F is 7. .i.
Xn be i.p( 00) as 0 ~ 00.)).1)cr'/k... Assmne lbat '\'(0) < 0 exists and lbat . I ! 1 i .n7(0) ~[..On) < 0). Show lbat if I(x) ~ (d) Assume part (c) and A6. . n 1 F' (x) exists. .0) ~ N Hint: P( .p(X.) Hint: Show that Ep1/J(X1  8) is nonincreasing in B. Show that.. [N(O))' . I ..O(P)) > 0 > Ep!/!(X.. .0) !:. Let Xl. where P is the empirical distribution of Xl.. O(P) is lbe unique solution of Ep. . . is continuous and lbat 1(0) = F'(O) exists.. (c) Assume the conditions in (a) and (b)... then < t) = P(O < On) = P (. random variables distributed according to PEP. Let i !' X denote the sample median..0) !.. Let On = B(P).p(Xi . (b) Show that if k is fixed and p ~ 00. . .. (Use .. .. 1 X n.p(X. ! Problems for Section 5. F(x) of X. . Deduce that the sample median is a consistent estimate of the population median if lbe latter is unique.n for t E R. I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _1 .i.. 1948)..p(x) (b) Suppose lbat for all PEP.. Use the bounded convergence lbeorem applied to . 1/). I i .On) ...f. N(O. I(X.(0) Varp!/!(X.0).. 1/4/'(0)). N(O) = Cov(. .L~. .p(X.p(X. _ .. Suppose !/! : R~ R (i) is monotone nondecreasing (ii) !/!(oo) < 0 < !/!(oo) . 1) for every sequence {On} wilb On 1 n = 0 + t/.0))  7'(0)) 0. (a) Show that (i). \ . then 8' !.\(On)] !:. Show that On is consistent for B(P) over P. (ii). under the conditions of (c). and (iii) imply that O(P) defined (not uniquely) by Ep. aliI/ > O(P) is finite.p(X.nCOn .d..n(X . Set (. . . . 00 (iii) I!/!I(x) < M < for all x. .4 1. • 1 = sgn(x). . .356 Asymptotic Approximations Chapter 5 .n(On .. . N(O. Ep...0).. . Show that _ £ ( . (e) Suppose lbat the d. (c) Give a consistent estimate of (]'2. .p(X. (k .0) and 7'(0) . That is the MLE jl' is not consistent (Neyman and Scott. ..0) = O.
denotes the N(O.(x. 2.b)dl"(x) J 1J.c)'P.. Find the efficiency ep(X.Xn ) is the MLE.O)p(x.20 and note that X is more efficient than X for these gross error cases.X) =0.B) < xl = 1.. lJ :o(1J.0. J = 1.05.7. Conclude that (b) Show that if 0 = max(XI.6 Problems and Complements 357 (0 For two estImates 0. Hint: Apply A4 and the dominated convergence theorem B. .~ for 0 > x and is undefined for 0 g~(X. 4.(x..B)) ~ [(I/O).. 0 > 0. 82 ) Pis N(I".a)p(x.Section 5.O).5  and '1'. 0» density..O))dl"(x) ifforalloo < a < b< 00.(x.4.0. Condition A6' pennits interchange of the order of integration by Fubini's theorem (Billingsley. If (] = I. 0) = .O)p(x.'" ..(1:e )n ~ 1.(x . 0 «< 0. U(O.(x.0) ~ N(O.0)p(x. ( 2 ). evaluate the efficiency for € = . then 'ce(n(B . x E R."' c .    . 1979) which you may assume.15 and 0. This is compatible with I(B) = 00.(x. .2. X n be i..' .5) where f. X) as defined in (f).10.d. Let XI. Hint: teJ1J. Show that   ep(X. oj).i.0) (see Section 3. .(x). Show that A6' implies A6. Show that assumption A4' of this section coupled with AGA3 implies assumption A4. Show that if (g) Suppose Xl has the gross error density f. (h) Suppose that Xl has the Cauchy density fix) = 1/1r(1 + x 2). . 3. not O! Hint: Peln(B . then ep(X.  = a?/o. (a) Show that g~ (x. the asymptotic 2 .jii(Oj . Thus.9)dl"(x) = J te(1J..I / 2 .(x) = (I .exp(x/O).y.b)p(x. " relative efficiency of 8 1 with respect to 82 is defined as ep(8 1 . ~"'.(x) + <'P. not only does asymptotic nonnality not hold but 8 converges B faster than at rate n. X) = 1r/2. 'T = 4. and O WIth .a)dl"(x).O) is defined with Pe probability I but < x.O)dl"(x)) = J 1J.
g. 80 +.18 . Hint: (b) Expand as in (a) but aroond 80 (d) Show that ?'o+".. . I .< ~log n p(X. O. ) n en] .5 Asymptotic Approximations Chapter 5 ~lOg n p(Xi.In) + .0'1 < En}'!' 0 . 8 ) 0 .·_t1og vn n j = p( x"'o+.. 8)p(x.i (x.4."('1(8) .00) . 1(80 ).in) . 0 ~N ~ (.i. 0). Suppose A4'.in. under F oo ' and conclude that . ~log (b) Show that n p(Xi . Apply the dominated convergence theorem (B. Show that the conclusions of PrQple~ 5.358 5.4.In ~ "( n & &8 10gp(Xi.7. i I £"+7. &l/&O so that E.80 + p(X . Show that 0 ~ 1(0) is continuous.) P"o (X .. and A6 hold for. I I j 6. 0 for any sequence {en} by using (b) and Polyii's theorem (A. ..5 continue to hold if . 8) is continuous and I if En sup {:.80 + p(Xi .i'''i~l p(Xi.O) 1(0) and Hint: () _ g.4) to g..p ~ 1(0) < 00.i(X. (8) Show that in Theorem 5. 8 ) i .80 p(X" 80) +.~ (X.(X.50). ~ L.in) ~ . j. .in) 1 is replaced by the likelihood ratio statistic 7. ) ! .. I 1 . I (c) Prove (5. A2.O') .. .[ L:."log . I .14.4.22).Oo) p(Xi . i.
3). f is a is an asymptotic lower confidence bound for fJ.4.4.6 Problems and Complements 359 8. there is a neighborhood V(Bo) of B o o such that limn sup{Po[B~ < BJ : B E V(Bo)} ~ 1 . (a) Show that under assumptions CAO)(A6) for 1/! consistent estimate of I (fJ).4. Let B~ be as in (5. if X = 0 or 1. Let B be two asymptotic l.2.54) and (5.59).4.5. which is just (4.5 hold. (a) Show that under assumptions (AO)(A6) for all Band (A4').4. B' _n + op (n 1/') 13.B lower confidence bounds. Compare Theorem 4.Q" is asymptotically at least as good as B if.7). and give the behavior of (5.4.4. (Ii) Compare the bounds in Ca) with the bound (5.4.4.Section 5. Let [~11.4.5 and 5.2. Establish (5. A. B' nj for j = 1.6 and Slutsky's theorem. 10. Consider Example 4.61). gives B Compare with the exact bound of Example 4.57) can be strengthened to: For each B E e. setting a lower confidence bound for binomial fJ.a. for all f n2 nI n2 .4. Hint: Use Problem 5. (a) Establish (5. (Ii) Deduce that g.:vell . 11. Hint. all the 8~i are at least as good as any competitors. which agrees with (4. at all e and (A4').61) for X ~ 0 and L 12. hence.4.4.58). . We say that B >0 nI Show that 8~I and. Then (5.4. (Ii) Suppose the conditions of Theorem 5. Hint: Use Problems 5.14. B).4. ~. (a) Show that.59) coincide. (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5. 9.4. = X.4.2.59).3.7. ~ 1 L i=I n 8'[ ~ 8B' (Xi. the bound (5.56).6.
I ' '1 against H as measured by the smallness of the pvalue is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "paradox"). J1. 2.1 and 3.1. Now use the central limit theorem.=2[1  <1>( v'nIXI)] has a I U(O.i. 1) distribution.'" XI! are ij.\ > o..\ l ... X > 0.5 ! 1. : A < Ao (a) Find the NeymanPearson (NP) test for testing H : . X n i..11).i. the test statistic is a sum of i. I' has aN(O. is a given number.4. By Problem 4. =f:. Problems for Section 5.riX) = T(l + nr 2)1/2rp (I+~~. T') distribution. (c) Find the approximate critical value of the NeymanPearson test using a Donnal approximation. I ... . 0 given Xl.LJ. . N(Jt. (c) Suppose that I' = 8 > O. 1) distribution.1.i. LJ.d.2. 1. Jl > OJ . the evidence I . Hint: By (3..2. N(I'.360 1 Asymptotic Approximations Chapter 5 14.d.4.6. where . variables with mean zero and variance 1(0). Let X 1> . Suppose that Xl.L and A. each Xi has density ( 27rx3 A ) 1/2 {AX + J.\ = '\0 versus K • (b) Show that the NP test is UMP for testing H : A > AO versus K : A < AO. Establish (5.4..)) and that when I' = LJ. (b) Suppose that I' ).d. 00. (a) Show that the test that rejects H for large values of v'n(X . That is.\ < Ao. phas a U(O.t = 0 versus K : J. 7f(1' 7" 0) = 1 . ..55). . where Jl is known.A 7" 0.d.] versus K . 1 .1.. if H is false.)l/2 Hint: Use Examples 3. That is. the pvalue fJ . 1. (d) Find the Wald test for testing H : A = AO versus K : A < AO.•. Show that (J/(J l'. . is distributed according to 7r such that 1> 7f({O}) and given I' ~ A > 0.I). Consider the Bayes test when J1. ! (a) Show that the posterior probability of (OJ is where m n (. = O. ! I • = '\0 versus K : .2. inverse Gaussian with parameters J. Consider the problem of testing H : I' E [0. Consider testing H : j. 1). " " I I .) has pvalue p = <1>(v'n(X . > ~. X n be i.LJ. 1 15.5. A exp .01 .21J2 2x A} .. (e) Find the Rao score test for testing H : . Show that (J l'.10) and (3.
17). 1) and p ~ L U(O. Extablish (5. 10 .0 3. ~ L N(O.5.Oo»)+log7f(O+ J.01 < 0 } continuThen 00. all . J:J'). 1) prior.1 is not in effect.046 . (e) Show that when Jl ~ ~.s.05.034 .I(Xi. Hint: By (5.13) and the SLLN..5. Establish (5...2..4.Section 5. Apply the argnment used for Theorem 5. Suppose that in addition to the conditions of Theorem 5.14). L M M tqn(t)dt ~ 0 a. P.054 50 . yn(anX   ~)/ va.6 Problems and Complements 361 (b) Suppose that J.0'): iO' .} dt < .for M(.:. [0.. By Theorem 5..( Xt.1.2 and the continuity of 7f( 0).052 100 . By (5. (Lindley's "paradox" of Problem 5. Fe.(Xi.5.2 it is equivalent to show that J 02 7f (0)dO < a..s.sup{ :.042 .. for all M < 00. yn(E(9 [ X) . (e) Verify the following table giving posterior probabilities of 1. Show that the posterior probability of H is where an = n/(n + I).O'(t)).050 logdnqn(t) = ~ {I(O). 4.~~(:. Hint: 1 n [PI . oyn.5. i ~.)}.17).1 I'> = 1.029 .) sufficiently large.I).O') : 100'1 < o} 5.2.058 20 .) (d) Compute plimn~oo p/pfor fJ.i It Iexp {iI(O)'..i Itlqn(t)dt < J:J'). 0' (i» n L 802 i=l < In n ~sup {82 8021(X i .5.5.S. ifltl < ApplytheSLLN and 0 ous at 0 = O.(Xi..5.645 and p = 0.l has a N(O.8) ~ 0 a. ~ E. Hint: In view of Theorem 5.' " . ~l when ynX = n 1'>=0.
Pn where Pn a or 1. (1) This famous result appears in Laplace's work and was rediscovered by S.5. von Misessee Stigler (1986) and Le Cam and Yang (1990). 'i roc J. 362 f Asymptotic Approximations Chapter 5 > O..5. convergence replaced by convergence in Po probability. i=l }()+O(fJ) I Apply (5.5. Suppose that in Theorem 5. See Bhattacharya and Ranga Rao (1976) for further discussion. for all 6. to obtain = I we must have C n = C(ZI_ ~ [f(0)nt l / 2 )(1 + op(I)) by Theorem 5.s.5.16) noting that vlnen. . Finally.f. .I Notes for Section 5. .1 .9) hold with a.1 (I) The bound is actually known to be essentially attained for Xi = a with probability Pn and 1 with probability 1 .!. " . (0» (b) Deduce (5. . I i. A. Hint: (t : Jf(O)<p(tJf(O)) > c(d)} = [d. L . .S. all d and c(d) / in d. I: . ~ II• '. I : ItI < M} O.d) for some c(d). . For n large these do not correspond to distributions one typically faces. A proof was given by Cramer (1946).". Fisher (1925).) and A5(a. • Notes for Section 5. Notes for Section 5. Fn(x) is taken to be O. .I(Xi}»} 7r(t)dt.3 ). .s.) ~ 0 and I jtl7r(t)dt < co. Bernstein and R. J I . dn Finally. (I) If the rightband side is negative for some x.29). The sets en (c) {t .. . ~ 0 a. .7 NOTES Notes for Section 5. (O)<p(tI. 5.s.1.s(8)vn tqn(t)dt ~ roc vIn(t  0) exp {i=(l(Xi' t) . qn (t) > c} are monotone increasing in c.5 " I i . (a) Show that sup{lqn (t) . .4 (1) This result was first stated by R.2 we replace the assumptions A4(a. jo I I . 7.5. . Show tbat (5. ! ' (2) Computed by Winston Cbow.5. . • .8) and (5.) by A4 and A5.
.. RANGA RAO. HANSCOMB. 1985. WILSON. Prob. . E. 1999. Wiley & Sons. "Nonnormality and tests on variances:' Biometrika. Some Basic Concepts New York: Springer. Approximation Theorems of Mathematical Statistics New York: J. FISHER... Sci.S." Proc. SERFLING. CRAMER. 1987. 1998. Linear Statistical Inference and Its Applications. 16. RAo. SCOTT." Proc.. AND D. NEYMAN. C. W. "Probability inequalities for sums of bounded random variables. L. 58. H. Nat. Acad. L. Mathematical Methods of Statistics Princeton. 1990. 318324 (1953).. S. MA: Harvard University press. Symp. Math. Vth Berk. 1980. 1964. p.. BILLINGSLEY. "Consistent estimates based on partially consistent observations. 17.. C. J. 684 (1931). 132 (1948).Section 5. LEHMANN. J. "Theory of statistical estimation. Asymptotics in Statistics.. R. E. Monte Carlo Methods London: Methuen & Co. LE CAM. J. Camb. Phil. New York: J. Statisticallnjerence and Scientific Method. M. S.. 700725 (1925). HOEFFDING. Proc. L. FISHER. 1967. L.. AND G. B. A. P. 1976. 1996. LEHMANN. DAVID. Vol.8 REFERENCES BERGER. 1995. H. 1380 (1963). 1973. Statist. Elements of LargeSample Theory New York: SpringerVerlag. AND R. Box. Pearson. T. R. SCHERVISCH.. 3rd ed. L. W. YANG. M. 1986. CASELLA. 2nd ed. FERGUSON. Hartley and E. 1938. Vol.. HUBER. Wiley & Sons. R. STIGLER. E.. I. P. New York: McGraw Hill. The History of Statistics: The Measurement of Uncertainty Before 1900 Cambridge. Vth Berkeley Symposium.." Econometrica. I Berkeley. Cambridge University Press. HfLPERTY. 40. The Behavior of the Maximum Likelihood Estimator Under NonStandard Conditions. G. O. J. 'The distribution of chi square. AND M. Tables of the Correlation Coefficient. RUDIN. reprinted in Biometrika Tables/or Statisticians (1966). NJ: Princeton University Press. 22. AND E. P.. Probability and Measure New York: Wiley. Normal Approximation and Asymptotic Expansions New York: Wiley. Amer. E.. R. AND G. Assoc. 1946. Editors Cambridge: Cambridge University Press. S. CA: University of California Press. BHATTACHARYA. H. M. A CoUrse in Large Sample Theory New York: Chapman and Hall." J. HAMMERSLEY... R... U. 1979. J. A.A. N. F. 1958. Mathematical Analysis. Theory o/Statistics New York: Springer. Statistical Decision Theory and Bayesian Analysis New York: SpringerVerlag.8 References 363 5. Theory ofPoint EstimatiOn New York SpringerVerlag. Soc.. 3rd ed. Statist.
. (I I i : " .I 1 .\ . ' i ' .. . . I . .
1 INFERENCE FOR GAUSSIAN LINEAR MODElS • Most modern statistical questions iovol ve large data sets.3).or nonpararnetric models. and efficiency in semiparametric models.6.3. and prediction in such situations. We begin our study with a thorough analysis of the Gaussian linear model with known variance in which exact calculations are possible.2.3.3). with the exception of Theorems 5. the number of parameters. often many. There is. an important aspect of practical situations that is not touched by the approximation.2 and 5. 2. for instance.5. the bootstrap.2.1) and more generally have studied the theory of multiparameter exponential families (Sections 1. 2. however. the properties of nonparametric MLEs. However. [n this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates. testing. the number of observations. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for functionvalued statistics.and semiparametric models. We shall show how the exact behavior of likelihood procedures in this model correspond to limiting behavior of such procedures in the unknown variance case and more generally in large samples from regular ddimensional parametric models and shall illustrate our results with a number of important examples. curve estimates. 365 . This chapter is a leadin to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non. tests. 2. the fact that d. are often both large and commensurate or nearly so. real parameters and frequently even more semi.6. the multinomial (Examples 1. Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II. in which we looked at asymptotic theory for the MLE in multiparameter exponential families.2. confidence regions. we have not considered asymptotic inference. and confidence regions in regular onedimensional parametric models for ddirnensional models {PO: 0 E 8}.1. The inequalities ofVapnikChervonenkis.7.8 C R d We have presented several such models already. 2. and n.Chapter 6 INFERENCE IN THE MUlTIPARAMETER CASE 6. the modeling of whose stochastic structure involves complex models governed by several. 1.1.4. multiple regression models (Examples 1.4.3.
. " 'I and Y = Z(3 + e. 1 en = LZij{3j j=1 +Ei. i = l..1. It turn~ out that these techniques are sensible and useful outside the narrow framework of model (6.a2). Notational Convention: In this chapter we will. n (6. In Section 6...1.3).. I I. • i .. Regression.1) j .3) whereZi = (Zil.Zip' We are interested in relating the mean of the response to the covariate values." . 0"2).1.4) 0 where€I.1.'" .1 is also of the fonn (6.zipf.1. Z = (Zij)nxp. andJ is then x nidentity matrix.. .2J) (6. are U. I The regression framewor~ of Examples 1. The model is Yi j = /31 + €i. In this section we will derive exact statistical procedures under the assumptions of the model (6.1.6 we will investigate the sensitivity of these procedures to the assumptions of the model. the Zij are called the design values. . we have a response Yi and a set of p .N(O.1.3): l I • Example 6... 6. '. We consider experiments in which n cases are sampled from a population. .1. In vector and matrix notation.Herep= 1andZnxl = (l.1 covariate measurements denoted by Zi2 • . Here is Example 1.1.€nareij. .1.. " n (6.1 The Classical Gaussian linear Model l' f Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants Zil.2. say the ith case. .d. when there is no ambiguity. . 366 Inference in the Multiparameter Case Chapter 6 ! . i = 1.Zip.. and Z is called the design matrix. .1. i = 1.. We have n independent measurements Y1 . .l p..n (6. we write (6. The normal linear regression model is }i = /31 + L j=2 p Zij/3j + €i.j . The OneSample Location Problem.1. let expressions such as (J refer to both column and row vectors.1. . . and for each case. e ~N(O.. .. In the classical Gaussian (nannal) linear model this dependence takes the fonn p 1 Yi where EI.3). . . • .. i .d. Example 6.. N(O. . These are among the most commonly used statistical techniques. 1 Yn from a population with mean /31 = E(Y).1.l)T.1.4 and 2.5) .2) Ii .. .2(4) in this framework. Here Yi is called the response variable. .
.. If we are comparing pollution levels. To see that this is a linear model we relabel the observations as Y1 . where YI.0: because then Ok represents the difference between the kth and average treatment . The model (6. In this case.1.. 0 Example 6.9.2) and (6. nl + . and the €kl are independent N(O.1 Inference for Gaussian Linear Models 367 where (31 is called the regression intercept and /32. TWosample models apply when the design values represent a qualitative factor taking on only two values.~) random variables.1fwe set Zil = 1.5) is called the fixed design norrnallinear regression model. Generally. . (6. We treat the covariate values Zij as fixed (nonrandom).6) is an example of what is often called analysis ojvariance models. + n p = n. the design matrix has elements: 1 if L jl nk + 1<i < L nk k=1 j k=l ootherwise and z= o 0 ••• o o Ip where I j is a column vector of nj ones and the 0 in the "row" whose jth member is I j is a column vector of nj zeros.•.3 we considered experiments involving the comparisons of two population means when we had available two independent samples. In Example I. we are interested in qualitative factors taking on several values. Then for 1 < j < p.1.1. this terminology is commonly used when the design values are qualitative.. and so on. n. Ynt +n2 to that getting the second.. we arrive at the one~way layout or psample model. . To fix ideas suppose we are interested in comparing the performance of p > 2 treatments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k.. .. .3. one from each population.3.Way Layout. Yn1 + 1 . The pSample Problem Or One. . then the notation (6.1.Yn . The model (6. we often have more than two competing drugs to compare. .3) applies. · · .6) is often reparametrized by introducing ( l ~ pl I:~~1 (3k and Ok = 13k . Frequently. and so on.4."i = 1./. . We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional distribution of Y given a set of observed covariate values..6) where Y kl is the response of the lth subject in the group obtaining the kth treatment. Yn1 correspond to the group receiving the first treatment.1. (6. we want to do so for a variety of locations.. if no = 0. 1 < k < p.1. If the control and treatment responses are independent and nonnally distributed with the same variance a 2 . .Section 6. /3p are called the regression coefficients.1. The random design Gaussian linear regression model is given in Example 1.3 and Section 4. 13k is the mean response to the kth treatment.
•• Note that w is the linear space spanned by the columns Cj. I5 h . .368 effects.Y. i5p )T. then r is the rank of Z and w has dimension r. j = 1.(T2J) where Z. 1 JLn)T I' where the Cj = Z(3 = L j=l P .! x (p+l) = (1. E ~N(O.. (3 E RP}. i = 1. .2). . . is common in analysis of variance models.3... We assume that n > r..a 2 ) is identifiable if and only ifr = p(Problem6.. n. j = I •. Ui = v.. . V r span w. . .r additional linear restrictions have been specified. Let T denote the number of linearly independent Cj... •. k model is Inference in the Multiparameter Case Chapter 6 1.. Recall that orthononnal means Vj = 0 fori f j andvTvi = 1. • . When VrVj = O. Note that Z· is of rank p and that {3* is not identifiable for {3* E RP+l. This type oflinear model with the number of columns d of the design matrix larger than its rank r. an orthononnal basis VI. . . b I _ .. .po In tenns of the new parameter {3* = (0'. . It follows that the parametrization ({3....6jCj are the columns of the design matrix. r i' . the linear Y = Z'(3' + E. the vector of means p of Y always is. Note that any t E Jl:l can be written . . It is given by 0 = (itl.. TJi = E(Ui) = V'[ J1. The Canonical Fonn of the Gaussian Linear Model The linear model can be analyzed easily using some geometry. We now introduce the canonical variables and means . However.g. Z) and Inx I is the vector with n ones. • I t Ew ¢:> t = L:(v[t)Vi i=l ¢:> vTt = 0. . n.. of the design matrix. vT n i I t and that T = L(vTt)v.7) !.. . by the GramSchmidt process) (see Section B. Cj = (Zlj. j = I.. {3* is identifiable in the pdimensional linear subspace {(3' E RP+l : L:~~l dk = O} of RP+ 1 obtained by adding the linear restriction E~=l 15k = 0 forced by the definition of the 15k '5.. n. and with the parameters identifiable only once d ...1.. . i=I (6. " .• znj)T. V n for Rn such that VI. . Because dimw = r. i = T + 1. we call Vi and Vj orthogonal... The parameter set for f3 is RP and the parameter set for It is W = {I' = Z(3. i . . .17).1. Even if {3 is not a parameter (is unidentifiable).p.P. there exists (e.
. IJ.2 ) using the parametrization (11.2 known (i) T = (UI •. . (6. and then translate them to procedures for . .1... .2<7' L.1.". Theorem 6.. where = r + 1..2. n. . (iii) U i is the UMVU estimate of1]i. also UMVU for CY. it = E.Ur istheMLEof1]I. .1 Inference for G::'::"::"::"::"::L::'::"e::'::'::M::o::d::e'::' '" '3::6::::. . Theorem 6.8) So...L = 0 for i = r + 1.3.. • .9 N(rli. then the MLE of 0: Ci1]i is a = E~ I CiUi. which are sufficient for (IJ. '" ~ A 1'1...1.2 Estimation 0.2 L i=l + _ '" ryiui _ 02~ i=l 1 r r 2 (6. observing U and Y is the same thing. while (1]1.10). and by Theorem B. U1 .). ..1. n. u) based on U £('I. 20.u) 1 ~ 2 n 2 .(Ui . . is Ui = making vTit. = 1... 1 ViUi and Mi is UMVU for Mi.1. . . Then we can write U = AY. 1 If Cl. In the canonical/orm ofthe Gaussian linear model with 0.2. r.'J nxn . (3.. . .1. The U1 are independent and Ui 11i = 0.2 and E(Ui ) = vi J.Cr are constants. ais (v) The MLE of. 7J = AIJ... Un are independent nonnal with 0 variance 0. n because p.1... U. .10) It will be convenient to obtain our statistical procedures for the canonical variables U.1.1]r. i it and U equivalent.• .1.'Ii) . i i = 1. 0. E w.)T is sufJicientfor'l.•• (ii) U 1 . . 1]r)T varies/reely over Rr.1... Moreover. = L. We start by considering the log likelihood £('1..2"log(21r<7 ) t=1 n 1 n _ _ ' " u.. (6.. 2 ' " 'Ii i=l ~2(T2 6.9) Var(Y) ~ Var(U) = .Section 6.' based on Y using (6. . . whereas (6. .2 We first consider the known case..and 7J are equivalently related. v~.11) _ n log(21r"'). . . (T2)T. and . which is the guide to asymptotic inference in general.8)(6.. i (iv) = 1. 2 0. . n. Note that Y~AIU. Let A nxn be the orthogonal matrix with rows vi. Proof. .. •.
. Theorem 6..4. (ii) U1. 1 Ur .... .• EUl U? Ul Projections We next express j1 in terms of Y.) = '7" i = 1. . = 0. WI = Q is sufficient for 6 = Ct..6. By (6..11) by setting T j = Uj .) (iii) By Theorem 3. . (6. .2 (ii).11) is an exponential family with sufficient statistic T. (iii) 8 2 =(n . Ui is UMVU for E(U.2. r. . . 1 U1" are the MLEs of 1}1. ( 2 )T. .. and . {3.1.1.'7.1.2 . this statistic is equivalent to T and (i) follows. we need to maximize n (log 27f(J2) 2 II ". = 1. recall that the maximum of (6. (i) T = (Ul1" . To show (ii). then W . T r + 1 = L~ I and ()r+l = 1/20. (iv) The conclusions a/Theorem 6. 0. (ii) The MLE ofa 2 is n.2. By GramSchmidt orthogonalization. the MLE of q(9) = I:~ I c.. . and give a geometric interpretation of ii. Proof. there exists an orthonormal basis VI) .4 and Example 3.. . 370 Inference in the Multiparameter Case Chapter 6 I iI .. I.4.1. .1.. . That is.4. To this end.4. To show (iv). i > r+ 1.11) has fJi 1 2 = Ui .10. The maximizer is easily seen to be n 1 L~ r+l (Problem 6. The distribution of W is an exponential family.. define the norm It I of a vector tERn by  ! It!' = I:~ .N(~. . .Ui · If all thee's are zero. . n " .. l Cr ..2 . Q is UMVU. " as a function of 0. Proof..ti· 1 . where J is the n x n identity matrix.1.3 and Example (iii) is clear because 3. n. I i > .3. . "7r because.. is q(8) = L~ 1 <.1.. (v) Follows from (iv).(v) are still valid. .O)T E R n .r) 2:7 1 L~ r+l ul is an unbiased estimator of (J2.6 to the canonical exponential family obtained from (6.1]r.11).. (iv) By the invariance of the MLE (Section 2. and is UMVU for its expectation E(W. by observation. I .1. . (i) By observation. By Problem 3.11) is a function of 1}1. 2:7 r+l un Tis sufficientfor (1]1.Ur'L~ 1 Ul)T is sufficient. 0  i Next we consider the case in which a 2 is unknown and assume n >r+ 1. (U I . j = 1. (6. .3.4. . (We could also apply Theorem 2. i = 1."7i)2 and is minimized by setting '(Ii = Ui. I i I .3. apply Theorem 3.'  . . obtain the MLE fJ of fl.. " V n of R n with VI = C = (c" .... wecan assume without loss of generality that 2:::'=1 c.. 2(J i=r+l ~ U. Assume that at least one c is different from zero... But because L~ 1 Ui2 L:~ I Ui2 + r+l Ul.2). In the canonical Gaussian linear model with a 2 unknown. 1lr only through L~' 1 CUi ..1 L~ r+l U.(J2J) by Theorem B. ~i = vi'TJ. 2 2 ()j = 1'0/0.1).) = a.2. Let Wi = vTu. r.1.
the MLE = LSE of /3 is unique and given by (6.2.1.1. Proof.1. by (6.n log(21f<T .J) of a point y Yo =argmin{lyW: tEw}.Ii = . spans w. IY .jil' /(n r) (6.14) (v) f3j is the UMVU estimate of (3j. = Z T Z/3 and ZTji = ZTZfj and.4.' and is given by ji (ii) jl is orthogonal to Y (iii) 8' ~ = zfj (6.L of vectors s orthogonal to w can be written as ~ 2:7 w. and  Jii is the UMVU estimate of J. .3 because Ii = l:~ 1 ViUi and Y . note that the space w.9).L. In the Gaussian linear model (i) jl is the unique projection oiY on L<..1.1. Thus. then /3 is identifiable. 0 ~ . . note that 6...n. /3. the MLE of {3 equal~ the least squares estimate (LSE) of {3 defined in Example 2.2<T' IY .1 and Section 2.14) follows.ji) = 0 and the second equality in (6.L = {s ERn: ST(Z/3) = Oforall/3 E RP}. any linear combination of U's is a UMVU estimate of its expectation. To show (iv). (i) is clear because Z.. We have Theorem 6. i=l.13) (iv) lfp = r. It follows that /3T (ZT s) = 0 for all/3 E RP..3.Z/3 I' . To show f3 = (ZTZ)lZTy. fj = (ZTZ)lZTji. _..ti. . . any linear combination ofY's is also a linear combination of U's.1 Inference for Gaussian linear Models 371 E Rn on w is the point Definition 6..1.1. j = 1.Section 6. which implies ZT s = 0 for all s E w.12) implies ZT.. equivalently. ) 2 or. /3 = (ZTZ)l ZT.3(iv)..3 maximizes 1 log p(y. That is.3 of.1. and /3 = (ZTZll ZT 1".' and by Theorems 6.1.3 E RP. because Z has full rank.1.. ZTZ is nonsingular. <T) = .... ZT (Y .Z/3I' : /3 E W}. The projection Yo = 7l"(Y I L.12) ii.1. = Z/3 and (6.. ~ The maximum likelihood estimate .1.2(iv) and 6. fj = arg min{Iy . f3j and Jii are UMVU because.1.1.p. (ii) and (iii) are also clear from Theorem r+l VjUj .
H)Y. .1.12) and (6. = [y.372 Inference in the Multiparameter Case Chapter 6 Note that in Example 2.. and I." As a projection matrix H is necessarily symmetric and idempotent. w~ obtain Pi = Zi(3.1.'.2 illustrates this tenninology in the context of Example 6. Note that by (6. (12(J . see also Section RIO. it is commOn to write Yi for 'j1i. . n}.1. 1 < i < n. (12 (ZTz) 1 ).H)). ~ ~ ·. : H = Z(Z T Z)IZT The matrix H is the projection matrix mapping R n into w. That is.16) J I I CoroUary 6. In the Gaussian linear model (i) the fitted values Y = iJ.' € = Y . L7 1 .1.1.({3t + {3. The estimate fi = Z{3 of It is called thefitted value and € = y . . In statistics it is also called the hat matrix because it "puts the hat on Y. i = 1.)I are the vertical distances from the points to the fitted line. moreover. (iv) ifp = r. whenp = r.3) is to be taken. the residuals €.1.ill 2 = 1 q. (j ~ N(f3.I" (12H). i = 1. =H . the best MSPE predictor of Y if f3 is known as well as z is E(Y) = ZT{3 and its best (UMVU) estimate not knowing f3 is Y = ZT {j. By Theorem 1. n lie on the regression line fitted to the data {('" y.H). Var(€) = (12(J . The goodness of the fit is measured by the residual sum of squares (RSS) IY . 1 < i < n. We can now conclude the following.4.1.14) and the normal equations (Z T Z)f3 = Z~Y.. . H I I It follows from this and (B. (ii) (iii) y ~ N(J. .2. • ~ ~ Y=HY where i• 1 . Suppose we are given a value of the covariate z at which a value Y following the linear model (6.5. H T 2 =H. There the points Pi = {31 + fhzi. Taking z = Zi.2 with P = 2.3) that if J = J nxn is the identity matrix. • . . then (6.1.Y = (J . Y = il. the ith component of the fitted value iJ.1. The residuals are the projection of Y on the orthocomplement of wand ·i . (6..1. . and the residual € are independent.. Example 2. 1 ~ ~ ! I ~ ~ ~ i .. I. In this method of "prediction" of Yi..1 we give an alternative derivation of (6.1.). 1 · .14). € ~ N(o. . i .15) Next note that the residuals can be written as ~ I . I .ii is called the residual from this fit.
4. In the Gaussian case. The independence follows from the identification of j1 and € in tenns of the Ui in the theorem. . 0 ~ We now return to our examples.1. . The error variance (72 = Var(El) can be unbiasedly estimated by 8 2 = (n _p)lIY . .1 ~ Inference for Gaussian Linear Models 373 Proof (Y.8) = (J2(Z T Z)1 follows from (B.a of the kth treatment is ~ Pk = Yk.1.. and€"= Y . o . 'E) is a linear transformation of U and.ii. (not Y. We now see that the MLE of It is il = Z. If {Cijk _. In this example the nonnal equations (Z T Z)(3 = ZY become n.. i = 1. Regression (continued). . .. The OneWay Layout (continued). By Theorem 6..p. k = 1. the treatments. If the design matrix Z has rank P.. in the Gaussian model.. One Sample (continued). + n p and we can write the least squares estimates as {3k=Yk . then the MLE = LSE estimate is j3 = (ZTZr1 ZTy as seen before in Example 2. } is a multipleindexed sequence of numbers or variables..1 and Section 2.no The variances of .1. k = 1. is th~ UMVU estimate of the average effect of all a 1 p = Yk . a = {3.. 1=1 At this point we introduce an important notational convention in statistics. .1.8. 0 Example 6. k=I. hence..1.p.. . . Var(.1.2). joint Gaussian.1.2. .3). Here Ji = . . Moreover.8 and (3.i respectively.8.Section 6.ii = Y. Example 6.Ill'.1. and € are nonnally distributed with Y and € independent.2.8 and that {3j and f1i are UMVU for {3j and J1.j = 1. where n = nl + .. Thus. 0 ~ ~ ~ ~ ~ ~ ~ Example 6.Y are given in Corollary 6. .3.p..3...81 and i1 = {31 Y. nk(3k = LYkl..3. in general) p k=l L and the UMVU estimate of the incremental effect Ok = {3k .1).5. then replacement of a subscript by a dot indicates that we are considering the average over that subscript. Y.p.1. (n . which we have seen before in the unbiased estimator 8 2 of (72 is L:~ I (Yi ~  Yf/ Problem 1.
under H. see Problems 1.p}. j where Zi2 is the dose level of the drug given the ith patient.4. in the context of Example 6. In general.. j 1 • 1 ..1.7 and 2.374 Inference in the Multiparameter Case Chapter 6 Remark 6. n} is a twodimensional linear subspace of the full model's threedimensional linear subspace of Rn given by (6.Li = 13 E R. . Thus. Zi3 is the age of the ith patient. a regression equation of the form mean response ~ . For more on LADEs. 1 and the estimates of f3 and J. j.. 1 < . However. The LSEs are preferred because of ease of computation and their geometric properties. q < r. whereas for the full model JL is in a pdimensional subspace of Rn.al. .JL): JL E w} sup{p(Y. in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider. see Koenker and D'Orey (1987) and Portnoy and Koenker (1997).1..3 Tests and Confidence Intervals . The most important hypothesistesting questions in the context of a linear model correspond to restriction of the vector of means JL to a linear subspace of the space w.. and the matrix IIziillnx3 with Zil = 1 has rank 3. . = {3p = 13 for some 13 E R versus K: "the f3's are not all equal. .1. i = 1." Now. i = 1. ! j ! I I . . . I. under H. The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEssee Stigler (1986).3 with 13k representing the mean resJXlnse for the kth population. Now we would test H : 132 = 0 versus K : 132 =I O. {JL : Jti = 131 + f33zi3.JL): JL E wo} ~ ! .1. (6. the LADEs are obtained fairly quickly by modem computing methods.zT . which is a onedimensional subspace of Rn.\(y) = sup{p(Y.En in (6.2.1.17) I. The first inferential question is typically "Are the means equal or notT' Thus we test H : 131 = . which together with a 2 specifies the model. . . We first consider the a 2 known case and consider the likelihood ratio statistic ..L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L~ 1 IYi ..17). " .6.2.. An alternative approach 10 the MLEs for the nonnal model and the associated LSEs of this section is an approach based on MLEs for the model in which the errors El. .1.i 1 6. For instance. we let w correspond to the full model with dimension r and let Wo be a qdimensional linear subspace over which JL can range under the null hypothesis H.31.1. Next consider the psample model of Example 1. . • . the mean vector is an element of the space {JL : J. . • = 131 + f32Zi2 + f33Zi3.1) have the Laplace distribution with density .
1) distribution with OJ = 'Ida. then r. In this case the distribution of L:~q+1 (Uda? is called a chisquare distribution with r .\(Y) Proof We only need to establish the second equality in (6. (}r.\(Y) has a X.iii' .\(Y) = L i=q+l r (ud a )'.1. But if we let A nx n be an orthogonal matrix with rows vi. Write 'Ii is as defined in (6.iio 2 1 } where i1 and flo are the projections of Yon wand wo. respectively..20) i=q+I 210g. i=q+l r = a 21 fL  Ito [2 (6.. . .I8) then.3. V q (6.Section 6_1 Inference for Gaussian Linear Models 375 for testing H : fL E Wo versus j{ : JL E W  woo Because (6.1.1..1.q {(}2) for this distribution. v~' such that VI. when H holds. 2 log .. .\(Y) ~ exp {  2t21Y .l2). V r span wand set .1.'" (}r)T (see Problem B. In the Gaussian linear model with 17 2 known.q degrees offreedom and noncentrality parameter Ej2 = 181 2 = L:=q+ I where 8 = «(}q+l.\(Y) It follows that = exp 1 2172 L '\' Ui' (6. .J X.l9) then. . span Wo and VI. . We write X..1.21).2(v).21 ) where fLo is the projection of fL on woo In particular.19). Proposition 6._q' = AJL where A L 'I.. We have shown the following.1..1. by Theorem 6.1. . o . = IJL i=q+l r JLol'...1.. 2 log . Note that (uda) has a N(Oi. by Theorem 6.1.q( (}2) distribution with 0 2 =a 2 ' \ '1]i L.IY .
q and m = n .23) .I.n_r(02) where 0 2 = u.2.1.r.. which has a X~r distribution and is independent of u. (6.iT'): /L E w} . T has the (central) Frq.5) that if we introduce the variance equal likelihood ratio w'" .376 Inference in the Multiparameter Case Chapter 6 Next consider the case in which a 2 is unknown.') denotes the righthand side of (6.q and n . We know from Problem 6.1. /L..r IY . statistic :\(y) _ max{p{y.m{O') for this distrihution where k = r .1. In Proposition 6. T is called the F statistic for the general linear hypothesis./to I' ' n J respectively.I}.liol' (n _ r) 'IY _ iii' . E In this case.. Thus. has the noncentral F distribution F r _q. which is equivalent to the likelihood ratio statistic for H : Ii.wo. T has the representation .1. I .3. as measured by the residual sum of squares under the model specified by H. Substituting j).q variable)/df (central X~T variahle)/df with the numerator and denominator independent. The resulting test is intuitive.1 that the MLEs of a 2 for It E wand It E wQ are . The distribution of such a variable is called the noncentral F distribution with noncentrality parameter 0 2 and r .2 1JL ./L. . .iT') : /L E wo} (6.0./L.21ii.1._q(02) distribution with 0' = .~I. it can be shown (Problem 6.2 is known" is replaced by '.2.1 suppose the assumption "0.1.14). we can write . J • T = (noncentral X:. For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >. and 86 into the likelihood ratio statistic.. In the Gaussian linear model the F statistic defined by (6.'}' YJ./Lol'.1.q)'Ili .22) Because T = {n .nr distribution. .a = ~(Y'E.. .q)'{[A{Y)J'/n .• j .1 that 021it . T is an increasing function of A{Y) and the two test statistics are equivalent.JLo. We have shown the following.JtoI 2 .L a =  n I' and .r degrees affreedam (see Prohlem B. IIY .liol' IY r _q IY _ iii' iii' (r . 8 2 .r){r .max{p{y. when H I lwlds.'I/L..1.(Y)..L n where p{y. we obtain o A(Y) = P Y. Remark 6.19).2 is the same under Hand K and estimated by the MLE 0:2 for /. is poor compared to the fit under the general model. We write :h. We have seen in PmIXlsition 6. By the canonicaL representation (6.18).itol 2 have a X..E Wo for K : Jt E W .22). = aD IIY . {io. In particular. Proposition 6.l) {Ir . . ]t consists of rejecting H when the fit.itol 2 = L~ q+ 1 (Ui /0) 2.1.1.a:. T = n .'IY _ iii' = L~ r+l (Ui /0)2.J.1.
2.22).1.1.3. q = 0.IO.1. One Sample (continued).iLl' ~++Yl I' I' 1 .1.iLol'.1.1.1'0 I' y. r = 1 and T = 1'0 versus K : (3 'f 1'0. and the Pythagorean identity. We next return to our examples. Y3 y Iy . 0 .(Y) equals the likelihood ratio statistic for the 0'2.~T = (n . The canonical representation (6. It follows that ~ (J2 known case with rT 2 replaced by r. where t is the onesample Student t statistic of Section 4.r)ln central X~_cln where T is the F statistic (6.25) which we exploited in the preceding derivations.\(Y) = .iLol' = IY . See Pigure 6. We test H : (31 case wo = {I'o}.1.9. (6.1.19) made it possible to recognize the identity IY . Yl = Y2 Figure 6.1.Section 6. (6. This is the Pythagorean identity.24) Remark 6. In this = (Y 1'0)2 (nl) lE(Y.1 Inference for Gaussian Linear Models ~ 377 then >.q noncentral X? q 2Iog.Y)2' which we recognize as t 2 / n.iLl' + liL . The projections il and ilo of Yon w and Wo.1 and Section B. Example 6.1.
)2' .2.: I :I T _ n ~ p L~l nk(Y .1. respectively.27) 1 .P . _y. Regression (continued). (6. I3I) where {32 is a (p . j ! j • litPol2 = LL(Yk _y...{3p are Y1.)2.=11=1 k=1 •• Substituting in (6.1 L~=. respectivelY.. which only depends on the second set of variables and coefficients.' • ito = Thus. 1 02 = (J2(p .ZfZl(ZiZl)'ziz. We consider the possibility that a subset of p . In this case f3 (ZrZ)lZry and f3 0 = (Z[Ztl1Z[Y are the ML& under the full model (6.ito 1 are the residual sums of squares under the full model and H. age.vy. •.  I 1.n 378 Inference in the Multiparameter Case Chapter 6 iI II " • Example 6. P .q)1f3nZfZ2 . Now the linear model can be written as i i I I ! We test H : f32 (6. and we partition {3 as f3T = (f3f. • 1 F = (RSSlf . I • ! . The One.26) and H.. Under H all the observations have the same mean so that.1. we partition the design matrix Z by writing it as Z = (ZI.. Under the alternative F has a noncentral Fp_q. 0 I j Example 6. we want to test H : {31 = . Using (6.np distribution.RSSF)/(dflf .q).}f32.Yk.26) 0 versus K : f3 2 i' O..Q)f3f(ZfZ2)f32.. .Way Layout (continued). in general 02 depends on the sample correlations between the variables in Zl and those in Z2' This issue is discussed further in Example 62.)2= Lnk(Yk.g.1. . As we indicated earlier.. L~"(Yk' .. However. 02 simplifies to (J2(p . Without loss of generality we ask whether the last p .q) x 1 vector of main (e. 7) iW l. Z2) where Z1 is n x q and Z2 is 11 X (p . The F test rejects H if F is large when compared to the ath quantile of the Fpq.q covariates in multiple regression have an effect after fitting the first q.g.y)2 k .1.dfp) RSSF/dh I 2 where RSSF = IY and RSSH = IY . In the special case that ZrZ2 = 0 so the variables in Zl are orthogonal to the variables in Z2..1.1. anddfF = np and dfH = nq are the corresponding degrees of freedom. . .1.. I • I b . = {3p. Recall that the least squares estimates of {31. . economic status) coefficients.1.3.. To formulate this question.. . P nk (Y..22) we can write the F statistic version of the likelihood ratio test in the intuitive fonn =   i .22) we obtain the F statistic for the hypothesis H in the oneway layout .q covariates does not affect the mean response. treatment) effect coefficients and f3 1 is a q x 1 vector of "nuisance" (e.n_p(fP) distribution with noncentrality parameter (Problem 6. '1 Yp.1. ..
the total sum of squares. If the fJ.. We assume the oneway layout is valid. respectively. sum of squares in the denominator. . n. Ypnp' The 88w ~ L L(Y Y p "' k'  k )'. identifying 0' and (p . (3" . Note that this implies the possibly unrealistic .. . SST/a 2 is a (noncentral) X2 variable with (n .3. the within groups (or residual) sum of squares.)'. k=l 1=1 measures variation within the samples. Yi nl . .p). Because 88B/0" and SSw / a' are independent X' variables with (p . (31. we have a decomposition of the variability of the whole set of data.···.p) degrees of freedom. the between groups (or treatment) sum of squares and SSw. we see that the decomposition (6. This information as well as S8B/(P . [f we define the total sum of squares as 88T =L l' L(Y nk k'  Y. .29) Thus.1 Inference for Gaussian Linear Models 379 When H holds.1. ijjT = There is an interesting way of looking at the pieces of infonnation summarized by the F statistic.)' . is a measure of variation between the p samples YII . .fJ..30) can also be viewed stochastically. into two constituent components.1. and III with I being the "high" end. (3p)T and its projection 1'0 = (ij. SSB.fLo 12 for the vector Jt (rh. ... See Tables 6.Y. 888 =L k=I P nk(Yk.28) where j3 = n I :2:=~"" 1 nifJi. and the F statistic.1) and 8Sw /(n . . .25) 88T = 888 + 88w . As an illustration.1. (6..2 1M .p) of the (n 1) degrees offreedom of SST /0" as "cooling" from S8w/a'.1. Ypl . SST. To derive IF..1) and (n .1. .. compute a. (3p. . k=1 1=1 which measures the variability of the pooled samples..1..1 and 6. The sum of squares in the numerator..1) degrees of freedom and noncentrality parameter 0'. .Section 6. are not all equal. are often summarized in what is known as an analysis a/variance (ANOVA) table. which is their ratio. the unbiased estimates of 02 and a 2 . consider the following data(I) giving blood cholesterol levels of men in three different socioeconomic groups labeled I.... . .np distribution with noncentrality parameter (6. T has a Fpl. T has a noncentral Fp1.1) degrees offreedom as "coming" from SS8/a' and the remaining (n .. then by the Pythagorean identity (6.np distribution.
j:
I ,
380
Inference in the Multiparameter Case Chapter 6
,
:
TABLE 6.1.1. ANOYA table for the oneway layout
Sum of squares
d.f.
,
Between samples
Within samples
SSe
I:
r I lld\'k_
Mean squares
F value
1\1 S '
i'dS B
Lf!
P
1
MSB
~
, ,
Total
58W  ""k '1'1n I "(I'k l  I')'  I " k 55T  ,. P 1 L "I" 1 (I'kl _ I' )' ~ k I
np
Tl  1
" '''
A1S w = SSw
1
1
j
TABLE 6.1.2. Blood cholesterol levels
I
J
286 290
II III
403 312 403
311 222 244
269 302 353
336 420 235
259 420 319
386 260
353
210
l 1
I
I
,
i,
.!
, , I
I , ,
assumption that the variance of the measurement is the same in the three groups (not to speak of normality). But see Section 6.6 for "robustness" to these assumptions. We want to test whether there is a significant difference among the mean blood cholesterol of the three groups. Here p = 3, nl = 5, n2 = 10, n3 = 6. n = 21, and we compute
TABLE 6.1.3. ANOYA table for the cholesterol data
;
,
88
Between groups Within groups Total
I
,
,I
1202.5 85,750,5 86,953.0
dJ, 2 18 20
M8
601.2 4763,9
F~value
0.126
'I
From :F tables, we find that the pvalue corresponding to the Fvalue 0.126 is 0.88. Thus, there is no evidence to indicate that mean blood cholesterol is different for the three socioeconomic groups. 0 Remark 6.1.4. Decompositions such as (6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding to variation of covariates are referred to as analysis oj variance. They can be fonnulated in any linear model including regression models. See Scheff" (1959, pp. 4245) and Weisberg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to establish the distribution theory of the components via a device known as Cochran's theorem (Graybill, 1961, p. 86). Their principal use now is in the motivation of the convenient summaries of infonnation we call ANOVA tables.
,
,I "
,
,
,
Section 6.1
Inference for Gaussian linear Models
~
381
Confidence Intervals and Regions We next use our distributional results and the method of pivots to find confidence intervals for J.li, 1 < i < n, !3j, 1 <j < p, and in general, any linear combination
n
,p
= ,p(/l) = Lai/Li
i::: 1
~ aT /l
of the J.l's. If we set;j;
= 1:7
1 ai/Ii
= aT fl and
~
~
where H is the hat matrix, then (,p  ,p)ja(,p) has a N(O, I) distribution. Moreover,
(n  r)8 2ja 2 ~
IY  iii 2 ja 2 =
L
i=r+l
n
(Uda 2 )'
has a X~r distribution and is independent of ;;;. Let
~
~
be an estimate of the standard deviation a('IjJ) of
~
'0. This estimated standard deviation is
called the standard error of 'IjJ. By referring to the definition of the t distribution, we find that the pivot
has a TnT. distribution. Let t n  r (1  40) denote the 1 ~o: quantile of the bution, then by solving IT(,p)1 < t n _, (I  ~",) for,p, we find that
Tn r
distri
is, in the Gaussian linear model, a 100(1  a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider'IjJ = p. We obtain the interval
i' ~ Y ± t n 1
(1
~q) sj.,fii,
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = T. First consider 1/J = f3j for some specified ~gression coefficient (3j. The 100(1  a)% confidence interval for (3j is
(3j
}" = (3j ± t n p (','" s{ [ (ZT Z) 11 j j ' 1 )
382
Inference in the Multiparameter Case
Chapter 6
where [(ZTZ)~l]j) is the jthdiagonal element of (ZTZ)~ '. Computersoftware computes (ZTZ)~I and labels S{[(ZTZ)~lI]j); as the standard errOr of the (estimated) jth regression coefficient. Next consider ljJ = j.ti = mean response for the ith case, 1 < i < n. The level (1  0:) confidence interval is
J.Li =
Jii ± t n _ p (1 
!a) sJh:
where hii is the ith diagonal element of the hat matrix H. Here sjh;; is called the standard error of the (estimated) mean of the ith case. Next consider the special case in which p = 2 and
Yi = 131 + 132Zi2 + Eil i = 1 ", n.
"
If we use the identity
n
~)Zi2  Z2)(l'i  Y)
i=1
~)Zi2  Z.2)l'i,
We obtain from Example 2.2.2 that
ih ~
Because Var(Yi) = a 2 , we obtain
~ Var(.6,)
L~l (Zi2  z.,)l'i
L~ 1 (Zi2  Z.2)2 .
(6.1.30)
= (J I L)Zi' i=l
''''
n
Z.2) ,
,
and the 100(1  a)% confidence interval for .6, has the form
732 ± t n p (I  ~a) sl J2:(Zi2  Z.2)'. The confidence interval for 131 is given in Problem 6.1.10. .62
=
Similarly, in the p = 2 case, it is straightforward (Problem 6.1.10) to compute
i.
h ii
I ,
•
;
_ 1

(Zi2  z.,)2
",n
+ n
L....i=l Zi2 
(
Z·2
)'
,
0
•
I
and the confidence interval for th~ mean response J1.i of the ith case has a simple explicit
fonn.
,
1
,
i ,
.I
Example 6.1.3. OneWay Layout (continued). We consider 'I/J = 13k. 1 < k .6k = Yk. ~ N(.6k, (J'lnk), we find the 100(1  a)% confidence interval
~
< p. Because
•
,
I
= 7J. ± t n  p (1 ~a) slj'nj; where 8' = SSwl(np). The intervals for I' = .6. and the incremental effect Ok =.6k1'
.6k
are given in Problem 6.1.11. 0
j
j
"I
I
i
,
I
I
Do
l
Section 6.2
Asymptotic Estimation Theory in p Dimensions
383
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j, 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector /3 with prescribed probability (1  0:). This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of f3 0 that is accepted when the level (1  a) F test is used to test H : (3 ~ (30' Under H, /' = /'0 = Z(3o; and the numerator of the F statistic (6.1.22) is based on
Iii  /'01'
C "
=
Izi3 
Z(3ol' =
(13  (30)T(Z T Z)(i3 (30)
(30)
Thus, using (6.1.22), the simultaneous confidence region for
f3 is the ellipse
= {(30 .. (13  (30)T(Z 2 Z)(i3 rs
T
< f r,ni"" (1 _ 1 20
'J}
. .T
(6.1.31)
where fr,nr
(1  40:)
is the 1  !o quantile of the :Fr,nr distribution.
Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) write Z = (ZJ, Z2) and /3T = (f3j, {3f), where f32 is a vector of main effect coefficients and f3 1is a vector of "nuisance" coefficients. Similarly, we partition {3 as {3 = ({31 ,(32 ) where (31 is q x 1 and (3, is (p  q) x 1. By Corollary 6.1.1, O"(ZTZ) is the variancecovariance matrix of {3. It follows that if we let 8 denote the lower right (p  q) x (p ~ q) comer of (ZTZ)I, then (728 is the variancecovariance matrix of 132' Thus, a joint 100(1  0:)% confidence region for {32 is the p  q dimensional ellipse
~ ~ ~
.T ",T
C={(3
0' .
. (i3,(302)TSl(i3,_(3o,) <f
() 2
pqs

pq,np
(11
20:·
l}
o
Summary. We consider the classical Gaussian linear model in which the resonse Yi for (3jZij of the ith case in an experiment is expressed as a linear combination J1i = covariates plus an error fi, where Ci, ... 1 f n are i.i.d. N (0, (72). By introducing a suitable orthogonal transfonnation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transfonnation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {(3j}, the resJXlnse means {J1i}, and linear combinations of these.
LJ=l
6.2
ASYMPTOTIC ESTIMATION THEORY IN p DIMENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic properties of the MLE and related tests and confidence bounds for onedimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generalizing Section 5.4.2.
384'
f ",'c:,c:""""ce=:::;'=''''h':.M=,'"';,,pa::.'::'m'''e::.t::"...:C::':::",..::Cc::h,:,p:::'e::'~6
6.2.1
Estimating Equations
OUf assumptions are as before save that everything is made a vector: X!, ... , X n are i.i.d. Pwhere P E Q, a model containing P = {PO: 0 E e} such that
(i)
e open C RP.
e.
(ii) Densities of P, are pC 0),9 E
1 , I
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=(,p" ... , ,pp)T where,p) ~ g:., is well defined and
 L >I'(X n.
1=1
1
n
..
i,
On) = O.
(6.2.1)
A solution to (6.2.1) is called an estimating equation estimate or an M estimate.
AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p
unknowns):
J
A2. Epl'l'(X" 0(P)1 2
>I'(x, O)dP(x)
=0
(6.2.2)
, , ,
I
is well defined on Q so that O(P) is the unique solution of (6.2.2). Necessarily O(PO) because Q => P.
=0
I , ,
< 00 where I· I is the Euclidean nonn.
I I:
.' , ,
,
I
A3. 'l/Ji{', 8), 1 < i < P. have firstorder partials with respect to all coordinates and using the notation of Section B.8,
I
where
': I
,
~
.,
i
~~
l:iI ~
is nonsingular.
A4. sup
..
{I ~ L~ 1 (D>I'(X"
p
t)  D>I'(Xi , O(P))) I: It  O(P)I < En} ~ 0 if En ~ O.
, .I ,
,
I
I
AS. On ~ O(P) for all P E Q.
Theorem 6.2.1. Under AOA5 ofthis section
8n = O(P) + where
L iii(X n.
t=l
1
n
i,
O(P))
+ op(n 1 / 2 )
(6.2.3)
i..I , ,
• •
iii(x,o(p))
= (EpD>I'(X" 0(P)W 1 >1'(x, O(P)).
(6.2.4)
b
Section 6.2
Asymptotic Estimation Theory in p Dimensions
385
Hence,
(6.2.5)
where
E(q" P) ~ J(O. p)Eq,q,T(X" O(p))F (0, P)
and
81/J~
J
(6.2.6)
r'(o,p)
= EpDq,(X"O(P))
~
E p 011 (X"O(P))
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
 n
2.:= q,(Xi , O(P)) = n 2.:= Dq,(Xi,0~)(9n
i=l
n
1
n
 O(P)).
(6.2.7)
i=l
Note that the lefthand side of (6.2.7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open , subset of RP , representable, for instance, as the Set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to conclude that A3 and A4 guarantee that with probability tending to 1, ~ l::~ I Dq,(Xi , 6~) is nonsingular. Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by p, the behavior in (6.2.3) is guaranteed for P E Q, which can include P <1c P. In fact, typically Q is essentially the set of P's for which O(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to: A6. If 1(,0) is differentiabLe
EODq,(X"O)

EOq,(X" O)DI(X" 0) CovO(q,(X" 0), DI(X, , 0))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the onedimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however. be shown that with probability tending to 1, a rootfinding algorithm starting at a consistent estimate 6~ will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
386
Inference in the Multiparameter Case
Chapter 6
6,2,2
comes
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x,O)
=
l(x,O)
= 10gp(x,0), and >I>(x,O)
obeys AOA6, then (62,8) be.
T
l
EODl (X" O)D l( X 1,0» VarODl(X"O)
(62,9)
where
j 1 ,
, ,
is the Fisher information matrix I(e) introduced in Section 3.4. If p: e _ R, e c R d , is a scalar function, the matrix)! 8~~P(}J (e) is known as the Hessian or curvature matrix of the sutface p. Thus, (6.2.9) stateS that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
Theorem 6.2.2. If AOA6 holdfor p(x, 0)
,1, , ,
I
j
=10gp(x, 0), then the MLE  satisfies On
i,
(62,10)
(62,11)
On ~ 0+  :Lr'(O)DI(Xi,O) + op(n"/')
n
i=1
1
n
,,
so that
is a minimum contrast estimate with p and 'f/J satisfying AOA6 and corresponding asymptotic variance matrix E('I1, Pe), then
"
"
If en
"
, ,
E(>I>,PO)
On
> r'(O)
(62,12)
in the sense of Theorem 3.4.4 with equality in (6,2,12) for 0
,
= 0 0 iff, undRr 0 0 ,
(6,2,13)
~
On + op(n 1/2 ),
, i ,
•
,
Proof. The proofs of (6,2,10) and (6,2,11) parallel those of (5.4.33) and (5.4,34) exactly, The proof of (6,2,12) parallels that of Theorem 3.4.4, For completeness we give it Note that hy (6,2,6) and (6,2,8)
I
" :I I.
1" ,
where U >I>(Xt,O). V Var(U T , VT)T nonsingular Var(V)
=
E(>I>,PO)
= CovO'(U, V)VarO(U)CovO'(V, U)
~
(62,14)
DI(X1,0),
But hy (B,lO.8), for any U,V with
(6,2.15)
> Cov(U, V)Var1(U)Cov(V, U),
Taking inverses of both sides yields
r1(0)
= Var01(V) < E(>I>,O).
(6.2.16)
• • !
Section 6.2
Asymptotic Estimation Theory in p Dimensions
387
Equality holds in (6.2.15) by (B. 10.2.3) iff for some b U
= b(O)
(6.2.17)
= b + Cov(U, V)Var1(V)V with probability 1. This means in view of Eow = EODl = 0 that
w(X"O) = b(0)Dl(X1,0).
In the case of identity in (6.2.16) we must have
[EODw(X 1, OW'W(X 1 , 0)
=
r1(0)DI(X" 0).
(6.2.18)
Hence, from (6.2.3) and (6.2.10) we conclude that (6.2.13) holds.
o
apxl, a T 8 n
~
We see that, by the theorem, the MLE Is efficient in the sense that for any has asymptotic bias o(n1/2) and asymptotic variance nlaT ]1(8)a, which is n.? larger than that of any competing minimum contrast estimate. Further any competitor 8 n such that aTO n has the same asymptotic behavior as a T 8 n for all a in fact agrees with On to ordern 1/2


A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6
on the asymptotic nonnality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zr, Yi)T. 1 < i < n, be ij.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways:
(il
(6.2.19) where, is distributed as N(O, (),2) independent of Z and E(Z) Z, Y has aN (a + ZT [3, (),2) distrihution.
= O.
That is, given
(ii) The distribution Ho of Z is known with density h o and E(ZZT) is nonsingular.
The second assumption is unreasonable but easily dispensed with. It readily follows (Prohlem 6.2.6) that the MLE of [3 is given by (with probability 1) [3 = [Z(n}Z(n}]
T
~
lT
Zen} Y.
(6.2.20)
Here Zen) is the n x p matrix IIZij ~ Z.j II where z.j = ~ 1 Zij. We used subscripts (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Zen) = (Zl, .. , 1 Zn)T is referred to as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and ()'2 are
p
2:7
Ci
=Y
 ~ Zj{3j, (j
J=l
"

2
.
1 ~2 = IY  (Ci + Z(n)[3)1 .
n
(6.2.21 )
I
388
~
Inference in the Multiparameter Case
Chapter 6
Note that although given ZI," " Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AOA6 hold in this case because if H o has density k o and if 8 denotes (a,{3T,a 2 )T. then
~
j
1
j
I(X,8) Dl(X,8)
and
 20'[Y  (a + Z T 13W
(
1
)
2(logo'
1
)
+ log21T) + logho(z)
(6.2.22)
;2'
1
i,
Z ;" 20 4
(.2
o o
1)
I
.,,
a 2
1(8) =
I
0 0' E(ZZT)
0
,
o o
20 4
(6.2.23)
i ,
,
1
so that by Theorem 6.2.2
iI
I'
L(y'n(ii 
a,(3  13,8'  0 2»
~ N(O,diag(02,02[E(ZZ T W',20 4 ».
(6.2.24)
1
!
I
!
,,
This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of H o 2 known plays no role in the limiting result for a,j3,Ci • Of course, these will only be the MLEs if H o depends only on parameters other than (a, f3,( 2 ). In this case we can estimate E(ZZT) by ~ L~ 1 ZiZ'[ and give approximate confidence intervals for (3j. j = 1 .. ,po " An interesting feature of (6.2.23) is that because 1(8) is a block diagonal matrix so is II (6) and, consequently, f3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perfonn inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew 0 2 , the MLE would still be and its asymptotic varianc~ optimal for this model. If we knew a and 13. ;;2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, 0=2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = {P(9,"} : 8 E e, '/ E £}
~
j,
1 ,
l
I ,
1 ,
1
13
. ,
:
~
•
L
.' ,
we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE (J (or more generally, an efficient estimate of e) in the pair (e, iiJ is lhe same as that of e(,/o), the efficient estimate for 'P'I7o = {P(9,l'jo) : E 8}. The possibility of adaptation is in fact rare. though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, ({32 ... , /3p ) with
~ ~
e
(i) a, (3" ... , {3p known. (ii)
13 arbitrary.
/32
... = {3p = O. LetZi
In case (i), we take, without loss of generality, a = (ZiI, ... , ZiP) T. then the efficient estimate in case (i) is
,
•• "
I
;;n _ L~l Zit Yi Pi n 2
Li=l Zil
(6,2.25)
, , ,
, I
.
I
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance (T2[EZlJ1. On the other hand, [31 is the first coordinate of {3 given by (6.2.20). Its asymptotic variance is the (1,1) element of O"'[EZZTjl, which is strictly bigger than ,,' [EZfj1 unless [EZZTj1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate [31 adaptively if {3z, .. . , (3p are regarded as nuisance parameters. What is happening can be seen by a representation of [Z'fn) Z(n)ll Zen) Y and [11(11) where ['(II) Ill'j(lI) II· We claim that


=
2i
fJl
= 2:~_l(Z"
",n
 Zill))y; (Z _ 2(11 ),
tl
t
(6.2.26)
L....t=1
where Z(1) is the regression of (Zu, ... , Z1n)T on the linear space spanned by (Zj1,' . " Zjn)T, 2 < j < p. Similarly.
[11 (II)
= 0"'/ E(ZlI
 I1(ZlI
1
Z'l,"" Zp,»)'
(6.2.27)
where II(Z11 I ZZ1,' .. , Zpl) is the projection of ZII on the linear span of Z211 .. , , Zpl (Problem 6.2.11). Thus, I1(ZlI I Z'lo"" Zpl) = 2:;=, Zjl where (ai, ... ,a;) minimizes E(Zll  2: P _, ajZjl)' over (a" ... , ap) E RPI (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3" ... , f3p when the variables Z2, .. . ,Zp are in any way correlated with Z1 and the price is measured by
a;
[E(Zll  I1(Zll I Z'lo"" Zpl)'jl = ( _ E(I1(Zll 1 Z'l, ... ,ZPI)),)I '''''';:;f;c;o,~''"'''''1 , E(ZII) E(ZII)
(6.2.28) In the extreme case of perfect collinearity the price is 00 as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where (Z2, . .. , Zp) have no value in predicting Zl linearly (see Section 1.4). Correspondingly in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, ... , n, (31 is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if E(Zu  I1(ZlI I Z'l, ... , Zpt})' = O. 0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are ij.d. but not necessarily Gaussian with density ~fo (~). for instance, e x
fo(x)
~
=
(1 + eX)"
the logistic density. Such error densities have the often more realistic, heavier tails(l) than the Gaussian density. The estimates {301 0'0 now solve
and
390
Inference in the Multiparameter Case
Chapter 6
f' ~ ~ where1/; = 'X;,X(Y) =  (iQ)~ Yfo(y)+1 ,(30 (fito, ... ,ppO)T The assumptions of Theorem 6.2.2 may be shown to hold (Problem 6.2.9) if
=
(i) log fa is strictly concave, i.e.,
*
is strictly decreasing.
(ii) (log 10)" exists and is bonnded.
Then, if further fa is symmetric about 0,
I(IJ)
cr"I((3T, I)
= cr"
where
Cl
(C 1 Z ) E(t
~)
(6.2.29)
, ,
= J (f,(x»)' lo(x)dx, c, = J (xf,(x) + I)' 10 (x)dx.
~
Thus, ,Bo, ao are opti
mal estimates of {3 and (l in the sense of Theorem 6.2.2 if fo is true. Now suppose fa generating the estimates Po and (T~ is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable conditions we can apply Theorem 6.2.1 with
1 ,
,
i
where
i
1f;j(Z,y,(3,(J)
,
;
"
~ 1f; (y  L~1 Zkpk )
, I
<j < p
(6.2.30)
!
,
•
> (y  L~1
to conclude that
where (301 ao solve
I 1
j
Zkpk )
1
I
I
I
L:o( y'n(,Bo  (30)) ~ N(O, I: ('I', P») L:( y'n(a  cro) ~ N(O, (J'(P))
1 •
,
(6.2.31)
• ,
J'I'(y zT(30)dP=O
• ,
I
p
II . ,.I'
ii
I:
and E('I', P) is as in (6.2.6). What is the relation between (30' (Jo and (3, (J given in the Gaussian model (6.2.19)? If 10 is symmetric about 0 and the solution of (6.2.31) is unique, . then (30 = (3. But (Jo = c(fo)q for some, c(to) typically different from one. Thus, (30 can be used for estimating (3 althougll if t1j. true distribution of the is N(O, (J') it should ' perform less well than (3. On the Qther hand, o is an estimate of (J only if normalized by a constant depending on 10. (See Problem 6.2.5.) These are issues of robustness, that
~

a
'i
1:
is, to have a bounded sensitivity curve (Section 3~5. Problem 3.5.8), we may well wish to use a nonlinear bounded '.Ii = ('!f;11'" l ,pp)T to estimate f3 even though it is suboptimal when € rv N(O, (12). and to use a suita~ly nonnalized version of {To for the same purpose. One effective choice of.,pj is the Hu~er function defined in Problem 3.5.8. We will discuss these issues further in Section 6.6 and Volume II. 0
•
We defined minimum contrast (Me) and M estimates in the case of pdimensional parameters and established their convergence in law to a nonnal distribution. ife denotes the MLE. for fixed n. 6. . the likelihood ratio principle. Wald tests (a generalization of pivots). Although it is easy to write down the posterior density of 8. The three approaches coincide asymptotically but differ substantially. Op). A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems.3 The Posterior Distribution in the Multiparameter Case The asymptotic theory of the posterior distribution parallels that in the onedimensional case exactly. We simply make 8 a vector. .I as the Euclidean nonn in conditions A7 and A8. Using multivariate expansions as in B.3. .2. the equivalence of Bayesian and frequentist optimality asymptotically. say. up to the proportionality constant f e . Ifthe multivariate versions of AOA3.s. This approach is refined in Kass. All of these will be developed in Section 6. .2 Asymptotic Estimation Theory in p Dimensions 391 Testing and Confidence Bounds There are three principal approaches to testing hypotheses in multiparameter models. the latter can pose a fonnidable problem if p > 2. 2 The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5. and Tierney (1989).19). A4(a. because we then need to integrate out (03 . A5(a. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : fJ 1 < 0 versus K : fh > 0 where fJ 2 .. .s.) and A6A8 hold then.(t) n~ 1 p(Xi . and Rao's tests. 1T(8) rr~ 1 P(Xi1 8)..2. A new major issue that arises is computation. These methods are beyond the scope of this volume but will be discussed briefly in Volume II. under PeforallB. Confidence regions that parallel the tests will also be developed in Section 6. O ). We have implicitly done this in the calculations leading up to (5. Kadane.s. The problem arises also when.2. t)dt. as is usually the case. we are interested in the posterior distribution of some of the parameters. The consequences of Theorem 6.).2.Section 6. However. fJ p vary freely.19) (Laplace's method). See Schervish (1995) for some of the relevant calculations.3 are the same as those of Theorem 5. ~ (6.32) a..3.5.3.8 we obtain Theorem 6.5. and interpret I .. Again the two approaches differ at the second order when the prior begins to make a difference.5. typically there is an attempt at "exact" calculation. Summary.2. When the estimating equations defining the M estimates coincide with the likelihood . in perfonnance and computationally. say (fJ 1.
d. '(x) = sup{p(x. ! I In Sections 4. LARGE SAMPLE TESTS AND CONFIDENCE REGIONS ~ 0) converges a. confidence regions. covariates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations.3 • . X n are i.I . ry E £} if the asymptotic distribution of In(() .2 to extend some of the results of Section 5.1 we developed exact tests and confidence regions that are appropriate in re~ gression and anaysis of variance (ANOVA) situations when the responses are normally distributed. 1 Zp. 170 specified. In these cases exact methods are typically not available.denotes the MLE for PO' then the posterior distribution of yn(O if 0 r' (9)) distribution. in many experimental situations in which the likelihood ratio test can be used to address important questions. Finally we show that in the Bayesian framework where given 0. we need methods for situations in which.B} has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {PO. We shall show (see Section 6. and other methods of inference.9) . I j I I .9 and 6.1 we considered the likelihood ratio test statistic. and showed that in several statistical models involving normal distributions. In this section we will use the results of Section 6. We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {PO.8 0 . this result gives the asymptotic distribution of the MLE.9).. I .1 .i.4.I . 1 . . A(X) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions. adaptive estimation of 131 is possible iff Zl is uncorre1ated with every linear function of Z2) . to the N(O.f/ : () E 8. Another example deals with M estimates based on estimating equations generated by linear models with nonGaussian error distribution. These were treated for f) real in Section 5. However. the exact critical value is not available analytically. In Section 6. Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic i. I . under P. I H I! ~. 9 E 8} sup{p(x. 9 E 8" 8 1 = 8 . Wald and Rao large sample tests. However. as in the linear model. 9 E 8 0 } for testing H : 9 E 8 0 versus K . 1 '~ J .110 : () E 8}.4. ..4 to vectorvalued parameters. . . We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AIestimate if we know the correct model P = {Po : BEe} and use the MLE for this model. 6.•• . "' ~ 6. PO'  . In such . equations. X II .4. and confidence procedures. and we tum to asymptotic approximations to construct tests.392 Inference in the Multiparameter Case Chapter 6 .3. We present three procedures that are used frequently: likelihood ratio. In linear regression.6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal.". .s.
Section 6...3. 0 = (iX. .2. 0) ~ iJ"x·. SetBi = TU/uo. when testing whether a parameter vector is restricted to an open subset of Rq or R r .\(x) is available as p(x.1.l.Wo where w is an rdimensional linear subspace of Rn and W ::> Wo. . Wilks's approximation is exact. the numerator of . The MLE of f3 under H is readily seen from (2.\(X) based on asymptotic theory. the X~_q distribution is an approximation to £(21og . iJ > O.n.1. i = 1.£ X~" for degrees of freedom d to be specified later.. . .4. This is not available analytically.Xn are i. =6r = O. ..5) to be iJo = l/x and p(x. iJ). Y n be nn.1 exp{ iJx )/qQ). 0 It remains to find the critical value.3. . . ~ ~ ~ ~ ~ The approximation we shall give is based on the result "2 log '\(X) .i. iJ). 0) = IT~ 1 p(Xi. The Gaussian Linear Model with Known Variance.2 we showed how to find B as a nonexplicit solution of likelihood equations.. In this u 2 known example.3. .. Moreover.1). such that VI. . .tl.. the hypothesis H is equivalent to H: 6 q + I = .1. as X where X has the gamma. . . As in Section 6. Here is an example in which Wilks's approximation to £('\(X)) is useful: Example 6..i = 1.1.3 we test whether J. we conclude that under H. 1). 0 We iHustrate the remarkable fact that X~q holds as an approximation to the null distribution of 2 log A quite generally when the hypothesis is a nice qdimensional submanifold of an rdimensional parameter space with the following. We next give an example that can be viewed as the limiting situation for which the approximation is exact: independent with Yi rv N(Pi. E W . ~ In Example 2.3. 2 log . = (J.'··' J.n.i = 1.\(Y) = L X. •. iJo) is the denominator of the likelihood ratio statistic. V q span Wo and VI. . where A nxn is an orthogonal matrix with rows vI. 0).3 Large Sample Tests and Confidence Regions 393 cases we can turn to an approximation to the distribution of . Suppose XL X 2 . Using Section 6.. distribution with density p(X. which is usually referred to as Wilks's theorem or approximation.d. Then Xi rvN(B i . ~ X. x ~ > 0.. and we transform to canonical form by setting Example 6. under regularity conditions..."" V r spanw..\(Y)).v'[..3.randXi = Uduo. .2 we showed that the MLE. Other approximations that will be explored in Volume II are based on Monte Carlo and bootstrap simulations. Q > 0. exists and in Example 2. i = r+ 1. q < r... 1.q' i=q+1 r Wilks's theorem states that. r and Xi rv N(O. Let YI . Suppose we want to test H : Q = 1 (exponential distribution) versus K : a i.tn)T is a member of a qdimensional linear subspace of woo versus the alternative that J.l. qQ. . <1~) where {To is known. Thus.. .
where Irxr«(J) is the Fisher information matrix... . where x E Xc R'..1. 21og>. (J = (Jo.(In[ < l(Jn . ". c = and conclude that 210g . . . "'" T.2 but (T2 is unknown then manifold whereas under H..2. Then. : .3." . and conclude that Vn S X.2. = 2[ln«(Jn) In((Jo)] ~ ~ ~ £ X. . .(Jol < I(J~ . In Section 6. I I i Example 6. o . The Gaussian Linear Model with Unknown Variance. I(J~ . Proof Because On solves the likelihood equation DOln(O) = 0..(Jo) for some (J~ with [(J~ . I1«(Jo)). 2 log >'(Y) = Vn ". . X. Note that for J.2. Suppose the assumptions of Theorem 6.. Eln«(Jo) ~ I«(Jo).(X) ~ 1 . "'" (6.7 to Vn = I:~_q+lXl/nlI:~ r+IX. and (J E c W.3.2. an = n.(Jo) "..(J) Uk uJ rxr c'· • 1 .Xn a sample from p(x.q' Finally apply Lemma 5. Apply Example 5..· . i\ I . hy Corollary B.2) V ~ N(o.1 ((Jo)).394 Inference in the Multiparameter Case Chapter 6 1 .q also in the 0. .2. (6.3. ranges over a q + Idimensional manifold.6...logp(X. V T I«(Jo)V ~ X. we can conclude arguing from A..1.2 unknown case. .~ L H .d. Write the log likelihood as e In«(J) ~ L i=I n logp(X i . N(o.1) " I._q as well.i. ! . .\(Y) £ X.2 are satisfied.. under H .2 with 9(1) = log(l + I). j . an expansion of lnUI) about en evaluated at 8 = 0 0 gives 2[ln«(Jn) In((Jo)] = n«(Jn .Onl + IOn  (Jol < 210n . 0 Consider the general i. ·'I1 . where DO is the derivative with respect to e. I( I In((J) ~ n 1 n L i=l & & &" &".. (72) ranges over an r + Idimensional Example 6. If Yi are as in = (IL. Hence.1.(Jol.(Y) defined in Remark ° 6.3 and AA that that In((J~) "'" 2[ln((Jn) In((Jo)] £r ~ V I((Jo)V. we derived e e 2log >'(Y) ~ n log ( 1 + 2::" L. 1 • ! We tirst consider the simple hypothesis H : 6 = 8 0 _ Theorem 6. Here ~ ~ j. !. y'ri(On .3. I .(Jol. case with Xl. By Theorem 6. The result follows because.3. Because . B).3.. 0).3.3.3.(Jo) In«(Jn)«(Jn .. . I.i=q+l 'C'" I=r+l Xi' ) X.
10 (0 0 ) = VarO.8..2.". Let 0 0 E 8 0 and write 210g >'(X) ~ = 2[1n(9 n )  In(Oo)] .6) .(X) 8 0 when > x.3. 0(2) = (8. Then under H : 0 E 8 0. where x r (1.a)) (6.8. (6.(9 0 .2[1. . Suppose that the assumptions of Theorem 6.j.3. dropping the dependence on 6 0 • M ~ PI 1 / 2 (6.3) is a confidence region for 0 with approximat~ coverage probability 1 .35) where 8(00) = n 1/2 L Dl(X" 0) i=1 n and 8 = (8 1 . SupposetharOo istheMLEofO underH and that 0 0 satisfiesA6for Po.3.2 illustrate such 8 0 . for given true 0 0 in 8 0 • '7=M(OOo) where.0) is the 1 and C!' quantile of the X. 0). 8 2)T where 8 1 is the first q coordinates of8. Examples 6.n and (6. distribution..2. where e is open and 8 0 is the set of 0 E e with 8j = 80 .8q ). the test that rejects H : () 2Iog>.j} are specified values. We set d ~ r .10) and (6.(] . Theorem 6. Next we tum to the more general hypothesis H : 8 E 8 0 . ~ {Oo : 2[/ n (On) /n(OO)] < x.2 hold for p(x.2..4) It is easy to see that AOA6 for 10 imply AOA5 for Po.T.80.)T. 0b2) = (80"+1". has approximately level 1. 0 E e. Make a change of parameter.Section 6. j ~ q + 1..81(00).)1'.3.. T (1) (2) Proof.+l. (JO.1) applied to On and the corresponding argument applied to 8 0 . OT ~ (0(1).3.0:'.q. ~(11 ~ (6. Furthennore.0(2».3.)T. Let 00..n) /n(Oo)l.3.(Ia)..3 Large Sample Tests and Confidence Regions 395 = As a consequence of the theorem..3.4).0(1) = (8 1" ".' ".n = (80 .1 and 6. . Let Po be the model {Po: 0 E eo} with corresponding parametn"zation 0(1) = ~(1) (1) ~(1) (8 1. By (6.0 0 ). and {8 0 .a.
if A o _ {(} .1I0 + M./ L i=I n D'1I{X" liD + M. because in tenus of 7].1 '1ll/sup{p(x.3.8 0 : () E 8}.I '1): 11 0 + M.a) is an asymptotically level a test of H : 8 E Of equal importance is that we obtain an asymptotic confidence region for (e q+ 1.3. Moreover.(X) > x r .. Such P exists by the argument given in Example 6. • 1 y{X) = sup{P{x. .+ 1 .+1 ~ .2. rejecting if .1I0 + M.13) D'1I{x. . . This asymptotic level 1 ..1 because /1/2 Ao is the intersection of a q dirnensionallinear subspace of R r with JI/2{8 .396 Inference in the Multiparameter Case Chapter 6 and P is an orthogonal matrix such that. His {fJ E applying (6.3. ••. (O) r • + op(l) (6. which has a limiting X~q distribution by Slutsky's theorem because T(O) has a limiting Nr(O..3.3..(l .7).9) .l. then by 2 log "(X) TT(O)T(O) .2.. .1 '1).1> . by definition..0. ).5) to "(X) we obtain. 18q acting as r nuisance parameters.9).10) LTl(o) . Or)J < xr_. '1 E Me}..1/ 2 p = J.1I 0 + M. (6. Now write De for differentiation with respect to (J and Dry for differentiation with respect to TJ.0: confidence region is i . ~ 'I. under the conditions of Theorem 6.II).3.1 '1) = [M.LTl(o) + op(l) i=l i=l L i=q+l r Ti2 (0) + op(l).1'1 '1 E eo} and from (B. Thus. 0 Note that this argument is simply an asymptotic version of the one given in Example 6.00 . IIr ) : 2[ln(lI n ) .In( 00 . Tbe result follows from (6.11) .8.3.(1 . : • Var T(O) = pT r ' /2 II..8) that if T('1) then = 1 2 n.1ITD III(x.all (6.6) and (6. J) distribution by (6. We deduce from (6. 1e ). Me : 1]q+l = TJr = a}.+I> . with ell _ . = .. .• I . Note that.l.Tf(O)T ... is invariant under reparametrization .. = 0.3. • . a piece of 8..00 : e E 8 0} MAD = {'1: '/. I I eo~ ~ ~ {(O.(X) = y{X) where ! 1 1 ..3.2..3.
n under H is consistent for all () E 8 0. will appear in the next section..3.... B2 .. q + 1 < j < r. Suppose the assumptions of Theorem 6.3.:::f.p:::'e:.. 9j .. "'0 ~ {O : OT Vj = 0. As an example ofwhatcau go wrong. " ' J v q and Vq+l " . More sophisticated examples are given in Problems 6.3..:::m.13) then the set {(8" .d.3. J). The proof is sketched iu Problems (6. Suppose the MLE hold (JO.2.( 0 ) ~ o} (6. . Then. Here the dimension of 8 0 and 8 is the same but the boundary of 8 0 has lower dimension. Examples such as testing for independence in contingency tables. Wilks's theorem depends critically On the fact that nOt only is open but that if giveu in (6. Suppose H is specified by: There exist d functions..14) a hypothesis of the fonn (6. It can be extended as follows.3.::. if A(X) is ~ the likelihood ratio statistic for H : 0 E 8 0 given in (6.. R.. .3. 2 e eo Ii o ~ (Xl + X 2) 2 + ~ 1 _ (X.13). .cTe:::. q + 1 < j < r written as a vector g.8 0 E Wo where Wo is a linear space of dimension q are also covered.3. A(X) behaves asymptotically like a test for H : 0 E 8 00 where 800 = {O E 8: Dg(Oo)(O . l B. X~q under H.6.3. themselves depending on 8 q + 1 . . of f)l" .3). . More complicated linear hypotheses such as H : 6 .(0) ~ 8j .e:::S:::.. If 81 + B2 = 1. X. 8q)T : 0 E 8} is opeu iu Rq. 8q o assuming that 8q + 1 .'! are the MLEs.3..2 to this situation is easy and given in Problem 6. .2 is still inadequate for most applications.. Vl' are orthogonal towo and v" . (6.o:::"'' 397 CC where 00."de"'::'"e:::R::e"g.2 falls under this schema with 9. if 0 0 is true.12) The extension of Theorem 6..3.80.. The esseutial idea is that.3.3..g.i.. Theorem 6.8r are known. which require the following general theorem. t..'':::':::"::d_C:::o:::. Defiue H : 0 E 8 0 with e 80 = {O E 8 : g(O) = o}..l:::... where J is the 2 x 2 identity matrix aud 8 0 = {O : B1 + O < I}.t.) be i. (6.3.::Se::..3.. v r span " R T then.3.q at all 0 E 8. .. We need both properties because we need to analyze both the numerator and denominator of A(X).3. such that Dg(O) exists and is of rauk r .5 and 6. 2' 2 + X 2 )) 2 and 210g A(X) ~ xf but if 81 + 82 < 1 clearly 2 log A(X) ~ op(I).3. let (XiI. N(B 1 . The formulation of Theorem 6.o''C6::.... q + 1 <j < r}. 8..2 and the previously conditions on g.2)(6.13) Evideutly... .13). We only need note that if WQ is a linear space spanned by an orthogonal basis v\. 210g A(X) ".3. Theorem 6.
hence.2. the Hessian (problem 6.Nr(O.3. ~ ~ 00.3.3.: b.. But by Theorem 6.3.l(a) that I(lJ n ) ~ I(IJ) asn By Slutsky's theorem B. y T I(IJ)Y . I 22 (9 n ) is replaceable by any consistent estimate of J22( 8). For the more general hypothesis H : (} E 8 0 we write the MLE for 8 E as = ~ ! I: (IJ n . I. I .3.3.10). More generally /(9 0 ) can be replaced by any consistent estimate of I( 1J0 ).6. in particular . (6.2.X.a)} is an ellipsoid in W easily interpretable and computablesee (6.Or) and define the Wald (6.9). theorem completes the proof.2 Wald's and Rao's Large Sample Tests The Wald Test Suppose that the assumptions of Theorem 6.2. y T I(IJ)Y.3.2. If H is true..7.17) ~ ~ e ~ en 1 • I I Wn(IJ~2)) ~ n(9~2) 1J~2)l[I"(9nJrl(9~) _1J~2») where I 22 (IJ) is the lower diagonal block of II (IJ) written as I2 I 1 _ (Ill(lJ) (IJ) I21(IJ) I (1J)) I22(IJ) with diagonal blocks of dimension q x q and d x d. .31).16). has asymptotic level Q.398 Inference in the Multiparameter Case Chapter 6 6. respectively. the lower diagonal block of the inverse of .. r 1 (1J)) where.2. Slutsky's it I' I' . . . (6. i .fii(lJn (2) L 1J 0 ) ~ Nd(O. Y .. ~ Theorem 6.16) n(9 n _IJ)TI(9 n )(9 n IJ) !:. 0 ""11 F . It and I(8 n ) also have the advantage that the confidence region One generates {6 : Wn (6) < xp(l .2. j Wn(IJ~2)) !:.q' ~(2) (6. Under the conditions afTheorem 6. The last Hessian choice is ~ ~ . 1(8) continuous implies that [1(8) is continuous and.rl(IJ» asn ~ p ~ L 00. It follows that the Wald test that rejects H : (J = 6 0 in favor of K : () • i= 00 when .18) i 1 Proof..7.) and IJ n ~ ~ ~(2) ~ (Oq+" . [22 is continuous. More generally.0. according to Corollary B. X.3... i favored because it is usually computed automatically with the MLE. Then .3.15) Because I( IJ) is continuous in IJ (Problem 6. .~ D 2 1n ( IJ n).fii(1J IJ) ~N(O. for instance. (6. it follows from Proposition B.2.1.lJ n ) where IJ n statistic as ~(1) ~(2) ~(1) = (0" .4. .2 hold. I22(lJo)) if 1J 0 E eo holds.~ D 2ln (IJ n ).15) and (6.3.~ D 2l n (1J 0 ) or I (IJ n ) or . .
0:) is called the Rao score test. The argument is sketched in Problem 6. which rejects iff HIn (9b ») > x 1_ q (1 . is.. The Rao Score Test For the simple hypothesis H that.n)[D. Rao's SCOre test is based on the observation (6. The Wald test leads to the Wald confidence regions for (B q +] . Let eo '1'n(O) = n..9.n) ~  where :E is a consistent estimate of E( (}o). It follows from this and Corollary B. What is not as evident is that. asymptotically level Q.. Rn(Oo) = nt/J?:(Oo)r' (OO)t/Jn(OO) !. It can be shown that (Problem 6.19) indicates.nl]  2  1 . N(O. the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n.nlE ~ (2) _ T  .n] D"ln(Oo. requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests.I Dln ((Jo) is the likelihood score vector.3.3. The Rao test is based on the statistic Rn(Oo ) ~n'1'n(Oo. (J = 8 0 .\(X) is the LR statistic for H Thus. 112 is the upper right q x d block.19) : (J c: 80.. This test has the advantage that it can be carried out without computing the MLE. (Problem 6..:El ((}o) is ~ n 1 [D.. 1(0 0 )) where 1/J n = n ..8) E(Oo) = 1.n) under n H. The extension of the Rao test to H : (J E runs as follows. un.3.n) 2  + D21 ln (00. under H...Q). therefore. by the central limit theorem.3. The test that rejects H when R n ( (}o) > x r (1. as (6. and the convergence Rn((}o) ~ X.3. 2 Wn(Oo ) ~(2) where .3. X.1/ 2 D 2 In (0) where D 1 l n represents the q x 1 gradient with respect to the first q coordinates and D 2 l n the d x 1 gradient with respect to the last d.n W (80 .20) vnt/Jn(OO) !..21) where III is the upper left q x q block of the r x r infonnation matrix I (80 ). ~ B ) T given by {8(2) : r W n (0(2)) < x r _ q (1 These regions are ellipsoids in R d Although.2 that under H.3.ln(8 0 . a consistent estimate of ..n under H. and so on. the asymptotic variance of . Furthermore. . the two tests are equivalent asymptotically.6.22) . ~1 '1'n(OO.3 Large Sample Tests and Confidence Regions 399 The Wold test.(00 )  121 (0 0 )/. in practice they can be very different..' (0 0 )112 (00) (6. = 2 log A(X) + op(l) (6.3.Section 6. (6.ln(Oo. as n _ CXJ.9) under AOA6 and consistency of (JO.
Finally.8 0 where is an open subset of R: and 8 0 is the collection of 8 E 8 with the last r ~ q coordinates 0(2) specified.0 0 ) I(On)(On .0 0 ) = T'" "" (vn(On .a)} and • The Rao large sample critical and confidence regions are {R n (Ob » {0(2) : Rn (0(2) < xd(l. it shares the disadvantage of the Wald test that matrices need to be computed and inverted.3.. The asymptotic distribution of this quadratic fonn is also q• I e X: X: 6.3. We considered the problem of testing H : 8 E 80 versus K : 8 E 8 . X. we introduced the Rao score test.2 but with Rn(lJb 2 1 1 » !:. and so on. .)I(On)( vn(On  On) + A) !:. which is based on a quadratic fonn in the gradient of the log likelihood. and Wald Tests It is possible as in the onedimensional case to derive the asymptotic power for these tests for alternatives of the form On = ~ eo + ~ where 8 0 E 8 0 .a)}. D 21 the d x d Theorem 6.2. 0(2). then. I . e(l). and showed that this quadratic fonn has limiting distribution q .. . for the Wald test neOn . under regularity conditions. We also considered a quadratic fonn. On the other hand. The analysis for 8 0 = {Oo} is relatively easy.19) holds under On and that the power behavior is unaffected and applies to all three tests. . Rao. Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score testssee Rao (1973) for more on this. . .400 Inference in the Multiparameter Case Chapter 6 where D~ is the d x d matrix of second partials of ill with respect to matrix of mixed second partials with respect to e(l). which measures the distance between the hypothesized value of 8(2) and its MLE. ._q(A T I(Oo)t. The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. Power Behavior of the LR. I I . called the Wald statistic. We established Wilks's theorem.) . 2 log A(X) has an asymptotic X.. X~' 2 j > xd(1 . Under H : (J E A6 required only for Po eo and the conditions ADAS a/Theorem 6.On) + t. .4 LARGE SAMPLE METHODS FOR DISCRETE DATA In this section we give a number of important applications of the general methods we have developed to inference for discrete data. which stales that if A(X) is the LR statistic. where X~ (')'2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'? It may he shown that the equivalence (6. b . For instance. Summary. In particular we shall discuss problems of .5. .q distribution under H.
+ n LL(Bi .4 Large Sample Methods for Discret_e_D_a_ta ~_ _ 401 goodnessoffit and special cases oflog linear and generalized linear models (GLM).0). where N. I ij [ . Because ()k = 1 .3. j = 1. . or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by theory. . . with ()Ok =  1 1 E7 : kl j=l ()OJ.) /OOk = n(Bk .()k_l)T and test the hypothesis H : ()j = ()OJ for specified (JOj.+ ] 1 1 0.6.8 we found the MLE 0.00k)2lOOk. In Example 2.Section 6.OOi)(B.1. . . and 2.E~. j=1 To find the Wald test. ()j. Ok if i if i = j. k1.2.OO.2.7.)/OOk.32) thaI IIIij II.]2100. we consider the parameter 9 = (()l. the Wald statistic is klkl Wn(Oo) = n LfB. For i. Pearson's X2 Test As in Examples 1.. trials in which Xi = j if the ith trial prcx:luces a result in the jth category. Thus.. treated in more detail in Section 6. k. j = 1.lnOo. L(9. 2.4.33) and (3. Ok Thus. = L:~ 1 1{Xi = j}. consider i. j=l i=1 The second term on the right is kl 2 n Thus.. j = 1.d. log(N.) > Xkl(1.2. k .4.i. In. j=1 Do.nOo.. we need the information matrix I = we find using (2.1 GoodnessofFit in a Multinomial Model.00. ..5.3. . . Let ()j = P(Xi = j) be the probability of the jth category. j=l ..)2InOo. . ~ N. # j. we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution. 6. k Wn(Oo) = L(N..." .8. It follows that the large sample LR rejection region is ~ k 2IogA(X) = 2 LN.
~ 8 8 0j 0k  2 80j ) ( n LL _ _ L kl kl kl ( J=1 z=1 ~ 8_Z 80i ~) 8k 80k (~ 8 _J 80J ~) ) 8k . by A..L ..1S.2.4.4. note that from Example 2.. we write ~ ~  8j 80j . . ~ . ~ = Ilaijll(kl)X(kl) with Thus.2. n ( L kl ( j=1 ~ ~) !.80k). 80i 80j 80k (6.1) X where the sum is over categories and "expected" refers to the expected frequency E H (Nj ). The general form (6. ]1(9) = ~ = Var(N).8. 8k 80 k = {[8 0k (8 j  80j )]  [8 0j (8 k ~  80k ) ] } .11).L . It is easily remembered as 2 = SUM (Observed .13. To derive the Rao test.4. Then. The second term on the right is n [ j=1 (!..2).. j=1 . the Rao statistic is . because kl L(Oj .4. '" . where N = (N1 . with To find ]1.Nk_d T and.. 80j 80k 1 and expand the square keeping the square brackets intact.80j ) = (Ok ..402 Inference in the Multiparameter Case Chapter 6 The term on the right is called Pearson's chisquare (X 2 ) statistic and is the statistic that is typically used for this multinomial testing problem. .~) 8 80j 80 k ~ ~ ~ ] 2 0j = n [~ 80 k ~ 1] 2 To simplify the first term on the right of (6.. we could invert] or note that by (6.Expected)2 Expected (6. ..2) . .1) of Pearson's X2 will reappear in other multinomial applications in this section.
75)2 104.i::. . For comparison 210g A = 0. n2 = 101. In experiments on pea breeding. 02.75 .25)2 104. this value may be too small! See Note 1.4). Then.75. 8). Mendel observed nl = 315. and we want to test whether the distribution of types in the n = 556 trials he performed (seeds he observed) is consistent with his theory. 04. and (4) wrinkled green. 4 as above and associated probabilities of occurrence fh. we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1.25 + (3.25)2 312.4.1) dimensional parameter space k 8={8:0i~0.9 when referred to a X~ table.4 Large Sample Methods for Discrete Data 403 the first term on the right of (6. distribution. Contingency Tables Suppose N = (NI .75. . nOlO = 312. k = 4 X 2 = (2. in the HardyWeinberg model (Example 2. 04 = 1/16. Testing a Genetic Theory. fh = 03 = 3/16. Possible types of progeny were: (1) round yellow. n020 = n030 = 104. Example 6.. which has a pvalue of 0.25. where 8 0 is a composite "smooth" subset of the (k .75 + (3. which is a onedimensional curve in the twodimensional parameter space 8. 7. j.75)2 = 04 34.Section 6.1.2 GoodnessofFit to Composite Multinomial Models. Mendel's theory predicted that 01 = 9/16. M(n. 2.fh. (2) wrinkled yellow. • 6. However. There is insufficient evidence to reject Mendel's hypothesis. n040 = 34.: .2) becomes It follows that the Rao statistic equals Pearson 's X2. 8 0 . n3 = 108. l::.. i=1 For example.1. Nk)T has a multinomial.4. 3.4. (3) round green.48 in this case. We will investigate how to test H : 0 E 8 0 versus K : 0 ¢:. Here testing the adequacy of the HardyWeinberg model means testing H : 8 E 8 0 versus K : 8 E . Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds. If we assume the seeds are produced independently.k. LOi=l}. n4 = 32.25 + (2.
.. . i=l k If 11 ~ e( 11) is differentiable in each coordinate. j Oj is." For instance.q' Moreover. 8) denote the frequency function of N. 2 log . .nk. j = 1.3) Le. .. nk. the log likelihood ratio is given by log 'x(nb .1). .4. we obtain the Rao statistic for the composite multinomial hypothesis by replacing eOj in (6. .. To avoid trivialities we assume q < k . That is.4.2) by ej(ij). thus..2~(1 .. . approximately X.r for specified eOj . where 9j is chosen so that H becomes equivalent to "( e~ .4. Maximizing p(n1.r.3. . . ek(11))T takes £ into 8 0 . .3). j = q + 1. {p(. 8(11)) : 11 E £}. and ij exists.. e~) T ranges over an open subset of Rq and ej = eOj .3 that 2 log . involve restrictions on the e obtained by specifying independence assumptions on i classifications of cases into different categories..X approximately has a xi distribution under H. 8(11)) for 11 E £. then it must solve the likelihood equation for the model. .. Consider the likelihood ratio test for H : e E 8 0 versus K : e 1:. which will be pursued further later in this section. Other examples. .l now leads to the Rao statistic R (8("") n 11 =~ ~ j=l [Ni . .8 0 .. a 11 'r/J .ne j (ij)]2 = 2 ne. we define ej = 9j (8).1.. also equal to Pearson's X2 • .. 8) for 8 E 8 0 is the same as maximizing p( n1.. 8 0 • Let p( n1. Then we can conclude from Theorem 6.. ij = (iiI..X approximately has a X. nk.3 and conclude that. If a maximizing value. under H. .4. r.("") X J 11 I where the righthand side is Pearson's X2 as defined in general by (6. l~j~q. 1 ~ j ~ q or k (6. the Wald statistic based on the parametrization 8 (11) obtained by replacing e by e (ij). .. £ is open. The Rao statistic is also invariant under reparametrization and. nk) = L ndlog(ni/n) log ei(ij)]. by the algebra of Section 6. .4.404 Inference in the Multiparameter Case Chapter 6 8 1 where 8 1 = 8 .4. is a subset of qdimensional space. However.1."" 'r/q) T. 8(11)) = 0. e~ = e2 . to test the HardyWeinberg model we set e~ = e1 . .. nk. i=l t n. We suppose that we can describe 8 0 parametrically as where 11 = ('r/1..~) and test H : e~ = O.q distribution for large n. and the map 11 ~ (e 1(11). . .q) exists.(t)~ei(11)=O. If £ is not open sometimes the closure of £ will contain a solution of (6. The algebra showing Rn(8 0 ) = X2 in Section 6. The Wald statistic is only asymptotically invariant under reparametrization.. ... To apply the results of Section 6. ij satisfies l a a'r/j logp(nl..
TJ). Then a randomly selected individual from the population can be one of four types AB.4) which reduces to a quadratic equation in if. If Ni is the number of offspring of type i among a total of n offspring. specifies that where TJ is an unknown number between 0 and 1. p. For instance. N 4 ) has a M(n. then (Nb .5.4. (}4) distribution. 0 Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories. A selfcrossing of maize heterozygous on two characteristics (starchy versus sugary. 9(ij) ((2nl + n2) 2n 2n 2.4. To study the relation between the two characteristics we take a random sample of size n from the population. (2nl + n2)(2 3 + n2) . 301). 1} a "onedimensional curve" of the threedimensional parameter space 8. (h.a) with if (2nl + n2)/2n.2.6 that Thus. Because q = 1. HardyWeinberg. {}22. A linkage model (Fisher. do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second Band B. T() test the validity of the linkage model we would take 8 0 {G(2 + TJ). H is rejected if X2 2:: Xl (1 . (2) sugarygreen.TJ). iTJ) : TJ :. An individual either is or is not inoculated against a disease.• . The likelihood equation (6. We found in Example 2. (6.. k 4. (3) starchywhite. {}21.4. AB. and so on. ( 2n3 + n2) 2) T 2n 2n o Example 6. is or is not a smoker. We often want to know whether such characteristics are linked or are independent. is male or female. Independent classification then means that the events [being an A] and [being a B] are independent or in terms of the B . 1958.1).3) becomes i(1 °:. The Fisher Linkage Model. Denote the probabilities of these types by {}n. (4) starchygreen.4. . . . nl (2+TJ) (n2 + n3) n4 (1TJ) +:ry 0.4. . green base leaf versus white base leaf) leads to four possible offspring types: (1) sugarywhite.4. {}12. ij {}ij = (Bil + Bi2 ) (Blj + B2j ). AB. AB. i(1 ..4 Methods for Discrete Data 405 Example 6.. The only root of this equation in [0. 1] is the desired estimate (see Problem 6. respectively. we obtain critical values from the X~ tables. The results are assembled in what is called a 2 x 2 contingency table such as the one shown.Section 6.
for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic.'r/2)) : 0 ::. the likelihood equations (6.4. ( 22 ). 'r/2 to indicate that these are parameters. In fact (Problem 6.'r/1) (1 .406 Inference in the Multiparameter Case Chapter 6 l A 11 The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column.'r/1). Thus. 0 12 . Pearson's statistic is then easily seen to be (6.4. This suggests that X2 may be written as the square of a single (approximately) standard normal variable. because k = 4.4. 'r/2 ::. the (Nij . (1 . 'r/1 (1 .2). q = 2. C j = N 1j + N 2j is the jth column sum. where z tt [R~jl1 2=1 J=l . N 22 )T. N 21 . These solutions are the maximum likelihood estimates. if N = (Nll' N 12 . By our theory if H is true. respectively. We test the hypothesis H : () E 8 0 versus K : 0 f{. Then. 'r/2 (1 .011 + 021 as 'r/1. I}. 8 0 .7) where Ri = Nil + Ni2 is the ith row sum.Tid (n12 + n22) (1 .'r/2). X2 has approximately a X~ distribution.~ + n12) iiI (nll + n2d (nll i72 (n21 + n22) (1 . 'r/1 ::.6) the proportions of individuals of type A and type B. 1. 0 11 . 021 .4.Ti2) (6. we have N rv M(n. 0 ::.3) become .5) whose solutions are Til 'r/2 = (n11+ n 12)/n (nll + n21)/n.4. For () E 8 0 .RiCj/n) are all the same in absolute value and. Here we have relabeled 0 11 + 0 12 . (6. which vary freely. where 8 0 is a twodimensional subset of 8 given by 8 0 = {( 'r/1 'r/2.
. hair color).. b NIb Rl a Nal C1 C2 .a) as a level a onesided test of H : peA I B) = peA I B) (or peA I B) :::. TJj2 are nonnegative and 2:~=1 = 2:~=1 TJj2 = 1. 1 :::. a.. The N ij can be arranged in a a x b contingency table. B. . i :::.. peA I B)) versus K : peA I B) > peA I B).. Z indicates what directions these deviations take.8) Thus. B.4.. . and only if. Positive values of Z indicate that A and B are positively associated (i.g.. a. Next we consider contingency tables for two nonnumerical characteristics having a and b states. B.... 1 :::.. if and only if. Thus. 1). . j = 1. that A is more likely to occur in the presence of B than it would in the presence of B). i = 1.. respectively. it is reasonable to use the test that rejects. (Jij : 1 :::. a.4. b ~ 2 (e.3) that if A and B are independent. B to denote the event that a randomly selected individual has characteristic A. (Jij TJil The hypothesis that the characteristics are assigned independently becomes H : TJil TJj2 for 1 :::. that is. eye color. 1 :::. j :::. . .. . i :::. . b} "J M(n. b where the TJil. b where N ij is the number of individuals of type i for characteristic 1 and j for characteristic 2. i :::. .Section 6. a. Nab Cb Ra n . a. If (Jij = P[A randomly selected individual is of type i for 1 and j for 2]. The X2 test is equivalent to rejecting (twosidedly) if. then {Nij : 1 :::..4 Large Sample Methods for Discrete Data 407 An important altemative form for Z is given by (6. then Z is approximately distributed as N(O. j :::. if X2 measures deviations from independence. Z = v'n[P(A I B) . If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N ij . . Z ~ z(l. j :::. .. 1 Nu 2 N12 . It may be shown (Problem 6.peA I B)] [~(B) ~(~)ll/2 peA) peA) where P is the empirical distribution and where we use A.4.. b).e.4. peA I B) = peA I B).. Therefore.
Examples are (1) medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0).1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are. The log likelihood of 7r = (7rl.7r)] are also used in practice." We assume that the distribution of the response Y depends on the known covariate vector ZT. (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y = 0). and the loglog transform g2(7r) = 10g[log(1 . with Xi binomial. we obtain what is called the logistic linear regression model where . (6. 1) d. 6. perhaps after a transfonnation. B(mi' 7ri)' where 7ri = 7r(Zi) is the probability of success for a case with covariate vector Zi.. The argument is left to the problems as are some numerical applications. log(l "il] + t.10.) over the whole range of z is impossible. [Xi C :i"J log + m. we call Y = 1 a "success" and Y = 0 a "failure.. Other transforms."F~1 Yij where Yij is the response on the jth of the mi trials in block i. or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y = 0).f.Li = L. we observe the number of successes Xi = L. .408 Inference in the Multiparameter Case Chapter 6 with row and column sums as indicated. 1 :::.7r)].4. i :::. In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1.8 as the canonical parameter zT TJ = g ( 7r) = log [7r / (1 .6. (6. approximately normally distributed ~ for known constants {Zij} and and whose means are modeled as J. In this section we assume that the data are grouped or replicated so that for each fixed i. we observe independent Xl' .{3p. .3 Logistic Regression for Binary Responses In Section 6. a simple linear representation zT ~ for 7r(.4. Thus. log ( ~: ) . Instead we turn to the logistic transform g( 7r).11) When we use the logit transfonn g(7r)."" 7rk)T based on X = (Xl"'" Xk)T is t. Next we choose a parametric model for 7r(z) that will generate useful procedures for analyzing experiments with binary responses.9) which has approximately a X(al)(bl) distribution under H.~ =1 Zij {3j = unknown parameters {31. such as the probit gl(7r) = <I>1(7r) where <I> is the N(O.4. whicll we introduced in Example 1. usually called the log it. ' XI..4. k. As is typical. Maximum likelihood and dimensionality calculations similar to those for the 2 x 2 table show that Pearson's X2 for the hypothesis of independence is given by (6. Because 7r(z) varies between 0 and 1..
it follows by Theorem 5.. ( 1 . mi 2mi Xi 1 Xi 1 = ..f3p )T is. .3. . W is estimated using Wo = diag{mi1f.(1.4. As the initial estimate use (6. k m} (V. the solution to this equation exists and gives the unique MLE 73 of f3.1fi )* = 1 .L) = O.log(l:'.4 Large Sample Methods for Discrete Data 409 The special case p = 2.. (6.3..! ) ( mi .l..1. The coordinate ascent iterative procedure of Section 2.16) 1fi) 1]. It follows that the NILE of f3 solves E f3 (Tj ) = T . IN(f3) of f3 = (f31.3. in (6. Although unlike coordinate ascent NewtonRaphson need not converge. Tp) T. Then E(Tj ) = 2:7=1 Zij J. Similarly.3. the empirical logistic transform. Zi The log likelihood l( 7r(f3)) = == k (1. I N(f3) = f. Because f3 = (ZTz) 1 ZT 'T] and TJi = 10g[1fi (1 1fi) in TJi has been replaced by 730 is a plugin estimate of f3 where 1fi and (1 1f.4.J. Alternatively with a good initial value the NewtonRaphson algorithm can be employed..+ . Zi)T is the logistic regression model of Problem 2.4 the Fisher information matrix is I(f3) = ZTWZ (6.15) Vi = log X+..4.1 and by Proposition 3. The condition is sufficient but not necessary for existencesee Problem 2.14) where W = diag{ mi1fi(l1fi)}kxk.13) By Theorem 2. We let J. . .4.1 applies and we can conclude that if 0 < Xi < mi and Z has rank p.4.J) SN(O.14). k ( ) (6. p (Jj Tj  ~ mi loge1+ exp{Zif3} ) + ~ log '. i ::.4. or Ef3(Z T X) = ZTX.3. j = 1.1fi)*}' + 00. we can guarantee convergence with probability tending to 1 as N + 00 as follows.3.Li.p. 7r Using the 8method.~ Xi 2 1 +"2 ' (6. j Thus.2 can be used to compute the MLE of f3.4.Li = E(Xi ) = mi1fi. where Z = IIZij Ilrnxp is the design matrix.12) where T j = 2:7=1 ZijXi and we make the dependence on N explicit.+ . Note that IN(f3) is the log likelihood of a pparameter canonical exponential model with parameter vector f3 and sufficient statistic T = (T1' . Theorem 2.4. mi 2mi Here the adjustment 1/2mi is used to avoid log 0 and log 00. the likelihood equations are just ZT(X . . if N = 2:7=1 mi..[ i(l7r iW')· ..Section 6.. that if m 1fi > 0 for 1 ::.
we set n = Rk and consider TJ E n. By the multivariate delta method.\ for testing H : TJ E w versus K : TJ n .1 linear subhypotheses are important. . In this case the likelihood is a product of independent binomial densities. i 1.. .\ has an asymptotic X..13. that is.4.6. D(X. 2 log . ji). it foHows (Problem 6. . The LR statistic 2 log .410 Inference in the Mllltin::lr.\=2t [Xi log i=1 (Ei. we let w = {TJ : 'f]i = {3.4.4.17) where X: = mi Xi and M~ mi Mi. ji) 2 I)Xi 10g(Xdfli) + XIlog(XI!flDJ i=1 (6. k < oosee Problem 6. one from each location. {3 E RP} and let r be the dimension of w.4. Thus. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and recording the number Xi of patients that recover. To get expressions for the MLEs of 7r and JL..4.) +X. by Problem 6.14) that {3o is consistent."" Xk independent with Xi example. and the MLEs of 1ri and {Li are Xdmi and Xi.8 that the inverse of the logit transform 9 is the logistic distribution function Thus. We want to contrast w to the case where there are no restrictions on TJ.1. k. If Wo is a qdimensional linear subspace of w with q < r. f"V .Wo 210g.4. Example 6.:IITlPt'pr Case 6 Because Z has rankp. recall from Example 1. from (6..q distribution as mi + 00.4. The Binomial OneWay Layout.11) and (6. Here is a special case. and for the ith location count the number Xi among mi that has the given attribute. D(X. . i = 1.4. fl) has asymptotically a X%r. .flm. i = 1.k. where ji is the MLE of JL for TJ E w. We obtain k independent samples. As in Section 6. The samples are collected indeB(1ri' mi).. Testing In analogy with Section 6. the MLE of 1ri is 1ri = 9. then we can form the LR statistic for H : TJ E Wo versus K: TJ E w .13...18) {Lo~ where jio is the MLE of JL under H and fl~i = mi . For a second pendently and we observe Xl..3..~.IOg(E.1. ji) measures the distance between the fit ji based on the model wand the data X.12) k zT D(X. .4. distribution for TJ E w as mi + 00. Theorem 5.w is denoted by D(y.1 (L:~=l Xij(3j). In the present case. suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition.)l {LOt (6.
4. versus the alternative that the 7r'S are not all equal. .1). the Wald and Rao statistics take a form called "Pearson's X2. It follows from Theorem 2. In Section 6.7r)]. in the case of a Gaussian response. (3 = L Zij{3j j=l p of covariate values.4. Summary. Using Z as given in the oneway layout in Example 6.18) with JiOi mi7r.5 Generalized Linear Models 411 This model corresponds to the oneway layout of Section 6. the Rao statistic is again of the Pearson X2 form. N 2::=1 mi.5 GENERALIZED LINEAR MODELS Yi In Sections 6. 6.Li = EYij. an important hypothesis is that the popUlations are homogenous.3. We found that for testing the hypothesis that a multinomial parameter equals a specified value.1.3.1. discuss algorithms for computing MLEs. J.Li 7ri = gl(~i)' where gl(y) is the logistic distribution .1 that if 0 < T < N the MLE exists and is the solution of (6. Finally.Li of a response is expressed as a function of a linear combination ~i =.. The Pearson statistic 2 _ X  L k1 i=l (X "'(1) m'7r .13).4. we test H : 7rl = 7r2 7rk = 7r. we find that the MLE of 7r under His 7r = TIN.3. where J. In particular. we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates.Section 6.3 we considered experiments in which the mean J.Li = ~i. We derive the likelihood equations. Thus. When the hypothesis is that the multinomial parameter is in a qdimensional subset of the k .L (ml7r. 7r E (0. In the special case of testing independence of multinomial frequencies representing classifications in a twoway contingency table. the Pearson statistic is shown to have a simple intuitive form." which equals the sum of standardized squared distances between observed frequencies and expected frequencies under H.. We used the large sample testing results of Section 6.Idimensional parameter space e. In the special case of testing equality of k binomial parameters. and (3 = 10g[7r1(1 . we give explicitly the MLEs and X2 test. then J. .15).N log(1+ exp{{3}) + ~ log ( k ~: ) where T = 2::=1 Xi.7r ~ i "')2 mi 7r is a Wald statistic and the X2 test is equivalent asymptotically to the LR test (Problem 6.mk7r)T. and as in that section. and give the LR test. if J.4. Under H the log likelihood in canonical exponential form is {3T . z.4.1 and 6.3 to find tests for important statistical problems involving discrete data. The LR statistic is given by (6.
. g(JLn)) T. These Ii are independent Gaussian with known variance 1 and means JLi The model is GLM with canonical g(fL) fL. Y) where Y = (Yb . We assume that there is a onetoone transform g(fL) of fL. ••• . called the link junction. See Haberman (1974).412 Inference in the 1\III1IITin'~r:>rn"'T..1). More generally..5. Special cases are: (i) The linear model with known variance. the GLM is the canonical subfamily of the original exponential family generated by ZTy. (ii) Log linear models. j=1 p In this case. the mean fL of Y is related to 11 via fL = A(11). The generalized linear model with dispersion depending only on the mean The data consist of an observation (Z. Note that if A is oneone. such that g(fL) = L J'jZ(j) j=1 p Z{3 where Z(j) (Zlj. in which case 9 is also called the link function. that is.. 1989) synthesized a number of previous generalizations of the linear model. . . Znj)T is the jth column vector of Z.. g(fL) is of the form (g(JL1) . . McCullagh and NeIder (1983. 11) given by (6. the natural parameter space of the nparameter canonical exponential family (6. which is p x 1. fL determines 11 and thereby Var(Y) A( 11).r Case 6 function.1..6. = L:~=1 Zij{3.5. Yn)T is n x 1 and ZJxn = (ZI.. in which case JLi A~ (TJi). g = or 11 L J'j Z(j) = Z{3.1) where 11 is not in E. most importantly the log linear model developed by Goodman and Haberman. Canonical links The most important case corresponds to the link being canonical."" zn) with Zi = (Zib"" Zip)T nonrandom and Y has density p(y. As we know from Corollary 1. Typically. . A (11) = Ao (TJi) for some A o. Typically. but in a subset of E obtained by restricting TJi to be of the form where h is a known function. the identity.
In this case.80 ~ f3 0 . .1. ZTy or = ZT E13 y = ZT A(Z.8) (6. classification i on characteristic 1 and j on characteristic 2.4.B 1../1(.3) where In this situation and more generally even for noncanonical links. B+ j = 2:~=1 Bij . 2:~=1 B = 1. a. The coordinate ascent algorithm can be used to solve (6 . 1 ::. Then. /1(. . j ::. f3j are free (unidentifiable parameters). for example.8) is orthogonal to the column space of Z. 1 ::. b.5..2) (or ascertain that no solution exists).Section 6.4.5. say. .tp) T .5. p. But.2) It's interesting to note that (6. NewtonRaphson coincides with Fisher's method of scoring described in Problem 6.3. by Theorem 2. Algorithms If the link is canonical. If we take g(/1) = (log J. if maximum likelihood estimates they necessarily uniquely satisfy the equation f3 exist.5. . i ::. Bj > 0.1.. b. is that of independence Bij = Bi+B+j where Bi+ = 2:~=1 Bij . . that procedure is just (6. in general. j ::.7.5. This isjust the logistic linear model of Section 6. so that Yij is the indicator of.. With a good starting point f3 0 one can achieve faster convergence with the NewtonRaphson algorithm of Section 2.8) is not a member of that space. "Ij = 10gBj. j ::. that Y = IIYij 111:Si:Sa. Then () = IIBij II. 0 < (}i < 1 withcanonicallinkg(B) = 10g[B(1B)]. the link is canonical. and the log linear model corresponding to log Bij = f3i + f3j. where f3i. See Haberman (1974) for a further discussion.. the .t1. log J..Yp)T is M(n. Suppose.5 Generalized linear Models 413 Suppose (Y1. as seen in Example 1. 1 ::. The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ((}i). . The models we obtain are called log linearsee Haberman (1974) for an extensive treatment. j::. 1f.Bp).6.2) can be interpreted geometrically in somewhat the same way as in the Gausian linear modelthe "residual" vector Y . p are canonical parameters. 1 ::.1::.3.
••• .D(Y .6) a decomposition of the deviance between Y and iLo as the sum of two nonnegative components. the correction Am+ l is given by the weighted least squares formula (2. the algorithm is also called iterated weighted least squares.2. iLl) == D(Y. We can think of the test statistic .1.. This quantity. 1]) > o}). Unfortunately tl =f.5.2.20) when the data are the residuals from the fit at stage m. In that case the MLE of J1 is iL M = (Y1 .1) for which p = n. Write 1](') for AI. IL) : IL E wo} where iLo is the MLE of IL in Woo The LR statistic for H : IL E Wo versus K : IL E WI .1](Y)) l(Y . iLo) . iLo) .5. For the Gaussian linear model with known variance 0'5 (Problem 6. iLl) where iLl is the MLE under WI.5.4) That is. The LR statistic for H : IL E Wo is just D(Y. then with probability tending to 1 the algorithm converges to the MLE if it exists. Testing in GLM Testing hypotheses in GLM is done via the LR statistic. .4). called the deviance between Y and lLo. 210gA = 2[l(Y.D(Y. Yn ) T (assume that Y is in the interior of the convex support of {y : p(y. iLl)' each of which can be thought of as a squared distance between their arguments. We can then formally write an analysis of deviance analogous to the analysis of variance of Section 6. (6. Let ~m+l == 13m+1 . the variance covariance matrix is W m and the regression is on the columns ofW mZProblem 6. as n ~ 00. the deviance of Y to iLl and tl(iL o. In this context. D generally except in the Gaussian case.1](lLa)] for the hypothesis that IL = lLa within M as a "measure" of (squared) distance between Y and lLo.5.414 Inference in the Multiparameter Case Chapter 6 true value of {3.5. If Wo C WI we can write (6. This name stems from the following interpretation.5. iLo) == inf{D(Y. As in the linear model we can define the biggest possible GLM M of the form (6.5) is always ~ O.13m' which satisfies the equation (6.Wo with WI ~ Wo is D(Y.
1.. is consistent with probability 1. .3 hold (Problem 6. (Zn' Y n ) can be viewed as a sample from a population and the link is canonical. in order to obtain approximate confidence procedures. . .3.7) More details are discussed in what follows.. .5 Generalized Linear Models 415 Formally if Wo is a GLM of dimension p and WI of dimension q with canonical links. Y n ) from the family with density (6. (Zn' Y n ) has density (6.9) What is ]1 ((3)? The efficient score function a~i logp(z .q. so that the MLE {3 is unique.2. asymptotically exists. (Zn. (3) term. . (3))Z. . the theory of Sections 6. and can conclude that the statistic of (6. = f3d = 0.8) This is not unconditionally an exponential family in view of the Ao (z.. 6. which we temporarily assume known. . f3p ). which.3. Asymptotic theory for estimates and tests If (ZI' Y 1 ).. ••• .1 .Section 6 .5. if we assume the covariates in logistic regression with canonical link to be stochastic..5. then (ZI' Y1 ). fil) is thought of as being asymptotically X. we can calculate i=l where {3 H is the (p xl) MLE for the GLM with {3~xl = (0. we obtain If we wish to test hypotheses such as H : 131 = . Zn in the sample (ZI' Y 1 ). there are easy conditions under which conditions of Theorems 6.2 and 6.. then ll(fio.2. Thus. if we take Zi as having marginal density qo.5..2. Ao(Z. .3 applies straightforwardly in view of the general smoothness properties of canonical exponential families..3). . For instance. f3d+l.5. and 6.5. y . This can be made precise for stochastic GLMs obtained by conditioning on ZI. and (6.0. Similar conclusions r '''''''. However.. can be estimated by f = :EA(Z{3) where :E is the sample variance matrix of the covariates. .. d < p. . (3) is (Yi and so..10) is asymptotically X~ under H.
Because (6. if in the binary data regression model of Section 6. . we take g(JL) = <1>1 (JL) so that 7ri = <1> (zT!3) we obtain the socalled probit model. then it is easy to see that E(Y) = A(11) Var(Y) = c(r)A(11) (6. Important special cases are the N (JL. 11.1 (r)(11T y .5.13) (6.4. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6.11) Jp(y. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean.5.10) is of product form A( 11) [1/ c( r)] whereas the righthand side cannot always be put in this form. It is customary to write the model as.3. Note that these tests can be carried out without knowing the density qo of Z1.416 Inference in the Multiparameter Case Chapter 6 follow for the Wald and Rao statistics.1 ::. For instance. (6.5.A(11))}h(y. General link functions Links other than the canonical one can be of interest.14) so that the variance can be written as the product of a function of the mean and a general dispersion parameter. noncanonicallinks can cause numerical problems because the models are now curved rather than canonical exponential families. However. the results of analyses with these various transformations over the range .5.r)dy. and NeIder (1983. . r) = exp{c.9 are rather similar. From the point of analysis for fixed n. For further discussion of this generalization see McCullagp. then A(11)/c(r) = log! exp{c1 (r)11T y}h(y. 1989). r)dy = 1. which we postpone to Volume II. 11. As he points out.5. (J2) and gamma (p. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred. p(y.5. for c( r) > 0. r). when it can.1) depend on an additional scalar parameter r.12) The lefthand side of (6. These conclusions remain valid for the usual situation in which the Zi are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables. A) families. Cox (1970) considers the variance stabilizing transformation which makes asymptotic analysis equivalent to that in the standard Gaussian linear model. JL ::.
roughly speaking. Y) has a joint distribution and if we are interested in estimating the best linear predictor /lL(Z) of Y given Z. confidence procedures.6 ROBUSTNESS PROPERTIES AND SEMIPARAMETRIC MODELS As we most recently indicated in Example 6.2.6. for estimating /lL(Z) in a submodel of P 3 with /3 (6. PI {P : (zT. That is. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function. the LSE is not the best estimate of (3.31) with fa symmetric about O. if we consider the semiparametric model. Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem 6. if P3 is the nonparametric model where we assume only that (Z.1) where E is independent of Z but ao(Z) is not constant and ao is assumed known. For this model it turns out that the LSE is optimal in a sense to be discussed in Volume II. is "act as ifthe model were the one given in Example 6. called the link function. E p y2 < oo}. we studied what procedures would be appropriate if the linearity of the linear model held but the error distribution failed to be Gaussian. and tests.2." Of course. the resulting MLEs for (3 optimal under fa continue to estimate (3 as defined by (6. are discussed below.2.19) with Ei i. we use the asymptotic results of the previous sections to develop large sample estimation results. the distributional and implicit structural assumptions of parametric models are often suspect.2. in fact. still a consistent asymptotically normal estimate of (3 and. a 2 ) or. EpZTZ nonsingular.19) even if the true errors are N(O.r+i". so is any estimate solving the equations based on (6.2. the right thing to do if one assumes the (Zi' Yi) are i. whose further discussion we postpone to Volume II. have a fixed covariate exact counterpart. of a linear predictor of the form Ef3j Z(j).2. in fact. y)T rv P that satisfies Ep(Y I Z) = ZT(3. These issues.8 but otherwise also postponed to Volume II. with density f for some f symmetric about O}. "..6. Y) rv P given by (6. We considered the canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor. We discussed algorithms for computing MLEs of In the random design case. and Models 417 Summary. ____ . had any distribution symmetric around 0 (Problem 6. There is another semiparametric model P 2 = {P : (ZT. We found that if we assume the error distribution fa. /3. the GaussMarkov theorem..5).J ..1. Furthermore.6 Robustness Pr.i. then the LSE of (3 is.2. In Example 6..d. where the Z(j) are observable covariate vectors and (3 is a vector of regression coefficients.d. under further mild conditions on P. which is symmetric about 0.Section 6.2.2. Ii 6."n".i.
1. € are n x 1.. Var(Ei) + 2 Laiaj COV(Ei. Many of the properties stated for the Gaussian case carry over to the GaussMarkov case: Proposition 6.En are i. . Suppose the GaussMarkov linear model (6.4. {3 is p x 1. where Varc(Ci) stands for the variance computed under the Gaussian assumption that E1.1. in addition to being UMVU in the normal case. Ej). JL = f31.6. the estimate = E~l aiJii has uniformly minimum variance among all unbiased estimates linear in Y 1 . 418 Inference in the Multiparameter Case Chapter 6 Robustness in Estimation We drop the assumption that the errors E1.6.En in the linear model (6. In Example 6. See Problem 6.1. and Ji = 131 = Y. Example 6.1. n = 0:. ( 2 ). Yj) = Cov( Ei. .Ej) = a La.. The preceding computation shows that VarcM(Ci) = Varc(Ci). Var(jj) = a 2 (ZTZ)1. . Theorem 6.. and Y..3) are normal. jj and il are still unbiased and Var(Ji) = a 2 H.d.3(iv).6. Because E( Ei) = 0.2.1. . However.3.Yn . Instead we assume the GaussMarkov linear model where (6.6).H). .p . the conclusions (1) and (2) of Theorem 6. n a Proof.1. .. . the result follows.'.an. and by (B.4. moreover.1. . 13i of Section 6. for ... i=l i<j i=l where VarCM refers to the variance computed under the GaussMarkov assumptions..1. Varc(a) ::.6. In fact. .14). and ifp = r. 0 Note that the preceding result and proof are similar to Theorem 1.4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case. .2) holds. is UMVU in the class of linear estimates for all models with EY/ < 00. Then. in Example 6. E(a) = E~l aiJLi Cov(Yi. Because Varc = VarCM for all linear estimators. Let Ci stand for any estimate linear in Y1 . n 2 VarcM(a) = La.1 and the LSE in general still holds when they are compared to other linear estimates.1. One Sample (continued). By Theorems 6.2(iv) and 6. Var(e) = a 2 (I .4 are still valid.S. Ifwe replace the Gaussian assumptions on the errors E1. Moreover. our current Ji coincides with the empirical plugin estimate of the optimal linear predictor (1.' En with the GaussMarkov assumptions.6.2) where Z is an n x p matrix of constants.1. Varc(Ci) for all unbiased Ci. The GaussMarkov theorem shows that Y.. and a is unbiased. for any parameter of the form 0: = E~l aiJLi for some constants a1.. .Y .i. N(O.1. The optimality of the estimates Jii.1.
Y is unbiased (Problem 3. which is the unique solution of ! and w(x.d. Thus. Il E R. 1 (Zn.8(P)) ~ N(O.Section 6. 0.6. Suppose we have a heteroscedastic 0. our estimates are estimates of Section 2.4..1. Remark 6.. Another weakness is that it only applies to the homoscedastic case where Var(1'i) is the same for all i. but Var( Ei) depends on i.and twosample problems using asymptotic and Monte Carlo methods.6. .2. Y has a larger variance (Problem 6.ennip. This observation holds for the LR and Rao tests as wellbut see Problem 6. y E R. then Tw ~ VT I(8 0 )V where V '" N(O.2. as (Z. As seen in Example 6.. Y n ) are d. For this density and all symmetric densities.P) i= II (8 0 ) in general.:lrarnetlric Models 419 sample n large. Suppose (ZI' Yd.2.2 are UMVU.4) and the Wald test statis8 0 evidently we have i= Tw But if 8(P) = 8 0 . The 6method implies that if the observations Xi come from a distribution P.:.ii(8 .6. The Linear Model with Stochastic Covariates.6.4.2.6). 0. aT Robustness of Tests In Section 5. If 8(P) (6.6 Robustness "" .."n". for instance.6.ti". From the theory developed in Section 6.5) than the nonlinear estimate Y median when the density of Y is the Laplace density 1 2A exp{ Aly . A > O. .. the asymptotic behavior of the LR. Y) where Z is a (p x 1) vector of random covariates and we model the relationship between Z and Y as (6. ~('11. If the are version of the linear model where E( €) known. There is an important special case where all is well: the linear model we have discussed in Example 6. Because ~ ('lI .6.12).. 00.6.... P)).3 we investigated the robustness of the significance levels of t tests for the one.5) .1. 8)dP(x) ° = 80 (6. ~(w. consider H : 8 tics Tw = n(e ( 0 )T I(8 0 )(e ( 0 ).2 we know that if '11(" 8) = Die. when the not optimal even in the class of linear estimates. we can use the GaussMarkov theorem to conclude that the weighted least squares are unknown. a major weakness of the GaussMarkov theorem is that it only applies to linear estimates. Now we will use asymptotic methods to investigate the robustness of levels more generally.Ill}. 8) and Pis true we expect that 8n 8(P)..6. which does not belong to the model. P)) with :E given by (6. Wald and Rao tests depends critically on the asymptotic behavior of the underlying MLEs 8 and 80 .3) . Example 6. and . However. the asymptotic distribution of Tw is not X.
6.np(l a) . and Rao tests all are equivalent to the F test: Reject if Tn = (3 Z(n)Z(n)(3/ S 2 !p. Zen) S (Zb . .t.7. This kind of robustness holds for H : (Jq+ 1 (Jo. It follows by Slutsky's theorem that..t xp(l a) by Example 5. the test (6.2. For instance. Y) such that Eand Z are independent.6.q+ 1. EpE 0.3) equals (3 in (6.  2 Now..6) (3 is the LSE. E p E2 < 00. when E is N(O.2. Then. H : (3 = O.P i=l n p  Z(n)(31 ..r::nn".B) by where W pxp is Gaussian with mean 0 and.2.6... !ltin:::.6.5) and (12(p) = Varp(E). It is intimately linked to the fact that even though the parametric model on which the test was based is false. X. Thus. under H : (3 = 0. . it is still true that if qi is given by (6.6. say. it is still true by Theorem 6.6. .7) and (6. by the law of large numbers.1 that (6... and we consider. Wald.9) i=l n1Zfn)Z(n)S2 / Vii are still asymptotically of correct level. n lZT Z (n) (n) so that the confidence ~ ~ ZiZ'f n ~ 1 (6. (i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model. procedures based on approximating Var(..8) Moreover.1 and 6.2) the LR.30) then (3(P) specified by (6. even if E is not Gaussian.420 Inference in the M .5) has asymptotic level a even if the errors are not Gaussian. .np(1 . (12)..p or more generally (3 E £0 + (30' a qdimensional affine subspace of RP.Zn)T and  2 1 ~ " 2 1 ~(Yi .a) where _ ""T T  2 (6.3. (Jp (Jo.2. the limiting distribution of is Because !p. (see Examples 6.6.r Case 6 with the distribution P of (Z. hence.Yi) = IY n .
For simplicity. the lifetimes A2 the standard Wald test (6. the theory goes wrong. Simply replace (12 by (J ~2 = ~ ~ { ( ~1) n~ 2=1 Xz _ (X(I) + X(2»))2 2 + _ (X(1) + X(2»))2} 2 . V2 VarpX{I) (~2) Xz in general.10) does not have asymptotic level a in general..6.3.13) but & V2Ep(X(I») ::f. But the test (6.6.. To see this. then it's not clear what H : (J = (Jo means anymore. E (i.11) i=1 &= v 2n ! t(xP) + X?»). If it holds but the second fails. let the distribution of ( given Z = z be that of a(z)(' where (' is independent of Z. note that.6 Robustness Properties and Semiparametric Models 421 If the first of these conditions fails.6.7) fails and in fact by Theorem 6.6.3.15) .a)" (6..6.) respectively. Q) (6. Then H is still meaningful. If we take H : Al . . We illustrate with two final examples.6.5) are still meaningful but (and Z are dependent. Suppose E(E I Z) = 0 so that the parameters f3 of (6. It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general.6. (6. X(2») where X(I).6.6. Example 6. XU) and X(2) are identically distributed.Section 6. The Linear Model with Stochastic Covariates with E and Z Dependent.1 Vn(~ where (3) + N(O. (X(I).12) The conditions of Theorem 6.. X(2) are independent E of paired pieces of equipment. we assume variances are heteroscedastic..14) o Example 6. i=1 (6.17) is Suppose our model is that X "Reject iff n (0 1 where &2 2 O) > x 1 (1  . The TwoSample Scale Problem. That is.10) nL and ~ ~ X(j) 2 (6.2 clearly hold. However suppose now that X(I) 101 and X(2) 102 are identically distributed but not exponential.4...6.2. 2  (th).3. Suppose without loss of generality that Var( (') = L Then (6. under H.6.. (6.
n l1Y . 1 ::. Show that in the canonical exponential model (6. 0 < Aj < 1. and Rao tests. and Rao tests need to be modified to continue to be valid asymptotically.. We considered the behavior of estimates and tests when the model that generated them does not hold.Id .6.3. (5 2)T·IS (U1. then the MLE and LR procedures for a specific model will fail asymptotically in the wider model. then the MLE (iil. . N(O.6. LR. (6. The simplest solution at least for Wald tests is to use as an estimate of [Var y'n8]1 not 1(8) or ~D2ln(8) but the socalled "sandwich estimate" (Huber. TJr. . and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model. in general. (ii) if n 2: r + 1. = Ad = 1/ d (Problem 6.. = {3d = 0 fail to have correct asymptotic levels in general unless (52 (Z) is constant or A1 = .d. the test (6.Id) has a multinomial (A1.Ad)T where (I1.6) by Q1. (52) are still reasonable when the true error distribution is not Gaussian. For the canonical linear Gaussian model with (52 unknown.. the confidence procedures derived from the normal model of Section 6.7 PROBLEMS AND COMPLEMENTS Problems for Section 6.. . This is the stochastic version of the dsample model of Example 6. (5 .. 0 To summarize: If hypotheses remain meaningful when the model is false then.4. where Qis a consistent estimate of Q.3 to compute the information lower bound on the variance of an unbiased estimator of (52.fiT) o:2)T of U 2)T .9. Wald. the methods need adjustment.16) Summary. d.6.6) and. We also demonstrated that when either the hypothesis H or the variance of the MLE is not preserved when going to the wider model. . provided we restrict the class of estimates to linear functions of Y1 . Ur. use Theorem 3. . N(O. j ::.A1. In particular. The GaussMarkov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i. (i) the MLE does not exist if n = r. 1967).i.. . Compare this bound to the variance of 8 2 • . Specialize to the case Z = (1..1... 6..11) with both 11 and (52 unknown. (52) assumption on the errors is replaced by the assumption that the errors have mean zero. .1 are still approximately valid as are the LR. ..422 Inference in the Multiparameter Case Chapter 6 (Problem 6.6) does not have the correct level.6..6.d. In this case.6). Yn . . . n 1 L. For d = 2 above this is just the asymptotic solution of the BehrensFisher problem discussed in Section 4.1.. identical variances.. .11 . and are uncorrelated. Wald. the twosample problem with unequal sample sizes and variances. In the linear model with a random design matrix.. In partlCU Iar. ."i=r+ I i . ""2 _ ""12 (TJ1.1 1. . It is easy to see that our tests of H : {32 = .4.Ad) distribution. we showed that the MLEs and tests generated by the model where the errors are U. in general. ••• . A solution is to replace Z(n)Z(n)/ 8 2 in (6.JL • ""n 2. .
then B the MLE of (). .2 with p = r coincides with the empirical plugin estimate of JLL = ({LL1.28) for the noncentrality parameter ()2 in the regression example. where a.Section 6.. zi = (Zi21"" Zip)T. i = 1. of 7. (d) Show that Var(O) ~ Var(Y). t'V (b) Show that if ei N(O. fO 0.'''' Zip)T.5) Yi () + ei. 4.2. . . N(O. .4. n. 0.. = (Zi2.29). {Ly = /31.2 .n ~ where fi can be written as fi = ceil + ei for given constant c satisfying 0 ~ c are independent identically distributed with mean zero and variance 0. .. is (c) Show that Y and Bare unbiased.a)% confidence interval for /31 in the Gaussian linear model is Z2)2 ] . Suppose that Yi satisfies the following model Yi ei = () + fi. . Let 1.n. i = 1... and (3 and {Lz are as defined in (1.1] is a known constant and are i. 5. i eo = 0 (the €i are called moving average errors. (e) Show that Var(B) < Var(Y) unless c = O. c E [0. .. 0. 8.Li = {Ly+(zi {Lz)f3. J =~ i=O L)) c i (1.1.2 known case with 0. Here the empirical plugin estimate is based on U. the 100(1 .14).d. Let Yi denote the response of a subject at time i.2 ). .22."" (Z~.(_C)i)2 l+c L i=l (a) Show that Bis the weighted least squares estimate of (). Yn ) where Z.13.1. fn where ei = ceil + fi. Show that Ii of Example 6.1. Find the MLE B (). Show that >:(Y) defined in Remark 6. (Zi. where IJ.29) for the noncentrality parameter 82 in the oneway layout. .d. n.1. 1 {LLn)T. Yl)..l+c (_c)j+l)/~ (1.i. . 1 n f1. Consider the model (see Example 1. . and the 1.2 ). .. see Problem 2...2 replaced by 0: 2 • 6. Show that in the regression example with p = r = 2..1.. Derive the formula (6. Derive the formula (6. Var(Uf) = 20. .2 coincides with the likelihood ratio statistic A(Y) for the 0.4 • 3..7 Problems and Complements 423 Hint: By A.. i = 1. i = 1. 9.
..Z. then Var(8k ) is minimized by choosing ni = n/2. (a) Find a level (1 . .k)' C (b) If n is fixed and divisible by p.2) L(Zi2 . a 2 ::. (b) Find a level (1 a) prediction interval for Y (i.a). .Z. Yn ) such that P[t.2).{3i are given by = ~(f.p (~a) (n . (n ..1 2)? and that a level (1 a) confidence interval for a 2 is given by ~a)::.~ L~=1 nk = (p~:)2 + Lk#i . Assume the linear regression model with p future observation Y to be taken at the pont z. Yn be a sample from a population with mean f. (Zi2 Z. Let Y 1 . The following model is useful in such situations. Show that if p Inference in the Multiparameter Case Chapter 6 = r = 2 in Example 6. (b) Find confidence intervals for 'lj.1. Often a treatment that is beneficial in small doses is harmful in large doses.. (c) If n is fixed and divisible by 2(p . ••.:1 Yi + (3/ 2n) L~= ~ n+ 1 }i. Show that for the estimates (a) Var(a) = +==_=:.. then Var( a) is minimized by choosing ni = n / p. Note that Y is independent of Yl. Y ::. We want to predict the value of a 14. Consider the three estimates T1 = and T3 Y.• 424 10.. n2 = . (d) Give the 100(1 . (a) Show that level (1 .p (1 .e . .2)(Zj2 .. Consider a covariate x.1 and variance a 2 . Yn ). 15. = np n/2(p 1).2. In the oneway layout.p)S2 /x n .• statistics HYl. where n is even.a)% confidence intervals for a and 6k • 12. ~(~ + t33) t31' r = 2. = 2(Y . which is the amount .. 'lj..a) confidence interval for the best MSPE predictor E(Y) = {31 + {32 Z. then the hat matrix H = (h ij ) is given by 1 n 11.2)2 a and 8k in the oneway layout Var(8k ) . ..I)... .":=::. 1  (a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2? (b) Which estimate has the smallest MSE for estimating 0 13. Yn . . ::.a) confidence intervals for linear functions of the form {3j . T2 1 (1/ 2n) L. l(Yh . ~ 1 .p)S2 /x n .
Let 8. Problems for Section 6. . 1 2 where <t'CJL. compute confidence intervals for (31.0289. En Xi. and p be as in Problem 6. ( 2 )T is identifiable if and only if r = p. and a response variable Y. a 2 > O}. Check AO. ( 2 ) density. (a) For the following data (from Hald. r :::. fli) where fi1. Yi) and (Xi. nonsingular. and Q = P. 17. hence.2. (a) Show that AOA4 and A6 hold for model Qo with densities of the form 1 2<t'CJL. •. Show that if C is an n x r matrix of rank r.r2 )l 1 2 7 > O. Y (yield) 3. (33.. (b) Show that the MLE of ({L. p. 1952. (b) Plot (Xi. E :::.A6 when 8 = ({L.emEmts 425 . 653).I::llogxi' You may use X = 0.2 1. 16.r2) (x). (32. p(x. Suppose a good fit is obtained by the equation where Yi is observed yield for dose 10gYi where E1.. P. 8). But xC'C = o => IIxC'I1 2 = xC'Cx' = 0 => xC' = O. : {L E R.95 Yi = e131 e132xi Xf3. 8) = 2. i = 1. ( 2 ). . X (nitrogen) Hint: Do the regression for {Li = (31 + (32zil + (32zil = 10gXi .0'2)(x ) + E<t'(JL.or dose of a treatment. a 2 <7 2 . fi3 and level 0.0'2) is the N(Il'. ( 2 ) logp(x.• . n are independent N(O.77 167. P = {N({L. ( 2 ). . . fi2. Hint: Because C' is of rank r.1) + 2<t'CJL.. .5 Zi2 * + (33zi2 where Zil = Xi  X.1 and let Q be the class of distributions with densities of the form (1 E) <t' CJL. In the Gaussian linear model show that the parametrization ({3. n. which is yield or production. xC' = 0 => x = 0 for any rvector x.7 Probtems and Lornol.. then the r x r matrix C'C is of rank r and. 7 2 ) does not exist so that A6 doesn't hold. Assume the model = (31 + (32 X i + (33 log Xi + Ei.Section 6. Find the value of X that maxi mizes the estimated yield Y= e131 e132xx133.
2.24) directly as follows: (a) Show that if Zn = ~ I:~ 1 Zi then. show that ([EZZ T jl )(1... T 2 ) based on the first two I (d) Deduce from Problem 6. Z i ~O.2.2.)Z(n)(fj . In Example 6.Pe"H).2. T(X) = Zen). 8.2. H E H}..10 that the estimate Raphson estimates from On is efficient.fii(ji . '". (c) Apply Slutsky's theorem to conclude that and. The MLE of based on (X. .2.i > 1. (J derived as the limit of Newtonwith equality if and only if > [EZfj1 4. that (d) (fj  (3)TZ'f. . (e) Construct a method of moment estimate moments which are ~ consistent. t) is then called a conditional MLE. and that 0:2 is independent of the preceding vector with n(j2 /0 2 having a (b) Apply the law oflarge numbers to conclude that T P n I Z(n)Z(n) ~ E ( ZZ T ). /1.. . I . In Example 6. hence.. (I) Combine (aHe) to establish (6. H abstract. In Example 6. show that the assumptions of Theorem 6. ell of (} ~ = (Ji.IY . Establish (6.2.I3). Fill in the details of the proof of Theorem 6. Y).  I .20). (6.(b) The MLE minimizes .. i' " (b) Suppose that the distribution of Z is not known so that the model is semiparametric.2. . .1.24). e Euclidean.1) EZ . (fj multivariate nannal distribution with mean 0 and variance.(3) = op(n. n[z(~)~en)jl)' X. (a) In Example 6._p distribution. Hint: !x(x) = !YIZ(Y)!z(z).2.". given Z(n).1.. (P("H) : e E e.2.  (3)T)T has a (".2 are as given in (6. and . 6. 3. I I .2. show that c(fo) = (To/a is 1 if fo is normal and is different from 1 if 10 is logistic.426 Inference in the Multiparameter Case Chapter 6 I .2. then ({3. 5.1 show thatMLEs of (3. Hint: (a).21).2 hold if (i) and (ii) hold. (e) Show that 0:2 is unconditionally independent of (ji.2. iT 2 ) are conditional MLEs.1/ 2 ). Show that if we identify X = (z(n). X . b .Zen) (31 2 e ~ 7.'. In some cases it is possible to find T(X) such that the distribution of X given T(X) = tis Q" which doesn't depend on H E H.
L'l'(Xi'O~) n. a > 0 are unknown and ( has known density f > 0 such that if p(x) log f(x) then p" > 0 and. ] t=l 1 n . Hint: n . 8~ = 1 80 + Op(n. that is. (a) Show that if solves (Y ao is assumed known a unique MLE for L.2.O) has auniqueOin S(Oo.(XiI') _O. .3). <).7 Problems and Complements 427 9.0 0 1 < <} ~ 0. = J. ao (}2 = (b) Write 8 1 uniquely solves 1 8. Show that if BB a unique MLE for B1 exists and 10. R d are such that: (i) sup{jDgn(O) .L 'l'(Xi'O~) n i=l 1 1 =n n L 'l'(Xi.1) starting at 0.Dg(O)j .p(Xi'O~) n.Section 6.0 0 ). (logistic).. (iii) Dg(0 0 ) is nonsingular. 10 .LD. (iv) Dg(O) is continuous at 0 0 .'2:7 1 'l'(Xi.LD. (al Let ii n be the first iterate of the NewtonRapbson algorithm for solving (6.x (1 + C X ) .p(Xi'O~) + op(1) n i=l n ) (O~ . Examples are f Gaussian.1/ 2 ). p i=l J.1' _ On = O~  [ . the <ball about 0 0 .. and f(x) = e.2.l real be independent identically distributed Y. (ii) gn(Oo) ~ g(Oo)...' . a' . (b) Show that under AGA4 there exists E > 0 such that with probability tending to 1. .l exists and uniquely ~ . p is strictly convex.•. . Y. t=l 1 tt Show that On satisfies (6.l + aCi where fl. Suppose AGA4 hold and 8~ is vn consistent. hence. Him: You may use a uniform version of the inverse function theorem: If gn : Rd j. Let Y1 •.Oo) i=cl (1 . E R.
> 0 such that gil are 1 .3. N(O..26) and (6. 8) for 'I E 3 and 3 = {'1( 0) : 9 E e}.12). = versus K : O oF 0. . a 2 ). 2 ° 2. . < Zn 1 for given covariate values ZI. = 131 (Zil . f.2 holds for eo as given in (6.. log Ai = ElI + (hZi. . •. 1 Zw Find the asymptotic likelihood ratio. . • (6. Hint: Write n J • • • i I • DY. iteration of the NewtonRaphson algorithm starting at 0. I '" LJ i=l ~(1) Y.428 then. . P I .(Z" . Thus. CjIJtlZ. Establish (6.2. Wald.II(Zil I Zi2.. 1 Yn are independent Poisson variables with Yi .3.3 i 1. Ii 'I :f Z'[(3)2 over all {3 is the same as minimizing n . .i. then it converges to that solution.O E e./3)' = L i=1 1 CjZiU) n p . LJ j=2 Differentiate with respect to {31. Problems for Section 6.(j) l... •. Similarly compute the infonnation matrix when the model is written as Y. . q + 1 < j < r} then )'(X) for the original testing problem is given = by )'(X) = sup{ q(X· 'I) : 'I E 3}/ sup{q(X.ip range freely and Ei are i.2. Hint: You may use the fact that if the initial value of NewtonRaphson is close enough to a unique solution.. Suppose responses YI .. there exists a j and their image contains a ball S(g(Oo).3. that Theorem 6. hence. minimizing I:~ 1 (Ii 2 i. Zip)) + L j=2 p "YjZij + €i J . Suppose that Wo is given by (6. Show that if 3 0 = {'I E 3 : '1j = 0. for "II sufficiently large.Zi )13. and Rao tests for testing H : Ih.. i . Y.l hold for p(x. {Vj} are orthonormal.(Z"  ~(') z.12) and the assumptions of Theorem (6. t• where {31. i2.3. ~. 'I) : 'I E 3 0 } and.3). q(. i=l • Z... )Ih  '" j=l LJ(lJj + where ~(l) = I:j and the Cj do not depend on j3. P(Ai). and ~ .~_.' " 13. . . 0 < ZI < . . converges to the unique root On described in (b) and that On satisfies 1 1 • j.. Inference in the Multiparameter Case Chapter 6 > O. Reparametnze l' by '1(0) = Lj~1 '1j(O)Vj where '1j(9) 0 Vj.6).d. 'I) = p(.27).2.. Zi.. ..O). II.1 on 8(9 0 • 6) (c) Conclude that with probability tending to 1.
Consider testing H : 8 1 = 82 = 0 versus K : 0.(X" . hence. pXd . OJ}. < 00. q + 1 <j < r S(lJo) . 210g.(I(X i . Vi). 1 5. .l. independent. B). N(IJ" 1). S(lJ o) and a map ce which is continuously differentiable such that (i) ~J(IJ) = 9J(IJ) on S(lJo). {IJ E S(lJo): ~j(lJ) = 0. (b) If h = 2 show that asymptotically the critical value of the most powerful (NeymanPearson) test with Tn ~ L~ . with density p(.IJ) where q(.. 1 < i < n.3. (ii) 'I is IIon S( 1J0 ) and D'I( IJ) is a nonsingular r x r matrix for aIlIJ E S( 1J0). (ii) E" (a) Let ).3.3. Assume that Pel of Pea' and that for some b > 0. Xi.(X1 . with Xi and Y... .Xn ) be the likelihood ratio statistic. aTB. .'1) and Tin '1(lJ n) and... Show that under H. . respectively.'1(IJ» is uniquely defined on:::: = {'1(IJ) : IJ E Tio.. Bt ) is a KullbackLeibler in fonnation ° "_Q K(Oo.2. . \arO whereal. be ii.o log (X 0) PI. 'I) S(lJo)} then. Oil + Jria(00 .n:c"' 4c:2=9 3. IJ.7 Problems and C:com"p"":c'm"'..gr.. IJ. even if ~ b = 0. q + 1 <j< rl. X n Li.3 is valid.. There exists an open ball about 1J0. Let e = {B o . Deduce that Theorem 6.X. > 0 or IJz > O.) Show that if we reparametrize {PIJ: IJ E S(lJo)} by q('. Consider testing H : (j = 00 versus K : B = B1 . N(Oz.n 7J(Bo.. Testing Simple versus Simple..'" .) O.d. p .) 1(Xi .aq are orthogonal to the linear span of ~(1J0) . .) where k(Bo..d. . . (Adjoin to 9q+l. 1).Section 6.3 hold. Suppose that Bo E (:)0 and the conditions of Theorem 6.. q('. =  p(·. 0 ») is nK(Oo. Oil and p(X"Oo) = E.2. = ~ 4.n) satisfy the conditions of Problem 6. . Suppose 8)" > 0.. j = 1. Let (Xi.'1 8 0 = and.
Hint: By sufficiency reduce to n = 1.6.6. li).0). 2 log >'( Xi. = 0. . > 0. 0. In the model of Problem 5(a) compute the MLE (0.1.\(X).~.2 for 210g. O > 0. ±' < i < n) is distributed as a respectively.2 and compute W n (8~2» showing that its leading term is the same as that obtained in the proof of Theorem 6.)) ~ N(O. (b) Suppose Xil Yi e 4 (e) Let (X" Y. Then xi t. (ii) Show that Wn(B~2» is invariant under affine reparametrizations "1 B is nonsingular. • i . Po) distribution and (Xi. Wright. > OJ. Yi : 1 and X~ with probabilities ~. (h) : 0 < 0.i. 1988). under H. .0. Let Bi l 82 > 0 and H be as above. 1 < i < n.0.0"10' O"~o. Yi : l<i<n). Show that 2log.3.d. be i. ~ ° with probability ~ and V is independent of ~ (e) Obtain tbe limit distribution of y'n( 0. Show that (6.. = a + BB where .3.1. which is a mixture of point mass at O.3.. Z 2 = poX1Y v' . we test the efficacy of a treatment on the basis of two correlated responses per individual. Note: The results of Problems 4 and 5 apply generally to models obeying AQA6 when we restrict the parameter space to a cone (Robertson. XI and X~ but with probabilities ~ . (iii) Reparametrize as in Theorem 6.. = (T20 = 1 andZ 1= X 1.  0.0. =0. 7.=0 where U ~ N(O. and Dykstra. 1) with probability ~ and U with the same distribution.) have an N. 0. (d) Relate the result of (b) to the result of Problem 4(a). are as above with the same hypothesis but = {(£It.. < cIJ"O..\( Xi.) under the model and show that (a) If 0. ' · H In!: Consl'defa1O . > 0.) if 0.\(Xi.430 Infer~nce in the Multiparameter Case Chapter 6 (a) Show that whatever be n. Po . (b) If 0. j ~ ~ 6.. 2 e( y'n(ii.li : 1 <i< n) has a null distribution. /:1.19) holds. Exhibit the null distribution of 2 log . for instance. and ~ where sin. = v'1~c2' 0 <.. mixture of point mass at 0. Sucb restrictions are natural if. Hint: ~ (i) Show that liOn) can be replaced by 1(0). < . I .0.  OJ. ii. (0.
7 Problems and Complements ~(1 ) 431 8. 10. 0 < P( A) < 1.P(A)P(B) JP(A)(1 . Hint: Argue as in Problem 5.Section 6. (a) Show that the correlation of Xl and YI is p = peA n B) ..22) is a consistent estimate of ~l(lIo).6 is related to Z of (6.2.3.3. is of the fonn (b) Deduce that X' ~ Z' where Z is given by (6. 2.4 ~ 1(11) is continuous. (e) Conclude that if A and B are independent. . A3.2 to lin . Hint: Write and apply Theorem 6.0 < PCB) < 1. Show that under A2.nf.P(A))P(B)(1 .3.4) explicitly and find the one that corresponds to the maximizer of the likelihood. 9.4. A611 Problems for Section 6.4.21).8) (e) Derive the alternative form (6. 1. then Z has a limitingN(011) distribution.3. Show that under AOA5 and A6 for 8 11 where ~(lIo) is given by (6.P(B))· (b) Show that the sample correlation coefficient r studied in Example 5.4.1(0 0 ). hence.8) for Z. Exhibit the two solutions of (6. Under conditions AOA6 for (a) and AOA6 with A6 for (a) ~(1 ) i!~1) for (b) establish that [~D2ln(en)]1 is a consistent estimate of 1.8) by Z ~ . 3.4. (b) (6. In the 2 x 2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a Born. (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (estimated) expectations from each entry has all rows and columns summing to zero.10.
. 8 21 . (22 ) as in the contingency table.. in principle. =T 1. i = 1.1 II . Fisher's Exact Test From the result of Problem 6.~l.. B!c1b!. N ll and N 21 are independent 8( r. n a2 ) .4. . R 2 = T2 = n .~.. S. n R.b1only..6 that H is true. Consider the hypothesis H : Oij = TJil TJj2 for all i. Suppose in Problem 6. . 8l! / (8 1l + 812 )).2 is 1t(Ci... . (b) Deduce that Pearson's X2 is given by (6. I (c) Show that under independence the conditional distribution of N ii given R.)). TJj2 are given by TJil ~ = .4. CI. nab) A ) = B. j = 1. 1 : X 2 (b) How would you. .9) and has approximately a X1al)(bl) distribution under H. a ..D. This is known as Fisher's exact test. C j = L' N'j.~~. . j = 1...~~.2. 8(r2' 82 I/ (8" + 8... i = 1. (a) Show that the maximum likelihood estimates of TJil. . are the multinomial coefficients. Let N ij be the entries of an let 1Jil = E~=l (}ij. N 12 • N 21 .. TJj2 = 2::: a x b contingency table with associated probabilities Bij and 1 Oij. N 22 ) rv M (u. (b) Sbow that 812 /(8l! ° + 812 ) ~ 821 /(8 21 + 822 ) iff R 1 and C 1 are independent. ( rl. It may be shown (see Volume IT) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5. TJj2.u. = Lj N'j. . 7. ( where ( . . Let R i = Nil + N i2 • Ci = Nii + N 2i · Show that given R 1 = TI.. j... Ti) (the hypergeometric distribution).ra) nab .4 deduce that jf j(o:) (depending on chosen so that Tl. use this result to construct a test of H similar to the test with probability of type I error independent of TJil' TJj2? 1 .54). . 811 \ 12 .C. . n) can be then the test that rejects (conditionally on R I = TI' C 1 = GI) if N ll > j(a) is exact level o.1 .. (a) Show that then P[N'j niji i = 1. 6. .1.4. . n. nal ) n12. TJj2 ~ = Cj n where R... b I Ri ( = Ti. Cj = Cj] : ( nll.. Hint: (a) Consider the likelihood as a function of TJil.TI. 432 Inference in the Multiparameter Case Chapter 6 i 4. (a) Let (NIl. C i = Ci.
but (iii) does not. (e) The following 2 x 2 tables classify applicants for graduate study in different departments of the university according to admission status and sex.ZiNi > k] = a. where Pp~ [2:f .BINDEPENDENTGIVENC) n B) = P(A)P(B) (A.5? Admit Deny Men Women 1 19 . + IhZi. there is a UMP level a test. B INDEPENDENT) (iii) PeA (C is the complement of C.Section 6..4. for suitable a. consider the assertions. Establish (6.1""". Zi not all equal. Admit Men Women 1 235 1~35' 38 7 273 42 n = 315 Deny Admit 270 45 Men Women I 122 1'. (h) Construct an experiment and three events for which (i) and (ii) hold.14). C2). .ziNi > k.0. and that under H. Then combine the two tables into one. 11. (b). Give pvalues for the three cases.7 Problems and Complements 433 8..93'1 215 103 69 172 Deny 225 162 n=387 (d) Relate your results to the phenomenon discussed in (a). and petfonn the same test on the resulting table. (a) If A. n. C are three events.) Show that (i) and (ii) imply (iii).05 level (a) using the X2 test with approximate critical value? (b) using Fisher's exact test of Problem 6.4.(rl + cI).BINDEPENDENTGNENC) p(AnB I C) ~ peA I C)P(B I C) (A. B.. 5 Hint: (b) It is easier to work with N 22 • Argue that the Fisher test is equivalent to rejecting H if N 22 > q2 + n . which rejects.(rl + cd or N 22 < ql + n . if A and C are independent or B and C are independent. Would you accept Or reject the hypothesis of independence at the 0. (i) p(AnB (ii) I C) = PeA I C)P(B I C) (A. Show that. classified by sex and admission status. Suppose that we know that {3. N 22 is conditionally distributed 1t(r2. = 0 in the logistic model. Test in both cases whether the events [being a man] and [being admitted] are independent. and that we wish to test H : Ih < f3E versus K : Ih > f3E. 10.12 I . 9. 2:f . ~i = {3. if and only if. The following table gives the number of applicants to the graduate program of a small department of the University of California.
5.5.f. i Problems for Section 6.) ~ B(m. 1 .20) for the regression described after (6.5.Ld. " Ok = 8kO.. n) and simplification of a model under which (N1. ..3) l:::~ I (Xi . Show that the likelihO<Xt ratio test of H : O} = 010 . (a) Compute the Rao test for H : (32 Problem 6. . . . k and a 2 is unknown. a5 j J j 1 J I ... Compute the Rao test statistic for H : (32 case. .i. 1 i Show that for GLM this method coincides with the NewtonRaphson method of Section 2. or Oi = Bio (known) i = 1.4). .. Suppose the ::i in Problem 6. which is valid for the i. .i. 3.'Ir(. Suppose that (Z" Yj).. if the design matrix has rank p.. mk ~ 00 and H : fJ E Wo is true then the law of the statistic of (6. 434 • Inference in the Multiparameter Case Chapter 6 f • 12.4. (Zn.2. for example... but Var. " lOk vary freely. Show that. a 2 ) where either a 2 = (known) and 01 .: nOiO.11.l 1 .15) is consistent. . . .4) is as claimed formula (2. I). Use this to imitate the argument of Theorem 6.d.. = rn I < /3g and show that it agrees with the test of (b) Suppose that 131 is unknown. 2. Verify that (6.. In the binomial oneway layout show that the LR test is asymptotically equivalent to Pearson's  X2 test in the sense that 2log'\  X2 . < f3g in this (c) By conditioning on L~ 1 Xi and using the approach of Problem 6. · .OiO)2 > k 2 or < k}..3.. Given an initial value ()o define iterates  Om+l .... Let Xl... ..4. .5 construct an exact test (level independent of (31). 010.4. This is an approximation (for large k. .. . .d. E {z(lJ.11 are obtained as realization of i. 15. ~ 8 m + [1(8 m )Dl(8 m ). Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fishersee Rao (1973). .)llmi~i(1 'lri)..z(kJl) ~I 1 .".OkO) under H. . .4.3. .(Ni ) < nOiO(1 .Oio)("Cooked data").00 or have Eo(Nd : .5 1. Xi) are i. with (Xi I Z. but under K may be either multinomial with 0 #.8) and. Nk) ~ M(n. 13.LJ2 Z i )).. . I < i < k are independent.4. I.18) tends 2 to Xrq' Hint: (Xi . i 16.. case. then 130 as defined by (6. Y n ) have density as in (6. • I 1 • • (a)P[Z..1 14. X k be independent Xi '" N (Oi.4. a under H. Tn. Zi and so that (Zi. in the logistic regression model. asymptotically N(O. a 2 = 0"5 is of the form: Reject if (1/0". Show that if Wo C WI are nested logistic regression models of dimension q < r < k and mI.
.. T). Show that the conditions AQA6 hold for P = P{3o E P (where qo is assumed known). your result coincides with (6. . ~ .5. Find the canonical link function and show that when g is the canonical link. 9 = (b')l.5.T)exp { C(T) where T is known.Oi)~h(y.)..(3).. (a) Show that the likelihood equations are ~ i=I L. Show that. Show that when 9 is the canonical link. and v(. Give 0. g(p. k.) ~ 1/11("i)( d~i/ d".. h(y. Yn ) are i. (d) Gaussian GLM. T. j = 1.p. P(I"). T.. Assume that there exist functions h(y.) .WZ v where Zv = Ilz'jll is the design matrix and W = diag( WI. the deviance is 5. .).9). ball about "k=l A" zlil in RP .= 1 . has the Poisson.. 00 d" ~ a(3j (b) Show that the Fisher information is Z]. Hint: By the chain rule a l(y a(3j' 0) = i3l dO d" a~ . the resuit of (c) coincides with (6. Wn). 1'0) = jy 1'01 2 /<7.Section 6. under appropriate conditions. Y follow the model p(y. ~ N("" <7.)z'j dC. . Give 0.. . b(B). distribution.y ..5. C(T). h(y. . .. . 05. .. then the convex support of the conditional distribution of = 1 Aj Yj zU) given Z j = Z (j) . k. b(9). .) and v(.. . . Let YI. T). give the asymptotic distribution of y'n({3 . .ztkl} is RP (c) P[ZI ~ z(jl] > 0 for all j.b(Oi)} p(y.(. contains an open L. and v(.. d".9). b(9). L. Show that for the Gaussian linear model with known variance D(y. <. (y.F.) and C(T) such that the model for Yi can be written as O. (c) Suppose (Z" Y I )..1 = 0 V fJ~ . Wi = w(". Set ~ = g(... as (Z.5). In the random design case..i. Y) and that given Z = z. J JrJ 4.7 Problems and Complements 435 (b) The linear span of {ZII)... J . C(T).. .T).d. (Zn..) = ~ Var(Y)/c(T) b"(O). g("i) = zT {3. (e) Suppose that Y. Yn be independent responses and suppose the distribution of Yi depends on a covariate vector Zi. and b' and 9 are monotone. Hint: Show that if the convex support of the conditional distribution of YI given ZI = zU) contains an open interval about p'j for j = 1.O(z)) where O(z) solves 1/(0) = gI(zT {3). Suppose Y.
then O' 2(p) > 0'2 = Varp(X.1) ~ . then 0'2(P) < 0'2. i 6. Suppose ADA6 are valid. = 1" (b) Show that if f is N(I'. . ! I I ! I .6. j.6. Show that the LR. Apply Theorem 6. (a) Show that if f is symmetric about 1'. .1. .2. then it is.7. I . and R n are the corresponding test statistics.6. I 4. Consider the Rao test for H : f} = f}o for the model "P = {P/I : /I E e} and ADA6 hold.j 436 Inference in the Multiparameter Case Chapter 6 i Problems for Section 6..s(t).i.Q+1>'" . . Suppose that the ttue P does not belong to"P but if f}(P) is defined by (6. Establish (6. t E R. n . then under H. Wn Wn + op(l) + op(I). 0 < Vae f < 00. Wald. the infonnation bound and asymptotic variance of Vri(X 1'). (6. .d. but that if the estimate ~ L:~ dDIllDljT (Xi. By Problem 5. and Rao tests are still asymptotically equivalent in the sense that if 2 log An. set) = 1. Show that the standard Wald test forthe problem of Example 6. . 2. l 1 . W n and R n are computed under the assumption of the Gaussian linear model with a 2 known. Note: 2 log An.3. Show that.3) then f}(P) = f}o. Consider the linear model of Example 6.14) is a consistent estimate of2 VarpX(l) in Example 6. that is. hence. O' 2(p)/0'2 = 2/1r.3.6 1.. f}o) is used. in fact. In the hinary data regression model of Section 6.15) by verifying the condition of Theorem 6. but if f~(x) = ~ exp Ix 1'1. P. the unique median of p. " . if P has a positive density v(P). 0'2).O' 2 (p») = 1/4f(v(p».2. I "j 5. 7. Show that 0: 2 given in (6.3 and.1. I: .6.10).). let 1r = s(z. (3) where s(t) is the continuous distribution function of a random variable symmetric about 0. replacing &2 by 172 in (6. W n .6. if VarpDl(X. /10) is estimated by 1(80 ).2 and the hypothesis (3q+l = (3o. then v( P) .p under the sole assumption that E€ = 0.4.1 under this model and verifying the fonnula given.6.. then the sample median X satisfies ~ f at I I Vri(X where O' (p) 2 yep») ~ N(0. Suppose Xl.10) creates a valid level u test. • I 3.3 is as given in (6. . . Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation.6.(3p ~ (3o.6. then the Rao test does not in general have the correct asymptotic level.Xn are i.
Let RSS(p) = 2JY. be the MLE for the logit model.7 Problems and Complements 437 (a) Show that ~ Jr can be written in this form for both the probit and logit models. Hint: Apply Theorem 6.9.2 is known. Model selection consists in selecting p to minimize EPE(p) and then using Y(P) as a predictor (Mallows. Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d .1. (a) Show that EPE(p) ~ . Suppose that .lpI)2 be the residual sum of squares. = (3p = 0 by the (average) expected prediction error ~ ~ n EPE(p) = n. O)T and deduce that (c) EPE(p) = RSS(p) + ~.. hence.2..d.i . 8..Zn and evaluate the performance of Yjep) .. .' i=l . are indepeqdent of}] 1 ' · ' 1 Yn and ~* is distributed as }'i.p.p don't matter.. _. .2 (b) Show that (1 + ~D + . Zi. ... the model with 13d+l = . i .2 is an unbiased estimate of EPE(P). . for instance). 1 n.. Zi are ddimensional vectors for covariate (factor) values. 0. A natural goal to entertain is to obtain new values Yi". then ~ Vri(rJL QI ({3) Var(ZI (Y1   {3 Ll has a limiting normal distribution with mean 0 and variance A(Zr {3)) )[QI ({3)J where Q({3) = E(Zr A(Zr{30)ZI) is p x p and necessarily nonsingular. (b) Suppose that Zi are realizations of U. where Xln) ~ {(Y.lp)2 Here Yi" l ' •• 1 Y. /3P+I = '" = /3d ~ O.9.. . then 13L is not a consistent estimate of f3 0 unless s(t) is the logistic distribution. i = 1. .. . y~p) and. where ti are i. Let f3(p) be the LSE under this assumption and Y. But if 13L is defined as the solution of EZ1s(Zr {30) = Q(/3) where Q({3) = E(Zr A(Zr/3) is p x 1.{3lp) p and {3(P) ~ (/31. . f3p. (Model Selection) Consider the classical Gaussian linear model (6. Show that if the correct model has Jri given by s as above and {3 = {3o... 1973. L:~ 1 (P.(p) the corresponding fitted value. i = 1. Gaussian with mean zero J1i = Z T {3 and variance 0'2.1. 1 V....1) Yi = J1i + ti..Section 6. that Zj is bounded with probability 1 and let ih(X ln )).d.(b) continue to hold if we assume the GaussMarkov model.. Zi) : 1 <i< n}.I EL(Y. . 1 n. at Zll'" ..jP)2 where ILl ) = z. ~ ~ (d) Show that (a). .
ti. 1 n n L (I'.L . Y. 1 "'( i=1 . Heart Study after Dixon and Massey (1969). AND F. LR test statistics for enlarged models of this type do indeed reject H for data corresponding to small values of X2 as well as large ones (Problem 6. . . MASSEY. . )I'i).4 (1) R. A.16). this makes no sense for the model we discussed in this section.' .4.". New York: McGrawHill.9 for a discussion of densities with heavy tails. DIXON. 1970.z.  J .438 (e) Suppose p ~ Inference in the Multiparameter Case Chapter 6 2 and 11(Z) ~ Ii. i 6. which are not multinomial. W.6.. Introduction to Statistical Analysis. ! . if we consider alternatives to H. and (ii) 'T}i = /31 Zil + f32zi2. Use 0'2 = 1 and n = 10. (b) The result depends only on the mean and covariance structure of the i=l"". 6. 3rd 00.2 (1) See Problem 3. + Evaluate EPE for (i) ~i ~ "IZ. The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section 6. t12 and {Z/I. Give values of /31. 1969. Y/" j1~. For instance.1 (1) From the L A. Note for Section 6. Note for Section 6. The Analysis of Binary Data London: Methuen. but it is reasonable.) .i2} such that the EPE in case (i) is smaller than in case (d) and vice versa.8 NOTES Note for Section 6.n. To guard against such situations he argued that the test should be used in a twotailed fashion and that we should reject H both for large and for small values of X2 • Of course. 1 I .5.. . ~(p) n . R. D. EPE(p) n R SS( p) = . .. .9 REFERENCES Cox. Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good. i=1 Derive the result for the canonical model. Hint: (a) Note that ". we might envision the possibility that an overzealous assistant of Mendel "cooked" the data.~I'.
1989. "The behavior of the maximum likelihood estimator under nonstandard conditions. 1988... HABERMAN. 76. KOENKER. Theory ofStatistics New York: Springer.. t 983. AND R. TIERNEY. McCULLAGH. Genemlized Linear Models London: Chapman and Hall. DYKSTRA. S." J. /5. II. Statistical Theory with £r1gineering Applications New York: Wiley.• St(ltistical Methods {Of· Research lV(Jrkers. of California Press. WRIGHT.Section 6. HAW.9 References 439 Frs HER.• "Sur quelques points du systeme du monde. Univ. 36. Statist. Prob. Ser. T. GRAYBILL. 1961. "Computing regression qunntiles. SCHEFFE. MA: Harvard University Press. R. KASS. second edition. T. 13th ed. E. 1952. The Analysis of Frequency Data Chicago: University of Chicago Press. $CHERVISCH. R. 1995." Proc. J. RAO. L. Order Restricted Statistical Inference New York: Wiley. S. S. New York: Hafner. C. STIGLER..661675 (1973). AND Y. J. I New York: McGrawHill. 2nd ed. HUBER. WEISBERG. I. Soc. Statist. Linear Statisticallnference and Its Applications. 279300 (1997).. A. R. An Inrmdllction to Linear Stati. 221233 (1967). Roy. Wiley & Sons. A. P. II. M. Paris) (1789). S. "Approximate marginal densities of nonlinear functions. KOENKER. Vol. 383393 (1987). The l!istory of Statistics: The Measuremem of Uncel1ainty Before 1900 Cambridge.425433 (1989). MALLOWS. 1974. New York: J. 2nd ed. D'OREY.Him{ Models." Technometrics. PORTNOY. KADANE AND L... 1958. 1986.. Applied linear Regression. LAPLACE. C. 1985. NELDER. A. . "The Gaussian Hare and the Laplacian Tortoise: Computability of squarederror versus absoluteerror estimators. 12. P. 1959. c. The Analysis of Variance New York: Wiley. New York." Memoires de l'Academie des Sciences de Paris (Reprinted in Oevres CompUtes. "Some comments on C p . Fifth BNkeley Symp. PS." Statistical Science. AND j." Biometrika.. GauthierVillars.. 1973. New York: Wiley. ROBERTSON. 475558. AND R. Math. R. f.
I :i . . I . j. .. .. I I . I j i. . . i 1 I . .
Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level. is tossed can land heads or tails. we include some proofs as well in these sections. This 441 . we include some commentary. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. Sections A. Therefore. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties.Appendix A A REVIEW OF BASIC PROBABILITY THEORY In statistics we study techniques for obtaining and using information in the presence of uncertainty. The purpose of this appendix is to indicate what results we consider basic and to introduce some of the notation that will be used in the rest of the book. which are relevant to our study of statistics. in addition. The reader is expected to have had a basic course in probability theory. Probability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. The situations we are going to model can all be thought of as random experiments. require that every repetition yield the same outcome.I THE BASIC MODEl Classical mechanics is built around the principle that like causes produce like effects. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. although we do not exclude this case. The intensity of solar flares in the same month of two different years can vary sharply. an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation.14 and A. We add to this notion the requirement that to be called an experiment such an action must be repeatable. A coin that.IS contain some results that the student may not know. Viewed naively. A. The Kolmogorov model and the modem theory of probability based on it are what we need. The adjective random is used only to indicate that we do not. at least conceptually.
l The sample space is the set of all possible outcomes of a random experiment. which by definition is a nonempty class of events closed under countable unions. falls under the vague heading of "random experiment. and complementation (cf. We denote it by n. • . and inclusion as is usual in elementary set theory. C for union.2 A sample point is any member of 0.. we preSUnle the reader to be familiar with elementary set theory and its notation at the level of Chapter I of Feller (1968) or Chapter 1 of Parzen (1960). If A contains more than one point. The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A. intersection. However. A2 .. intersections. I l 1." Another school of statisticians finds this formulation too restrictive. whether it is conceptually repeatable or not. set theoretic difference. For example. and Loeve. In this sense. A is always taken to be a sigma field. the relation A C B between sets considered as events means that the occurrence of A implies the occurrence of B.3 Subsets of are called events. . A. the probability model. and so on or by a description of their members. Grimmett and Stirzaker. A random experiment is described mathematically in tenns of the following quantities. it is called a composite event. and Berger (1985). is to many statistician::. Raiffa and Schlaiffer (\96\). " are pairwise disjoint sets in I • I Recall that Ui I Ai is just the collection of points that are in anyone of the sets Ai and that two sets are disjoint if they have no points in common. from horse races to genetic experiments. Lindley (1965).\ is the number of times the possible outcome A occurs in n repetitions. = 1. A probabiliry distribution or measure is a nonnegative function P on A having the following properties: (i) P(Q) n 1 n . c.1. A. {w} is called an elementary event. We shall use the symbols U. Ii I I I j . i ! A. For technical mathematical reasons it may not be possible to assign a probability P to every subset of n. Chung.l. We denote events by A. almost any kind of activity involving uncertainty. 1977). In this section and throughout the book. 1 1  .1. induding the authors. Savage (1962).I I 442 A Review of Basic Probability Theory Appendix A 1 longtefm relative frequency 11 . n. . . the operational interpretation of the mathematical concept of probability. Its complement. 1992. 1974. is denoted by A." The set operations we have mentioned have interpretations also. We now turn to the mathematical abstraction of a random experiment. the null set or impossible event. If wEn.. . . as we shall see subsequently.4 We will let A denote a class of subsets of to which we an assign probabilities.~ n and is typically denoted by w. By interpreting probability as a subjective measure. .1/ /1. B. For a discussion of this approach and further references the reader may wish to consult Savage (1954). they are willing to assign probabilities in any situation involving uncertainty.l. A. where 11. then (ii) If AI. complementation. de Groot (1970).
S P A.. we can write f! = {WI. we have for any event A.1 If A c B.2.2) n .3 Hoel.l.7 P 1 An) = limn~= P(A n ). and Stone (1992) Section 1.P(A). References Gnedenko (1967) Chapter I.2.PeA).2. Sections 15 Pitman (1993) Sections 1.. P) either as a probability model or identify the model with what it represents as a (random) experiment.2. ~ A.2 and 1. } and A is the collection of subsets of n.68 Grimmett and Stirzaker (1992) Sections 1.1.. A. W2..2 PeN) 1 . That is. then P (U::' A. c . Port. Section 8 Grimmett and Stirzaker (1992) Section 1. (U... References Gnedenko. P(B) A.2.. A. and P together describe a random experiment mathematically.). > PeA).3. Port. J. 1. .' 1 peA.Section A.2. We shall refer to the triple (0..3 A.6 If A .2 Parzen (1960) Chapter 1. (n~~l Ai) > 1. (1967) Chapter l.3 Hoe!.S The three objects n.t A probability model is called discrete if is finite or countably infinite and every subset of f! is assigned a probability. A.2 ELEMENTARY PROPERTIES OF PROBABILITY MODELS The following are consequences of the definition of P.2 Elementary Properties of Probability Models 443 A.' 1 An) < L.L~~l peA. by axiom (ii) of (A. then PCB . Sections 13.3 DISCRETE PROBABILITY MODELS A.3. and Stone (1971) Sections 1.3 A. . C An . Sections 45 Pitman (1993) Section 1. A.3 Panen (1960) Chapter I.3 If A C B. For convenience.2.A) = PCB) . In this case.4).) (Bonferroni's inequality).. P(0) ~ O.l1.40 < P(A) < 1.. (A. A. when we refer to events we shall automatically exclude those that are not members of A. c A.
. guinea pigs.4.. i References Gnedenko (1967) Chapter I.4 CONDITIONAL PROBABILITY AND INDEPENDENCE I j 1 1 ! . machines. '< = ~=~:.4)(ii) and (A.1) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment..l n' " I"I ". then . If A" A 2 . is an experiment leading to the model of (A. the function P(. . which we write PtA I B).. j=l (A. a random number table or computer can be used.3). Sections 45 Parzen (1960) Chapter I.. Given an event B such that P( B) > 0 and any other event A.3.1 • 1 .l) gives the multiplication rule.=:. for fixed B as before. we define the conditional probability of A given B. From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred. . For large N..3) If B l l B 2 . . .(A n B j ). ~f 1 .=..2) In fact.3) yield U.4.3. (A. and drawing.' . Then P( {w}) = 1/ N for every wEn. then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur. Transposition of the denominator in (AA. Sections 67 Pitman (1993) Section l. the identity A = . I B) = ~ PiA.I B) is a probability measure on (fl. n PiA) = LP(A I Bj)P(Bj ). (A. Such selection can be carried out if N is small by putting the "names" of the Wi in a hopper.• 1 B n are (pairwise) disjoint events of positive probability whose union is fl.3. WN are the members of some population (humans.444 A Review of Basic Probability Theory Appendix A An important special case arises when n has a finite number of elements. .:c Number of elements in A N . etc. A) which is referred to as the conditional probability measure given B.:.• P (Q A.4.1. selecting at random. (A. . 1 1 PiA n B) = P(B)P(A I B). i • i: " i:.4 Suppose that WI. shaking well. (A. are (pairwise) disjoint events and P(B) > 0.4) . (A.4.3) 1 I A.:c:. and P( A ) ~ j . A. I B). by l P(A I B) ~ PtA n B) P(B) .4.). say N. Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another. flowers.. all of which are equally likely.~ 1 1 ~• 1 .
..S parzen (1960) Chapter 2.. Ifall theP(A i ) are positive.)P(B. the relation (AA.. References Gnedenko (1967) Chapter I.. B ll is written P(A B I •.lO) for any subset {iI.. Two events A and B are said to be independent if P(A n B) If P( B) ~ P(A)P(B)..} such thatj ct {i" . .. and (A. An are said to be independent if P(A i .) j=l k (AA.4.II) for any j and {i" .···.. B 2 ) ..En such that P(B I n .1 ).I J (AA. n··· nA i ... B I .) A) ~ ""_ PIA I B )p(BT L. (AA. BnJl (AA. . I PeA I B. and Stone (1971) Sections lA. n B n ) > O.9) In other words. Simple algebra leads to the multiplication rule.lO) is equivalent to requiring that P(A J I A.i. Sections IA Pittnan (1993) Section lA .4.. Sections 9 Grimmett and Stirzaker (1992) Section IA Hoel.7) whenever P(B I n .. .Section AA Conditional Probability and Independence 445 If P( A) is positive. . ." . .n}. Section 4.. B n ) and for any events A.S) The conditional probability of A given B I defined by . . ..) ~ P(A J ) (AA... . P(B" I Bl..i k } of the integers {l..) = II P(A.. P(il. Port.i. (AA. we can combine (A.... The events All .4) and obtain Bayes rule . . .• . P(B 1 n·· n B n ) ~ P(B 1 )P(B2 I BJlP(B3 I ill.}..8) > 0. n Bnd > O. I.. .4..A..I_1 . A and B are independent if knowledge of B does not affect the probability of A.. Chapter 3. relation (A.3)..8) may be written P(A I B) ~ P(A) (AA.
Pn({W n }) foral! Wi E fli.. An is by definition {(WI. x An by 1 j P(A. that is. r. If we are given n experiments (probability models) t\.. x flnl n ' . . £n with respective sample spaces OJ. . if Ai E Ai. . . . 1995..6 where examples of compound experiments are given.. x An) ~ P(A..' . 1974. x'" .. .."" ...0) given by . .o n = {(WI.5 COMPOUND EXPERIMENTS There is an intuitive notion of independent experiments.w n ) E o : Wi = wf}. W2 is the outcome of £2 and E Oi corresponds to the Occurrence of the comso on. An are events. then intuitively we should have alI classes of events AI.1 X Ai x 0i+ I x . x On in the compound experiment. and if we do not replace the first chip drawn before the second draw.. . These will be discussed in this section. wn ) is a sample point in . ..3).1 .on. . 1 1 . then (A52) defines P for A) x ". x An. then the probability of a given chip in the second draw will depend on the outcome of the first dmw. If P is the probability measure defined on the sigma field A of the compound experiment.. To be able to talk about independence and dependence of experiments. If we want to make the £i independent. 1 £n if the n stage compound experiment has its probability structure specified by (A5. More generally.5. X A 2 X '" x fl n ). then the sample space . Pn(A n ).. . (A5. the sigma field corresponding to £i. . x flnl n Ifl) x A 2 x ' . . 446 A Review of Basic Probability Theory Appendix A A. we introduce the notion of a compound experiment." X An) = P. a compound experiment is one made up of two or more component experiments. Loeve. .An with Ai E Ai.on.0 1 x··· XOi_I x {wn XOi+1 x··· x. (A5A) i. the Cartesian product Al x . I ! i w? 1 .0 ofthe n stage compound experiment is by definition 0 1 x·· .. X . if we toss a coin twice.. I P n on (fln> An). In the discrete case (A.2) 1 i " 1 If we are given probabilities Pi on (fl" AI). ) = P(A. (A53) It may be shown (Billingsley. it is easy to give examples of dependent experiments: If we draw twice at random from a hat containing two green chips and one red chip. P. There are certain natural ways of defining sigma fields and probabilities for these experiments. it can be uniquely extended to the sigma field A specified in note (l) at the end of this appendix. x fl 2 x '" x fln)P(fl. I n  I . This makes sense in the compound experiment.. (A. .o i . .' . x " . InformaUy. The (n stage) compound experiment consists in performing component experiments £1.. independent. W n ) : W~ E Ai. . the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second.53) holds provided that P({(w"".. . . P([A 1 x fl2 X .. ..1 Recall that if AI. I <i< n. on (fl2. . To say that £t has had outcome pound event (in .. . . Chung. P(fl.'" £n and recording all n outcomes. we should have . x . The interpretation of the sample space 0 is that (WI.) . X fl n 1 X An).wn )}) I: " = PI ({Wi}) .0 if and only if WI is the outcome of £1. ..5. The reader not interested in the formalities may skip to Section A. 1977) that if P is defined by (A... On the other hand. A2). x An of AI... A. We shall speak of independent experiments £1. \ I I .0 1 X '" x .. For example. the subsets A ofn to which we can assign probability(1).3) for events Al x . 1 < i < n}.. then Ai corresponds to .
If Ak is the event [exactly k S's occur]. if an experiment has q possible outcomes WI. we shall refer to such an experiment as a Bernoulli trial with probability of success p..·. we say we have performed n Bernoulli trials with success probability p. ill the discrete case. which we shall denote by 5 (success) and F (failure).4.: n ) with Wi E 0 1. .6. the following. then (A..6. SAMPLING WITH AND WITHOUT REPLACEMENT A. If we repeat such an experiment n times independently. i = 1" .. .3) where n ) ( k = n! kl(n .2) where k(w) is the number of S's appearing in w.6. and Stone (1971) Section 1. If o is the sample space of the compound experiment.. . then (A..5 Parzen (1960) Chapter 3 A.Pq' If fl is the sample space of this experiment and W E fl. A. . In the discrete case we know P once we have specified P( {(wI . Ifweassign P( {5}) = p.. the compound experiment is called n multinomial trials with probabilities PI.Section A 6 Bernoulli and Multinomial Trials.. Sampling With and Without Replacement 447 Specifying P when the £1 are dependent is more complicated.6 BERNOULLI AND MULTINOMIAL TRIALS. .. . By the multiplication rule (A.: n )}) for each (WI.6. .. J. 1..1 Suppose that we have an experiment with only two possible outcomes.£"_1 has outcome wnd.3) is known as the binomial probability..· IPq.wq and P( {Wi}) = Pi. n The probability structure is determined by these conditional probabilities and conversely.4 More generally.. we refer to such an experiment as a multinomial trial with probabilities PI. /I. ..u. (A.7) we have.6. References Grimmett and Stirzaker ( (992) Sections (. any point wEn is an ndimensional vector of S's and F's and.'" .6. Other examples will appear naturally in what follows.w n )}) = P(£l has outcome wd P(£2 hasoutcomew21 £1 has outcome WI)'" P(£T' has outcome W I £1 has outcome WI. If the experiment is perfonned n times independently..SP({(w] .5.k)!' The fonnula (A.5.S) ..6 Hoel. Port. . The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success). A. .
exactly kqwq's are observed). . Sectiou 11 Hoel.1O) J for max(O.6. i .p)) < k < min( n. . . then P( k" . .. (N)n . and P(Ak) = 0 otherwise.S If we have a finite population of cases = {WI"" WN} and we select cases Wi successively at random n times without replacement.6.P))nk k (N)n = (A6.7 PROBABILITIES ON EUCLIDEAN SPACE Random experiments whose outcomes are real numbers playa central role in theory and practice.k.S) as follows.7 If we perform an experiment given by (.. The probability models corresponding to such experiments can all be thought of as having a Euclidean space for sample space. . .6) where the k i are natural numbers adding up to n. · · • A.) A = n! k k k !. I A. . __________________J i .1 j .n)!' If the case drawn is replaced before the next drawing. .1 ...N (1 . If Np of the members of n have a "special" characteristic S and N (1 ~ p) have the opposite characteristic F and A k = (exactly k "special" individuals are obtained in the sample). and Stone (1971) Section 2. for any outcome a = (Wil"" 1Wi. we shall sometimes refer to the outcome of the compound experiment as a sample of size n from the population given by (n.6. Sections 14 Pitman (1993) Section 2.. then ~ ..6. n .(N(I. .6. . i . P) independently n times. = N! (N . .k q is the event (exactly k l WI 's are observed. 10) is known as the hypergeometric probability. we are sampling with replacement. . When n is finite the tenn. I' ..6. Np). exactly k 2 wz's are observed. . Port. If AkJ. with replacement is added to distinguish this situation from that described in (A.4 Parzen (1960) Chapter 3.) of the compound experiment.. kq!Pt' .. the component experiments are not independent and. and the component experiments are independent and P( {a}) = liNn. The fonnula (A. n P({a}) ~ (N)n where 1 I (A. A. ·Pq t (A. 1 .448 A Review of Basic Probability Theory Appendix A where k~(w) = number of times Wi appears in the sequence w. .A l P). . PtA ) k =( n ) (Np).0.1 .9) . References Gnedeuko (1967) Chapter 2. A.
Xn . Thefrequency function p of a discrete distribution is defined on Rk by n pix) = P({x»).. . we shall call the set (aJ. That is.7.7... xd'. 1 <i <k}anopenkrectallgle. A.7.. which we denote by f3k. dx n • is called a density function. Riemann integrals are adequate. . and 0 otherwise.3. . where ( )' denotes transpose. A. An important special case of (A.7.7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation P(A) = L p(x)d x =1 (A7.. (A7A) Conversely. Any subset of R k we might conceivably be interested in turns out to be a member of f3k..bd x '" (ak. .bk ) are k open intervals. } of vectors and that satisfies L:~ I P(Xi) = 1 defines a unique discrete probability distribution by the relation P(A) = L x. Integrals should be interpreted in the sense of Lebesgue. .7.5) A. is defined to be the smallest sigma field having all open k rectangles as members.. We will write R for R1 and f3 for f31. We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely. It may be shown that a function P so defined satisfies (AlA).7. P defined by A.S) is by definition r JR' 1A(X)P(x)dx where 1A(x) ~ 1 if x E A.. Geometrically.b k ) = {(XI"". However.. Recall that the integral on the right of (A.(ak.EA pix..S) for some density function P and all events A.3 A discrete (probability) distribution on R k is a probability measure P such that L:~ I P( {Xi}) = 1 for some sequence of points {xd in R k . any nonnegative function pon R k vanishing except on a sequence {Xl.Xk) :ai <Xi <bi.1If (al. .6 A nonnegative function p on R k • which is integrable and which has r p(x)dx = 1. This definition is consistent with (A. JR' where dx denotes dX1 . for practical purposes... .). A.7 Probabilities on Euclidean Space 449 We shall use the notation R k of k~dimensional Euclidean space and denote members of Rk by symbols such as x or (Xl. only an Xi can occur as an outcome of the experiment.bd. (A7. P(A) is the volume of the "cylinder" with base A and height p(x) at x.9) . ..2 The Borelfield in R k . .I) because the study of this model and that of the model that has = {Xl.S) is given by (A.7.7. X n .7. } are equivalent. .8 are usually called absolutely continuous. x A.Section A..
.7. P Xl (A.lO) The ratio p(xo)jp(xl) can.13HA. 3.7.17) = O.:0 and Xl are in R.7.2 parzen (1960) Chapter 4.) F is defined by F(Xl' . x. if p is a continuous density on R.7. F is a function of a real variable characterized by the following properties: > I . x n j X =? F(x n ) ~ (A. Thus.7.2. x. (A. 5. (A. F is continuous at x if and only if P( {x}) (A.13) x <y =? F(x) < F(y) (Monotone) F(x) (Continuous from the right) (A.x. We always have F(x)F(xO)(2) =P({x}).7 Pitman (1993) Sections 3.. • J .16) defines a unique P on the real line.14) .7.]). then P = Q. ! I . 4. 5. . 22 Hoel.7. When k = 1.7."(l) Although in a continuous model P( {x}) = 0 for every x..5 i • .7. defines P in the sense that if P and Q are two probabilities with the same d.1. References Gnedenko (1967) Chapter 4. thus. and Stone (1971) Sections 3.16) I It may be shown that any function F satisfying (A.4.12) The dJ. Sections 21..h. limx~oo F(x) limx~_oo =1 F(x) = O..15) I. be thought of as measuring approximately how much more Or less likely we are to obtain an outcome in a neighborhood of XQ then one in a neighborhood of Xl_ A. Xo P([xo ..11 The distribution function (dJ. x (00. and h is close to 0.J x . Sections 14.) = P( ( 00. POlt. .f. (A.xo + h]) '" 2hp(xo) and P([ h Xl 1 + h]) + Xl p(xo) hi) '" ( ). For instance.1'.450 A Review of Basic Probability Theory Appendix A It turns out that a continuous probability distribution determines the density that generates it "uniquely. the density function has an operational interpretation close to thal of the frequency function. r .1 and 4.1.7. . then by the mean value theorem paxo .h.
and so on. . 13k . by definition. we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined. Letg be any function from Rk to Rm. .8) PIX E: A] LP(X).5) p(x)dx .7. ifXisdiscrete xEA L (A. and so on of a random vectOr when we are.4 A random vector is said to have a continuous or discrete distribution (or to be continuous or discrete) according to whether its probability distribution is continuous or discrete. The study of real. the time to breakdown and length of repair time for a randomly chosen machine. In the probability model.1 A random variable X is a function from Oto Rsuch that the set {w: X(w) E B} X. A. the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred. When we are interested in particular random variables or vectors.. referring to those features of its probability distribution. dj.3) A.1 (B) is in 0 fnr every BE B.Xk)T is ktuple of random variables. from (A.Section A.or vectorvalued functions of a random vectOr X is central in the theory of probability and of statistics. Px ) given by n Px(B) = PIX E BI· (A. and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted. The event Xl( B) will usually be written [X E B] and P([X E BJ) will be written PIX E: H].8. Similarly. In the discrete case this means we need only know the frequency function and in the continuous case the density.'s. or equivalently a function from to Rk such that the set {w .S. if X is continuous. the yield per acre of a field of wheat in a given year. k. The probability distribution of a random vector X is. X(w) E: B} ~ X.1 (H) is in A for every B E BkJI) For k = 1 random vectors are just random variables. density. such that(2) gl(B) = {y E: Rk : g(y) E: . we measure the weight of pigs drawn at random from a population. in fact. The subscript X or X will be used for densities.7.a RANDOM VARIABLES AND VECTORS: TRANSFORMATIONS Although sample spaces can be very diverse.(1) = A.5) and (A.8 Random VariOlbles and Vectors: Transformations 451 A. Forexample. the probability measure Px in the model (R k . Here is the formal definition of such transformations. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X. m > 1. these quantities will correspond to random variables and vectors. Thus.8.8.8. we will refer to the frequency Junction. the concentration of a certain pollutant in the atmosphere.2 A random vector X = (Xl •. dJ.
a random vector obtained by putting two random vectors together.Y).7) and (A. then g(X) is discrete and has frequency function Pg(X)(t) = L {x:g(x)=t} Px(x).11) and (A.y)(x.9) t E g(S). These notions generalize to the case Z = (X. Another common example is g(X) = (min{X.12) by summing or integrating out over yin P(X.7. V). y)T is continuous with density p(X.1 1 Xi = X and 92(X) = k.8.Y) (x. ! for Pg(x)(I) ~ PX(gl(t)) Ig'(g 1(1))1 (A. j . The (marginal) frequency or density of X is found as in (A.(5) (A.8.Y).1O) From (A. • II ! I • .8.8) Suppose that X is continuous with density PX and 9 is realvalued and onetoone(3) on an open set S such that P[X E 5] = 1.452 A Review of BClsic ProbClbility Theory Appendix A [J} E BI. .8)) that X is a marginal density function given by px(x) ~ 1: P(X. max{X. .). then the frequency function of X.11) Similarly. assume that the derivative l of 9 exists and does not vanish on S. Furthennore.jPX a (A. a # 0. .y)(x.8. (A. This is called the change of variable formula.7) If X is discrete with frequency function Px.S. Then the random tran~form(lti(m g( X) is defined by g(X)(w) = g(X(w)).8) it follows that if (X.1 E~' l(X i . Yf is a discrete random vector with frequency function p(X.8.8. and X is continuous. Discrete random variables may be used to approximate continuous ones arbitrarily closely and vice versa. y). The probability distribution of g(X) is completely detennined by that of X through L:: P[g(X) E BI = PIX E gI(B)]. then 1 (I . known as the marginal frequency function. .8. i . if (X.8. Then g(X) is continuous with density given by .S.6) An example of a transformation often used in statistics is g (91.1') . . .: for every B E Bill.8. is given by(4) i I ( PX(X) = LP(X.X)2.y). Pg(X) (I) = j. (A. If g(X) ~ aX + 1'. it may be shown (as a consequence of (A. (A. . y (A.12) 1 .y)dy.92/ with 91 (X) = k. and 0 otherwise.)'.
. Sections 2124 Grimmett and Stirzaker (1992) Section 4. and only if....7). . X n with X (XI>"" Xn)' . A.13 A convention: We shan write X = Y if the probability of the event IX fe YI is O. A.3 By (A.[Xn E An] are independent. X 2 ) and (Yt> Y2 ) are independent.9 Independence of Random Variables and Vectors 453 In practice. Theorem A.9.3.9.. so are Xl + X2 and YI Y2 . A.4) (A.9. the events [XI E AI and IX.9.7 The preceding equivalences are valid for random vectors XlJ .4 A.6IftheXi are all continuous and independent. ' . 6. " X n ) is either a discrete or continuous random vector. and Stone (1971) Sections 3..1.2. .9. Nevertheless.Xn are said to be (mutually) independent if and only if for any sets AI. Sections 15. the approximation of a discrete distribution by a continuous one is made reasonable by one of the limit theorems of Sections A. 9 Pitman (1993) Section 4. . References Gnedenko (1967) Chapter 4. all random variables are discrete because there is no instrument that can measure with perfect accuracy. whatever be g and h. . " X n ) is continuous. One possibility is that the observed random variable or vector is obtained by rounding off to a large number of places the true unobservable continuous random variable specified by some idealized physical modeL Or else.9.. so are g(X) and h(Y).4 Parzen (1960) Chapter 7.Xn (not necessarily of the same dimensionality) we need only use the events [Xi E Ail where Ai is a set in the range of Xi' A. .An in S.l5 and B.7.6. . .1. which may be easier to deal with. Port.. (X"XIX. the events [X I E All. then X = (Xl. S. The justification for this may be theoretical or pragmatic.5) = following two conditions hold: A. For example. ..) and Y" and so on. it is common in statistics to work with continuous distributions. either ofthe (A.2 The random variables X I..Section A.9 INDEPENDENCE OF RANDOM VARIABLES AND VECTORS A... . 5.. 1 X n are independent if.1 Two random variables X I and X 2 are said to be independent if and only if for sets A and B in E. To generalize these definitions to random vectors Xl.9.9.S. Suppose X (Xl. if (Xl. E BI are independent..S. if X and Yare independent.7 Hoel.. . Then the random variables Xl.
such a random sample is often obtained by selecting n members at random in the sense of (A.9) If we perfonn n Bernoulli trials with probability of success p and we let Xi be the indicator of the event (success on the ith trial). Take • .4) from a population and measuring k characteristics on each member. 4.2.EA XiPX (Xi) < . } into two sets A and B where A consists of aU nonnegative Xi and B of all negative Xi· If either 'L.. we define the random variable 1 A. F x or density (frequency function) PX.. X2.I l I .454 A Review of Basic Probability Theory Appendix A A.I0 THE EXPECTATION OF A RANDOM VARIABLE r: r I Let X be the height of an individual sampled at random from a finite population. (A9..X. . A. In line with these ideas we develop the general concept of expectation as follows..I .I) .2 Hoel. . If XI . written E(X). . px(i) = i(i+1)' i= 1. Sections 23. .) A. Sections 6.j j .~ 1 i E(X) = LXiPX(Xi). The same quantity arises (approximately) if we use the longrun frequency interpretation of probability and calculate the average height of the individuals in a large sample from the population in question.p) to 0. 1 Xi =i...Ll XiP[X = Xi] where P[X = Xi] is just the proportion of individuals of height Xi in the population.' (Infinity is a possible value of E(X). Fx or density (frequency function) Px. . by 00 1 i .~ ( .3. .3 . then the Xi form a sample from a distribution that assigns probability p to 1 and (1. decompose {Xl l X2. and Stone (1971) Section 3. . the indicator of the event A. "" 1 X q are the only heights present in the population.. 5. 7 Pitman (1993) Sections 2. . it follows that this average is given by 'L..9. .2. X l l is called a random sample (l size n from a population with dJ. discrete random variable with possible values {Xl. .4 Par""n (1960) Chapter 7. if X is discrete. i==l (A. by 1 ifw E A o otherwise. Then a reasonable measure of the center of the distribution of X is the average height of an individual in the given population.W. If A is any event. Such samples win be referred to as the indicators of n Bernoulli tn"als with probability ofsuccess p" References Gnedenko (1967) Chapter 4... If X is a nonnegative.I0..• }. Port. 24 Grimmett and Stirzaker (1992) Sections 3.. i . .. I i I I . X n are independent identically distributed kdimensional random vectors with dJ..S If Xl . I. In statistics..5. then Xl .2 More generally. we define the expectation or mean of X.
If X (A. then E(X) = P(A).. 10. (A. and if E(lg(X) I) < 00..9. 10. an are constants and E(!Xil) < 00. .5) As a consequence of this result.ES( x. 10.."5'..9)).) < 00.rn". . (A. . Here are some properties of the expectation that hold when X is discrete. If X is a continuous random variable.. E(X) is left undefined. Otherwise. Those familiar with Lebesgue integration will realize that this leads to E(X) = 1: xpx(x)dx whenever (A.c.IO. n.p. we define E(X) unambiguously by (A.'"":.J:. then E(X) = c. .)Px (. we leave E(X) undefined. It may be shown that if X is a continuous k~dimensional random vector and g(X) is any random variable such that JR' { Ig(x)IPx(x)dx < 00.. it is natural to attempt a definition of the expectation via approximation from the discrete case..8From (A.9) as the definition of the expectation or mean of X f~oo xpx(x)dx is finite."_0_"_A". 10."_"_b_. 00 455 or L.I). .7) it follows that if X < Y and E(X). lOA) If X is an ndimensional random vector. if 9 is a realvalued function on Rn. X(t.6) Taking g(x) = L:~ 1 O:'iXi we obtain the fundamental relationship (A. i = 1. then it may be shown that 00 E(g(X)) = Lg(Xi)PX(Xi) i=l (A. . 10. E(Y) are defined.7) if Qt.IO. A.lO.'_""o"_of_'""R_'_"d"o:. then E(X) < E(Y). Otherwise.l"O_T_h''CEx. Jo= xpx(x)dx or A. If X is a constant.V. .v') = c for all w. we have 00 E(IXI) ~ L i=l IXiIPx(Xi)..3) = IA (cf (A.tO A random variable X is said to be integrable if E(IXI) < 00. I 0.
3.2) xkpx(x)dx if X is continuous. 10. • g(x)dP(x) = L g(Xi)p(Xi).1 parzen (1960) Chapter 5.1 I 1 . j i . In the continuous case dJ1(x) = dx and M(X) is called Lebesgue measure. the kth moment of X is defined to be the expectation of X k. 1 . Chapter 3.1. A convenient notation is dP(x) = p(x)dl'(x).! If k is any natural number and X is a random variable. The fonnulae (A 10.6) hold. Chapter S.. (A.3 Hoel. and (AIO. g(X)Px(x)dx.3).B) g(x)p(x)dx. continuous case.5).IRk g(x)dP(x) (A.5) and (A 10.ll MOMENTS 1 ...3.S) as well as continuous analogues of (A..lO. I In general. 7. j A. It I! II .12) where F denotes the distribution function of X and P is the probability function of X defined by (AS.11). lOA).. I0. 4. (AIO.IO. In the discrete case J1 assigns weight one to each of the points in {x : p(x) > O} and it is called counting measure.. It is possible to define the expectation of a random variable in general using discrete approximations.. I ... We will often refer to p(x) as the density of X in the discrete case as well as the continuous case. POrl.. j . A. and Stone (1971) Sections 4. which means J . The interested reader may consult an advanced text such as Chung (1974). 4.H. Section 26 Grimmett and Stirzaker (1992) Sections 3. . I I) are both sometimes written as E(g(X)) ~ r ink = g(x)dF(x) or r . Chung (1974) Chapter 3 Gnedenko (1967) Chapter 5.. 10.' • 1: x I (A I 1.. 10. Sections 14 Pitman (1993) Sections 3.5) and (A. 'I I' I. E(X k ) j = Lxkpx(x) if X is discrete . the moments depend on the distribution of X only... I: We refer to J1 = IlP as the dominating measure for P.456 then E(g(X» exists and E(g(X» A Review of BaSIC Probability Theory Appendix A = 1. discrete case f: f References i~l (AIO.5) and (A.7)... 3A. (A. .. .!I) In the continuous case expectation properties (A. By (A. I0.. II. We assume that all moments written here exist.
then X = O.E(X)J).j) of X. . A.12 It is possible to generalize the notion of moments to random vectors. 1) is . X = E(X) (a constant). This is the case. then the product moment of order (i.S The second central moment is called the variance of X and will be written Var X. The nonnegative square root of Var X is called the standard deviation of X. The central product moment of order (i. If X ~ N(/l.) dardized version or Zscore of X is the random variable Z ~ (X . for instance.2 are expressed in tenns of cumulants. then the coefficient of skewness and the kurtosis of Y = 12 = O.ll Moments 457 A.E(X))k].n. By (A 10. then 11 A. If a and b are constants. from (A.U.E(X 2 ))i]. and is denoted by I'k. A.n.3 The distribution of a random variable is typically uniquely specified by its moments.9 If E(X 2 ) = 0.12 where.15». which is often referred to as the mean deviation. the stan E(Z) = 0 and Var Z = 1. the kth moment of A.1».7) Var(aX + b) = a2 Var X. 10.E(XIl)i(X.4 The kth central moment of X (X .E(X))/vVar x.15..E(X)).6) it follows then that A.n If Y = a + bX with b > 0. are used in the coefficient of skewness . The variance of X is finite if and only if the second moment of X is finite (cf. The central product moment of order (1.ll. which are defined by where 0'2 = Var X. If Var X = 0. if the random variable possesses a moment generating function (cf. The standard deviation measures the spread of the distribution of X about its expectation.7) and (All. See also Section A. For sim plicity we consider the case k = 2.6) (One side of the equation exists if and only if the other does. and X 2 is again by definition E[(X. for example.2). If XI and X 2 are random variables and i. j are natural numbers. . 0 2).n. These results follow. by definition. Another measure of the same type is E(jX . (AIl.II. is by definition E[(X .7 If X is any random variable with welldefined (finite) mean and variance. It is also called a measure of scale. (All.1 and the kunosis ')'2. A.ll. are the same as those of X. then by (A.Section A. (AI2.j) of Xl and X 2 is.IO The third and fourth central moment.l and. These descriptive measures are useful in comparing the shapes of various frequently used densities. E(XiX~).S) A. (A.H.1l.
1. I liS) The covariance is defined whenever Xl and X 2 have finite variances and in that case (AII. X. denoted by Corr(XI1 X2)..I US) The correlation of Xl and X 2 is the covariance of the standardized versions of Xl and X 2· The correlation inequality is equivalent to the statement (A. . o. i . Z. = a + oXI . Cov(aX I + oX" eX. . (2) (XI .) = ac Cov(X I . X. By expanding the product (XI . then Cov(XI. .E(X. is linear function (X..E(X I ») = CO~(X~X. with equality holding if and only if (1) Xl or X 2 is a constant Or ! • • .X.E(X I ).).11l3) (A. It may be obtained from the CauchySchwartz inequality. E(Zil < 00.E( X. =X in (A.4.\:2 and is written Cov(. ar '2 . 7).) = lfwe put XI ~ ~E(XI .) + od Cov(X"X.). I 1 1 j I I . (AI 117) j r . The correlation of Xl and X 2.X.E(X J!)( X. A proof of the CauchySchwartz inequality is given in Remark 1.3) and (AID.lI.) (A. . Equality holds if and only if one of Zll Z2 equals 0 or 2 1 = aZ2 for some constant a.I 1.19) 1 • t . we obtain the relations. 0 # 0) of XI' l il .X. •  II . This is the correlation inequality.). .\ 1. is defined whenever Xl and X 2 are not constant and the variances of Xl and X 2 are finite by (A. such that E(Zf) < 00.E(X.[E(X)j2 (A. • i . = X. J I • for any two random variables Z" Z.IIl4) If Xi and Xf are distributed as Xl and X 2 and are independent of Xl and X 2. + dX.X 2).14).)(X.I6) . we get the formula Var X = E(X') .458 A Review of Basic Prob<lbility Theory Appendix A called the cOl'Orionce of X 1 and. X 3 ) + be Cov(X" X 3 ) and + ad Cov(X I . The correlation inequality correspcnds to the special case ZI = Xl . .)) and using (AI 0.. Equality holds if and only if X.) (X.
b < orb> 0.II).l1.4 ° + .. =L = E(X k ) lsi < so· (A..20). then (A.X.l2.2) eSXpx(x)dx if X is continuous. lsi < So Mx(s) LeSX'PX(Xi) if X is discrete 1: Mx(s) i~1 (A. 30 Hoel.21 ) or in view of(All.24. 28.ID.13) the relation Var(X 1 + .12. and Stone (1971) Sections 4.12 Moment and Cumulant Generating Functions 459 If Xl.. we obtain as a consequence of (A.J4).20) If Xl and X 2 are independent and X I and X 2 are integrable. i ~ 1. are If M x is well defined in a neighhorhood finite and {s .Section A.) > 0. lsi < so} of zero.. all moments of X k! sk. 11.3 Parzen (1960) Chapter 5.4. 12.Xn are independent with finite variances.. + X n ) = L Var Xi· i=I n (A. 10. + X n ) = L '1=1 n Var X.23) A. As a consequence of (AI 1. (A. Xl)· (A. Sections 14 Pitman (1993) Section 6. 7.X2 ) = Corr(X1. It is 1 Or 1 in the case of perfect relationship (X2 = a l. The correlation coefficient roughly measures the amount and sign of linear relationship between Xl and X 2.llf E(c'nIXI) < 00 for some So > 0.3) k=O .. Sections 27. Chapter 8.12 MOMENT AND CUMULANT GENERATING FUNCTIONS A.II. XII have finite variances.5.5) and (A. Mx(s) ~ E(e'X) is well defined for and is called the moment generating function of X. It is not trUe in general that X I and X 2 that satisfy (AIl.2. we see that if Xl. are uncorrelated) need be independent.22) This may be checked directly. CoV(X1..22) (i.'" . See also Section 1. Port.. +2L '<7 COV(X.bX 1.) ~ o when Var(X. . then Var(X 1 References Gnedenko (1967) Chapter 5. respectively)..e. By (A.II. I 1.22) and (AI 1.
t4 . Chapter 8.l1. . . .12. . In this section we introduce certain families of b •. . c.7) is called the cumulant generating function of X.4 The moment generuting function . Section 3.Us hils derivatives of all orders at :. Cj = 0 for j > 3.12. Port... • 111. The coefficients of skewness and kurtosis (see (A. The first cumulant Cl is the mean J. j A. Ii .9) l 1 " I J . then Cj(X + Y) = c.460 A Review of Basic: Probability Theory Appendix A A. If lv/x is well defined in some neighborhood of zero.(X) + cJ(Y).6) This follows by induction from the definition and (A. A/x determines the distribution of X uniquely and is itself uniquely determined by the distribution of X. + XI! has moment generating function given by M(x.+.3J1~.'(".~o sJ dj (A. is called the jth cumulant of X. the probability distribution of a random variable or vector is just a probability measure on a suitable Euclidean space.10» can be written as 1'1 = Problem 8. If XI . and Stone (1971) Chapter 8. If X is nonnally distributed.. ~ Cj(X) ~ d . c3/ci and 1'2 = C4/C~. For a generalization of the notion afmoment generating function to random vectors. and C4 = J.12..2l). Sections 23 Rao (1973) Section 2bA j i i I .12. Section 8. 11.(s). K x can be represented by the convergent Taylor expansion Kx(s) = where L j=O 00 Cisj (A.t of X.. XII are independent random variables with moment generating functions 11. References Hoel. see Section B.'(8) ds~ = E(X'). i=l n (A.+x"I(S) = II Mx.1x I ' .Kx(s)I. See • 1 ~ . I I i I A.8) J. . The function Kx(s) = log M x (s) (A. Cj(X + a) = Cj(X)..1 parzen (1960) Chapter 5. If X and Yare independent. SOME CLASSICAL DISCRETE AND CONTINUOUS DISTRIBUTIONS I f • .5 If defined. • By definition. j > L For j > 2 and any constant a.3. 12.5. then Xl + .8. C2 and C3 equal the second and third central moments f12 and f13 of X. .13 .12. = 0 and dk :\1. ! .
4).. and list some of their properties. This result may be derived by using (A.n). (A I 3.l3. B(n2. If anywhere below p is not specified explicitly for some value of x it shall be assumed that p vanishes at that point. The parameter n can be any integer (A. Discrete Distributions The binomial distribution with parameters nand B : B( n.13. (Al3.l.. The symbol p as usual stands for a frequency or density function. 0). which arise frequently in probability and statistics. p(k) = (A. it is assumed to be zero to the "left" ofthe set and one to the "right" ofthe set. p(k)~ (~)0'(10)"k. then X has a B( n. respectively.n.12. . then X. if the value of the distribution function F is not specified outside some set. B)" for "the binomial distribution with pmameter (n . B). 1]. . I..6) for k a natural number with max(O.. then E(X) ~ nO.Xk are independent random variables distributed as B(n l1 8). N. k=O. X 2 . .n .13. 0).6.12. ..5) and (A. and n : 1t(D. B(n. A.71f X is the number of defectives (special objects) in a sample of size n taken without replacement from a population with D defectives and N .1) > 0 whereas 0 may be any number in [0.OW.l3.4) .1 0».(N . Similarly. + X 2 + '" + X.13. + . A. The hypergeometric distribution with parameters D. . The parameters D and n may be any natural nwnbers that are less than or equal to the natural number N. n) distribution (see (A6. DIN) distribution. . Var X = nO(1  0). . 0) distribution.D).3) Higherorder moments may be computed from the moment generating function Mx(t) = [Oe' + (1 . Or. N.Section A.2 If X is the total number of successes obtained in n Bernoulli trials with probability of success 0.D» < k < min(n. Following the name of each distribution we give a shorthand notation that will sometimes be used as will obvious abbreviations such as "binomial (n.13. then X has an H( D.13 Some Classical Discrete and Continuous Distributions 461 distributions. If X has a B( n. A.3»..6) in conjunction with (A. . 0) distribution (see (A.D nondefectives. N. has a B(n. + n"O) distribution. If the sample is taken with replacement.5 If Xl. X has a B(n.
. respectively.7) by writing X = 'L. 01 "" . can be any positive number.Xq )'. ..20).13. to. > A. t=l = (Xl. This result may be derived in th same manner as the corresponding fact for the binomial distribution. .' . j = 1.. .• . k ( k) ]0 = e'A k! (A 13. Oil .. tion (see (A.462 A Review of Basic Probability Theory Appendix A If X has an H(D..I .Xj ) ne" Var Xi ~ nOi(I.1 < < q. .i3.l3) The par~meter n is any 'L. (A. " X n are independent random variables with P(A').6)).. . ..1I) A. 13. then .'" Oq) is any vector in n! k k (A. ~ I}.? i 1 ki = n..kq) = k. Oq) distribution. and .6).. Oq). 2. P(A2)..14 If X 0. then D D ( D) Nn E(X) = n N' Var X = n N 1...(. q.l3.'l)...1O) f The moment generating function of X is given by Mx(t) ~ e. . + An) distribution. lOA)..l3.7). + X 2 + +Xn has the P(AI + A2 + . " f" is the number of times outcome Wi occurs in n . 0 ' . . (AI3. . then X. The parameter>.. Bq : M(n.71Ij where I j = 1 if the jth object sampled is defective and 0 otherwise..O. • • I • E(X) = Var X ~ A.•• .Bq) distribu ! i : E(Xi ) Cov(Xi.. (A. ~ • (AI 1. If X has a M( n. e = {(O" ... .12 If X" X 2 . ]O(k" .0. 10.l3. and then applying formulae (A. kq.. . B1 . The Poisson distribution with parameter A: peA). An easier way is to use the interpretation (A.N N 1' (A. where Xi multinomial trials with probabilities (0 1 .l .Oq): 0.IS) .Oq). N. .) n(JiOj.6..8) may be obtained directly from the definition (A 13. Oq' whenever k i are nonnegative integers such that natural number while (Ol. then X has a M(n. n) distribution. (A. The multinomial distributio