Mathematical Statistics
Basic Ideas and Selected Topics
Volume I
Peter J. Bickel
University of California
Kjell A. Doksum
University of California
1'1"(.'nl icc
Hall
.. ~
PRENTICE HALL
Upper Saddle River, New Jersey 07458
Library of Congress CataloginginPublication Data Bickel. Peter J. Mathematical statistics: basic ideas and selected topics / Peter J. Bickel, Kjell A. Doksum2 nd ed. p. em. Includes bibliographical references and index. ISBN D13850363X(v. 1) L Mathematical statistics. L Doksum, Kjell A. II. Title.
QA276.B47200l
519.5dc21 00031377
Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte
}'I('I1II(\'
lI,dl
@2001, 1977 by PrenticeHall, Inc. Upper Saddle River, New Jersey 07458
All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
I
i
ISBN: D1385D363X
PrenticeHall International (UK) Limited, London PrenticeHall of Australia Pty. Limited, Sydney PrenticeHall of Canada Inc., Toronto PrenticeHall Hispanoamericana, S.A., Mexico PrenticeHall of India Private Limited, New Delhi PrenticeHall of Japan, Inc., Tokyo Pearson Education Asia Pte. Ltd. Editora PrenticeHall do Brasil, Ltda., Rio de Janeiro
J
To Erich L Lehmann
." i
~I
"I
~ !
,'
,
I,
,.~~
_.

..
CONTENTS
PREFACE TO THE SECOND EDITION: VOLUME I PREFACE TO THE FIRST EDITION I STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA 1.1 Data, Models, Parameters, and Statistics
xiii
xvii
1 1
1.1.1
1.1.2
Data and Models
Pararnetrizations and Parameters
I
6
1.1.3
1.2
Statistics as Functions on the Sample Space
8
1.3
1.4
1.5
1.6
1.7 1.8 1.9
1.1.4 Examples, Regression Models Bayesian Models The Decision Theoretic Framework 1.3.1 Components of the Decision Theory Framework 1.3.2 Comparison of Decision Procedures 1.3.3 Bayes and Minimax Criteria Prediction Sufficiency Exponential Families 1.6.1 The OneParameter Case 1.6.2 The Multiparameter Case 1.6.3 Building Exponential Families 1.6.4 Properties of Exponential Families 1.6.5 Conjugate Families of Prior Distributions Problems and Complements Notes References
9 12 16 17 24 26 32 41 49 49 53 56 58 62 66 95 96
VII
VIII
•••
CONTENTS
2 METHODS OF ESTIMATION 2.1 Basic Heuristics of Estimation 2.1.1 Minimum Contrast Estimates; Estimating Equations 2.1.2 The PlugIn and Extension Principles 2.2 Minimum Contrast Estimates and Estimating Equations 2.2.1 Least Squares and Weighted Least Squares 2.2.2 Maximum Likelihood 2.3 Maximum Likelihood in Multiparameter Exponential Families *2.4 Algorithmic Issues 2.4.1 The Method of Bisection 2.4.2 Coordinate Ascent 2.4.3 The NewtonRaphson Algorithm 2.4.4 The EM (ExpectationlMaximization) Algorithm 2.5 Problems and Complements 2.6 Notes 2.7 References
3
MEASURES OF PERFORMANCE
99 99 99 102 107 107 114 121 127 127 129 132 133 138 158 159
3.1 Introduction 3.2 Bayes Procedures 3.3 Minimax Procedures *3.4 Unbiased Estimation and Risk Inequalities 3.4.1 Unbiased Estimation, Survey Sampling 3.4.2 The Information Inequality *3.5 Nondecision Theoretic Criteria 3.5.1 Computation 3.5.2 Interpretability 3.5.3 Robustness 3.6 Problems and Complements 3.7 Notes 3.8 References
4 TESTING AND CONFIDENCE REGIONS 4.1 Introduction 4.2 Choosing a Test Statistic: The NeymanPearson Lemma 4.3 UnifonnIy Most Powerful Tests and Monotone Likelihood Ratio
161 161 161 170 176 176 179 188 188 189 190 197 210 211 213 213 223
227 233
4.4
Models Confidence Bounds, Intervals, and Regions
CONTENTS
ix
241 248 251 252
4.5 *4.6 *4.7 4.8
The Duality Between Confidence Regions and Tests Uniformly Most Accurate Confidence Bounds Frequentist and Bayesian Formulations Prediction Intervals
4.9
Likelihood Ratio Procedures 4.9.1 Inttoduction
4.9.2 4.9.3 Tests for the Mean of a Normal DistributionMatched Pair Experiments Tests and Confidence Intervals for the Difference in Means of
255 255
257
4.9.4
4.9.5
Two Normal PopUlations The TwoSample Prohlem with Unequal Variances
Likelihood Ratio Procedures fOr Bivariate Nonnal
261 264 266 269 295 295
Distrihutions 4.10 Problems and Complements 4.11 Notes 4.12 References
5 ASYMPTOTIC APPROXIMATIONS 5.1 Inttoduction: The Meaning and Uses of Asymptotics
5.2 Consistency
297 297
301
5.2.1
5.2.2
PlugIn Estimates and MLEs in Exponential Family Models
Consistency of Minimum Contrast Estimates
301
304
5.3
5.4
5.5 5.6 5.7 5.8
First and HigherOrder Asymptotics: The Delta Method with Applications 5.3.1 The Delta Method for Moments 5.3.2 The Delta Method for In Law Approximations 5.3.3 Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families Asymptotic Theory in One Dimension 5.4.1 Estimation: The Multinomial Case *5.4.2 Asymptotic Normality of Minimum Conttast and M Estimates *5.4.3 Asymptotic Normality and Efficiency of the MLE *5.4.4 Testing *5.4.5 Confidence Bounds Asymptotic Behavior and Optimality of the Posterior Distribution Problems and Complements Notes References
306 306 311 322 324 324 327 331 332 336 337 345 362 363
x
CONTENTS
6 INFERENCE IN THE MULTIPARAMETER CASE
6.1 Inference for Gaussian Linear Models 6.1.1 6.1.2 6.1.3 *6.2 6.2.1 6.2.2 6.2.3 *6.3 6.3.1 6.3.2 *6.4 6.4.1 6.4.2 6.4.3 *6.5 *6.6 6.7 6.8 6.9 The Classical Gaussian Linear Model Estimation Tests and Confidence Intervals Estimating Equations Asymptotic Normality and Efficiency of the MLE The Posterior Distribution in the Multiparameter Case Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic Wald's and Rao's Large Sample Tests GoodnessofFit in a Multinomial Model. Pearson's X 2 Test GoodnessofFit to Composite Multinomial Models. Contingency Thbles Logistic Regression for Binary Responses
365
365 366 369 374 383 384 386 391 392 392 398 400 401 403 408 411 417 422 438 438 441 441 443 443
Asymptotic Estimation Theory in p Dimensions
Large Sample Tests and Confidence Regions
Large Sample Methods for Discrete Data
Generalized Linear Models
Robustness Properties and Scmiparametric Models Problems and Complements Notes References
A A REVIEW OF BASIC PROBABILITY THEORY A.I The Basic Model A.2 Elementary Properties of Probability Models A.3 Discrete Probability Models A.4 Conditional Probability and Independence A.5 Compound Experiments A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement A.7 Probabilities On Euclidean Space A.8 Random Variables and Vectors: Transformations A.9 Independence of Random Variables and Vectors A.IO The Expectation of a Random Variable A.II Moments A.12 Moment and Cumulant Generating Functions
444
446 447 448 451 453 454 456 459
CONTENTS
XI
•
A. t3 Some Classical Discrete and Continuous Distributions A.14 Modes of Convergence of Random Variables and Limit Theorems A. I5 Further Limit Theorems and Inequalities A.16 Poisson Process
A.17 Notes
460
466 468 472
474 475 477 477 477 479 480 482 484 485 485 488 491 491 494 497 502 502 503 506 506 508
A.18 References
B ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS
B.I Conditioning by a Random Variable or Vector B.I.l B.I.2 B.1.3 B.IA B.l.5 B.2.1 B.2.2 B.3
B.3.1
The Discrete Case Conditional Expectation for Discrete Variables Properties of Conditional Expected Values Continuous Variables Comments on the General Case The Basic Framework The Gamma and Beta Distributions
The X2 , F, and t Distributions
B.2 Distribution Theory for Transformations of Random Vectors
Distribution Theory for Samples from a Normal Population B.3.2 Orthogonal Transformations
BA The Bivariate Normal Distribution
B.5 Moments of Random Vectors and Matrices B.5.1 B.5.2 B.6.1 B.6.2 B.8 Basic Properties of Expectations Properties of Variance Definition and Density Basic Properties. Conditional Distributions
Op
B.6 The Multivariate Normal Distribution
B.7 Convergence for Random Vectors: Multivariate Calculus B.9 Convexity and Inequalities
and Op Notation
511
516 518 519 519 520 521
B.1O Topics in Matrix Theory and Elementary Hilbert Space Theory
B.1O.1 Symmetric Matrices B,10.2 Order on Symmetric Matrices
B.10.3 Elementary Hilbert Space Theory
B.Il Problems and Complements B.12 Notes B.13 References
524
538 539
•• XII
CONTENTS
C TABLFS
Table I The Standard Nonna! Distribution Table I' Auxilliary Table of the Standard Normal Distribution
Table II t Distribution Critical Values Table In x2 Distribution Critical Values Table IV F Distribution Critical Values
INDEX
541 542 543 544 545 546 547
I
PREFACE TO THE SECOND EDITION: VOLUME I
In the twentythree years that have passed since the first edition of our book appeared statistics has changed enonnollsly under dIe impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phy
logenetic and other), and other types of combinatorial objects.
(2) The generation of enonnous amounts of dataterrabytes (the equivalent of 10 12 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore's "law") and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number of directions:
(I) Methods for inference based on larger numbers of observations and minimal assumptionsasymptotic methods in non and semiparametric models, models with ''infinite'' number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo.
XIll
.,q;,
•
:ci'~:,f"
.;.
'4,.
<.it:
...
Others do not and some though theoretically attractive cannot be implemented in a human lifetime. but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis.xiv Preface to the Second Edition: Volume I (4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. some methods run quickly in real time. such as the density. As in the first edition. each to be only a little shorter than the first edition. reflecting what we now teach our graduate students. which we present in 2000. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. of course." That is. Despite advances in computing speed. which includes more advanced topics from probability theory such as the multivariate Gaussian distribution.and semipararnetric models. Volume I. We . Our one long book has grown to two volumes. instead of beginning with parametrized models we include from the start non. Appendix B is as selfcontained as possible with proofs of mOst statements. Volume [ covers the malerial of Chapters 16 and Chapter 10 of the first edition with pieces of Chapters 710 and includes Appendix A on basic probability theory. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. Chapter 1 now has become part of a larger Appendix B. been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. we do not require measure theory but assume from the start that our models are what we call "regular. However. and references to the literature for proofs of the deepest results such as the spectral theorem. we assume either a discrete probability whose support does not depend On the parameter set. Hilbert space theory is not needed. However. As a consequence our second edition. such as the empirical distribution function. or the absolutely continuous case with a density. is much changed from the first. then go to parameters and parametric models stressing the role of identifiability. and probability inequalities as well as more advanced topics in matrix theory and analysis. problems. From the beginning we stress functionvalued parameters. Specifically. our focus and order of presentation have changed. (5) The study of the interplay between numerical and statistical considerations. weak convergence in Euclidean spaces. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. covers material we now view as important for all beginning graduate students in statistics and science and engineering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing concluSIons. There have. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. and functionvalued statistics. These will not be dealt with in OUr work.
we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis. There are clear dependencies between starred . Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation. Chapters 14 develop the basic principles and examples of statistics. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that are given in Appendix B. include examples that are important in applications. including some optimality theory for estimation as well and elementary robustness considerations. Finaliy. which parallels Chapter 2 of the first edition. Generalized linear models are introduced as examples. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the mOre advanced topics of Volume II. there is clearly much that could be omitted at a first reading that we also star. One of the main ingredients of most modem algorithms for inference. from the start. and some parailels to the optimality theory and comparisons of Bayes and frequentist procedures given in the univariate case in Chapter 5. including a complete study of MLEs in canonical kparameter exponential families. Wilks theorem on the asymptotic distribution of the likelihood ratio test. the Wald and Rao statistics and associated confidence regions. Robustness from an asymptotic theory point of view appears also. Almost all the previous ones have been kept with an approximately equal number of new ones addedto correspond to our new topics and point of view. Chapter 5 of the new edition is devoted to asymptotic approximations. The main difference in our new treatment is the downplaying of unbiasedness both in estimation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. Save for these changes of emphasis the other major new elements of Chapter 1. if somewhat augmented. Chapters 3 and 4 parallel the treatment of Chapters 4 and 5 of the first edition on the theory of testing and confidence regions. There is more material on Bayesian models and analysis. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in muitiparameter exponential families and ail introduction to the EM algorithm. As in the first edition problems playa critical role by elucidating and often substantially expanding the text. such as regression experiments. Although we believe the material of Chapters 5 and 6 has now become fundamental. It includes the initial theory presented in the first edition but goes much further with proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference. Chapter 6 is devoted to inference in multivariate (multiparameter) models. Included are asymptotic normality of maximum likelihood estimates.Preface to the Second Edition: Volume I xv also. are an extended discussion of prediction and an expanded introduction to kparameter exponential families. Nevertheless. inference in the general linear model. Major differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs). Also new is a section relating Bayesian and frequentist inference via the Bemsteinvon Mises theorem. The conventions established on footnotes and notation in the first edition remain.
and active participation in an enterprise that at times seemed endless. j j Peter J. in part in the text and. in part in appendices. and the functional delta method. Ying Qing Chen. encouragement. 5.XVI • Pref3ce to the Second Edition: Volume I sections that follow.edn Kjell Doksnm doksnm@stat.4 ~ 6. Topics to be covered include permutation and rank tests and their basis in completeness and equivariance. density estimation. 6. Semiparametric estimation and testing will be considered more generally. Michael Jordan.3 ~ 6.berkeley.2 ~ 6. convergence for random processes. The topic presently in Chapter 8. For the first volume of the second edition we would like to add thanks to new colleagnes. Examples of application such as the Cox model in survival analysis. Michael Ostland and Simon Cawley for producing the graphs. We also thank Faye Yeager for typing. other transformation models. elementary empirical process theory.4. and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley. are weak. and the classical nonparametric k sample and independence problems will be included. We also expect to discuss classification and model selection using the elementary theory of empirical processes. Yoram Gat for proofreading that found not only typos but serious errors. The basic asymptotic tools that will be developed or presented. With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics.5 I. and our families for support. taken on a new life. Bickel bickel@stat. Last and most important we would like to thank our wives. greatly extending the material in Chapter 8 of the first edition. • with the field.berkeley. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo.2 ~ 5. and Prentice Hall for generous production support. particnlarly Jianging Fan.4.6 Volume II is expected to be forthcoming in 2003. Jianhna Hnang.edn . Fujimura. Nancy Kramer Bickel and Joan H. will be studied in the context of nonparametric function estimation.3 ~ 6. appeared gratifyingly ended in 1976 but has.
3rd 00. and engineering that we have taught we cover the core Chapters 2 to 7. the information inequality. In the twoquarter courses for graduate students in mathematics. Finally. and the GaussMarkoff theorem. which go from modeling through estimation and testing to linear models. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory). At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig. covers most of the material we do and much more but at a more abstract level employing measure theory. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. the treatment is abridged with few proofs and no examples or problems. and Stone's Introduction to Probability Theory. These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. and the structure of both Bayes and admissible solutions in decision theory. Our book contains more material than can be covered in tw~ qp. the Lehmann5cheffe theorem. Port. we need probability theory and expect readers to have had a course at the level of. Our appendix does give all the probability that is needed. The work of Rao.arters. Introduction to Mathematical Statistics. However. statistics. we feel that none has quite the mix of coverage and depth desirable at this level.. multinomial models. Hoel. (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates. Be cause the book is an introduction to statistics. 2nd ed. PREFACE TO THE FIRST EDITION This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be. (2) Give careful proofs of the major "elementary" results such as the NeymanPearson lemma. we select topics from xvii . Linear Statistical Inference and Its Applications. for instance. the physical sciences. (4) Show how the ideas aod results apply in a variety of important subfields such as Gaussian linear mOdels.. and nonparametric models. Although there are several good books available for tbis purpose. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated.
Conventions: (i) In order to minimize the number of footnotes we have added a section of comments at the end of each Chapter preceding the problem section. i . and moments is established in the appendix. C.6 would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. densities. I . Much of this material unfortunately does not appear in basic probability texts but we need to draw on it for the rest of the book. It may be integrated with the material of Chapters 27 as the course proceeds rather than being given at the start. and stimUlating lectures of Joe Hodges and Chuck Bell. final draft) through which this book passed. Chou. I for the first. R. Gupta. Scholz. We would also like tn thank tlle colleagues and friends who Inspired and helped us to enter the field of statistics. in proofreading the final version. These comments are ordered by the section to which they pertain. S. A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text. They need to be read only as the reader's curiosity is piqued. Chen. A serious error in Problem 2. U. and friends who helped us during the various stageS (notes. and so On. We would like to acknowledge our indebtedness to colleagues. caught mOre mistakes than both authors together. preliminary edition. (iii) Basic notation for probabilistic objects such as random variables and vectors. Pyke's careful reading of a nexttofinal version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D. Peler J. A special feature of the book is its many problems. Among many others who helped in the same way we would like to mention C. The foundation of oUr statistical knowledge was obtained in the lucid. and additional references. J. They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text. Drew. Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers.xviii Preface to the First Edition Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. . (i) Various notational conventions and abbreviations are used in the text. Chapter 1 covers probability theory rather than statistics. Cannichael. The comments contain digressions. L. Later we were both very much influenced by Erich Lehmann whose ideas are strongly rellected in this hook. Minassian who sent us an exhaustive and helpful listing.2. reservations. and A Samulon. P. X. enthusiastic. G. students. Lehmann's wise advice has played a decisive role at many points. respectively. Without Winston Chow's lovely plots Section 9. Gray. or it may be included at the end of an introductory probability course that precedes the statistics course. E. W. Bickel Kjell Doksum Berkeley /976 : . distribution functions. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text.5 was discovered by F. Quang. 2 for the second.
Mathematical Statistics Basic Ideas and Selected Topics Volume I Second Edition .
j 1 J 1 . I .. . .
1. as usual. scientific or industrial. but we will introduce general model diagnostic tools in Volume 2. and/or characters. we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading. In any case all our models are generic and.1 DATA. functions as in signal processing.Chapter 1 STATISTICAL MODELS. GOALS.1 1. A generic source of trouble often called grf?ss errors is discussed in greater detail in the section on robustness (Section 3. (3) Arrays of scalars and/or characters as in contingency tablessee Chapter 6or more generally multifactor IDultiresponse data on a number of individuals. (4) All of the above and more. A 1 . Data can consist of: (1) Vectors of scalars. The goals of science and society. for example. Subject matter specialists usually have to be principal guides in model formulation.5. produce data whose analysis is the ultimate object of the endeavor. in particUlar. ''The Devil is in the details!" All the principles we discuss and calculations we perform should only be suggestive guides in successful applications of statistical analysis in science and policy. which statisticians share. a single time series of measurements. for example. trees as in evolutionary phylogenies. measurements.1 and 6. (2) Matrices of scalars and/or characters.3). The particular angle of mathematical statistics is to view data as the outcome of a random experiment that we model mathematically. AND PERFORMANCE CRITERIA 1. A detailed discussion of the appropriateness of the models we shall discuss in particUlar situations is beyond the scope of this book. digitized pictures or more routinely measurements of covariates and response on a set of n individualssee Example 1.4 and Sections 2. Chapter 1. MODELS. and so on.2.1.1. are to draw useful information from data using everything that we know. large scale or small. PARAMETERS AND STATISTICS Data and Models Most studies and experiments. Moreover.
. The population is so large that. So to get infonnation about a sample of n is drawn without replacement and inspected. Hierarchies of models are discussed throughout. (5) We can be guided to alternative or more general descriptions that might fit better. We begin this discussion with decision theory in Section 1. for modeling purposes. "Models of course. and more general procedures will be discussed in Chapters 24. (3) We can assess the effectiveness of the methods we propose. Goodness of fit tests. is distributed in a large population. producing energy.! I 2 Statistical Models. A to m. confidence regions. and Performance Criteria Chapter 1 priori. we can assign two drugs. An unknown number Ne of these elements are defective. reducing pollution. and so on. are never true but fortunately it is only necessary that they be usefuL" In this book we will study how. learning a maze. we obtain one or more quantitative or qualitative measures of efficacy from each experiment. testing. to what extent can we expect the same effect more generally? Estimation. in particular.5 and throughout the book. Chapter I. height or income. .. (b) We want to study how a physical or economic feature.21. we approximate the actual process of sampling without replacement by sampling with replacement. Random variability " : . An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the population. Here are some examples: (a) We are faced with a population of N elements. starting with tentative models: (I) We can conceptualize the data structure and our goals more precisely. and B to n. in the words of George Box (1979). For instance. for example. have the patients rated qualitatively for improvement by physicians. give methods that assess the generalizability of experimental results. if we observe an effect in our data. robustness. For instance. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes. (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee. a shipment of manufactured items. It is too expensive to examine all of the items. The data gathered are the number of defectives found in the sample. Goals. and diagnostics are discussed in Volume 2. treating a disease. (2) We can derive methods of extracting useful information from data and. e. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. and so on. In this manner. plus some random errors. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p. We begin this in the simple examples that follow and continue in Sections 1.3 and continue with optimality principles in Chapters 3 and 4. This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population. for instance. randomly selected patients and then measure temperature and blood pressure. (c) An experimenter makes n independent detenninations of the value of a physical constant p.
N(I . . as Xi = I..."" €n are independent. . H(N8./ ' stands for "is distributed as. It can also be thought of as a limiting case in which N = 00..i.xn . n).1.t + (i. that depends on how the experiment is carried out. What should we assume about the distribution of €. . .. . can take on any value between and N... Given the description in (c)... We often refer to such X I. The main difference that our model exhibits from the usual probability model is that NO is unknown and. we observe XI. That is. First consider situation (a).l) if max(n . N. . . €I..2.. Sample from a Population. . The same model also arises naturally in situation (c)..0) < k < min(N8. . and Statistics 3 here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs. On this space we can define a random variable X given by X(k) ~ k.. n corresponding to the number of defective items found..) random variables with common unknown distribution function F. X n are ij. OneSample Models. N.. . n)} of probability distributions for X.. We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models.1. X n as a random sample from F. completely specifies the joint distribution of Xl. Fonnally. Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording "defective" or not. 1 <i < n (1.6) (J. in principle. .8). If N8 is the number of defective items in the population sampled."..Section 1.. Parameters. which together with J1. .. So. Models. k ~ 0. n. then by (AI3. X has an hypergeometric. and also write that Xl.l. although the sample space is well defined. 1. Here we can write the n determinations of p. .1 Data. we cannot specify the probability structure completely but rather only give a family {H(N8. . n) distribution.d. anyone of which could have generated the data actually observed. Sampling Inspection." The model is fully described by the set :F of distributions that we specify. 0 ° Example 1. The sample space consists of the numbers 0. F.1. The mathematical model suggested by the description is well defined. which we refer to as: Example 1. A random experiment has been perfonned. X n ? Of course. .. if the measurements are scalar.. 1 €n) T is the vector of random errors. as X with X . I.2) where € = (€ I ... Thus.d.1.. which are modeled as realizations of X I.. . identically distributed (i. we postulate (l) The value of the error committed on one determination does not affect the value of the error at other times.Xn independent. so that sampling with replacement replaces sampling without.. where .
or by {(I"..3. 0 This default model is also frequently postulated for measurements taken on units obtained by random sampling from populations. cr > O} where tP is the standard normal distribution.1. E}. patients improve even if they only think they are being treated. Example 1. We call this the shift model with parameter ~.. All actual measurements are discrete rather than continuous. We let the y's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo. we have specified the Gaussian two sample model with equal variances. That is. respectively. then G(·) F(' . response y = X + ~ would be obtained where ~ does not depend on x. Often the final simplification is made." Now consider situation (d). This implies that if F is the distribution of a control. .t.1.3) and the model is alternatively specified by F.4 Statistical Models. It is important to remember that these are assumptions at best only approximately valid. 0'2). The classical def~ult model is: (4) The common distribution of the errors is N(o.. then F(x) = G(x  1") (1. .. £n are identically distributed. will have none of this. . (2) Suppose that if treatment A had been administered to a subject response X would have been obtained. By convention. .. if we let G be the distribution function of f 1 and F that of Xl. G) : J1 E R.Xn are a random sample and. . (72) population or equivalently F = {tP (':Ji:) : Jl E R. G) pairs.. if drug A is a standard or placebo. A placebo is a substance such as water tpat is expected to have no effect on the disease and is used to correct for the welldocumented placebo effect.. Goals. the Xi are a sample from a N(J.. whatever be J. Equivalently Xl. (3) The control responses are normally distributed... 1 X Tn a sample from F. . Thus. 0 ! I = .. be the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B. so that the model is specified by the set of possible (F.~).t and cr. The Gaussian distribution. There are absolute bounds on most quantitiesIOO ft high men are impossible. '. Natural initial assumptions here are: (1) The x's and y's are realizations of Xl. TwoSample Models. cr 2 ) distribution.. . and Y1. . We call the y's treatment observations.2) distribution and G is the N(J. or alternatively all distributions with expectation O. Then if treatment B had been administered to the same subject instead of treatment A. heights of individuals or log incomes. Let Xl. we refer to the x's as control observations. (3) The distribution of f is independent of J1..Yn.t + ~. for instance. YI. 1 Yn a sample from G. Heights are always nonnegative. 1 X Tn . and Performance Criteria Chapter 1 (2) The distribution of the error at one determination is the same as that at another. where 0'2 is unknown. G E Q} where 9 is the set of all allowable error distributions that we postulate. To specify this set more closely the critical constant treatment effect assumption is often made. Then if F is the N(I". •. that is. tbe set of F's we postulate. . Commonly considered 9's are all distributions with center of symmetry 0.
On this sample space we have defined a random vector X = (Xl.6. The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise. All the severely ill patients might.4. we observe X and the family P is that of all bypergeometric distributions with sample size n and population size N. have been assigned to B. we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated. our analyses.3 when F. Experiments in medicine and the social sciences often pose particular difficulties.1 Data. For instance. the data X(w). though correct for the model written down. P is the . equally trained observers with no knowledge of each other's findings. but not others. As our examples suggest. That is. G are assumed arbitrary. Parameters. the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed.'" . In Example 1.1. Since it is only X that we observe. in Example 1. in comparative experiments such as those of Example 1. It is often convenient to identify the random vector X with its realization.1. we can be reasonably secure about some aspects. if they are true. if they are false.1. P is referred to as the model. and Statistics 5 How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations. Fortunately.1). in Example 1.1. This will be done in Sections 3. For instance.2. Statistical methods for models of this kind are given in Volume 2. if (1)(4) hold.1. we know how to combine our measurements to estimate JL in a highly efficient way and also assess the accuracy of our estimation procedure (Example 4.3 and 6. for instance. ~(w) is referred to as the observations or data.5. This distribution is assumed to be a member of a family P of probability distributions on Rn.Section 1. there is tremendous variation in the degree of knowledge and control we have concerning experiments. Models. The advantage of piling on assumptions such as (I )(4) of Example 1.2. When w is the outcome of the experiment. Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter. However. Using our first three examples for illustrative purposes.1. In some applications we often have a tested theoretical model and the danger is small. we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + n available patients. In others. A review of necessary concepts and notation from probability theory are given in the appendices.1. the methods needed for its analysis are much the same as those appropriate for the situation of Example 1. we can ensure independence and identical distribution of the observations by using different.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease. The danger is that. we now define the elements of a statistical model. We are given a random experiment with sample space O. The number of defectives in the first example clearly has a hypergeometric distribution. we need only consider its probability distribution.2 is that. In this situation (and generally) it is important to randomize. may be quite irrelevant to the experiment that was actually performed. For instance.Xn ).
X l . having expectation 0..G taken to be arbitrary are called nonparametric. knowledge of the true Pe. under assumptions (1)(4).2. G) : I' E R. . .6 Statistical Models. However. Of even greater concern is the possibility that the parametrization is not onetoone.1.Xrn are identically distributed as are Y1 .2 with assumptions (1)(4) we have = R x R+ and.. we know we are measuring a positive quantity in this model. in Example 1.. NO. Yn .1. Goals. . that is. as we shall see later. .1 we can use the number of defectives in the population. moreover. The critical problem with such parametrizations is that ev~n with "infinite amounts of data. If. 1 • • • . Then the map sending B = (I'. on the other hand.e. tt = to the same distribution of the observations a~ tt = 1 and N (~ 1. models P are called parametric. .. The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything. When we can take to be a nice subset of Euclidean space and the maps 0 + Po are smooth. as a parameter and in Example 1. . we only wish to make assumptions (1}(3) with t:. 0.2)). models such as that of Example 1. 02 and yet Pel = Pe 2 • Such parametrizations are called unidentifiable.t2 + k e e e J . we have = R+ x R+ . . in Example = {O. and Performance Criteria Chapter 1 family of all distributions according to which Xl.1. X rn are independent of each other and Yl .. . N. the parameter space e.. .. in (l. Models such as that of Example 1. the first parametrization we arrive at is not necessarily the one leading to the simplest analysis.1. lh i O => POl f.1 we take () to be the fraction of defectives in the shipment.l. Thus. a map. Pe the H(NB.. to P. that is. Ghas(arbitrary)densityg}.. In Example 1. What parametrization we choose is usually suggested by the phenomenon we are modeling.2 Parametrizations and Parameters i e To describe P we USe a parametrization. . . () is the fraction of defectives. Finally.1. j. We may take any onetoone function of 0 as a new parameter. If.2 with assumptions (1)(3) are called semiparametric. e j . that is. or equivalently write P = {Pe : BEe}.. For instance.. by (tt. still in this example. G) into the distribution of (Xl. Now the and N(O. () t Po from a space of labels. for eXample. such that we can have (}1 f. we may parametrize the model by the first and second moments of the normal distribution of the observations (i. For instance.G) : I' E R. 1) errors lead parametrization is unidentifiable because. .3 that Xl. Pe 2 • 2 e ° . in senses to be made precise later. . tt is the unknown constant being measured." that is.1. if) (X'. 1) errors. Yn . Thus. Pe the distribution on Rfl with density implicitly taken n~ 1 . . n) distribution. 02 ). Note that there are many ways of choosing a parametrization in these and all other problems. we will need to ensure that our parametrizations are identifiable. I 1.. X n ) remains the same but = {(I'.1. if () = (Il.. G with density 9 such that xg(x )dx = O} and p(".. It's important to note that even nonparametric models make substantial assumptionsin Example 1.2) suppose that we pennit G to be arbitrary.1.JL) where cp is the standard normal density.2 ) distribution. parts of fJ remain unknowable.I} and 1.3 with only (I) holding and F.G) has density n~ 1 g(Xi 1').Xn are independent and identically distributed with a common N (IL. we can take = {(I'.
2. thus. if POl = Pe'). which can be thought of as the difference in the means of the two populations of responses. Similarly. from to its range P iff the latter map is 11. make 8 + Po into a parametrization of P. When we sample. For instance. G) parametrization of Example 1.1. in Example 1.2 with assumptions (1)(4) the parameter of interest fl. is that of a parameter. Implicit in this description is the assumption that () is a parameter in the sense we have just defined. a function q : + N can be identified with a parameter v( P) iff Po. We usually try to combine parameters of interest and nuisance parameters into a single grand parameter ().3. .3 with assumptions (1H2) we are interested in~. that is. formally a map. In addition to the parameters of interest. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable)." = f1Y .1. instead of postulating a constant treatment effect ~. then 0'2 is a nuisance parameter. in Example 1. or the median of P. .. it is natural to write e e e . there are also usually nuisance parameters. implies 8 1 = 82 . or more generally as the center of symmetry of P.Section 1. v.2 again in which we assume the error € to be Gaussian but with arbitrary mean~.1. the focus of the study. in Example 1. say with replacement.1.M(P) can be characterized as the mean of P.. As we have seen this parametrization is unidentifiable and neither f1 nor ~ arc parameters in the sense we've defined. that is.. Models. if the errors are normally distributed with unknown variance 0'2. which indexes the family P. Here are two points to note: (1) A parameter can have many representations. For instance. Parameters.1.2 where fl denotes the mean income and. Formally. E(€i) = O.flx. a map from some e to P. X n independent with common distribution. the fraction of defectives () can be thought of as the mean of X/no In Example 1. But = Var(X. For instance. or the midpoint of the interquantile range of P. which correspond to other unknown features of the distribution of X.1. consider Example 1. For instance. (J is a parameter if and only if the parametrization is identifiable. where 0'2 is the variance of €. The (fl. A parameter is a feature v(P) of the distribution of X. ) evidently is and so is I' + C>.1. Then "is identifiable whenever flx and flY exist. For instance. .1 Data. and Statistics 7 Dual to the notion of a parametrization. implies q(BJl = q(B2 ) and then v(Po) q(B). Sometimes the choice of P starts by the consideration of a particular parameter. Then P is parametrized by 8 = (fl1~' 0'2). from P to another space N. ~ Po. and observe Xl.3) and 9 = (G : xdG(x) = J O}. as long as P is the set of all Gaussian distributions.1. But given a parametrization (J + Pe.2 is now well defined and identifiable by (1. in Example 1.1. we can define (J : P + as the inverse of the map 8 + Pe. we can start by making the difference of the means. More generally. our interest in studying a population of incomes may precisely be in the mean income.
which now depends on the data.. Informally.3 Statistics as Functions on the Sample Space I j • . to narrow down in useful ways our ideas of what the "true" P is. we can draw guidelines from our numbers and cautiously proceed. this difference depends on the patient in a complex manner (the effect of each drug is complex). Often the outcome of the experiment is used to decide on the model and the appropriate measure of difference. is used to decide what estimate of the measure of difference should be employed (ct.1. x E Ris F(X " ~ I . It estimates the function valued parameter F defined by its evaluation at x E R. Goals. These issues will be discussed further in Volume 2. Next this model. T( x) is what we can compute if we observe X = x. Our aim is to use the data inductively. If we suppose there is a single numerical measure of performance of the drugs and the difference in performance of the drugs for any given patient is a constant irrespective of the patient. which evaluated at ~ X and 8 2 are called the sample mean and sample variance. Nevertheless.. can be related to model formulation as we saw earlier. Formally. Databased model selection can make it difficult to ascenain or even assign a meaning to the accuracy of estimates or the probability of reaChing correct conclusions.8 Statistical Models. however. we have to formulate a relevant measure of the difference in performance of the drugs and decide how to estimate this measure. The link for us are things we can compute. but the true values of parameters are secrets of nature. For instance. a common estimate of 0.. called the empirical distribution function. statistics. usually a Euclidean space.2 is the statistic I i i = i i 8 2 = n1 "(Xi . Thus. in Example 1.. . This statistic takes values in the set of all distribution functions on R. for example. Deciding which statistics are important is closely connected to deciding which parameters are important and. a statistic T is a map from the sample space X to some space of values T. a statistic we shall study extensively in Chapter 2 is the function valued statistic F. hence.2 a cOmmon estimate of J1. 1964). Mandel. and Performance Criteria Chapter 1 1. then our attention naturally focuses on estimating this constant If. X n ) are a sample from a probability P on R and I(A) is the indicator of the event A.• 1 X n ) = X ~ L~ I Xi.X)" L i=l I n How we use statistics in estimation and other decision procedures is the subject of the next section. . F(P)(x) = PIX. the fraction defective in the sample.1.1. Models and parametrizations are creations of the statistician. In this volume we assume that the model has .1. is the statistic T(X 11' . For future reference we note that a statistic just as a parameter need not be real or Euclidean valued. For instance. < xJ. T(x) = x/no In Example 1. Xn)(x) = n L I(X n i < x) i=l where (X" . consider situation (d) listed at the beginning of this section.
4. assign the drug that seems to be working better to a higher proportion of patients. Y n ) where Y11 ••• .1.. and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients. 0). (zn. patients may be considered one at a time. Expectations calculated under the assumption that X rv P e will be written Eo. after a while. In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1. Yn are independent. Regression Models. For instance. Moreover. . We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more infonnation. .. Parameters. There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. and so On of the ith subject in a study.1 Data. Regular models. It will be convenient to assume(1) from now on that in any parametric model we consider either: (I) All of the P. . 0).3 we could take z to be the treatment label and write onr observations as (A. are continuous with densities p(x. Y. Models. For instance. height. We observe (ZI' Y.4 Examples. Thus. Notation. we shall denote the distribution corresponding to any particular parameter value 8 by Po..1.1. In the discrete case we will use both the tennsfrequency jimction and density for p(x.3. ) that is independent of 0 such that 1 P(Xi' 0) = 1 for all O. The experimenter may.. 1990). Zi is a d dimensional vector that gives characteristics such as sex. sequentially. these and other subscripts and arguments will be omitted where no confusion can arise. (B. weight. Regression Models We end this section with two further important examples indicating the wide scope of the notions we have introduced. . (2) All of the P e are discrete with frequency functions p(x. 8). (A. in the study. This selection is based on experience with previous similar experiments (cf. age. Lehmann. and Statistics 9 been selected prior to the current experiment. ).. However. Example 1. drugs A and B are given at several . in Example 1.1. 0).). X m ).. assign the drugs alternatively to every other patient in the beginning and then. This is the Stage for the following. in situation (d) again. density and frequency functions by p(. 0). Distribution functions will be denoted by F(·. the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other.Section 1. Xl). This is obviously overkill but suppose that.X" . the number of patients in the study (the sample size) is random. and there exists a set {XI. When dependence on 8 has to be observed. Problems such as these lie in the fields of sequential analysis and experimental deSign. 1. They are not covered under our general model and will not be treated in this book. 'L':' Such models will be called regular parametric models.. Thus. See A. for example.lO. (B. Yn ). The distribution of the response Vi for the ith subject or case in the study is postulated to depend on certain characteristics Zi of the ith subject.
~II3Jzj = zT (3 so that (b) becomes (b') This is the linear model.treat the Zi as a label of the completely unknown distributions of Yi. That is. The most common choice of 9 is the linear fOlltl. in Example 1. the effect of z on Y is through f..(z) only. . Clearly.3 with the Gaussian twosample model I'(A) ~ 1'. Identifiability of these parametrizations and the status of their components as parameters are discussed in the problems. (3) and (4) above.z) = L. We usually ueed to postulate more.1. (b) where Ei = Yi . . then we can write. Then.. . See Problem 1.E(Yi ). i=l n If we let J. In the two sample models this is implied by the constant treatment effect assumption. For instance.. On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) 1'(z) = 9((3. Then we have the classical Gaussian linear model. z) where 9 is known except for a vector (3 = ((31..10 Statistical Models. In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed observations. d = 2 and can denote the pair (Treatment Label. A eommou (but often violated assumption) is (I) The ti are identically distributed with distribution F. [n general. By varying the assumptions we obtain parametric models as with (I)..Z~)T and Jisthen x n identity.L(z) denote the expected value of a response with given covariate vector z. . semiparametric as with (I) and (2) with F arbitrary. which we can write in vector matrix form. then the model is zT (a) P(YI. Example 1.1. 0 .1. So is Example 1.. and nonparametric if we drop (1) and simply ... Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on Zi. I'(B) = I' + fl.3(3) is a special case of this model.. i = 1..Yn) ~ II f(Yi I Zi). and Performance Criteria Chapter 1 dose levels.1.. . . If we let f(Yi I zd denote the density of Yi for a subject with covariate vector Zi. Here J. (3) g((3.2 uuknown. Goals.1. Often the following final assumption is made: (4) The distributiou F of (I) is N(O.L(z) is an unknown function from R d to R that we are interested in. (c) whereZ nxa = (zf. . Treatment Dose Level) for patient i.2) with .8. I3d) T of unknowns. .2 with assumptions (1)(4). n. .
An example would be. the model for X I. is that N(O. X n spent above a fixed high level for a series of n consecutive wave records at a point on the seashore. en' Using conditional probability theory and ei = {3ei1 + €i. . .. 0 . Consider the model where Xi and assume = J. . Measurement Model with Autoregressive Errors. It is plausible that ei depends on Cil because long waves tend to be followed by long waves.p(e" I e1. A second example is consecutive measurements Xi of a constant It made by the same observer who seeks to compensate for apparent errors. Xi .. . To find the density p(Xt. Of course. C n where €i are independent identically distributed with density are dependent as are the X's.1 Data.(3)I'). i = 1. . save for a brief discussion in Volume 2. .Section 1.(1 .n ei = {3ei1 + €i.t+ei... i = l.. X n is P(X1. x n ).. Models.. the conceptual issues of stationarity. and Statistics 11 Finally. . . the elapsed times X I..cn _d p(edp(e2 I edp(e3 I e2) .t. ... .n.. 0'2) density. Then we have what is called the AR(1) Gaussian model J is the We include this example to illustrate that we need not be limited by independence.. . . . However. i ~ 2. In fact we can write f. Parameters. say. p(e n I end f(edf(c2 .(3cd··· f(e n Because €i =  (3e n _1).... X n be the n determinations of a physical constant J. . model (a) assumes much more but it may be a reasonable first approximation in these situations.(3) + (3X'l + 'i..Il. we start by finding the density of CI.1. The default assumption. .{3Xj1 j=2 n (1. Example 1. we have p(edp(C21 e1)p(e31 e"e2)". Xi ~ 1. ergodicity.. Let XI. ... Xl = I' + '1.. .5. at best an approximation for the wave example. . ' . ..xn ) ~ f(x1 I') II f(xj . n. . Let j.. . we give an example in which the responses are dependent.t = E(Xi ) be the average time for an infinite series of records. eo = a Here the errors el. and the associated probability theory models and inference for dependent data are beyond the scope of this book.. .
and Performance Criteria Chapter 1 Summary.. That is. They are useful in understanding how the outcomes can he used to draw inferences that go beyond the particular experiment.I 12 Statistical Models.2. 0 = N I I ([.2 BAYESIAN MODELS Throughout our discussion so far we have assumed that there is no information available about the true value of the parameter beyond that provided by the data. X has the hypergeometric distribution 'H( i.. n). PIX = k. This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as measure of the beliefs of the experimenter concerning the true value of (J before he or she takes any data. it is possible that. 1l"i is the frequency of shipments with i defective items. N. There is a substantial number of statisticians who feel that it is always reasonable. ([. vector observations X with unknown probability distributions P ranging over models P. to think of the true value of the parameter (J as being the realization of a random variable (} with a known distribution. The general definition of parameters and statistics is given and the connection between parameters and pararnetrizations elucidated. we have had many shipments of size N that have subsequently been distributed. i = 0. 1. Models are approximations to the mechanisms generating the observations. If the customers have provided accurate records of the number of defective items that they have found. the most important of which is the workhorse of statistics.1.. Now it is reasonable to suppose that the value of (J in the present shipment is the realization of a random variable 9 with distribution given by P[O = N] I = IT" i = 0. N. We view statistical models as useful tools for learning from the outcomes of experiments and studies. N. . . The notions of parametrization and identifiability are introduced. We know that. in the past. in the inspection Example 1. . a ."" 1l"N} for the proportion (J of defectives in past shipments. Goals. This is done in the context of a number of classical examples. There are situations in which most statisticians would agree that more can be said For instance. Thus. .2. and indeed necessary.1) Our model is then specified by the joint distribution of the observed number X of defectives in the sample and the random variable 9. the regression model.1. given 9 = i/N.2) This is an example of a Bayesian model. In this section we introduced the first basic notions and formalism of mathematical statistics. we can construct a freq uency distribution {1l"O.. How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences.
I(O. Raiffa and Schlaiffer (1961).3) Because we now think of p(x. In the "mixed" cases such as (J continuous X discrete.1 of being defective independently of the other members of the shipment. The function 7r represents our belief or information about the parameter (J before the experiment and is called the prior density or frequency function. the joint distribution is neither continuous nor discrete. For a concrete illustration. An interesting discussion of a variety of points of view on these questions may be found in Savage et al.Section 1.2. let us turn again to Example 1. whose range is contained in 8.1. We shall return to the Bayesian framework repeatedly in our discussion. the information or belief about the true value of the parameter is described by the prior distribution. Savage (1954).2 Bayesian Models 13 Thus. The theory of this school is expounded by L. given (J = (J. 100. Lindley (1965). De Groot (1969). which is called the posterior distribution of 8. After the value x has been obtained for X.3). then by (B.O). the information about (J is described by the posterior distribution. insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later). Before the experiment is performed. the resulting statistical inference becomes subjective. we can obtain important and useful results and insights.9)'00'.2.1)'(0. and Berger (1985).2. We now think of Pe as the conditional distribution of X given (J = (J.3).1. Before sampling any items the chance that a given shipment contains . with density Or frequency function 7r. 1. (J) as a conditional density or frequency function given 8 = we will denote it by p(x I 0) for the remainder of this section. (1. (1) Our own point of view is that SUbjective elements including the views of subject matter experts arc an essential element in all model building. ?T. (8. Eqnation (1. If both X and (J are continuous or both are discrete.2) is an example of (1. select X according to Pe. For instance. In this section we shall define and discuss the basic clements of Bayesian models. However. = ( I~O ) (0. by giving (J a distribution purely as a theoretical tool to which no subjective significance is attached. X) is appropriately continuous or discrete with density Or frequency function. This would lead to the prior distribution e. (1962). x) = ?T(O)p(x. suppose that N = 100 and that from past experience we believe that each item has probability . X) is that of the outcome of a random experiment in which we first select (J = (J according to 7r and then. There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in tenus of frequencies. The joint distribution of (8.. . However. Suppose that we have a regular parametric model {Pe : (J E 8}.4) for i = 0. To get a Bayesian model we introduce a random vector (J. The most important feature of a Bayesian model is the conditional distribution of 8 given X = x.l. (1.. . 1.2.
001.<I> 1000 .30.. to calculate the posterior.1) '" I .. (1. 0. (i) The posterior distribution is discrete or continuous according as the prior distri bution is discrete or continuous.8) roo 1r(9)p(x 19) 1r(t)p(x I t)dt if 8 is continuous.52) 0. Example 1. X) given by (1.X..2.6) we argue loosely as follows: If be fore the drawing each item was defective with probability . (ii) If we denote the corresponding (posterior) frequency function or density by 1r(9 I x). Goals.Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1. Thus.J C3 (1.2.. Here is an example. P[1000 > 20 I X = 10] P[IOOO .X > 10 I X = lOJ P [(1000 .X) . (A 15. 1r(t)p(x I t) if 8 is discrete. this will continue to be the case for the items left in the lot after the 19 sample items have been drawn.181(0.2..2. . 1r(9)9k (1 _ 9)nk 1r(Blx" .181(0...1100(0.1) (1.k dt (1. the number of defectives left after the drawing.1.2.. 1.II ~.6) To calculate the posterior probability given iu (1. Specifically. I 20 or more bad items is by the normal approximation with continuity correction. theu 1r(91 x) 1r(9)p(x I 9) 2.8) as posterior density of 0.9 ] . .2.8. In the cases where 8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (0.2.4) can be used..9) I .1100(0.2. 1008 . . Therefore.9) > . . is iudependeut of X and has a B(81.10 '" .30. If we assume that 8 has a priori distribution with deusity 1r. and Performance Criteria Chapter 1 I .9)(0.2.7) In general.1) distribution. .xn )= Jo'1r(t)t k (lt)n.3).9) I . 14 Statistkal Models. we obtain by (1.<1>(0. This leads to P[1000 > 20 I X = lOJ '" 0.10).1)(0.5) 5 ) = 0.1)(0.1 and good with probability .j .. some variant of Bayes' rule (B. . P[IOOO > 201 = 10] P [ . Bernoulli Trials..9)(0.:.1. Now suppose that a sample of 19 has been drawn in which 10 defective items are found. ( 1.1 > . Suppose that Xl.9 independently of the other items.
Then we can choose r and s in the (3(r. which corresponds to using the beta distribution with r = . . 0 A feature of Bayesian models exhibited by this example is that there are natural parametric families of priors such that the posterior distributions also belong to this family. 8) distribution given B = () (Problem 1. .8) distribution.(8 I X" . One such class is the twoparameter beta family.. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B. we need a class of distributions that concentrate on the interval (0.2.2 Bayesian Models 15 for 0 < () < 1.Section 1.ll) in (1.s) distribution. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions. The result might be a density such as the one marked with a solid line in Figure B. for instance. 1).·) is the beta function. r. must (see (B. and the posterior distribution of B givenl:X i =kis{3(k+r. where k ~ l:~ I Xj. .2 indicates. For instance..2.10) The proportionality constant c. To choose a prior 11'. which depends on k. .nk+s). We may want to assume that B has a density with maximum value at o such as that drawn with a dotted line in Figure B. and s only. introduce the notions of prior and posterior distributions and give Bayes rule.k + s) where B(·. To get infonnation we take a sample of n individuals from the city. Or else we may think that 1[(8) concentrates its mass near a small number.2. s) density (B. we might take B to be uniformly distributed on (0.16. say 0.05 and its variance is very small. Another bigger conjugate family is that of finite mixtures of beta distributionssee Problem 1. Specifically.2. 'i = 1.2.2. which has a B( n. x n ). nonVshaped bimodal distributions are not pennitted. n . so that the mean is r/(r + s) = 0.15. We also by example introduce the notion of a conjugate family of distributions. Such families are called conjugaU? Evidently the beta family is conjugate to the binomial.9) we obtain 1 L: L7 (}k+rl (1 _ 8)nk+s1 c (1.2. We return to conjugate families in Section 1.2. Suppose.2.n. As Figure B.2.. L~ 1 Xi' We also obtain the same posterior density if B has prior density 1r and 1 Xi. = 1. We present an elementary discussion of Bayesian models. we are interested in the proportion () of "geniuses" (IQ > 160) in a particular city.11» be B(k + T. (A.2. Xi = 0 or 1. we only observe We can thus write 1[(8 I k) fo". If n is small compared to the size of the city. the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all.1).6. k = I Xi· Note that the posterior density depends on the data only through the total number of successes. Summary. upon substituting the (3(r..13) leads us to assume that the number X of geniuses observed has approximately a 8(n.05. If we were interested in some proportion about which we have no information or belief.<.9).
the expected value of Y given z. i we can try to estimate the function 1'0. If J. if there are k different brands. Given a statistical model. There are many possible choices of estimates. We may wish to produce "best guesses" of the values of important parameters.1. the information we want to draw from data can be put in various forms depending on the purposes of our analysis. in Example 1.lo is the critical matter density in the universe so that J. Making detenninations of "specialness" corresponds to testing significance. as in Example 1. drug dose) T that can be used for prediction of a variable of interest Y. Unfortunately {t(z) is unknown. the fraction defective B in Example 1. one of which will be announced as more consistent with the data than others.1.l < J..3. Intuitively.4. We may have other goals as illustrated by the next two examples. For instance. Thus. if we believe I'(z) = g((3. in Example 1.2.l > J. then depending on one's philosophy one could take either P's corresponding to J. Example 1. (age.3 THE DECISION THEORETIC FRAMEWORK I .1 contractual agreement between shipper and receiver may penalize the return of "good" shipments. On the basis of the sample outcomes the organization wants to give a ranking from best to worst of the brands (ties not pennitted). as it's usually called. Po or Pg is special and the general testing problem is really one of discriminating between Po and Po. "hypothesis" or "nonspecialness" (or alternative). for instance.16 Statistical Models.e.1.1. the receiver wants to discriminate and may be able to attach monetary Costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment.1. state which is supported by the data: "specialness" or. placebo and treatment are equally effective) are special because the FDA (Food and Drug Administration) does not wish to pennit the marketing of drugs that do no good. These are estimation problems. Ranking. A very important class of situations arises when. Prediction. 1 < i < n. Zi) and then plug our estimate of (3 into g." In testing problems we.l < Jio or those corresponding to J. and Performance Criteria Chapter 1 1. we have a vector z. with (J < (Jo. For instance. In other situations certain P are "special" and we may primarily wish to know whether the data support "specialness" or not. Thus. i.3.l > JkO correspond to an eternal alternation of Big Bangs and expansions.1 do we use the observed fraction of defectives .lo as special. there are k! possible rankings or actions. z) we can estimate (3 from our observations Y. As the second example suggests.lo means the universe is expanding forever and J. say a 50yearold male patient's response to the level of a drug. shipments." (} > Bo.L in Example 1. if we have observations (Zil Yi). and as we shall see fonnally later. For instance. of g((3. say.3. Note that we really want to estimate the function Ji(')~ our results will guide the selection of doses of drug for future patients. 0 In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function. P's that correspond to no treatment effect (i. Goals. sex. there are many problems of this type in which it's unclear which oftwo disjoint sets of P's.1 or the physical constant J.2. In Example 1.1.1. A consumer organization preparing (say) a report on air conditioners tests samples of several brands. at a first cut. However. say. a reasonable prediction rule for an unseen Y (response of a new patient) is the function {t(z). 0 Example 1. such as. whereas the receiver does not wish to keep "bad.
(2.1. in Example 1. on what criteria of performance we use. That is. A = {O. Thus. 3. in estimation we care how far off we are. 2.. 1. A new component is an action space A of actions or decisions or claims that we can contemplate making. A = {( 1. in testing whether we are right or wrong. (3. By convention. Ranking.. 3).Section 1. or combine them in some way? In Example 1.1. ik) of {I. 2).1.6. .3. (2. or p. Here quite naturally A = {Permntations (i I . whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing. there are 3! = 6 possible rankings. in Example 1. it is natural to take A = R though smaller spaces may serve equally well. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify Po with the set of "special" P's). Estimation. taking action 1 would mean deciding that D. and so on. accuracy.1. We usnally take P to be parametrized. Thus. The answer will depend on the model and. 3). . ~ 1 • • • l I} in Example 1.1. and reliability of statistical procedures. in Example 1. 1). 3. Action space. A = {a. Here are action spaces for our examples. once a study is carried out we would probably want not only to estimate ~ but also know how reliable our estimate is. I} with 1 corresponding to rejection of H. is large.1. or the median. (3) provide assessments of risk. # 0. Testing. 2. Intuitively.1. a large a 2 will force a large m l n to give us a good chance of correctly deciding that the treatment effect is there.2.2 lO estimate J1 do we use the mean of the measurements. (4) provide guidance in the choice of procedures for analyzing outcomes of experiments. Thus. . 2).1. In any case. we need a priori estimates of how well even the best procedure can do. (2) point to what the different possible actions are. in Example 1. . On the other hand.1 Components of the Decision Theory Framework As in Section 1. If we are estimating a real parameter such as the fraction () of defectives. (3. I)}. P = {P. : 0 E 8). k}}.. for instance. if we have three air conditioners. we would want a posteriori estimates of perfomlance. In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter. we begin with a statistical model with an observation vector X whose distribution P ranges over a set P.. (1.3. in ranking what mistakes we've made.3 The Decision Theoretic Framework 17 X/n as our estimate or ignore the data and use hislOrical infonnation on past shipments.1. . 1. 1. X = ~ L~l Xi. most significantly. For instance. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study. defined as any value such that half the Xi are at least as large and half no bigger? The same type of question arises in all examples. even if.3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that.
and Performance Criteria Chapter 1 Prediction. say. 1'() is the parameter of interest.ai. a) 1(0. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider. l(P. and z = (Treatment. a) = 0. Here A is much larger. asymmetric loss functions can also be of importance. say. . ~ 2. which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1. A = {a . r " I' 2:)a. Estimation. a) = J (I'(z) . Other choices that are. [v(P) .3. the probability distribution producing the data. 1 d} = supremum distance. ' .i = We can also consider function valued parameters. For instance. Closely related to the latter is what we shall call confidence interval loss.a) ~ (q(O) . d'}. is the nonnegative loss incurred by the statistician if he or she takes action a and the true "state of Nature. a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z.18 Statistical Models. examples of loss functions are 1(0.a)2). Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·).a)2 (or I(O. or 1(0. If. If v ~ (V1. Loss function. "does not respond" and "responds.1). a) I d . and z E Z... l( P.a(z))2dQ(z).a)2.qd(lJ)) and a ~ (a""" ad) are vectors. a) if P is parametrized. less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P..)2 = Vj [ squared Euclidean distance/d absolute distance/d 1. a) = 1 otherwise. as the name suggests. a).V. Although estimation loss functions are typically symmetric in v and a. then a(B. . the expected squared error if a is used. is P. Goals. The interpretation of I(P. Sex)T. a) 1(0.: la. . if Y = 0 or 1 corresponds to.. a) = min {(v( P) . [f Y is real. This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable. as we shall see (Section 5. M) would be our prediction of response or no response for a male given treatment B. For instance.3. I(P.. Quadratic Loss: I(P.3. As we shall see. sometimes can genuinely be quantified in economic terms. In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is. Far more important than the choice of action space is the choice of loss function defined as a function I : P X A ~ R+.." that is. .al < d. = max{laj  vjl. and truncated quadraticloss: I(P.2. although loss functions. a) = (v(P) . Q is the empirical distribution of the Zj in . r I I(P. For instance. in the prediction example 1. ." respectively.Vd) = (q1(0). they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. a) = [v( P) . a) = l(v < a).
a(zn) jT and the vector parameter (I'( z. respectively. For instance.1'(zn))T Testing. we implicitly discussed two estimates or decision rules: 61 (x) = sample mean x and 02(X) = X = sample median. y) = o if Ix::: 111 (J <c (1.. Y). relative to the standard deviation a. . (Zn. Then the appropriate loss function is 1.. that is. a Estimation... Here we mean close to zero relative to the variability in the experiment. is a or not..). and to decide Ll '# a if our estimate is not close to zero..). this leads to the commonly considered I(P. the decision is wrong and the loss is taken to equal one.3 with X and Y distributed as N(I' + ~.. Y n).. ]=1 which is just n. in Example 00 defectives results in a penalty of s dollars whereas every defective item sold results in an r dollar replacement cost.3 we will show how to obtain an estimate (j of a from the data. If we take action a when the parameter is in we have made the correct decision and the loss is zero.). The decision rule can now be written o(x. We define a decision rule or procedure to be any function from the sample space taking its values in A.1 loss function can be written as ed.. I) /(8. Testing.2) I if Ix ~ 111 (J >c .y is close to zero. in the measurement model. if we are asking whether the treatment effect p'!1'ameter 6. We ask whether the parameter () is in the subset 6 0 or subset 8 1 of e. . the statistician takes action o(x).1) Decision procedures. We next give a representation of the process Whereby the statistician uses the data to arrive at a decision.))2. then a reasonable rule is to decide Ll = 0 if our estimate x . . a) = 1 otherwise (The decision is wrong). other economic loss functions may be appropriate. In Example 1. .1.1 times the squared Euclidean distance between the prediction vector (a(z.3. For the problem of estimating the constant IJ.2) and N(I'. This 0 . In Section 4. .a) = I n n.2).. Of course.. L(I'(Zj) _0(Z. 0. (1. Using 0 means that if X = x is observed.3 The Decision Theoretic Framework 19 the training set (Z 1.0) sif8<80 Oif8 > 80 rN8.9. where {So. The data is a point X = x in the outcome or sample space X. is a partition of (or equivalently if P E Po or P E P. .3.1 suppose returning a shipment with °< 1(8.1.I loss: /(8. I) 1(8.Section 1. Otherwise. e ea. a) ~ 0 if 8 E e a (The decision is correct) l(0.
() is the true value of the parameter. A useful result is Bias(fi) Proposition 1. and Performance Criteria Chapter 1 where c is a positive constant called the critical value. Suppose Xl.) = 2L n ~ n . e Estimation. and is given by v= MSE{fl) = R(P. but for a range of plausible x·s..d. the cross term will he zero because E[fi . Goals. If we use quadratic loss. our risk function is called the mean squared error (MSE) of.i.v(P))' (1. If we expand the square of the righthand side keeping the brackets intact and take the expected value.3. so is the other and the result is trivially true. How do we choose c? We need the next concept of the decision theoretic framework. 1 is the loss function.1. The other two terms are (Bias fi)' and Var(v). Moreover.v) = [V  + [E(v) .3. That is.. Suppose v _ v(P) is the real parameter we wish to estimate and fJ(X) is our estimator (our decision rule).. R('ld) is our a priori measure of the performance of d. we turn to the average or mean loss over the sample space. fi) = Ep(fi(X) . . lfwe use the mean X as our estimate of IJ.. The MSE depends on the variance of fJ and on what is called the bias ofv where = E{fl) .3.6) = Ep[I(P.20 Statistical Models. Estimation of IJ. MSE(fi) I. (]"2) errors. and assume quadratic loss. Proof. then the loss is l(P. and X = x is the outcome of the experiment." Var(X.E(fi)] = O. (If one side is infinite.v can be thought of as the "longrun average error" of v. We do not know the value of the loss because P is unknown. Write the error as 1 = (Bias fi)' + E(fi)] Var(v). i=l . Example 1. 6(X)] as the measure of the perfonnance of the decision rule o(x). R maps P or to R+. Thus. (Continued). fJ(x)). 6 (x)) as a random variable and introduce the riskfunction R(P.. (fi . X n are i. If d is the procedure used. the risk or riskfunction: The risk function. for each 8. then Bias(X) 1 n Var(X) . we regard I(P.3. .3) where for simplicity dependence on P is suppressed in MSE. We illustrate computation of R and its a priori use in some examples. withN(O. Thus.. we typically want procedures to have good properties not at just one particular x. measurements of IJ.) 0 We next illustrate the computation and the a priori and a posteriori use of the risk function.1'].
the £.1". is possible. that the E:i are i.3. X) = EIX . 1) and R(I".. say. whereas if we have a random sample of measurements X" X 2. itself subject to random error. by Proposition 1. areN(O. .5) This harder calculation already suggests why quadratic loss is really favored.a 2 then by (A 13.4) which doesn't depend on J.2.X) = ".t. Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be < crJ. analytic.6). ( N(o. then for quadratic loss. write for a median of {aI.1"1 ~ Eill 2 ) where £.fii V.X) =. computational difficulties arise even with quadratic loss as soon as we think of estimates other than X. If.S. for instance by 8 2 = ~ L~ l(X i .4) can be used for an a priori estimate of the risk of X. If we have no data for area A.23). 0 ~ = a We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1. for instance. in general. Let flo denote the mean of a certain measurement included in the U. For instance. an estimate we can justify later. a natural guess for fl would be flo. n (1. Then R(I".3.3. If we have no idea of the value of 0. Suppose that instead of quadratic loss we used the more natural(l) absolute value loss.X)2.fii/a)[ ~ a .1 is useful for evaluating the performance of competing estimators. = X. age or income. X) = a' (P) / n still. or numerical and/or Monte Carlo computation.a 2 .. but for absolute value loss only approximate.. .2 and 0. with mean 0 and variance a 2 (P).6 through a .3.. . or approximated asymptotically.8 can only be made on the basis of additional knowledge about demography or the economy.3. .2)1"0 + (O.1.2 . . Then (1. We shall derive them in Section 1. we may wantto combine tLo and X = n 1 L~ 1 Xi into an estimator. Example 1. of course. E(i .d. as we assumed.8)X.i.4.1 2 MSE(X) = R(I". as we discussed in Example 1. planning is not possible but having taken n measurements we can then estimate 0. census.3 The Decision Theoretic Framework 21 and. The a posteriori estimate of risk (j2/n is. a'.. Ii ~ (0.. If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = <:TolE measurements. Next suppose we are interested in the mean fl of the same measurement for a certain area of the United States. . R( P. or na 21(n . an}).1). (. 1 X n from area A. The choice of the weights 0.2.Section 1. If we only assume.3.Xn ) (and we. In fact.. 00 (1. if X rnedian(X 1 .fii 1 af2 00 Itl<p(t)dt = .1")' ~ E(~) can only be evaluated numerically (see Problem 1.3.
Figure 1. if we use as our criteria the maximum (over p. Y) = which in the case of 0 .. If I' is close to 1'0. respectively.3. We easily find Bias(ii) Yar(ii) 0. iiJ + 0. Y) = IJ. and Performance Criteria Chapter 1 i formal Bayesian analysis using a nonnal prior to illustrate a way of bringing in additional knowledge.4). = 0. R(i'>. the risk is = 0 and i'> i 0 can only take on the R(i'>. then X is optimal (Example 3.8)'Yar(X) = (.n.2(1'0 . Goals.6) = 1(i'>.0)P[6(X.3. Here we compare the performances of Ii and X as estimators of Jl using MSE.81' .1 gives the graphs of MSE(ii) and MSE(X) as functions of 1'. The mean squared errdrs of X and ji. X) = 0" In of X with the minimum relative risk inf{MSE(ii)IMSE(X). the risk R(I'. Because we do not know the value of fL. The two MSE curves cross at I' ~ 1'0 ± 30'1 . I. In the general case X and denote the outcome and parameter space.1')' + (.1') (0. 22 Statistical Models. I)P[6(X.) ofthe MSE (called the minimax criteria). I' E R} being 0.3. D Testing. Y) = 01 if i'> i O.6) P[6(X. thus.2) for deciding between i'> two values 0 and 1. Y) = I] if i'> = 0 P[6(X. However. neither estimator can be proclaimed as being better than the other. Figure 1.64)0" In MSE(ii) = . A test " " e e e . using MSE.64 when I' = 1'0.I.1.21'0 R(I'. and we are to decide whether E 80 or E 8 where 8 = 8 0 U 8 8 0 n 8 . The test rule (1.I' = 0.04(1'0 . MSE(ii) MSE(X) i .64)0" In.3. Ii) of ii is smaller than the risk R(I'.1 loss is 01 + lei'>.
say .1) and tests 15k of the fonn.Section 1.[X < k].[X < k[.3. This is not the only approach to testing." in Example 1. If (say) X represents the amount owed in the sample and v is the unknown total amount owed. the focus is on first providing a small bound. whereas if <5(X) = 0 and we decide () E 80 when in fact 8 E 8 b we call the error a Type II error.05 or . Thus. 6(X) = IIX E C]. usually . I(P. we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding ~ =1= 0 when ~ = 0). and then trying to minimize the probability of a Type II error.3. we call the error committed a Type I error.01 or less. This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function.6) E(6(X)) ~ P(6(X) ~ I) if8 E 8 0 (1. the loss function (1. "Reject the shipment if and only if X > k.1. For instance.8) for all possible distrihutions P of X. For instance. an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed.7) Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation.3. and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding Do = 0 when Do ¥ 0).3 The Decision Theoretic Framework 23 function is a decision rule <5(X) that equals 1 On a set C C X called the critical region and equals 0 on the complement of C. on the probability of Type I error. a > v(P) I. For instance. If <5(X) = 1 and we decide 8 E 8 1 when in fact E 80.05.6) Probability of Type [error R(8. [n the NeymanPearson framework of statistical hypothesis testing. (1. where 1 denotes the indicator function. Suppose our primary interest in an estimation type of problem is to give an upper bound for the parameter v.a) upper coofidence houod on v.IX > k] + rN8P. a < v(P) .18). the risk of <5(X) is e R(8.1 lead to (Prohlem 1.a (1.3. Finding good test functions corresponds to finding critical regions with smaIl probabilities of error.8) sP.3. R(8. 8 > 80 . Here a is small.6) P(6(X) = 0) if 8 E 8 1 Probability of Type II error. it is natural to seek v(X) such that P[v(X) > vi > 1. confidence bounds and intervals (and more generally regions).o) 0. that is. 8 < 80 rN8P. Such a v is called a (1 . in the treatments A and B example.
and the risk of all possible decision procedures can be computed and plotted. A has three points. Ii) = E(Ii(X) v(P))+.3. We shall illustrate some of the relationships between these ideas using the following simple example in which has two members. as we shall see in Chapter 4. we could drill for oil. in fact it is important to get close to the truthknowing that at most 00 doJlars are owed is of no use. c ~1 .2 . replace it. The same issue arises when we are interested in a confidence interval ~(X) l v(X) 1for v defined by the requirement that • P[v(X) < v(P) < Ii(X)] > 1 . I I (Oil) (No oil) Bj B2 0 12 10 I 5 6 \ . general criteria for selecting "optimal" procedures. a2. and Performance Criteria Chapter 1 1 . The 0. a 2 certain location either contains oil or does not.. a) (Drill) aj (Sell) a2 (Partial rights) a3 . For instance. though upper bounding is the primary goal. sell the location. it is customary to first fix a in (1. It is clear. For instance = I(P.24 Statistical Models. an asymmetric estimation type loss function. Typically. and so on. The decision theoretic framework accommodates by adding a component reflecting this. we could leave the component in.1.a<v(P). In the context of the foregoing examples.3. Suppose that three possible actions. though. which we represent by Bl and B . . Suppose the following loss function is decided on TABLE 1. thai this fonnulation is inadequate because by taking D 00 we can achieve risk O.1 nature makes it resemble a testing loss function and. or wait and see. Goals. and a3.3. We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model. or sell partial rights. administer drugs. for some constant c > O. where x+ = xl(x > 0).8) and then see what one can do to control (say) R( P. We next tum to the final topic of this section. we could operate. The loss function I(B. or repair it.5. What is missing is the fact that.: Comparison of Decision Procedures In this section we introduce a variety of concepts used in the comparison of decision procedures.a ·1 for all PEP. Suppose we have two possible states of nature. a component in a piece of equipment either works or does not work. are available. We shall go into this further in Chapter 4.a)  a~v(P) . rather than this Lagrangian form.a>v(P) . 1.3. e Example 1. aJ. . a patient either has a certain disease or does not. the connection is close.
R(B" 52) R(B2 .5) E[I(B. the loss is zero.4 8 8.4 and graphed in Figure 1.5 3 7 1. an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0. The risk of 5 at B is R(B.4.4) ~ 7. B2 Thus. and frequency function p(x.5.3. B) given by the following table TABLE 1.3) For instance. 5 a. e TABLE 1.)P[5(X) = ad +1(B. a. 0 . Risk points (R(B 5. Possible decision rules 5i (x) .6 and 0. (R(O" 5). if there is oil and we drill. 6 a.Section 1. take action U2. TABLE 1. if X = 1. it is known that formation 0 occurs with frequency 0.5(X))] = I(B. i Rock formation x o I 0.3.6.2 (Oil) (No oil) e.3.. the loss is 12.5 8.3 and formation 1 with frequency 0.2 for i = 1.).. 1 9. R(B2. whereas if there is no oil and we drill. .)) 1 1 R(B .. e. whereas if there is no oil. The frequency function p(x.4 = 1. 8 a3 a2 9 a3 a3 Here.3.7. a.3. a. X may represent a certain geological formation.5 9. 5." Criteria for doing this will be introduced in the next subsection.0 9 5 6 It remains to pick out the rules that are "good" or "best.6 3 3. a3 7 a3 a.a3)P[5(X) = a3]' 0(0. Next.52 ) + 10(0. 01 represents "Take action Ul regardless of the value of X. R(025.).4.a2)P[5(X) ~ a2] + I(B. .7) = 7 12(0. R(B k .7 0. a3 4 a2 a." and so on.) R(B.3 0. We list all possible decision rules in the following table.6 0.4 10 6 6.. 1.).) 0 12 2 7 7.5.2. .3 The Decision Theoretic Framework 25 Thus." 02 corresponds to ''Take action Ul> if X = 0. 5)) and if k = 2 we can plot the set of all such points obtained by varying 5. If is finite and has k members. we can represent the whole risk function of a procedure 0 by a point in kdimensional Euclidean space.)) are given in Table 1.5 4. I x=O x=1 a. . and when there is oil.. formations 0 and 1 occur with frequencies 0. a. and so on.3. 2 a. 3 a.6) + 1(0.6 4 5 1 " 3 5. The risk points (R(B" 5.
.5).S) < R(8. i = 1. unbiasedness (for estimates and tests). or level of significance (for tests). Si). we obtain M SE(O) = (}2. Consider. if we ignore the data and use the estimate () = 0. We say that a procedure <5 improves a procedure Of if. Section 1. 1.S') for all () with strict inequality for some ().. if J and 8' are two rules.5 0 0 5 10 R(8 j .3.2. S.4 .26 Statistical Models. We shall pursue this approach further in Chapter 3.3.) < R( 8" S6) but R( 8" S. (2) A second major approach has been to compare risk functions by global crite . Goals.2 .8 5 . in estimating () E R when X '"'' N(().) > R( 8" S6)' The problem of selecting good decision " procedures has been attacked in a variety of ways. The absurd rule "S'(X) = 0" cannot be improved on at the value 8 ~ 0 because Eo(S'(X)) = 0 ifand only if O(X) = O. and only if. and S6 in our example. and Performance Criteria Chapter 1 R(8 2 .) I • 3 10 . Symmetry (or invariance) restrictions are discussed in Ferguson (1967).S. The risk points (R(8 j .6 . i (1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry. It is easy to see that there is typically no rule e5 that improves all others. For instance. Usually.   . for instance. Here R(8 S. Researchers have then sought procedures that improve all others within the class.9. a5).3 Bayes and Minimax Criteria The difficulties of comparing decision procedures have already been discussed in the special contexts of estimation and testing.. R( 8" Si)). R(8.7 .9 . . neither improves the other. Extensions of unbiasedness ideas may be found in Lehmann (1997.Si) Figure 1.
9).4 8 4. therefore.6 4 4. we need not stop at this point. if () is discrete with frequency function 'Tr(e).7 6..) maxi R(0" Oil. r(o. = 0. r(o") = minr(o) reo) = EoR(O.o(x))].3 The Decision Theoretic Framework 27 ria rather than on a pointwise basis. TABLE 1.02 8.0) Table 1.2R(0. Then we treat the parameter as a random variable () with possible values ()1. Recall that in the Bayesian model () is the realization of a random variable or vector () and that Pe is the conditional distribution of X given 0 ~ O. If there is a rule <5*.2. . This quantity which we shall call the Bayes risk of <5 and denote r( <5) is then. and reo) = J R(O.3.92 5.5.20) in Appendix B. To illustrate.8. . O(X)) I () = ()]. it has smaller Bayes risk.9 8. such that • see that rule 05 is the unique Bayes rule then it is called a Bayes rule. 0)11(0). + 0.Section 1.3. and only if. the only reasonable computational method. 0) isjust E[I(O.I. .5 7 7. If we adopt the Bayesian point of view.6 12 2 7. We shall discuss the Bayes and minimax criteria.0). the expected loss.. r( 09) specified by (1.o)] = E[I(O. R(0" Oi)) I 9. (1.3.o)1I(O)dO. ()2 and frequency function 1I(OIl The Bayes risk of <5 is.8 6 In the Bayesian framework 0 is preferable to <5' if.38 9.5 gives r( 0. which attains the minimum Bayes risk l that is.10) r( 0) ~ 0. Note that the Bayes approach leads us to compare procedures on the basis of. suppose that in the oil drilling example an expert thinks the chance of finding oil is . . Bayes: The Bayesian point of view leads to a natural global criterion.6 3 8. In this framework R(O.3.2.8R(0. 11(0. . if we use <5 and () = ().).5 9 5. to Section 3. From Table 1.) ~ 0.8 10 6 3.9) The second preceding identity is a consequence of the double expectation theorem (B. but can proceed to calculate what we expect to lose on the average as () varies.2.3. (1. The method of computing Bayes procedures by listing all available <5 and their Bayes risk is impracticable in general.3. given by reo) = E[R(O. Bayes and maximum risks oftbe procedures of Table 1.3.3.4 5 2. We postpone the consideration of posterior analysis.5 we for our prior.48 7..
Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule. J) A procedure 0*. and Performance Criteria Chapter 1 if (J is continuous with density rr(8).28 Statistical Models. suppose that. From the listing ofmax(R(O" J). in Example 1. if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used. the sel of all decision procedures. who picks a decision procedure point 8 E J from V. 6).3. 8)]. 2 The maximum risk 4. supR(O.20 if 0 = O .3. e 4. J). = infsupR(O. The maximum risk of 0* is the upper pure value of the game. Player II then pays Player I. . which has .J') . a randomized decision procedure can be thought of as a random experiment whose outcomes are members of V. . J'). we prefer 0" to 0"'. I Randomized decision rules: In general.J). The criterion comes from the general theory of twoperson zero sum games of von Neumann. This criterion of optimality is very conservative.75 if 0 = 0. The principle would be compelling. Nature's choosing a 8. Our expected risk would be. Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk. For instance.4. But this is just Bayes comparison where 1f places equal probability on f}r and ()2. Of course.5. It is then natural to compare procedures using the simple average ~ [R( fh 10) + R( fh. To illustrate computation of the minimax rule we tum to Table 1.4. 4. For instance. < sup R(O.3. sup R(O. but only ac. we toss a fair coin and use 04 if the coin lands heads and 06 otherwise.5 we might feel that both values ofthe risk were equally important. which makes the risk as large as possible. if V is the class of all decision procedures (nonrandomized).(2) We briefly indicate "the game of decision theory. . Such comparisons make sense even if we do not interpret 1r as a prior density or frequency. Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to emplOy. R(02. is called minimax (minimizes the maximum risk). This is. Nevertheless l in many cases the principle can lead to very reasonable procedures.75 is strictly less than that of 04. if and only if. in Example 1. R(O. It aims to give maximum protection against the worst that can happen." Nature (Player I) picks a independently of the statistician (Player II). J)) we see that J 4 is minimax with a maximum risk of 5. Goals. a weight function for averaging the values of the function R( B. For simplicity we shall discuss only randomized .
2.Ji)' r2 = tAiR(02. i = 1 . . J).).i)r2 = c... (2) The tangent is the line connecting two unonrandomized" risk points Ji . This is thai line with slope .J) ~ L. (1.11) Similarly we can define.). minimizes r(d) anwng all randomized procedures.(0 d = i = 1 .3.12) r(J) = LAiEIR(IJ.10). i = 1. That is.J. If the randomized procedure l5 selects l5i with probability Ai.(02 ).1). given a prior 7r on q e. l5q of nOn randomized procedures. R(O .3. 1 q.3.A)R(02.3..3.A)R(O"Jj ). For instance.J): J E V'} where V* is the set of all procedures. then all rules having Bayes risk c correspond to points in S that lie on the line irl + (1 . All points of S that are on the tangent are Bayes.3 The Decision Theoretic Framework 29 procedures that select among a finite set (h 1 • • • . AR(02.. By (1. TWo cases arise: (I) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule. 9 (Figure 2 1..R(O. We will then indicate how much of what we learn carries over to the general case.13) defines a family of parallel lines with slope i/(1 .. this point is (10. which is the risk point of the Bayes rule Js (see Figure 1. (1.14) .5. Ai >0.R(02. A point (Tl. i=l (1.13) intersects S.Ji )] i=l A randomized Bayes procedure d. S is the convex hull of the risk points (R(0" Ji ). As in Example 1. when ~ = 0.3.3. . we then define q R(O. Jj . ~Ai=1}. If . .3). Finding the Bayes rule corresponds to finding the smallest c for which the line (1. A randomized minimax procedure minimizes maxa R(8.d) anwng all randomized procedures. = ~A. the Bayes risk of l5 (1. including randomiZed ones. We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1.3).3.r). T2) on this line can be written AR(O"Ji ) + (1. . we represent the risk of any procedure J by the vector (R( 01 .13) As c varies. J)) and consider the risk sel 2 S = {(RrO"J)..3. Jj ). Ji )).3.r2):r. (1.3.5.Section 1.i/ (1 . 0 < i < 1. Ji ) + (1 .'Y) that is tangent to S al the lower boundary of S.J..R(OI.\. S= {(rI. R(O . Ei==l Al = 1.
3. . by (1. oil.3. R(B .11) corresponds to the values Oi with probability>. (1.15) Each one of these rules.>').. The convex hull S of the risk poiots (R(B!. (1.3. i = 1. and Performance Criteria Chapter 1 r2 I 10 5 Q(c') o o 5 10 Figure 1. Because changing the prior 1r corresponds to changing the slope .e. = r2. (\)). To locate the risk point of the nilhimax rule consider the family of squares.30 Statistical Models. In our example. the first square that touches S).3. thus. We can choose two nonrandomized Bayes rules from this class. the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (Le. See Figure 1. . where 0 < >. 0 < >. 2 The point where the square Q( c') defined by (1.3. ranges from 0 to 1.3. as >.9.13).16) touches S is the risk point of the minimax rule.3.)') of the line given by (1. It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*.• all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes). OJ with probability (1 .3.. namely Oi (take'\ = 1) and OJ (take>./(1 . Then Q( c*) n S is either a point or a horizontal or vertical line segment. and. = 0).3. < 1. is Bayes against 7I". Goals. < 1. Let c' be the srr!~llest c for which Q(c) n S i 0 (i.16) whose diagonal is the line r. the first point of contact between the squares and S is the ..
(a) For any prior there is always a nonrandomized Bayes procedure.\ the solution of From Table 1. there is no (x. To gain some insight into the class of all admissible procedures (randomized and nOnrandomized) we again use the risk set. under some conditions.) and R( 8" 04) = 5. A rule <5 with risk point (TIl T2) is admissible..3. If e is finite. Naturally. the minimax rule is given by (1.\). if there is a randomized one. (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the positive quadrant.e. The following features exhibited by the risk set by Example 1. all rules that are not inadmissible are called admissible.3.6 ~ R( 8" J.59. if and only if.4 we can see. agrees with the set of risk points of Bayes procedures. . Thus. (c) If e is finite(4) and minimax procedures exist. . (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the Sense of (1. Using Table 1. 1 . l Ok}. (e) If a Bayes prior has 1f(Oi) to 1f is admissible. thus.14). From the figure it is clear that such points must be on the lower left boundary. 0.\ + 6.4'\ + 3(1 . j = 6 and . which yields . all admissible procedures are either Bayes procedures or limits of . for instance.3 The Decision Theoretic Framework 31 intersection between TI = T2 and the line connecting the two points corresponding to <54 and <56 .. A decision rule <5 is said to be inadmissible if there exists another rule <5' such that <5' improves <5. (d) All admissible procedures are Bayes procedures.3. for instance).3.5(1 . > a for all i. we can define the risk set in general as s ~ {( R(8 .4.\ ~ 0. e = {fh. {(x. 1967. this equation becomes 3. . or equivalently. if and only if.V) : x < TI. the set of all lower left boundary points of S corresponds to the class of admissible rules and..\) = 5.. then any Bayes procedure corresponding If e is not finite there are typically admissible procedures that are not Bayes.)).4 < 7.14) with i = 4. R(8" 0)) : 0 E V'} where V" is the set of all randomized decision procedures. However.0).5 can be shown to hold generally (see Ferguson. that <5 2 is inadmissible because 64 improves it (i. R( 81 . Y < T2} has only (TI 1 T2) in common with S. they are Bayes procedures.. In fact. There is another important concept that we want to discuss in the context of the risk set.Section 1.3. y) in S such that x < Tl and y < T2. 04) = 3 < 7 = R( 81 .
testing. Here are some further examples of the kind of situation that prompts our study in this section.Yj'. These remarkable results. Z is the information that we have and Y the quantity to be predicted. ]967. . Section 3. One reasonable measure of "distance" is (g(Z) . are due essentially to Waldo They are useful because the property of being Bayes is ea·der to analyze than admissibility. decision rule. We introduce the decision theoretic foundation of statistics inclUding the notions of action space.4 PREDICTION . Other theorems are available characterizing larger but more manageable classes of procedures. The frame we shaH fit them into is the following. which include the admissible rules. We want to find a function 9 defined on the range of Z such that g(Z) (the predictor) is "close" to Y. and prediction. For more information on these topics. The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility. For example. and risk through various examples including estimation.g(Z») = E[g(Z) ~ YI 2 or its square root yE(g(Z) .. Using this information. We assume that we know the joint probability distribution of a random vector (or variable) Z and a random variable Y. and Performance Criteria Chapter 1 '. he wants to predict the firstyear grade point averages of entering freshmen on the basis of their College Board scores. we tum to the mean squared prediction error (MSPE) t>2(Y. I II .32 . A government expert wants to predict the amount of heating oil needed next winter. The joint distribution of Z and Y can be calculated (or rather well estimated) from the records of previous years that the admissions officer has at his disposal. ranking. A stockholder wants to predict the value of his holdings at some time in the future on the basis of his past experience with the market and his portfolio. We stress that looking at randomized procedures is essential for these conclusions. Bayes procedures (in various senses).I 'jlI . Summary. which is the squared prediction error when g(Z) is used to predict Y. I The prediction Example 1. A meteorologist wants to estimate the amount of rainfall in the coming spring. An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson. loss function. The basic biasvariance decomposition of mean square error is presented.y)2. at least in their original fonn. In terms of our preceding discussion. we referto Blackwell and Girshick (1954) and Ferguson (1967). • 1. A college admissions officer has available the College Board scores at entrance and firstyear grade point averages of freshman classes for a period of several years. confidence bounds. The MSPE is the measure traditionally used in the . Next we must specify what close means. although it usually turns out that all admissible procedures of interest are indeed nonrandomized. at least when procedures with the same risk function are identified. Since Y is not known. Z would be the College Board score of an entering freshman and Y his or her firstyear grade point average.2 presented important situations in which a vector z of 00variates can be used to predict an unseen response Y.4). Similar problems abound in every field. in the college admissions situation. Goals. Statistical Models.3.
E(Y .3) If we now take expectations of both sides and employ the double expectation theorem (B. see Problem 1.Section 1. ljZ is any random vector and Y any random variable. In this section we bjZj .4.4 Prediction 33 mathematical theory of prediction whose deeper results (see. Let (1. we can conclude that Theorem 1.4. given a vector Z. (1. we have E[(Y . for example.g(Z))' I Z = z] = E[(Y .g(Z))' (1.c)2 as a function of c.4. then either E(Y g(Z))' = 00 for every function g or E(Y . see Example 1. in which Z is a constant.g(z))' IZ = z] = E[(Y I'(z))' I Z = z] + [g(z) I'(z)]'.4.711).YI) (Problems 1.) = 0 makes the cross product term vanish. when EY' < 00. 0 Now we can solve the problem of finding the best MSPE predictor of Y. we can find the 9 that minimizes E(Y .1) follows because E(Y p.C)2 has a unique minimum at c = J.L = E(Y).5 and Section 3. The class Q of possible predictors 9 may be the nonparametric class QN P of all 9 : R d Jo R or it may be to some subset of this class.4.g(z))' I Z = z].4.25. or equivalently. See Remark 1.(Y. We see that E(Y . EY' < 00 Y . (1.L and the lemma follows.c)2 is either oofor all c or is minimized uniquely by c In fact.l.1 assures us that E[(Y .4.6. Just how widely applicable the notions of this section are will become apparent in Remark 1.4.2) I'(z) = E(Y I Z = z).3. and by expanding (1.4. Lemma 1.4. consider QNP and the class QL of linear predictors of the form a + We begin the search for the best predictor in the sense of minimizing MSPE by considering the case in which there is no covariate information.c)' < 00 for all c.4. Proof. In this situation all predictors are constant and the best one is that number Co that minimizes B(Y . 1.20).1. exists.c) implies that p. L1=1 Lemma 1. Because g(z) is a constant. EY' = J.1. E(Y . By the substitution theorem for conditional expectations (B.4.g(Z))2.I'(Z))' < E(Y .2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss.4. The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6. g(Z)) such as the mean absolute error E(lg( Z) . that is. 1957) presuppose it. Grenander and Rosenblatt.16).1) < 00 if and only if E(Y .c = (Y 1') + (I' .4) .c)' ~ Var Y + (c _ 1')'.
4. E{h(Z)<1 E{ E[h(Z)< I Z]} E{h(Z)E[Y I'(Z) I Z]} = 0 because E[Y I'(Z) I Z] = I'(Z) I'(Z) = O.1.4.4. then we can write Proposition 1.E(V)J Let € = Y .E(U)] = O. = I'(Z).4. To show (a).8) or equivalently unless Y is a function oJZ. when E(Y') < 00. then by the iterated expectation theorem.4.4. and Performance Criteria Chapter 1 for every 9 with strict inequality holding unless g(Z) best MSPE predictor.5) An important special case of (1. = O. (1. Property (1.4. Equivalently U and V are uncorrelated if either EV[U E(U)] = 0 or EUIV . Vat Y = E(Var(Y I Z» + Var(E(Y I Z).E(Y I z)]' I z). Write Var(Y I z) for the variance of the condition distribution of Y given Z = z.4. I'(Z) is the unique E(Y . then (1. which will prove of importance in estimation theory. Suppose that Var Y < 00.g(Z)' = E(Y .Il(Z) denote the random prediction error.4. 0 Note that Proposition 1. . That is. that is.I'(Z))' + E(g(Z) 1'(Z»2 ( 1.34 Statistical Models.5) follows from (a) because (a) implies that the cross product term in the expansionof E ([Y 1'(z)J + [I'(z) g(z)IP vanishes. we can derive the following theorem. As a consequence of (1.2.6) is linked to a notion that we now define: Two random variables U and V with EIUVI < 00 are said to be uncorrelated if EIV . and recall (B.4.E(v)IIU . (1. If E(IYI) < 00 but Z and Y are otherwise arbitrary. 1. Proof. Theorem 1.6) and that (1.4.6) which is generally valid because if one side is infinite. Var(Y I z) = E([Y . then (a) f is uncorrelated with every function ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <. Properties (b) and (c) follow from (a).6). Goals. let h(Z) be any function of Z.1(c) is equivalent to (1. so is the other.4. then Var(E(Y I Z)) < Var Y.7) IJVar Y • < 00 strict inequality ooids unless 1 Y = E(Y I Z) (1.5) becomes.5) is obtained by taking g(z) ~ E(Y) for all z.20). Infact.4.
Within any given month the capacity status does not change.E(Y I Z = z))2 p (z.05 0. Equality in (1.025 0.10 I 0.8) is true. half.50 z\y 1 0 0. An assembly line operates either at full. The assertion (1. 2. 2. y) = 0. also. In this case if Yi represents the number of failures on day i and Z the state of the assembly line. p(z. I z) = E(Y I Z). But this predictor is also the right one.20.E(Y I Z))' ~ 0 By (A.25 0. or 3 shutdowns due to mechanical failure.4.4. The first is direct. We want to predict the number of failures for a given day knowing the state of the assembly line for the month.05 0. the average number of failures per day in a given month. i=l 3 E (Y I Z = ~) ~ 2.45. Each day there can be 0. if we are trying to guess.y) pz(z) 0.4.15 3 0.10. y) = p (Z = Z 1 Y = y) of the number of shutdowns Y and the capacity state Z of the line for a randomly chosen day.10 0.lI.Section 1.E(Y I Z))' =L x 3 ~)y y=o .9) this can hold if.10 0. These fractional figures are not too meaningful as predictors of the natural number values of Y. E(Var(Y I Z)) ~ E(Y . We find E(Y I Z = 1) = L iF[Y = i I Z = 1] ~ 2. or 3 failures among all days.1 2.7) follows immediately from (1.45 ~ 1 The MSPE of the best predictor can be calculated in two ways. . E(Y .25 0.30 py(y) I 0. and only if.6).25 2 0.7) can hold if. 1. The following table gives the frequency function p( z. The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state. and only if. (1. or quarter capacity.10 0.4.'" 1 Y.4 Prediction 35 Proof.05 0.885. the best predictor is E (30. I. as we reasonably might. E (Y I Z = ±) = 1.25 1 0. whereas the column sums py (y) yield the frequency of 0. o Example 1.1.15 I 0.:.30 I 0.025 0.4.
(1 . and Performance Criteria Chapter 1 The second way is to use (1.4. .4. In the bivariate normal case. the MSPE of our predictor is given by. Goals.L[E(Y I Z y . whereas its magnitude measures the degree of such dependence.4. the best predictor is just the constant f. (1. p) distribution.p')). One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y. . If p > O. = z)]'pz(z) 0. which corresponds to the best predictor of Y given Z in the bivariate normal model.36 Statistical Models.4.y as we would expect in the case of independence.LY. u~. If (Z. a'k. Suppose Y and Z are bivariate normal random variables with the same mean .6) writing. E(Y . = /lY + p(uyjuz)(Z ~ /lZ)./lz). o tribution of Y given Z = z is N City + p(Uy j UZ ) (z .4.E(Y I Z p') (1. " " is independent of z. O"~. this quantity is just rl'.885 as before. The Bivariate Normal Distribution. Therefore. can reasonably be thought of as a measure of dependence. is usually called the regression (line) of Y on Z. E(Y ~ E(Y I Z))' VarY ~ Var(E(Y I Z)) E(Y') ~ E[(E(Y I Z))'] I>'PY(Y) . The term regression was coined by Francis Galton and is based on the following observation. Regression toward the mean. The larger this quantity the more dependent Z and Yare.9) I .J. Y) has a N(Jlz l f. If p = 0.4. Thus.E(Y I Z))' (1.11) I The line y = /lY + p(uy juz)(z ~ /lz). E((Y .4. = Z))2 I Z = z) = u~(l _ = u~(1 _ p'). Theorem B. the predictor is a monotone increasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z. Because of (1. p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence.10) I I " The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution.2 tells us that the conditional dis /lO(Z) Because . 2 I . Similarly. which is the MSPE of the best constant predictor.6) we can also write Var /lO(Z) p = VarY .2. the best predictor of Y using Z is the linear function Example 1. for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y.
distribution (Section B. or the average height of sbns whose fathers are the height Z.Section 1. (Zn. 0 Example 1." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox. variance cr2.11). Then the predicted height of the son. In Galton's case. We write Mee = ' ~! _ PZy ElY /lO(Z))' ~ Var l"o(Z) Var Y Var Y . Note that in practice. the MCC equals the square of the .COV(Zd. Theorem B.4. Yn ) from the population. We shall see how to do this in Chapter 2.. YI ).12) with MSPE ElY l"o(Z)]' ~ E{EIY l"o(zlI' I Z} = E(uYYlz) ~ Uyy  EyzEziEzy. I"Y) T. . . Y) is unavailable and the regression line is estimated on the basis of a sample (Zl. Thus.. The variability of the predicted value about 11. where the last identity follows from (1.. these were the heights of a randomly selected father (Z) and his son (Y) from a large human population. so the MSPE of !lo(Z) is smaller than the MSPE of the constant predictor J1y. One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y. Y» T T ~ E yZ and Uyy = Var(Y).4 Prediction 37 tt. The Multivariate Normal Distribution.6).. y)T has a (d + !) multivariate normal. consequently. coefficient ofdetennination or population Rsquared. I"Y = E(Y). Ezy O"yy E zz is the d x d variancecovariance matrix Var(Z) . Thus. should.4. The quadratic fonn EyzEZ"iEzy is positive except when the joint nonnal distribution is degenerate. By (1.. (1 . be less than that of the actual heights and indeed Var((! . E).6.p)tt + pZ. in particular in Galton's studies. uYYlz) where. Let Z = (Zl. the best predictor E(Y I Z) ofY is the linear function (1. . . the distribution of (Z.6) in which I' (I"~.4..8. ttd)T and suppose that (ZT.8 = EziEzy anduYYlz ~ uyyEyzEziEzy. tall fathers tend to have shorter sons.. E zy = (COV(Z" Y). . there is "regression toward the mean.4. . " Zd)T be a d x 1 covariate vector with mean JLz = (ttl.p)1" + pZ) = p'(T'. is closer to the population mean of heights tt than is the height of the father. N d +1 (I'. This quantity is called the multiple correlation coefficient (MCC). . and positive correlation p.3. . usual correlation coefficient p = UZy / crfyCT lz when d = 1.5 states that the conditional distribution of Y given Z = z is N(I"Y +(zl'zf.
29W· Then the strength of association between a girl's height and those of her mother and father. Suppose that E(Z') and E(Y') are finite and Z and Y are not constant.:.bZ)' is uniquely minintized by b = ba. are P~.3%. .335.l Proof.1 and 2.39 l.)T. Then the unique best zero intercept linear predictor i~ obtained by taking E(ZY) b=ba = E(Z') . let Y and Z = (Zl. and E(Y _ boZ)' = E(Y') _ [E(ZY)]' E(Z') (1. y)T is unknown. E(Y . If we are willing to sacrifice absolute excellence. l. The problem of finding the best MSPE predictor is solved by Theorem 1.y = .~ ~:~~). whereas the unique best linear predictor is ILL (Z) = al + bl Z where b = Cov(Z. yn)T See Sections 2. respectively.zy ~ (407. (Z~. Suppose(l) that (ZT l Y) T is trivariate nonnal with Var(Y) = 6.2.bZ)' = E(Y') + E(Z')(b  Therefore. In practice. E(Y ..4.4.9% and 39. .209. respectively. We first do the onedimensional case. Y. and Performance Criteria Chapter 1 For example. P~.3. knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33.y = .393. What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1.ba) + Zba]}' to get ba )' .zz ~ (. Y) a I VarZ.1. The best linear predictor. Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor.bZ}' = E(Y)  b E(Z) 1· = (Y  [Z(b . Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z. Z2 = father's height).38 Statistical Models. The natural class to begin with is that of linear combinations of components of Z. and parents.E(Z')b~.4. when the distribution of (ZT. The percentage reductions knowing the father's and both parent's heights are 20. Goals. the hnear predictor l'a(Z) and its MSPE will be estimated using a 0 sample (Zi. p~y = . We expand {Y . In words...13) .5%. Z2)T be the heights in inches of a IOyearold girl and her parents (Zl = mother's height.. we can avoid both objections by looking for a predictor that is best within a class of simple predictors.
a .4.1 the best linear predictor and best predictor differ (see Figure 1.5).E(Z)IIZ . E(Y .boZ)' = 0. .l7) in the appendix. Let Po denote = . I 1. = I'Y + (Z  J.I'd Z )]' = 1 05 ElY I'(Z)j' .bE(Z). Substituting this value of a in E(Y . if the best predictor is linear. then a = a. E[Y .4. Note that R(a.b(Z .E(Z))]'.bZ)2 is uniquely minimized by taking a ~ E(Y) .4. whiclt corresponds to Y = boZo We could similarly obtain (A. This is in accordance with our evaluation of E(Y I Z) in Example 1. it must coincide with the best linear predictor.b.I'l(Z)1' depends on the joint distribution P of only through the expectation J.E(Y)]) = EziEzy.E(Y)) . That is.E(ZWW ' E([Z .E(Z)f[Z .4. This is because E(Y .l).t and covariance E of X.bZ) + (E(Y) . (1. Theorem 1. If EY' and (E([Z . Best Multivariate Linear Predictor. In that example nothing is lost by using linear prediction. On the other hand.I1. and only if. 0 Note that if E(Y I Z) is of the form a + bZ.4.4.E(Z) and Y .4. E(Y .a . Therefore..bZ)2 we see that the b we seek minimizes E[(Y .1). by (1.Section 1. 0 Remark 1.E(Y) to conclude that b.bZ)' ~ Var(Y .boZ)' > 0 is equivalent to the CauchySchwarz inequality with equality holding if. whatever be b.E(Z)J[Y .4.4 Prediction 39 To prove the second assertion of the theorem note that by (lA. b) X ~ (ZT.2. From (1. A loss of about 5% is incurred by using the best linear predictor.13) we obtain tlte proof of the CauchySchwarz inequality (A.14) Yl T Ep[Y . in Example 1.bE(Z) .4.1. and b = b1 • because. then the Unique best MSPE predictor is I'dZ) Proof.tzf [3. E(Y . OUf linear predictor is of the fonn d I'I(Z) = a + LbjZj j=l = a+ ZTb [3 = (E([Z . We can now apply the result on zero intercept linear predictors to the variables Z .Z)'.a .a)'.16) directly by calculating E(Y . is the unique minimizing value.a.E(Z)J))1 exist.
4.3.4. Ro(a. thus. ~ Y . b) is minimized by (1.05 + 1. .4 by extending the proof of Theorem 1. See Problem 1. E). distribution and let Ro(a.I'(Z) and each of Zo. E(Zj[Y . j = 0.40 Statistical Models. b) is miuimized by (1. b) = Ro(a.4. . Zd are uncorrelated.50 0.4.4. p~y = Corr2 (y. Y). R(a.I'I(Z)]'.3 to d > 1. . our new proof shows how secondmoment results sometimes can be established by "connecting" them to the nannal distribution. However. and R(a. The three dots give the best predictor.4. 0 Remark 1.17 for an overall measure of the strength of this relationship.. case the multiple correlation coefficient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y. We could also have established Theorem 1. Because P and Po have the same /L and E.25 0. By Example 1. We want to express Q: and f3 in terms of moments of (Z.4. . I"(Z) = E(Y I Z) = a + ZT 13 for unknown Q: E R and f3 E Rd.00 z Figure 1. that is. ..1(a). that is.3.19.75 1. Set Zo = 1.4.4...4.4. b). A third approach using calculus is given in Problem 1.4. 0 Remark 1. Goals. .d. Suppose the mooel for I'(Z) is linear. . The line represents the best linear predictor y = 1. 0 Remark 1. Thus. 2 • • 1 o o 0. the MCC gives the strength of the linear relationship between Z and Y.4. N(/L.45z. b) = Epo IY . In the general.15) . and Performance Criteria Chapter 1 y .14). not necessarily nonnal. . By Proposition 1. (1. the multivariate uormal..(a + ZT 13)]) = 0.1.4. .2.14).I'LCZ)).
g(Z»): 9 E g).Thus. Using the distance D.5. It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear. the optimal MSPE predictor is the conditional expected value of Y given Z. g(Z) and h(Z) are said to be orthogonal if at least one has expected value zero and E[g(Z)h(Z)] ~ O.I'(Z) is orthogonal to I'(Z) and to I'dZ).3.8) defined by r(5) = E[I(II.. . We consider situations in which the goal is to predict the (perhaps in the future) value of a random variable Y. The optimal MSPE predictor in the multivariate normal distribution is presented. we would clearly like to separate out any aspects of the data that are irrelevant in the context of the model and that may obscure our understanding of the situation. I'dZ» = ". then 90(Z) is called the projection of Y on the space 9 of functions of Z and we write go(Z) = . Because the multivariate normal model is a linear model. and it is shown that if we want to predict Y on the basis of information contained in a random vector Z.(Y. Consider the Bayesian model of Section 1. When the class 9 of possible predictors 9 with Elg(Z)1 space as defined in Section B.5 SUFFICIENCY Once we have postulated a statistical model. the optimal MSPE predictor E(6 I X) is the Bayes procedure for squared error loss.(Y 19NP).4.15) for a and f3 gives (1. Thus. we can conclude that I'(Z) = . and projection 1r notation.Section 1.16) is the Pythagorean identity.17) ""'(Y.5)'. The notion of mean squared prediction error (MSPE) is introduced. We return to this in Section 3. then by recording or taking into account only the value of T(X) we have a reduction of the data.4.2 and the Bayes risk (1. Remark 1.4.5(X»].14) (Problem 1..16) (1. but as we have seen in Sec~ tion 1.23)..5) = (9 . If T assigns the same value to different sample points. this gives a new derivation of (1.2.(Y I 9).3. Moreover.4. With these concepts the results of this section are linked to the general Hilbert space results of Section 8.4. Remark 1. I'(Z» + ""'(Y.5 Sufficiency 41 Solving (1.I 0 and there is a 90 E 9 such that 0 < oc form a Hilbert go = arg inf{". we see that r(5) ~ MSPE for squared error loss 1(9.4.10.(Y 1ge) ~ "(I'(Z) 19d (1. We begin by fonnalizing what we mean by "a reduction of the data" X EX. can also be a set of functions.4.' (l'e(Z).. Note that (1. 1.12)..4. I'dZ) = . usually R or Rk. I'(Z» Y . If we identify II with Y and X with Z.6.. T(X) = X loses information about the Xi as soon as n > 1. The range of T is any space of objects T.4.1. Recall that a statistic is any function of the observations generically denoted by T(X) or T. o Summary.
) is conditionally distributed as (X. (Xl. Although the sufficient statistics we have obtained are "natural. .3. where 0 is unknown. Thus.'" . . • X n ) (X(I) .1. the sample X = (Xl. we need only record one. By (A. Then X = (Xl. = tis U(O. However. Thus. By Example B.) given Xl + X.5). P[X I = XI.8)n' (1.5. Suppose there is no dependence between the quality of the items produced and let Xi = 1 if the ith item is good and 0 otherwise.) and that of Xlt/(X l + X. 16. . t) and Y = t .1) where Xi is 0 or 1 and t = L:~ I Xi. X n ) into the same number.. Example 1.l.) and Xl +X. once the value of a sufficient statistic T is known. o Example 1.l.0. + X.9. A machine produces n items in succession.Xn ) where Xi = 1 if the ith item sampled is defective and Xi = 0 otherwise. X. X 2 the time between the arrival of the first and second customers.1 we had sampled the manufactured items in order. T = L~~l Xi. 1).'" . One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following.. The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X. is sufficient. . suppose that in Example 1. We could then represent the data by a vector X = (Xl.1. Y) where X is uniform on (0. The idea of sufficiency is to reduce the data with statistics whose use involves no loss of information.) are the same and we can conclude that given Xl + X. whatever be 8. when Xl + X.. Suppose that arrival of customers at a service counter follows a Poisson process with arrival rate (parameter) Let Xl be the time of arrival of the first customer. It follows that.4). For instance. We prove that T = X I + X 2 is sufficient for O. t) distribution.2. the conditional distribution of X 0 given T = L:~ I Xi = t does not involve O.' • . . the conditional distribution of XI/(X l + X. Begin by noting that according to TheoremB. ~ t. the conditional distribution of Xl = [XI/(X l + X. Thus. Using our discussion in Section B. whatever be 8. X(n))' loses information about the labels of the Xi.42 Statistical Models. Xl and X 2 are independent and identically distributed exponential random variables with parameter O. .2. .. in the context of a model P = {Pe : (J E e}. Therefore.. = t.Xn = xnl = 8'(1. recording at each stage whether the examined item was defective or not. A statistic T(X) is called sufficient for PEP or the parameter if the conditional distribution of X given T(X) = t does not involve O. . We give a decision theory interpretation that follows. " . Xl has aU(O. 0 In both of the foregoing examples considerable reduction has been achieved. it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T. Each item produced is good with probability 0 and defective with probability 1." it is important to notice that there are many others e. By (A. XI/(X l +X. . X.5.'" . Instead of keeping track of several numbers. The total number of defective items observed. . 1) whatever be t. are independent and the first of these statistics has a uniform distribution on (0..l we see that given Xl + X2 = t. is a statistic that maps many different values of (Xl.)](X I + X.l. Goals.X. given that P is valid..5.Xn ) does not contain any further infonnation about 0 or equivalently P. T is a sufficient statistic for O. and Performance Criteria Chapter 1 Even T(X J.Xn ) is the record of n Bernoulli trials with probability 8.
j is independent ofe on cach of the sets Si = {O: porT ~ til> OJ. there exists afunction g(t. a simple necessary and sufficient criterion for a statistic to be sufficient is available. More generally. X2.e) = g(ti.I. 0) porT ~ Ii] ~~CC+ g(li.} h(x). O)h(x) for all X E X.4) if T(xj) ~ Ii 41 o if T(xj) oF t i . Proof. Fortunately.O) Theorem I. we need only show that Pl/[X = xjlT = til is independent of for every i and j.6). Being told that the numbers of successeS in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one. forO E Si.] o if T(xj) oF ti if T(xj) = Ii.. The complete result is established for instance by Lehmann (1997. a statistic T(X) with range T is sufficient for e if. checking sufficiency directly is difficult because we need to compute the conditional distribution.] Po[X = Xj.O) I: {x:T(x)=t.Ll) and (152). (153) By (B. This result was proved in various forms by Fisher. and e and a/unction h defined (152) = g(T(x). h(xj) (1. if Tl and T2 are any two statistics such that 7 1 (x) = T 1 (y) if and only if T2(x) = T 2(y). In general.5. • only if. Po[X = XjlT = t.In a regular model. e)h(Xj) Po[T (15. then T 1 and T2 provide the same information and achieve the same reduction of the data. Po [X ~ XjlT = t. Neyman..5 Sufficiency 43 that will do the same job.} p(x.5) . Now..Section 1. 0 E 8. We shall give the proof in the discrete case. T = t. and Halmos and Savage. . it is enough to show that Po[X = XjlT ~ l. if (152) holds. Such statistics are called equivalent. Applying (153) we arrive at.") be the set of possible realizations of X and let t i = T(Xi)' Then T is discrete and 2::~ 1 porT = Ii] = 1 for every e. Let (Xl. e) definedfor tin T and e in on X such that p(X. e porT = til = I: {x:T(x)=t.. It is often referred to as the factorization theorem for sufficient statistics.2.]/porT = Ii] p(x. By our definition of conditional probability in the discrete case.S. Section 2. To prove the sufficiency of (152). i = 1..
let g(ti' 0) = porT = tiL h(x) = P[X = xIT(X) = til Then (1.O)=onexP[OLXil i=l (1.Xn ) = L~ IXi is sufficient.Xn ) is given by (see (A. .2 (continued).. 0 o Example 1.Xn. 1=1 (!. 0) by (B.JL)'} ~=l I n I! 2 .. We may apply Theorem 1.. . O)h(x) (1. . . X(n) is a sufficient statistic for O. 0 Example 1.. Common sense indicates that to get information about B. n P(Xl.5. both of which are unknown.4)). I x n ) = 1 if all the Xi are > 0..1. A whole class of distributions.8) = eneStift > 0. . Conversely.S..9) can be rewritten as " 1 Xn .9) if every Xi is an integer between 1 and B and p( X 11 • (1. .5.16.X.1 to conclude that T(X 1 . > 0.. Estimating the Size of a Population.5. and Performance Criteria Chapter 1 Therefore.' (L x~ 1=1 1 n n 2JL LXi))]. Takeg(t. 1. The probability distribution of X "is given by (1.5.'). Consider a population with () members labeled consecutively from I to B. which admits simple sufficient statistics and to which this example belongs. .' L(Xi . Goals. By Theorem 1.xn }.6) p(x. 1 X n be independent and identically distributed random variables each having a normal distribution with mean {l and variance (j2.5.. The population is sampled with replacement and n members of the population are observed and their labels Xl. Then the density of (X" .7) o Example 1. T = T(x)] = g(T(x). l X n are the interarrival times for n customers.n /' exp{ . T is sufficient. If X 1. ~ Po[X = x...5. 10 fact.. and P(XI' ..~. are introduced in the next section. _ _ _1 ..'I.2. }][exp{ . X n ). O} = 0 otherwise. . we need only keeep track of X(n) = max(X .... and h(xl •. . .5. Expression (1.Il) .2. •. if T is sufficient. we can show that X(n) is sufficient.'t n /'[exp{ .'" .X n JI) = 0 otherwise..Xn ) is given by I.[27f. .5. ' .5.3).10) P(Xl. . then the joint density of (X" .3.Xn are recorded. Let Xl.O) where x(n) = onl{x(n) < O}. . . [27"..4.5. and both functions = 0 otherwise.. = max(xl.44 Statistical Models.8) if all the Xi are > 0. Let 0 = (JL.5..
2a I3.' + 213 2a + 213. with JLi following the linear regresssion model ("'J .9) and _ (y. L~l x~) and () only and T(X 1 . Then pix. I)) .~ I: xn By the factorization theorem. Here is an example.~0)}(2"r~n exp{ . respectively.5..5 Sufficiency 45 I Evidently P{Xl 1 • •• . 1 2 2 .X)']. Then 6 = {{31.1. (1. [1/(n i=l n n 1)1 I:(Xi . for any decision procedure o(x). 0') for allO.4 with d = 2.Zi)'} exp {Er.l) distributed. .5. Suppose X" . An equivalent sufficient statistic in this situation that is frequently used IS S(X" . 0 X Example 1. that Y I . .12) t ofT(X) and a Example 1.Section 1.6.• Yn are independent.o) = R(O. . By randomized we mean that o'(T(X») can be generated from the value random mechanism not depending on B. EYi 2 . X n ) = [(lin) where I: Xi. . . I)) = exp{nI)(x .. Ezi 1'i) is sufficient for 6. if T(X) is sufficient. Thus. R(O.5. X n ) = (I: Xi.5. The first and second components of this vector are called the sample mean and the sample variance. we can. i= 1 = (lin) L~ 1 Xi. X is sufficient.( 2?Ta ')~ exp p {E(fJ. find a randomized decision rule o'(T(X)) depending only on T(X) that does as well as O(X) in the sense of having the same risk function. a 2)T is identifiable (Problem 1. in Example 1. T = (EYi . Yi N(Jli.1 we can conclude that n n Xi. {32. a 2 ). Specifically. we construct a rule O'(X) with the sarne risk = mean squared error as o(X) as follows: Conditionally. o Sufficiency and decision theory Sufficiency can be given a clear operational interpretation in the decision theoretic setting.. Suppose. 2:>.} + EY. Let o(X) = Xl' Using only X.) i=l i=l is sufficient for B.1. X n are independent identically N(I).··· .xnJ)) is itself a function of (L~ upon applying Theorem 1.5. that is. where we assume diat the given constants {Zi} are not all identical.EZiY.. a<.
2.. it is Bayes sufficient for every 11. then T(X) = (L~ I Xi. where k In this situation we call T Bayes sufficient.5. = 2:7 Definition. the distribution of 6(X) does not depend on O.12) follows along the lines of the preceding example: Given T(X) = t. . 0) as p(x. 0 The proof of (1. and Performance Criteria Chapter 1 given X = t. T = I Xi was shown to be sufficient. Minimal sufficiency For any model there are many sufficient statistics: Thus. the same as the posterior distribution given T(X) = L~l Xi = k.5.46 Statistical Models.O) = g(S(x). we find E(T') ~ E[E(T'IX)] ~ E(X) ~ I" ~ E(X .6(X»)IT]} = R(O.1 (continued). R(O.2. o Sufficiency aud Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P = {Po : 0 E e} we postulate a prior distribution 11 for e. Using Section B. . 6'(X) and 6(X) have the same mean squared error. n~l) distribution. Equivalently.6') ~ E{E[R(O. Now draw 6' randomly from this conditional distribution.4..Xn is a N(p.. But T(X) provides a greater reduction of the data. O)h(x) E:  . .1 (Bernoulli trials) we saw that the posterior distribution given X = x is 1 Xi. In this Bernoulli trials case. ).6). Then by the factorization theorem we can write p(x. Thus. (Kolmogorov). L~ I xl) and S(X) = (X" . (T2) sample n > 2. This result and a partial converse is the subject of Problem 1. Let SeX) be any other sufficient statistic. Example 1.. T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of () given T(X) = T(x) for all x. we can find a transformation r such that T(X) = r(S(X)).. X n ) are hoth sufficient.l and (1. in that. if Xl. Theorem }. by the double expectation theorem. We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X). ) Var(T') = E[Var(T'IX)] + VarIE(T'IX)]  ~ nl n + 1 n ~ 1 = Var(X .5. IfT(X) is sufficient for 0. choose T" = 15"' (X) from the normal N(t.6'(T))IT]} ~ E{E[R(O. Goals. This 6' (T(X)) will have the same risk as 6' (X) because.14. 0 and X are independent given T(X). In Example 1.5.6).
the likelihood function (1.) = (/". However.1).11) is determined by the twodimensional sufficient statistic T Set 0 = (TI. the "likelihood" or "plausibility" of various 8. L x (') determines (iI. the statistic L takes on the value Lx.)}.5.) ~ (L Xi.Section 1. ~ 2/3 and 0.5. 20' 20 I t. L Xf) i=l i=l n n = (0 10 0.(t.8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of 8. Thus.4 (continued). t2) because. it gives.5 Sufficiency 47 Combining this with (1. if we set 0. L x (()) gives the probability of observing the point x. Thus. (T') example. we find T = r(S(x)) ~ (log[2ng(s(x). for a given observed x. In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x. if X = x. t.O). For any two fixed (h and fh.O E e. In the discrete case. 2/3)/g(S(x). The likelihood function o The preceding example shows how we can use p(x. take the log of both sides of this equation and solve for T.o)nT ~g(S(x).T. we find OT(1 . }exp ( I . as a function of (). when we think of Lx(B) as a function of (). ~ 1/3. (T'). for a given 8.O)h(x) forall O. In this N(/". the ratio of both sides of the foregoing gives In particular. ()) x E X}. We define the likelihood function L for a given observed data vector x as Lx(O) = p(x. for example. 1/3)]}/21og 2. Lx is a map from the sample space X to the class T of functions {() t p(x. = 2 log Lx(O.5.) n/' exp {nO. then Lx(O) = (27rO. ()) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient. Now.1) nlog27r . 20. Example 1. T is minimally sufficient.5.. It is a statistic whose values are functions. The formula (1.
For instance. (X( I)' . If T(X) is sufficient for 0. . if a 2 = 1 is postulated..5.4.X)2) is sufficient.13.5. a 2 is assumed unknown. See Problem ((x:). Consider an experiment with observation vector X = (Xl •. or for the parameter O.X. The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t. Let p(X. Summary. 0) = g(T(X). _' . X is sufficient.1. hence. 1 X n are a random sample.5. 0 In fact. . We define a statistic T(X) to be Bayes sufficient for a prior 7r if the posterior distribution of f:} given X = x is the same as the posterior distribution of 0 given T(X) = T(x) for all X. OJ denote the frequency function or density of X.17). and Scheffe. (X. B ul the ranks are needed if we want to look for possible dependencies in the observations as in Example 1. the ranks.. Ax is the function valued statistic that at (J takes on the value p x.. Suppose that X has distribution in the class P ~ {Po : 0 E e}. where R i ~ Lj~t I(X. . Thus. 1 X n ).).. We show the following result: If T(X) is sufficient for 0. Let Ax OJ c (x: p(x.48 Statistical Models.5.Xn .. L is minimal sufficient. or if T(X) ~ (X~l)' .. but if in fact the common distribution of the observations is not Gaussian all the information needed to estimate this distribution is contained in the corresponding S(X)see Problem 1.5. The likelihood function is defined for a given data vector of .. By arguing as in Example 1.5. the residuals. 0 the likelihood ratio of () to Bo. 1) and Lx (1.O) > for all B. . If.1 (continued) we can show that T and. We say that a statistic T (X) is sufficient for PEP. 1) (Problem 1. itself sufficient. S(X) = (R" . l:~ I (Xi .5.Oo) > OJ = Lx~6o)' Thus. hence. The "irrelevant" part of the data We can always rewrite the original X as (T(X). 0) and heX) such that p(X. it is Bayes sufficient for O. SeX)~ where SeX) is a statistic needed to uniquely detennine x once we know the sufficient statistic T(x). If P specifies that X 11 •.X). we can find a randomized decision rule J'(T(X» depending only on the value of t = T(X) and not on 0 such that J and J' have identical risk functions. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic SeX) we can find a ttansformation r such that T(X) = r(S(X).5. Suppose there exists 00 such that (x: p(x. but if in fact a 2 I 1 all information about a 2 is contained in the residuals..12 for a proof of this theorem of Dynkin. then for any decision procedure J(X).. . < X. as in the Example 1.X Cn )' the order statistics. . O)h(X). Lehmann. if the conditional distribution of X given T(X) = t does not involve O. L is a statistic that is equivalent to ((1' i 2) and. . and Performance Criteria Chapter 1 with a similar expression for t 1 in terms of Lx (O. Thus. • X( n» is sufficient. Goals. ifT(X) = X we can take SeX) ~ (Xl . in Example 1. SeX) becomes irrelevant (ancillary) for inference if T(X) is known but only if P is valid. 1. a statistic closely related to L solves the minimal sufficiency problem in general.Rn ). Then Ax is minimal sufficient.
exp{xlogBB}. B( 0) on e. B). 1.o I x. Note that the functions 1}. and T are not unique.6. if there exist realvalued functions fJ(B). B. many other common features of these families were discovered and they have become important in much of the modern theory of statistics.Section 1.2"" }. Probability models with these common features include normal.B(B)} with g(T(x). More generally. 1. x.B) and h(x) with itself in the factorization theorem. Subsequently. They will reappear in several connections in this book. Ax(B) is a minimally sufficient statistic. B E' sufficient for B. The Poisson Distribution.1 The OneParameter Case The family of distributions of a model {Pe : B E 8}. by the factorization theorem.B) = eXe. for x E {O. then. Let Po be the Poisson distribution with unknown mean IJ. Pitman. and if there is a value B E such that o e e. IfT(X) is {x: p(x. B) of the Pe may be written p(x. realvalued functions T and h on Rq. B) > O} c {x: p(x. Here are some examples. p(x.1. 1. gamma.6. beta. = 1 . Then. B>O. such that the density (frequency) functions p(x.1) where x E X c Rq.6 EXPONENTIAL FAMILIES The binomial and normal models considered in the last section exhibit the interesting feature that there is a natural sufficient statistic whose dimension as a random vector is independent of the sample size. In a oneparameter exponential family the random variable T(X) is sufficient for B.B(B)} (1. Poisson.6. Example 1. We return to these models in Chapter 2.B) ~ h(x)exp{'7(B)T(x) . these families form the basis for an important class of models called generalized linear models. This is clear because we need only identify exp{'7(B)T(x) . We shall refer to T as a natural sufficient statistic of the family.6. The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman. and Dannois through investigations of this property(l). and multinomial regression models used to relate a response variable Y to a set of predictor variables.B o) > O}. binomial. (1. is said to be a oneparameter exponentialfami/y. the likelihood ratio Ax(B) = Lx(B) Lx(Bo) depends on X through T(X) only.6 Exponential Families 49 observations X to be the function of B defined by Lx(B) = p(X. BEe.2) .
50 . T(x) ~ x.~z.T(x)=x. q = 1.1](O)=log(I~0).B(O)=nlog(I0).h(x) = .O) f(z. If {pJ=».3) I' .~O'.1](0) = .0)]. and Performance Criteria Chapter 1 Therefore. h(x) = (2rr)1 exp { .6) .Z)OI) Z)'O'J} (2rrO)1 exp { (2rr)1 exp {  ~ [z' + (y  ~z' } exp { .1](0) = logO. .. for x E {O.Xm are independent and identically distributed with common distribution Pe.3. 0 Example 1.} .6.y. Example 1.5) o = 2.. .6.6. o The families of distributions obtained by sampling from oneparameter exponential families are themselves oneparameter exponential families.~O'(y  z)' logO}.. is the family of distributions of X = (Xl' .(y I z) = <p(zW1<p((y . Suppose X has a B(n. Z and W are f(x. B(O) = logO. n) < 0 < 1. Then > 0. . This is a oneparameter exponential family distribution with q = 2. B) are the corresponding density (frequency) functions. I I! .O) = f(z)f. 1 (1. we have p(x. Suppose X = (Z. fonn a oneparameter exponential family as in (1. suppose Xl. Here is an example where q (1. . + OW. .T(x) = (y  z)'. y)T where Y = Z independent N(O. 1.1).O) IIh(x. where the p. 0) distribution. 1).6.) i=1 m B(8)] ( 1. 0 Then.4) Therefore. the Pe form a oneparameter exponential family with ~~I .h(X)=( : ) . B(O) = 0.. Specifically.' x. the family of distributions of X is a oneparameter exponential family with q=I. . Goals. .2. 1 X m ) considered as a random vector in Rmq and p(x. . The Binomial Family.6.6.6. Statistical Models.O) = ( : ) 0'(1_ Or" ( : ) ex p [xlog(1 ~ 0) + nlog(l. .. p(x. 0 E e.) exp[~(O)T(x. (1.
\ fixed r fixed s fixed 1/2. . .. B.6.6. In the discrete case we can establish the following general result.. if X = (Xl. B(m)(o) ~ mB(O)..x log x 10g(l x) logx The statistic T(m) (X I.. X m ) corresponding to the oneparameter exponential fam1 T(Xd.2) r(p. This family of Poisson distIibutions is oneparameter exponential whatever be m.1 the sufficient statistic T :m)(x1. .6 Exponential Families 51 where x = (x 1 . •.6. 1)(0) N(I'. ily of distributions of a sample from any of the foregoin~ is just In our first Example 1. h(m)(x) ~ i=l II h(xi).B(O)} for suitable h * . fl.Xm ) = I Xi is distributed as P(mO). i=l (1.\) (3(r. . and h. · .7) Note that the natural sufficient statistic T(m) is onedimensional whatever be m.Il)" .' fixed I' fixed p fixed . I:::n I::n Theorem 1. then qCm) = mq.\ 1"/'" (p (s 1) 1) 1) (r T(x) x (x .1. then the family of distributions of the statistic T(X) is a oneparameter exponential family of discrete distributions whose frequency functions may be written h'Wexp{1)(O)t . and pJ LT(Xi).8) . then the m ) fonn a I Xi. .Section 1. . Therefore..1 Family of distributions . 1 X m ) is a vector of independent and identically distributed P(O) random variables and m ) is the family of distributions of x. 1 X m).. the m) fonn a oneparameter exponential family. For example. If we use the superscript m to denote the corresponding T... We leave the proof of these assertions to the reader. pi pi I:::n TABLE 1.· . oneparameter exponential family with natural sufficient statistic T(m)(x) = Some other important examples are summarized in the following table. Let {Pe } be a oneparameter exponential family of discrete distributions with corresponding functions T.. . 1]. .6. and h. B.
the result follows.6.6.8) h(x) exp[~(8)T(x) .A(1})1 for s in some neighborhood 0[0.1. Then as we show in Section 1.1}) = (1(x!)exp{1}xexp[1}]}. We obtain an important and useful reparametrization of the exponential family (1. If X is distributed according to (1.9) with 1} ranging over E is called the canonical oneparameter exponential family generated by T and h. By definition.1) by letting the model be indexed by 1} rather than 8. (continued). 0 A similar theorem holds in the continuous case if the distributions ofT(X) are them Canonical exponential families.. }. E is either an interval or all of R and the class of models (1. If 8 E e. where 1} = log 8.2.9) with 1} E E contains the class of models with 8 E 8. Theorem 1. x E {O.6.2. 00 00 exp{A(1})} andE = R.. P9[T(x) = tl L {x:T(x)=t} p(x.52 Statistical Models. the momentgenerating function o[T(X) exists and is given by M(s) ~ exp[A(s + 1}) .8) exp[~(8)t .9) and ~ is an Ihlerior point of E. . The exponential family then has the form q(x. x=o x=o o Here is a useful result. . Let E be the collection of all 1} such that A(~) is finite. Goals. Example 1. = L(e"' (x!) = L(e")X (x! = exp(e").6. if q is definable.1. and Performance Criteria Chapter 1 Proof. The Poisson family in canonical form is q(x. then A(1}) must be finite..6. Jh(x)exp[1}T(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case.6.1}) = h(x)exp[1}T(x) .9) where A(1}) = logJ .A(~)I.2.B(8)]{ Ifwe let h'(t) ~ L:{"T(x)~t} h(x).6. The model given by (1.6. L {x:T{x)=t} h(x)}.6. x E X c Rq (1. £ is called the natural parameter space and T is called the natural sufficient statistic..B(8)] L {x:T(x)=t} ( 1. selves continuous.
6. nlog0 J. 0 = n log 82 and A(1]) 1 xl has mean nit} = = nlog(21]). _ . More generally. It is used to model the density of "time until failure" for certain types of equipment. Koopman.. . if there exist realvalued functions 171.17k and B of 6.O) = h(x) exp[L 1]j(O)Tj (x) .2 The Multiparameter Case Our discussion of the "natural form" suggests that oneparameter exponential families are naturally indexed by a onedimensional real parameter fJ and admit a onedimensional sufficient statistic T(x).. 02 ~ 1/21]. which is naturally indexed by a kdimensional parameter and admit a kdimensional sufficient statistic.8) ~ (x/8 2)exp(_x 2/28 2). h on Rq such that the density (frequency) functions of the Po may be written as. X n is a sample from a population with density p(x. Proof.8) = (il(x.j8 2))exp(i=l n n I>U282) ~=1 = (il xi)exp[202 LX.8 > O. and Dannois were led in their investigations to the following family of distributions. 1 Tk. B(O) the natural sufficient statistic E~ 2n(j2 and variance nlt}2 4n04 .6. is said to be a kparameter exponential family. Direct computation of these moments is more complicated. and realvalued functions T 1.A(1])]dx h(x)exp[(s + 1])T(x)  A(s + 1])Jdx = . exp[A(s + 1]) . Var(T(X)) ~ A"(1]).10) .4 Suppose X 1> ••• . Now p(x.. We compute M(s) = = E(exp(sT(X))) {exp[A(s ~ + 1])  A(1])]} J'" J J'" J h(x)exp[(s + 1])T(x) . Therefore. A family of distrib~tioos {PO: 0 E 8}.<·or .6 Exponential Families 53 Moreover.8(0)].12). x E X j=1 c Rq. Pitman.Section 1. c R k . 0 Here is a typical application of this result. The rest of the theorem follows from the momentenerating property of M(s) (see Section A. E(T(X)) ~ A'(1]). 2 i=1 i=l n 1 n Here 1] = 1/20 2. . We give the proof in the continuous case. (1. 1. x> 0.6. This is known as the Rayleigh distribution. ' e k p(x.A(1])] because the last factor. Example 1. being the integral of a density. is one.
q(x. .4)...Tk(x)f and. J1 < 00. then the preceding discussion leads us to the natural sufficient statistic m m (LXi. Again. letting the model be Thus. It will be referred to as a natural sufficient statistic of the family.3.6.x .') population.LX. Suppose lhat P.(x) = x'... .1. .'). = N(I". (]"2 > O}.. .5.Ot = fl. .x E X c Rq where T(x) = (T1 (x).. the canonical kparameter exponential indexed by 71 = (fJI. which corresponds to a twOparameter exponential family with q = I. The Normal Family.'). . the vector T(X) = (T. + log(27r"'».O) =exp[". we define the natural parameter space as t: = ('l E R k : 00 < A('l) < oo}. The density of Po may be written as e ~ {(I".I' 54 Statistical Models. X m ) from a N(I".A('l)}. 2 (".. . Goals. i=l i=1 which we obtained in the previous section (Example 1. .(X) •.fJk)T rather than family generated by T and h is e.2(".2..11) (]"2.') : 00 < f1 x2 1 J12 2 p(x. Then the distributions of X form a kparameter exponential family with natural sufficient statistic m m TCm)(x) = (LTl(Xi).LTk(Xi )) t=1 Example 1.' .'" . suppose X = (Xl. +log(27r" »)].Xm ) where the Xi are independent and identically distributed and their common distribution ranges over a k~parameter exponential family given by (1.6...6. 8 2 = and I" 1 2' Tl(x) = x.. I ! In the discrete case. . 1)2(0) = 2 " B(O) " 1"' 1 " T. 0 Again it will be convenient to consider the "biggest" families. In either case. in the conlinuous case. i=l . and Performance Criteria Chapter 1 By Theorem 1. . h(x) = I..5.'l) = h(x)exp{TT(x)'l... A(71) is defined in the same way except integrals over Rq are replaced by sums. . . (1. .10). If we observe a sample X = (X" .TdX))T is sufficient.
/(7'. Now we can write the likelihood as k qo(x. Suppose a<. k ~ 2.~. From Exam~ pIe 1..~3): ~l E R. is not and all c.. A(1/) ~ ~[(~U2~. ..5. k}] with canonical parameter 0. j=1 k This is a kparameter canonical exponential family generated by TIl"" Tk and h(x) = II~ I l[Xi E {I.(x»). Multinomial Trials.. . A(1/) = 4n[~:+m'~5+z~1~'+2Iog(rr/~3)].5..)j. 0) ~ exp{L . In this example. .'. .~2)]./(7'.Section 1.~l= (3. ..id. .. ~l ~ /..T.k. < l. 0) for 1 = (1. . the density of Y = (Yb .. . Let T. E R k . ~. 2..~·(x) . .6 Exponential Families 55 (x. we can achieve this by the reparametrization k Aj = eO] /2:~ e Oj ."k. and AJ ~ P(X i ~ j)._l)(x)1/nlog(l where + Le"')) j=1 .. x') = (T. A E A.5.) + log(rr/..k.(x). and rewriting kl q(x.'12 ~ (3. Linear Regression. .j j=l = 1.. in Examples 1. Then p(x. .. In this example. TT(x) = T(Y) ~ (EY"EY. However 0. where m.= {(~1. 0 + el) ~ qo(x.. h(x) = I andt: ~ R x R. ~3 and t: = {(~1.7.6.)T.n log j=l L exp(. .5.~3 ~ 1/2(7'.. This can be identifiable because qo(x.1. E R. < OJ.kj. I yn)T can be put in canonical form with k = 3.4 and 1. Y n are independent.'I2./Ak) = ". and t: = Rk .EziY. L71 Aj = I}. as X and the sample space of each Xi is the k categories {I.5 that Y I . with J.ti = f31 + (32Zi. . where A is the simplex {A E R k : 0 < A. remedied by considering IV ~j = log(A.. ..~3 < OJ.6. A) = I1:~1 AJ'(X). Example 1. 1/) =exp{T'f. i = 1. .j = 1. Example 1. 0. . = 1/2(7'.~. = n'Ezr Example 1.6. It will often be more convenient to work with unrestricted parameters. a 2 )./(7'. . We observe the outcomes of n independent trials where each trial can end up in one of k possible categories. I <j < k 1.. n. Yi rv N(Jli.(x) = L:~ I l[X i = j]. We write the outcome vector as X = (Xl. (N(/" (7') continued}. .Xn)'r where the Xi are i.~2): ~l E R.6...
If the Ai are unrestricted. this. if c Re and 1/( 0) = BkxeO C R k . are identifiable. Goals.6. 1 <j < k .·.. .).d. M(T) sponding to and ~ MexkT + hex" it is easy to see that the family generated by M(T(X» and h is the subfamily of P corre 1/(0) = MTO.7 and X = (X t.. h(y) = 07 I ( ~. Here 'Ii = log 1\. .2.6. More10g(P'I[X = jI!P'IIX ~ k]).1.6. and Performance Criteria Chapter 1 1 parameter canonical exponential family generated by T (k 1) {I. 1 < i < n. then all models for X are exponential families because they are submodels of the multinomial trials model. I < k. 0 < Ai < 1. then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h. " ..6. See Problem 1. < .Y n ) ~ Y. 0) ~ q(x. be independent binomial. " Example 1.13) . 17). ) 1(0 < However. let Xt =integers from 0 to ni generated Vi < n. as X. X n ) T where the Xi are i. 11 E £ C R k } is an exponential family defined by p(x. is an npararneter canonical exponential family with Yi by T(Yt . where (J E e c R1. ..6. . < X n be specified levels and (1. 0 1. Logistic Regression. and 1] is a map from e to a subset of R k • Thus. 1 < i < n.6.. k}] with canonical parameter TJ and £ = Rk~ 1.56 Note that q(x.8. Here is an example of affine transformations of 6 and T.' A(1/) = L:7 1 ni 10g(l + e'·)..17 e for details.3 Building Exponential Families Submodels A submodel of a kparameter canonical exponential family {q(x. . 1]) is a k and h(x) = rr~ 1 l[xi E over. 1/(0») (1.6. Ai). Similarly.i. if X is discrete Affine transformations from Rk to RI defined by UP is the canonical family generated by T kx 1 and hand M is the affine transformation I' . from Example 1. is unchanged. Let Y.12) taking on k values as in Example 1. B(n. the parameters 'Ii = Note that the model for X Statistical Models.
PIX < x])) 8. p(x.6. i = 1. + 8.)T . ~ (1.i/a. which is less than k = n when n > 3.x)W'.1)T. Suppose that YI ~ ..a 2 ).6 Exponential Families 57 This is a linear transformation TJ(8) = B nxz 8 corresponding to B nx2 = (1..10. Example 1.Section 1. . Y.i. log(P[X < xl!(1 .. + 8.. L:~ 1 Xl yi)T and h with A(8. + 8. .9. with B = p" we can write where T 1 = ~~ I Xi. . This is a 0 In Example 1. }Ii N(J1..6..13) holds. . .6. are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization.5 a 2nparamctercanonical exponential family model with fJi = P. LocationScale Regression.!t is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied.8. .x and (1. Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic.i.). suppose the ratio IJlI/a.. Example 1. . T 2 = L:~ 1 X. and l1n+i = 1/2a. The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance.8... Yn. i=I This model is sometimes applied in experiments to determine the toxicity of a substance. N(Jl.. generated by . n.14) 8. so it is not a curved family. Then. x).Xn i. ranges over (0. .'.00)... is a known constant '\0 > O.6.. o Curved exponential families Exponential families (1.. y. 'fJ1(B) curved exponential family with l = 1. which is called the coefficient of variation or signaltonoise ratio.. > O.Xi)).xn)T. . this is by Example 1. If each Iii ranges over R and each a. '.'V T(Y) = (Y" .6. then this is the twoparameter canonical exponential family generated by lYIY = (L~l. Gaussian with Fixed SignaltoNoise Ratio. However.x = (Xl.6. that is..6. In the nonnal case with Xl.. SetM = B T . Then (and only then). 8) in the 8 parametrization is a canonical exponential family. E R. . = '\501 and 'fJ2(0) = ~'\5B2. Y n are independent.1. PIX < xl ~ [1 + exp{ (8. where 1 is (1.d. I ii. the 8 parametrization has dimension 2.12) with the range of 1J( 8) restricted to a subset of dimension l with I < k . a.8. .) = L ni log(1 + exp(8.
Let Yj . with an exponential family density Then Y = (Y1. we define I I . For 8 = (8 1 . say.i.12) is not an exponential family model. yn)T is modeled by the exponential family generated by T(Y) ~. and = Ee sTT . sampling. for unknown parameters 8 1 E R. be independent.. 1 Tj(Y.6.6.. Thus. 1989.8.(0).10). Bickel. Goals.5..) depend on the value Zi of some covariate. and Performance Criteria Chapter 1 and h(Y) = 1. Ii.13) exhibits Y.8. We extend the statement of Theorem 1. Carroll and Ruppert.8 note that (1. 1 Bj(O). 1 < j < n. Section 15. Even more is true. Examples 1. Recall from Section B. . and B(O) = ~.6. 83 > 0 (e. . E R. M(s) as the momentgenerating function. TJ'(Y).4 Properties of Exponential Families Theorem 1.E Yj c Rq. Next suppose that (JLi.12.1. We return to curved exponential family models in Section 2.). then p(y._I11.5 that for any random vector Tkxl.1 generalizes directly to kparameter families as does its continuous analogue. 8.g.) = (Yj. but a curved exponential family model with 0 1=3.3.(O)Tj'(Y) for some 11.) and 1 h'(YJ)' with pararneter1/(O).2.6. 1988. 1978.6.10 and 1. n.. In Example 1. the map 1/(0) is Because L:~ 11]i(6)Yi + L:~ i17n+i(6)Y? cannot be written in the fonn ~. respectively. ~.6.58 Statistical Models. a.6. as being distributed according to a twoparameter family generated by Tj(Y.6 are heteroscedastic and homoscedastic models.d. 1. 0) ~ q(y. and Snedecor and Cochran. Sections 2.1/(O)) as defined in (6. Supermode1s We have already noted that the exponential family structure is preserved under i.xjYj ) and we can apply the supennode! approach to reach the same conclusion as before. Models in which the variance Var(}'i) depends on i are called heteroscedastic whereas models in which Var(Yi ) does not depend on i are called homoscedastic.
(with 00 pennitted on either side).8". v(x).6.6.6.1 give a classical result in Example 1.6. h(x) > 0.6. T.3. 1O)llkxk.6.6. Under the conditions ofTheorem 1.a)'72) < aA('7l) + (1 . 8". Proof of Theorem 1. for any u(x).T(x)).1 and Theorem 1. + (1 . 8A 8A T" = A('7o) f/J.r(T) = IICov(7~.. Fiually (c) 0 The formulae of Corollary 1.6.A('70) = II 8". then T(X) has under 110 a moment M(s) ~ exp{A('7o + s) .A('7o)} vaLidfor aLL s such that 110 + 5 E E. We prove (b) first. h) with corresponding natural parameter space E and function A(11).a)'72 E is proved in exactly the same way as Theorem 1. u(x) ~ exp(a'7. . Theorem 1.6.9.3(c). ('70)..6 Expbnential Families 59 V.1. ~ = 1 . Substitute ~ ~ a.6.2. By the Holder inequality (B.6. Suppose '70' '71 E t: and 0 < a < 1. A(U'71 Which is (b). Then (a) E is convex (b) A: E ) R is convex (c) If E has nonempty interior in R k generating function M given by and'TJo E E. Let P be a canonicaL kparameter exponential famiLy generated by (T.5. A where A( '70) = (8".r'7o T(X) . vex) = exp«1 .3. Corollary 1. J exp('7 TT (x))h(x)dx > 0 for all '7 we conclude from (1.15) is finite. .a)A('72) Because (1.4).a)'7rT(x)) and take logs of both sides to obtain.Section 1.15) t: the righthand side of (1. S > 0 with ~ + ~ = 1.3 V.15) that a'7. J u(x)v(x)h(x)dx < (J ur(x)h(x)dx)~(Jv'(x)h(x)dx)~. t: and (a) follows. Since 110 is an interior point this set of5 includes a baLL about O.6. ('70) II· The corollary follows immediately from Theorem B. If '7" '72 E + (1 . ('70)) .a. .
"I 11 J Evidently every kparameter exponential family is also k'dimensional with k' > k.7 we can see that the multinomial family is of rank at most k .7). It is intuitively clear that k . for all 9 because 0 < p«X. . Our discussion suggests a link between rank and identifiability of the 'TJ parameterization. P7)[L.(Tj(X» = P>.lx ~ j] = AJ ~ e"'ILe a . Sintilarly. Suppose P = (q(x. 9" 9. ajTj(X) = ak+d < 1 unless all aj are O.6.4. An exponential family is of rank k iff the generating statistic T is kdimensional and 1.7. . (ii) 7) is a parameter (identifiable).x. + 9. However. II ! I Ii . Note that PO(A) = 0 or Po (A) < 1 for some 0 iff the corresponding statement holds i. we are writing the oneparameter binomial family corresponding to Yl as a twoparameter family with generating statistic (Y1 . using the a: parametrization. such that h(x) > O. (continued).6. . there is a minimal dimension.~~) < 00 for all x.. Here.1 is in fact its rank and this is seen in Theorem 1. o The rank of an exponential family I . if we consider Y with n'> 2 and Xl < X n the family as we have seen remains of rank < 2 and is in fact of rank 2. However. k A(a) ~ nlog(Le"') j=l and k E>.~. But the rank of the family is 1 and 8 1 and 82 are not identifiable.6.8. ifn ~ 1. in Example 1. Then the following are equivalent. (iii) Var7)(T) is positive definite.(9) = 9. T. Formally.1. p:x. Theorem 1.6. We establish the connection and other fundamental relationships in Theorem 1. (i) P is a/rank k. and ry. 2' Going back to Example 1.4. Goads. Tk(X) are linearly independent with positive probability.4 that follows. and Performance Criteria Chapter 1 Example 1.6.60 Statistical Models. (X)". Xl Y1 ). 7) E f} is a canonical exponential/amily generated by (TkXI ' h) with natural parameter space e such that E is open. f=l .6. .
hence. III. (iv) '" (v) = (tii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined as previously for arbitrary 1)0' 1).6.1)o)TT. 0 Corollary 1. Taking logs we obtain (TJI .Section 1. We give a detailed proof for k = 1.6. because E is open.A(TJI)}h(x) ~ exp{TJ2T(x) . F!~ .2 and.A(TJ. Conversely. Then (a) P may be uniquely parametrized by "'(1)) E1)T(X) where". 1)) is a strictly concavefunction of1) On 1'. . all 1) ~ (iii) = a T Var1)(T)a = Var1)(aT T) = 0 for some a of 0. ~ (ii) ~ _~ (i) (ii) = P1). We.(ii) = (iii). . A"(1]O) = 0 for some 1]0 implies c... :. all 1) = (~i) II.. by our remarks in the discussion of rank.2.4 hold and P is of rank k.) is false.. 1)0) E n· Q is the exponential family (oneparameter) generated by (1)1 . ~ P1)o some 1). Apply the case k = 1 to Q to get ~ (ii) =~ (i)..6 Exponential Families 61 (iv) 1) ~ A(1)) is 11 onto (v) A is strictly convex on E. Let ~ () denote "(. of 1)0· Let Q = (P1)o+o(1).T(x) .'t·· . ~ (i) =~ (iii) ~ (i) = P1)[aT T = cJ ~ 1 for some a of 0. = Proof. by Theorem 1. Note that.6. This is just a restatement of (iv) and (v) of the theorem.6.A(TJ2))h(x). have (i) . . with probability 1. A'(TJ) is strictly monotone increasing and 11. thus.(v). ".3. (b) logq(x. Proof. Thus.1)o) : 1)0 + c(1). This is equivalent to Var"(T) ~ 0 '*~ (iii) {=} f"V(ii) There exist T}I =I= 1J2 such that F Tll = Pm' Equivalently exp{r/. = Proof ofthe general case sketched I." Then ~(i) > 1 is then sketched with '* P"[a. Now (iii) =} A"(TJ) > 0 by Theorem 1.T ~ a2] = 1 for al of O...TJ2)T(X) = A(TJ2) ..) with probability 1 =~(i). Suppose tlult the conditions of Theorem 1.' . The proof for k details left to a problem. 0 . A' is constant. (iii) = (iv) and the same discussion shows that (iii) . '" '.~> . for all T}. which that T implies that A"(TJ) = 0 for all TJ and. A is defined on all ofE. ranges over A(I'). hence.
. See Section 2...16) Rewriting the exponent we obtain 10gf(Y.18) _.6.a 2 + J12). .62 Statistical Models.4 applies..6. E).Yn)T follows the k = p(p + 3)/2 parameter exponential family with T = (EiYi . the 8(11) 0) family is parametrized by E(X).6..(Y).a 2 ). E). .21). ifY 1. Y n are iid Np(Jl.._.6. Example 1. 0 ! = 1.17) can be rewritten ( 2: 1 <i<j~p aijYiYj + ~ 2: aii r:2 ) + L(2: aij /lj)Yi i=l i=l j=l p where E. For {N(/" "')}.X') = (JL. which is obviously a 11 function of (J1._. families to which the posterior after sampling also belongs. N p (1J. Recall that Y px 1 has a p variate Gaussian distribution..29) that I) generate this family and that the rank of the family is indeed p(p + 3)/2. is open. Jl. E(X.2 (Y ..(Xi) i=l j=l i=l n • n nB(9)}..1 '" 11"'.6.6.E). which is a p x p symmetric matrix. Goals.. and. By our supermodel discussion.Jl.~p). . as we always do in the Bayesian context. An important exponential family is based on the multivariate Gaussian distributions of Section 8. revealing that this is a k = p(p + 3)/2 parameter exponential family with statistics (Yi.2 p p (1. so that Theorem 1.6.2 we considered beta prior distributions for the probability of success in n Bernoulli trials.(9) LT.{Y. .6.10).1 (Y .Yp. Then p(xI9) = III h(x. where X is the Bernoulli trial.. the N(/l. ~iYi Vi).11.ljh<i<...11.. It may be shown (Problem 1.. However. (1. .6.)] eXP{L 1/..6.Jl)}. .t parametrization is close to the initial parametrization of classical P. where we identify the second element ofT.Xn is a sample from the kparameter exponential family (1. then X .tpx 1 and positive definite variance covariance matrix L::pxp' iff its density is f(Y.6. write p(x 19) for p(x. with its distinct p(p + I) /2 entries. We close the present discussion of exponential families with the following example. The corollary will prove very important in estimation theory.17) The first two terms on the right in (1. and Performance Criteria Chapter 1 The relation in (a) is sometimes evident and the j. and that (.Jl) ..5. E) _~yTEIy + (E1JllY 2 I 2 (log Idet(E)1 + JlT I P log". Thus.. the relation in (a) may be far from obvious (see Problem 1.. . (1. The p Van'ate Gaussian Family."5) family by E(X). with mean J. . B(9) = (log Idet(E) I+JlTEl Jl). This is a special case of conjugate families of priors.Jl)TE.. E) = Idet(E) 1 1 2 / . h(Y) . T (and h generalizing Example 1.5 Conjugate Families of Prior Distributions In Section 1.p/2 I exp{ . E. Suppose Xl.I. 9 ~ (Jl. 0).3.6.. .
21) where S = (SI. .21) and our assertion follows. where u6 is known and (j is unknown. then k n . .. . be "parameters" and treating () as the variable of interest.18) by letting 11.(Xi)+ f.6. X n become available.Xn is a N(O..6. . . the parameter t of the prior distribution is updated to s = (t + a). tk+l) E 0.(lk+1 j=1 i=1 + n)B(O)) ex ".6. n n )T and ex indicates that the two sides are proportional functions of ().fk+Jl.6.2)' Uo 2uo Ox . We assume that rl is nonempty (see Problem 1.19) o {(iJ.6. .'''' I:~ 1 Tk(Xi).6. 1r(6jx) is the member of the exponential family (1. . ..21) is an updating formula in the sense that as data Xl. k..36).1.(0) ~ exp{L ~j(O)fj . and t j = L~ 1 Tj(xd.(O) ex exp{L ~J(O)(LT. .'" .6.6. .. Note that (1....1. A conjugate exponential family is obtained from (1. Forn = I e. where a = (I:~ 1 T1 (Xi). .6 Exponential Families 63 where () E e.18).6. n)T D It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way.Section 1. then Proposition 1. Example 1. That is. ik + ~Tk(Xi)' tk+l + 11. ."" Sk+l)T = ( t} + ~TdxiL .12.6.20) where t = (fJ. Proof.) .18) and" by (1. we consider the conjugate family of the (1.6... Because two probability densities that are proportional must be equal. If p(xIO) is given by (1.. j = 1.20).22) 02 p(xIO) ex exp{ 2 .6. To choose a prior distribution for model defined by (1.u5) sample.6.6.(Olx) ex p(xlO)".20) given by the last expression in (1..6. is a conjugate prior to p(xIO) given by (1.fkIJ)<oo) with integrals replaced by sums in the discrete case..(0). The (k + I)parameter exponential/amity given by k ". 0 Remark 1. Suppose Xl..fk+IB(O) logw(t)) j=1 (1.6. (1. ..20). let t = (tl. which is k~dimensional.O<W(tJ. tk+ I) T and w(t) = 1= eXP{hiJ~J(O)' _='" _= 1 k ik+JB(II)}dO J···dOk (1.
6. = s.30) that the Np()o" f) family with )0.28) where W.) density.d.6.(O) <X exp{ . consists of all N(TJo. i.6. 1r. 0 These formulae can be generalized to the case X. therefore. 75) prior density.24). the posterior has a density (1.23) Upon completing the square.26) intuitively as ( 1.W.(S) = t. we must have in the (t l •t2) parametrization "g (1. Eo known. = 1 .64 Statistical Models. (0) is defined only for t. we find that 1r(Olx) is a normal density with mean Jl(s. t. Goals. = nTg(n)I(1~. Eo). > 0 and all t I and is the N (tI!t" (1~ It. Moreover. E RP and f symmetric positive definite is a conjugate family f'V .25) By (1..6.6. W. it can be shown (Problem 1.6.24) Thus. Tg is scalar with TO > 0 and I is the p x p identity matrix (Problem 1.(n) (g +n)I[s+ 1]0~~1 TO TO (1. If we start with aN(f/o.. (J Np(TJo.i.6.6.23) with (70 TO . and Performance Criteria Chapter 1 This is a oneparameter exponential family with The conjugate twoparameter exponential family given by (1.6.n) and variance = t.6. = rg(n)lrg so that w. we obtain 1r.21).6. 761) where '10 varies over RP. 1 < i < n.26) (1.37). Using (1. rJ) distributions where Tlo varies freely and is positive. + n.6.( 0 .(n) = .20) has density (1.6. .27) Note that we can rewrite (1. tl(S) = 1]0(10 2 TO 2 + S. Np(O.) } 2a5 h t2 t1 2 (1. Our conjugate family. if we observe EX.
6. 0) = h(x) expl2:>j(I1)Ti (x) . 2.6. 8 C R k . In fact.6. Despite the existence of classes of examples such as these. the conditions of Proposition 1. The natural sufficient statistic max( X 1 ~ .Xn )... In fact.32.6. is not of the form L~ 1 T(Xi). B) are not exponential. If bas a nonempty interior in R k and '10 E then T(X) has for X ~ P7Jo the momentgenerating function . Discussion Note that the uniform U( (I.A(7Jo)} e e. Problem 1. See Problems 1. E 1 R is convex.B(O)]. is a kparameter exponential/amity of distributions if there are realvalued functions 'T}I . r) is a p(p + 3)/2 rather than a p + I parameter family.J: h(x)exp{TT(x)7J}dx in the continuous case. The set E is convex. and realvalued functions TI. Pitman. h on Rq such that the density (frequency) function of Pe can be written as k p(x... for all s such that '10 + s is in Moreover E 7Jo [T(X)] = A(7Jo) and Var7Jo[T(X)] A( '10) where A and A denote the gradient and Hessian of A. (1. must be kparameter exponential families. T k (X)) is called the natural sufficient statistic of the family. x j=1 E X c Rq.10 is a special result of this type.'T}k and B on e.1 are often too restrictive. and Darmois. the family of distributions in this example and the family U(O. . O}) model of Example 1.6 Exponential Families 65 but a richer one than we've defined in (1. a theory has been built up that indicates that under suitable regularity conditions families of distributions. the map A . Some interesting results and a survey of the literature may be found in Brown (1986).. Summary.. The canonical kparameter exponentialfamily generated by T and h is T(X) = where A(7J) ~ log J:. " Tk. ••• . {PO: 0 E 8}. to Np (8 .(s) = exp{A(7Jo + s) .20) except for p = 1 because Np(A. .3 is not covered by this theory. which admit kdimensional sufficient statistics for all sample sizes.31 and 1.. which is onedimensional whatever be the sample size. . It is easy to see that one can consUUct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1. In the onedimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape. .20).. with integrals replaced by sums in the discrete case. Eo).5.Section 1. . The set is called the natural parameter space.29) (Tl (X). starting with Koopman..6. e. ..6. ..6.
and t = (t" ... Assume that the errors are otherwise identically distributed nonnal random variables with known variance.parameter exponential family k ".29).(0) = exp{L1)j(O)t) ..66 Statistical Models. 1. Suppose that the measuring instrument is known to be biased to the positive side by 0. (v) A is strictlY convex on f. (iii) Var'7(T) is positive definite. tk+1) 0 ~ {(t" . He wishes to use his observations to obtain some infonnation about J. If P is a canonical exponential family with E open. and Performance Criteria Chapter 1 An exponential family is said to be of rank k if T is kdimensional and 1.1)) (O)t j . . then the following are equivalent: (i) P is of rank k. (b) A measuring instrument is being used to obtain n independent determinations of a physical constant J. A family F of priOf distributions for a parameter vector () is called a conjugate family of priors to p(x I 8) if the posterior distribution of () given x is a member of F. T" .1 1. is conjugate to the exponential family p(xl9) defined in (1.L and variance a 2 . (a) A geologist measures the diameters of a large number n of pebbles in an old stream bed.B(O)tk+l logw} j=l where w = 1:" 1: E exp{l:. Ii II I .. . Goals. (ii) '7 is identifiable. The (k + 1).L. tk+.6.. State whether the model in question is parametric or nonparametric.7 PROBLEMS AND COMPLEMENTS Problems for Section 1.1 units.B(O)}dO.) E R k+l : 0 < w < oo}.L and a 2 but has in advance no knowledge of the magnitudes of the two parameters. (iv) the map '7 ~ A('7) is I ' Ion 1'. Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean J.. l T k are linearly independent with positive Po probability for some 8 E e. Give a formal statement of the following models identifying the probability laws of the data and the parameter space.
i=l (c) X and Y are independent N (I' I .. An entomologist studies a set of n such insects observing both the number of eggs laid and the number of eggs hatching for each nest. Which of the following parametrizations are identifiable? (Prove or disprove. Each . i = 1. .) < Fy(t) for every t.Xpb .1.1(c)..7 Problems and Complements 67 (c) In part (b) suppose that the amount of bias is positive but unknown. 4.. .2 ) and Po is the distribution of Xll. . j = 1... = o.ap ) : Lai = O}. v.l(d) if the entomologist observes only the number of eggs hatching but not the number of eggs laid in each case.. B = (1'1. (e) The parametrization of Problem l. Q p ) and (A I . (a) Let U be any random variable and V be any other nonnegative random variable. and Po is the distribution of X (b) Same as (a) with Q = (X ll . ~ N(l'ij..2) where fLij = v + ai + Aj.2). . . .. are sampled at random from a very large population. restricted to p = (all' ... .2) and N (1'2.1. e (e) Same as (d) with (Qt.. . 0. 1 Ab) restricted to the sets where l:f I (Xi = 0 and l:~~1 A. (d) Xi. . respectively.Section 1.1(d).t. then X is (b) As in Problem 1. (If F x and F y are distribution functions such that Fx(t) said to be stochastically larger than Y. 2... .. . 1'2) and we observe Y X. Show that Fu+v(t) < Fu(t) for every t. (b) The parametrization of Problem 1. ..... Once laid.··. Can you perceive any difficulties in making statements about f1.1 describe formaHy the foHowing model. . . .Xp are independent with X~ ruN(ai + v. 0.p. Ab. Are the following parametrizations identifiable? (Prove or disprove. .) (a) The parametrization of Problem 1. 3..) (a) Xl.. = (al. Q p ) {(al.. each egg has an unknown chance p of hatching and the hatChing of one egg is independent of the hatching of the others.2 ).for this model? (d) The number of eggs laid by an insect follows a Poisson distribution with unknown mean A.Xp ).. bare independeht with Xi. .1. Two groups of nl and n2 individuals. ."" a p . At.
Let Y and Po is the distribution of Y. Hint: IfY . . Mk. if and only if.0. + mk + r = n 1. (0).· . is the distribution of X e = (0.. + Vk + P = 1. and Performance Criteria Chapter 1 ":.c) = PrY > c .Lk VI n".0.Nk = nk. • .. .. 68 Statistical Models. whenXis unifonnon {O.1). V k mk P r where J. Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour. 0 = (1'.O}. }.. (b)P.t) fj>r all t.··· ..  . 0 < V < 1 are unknown.1... Consider the two sample models of Examples 1. M k = mk] I! n. I. J. I . What assumptions underlie the simplification? 6. Let N i be the number dropping out and M i the number graduating during year i. .) (a) p.I nl .Li < + . 0 < Vi < 1. P8[N I .t).k + VI + .Li = 71"(1 M)iIJ. then P(Y < t + c) = P( Y < t .2). . = n11MI = mI. The following model is proposed. . Vi = (1 11")(1 ~ v)iI V for i = 1. 7. Show that Y . It is known that the drug either has no effect or lowers blood pressure. =X if X > 1. (e) Suppose X ~ N(I'.9). VI. + fl.p(0... I nl.' I ..L. 2..1.:". o < J1 < 1. the density or frequency function p of Y satisfies p( c + t) ~ p( c . The simplification J.. 5.... Vk) is unknown.k is proposed where 0 < 71" < I.. mk·r. ..p(0. + nk + ml + .c has the same distribution as Y + c.2.. In each of k subsequent years the number of students graduating and of students dropping out is recorded.1.LI. nk·mI . ml . 0'2). Both Y and p are said to be symmetric about c. e = = 1 if X < 1 and Y {J... . . but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown. The number n of graduate students entering a certain department is recorded. i = 1.. • "j :1 member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour. .t) = 1 . . . 1 < i < k and (J = (J..LI ni + . ...9 and they oecur with frequencies p(0.0'2) (d) Suppose the possible control responses in an experiment are 0..2.P(Y < c . .. .3(2) and 1.... (a) What are the assumptions underlying this model'? (b) (J is very difficult to estimate here if k is large. Suppose the effect of a treatment is to increase the control response by a fixed amount (J..0). Which of the following models are regular? (Prove or disprove. is the distribution of X when X is unifonn on (0... k.c bas the same distribution as Y + c.. Goals. . Let Po be the distribution of a treatment response..4(1). I ! fl.. 0 < J.. 8.
.N = 0. Yn denote the survival times of two groups of patients receiving treatments A and B.{3p) are identifiable iff Zl. (]"2) independent. > 0. Show that if we assume that o(x) + x is strictly increasing. Collinearity: Suppose Yi LetzJ' = L:j:=1 4j{3j + Ci. N} are identifiable. (b) In part (a). . 11. r(j) = = PlY = j.tl) implies that o(x) tl. In Example 1. i . eY and (c) Suppose a scale model holds for X.Znj?' (a) Show that ({31. o(x) ~ 21' + tl . C)."" Yn ). (Yl.. X m and Yi. fJp) are not identifiable if n pardffieters is larger than the number of observations.. suppose X has a distribution F that is not necessarily normal.Xn ). 0 'Lf 'Lf op(j) = I. . t > O. .. C(·) = F(· . For what tlando(x) ~ 2J1+tl2x? type ofF is it possible to have C(·) ~ F(·tl) for both o(x) (e) Suppose that Y ~ X + O(X) where X ~ N(J1. according to the distribution of X. = (Zlj...Section 1. N.. Y = T I Y > j].tl) does not imply the constant treatment effect assumption.. ... The Lelunann 1\voSample Model..i. j = 0. . I} and N is known. eX o..tl and o(x) ~ 21' + tl. Y.2x and X ~ N(J1. . Let X < 1'. e) : p(j) > 0. The Scale Model. or equivalently. Hint: Consider "hazard rates" for Y min(T. C). (b) Deduce that (th. then C(·) = F(. log y' satisfy a shift model? = Xc. {r(j) : j ~ 0.. . . e(j) > 0.. . . Sx(t) = .'" . . the two cases o(x) . N}.Xn are observed i. Sbow that {p(j) : j = 0. Therefore. Does X' Y' = yc satisfy a scale model? Does log X'. log X and log Y satisfy a shift model with parameter log (b) Show that if X and Y satisfy a shift model With parameter tl.. e) vary freely over F ~ {(I'.d. p(j). .6. I(T < C)) = j] PIC = j] PIT where T.3 let Xl. o (a) Show that in this case. 10. . that is."'). and (I'. Positive random variables X and Y satisfy a scale model with parameter 0 > 0 if P(Y < t) = P(oX < t) for all t > 0. . .7 Problems and Complements 69 (a) Show that if Y ~ X + o(X).zp..tl). . That is. C(t} = F(tjo). then satisfy a scale model with parameter e.. = 9. Let c > 0 be a constant. if the number of = (min(T. 12. C are independent. ci '" N(O. Suppose X I. ... e(j).1. .2x yield the same distribution for the data (Xl. are not collinear (linearly independent). ."') and o(x) is continuous.. . . then C(·) ~ F(. 1 < i < n. 0 < j < N.
(1. t > 0. " T k are unobservable and i. ~ a5.(t). The most common choice of 9 is the linear form g({3. Zj) + cj for some appropriate €j. we have the Lehmann model Sy(t) = si. Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t). are called the survival functions.7.h. Goals.(t). Show that hy(t) = c. = = (.) Show that Sy(t) = S'.I ~I .(t) if and only if Sy(t) = Sg. The Cox proportional hazani model is defined as h(t I z) = ho(t) exp{g(.2) is equivalent to Sy (t I z) = sf" (t). respectively.z)}. as T with survival function So. Show that if So is continuous. thus. then Yi* = g({3. = exp{g(.{a(t). Also note that P(X > t) ~ Sort). k = a and b.1) Equivalently.).2) and that Fo(t) = P(T < t) is known and strictly increasing. and Performance CriteriCl Chapter 1 t Ii . Tk > t all occur. z) zT.G(t). then X' = log So (X) and Y' ~ log So(Y) follow an exponential scale model (see Problem Ll.l3. .2.. z)) (1.i.. Set C. hy(t) = c. 12. where TI". So (T) has a U(O. 13.ll) with scale parameter 5. Specify the distribution of €i. Hint: See Problem 1.. A proportional hazard model. .. Sy(t) = Sg.d. log So(T) has an exponential distribution. I' 70 StCitistical Models. F(t) and Sy(t) ~ P(Y > t) = 1 . (b) By extending (bjn) from the rationals to 5 E (0. .. Moreover. Let T denote a survival time with density fo(t) and hazard rate ho(t) = fo(t)j P(T > t). Survival beyond time t is modeled to occur if the events T} > t. 1 P(X > t) =1 (. h(t I Zi) = f(t I Zi)jSy(t I Zi)..l3. I (b) Assume (1. t > 0... (c) Under the assumptions of (b) above. 7.ho(t) is called the Cox proportional hazard model.l3.(t) = fo(t)jSo(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y. show that there is an increasing function Q'(t) such that ifYt = Q'(Y.12. t > 0. . Let f(t I Zi) denote the density of the survival time Yi of a patient with covariate vector Zi apd define the regression survival and hazard functions of Y as i Sy(t I Zi) = 1~ fry I zi)dy.1.2) where ho(t) is called the baseline hazard function and 9 is known except for a vector {3 ({311 .7. Then h.00). Hint: By Problem 8. .  . I (3p)T of unknowns. I) distribution.... (c) Suppose that T and Y have densities fort) andg(t).7. For treatments A and E. .(t) with C.) Show that (1.
14. the parameter of interest can be characterl ized as the median v = F..p(t)). 15. have a N(O.11 and 1.p and ~ are identifi· able.000 is the minimum monthly salary for state workers.. Problems for Section 1. have a N(O.4 1 0. Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function J=oo Jo F(x. wbere the model {(F. an experiment leads to a random variable X whose frequency function p(x I 0) is given by O\x 01 O 2 0 0. 1f(O2 ) = ~. (a) Suppose Zl' Z..d.L . what parameters are? Hint: Ca) PIXI < tl = if>(.G)} is described by where 'IjJ is an unknown strictly increasing differentiable map from R to R..8 0.2 1.l (0. When F is not symmetric. Let Xl. still identifiable? If not.1.X n ).6 Let 1f be the prior frequency function of 0 defined by 1f(I1.5) or mean I' = xdF(x) = Fl(u)du... .i. Show that both .v arbitrarily large. (b) Suppose X" .i. x> c x<c 0.12. J1 and v are regarded as centers of the distribution F.1. . Consider a parameter space consisting of two points fh and ()2. Are.p and C. Ca) Find the posterior frequency function 1f(0 I x). Merging Opinions. (b) Suppose Zl and Z.1. '1/1' > 0.2 0.(x/c)e.2 unknown.7 Problems and Complements 71 Hint: See Problems 1. In Example 1. Yn be i. F. ." . Show how to choose () to make J.2) distribution with . the median is preferred in this case.Xm be i.Section 1. J1 may be very much pulled in the direction of the longer tail of the density. Generally. and suppose that for given ().. 1) distribution. .. . Find the median v and the mean J1 for the values of () where the mean exists. 'I/1(±oo) = ±oo.2 with assumptions (t){4). Find the . . Observe that it depends only on l:~ I Xi· I 0). G. Here is an example in which the mean is extreme and the median is not. where () > 0 and c = 2.O) 1 . and Zl and Z{ are independent random variables. and for this reason. Examples are the distribution of income and the distribution of wealth. ) = ~. X n are independent with frequency fnnction p(x posterior1r(() I Xl.d. Yl ..
Assume X I'V p(x) = l:.'" 1rajI Show that J + a. and Performance CriteriCi Chapter 1 (cj Same as (b) except use the prior 71"1 (0 . Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success O.=l1r(B i )p(x I 8i ). I XI . X n be distributed as where XI. Consider an experiment in which. Suppose that for given 8 = 0.'" .Xn = X n when 1r(B) = 1. . X n are independent with the same distribution as X. Then P.. . Goals. < 0 < 1. " ••. 2.. is used in the fonnula for p(x)? 2. 4.. c(a). what is the most probable value of 8 given X = 2? Given X = k? (c) Find the posterior distribution of (} given X = k when the prior distribution is beta. 7r (d) Give the values of P(B n = 2 and 100. X has the geometric distribution (a) Find the posterior distribution of (J given X = 2 when the prior distribution of (} is . 71"1 (0 2 ) = .. (b) Find the posterior density of 8 when 71"( OJ = 302 . . . IE7 1 Xi = k) forthe two priors (f) Give the set on which the two B's disagree. the outcome X has density p(x OJ ~ (2x/O'). " J = [2:.25. (J. Let X I. (3) Find the posterior density of 8 when 71"( 0) ~ 1.. 3. . .... (3) Suppose 8 has prior frequency. 1. 3. Compare these B's for n = 2 and 100. 0 < x < O. 4'2'4 (b) Relative to (a). Let 71" denote a prior density for 8. ':. For this convergence..0 < B < 1.8). . does it matter which prior. l3(r. This is called the geometric distribution (9(0)).0 < 0 < 1. . . .[X ~ k] ~ (1.)=1. ) ~ . .72 Statistical Models.2. k = 0. Unllormon {'13} . ~ = (h I L~l Xi ~ = . . Find the posteriordensityof8 given Xl = Xl. 0 (c) Find E(81 x) for the two priors in (a) and (b).. where a> 1 and c(a) . (d) Suppose Xl. 7l" or 11"1. }.m+I.2.5n) for the two priors and 1f} when 7f (e) Give the most probable values 8 = arg maxo 7l"(B and 71"1. "X n ) = c(n 'n+a m) ' J=m.Xn are natural numbers between 1 and Band e= {I.75. Show that the probability of this set tends to zero as n t 00.. for given (J = B. ... 1r(J)= 'u .O)kO. 'IT ..
are independent standard exponential.... Show that a conjugate family of distributions for the Poisson family is the gamma family. Justify the following approximation to the posterior distribution where q. . then the posterior distribution of D given X = k is that of k + Z where Z has a B(N . 10. X n+ k ) be a sample from a population with density f(x I 0). ZI. 7. } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of () given X = x is .. is the standard normal distribution function and n _ T 2 x + n+r+s' a J."" Zk) where the marginal distribution of Y equals the posterior distribution of 6 given Xl = Xl.. Let (XI. a regular model and integrable as a function of O. .Section 1. .. .• X n = Xn' and the conditional distribution of the Zi'S given Y = t is that of sample from the population with density e. .1= n+r+s _ ..7 Problems . .. Xn) = Xl = m for all 1 as n + 00 whatever be a. .and Complements 73 wherern = max(xl. 6. where VI. Next use the central limit theorem and Slutsky's theorem.• X n = Xn. b) denote the posterior distribution.) = n+r+s Hint: Let!3( a. then !3( a.1..b> 1." . (b) Suppose that max(xl.. Va' W t .0 E Let 9 have prior density 1r.. (a) Show that the family of priors E. .. D = NO has a B(N. Show rigorously using (1. 11'0) distribution. b) is the distribution of (aV loW)[1 + (aV IbW)]t.c(b. Suppose Xl. . S. XI = XI. I Xn+k) given .2."'tF·t'. Show in Example 1.2. Interpret this result.. .1 that the conditional distribution of 6 given I Xi = k agrees with the posterior distribution of 6 given X I = Xl •.1.n. f3(r.8) that if in Example 1. X n = Xn is that of (Y. Show that ?T( m I Xl..J:n). .f) = [~.2.... Show that the conditional distribution of (6.Xn is a sample with Xi '"'' p(x I (}). n. . where {i E A and N E {l. 11'0) distribution.. Xn) + 5. W. .'" .. s). X n+ Il ..··.2..(1. where E~ 1 Xi = k. f(x I f). Assume that A = {x : p( x I 8) > O} does not involve O.1 suppose n is large and (1 In) E~ I Xi = X is not close to 0 or 1 and the prior distribution is beta. In Example 1. If a and b are integers. 9.
X n .)T. 9= (OJ... Suppose p(x I 0) is the density of i. . . Find the posterior distribution 7f(() I x) and show that if>' is an integer. unconditionally. (b) Discuss the behavior of the two predictive distributions as 15. 0 > O.i. Q'j > 0. . compute the predictive and n + 00.\ > 0. The posterior predictive distribution is the conditional distribution of X n +l given Xl... 14.. Letp(x I OJ =exp{(xO)}. T6) densities.J u. X and 5' are independent with X ~ N(I"./If (Ito. . x> 0. 13. N. po is I t = ~~ l(X i  1"0)2 and ()( denotes (b) Let 7t( 0) ()( 04 (>2) exp { . posterior predictive distribution.d. . The Dirichlet distribution is a conjugate prior for the multinomial. Goals.O. Q'r)T. (.. . j=1 '\' • = 1. N(I".. and Performance Criteria Chapter 1 + nand ({. (a) If f and 7t are the N(O. Next use Bayes rule..~l .. The Dirichlet distribution. (N. .") ~ . 8 2 ) is such that vn(p. . 1 X n . .. X n are i.. = 1..Xn. and () = a.~vO}. "5) and N(Oo. where Xi known. (a) Show that p(x I 0) ()( 04 n exp (~tO) where "proportional to" as a function of 8..~X) '" tnI. j=1 • . (12). 01 )=1 j=l Uj < 1. given (x.Xn + l arei. 0 > O. 11. .d.) pix I 0) ~ ({" . L. .6 ""' 1r."Z In) and (n1)S2/q 2 '" X~l' This leads to p(x.. Xl .. LO. 0<0. Note that. Let N = (N l .. x n ). < 1. X 1.) u/' 0 < cx) L". In a Bayesian model whereX l . Show that if Xl.. ..2) and we formally put 7t(I".i. . given x.d. 'V .O the posterior density 7t( 0 Ix).. . a = (a1. (c) Find the posterior distribution of 0". 0> 0 a otherwise.. . . I (b) Use the result (a) to give 7t(0) and 7t(0 I x) when Oexp{ Ox}. . .. D(a). . O(t + v) has a X~+n distribution. I(x I (J).J t • I X.2 is (called) the precision of the distribution of Xi. Here HinJ: Given I" and "..) be multinomial M(n.1 < j < T.74 where N' ~ N Statistical Models. 52) of J1..i. Find I 12. the predictive distribution is the marginal distribution of X n + l . . 82 I fL. v > 0. . •• • . then the posterior density ~(Jt 52 = _1_ "'(X _ X)2 nI I.9). <0<x and let7t(O) = 2exp{20}. vB has a XX distribution. i).. has density r("" a) IT • fa(u) ~ n' r(.
5. Find the Bayes rule for case 2.5 and (ii) l' = 0. O 2 a 2 I a I 2 I Let X be a random variable with frequency function p(x. . 159 of Table 1. 159 for the preceding case (a). (c) Find the minimax rule among bt. I. (J) given by O\x a (I . respectively and suppose . 1957) °< °= a 0. a3.. 0. The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether 8 < O.. (d) Suppose that 0 has prior 1C(OIl (a). (b)p=lq=. Problems for Section 1. (J2. n r ).3. See Example 1..3. a) is given by 0. Let the actions corresponding to deciding whether the loss function is given by (from Lehmann.q) and let when at. (b) Find the minimax rule among {b 1 ...1. .p) (I . Compute and plot the risk points (a)p=q= . a2.3. Suppose that in Example 1. then the posteriof7r(0 where n = (nt l · · · . = 11'. or 0 > a be penoted by I.3.1..1. 8 = 0 or 8 > 0 for some parameter 8.. 1C(O. I b"9 }. Suppose the possible states of nature are (Jl.1.) = 0. Find the Bayes rule when (i) 3. (c) Find the minimax rule among the randomized rules.Section 1. .3 IN ~ n) is V( a + n).3. = 0. . .5.5. . (d) Suppose 0 has prior 1C(01) ~ 1'. the possible actions are al.159 be the decision rules of Table 1.7 Problems and Complements 75 Show that if the prior7r( 0) for 0 is V( a). a new buyer makes a bid and the loss function is changed to 8\a 01 O 2 al a2 a3 a 12 7 I 4 6 (a) Compute and plot the risk points in this case for each rule 1St.3. 1. and the loss function l((J. .. 1C(02) l' = 0.
.d.(X) 0 b b+c 0 c b+c b 0 where b and c are positive. Goals. < 00.0)). Show that the strata sample sizes that minimize MSE(!i2) are given by (1..=l iii = N. Within the jth stratum we have a sample of i. Stratified sampling. .0)) +b<f>( y'n(s . (a) Compute the biases.7.i.Xnjj. 0 be chosen to make iiI unbiased? I (b) Neyman allocation. How should nj.g. I . 1 < j < S. Weassurne that the s samples from different strata are independent.j = 1. random variables Xlj"". b<f>(y'ns) + b<I>(y'nr). (.) c<f>( y'n(r . 1) distribution function. (b) Plot the risk function when b = c = 1. and Performance Criteria ChClpter 1 O\a 1 0 c 1 <0 0 >0 J".. . 1 (l)r=s=l. are known.8. 1) sample and consider the decision rule = 1 if X <r o (a) Show that the risk function is given by ifr<X<s 1 if X > s. and <I> is the N(O. Suppose that the jth stratum has lOOpj% of the population and that the jth stratum population mean and variances are f£j and Let N = nj and consider the two estimators 0.s..) r=2s=1. Suppose X is aN(B. and MSEs of iiI and 'j1z. J". n = 1 and .I L LXij. I I I.j = I. < J' < s.andastratumsamplemeanXj. are known (estimates will be used in a latet chapter). .3) . i I For what values of B does the procedure with r procedure with r = ~s = I? = 8 = 1 have smaller risk than the 4. geographic locations or age groups). variances. 11 . iiz = LpjX j=Ii=1 )=1 s n] s j where we assume that Pj. E. Assume that 0 < a. . 1 < j < S. c<I>(y'n(sO))+b<I>(y'n(rO)).. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e.76 StatistiCCII Models... We want to estimate the mean J1. 0<0 0=0 0 >0 R(O. where <f> = 1 <I>.
. (b) Compute the relative risk RR = M SE(X)/MSE(X) in question (i) when b = 0.. show that MSE(Xb ) and MSE(Xb ) are the same for all values of b (the MSEs of the sample mean and sample median are invariant with respect to shift). .7 Problems and Complements 77 Hint: You may use a Lagrange multiplier. Hint: By Problem 1..0 ~ + 2'. . .7. and 2 and (e) Compute the relative risks M SE(X)/MSE(X) in questions (ii) and (iii). .. p = . (iii) F is normal.'.b.. (a) Find MSE(X) and the relative risk RR = MSE(X)/MSE(X).2p.3. 5..1.3. respectively. P(X ~ b) = 1 . . Let XI.2.2. N(o. Suppose that n is odd. < ° P < 1.d. where lJ is defined as a value satisfyingP(X < v) > ~lllldP(X >v) >~. (d) Find EIX .15..3.45. I)..13. .I 2:. X n be a sample from a population with values 0.i. Hint: Use Problem 1.0 + '.. Hint: See Problem B. > k) (ii) F is uniform. Use a numerical integration package.) with nk given by (1. n ~ 1. 75. ~ a) = P(X = c) = P.=1 Pj(O'j where a.2.4.2. Also find EIX .=1 pp7J" aV..3) is N.bl for the situation in (i). Let X and X denote the sample mean and median. .. Suppose that X I. We want to estimate "the" median lJ of F.5.Section 1. U(O. . Next note that the distribution of X involves Bernoulli and multinomial trials. ° = 15. (c) Show that MSE(!ill with Ok ~ PkN minus MSE(!i. l X n .... that X is the median of the sample. Each value has probability . ~ ~  ~ 6.0.2'. 1). and = 1. ali X '"'' F.b.9. (b) Evaluate RR when n = 1.0 .5(0 + I) and S ~ B(n. X n are i.5. b = '.20. plot RR for p = . [f the parameters of interest are the population mean and median of Xi .40. a = . and that n is odd. set () = 0 without loss of generality.bl when n = compare it to EIX . ~ (a) Find the MSE of X when (i) F is discrete with P(X a < b < c. ~ ~ ~ . Hint: See Problem B. . ~ 7. The answer is MSE(X) ~ [(ab)'+(cb)'JP(S where k = .25.5. > 0. ..bl.p)..= 2:.5. (c) Same as (b) except when n ~ 1.3. Let Xb and X b denote the sample mean and the sample median of the sample XI b.5.
suppose (J is discrete with frequency function 11"(0) = 11" (!) = 11" U) = Compute the Bayes risk of I5r •s when i. I . defined by (J(6. . . e 8 ~ (.L)4 = 3(12. In Problem 1. Find MSE(6) and MSE(P).  . ..(1(6'. then expand (Xi .1'] .X)2 has a X~I distribution. for all 6' E 8 1. . ~ 2(n _ 1)1. then this definition coincides with the definition of an unbiased estimate of e. 9. i1 . = c L~ 1 (Xi . It is known that in the state where the company is located. Let Xl. and Performance Criteria Chapter 1 :! .3(a) with b = c = 1 and n = I. . . then a test function is unbiased in this sense if. (i) Show that MSE(S2) (ii) Let 0'6 c~ .3) that .X)2 is an unbiased estimator of u 2 • Hint: Write (Xi .4.[X .2 ~~ I (Xi . satisfies > sup{{J(6.a)2. (b) Suppose Xi _ N(I'. being lefthanded). 8. Which one of the rules is the better one from the Bayes point of view? 11.. the powerfunction. 6' E e. A person in charge of ordering equipment needs to estimate and uses I . If the true 6 is 60 .0): 6 E eo}. 0) (J(6'.) is (n+ 1)1 Hint!or question (bi: Recall (Theorem B. 0 < a 2 < 00. Show that the value of c that minimizes M S E(c/.J. Goals. ' .3. I (b) Show that if we use the 0 1 loss function in testing.g...1)1 E~ 1 (Xi . Let () denote the proportion of people working in a company who have a certain characteristic (e. . I I 78 Statistical Models. I . You may use the fact that E(X i .Xn be a sample from a population with variance a 2 . II.0(X))) for all 6.').0) = E. 10.. (a) Show that if 6 is real and 1(6.X)2 keeping the square brackets intact.10) + (. (a) Show that s:? = (n ..X)2.8lP where p = XI n is the proportion with the characteristic in a sample of size n from the company. a) = (6 .3. 10% have the characteristic.(1(6.. and only if.1'])2. A decision rule 15 is said to be unbiased if E.for what 60 is ~ MSE(8)/MSE(P) < 17 Give the answer for n = 25 and n = 100.X)' = ([Xi .0(X))) < E. (a)r = 8 =1 (b)r= ~s =1.2)(.(o(X)).
Show that J agrees with rp in the sense that.7 Problems and Complements 79 12.Section 1. (b) find the minimum relative risk of P. Ok) as a function of B. but if 0 < <p(x) < 1.~ is unbiased 13. b.3. then.. s = r = I. Show that if J 1 and J 2 are two randomized. B = . Consider the following randomized test J: Observe U. and k o ~ 3. In Example 1.4. If U = u.3. 15.' and 0 = = WJ10 + (1  w)x' II'  1'0 I are known. given 0 < a < 1. Po[o(X) ~ 11 ~ 1 . 16. Suppose that the set of decision procedures is finite. show that if c ::.) + (1 . Furthersuppose that l(Bo. there is a randomized procedure 03 such that R(B. Show that the procedure O(X) au is admissible." (a) Show that the risk is given by (1. 0.1) and let Ok be the decision rule "reject the shipment iff X > k.3.) for all B.4..1.' and 0 = II' . (a) find the value of Wo that minimizes M SE(Mw). 14. wc dccide El. procedures. and possible actions at and a2.1.1'0 I· 17.7).ao) ~ O. and then JT. we petiorm a Bernoulli trial with probability rp( x) of success and decide 8 1 if we obtain a success and decide 8 0 otherwise. In Example 1. 18. (c) Same as (b) except k = 2. Consider a decision problem with the possible states of nature 81 and 82 . (h) If N ~ 10. . In Problem 1. 0 < u < 1. . o Suppose that U . For Example 1. Define the nonrandomized test Ju . A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 < <p(X) < 1. plot R(O.Polo(X) ~ 01 = Eo(<p(X))..wo to X. 0. z > 0. a) is . If X ~ x and <p(x) = 0 we decide 90.. (B) = 0 for some event B implies that Po (B) = 0 for all B E El. use the test J u . by 1 if<p(X) if <p(X) >u < u. Suppose that Po.UfO. 03) = aR(B. if <p(x) = 1. Convexity ofthe risk set. 1) and is independent of X.a )R(B. Suppose the loss function feB. consider the estimator < MSE(X). Compare 02 and 03' 19. consider the loss function (1. Your answer should Mw If n.1.3.3. find the set of I' where MSE(ji) depend on n.3. The interpretation of <p is the following.
Let X be a random variable with probability function p(x ! B) O\x 0. 2. (b) Give and plot the risk set S.1. 5. . its MSPE.9.) the Bayes decision rule? Problems for Section 1. In Example 1. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn.80 Statistical Models. 0 0. Goals. 0. 7. Give the minimax rule among the nonrandomized decision rules. 3. 6. (b) Compute the MSPEs of the predictors in (a). In Problem B. Show that either R(c) = 00 for all cor R(c) is minimized by taking c to be any number such that PlY > cJ > pry < cJ > A number satisfying these restrictions is called a median of (the distribution of) Y. !. and the ratio of its MSPE to that of the best and best linear predictors.(0.4 I 0.(0. Give the minimax rule among the randomized decision rules. Let U I . al a2 0 3 2 I 0. U2 be independent standard normal random variables and set Z Y = U I..) = 0.. Give an example in which the best linear predictor of Y given Z is a constant (has no predictive value) whereas the best predictor Y given Z predicts Y perfectly. Let Y be any random variable and let R(c) = E(IYcl) be the mean absolute prediction error. (c) Suppose 0 has the prior distribution defined by . An urn contains four red and four black balls. What is 1.Is Z of any value in predicting Y? = Ur + U?. Give an example in which Z can be used to predict Y perfectly.8 0. !. Four balls are drawn at random without replacement. (a) Find the best predictor of Y given Z.. but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z).2 0. and Performance Criteria Chapter 1 B\a 0. the best linear predictor.4 = 0. 4.4. and the best zero intercept linear predictor.7 find the best predictors ofY given X and of X given Y and calculate their MSPEs.l.1 calculate explicitly the best zero intercept linear predictor.6 (a) Compute and plot the risk points of the nonrandomized decision rules. The midpoint of the interval of such c is called the conventionally defined median or simply just the median.
° 14. p) distribution and Z". z > 0. Y'. Suppose that Z has a density p.2 ) distrihution. If Y and Z are any two random variables. 9. that is. Let ZI. p( z) is nonincreasing for z > c. and otherwise. which is symmetric about c. Y') have a N(p" J. Suppose that Z. Y)I < Ipl. Suppose that Z has a density p. Show that c is a median of Z.7 Problems and Complements 81 Hint: If c ElY . Y" are N(v. 12. Let Zl and Zz be independent and have exponential distributions with density Ae\z. Find (a) The best MSPE predictor E(Y I Z = z) ofY given Z = z (b) E(E(Y I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z» .L. Y" be the corresponding genetic and environmental components Z = ZJ + Z". Z".cl) = oQ[lc .  = ElY cl + (c  co){P[Y > co] .t.Section 1. + z) = p(c . (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict y J • 13. 7 2 ) variables independent of each other and of (ZI.z) for 11. (b) Suppose (Z. 0.1"1/0] where Q(t) = 2['I'(t) + t<I>(t)].YI > s. Suppose that if we observe Z = z and predict 1"( z) for Your loss is 1 unit if 11"( z) . a 2. exhibit a best predictor of Y given Z for mean absolute prediction error. Show that if (Z. Yare measurements on such a variable for a randomly selected father and son. 10. Define Z = Z2 and Y = Zl + Z l Z2.col < eo. (a) Show that E(IY . Y) has a bivariate nonnal distribution the best predictor of Y given Z in the sense of MSPE coincides with the best predictor for mean absolute error. y l ). Sbow that the predictor that minimizes our expected loss is again the best MSPE predictor. (a) Show that P[lZ  tl < s] is maximized as a function oftfor each s > ° by t = c. (a) Show that the relation between Z and Y is weaker than that between Z' and y J . Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables. that is.el) as a function of c.PlY < eo]} + 2E[(c  Y)llc < Y < co]] 8. Y = Y' + Y". which is symmetric about c and which is unimodal. Let Y have a N(I". ICor(Z. a 2. Y) has a bivariate normal distribution. (b) Show directly that I" minimizes E(IY . p( c all z. where (ZI .
82 Statisflcal Models. a blood cell or viral load measurement) is obtained. IS.S from infection until detection. > 0. and Performance Criteria Chapter 1 . 1995. and is diagnosed with a certain disease.4. 1905. ~~y = 1 . >. 2 'T ' . g(Z)) 9 where g(Z) stands for any predictor. Predicting the past from the present.) ~ 1/>. mvanant un d ersuchh. Consider a subject who walks into a clinic today. . At the same time t a diagnostic indicator Zo of the severity of the disease (e.3.) ~ 1/>''.15. y> O./L)/<J and Y ~ /3Yo/<J./LdZ) be the linear prediction error. 1). and Doksurn and Samarov. We are interested in the time Yo = t . Show that. Show that p~y linear predictors. and Var( Z. ..4. at time t.4. .) = E( Z. (c) Let '£ = Y . It will be convenient to rescale the problem by introducing Z = (Zo . that is. Hint: Recall that E( Z. hat1s.exp{ >. (b) Show that if Z is onedimensional and h is a 11 increasing transfonnation of Z.g(Z)) where £ is the set of 17./L(Z)J' /Var(Y) ~ Var(/L(Z))/Var(Y). Here j3yo gives the mean increase of Zo for infected subjects over the time period Yo./L£(Z)) ~ maxgEL Corr'(Y. IS. .y}. = Corr'(Y..I • ! . I. One minus the ratio of the smallest possible MSPE to the MSPE of the constant predictor is called Pearson's correlation ratio 1J~y. Show that Var(/L(Z))/Var(Y) = Corr'(Y. then'fJh(Z)Y =1JZY.ElY . (a) Show that the conditional density j(z I y) of Z given Y (b) Suppose that Y has the exponential density = y is H(y. ./L(Z)) = max Corr'(Y.1] . Let S be the unknown date in the past when the sUbject was infected. €L is uncorrelated with PL(Z) and 1J~Y = P~Y' 18. . Yo > O. (See Pearson.4.) (a) Show that 1]~y > piy.g. Goals. on estimation of 1]~y. and a 2 are the mean and variance of the severity indicator Zo in the population of people without the disease. 16.: (0 The best linear MSPE predictor of Y based on Z = z. Hint: See Problem 1. Let /L(z) = E(Y I Z ~ z). .) = Var( Z.(y) = >. where piy is the population multiple correlation coefficient of Remark 1. in the linear model of Remark 1. Assume that the conditional density of Zo (the present) given Yo = Yo (the past) is where j1. j3 > 0.
s(Y) I z] for the covatiance between r(Y) and s(Y) in the conditional distribution of (r(Y). solving for (a. b) equal to zero. see Berman. wbere Y j . (b) Show that (a) is equivalent to (1. s(Y)] < 00.~ [y  (z . N(z . s(Y» given Z = z. and checking convexity. 2000).. (a) Show that ifCov[r(Y).z). and Nonnand and Doksum. (In practice. c.. 1990. .z) . I Z]' Z}. (c) Find the conditional density 7I"o(Yo I zo) of Yo given Zo = zo.(>' .7 and 1. 6. Y2) using (a). Let Y be the number of heads showing when X fair coins are tossed.4.I exp { .>. including the "prior" 71". need to be estimated from cohort studies. all the unknowns..4. Hint: Use Bayes rule. Establish 1. (e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z ~ z) ~ cI<p(>.. (c) The optimal predictor of Y given X = x. .4. Z] = Cov{E[r(Y) (d) Suppose Y i = al + biZ i + Wand 1'2 = a2 and Y2 are responses of subjects 1 and 2 with common influence W and separate influences ZI and Z2.>')1 2 } Y> ° where c = 4>(z . b). (b) The MSPE of the optimal predictor of Y based on X. (c) Show that if Z is real. 20.14 by setting the derivatives of R(a. then = E{Cov[r(Y). This density is called the truncated (at zero) normal.). Find (a) The mean and variance of Y. density.Section 1.9. s(Y) I Z]} + Cov{ E[r(Y) I Z]' E[s(Y) I Z]}. 21.s(Y)] Covlr(Y). where X is the number of spots showing when a fair die is rolled. + b2 Z 2 + W. Z2 and W are independent with finite variances.g(Zo)l· Hint: See Problems 1. 19. (d) Find the best predictor of Yo given Zo = zo using mean absolute prediction error EIYo . Let Y be a vector and let r(Y) and s(Y) be real valued. Cov[r(Y).>.7 Problems and Complements 83 Show that the conditional distribution of Y (the past) given Z = z (the present) has density . Find Corr(Y}. where Z}. x = 1. 7r(Y I z) ~ (27r) .4.6) when r = s.. 1). . Write Cov[r(Y).
z) = I. Problems for Section 1. 0 > 0.x > 0. Find the 22. Show that the mean weighted squared prediction error is minimized by Po(Z) = EO(Y I Z).15) yields (1. 2.') and W ~ N(po. z) = z(1 .. suppose that Zl and Z.5 1. are N(p. z) and c is the constant that makes Po a density.Xn is a sample from a population with one of the following densities.z).z)lw(y. Find Po(Z) when (i) w(y. 0 > 0..g(z») is called weighted squared prediction error. (a) p(x. 23.3.e' < (Y . we say that there is a 50% overlap between Y 1 and Yz. Oax"l exp( Ox").2eY + e' < 2(Y' + e'). 00 (b) Suppose that given Z = z. (a) Let w(y. and = 0 otherwise. This is thebeta. 1.. 25. Then [y . This is known as the Weibull density. < 00 for all e. .4. population where e > O. . Show that EY' < 00 if and only if E(Y . a > O. 0 < z < I. 3. 24. (c) p(x. n > 2. 0) = Ox'. \ (b) Establish the same result using the factorization theorem.14).6(O. . 0) = < X < 1. Zz and ~V have the same variance (T2.e)' Hint: Whatever be Y and c.. > a. p(e). .84 Statistical Models. optimal predictor of Yz given (Yi. 1 2 y' . Let Xi = 1 if the ith item drawn is bad. (a) Show directly that 1 z:::: 1 Xi is sufficient for 8. 0) = Oa' Ix('H). and suppose that Z has the beta. and (ii) w(y. Y ~ B(n. z) ~ ep(y. 1). Verify that solving (1. .e)' = Y' . . where Po(Y.0> O. z "5).z) be a positive realvalued function.6(r. if b1 = b2 and Zl. 1 X n be a sample from a Poisson. Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad.density. show that the MSPE of the optimal predictor is . . density. Let Xl..p~y). In Example 1. Zz). a > O.4.z). In this case what is Corr(Y. .g(z)]'lw(y.. x  .~(1 . Goals. s).l . Show that L~ 1 Xi is sufficient for 8 directly and by the factorization theorem. g(Z)) < for some 9 and that Po is a density.4.z) = 6w (y. Assume that Eow (Y. 0 (b) p(x. Suppose Xl. and Performance Criteria Chapter 1 (e) In the preceding model (d). Y z )? (f) In model (d).
Section 1. 7
Problems and Complements
85
This is known as the Pareto density. In each case. find a realvalued sufficient statistic for (), a fixed. 4. (a) Show that T 1 and T z are equivalent statistics if, and only if, we can write T z = H (T1 ) for some 11 transformation H of the range of T 1 into the range of T z . Which of the following statistics are equivalent? (Prove or disprove.) (b) n~ (c) I:~
1
x t and I:~
and I:~
I:~
1
log Xi, log Xi.
Xi Xi
I Xi
1
>0 >0
l(Xi X
1 (Xi 
(d)(I:~ lXi,I:~ lxnand(I:~ lXi,I:~
(e) (I:~
I Xi, 1
?)
xl)
and (I:~
1 Xi,
I:~
X)3).
5. Let e = (e lo e,) be a bivariate parameter. Suppose that T l (X) is sufficient for e, whenever 82 is fixed and known, whereas T2 (X) is sufficient for (h whenever 81 is fixed and known. Assume that eh ()2 vary independently, lh E 8 1 , 8 2 E 8 2 and that the set S = {x: pix, e) > O} does not depend on e. (a) Show that ifT, and T, do not depend one2 and e, respectively, then (Tl (X), T2 (X)) is sufficient for e. (b) Exhibit an example in which (T, (X), T2 (X)) is sufficient for T l (X) is sufficient for 8 1 whenever 8 2 is fixed and known, but Tz(X) is not sufficient for 82 , when el is fixed and known. 6. Let X take on the specified values VI, .•. 1 Vk with probabilities 8 1 , .•• ,8k, respectively. Suppose that Xl, ... ,Xn are independently and identically distributed as X. Suppose that IJ = (e" ... , e>l is unknown and may range over the set e = {(e" ... ,ek) : e, > 0, 1 < i < k, E~ 18i = I}, Let Nj be the number of Xi which equal Vj' (a) What is the distribution of (N" ... , N k )? (b) Sbow that N = (N" ... , N k _,) is sufficient for 7. Let Xl,'"
1
e,
e.
X n be a sample from a population with density p(x, 8) given by
pix, e)
o otherwise.
Here
e = (/1, <r) with 00 < /1 < 00, <r > 0.
(a) Show that min (Xl, ... 1 X n ) is sufficient for fl when a is fixed. (b) Find a onedimensional sufficient statistic for a when J1. is fixed. (c) Exhibit a twodimensional sufficient statistic for 8. 8. Let Xl,. " ,Xn be a sample from some continuous distribution Fwith density f, which is unknown. Treating f as a parameter, show that the order statistics X(l),"" X(n) (cf. Problem B.2.8) are sufficient for f.
,
"
I
!
86
Statistical Models, Goals, and Performance Criteria
Chapter 1
9. Let Xl, ... ,Xn be a sample from a population with density
j,(x)
a(O)h(x) if 0, < x < 0,
o othetwise
where h(x)
> 0,0= (0,,0,)
with
00
< 0, < 0, <
00,
and a(O)
=
[J:"
h(X)dXr'
is assumed to exist. Find a twodimensional sufficient statistic for this problem and apply your result to the U[()l, ()2] family of distributions. 10. Suppose Xl" .. , X n are U.d. with density I(x, 8) = ~elx61. Show that (X{I),"" X(n», the order statistics, are minimal sufficient. Hint: t,Lx(O) =  E~ ,sgn(Xi  0), 0 't {X"" . , X n }, which determines X(I),
. " , X(n)'
11. Let X 1 ,X2, ... ,Xn be a sample from the unifonn, U(O,B). distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for O.
12. Dynkin, Lehmann, Scheffe's Theorem. Let P = {Po : () E e} where Po is discrete concentrated on X = {x" x," .. }. Let p(x, 0) p.[X = xl Lx(O) > on Show that f:xx(~~) is minimial sufficient. Hint: Apply the factorization theorem.
=
=
°
x,
13. Suppose that X = (XlI" _, X n ) is a sample from a population with continuous distribution function F(x). If F(x) is N(j1., ,,'), T(X) = (X, ,,'). where,,2 = n l E(Xi 1')2, is sufficient, and S(X) ~ (XCI)"" ,Xin»' where XCi) = (X(i)  1')/'" is "irrelevant" (ancillary) for (IL, a 2 ). However, S(X) is exactly what is needed to estimate the "shape" of F(x) when F(x) is unknown. The shape of F is represented hy the equivalence class F = {F((·  a)/b) : b > 0, a E R}. Thus a distribution G has the same shape as F iff G E F. For instance, one "estimator" of this shape is the scaled empirical distribution function F,(x) jln, x(j) < x < x(i+1)' j = 1, . .. ,nl
~
0, x
< XCI)
> x(n)
1, x
~
Show that for fixed x, F,((x  x)/,,) converges in prohahility to F(x). Here we are using F to represent :F because every member of:F can be obtained from F.
I
I ,
'I i,
14. Kolmogorov's Theorem. We are given a regular model with e finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on 9, the posterior distrihution of 9 depends on x only through T(x). Show that T(X) is sufficient.
(b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the posterior distribution depends on x only through T(x).
Section 1.7
Problems and Complements
87
Hint: Apply the factorization theorem.
15. Let X h .··, X n be a sample from f(x  0), () E R. Show that the order statistics arc minimal sufficient when / is the density Cauchy Itt) ~ I/Jr(1 + t 2 ). 16. Let Xl,"" X rn ; Y1 ,· . " ~l be independently distributed according to N(p, (72) and N(TI, 7 2 ), respectively. Find minimal sufficient statistics for the following three cases:
(i) p, TI,
0", T
are arbitrary:
00
< p, TI < 00, a <
(J,
T.
(ii)
(J
=T
= TJ
and p, TI, (7 are arbitrary. and p,
0", T
(iii) p
are arbitrary.
17. In Example 1.5.4. express tl as a function of Lx(O, 1) and Lx(l, I). Problems to Sectinn 1.6
1. Prove the assertions of Table 1.6.1.
2. Suppose X I, ... , X n is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show that the distribution of X fonns a oneparameter exponential family. Identify 'TI, B, T, and h. 3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability nf success O. Then P, IX = k] = (I  0)'0, k ~ 0 0 1,2, ... This is called thc geometric distribution (9 (0». (a) Show that the family of geometric distributions is a oneparameter exponential family with T(x) ~ x. (b) Deduce from Theorem 1.6.1 that if X lo '" oXn is a sample from 9(0), then the distributions of L~ 1 Xi fonn a oneparameter exponential family. (c) Show that E~
1
Xi in part (b) has a negative binomial distribution with parameters
(noO)definedbyP,[L:71Xi = kJ =
(n+~I
)
(10)'on,k~0,1,2o'"
(The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success 0.) Hint: By Theorem 1.6.1, P,[L:7 1 Xi = kJ = c.(1  o)'on. 0 < 0 < 1. If
=
' " CkW'
I
= c(;'',)::, 0 lw n
k=O
L..J
< W < I, then
4. Which of the following families of distributions are exponential families? (Prove or
disprove.) (a) The U(O, 0) fumily
88
(b) p(.", 0)
Statistical Models, Goals, and Performance Criteria
Chapter 1
(c)p(x,O)
= {exp[2Iog0+log(2x)]}I[XE = ~,xE {O.I +0, ... ,0.9+0j = 2(x +
0)/(1 + 20), 0
(0,0)1
(d) The N(O, 02 ) family, 0 > 0
(e)p(x,O)
< x < 1,0> 0
(f) p(x,9) is the conditional frequency function of a binomial, B(n,O), variable X, given that X > O.
5. Show that the following families of distributions are twoparameter exponential families and identify the functions 1], B, T, and h. (a) The beta family. (b) The gamma family. 6. Let X have the Dirichlet distribution, D( a), of Problem 1.2.15. Show the distribution of X form an rparameter exponential family and identify fJl B, T, and h.
7. Let X = ((XI, Y I ), ... , (X no Y n » be a sample from a bIvariate nonnal population.
Show that the distributions of X form a fiveparameter exponential family and identify 'TJ, B, T, and h.
8. Show that the family of distributions of Example 1.5.3 is not a one parameter eX(Xloential family. Hint: If it were. there would be a set A such that p(x, 0) > on A for all O.
°
9. Prove the analogue of Theorem 1.6.1 for discrete kparameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (XI, X 2) is a sample of size 2 from f(·, 0), then XI + X2 is sufficient for B. Show that f(·, B) corresponds to a onearameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x" X2) such that log f(x" 0) + log f(X2, 0) = g(xI + X2, 0) + h(XI, X2). Fix 00 and let r(x, 0) = log f(x, 0)  log f(x, 00), q(x, 0) = g(x,O)  g(x,Oo). Then, q(xI + X2,0) = r(xI,O) +r(x2,0), and hence, [r(x" 0) r(O, 0)1 + [r(x2, 0)  r(O, 0») = r(xi + X2, 0)  r(O, 0). 11. Use Theorems 1.6.2 and 1.6.3 to obtain momentgenerating functions for the sufficient statistics when sampling from the following distributions. (a) normal, () ~ (ll,a 2 )
(b) gamma. r(p, >.), 0
= >., p fixed
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3)
(0 gamma. r(p, >'). ()
= (p, >.).
 
,

Section 1. 7
Problems and Complements
89
12. Show directly using the definition of the rank of an ex}X)nential family that the multinomialdistribution,M(n;OI, ... ,Ok),O < OJ < 1,1 <j < k,I:~oIOj = 1, is of rank k1. 13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any (k ~ I)dimensional hyperplane. 14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of &. Show that if k = 1 and &0 oJ 0 and A, A are defined on all of &, then Theorem 1.6.3 continues to hold. 15. Let P = {P. : 0 E e} where p. is discrete and concentrated on X = {x" X2, ... }, and let p( x, 0) = p. IX = x I. Show that if P is a (discrete) canonical ex ponential family generated bi, (T, h) and &0 oJ 0, then T is minimal sufficient. Hint: ~;j'Lx('l) = Tj(X)  E'lTj(X). Use Problem 1.5.12.
16. Life testing. Let Xl,.'" X n be independently distributed with exponential density (20)l e x/2. for x > 0, and let the ordered X's be denoted by Y, < Y2 < '" < YnIt is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r aparticles have been emitted. Show that
(i) The joint distribution of Y1 , •.. , Yr is an exponential family with density
n! [ (20), (n _ r)! exp (ii) The distribution of II:: I Y;
(iii) Let
1
I::l Yi + (n 20
r)Yr]
' 0  Y,  ...  Yr·
<
<
<
+ (n 
r)Yrl/O is X2 with 2r degrees of freedom.
denote the time required until the first, second,... event occurs in a Poisson process with parameter 1/20' (see A.I6). Then Z, = YI/O', Z2 = (Y2 Yr)/O', Z3 = (Y3  Y 2)/0', ... are independently distributed as X2 with 2 degrees of freedom, and the joint density of Y1 , ••. , Yr is an exponential family with density
Yi, Yz , ...
The distribution of Yr/B' is again XZ with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burnedout tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
I ,
90
Statistical Models, Goals, and Performance Criteria Chapter 1
1)(Y; l~~l)/e (I = 1", .. ,') are independently distributed as X2 with 2 degrees of freedom, and [L~ 1 Yi + (n  7")Yr]/B = [(ii): The random variables Zi ~ (n  i
+
L::~l Z,.l
17. Suppose that (TkXl' h) generate a canonical exponential family P with parameter k 1Jkxl and E = R . Let
(a) Show that Q is the exponential family generated by IlL T and h exp{ cTT}. where IlL is the projection matrix of Tonto L = {'I : 'I = BO + c). (b) Show that ifP has full rank k and B is of rank I, then Q has full rank l. Hint: If B is of rank I, you may assume
18. Suppose Y1, ... 1 Y n are independent with Yi '" N(131 + {32Zi, (12), where Zl,'" , Zn are covariate values not all equaL (See Example 1.6.6.) Show that the family has rank 3.
Give the mean vector and the variance matrix of T.
19. Logistic Regression. We observe (Zll Y1 ), ... , (zn, Y n ) where the Y1 , .. _ , Y n are independent, Yi "' B(TIi, Ad The success probability Ai depends on the characteristics Zi of the ith subject, for example, on the covariate vector Zi = (age, height, blood pressure)T. The function I(u) ~ log[u/(l  u)] is called the logil function. In the logistic linear re(3 where (3 = ((31, ... ,/3d ) T and Zi is d x 1. gression model it is assumed that I (Ai) = Show that Y = (Y1 , ... , yn)T follow an exponential model with rank d iff Zl, ... , Zd are
zT
not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). 20. (a) In part IT of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by ('11 'Io)TT and that ~(ii) =~(i). (b) Fill in the details of part III of the proof of Theorem 1.6.4. 21. Find JJ.('I) ~ EryT(X) for the gamma,
qa, A), distribution, where e = (a, A).
I
22. Let X I, . _ . ,Xn be a sample from the k·parameter exponential family distribution (1.6.10). Let T = (L:~ 1 1 (Xi ), ... , L:~ 1Tk(X,») and let T
I
S
~
((ryl(O), ... ,ryk(O»): e E 8).
Show that if S contains a subset of k + 1 vectors Vo, .. _, Vk+l so that Vi  Vo, 1 < i are not collinear (linearly independent), then T is minimally sufficient for 8.
< k.
I .' jl,
"
23. Using (1.6.20). find a conjugate family of distributions for the gamma and beta families. (a) With one parameter fixed. (b) With both parameters free.
:
I
Section 1.7
Problems and Complements
91
24. Using (1.6.20), find a conjugate family of distributions for the normal family using as parameter 0 = (O!, O ) where O! = E,(X), 0, ~ l/(Var oX) (cf. Problem 1.2.12). 2 25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with (72 known. Find a conjugate family of prior distributions for (131,132) T. 26. Using (1.6.20), find a conjugate family of distributions for the multinomial distribution. See Problem 1.2.15. 27. Let P denote the canonical exponential family genrated by T and h. For any TJo E £, set ho(x) = q(x, '10) where q is given by (1.6.9). Show that P is also the canonical exponential family generated by T and h o.
28. Exponential/amities are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by h(f)
~ E(logf(X)) =
l:IIOgf(X)I!(X)dx.
This quantity arises naturally in infonnation in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S ~ {x: f(x) > OJ. (a) Show that the canonical kparameter exponential family density
f(x, 'I)
= exp
• ryjrj(x) 1/0 + I:
j:=1
A('I)
, XES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is
f(x)dx
~ 1,
Is
f(x)rj(x)
~ aj,
1 < j < k,
where '17o, .•.• '17k are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. (b) Find the maximum entropy densities when rj(x) = x j and (i) S ~ (0,00), k = 1, at > 0; (ii) S = R, k = 2, at E R, a, > 0; (iii) S = R, k = 3, a) E R, a, > 0, a3 E R. 29. As in Example 1.6.11, suppose that Y 1, ...• Y n are Li.d. Np(f.L. E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y = (Y ... , Yn ) is the p(p + 3)/2 canonical " exponential family generated by h = 1 and the p(p + 3)/2 statistics
n n
Tj
=
LYii>
i=l
1 <j <Pi
Tjl =
LJ'ijJ'iI.
i=l
1 <j< l<p
where Y i = (Yi!, ... , Yip). Show that <: is open and that this family is of rank pcp + 3)/2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = pcp + 3)/2 statistics Tj(Y) ~ Yj, 1 < j < p, and Tj,(Y) = YjYi, 1 <j < I < p,
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Np(J.l, E). As E ranges over all p x p symmetric positive definite matrices, so does E 1 • Next establish that for symmetric matrices M,
J
M
exp{ _uT Mu}du
< 00 iff M
is positive definite
by using the spectral decomposition (see B.I0.1.2)
=L
j=1
p
AjejeJ for el, ... , e p orthogonal. Aj E R.
To show that the family has full rank m, use induction on p to show that if Zt, ... , Zp are i.i.d. N(O, 1) and if B pxp = (b jl ) is symmetric, then
p
P
LajZj
j"" 1
+ Lbj,ZjZ,
j,l
~c
= P(aTZ + ZTBZ = c) ~ 0
N p(l', E), then
unless a ~ 0, B = 0, c = 0. Next recall (Appendix B.6) that since Y ~ y = SZ for some nonsingular p x p matrix S.
I
30. Show that if Xl,'" ,Xn are d.d. N p (8,E o) given (J where ~o is known, then the Np(A, f) family is conjugate to N p(8, Eo), where A varies freely in RP and f ranges over
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I'j, Tj) : 1 < j < k} be a given collection of pairs with I'j E R, Tj > 0. Let (tt, tT) be a random pair with Aj = P«(I', tT) = (I'j, Tj)), 0 < Aj < 1, L:~~l Aj = 1. Let 8 be a random variable whose conditional distribution given (IL, IT) = (p,j 1 Tj) is nonnal, N(p,j, rJ). Consider the model X = 8 + f, where 8 and € are independent and € rv N(O, a3), a~ known. Note that 8 has the prior density
11'(0)
=L
j=l
k
Aj'l'rj (0  I'j)
(1.7.4)
where 'I'r denotes the N(O, T 2 ) density. Also note that (X tion. (a) Find the posterior
i!
I 0) has the N(O, (75) distribu
" "
,;
k
11'(0 I x)
and write it in the fonn
= LP((tt,tT) ~
j=1
(l'j,Tj) I X)1I'(O I (l'j,Tj),X)
" • ,
k
L
j=1
Aj (x)'I'rj(x) (0 l'j(X»
Section 1.7
Problems ;3nd Complements
93
for appropriate A) (x), Tj (x) and ILJ (x). This shows that (1. 7.4) defines a conjugate prior for the N(O, (76), distribution. (b) Let Xi = + Ei, I < i < n, where is as previously and EI," ., En are ij.d. N(O, (76)' Find the posterior 7r( 0 I Xl, ... , x n ), and show that it belongs to class (1.7 A). Hint: Consider the sufficient statistic for p(x I B).
e
e
32. A Hierarchical BinomialBeta Model. Let {(rj, Sj) : 1 <j < k} be a given collection of pair.; with rj > 0, sJ > 0, let (R, S) be a random pair with P(R = cJ' S = 8j) = Aj, D < Aj < 1, E7=1 Aj = 1, and let e be a random variable whose conditional density ".(0, c, s) given R = r, S = S is beta, (3(c, s). Consider the model in which (X I 0) has the binomial, B( n, fJ), distribution. Note that e has the prior density
".(0)
Find the posterior
k
=L
j=1
k
Aj"'(O, cJ ' sJ)'
(J .7.5)
".(0 I x) = LP(R= cj,S =
j=l
8j
I x)7r(O I (rj,sj),x)
and show that it can be written in the form J (x)7r(O,rj(x),sj(x)) for appropriate Aj(X), Cj(x) and 8j(X). This shows that (1.7.5) defines a class of conjugate prior.; for the B( n, 0) distribution.
L:A
33. Let p(x,TJ) be a one parameter canonical exponential family generated by T(x) = x and h(x), X E X C R, and let 1jJ(x) be a nonconstant, nondecreasing function. Show that E,1jJ(X) is strictly increasing in ry. Hint:
Cov,(1jJ(X), X)
~E{(X 
X')i1jJ(X) 1jJ(X')]}
where X and X' are independent identically distributed as X (see A.Il.12).
34. Let (Xl, ... , X n ) be a stationary Markov chain with two states D and 1. That is.
P[Xi
where
= Ei I Xl = EI,·· .,Xi  = Eid = P[Xi = Ci I X i  l = Eid =
l
PEi_1Ei
(POO PIO
pal) is the matrix of transition probabilities. Suppose further that Pn '
= 1  p.
(i) poo
(ii)
= PII = p, so that, PlO = Pal PIX, = OJ = PIX, = IJ = !.
94
Statistical Models, Goals, and Performance Criteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, oneparameter exponential family with T = NOD + N ll where Nt) the number of transitions from i to j. For example, 01011 has N Ol = 2, Nil = 1, N oo = 0, N IO ~ 1.
(b) Show that E(T)
= (n 
l)p (by the method of indicators or otherwise).
35. A Conjugate Priorfor the Two~Sample Problem. Suppose that Xl, ... , X n and Y1 , ... , Yn are independent N(fLI' (12) and N(1l2' ( 2 ) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro 2 has a X~ distribution and given 0 2 , /11 and fL2 are independent with N(~I, <7 2/ kt} and N(6, <7 2/ k2) distributions, respectively, where ~j E R, k j > 0, j = 1,2. Show that Jr is a conjugate prior.
I
36. The inverse Gaussian density. IG(j..t, .\), is
f(x,J1.,>')
= [>./21Tjl/2 x 3/2 exp { >.(x 
J1.)2/ 2J1.2 X}, x> 0, J1. > 0, >. > O.
(a) Show thatthis is an exponentialfamily generated hy T( X) h(x) = (21T)1/2 X 3/'. (b) Show that the canonical parameters TJl, TJ2 are given by TJI that A( 'II, '12) =  [! [Og('I2) + v''I1'I2]'£ = [0,00) x (0,00).
= ! (X, XI) T and = fL 2A, 1]2 =
= '\, and
(e) Fwd the momentgenerating function ofT and show that E(X) J1. 3>., E(XI) = J1. 1 + >.1, Var(X I ) = (>'J1.)1 + 2>'2. (d) Suppose J1. pnor.
~
J1., Var(X)
J1.o is known. Show that the gamma family, qa,,6), is a conjugate
(e) Suppose that>' = >'0 is known. Show that the conjngate prior formula (1.6.20) produces a function that is not integrable with respect to fl. That is, defined in (1.6.19) is empty.
n
(I) Suppose that J1. and>. are both unknown. Show that (1.6.20) produces a function
that is not integrable; that is, f! defined in (1.6,19) is empty. 37, Let XI, ... , X n be i.i.d. as X ~ Np(O, ~o) where ~o is known. Show that the conjugate prior generated by (1.6.20) is the N p ( 1]0,761) family, where 1]0 varies freely in RP, 76 > 0 and I is the p x p identity matrix.
,
•
38. Let Xi
"
(Zi, Yi)T be jj,d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of XI, ... ,Xn as a canonical exponential family and identify T, h, A, and E. Find the expected value and variance of the sufficient statistic.
=
,!,i
i!: "
39. Suppose that Y1 , •.. 1 Y n are independent, Yi
'"'' N(fLi, a 2 ), n > 4.
"'
(a) Write the distribution of Y1 , ..• ,Yn in canonical exponential family fonn. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 1) : (0 1, O , 03)T ~ (1'7, <72 jT where 1) is detennined by 2
fLi
I i'i
I ,
I
= exp{OI
+ 02Zi},
Zl
< Z2 < .. , <
Zn;
(72 =
03


Section 1.8
Notes
95
where 8 r E R, O E R, 03 > O. This model is sometimes used when IIi is restricted to be 2 positive. Show that p(y, 0) as given by (1.6.12) is a curved exponential family model with
1=3.
40. Suppose Y 1 , • •. , y;'l are independent exponentially, E' (Ai), distributed survival times, n > 3. (a) Write the distribution of Y1 , ... 1 Yn in canonical exponential family form. Identify T, h, '1, A, and E. (b) Recall that J.1i = E (Y'i) = Ai I. Suppose lJi depends on the value Zi of a covariate. Because Iti > O. fLi is sometimes modeled as
fLi
= CXp{ 0 1 + (hZi},
i
=
1, ... , n
where not all the z's are equal. Show that p(y, fi) as given by (1.6.12) is a curved exponential family model with 1 = 2.
1.8
NOTES
Note for Section 1.1
(1) For the measure theoretically minded we can assume more generally that the Po are
all dominated by a derivative.
(J
finite measure It and that p(x, 8) denotes
dJ;lI, the Radon Nikodym
Notes for Section 1,3
~
(I) More natural in the sense of measuring the Euclidean distance between the estimate f} and the "truth" Squared error gives much more weight to those that are far away from f} than those close to f}.
e.
e
~
(2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r.
Note for Section 1,4
(I) Source; Hodges, Jr., J. L., D. Keetch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGrawHill, 1975.
Notes for Section 1,6
(1) Exponential families arose much earlier in the work of Boltzmann in statistical mechanics as laws for the distribution of the states of systems of particlessee Feynman (1963), for instance. The connection is through the concept of entropy, which also plays a key role in infonnation theorysee Cover and Thomas (199]). (2) The restriction that's x E Rq and that these families be discrete or continuous is artificial. In general if fL is a (J finite measure on the sample space X. p( x, e) as given by (1.6.1)
. The Advanced Theory of Statistics. positions. JR. BERMAN." Biomelika. . E. and M.266291 (1978). M. 22. Statistical Analysis of Stationary Time Series New York: Wi • . Nonlinearity. Elements of Information Theory New York: Wiley." Statist. G. FERGUSON. BICKEL.. T. P. B. 14431473 (1995). "A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts. I. BOX. . BLACKWELL. 5.. P. E. v.. R.. AND A. and so on. STUART. Testing Statistical Hypotheses. E. D. 1957.." J. 125. "Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discussion). Statlab: An Empirical Introduction to Statistics New York: McGrawHili. P.733741 (1990). GIRSHICK. 1961. R.9 REFERENCES BERGER. 160168 (1990). L. Feynman. Science 5. 1963. Optimal Statistical Decisions New York: McGrawHili. Leighton. Note for Section 1. 1969. the Earth). 2nd ed. "Model Specification: The Views of Fisher and Neyman. KENDALL. 1988. Sands. Transformation and Weighting in Regression New York: Chapman and Hall. A.g. T. 1991. 383430 (1979). AND M. 1966. 77.. AND M. "A Theory of Some Multiple Decision Problems. GRENANDER. for instance. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. THOMAS. Eels.. G." Ann. • • . ley. M. BROWN. J. RUPPERT.. 1.. IMS Lecture NotesMonograph Series. Iii I' i M. 23.547572 (1957). 1997. 6. Vols.. R. I: I" Ii L6HMANN. Hayward. S. Statist. Ii . KRETcH AND R.. D. II.• Statistical Decision Theory and Bayesian Analysis New York: Springer.. m New York: Hafner Publishing Co. CARROLL. Ch. Soc.. AND COVER. . A 143. L. 1967.t AND D. Theory ofGames and Statistical Decisions New York: Wiley. r HorXJEs. J. Goals. 40 Statistical Mechanics ofPhysics Reading. P. . This permits consideration of data such as images. "Using Residuals Robustly I: Tests for Heteroscedasticity. H. MA: AddisonWesley.. L. 1954. DE GROOT. FEYNMAN. R.. Royal Statist. K A AND A. and Performance Criteria Chapter 1 can be taken to be the density of X with respect to fLsee Lehmann (1997). CRUTCHFIELD. New York: Springer. Mathematical Statistics New York: Academic Press. L. M. A." Ann.I' L6HMANN. "Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression. S. The Feynmtln Lectures on Physics. 1985.. 0 . and spheres (e.. U. Math Statist. SAMAROV. . J." Ann.. I and II. Statist. J. DoKSUM.96 Statistical Models. ROSENBLATT. 1986. L. and Later Developments.7 (I) u T Mu > 0 for all p x 1 vectors u l' o.. E. L6HMANN.. 1975.
AND K. . PEARSON. E. D. Ames. J. Roy. 1964. MANDEL. SCHLAIFFER. Dulan & Co. G. G.. Editors: S. (Draper's Research Memoirs. 1962.. The Statistical Analysis of Experimental Data New York: J. 1961. Biometrics Series II.. GLAZEBROOK. SL. Wiley & Sons.. Part II: Inference London: Cambridge University Press. SAVAGE. ET AL. Statistical Methods. New York: Springer.Section 1." Proc. Introduction to Probability and Statistics from a Bayesian Point of View.. IA: Iowa State University Press. L. Reid. A. Graduate School of Business Administration. AND R. Sequential Methods in Statistics New York: Chapman and Hall. SAVAGE. 6779. H. 1986. 9 References 97 LINDLEY. V. 1965. Ahmed and N. "Empirical Bayes Procedures for a Change Point Problem with Application to HIVJAIDS Data. The Foundations ofStatistics. Wiley & Sons. DOKSUM.. Lecture Notes in Statistics. J.. Soc. Harvard University. WETHERILL. 1954. NORMAND. Division of Research. Boston." Empirical Bayes and Likelihood Inference. 8th Ed. The Foundation ofStattstical Inference London: Methuen & Co. 2000. "On the General Theory of Skew Correlation and NOnlinear Regression. L. K. 1. D. 303 (1905). SNEDECOR. AND W. New York. London 71. B. G. J. Part I: Probability. Applied Statistical Decision Theory. 1989.) RAIFFA. AND K. W.. COCHRAN.
I . I 'I i . .! ~J .'. I:' o. I II 1'1 .·. :i.: '".
p(X.O). That is.2) 99 . and 8 Jo D( 8 0 . X """ PEP. we don't know the truth so this is inoperable. we could obtain 8 0 as the minimizer. So it is natural to consider 8(X) minimizing p(X. how do we select reasonable estimates for 8 itself? That is.1 2.8) is an estimate of D(8 0 . We consider a function that we shall call a contrast function  p:Xx8>R and define D(Oo.1.1 BASIC HEURISTICS OF ESTIMATION Minimum Contrast Estimates.1. 6) ~ o. Estimating Equations Our basic framework is as before. but in a very weak sense (unbiasedness). As a function of 8.Chapter 2 METHODS OF ESTIMATION 2.8) is smooth. =0 (2.O) EOoP(X.1. how do we find a function 8(X) of the vector observation X that in some sense "is close" to the unknown 81 The fundamental heuristic is typically the following. In order for p to be a contrast function we require that DC 8 0l 9) is uniquely minimized for 8 = (Jo. This is the most general fonn of the minimum contrast estimate we shall consider in the next section. 8).8) as a function of 8. X E X. usually parametrized as P = {PO: 8 E e}. In this parametric case. Now suppose e is Euclidean C R d . the true 8 0 is an interior point of e.2) define a special form of estimating equations. The equations (2. Of course.1) Arguing heuristically again we are led to estimates B that solve \lOp(X.O) where V denotes the gradient. (2. if PO o were true and we knew DC 8 0 .1. DC8 0l 9) measures the (population) discrepancy between 8 and the true value 8 0 of the parameter.8). Then we expect  \lOD(Oo.
I In the important linear case.1. i=l 3 I (2. .. Yn are independent.1L1 2 = L[Yi .('l/Jl.)J i=1 ~ 2 (2. z.5) Strictly speaking P is not fully defined here and this is a point we shall explore later.»)'. If.I1) =0 is an estimating equation estimate.1.zd = LZij!3j j=1 andzi = (Zil. (1).g({3. Here the data are X = {(Zi. ~ Then we say 8 solving (2.7) i=l J . suppose we postulate that the Ei of Example 1. A naiural(l) function p(X.. The estimate (3 is called the least squares estimate. g({3. Consider the parametric version of the regression model of Example 1.z. i=l (2. where the function 9 js known.z.1. we take n p(X. But.4 are i.z)l: 1{31 ~ oo} 'I = 00 I (Problem 2..3) = 0 has 8 0 as its unique solution for all 8 0 ~ E e.16). (3) = !Y . (3) n n<T.i + L[g({3o.8) ~ E I1 .!I [" '1 L • . Example 2. "'. J which is indeed minimized at {3 = {3o and uniquely so if and only if the parametrization is identifiable.4 R d. ua). :i = ~8g~ ~ L. .6) . z).1.1. {3 E R d . W . then {3 satisfies the equation (2. z.100 More generally.2) or equivalently the system of estimating eq!1ations.. . Then {3 parametrizes the model and we can compute (see Problem 2.1.8/3 ({3.1.1. ~ ~8g~ L. (2. I D({3o. W(X. (3) exists if g({3.g({3.) .Zid)T 'j • r . Zn»T That is.). there is a substantial overlap between the two classes of estimates. for convenience. ZI). d g({3. . suppose we are given a function W : X and define Methods of Estimation Chapter 2 X R d .' . N(O.)g(l3. l'i) : 1 < i < n} where Yi.1. z) is differentiable in {3. IL( z) = (g({3. (1) Suppose V (8 0 . Wd)T V(l1 o. Least Squares.4) w(X.1. {3) to consider is the squared Euclidean distance between the vector Y of observed Yi and the vector expectation of Y. . (3) E{3. ..4 with I'(z) = g({3.10).d. further.)Y. z. .1. An estimate (3 that minimizes p(X. z.1. Evidently.8/3 ({3.i. g({3. Here is an example to be pursued later. z) is continuous and ! I lim{lg(l3.p(X.
2. Thus. This very important example is pursued further in Section 2. suppose is 1 . We return to the remark that this estimating method is well defined even if the Ci are not i. lid) as the estimate of q( 8)..i. To apply the method of moments to the problem of estimating 9. Method a/Moments (MOM). . if we want to estimate a Rkvalued function q(8) of 9. we obtain a MOM estimate of q( 8) by expressing q( 8) as a function of any of the first d moments 1'1. we need to be able to express 9 as a continuous function 9 of the first d moments.d.9) where Zv IIZijllnxd is the design matrix. 1 < i < n}. 8 E R d and 8 is identifiable. I'd). ~ . Suppose Xl. I'd( 8) are the first d moments of the population we are sampling from. /1j converges in probability to flj(fJ). say q(8) = h(I'I.(8).8) the normal equations. which can be judged on its merits whatever the true P governing X is. N(O. .. Least squares. 0 Example 2... provides a first example of both minimum contrast and estimating equation methods. The motivation of this simplest estimating equation example is the law of large numbers: For X "' Po. d > k.1 from R d to Rd.. In fact. Suppose that 1'1 (8).Xi t=l .d. Here is another basic estimating equation example. we assume the existence of Define the jth sample moment f1j by. The method of moments prescribes that we estimate 9 by the solution of p.i. as X ~ P8.Section 2.1. .. L.1 <j < d if it exists.2 and Chapter = 6. and then using h(p!. .. .  n n. 1 1j ~_l". Thus.. . .1.Xn are i. . I'd of X.1 Basic Heuristics of Estimation 101 the system becomes (2. 1 1 < j < d. = 1'. . thus. u5). These equations are commonly written in matrix fonn (2. once defined we have a method of computing a statistic fj from the data X = {(Zi 1 Xi). .. .1. More generally..
. . then setting (2. X n be Ltd. This algorithm and others will be discussed more extensively in Section 2. i = 1. neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn.~ (I'l/a)'. A>O. two other basic heuristics particularly applicable in the i. Example 2.X 2 • In this example.)]xO1exp{Ax}.3. Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category.1 0) . . consider a study in which the survival time X is modeled to have a gamma distribution. . x>O. . . .6. for instance.•. We can..(X")1 dxd J isquickandM is nonsingular with high probability is the NewtonRaphson algorithm..·)  l~t.Pk are completely unknown. It is defined by initializing with eo. Here k = 5. but their respective probabilities PI. l Vk of the population being sampled are known. . There are many algorithms for optimization and root finding that can be employed. 1'1 for () gives = E(X) ~ "/ A.>0.10.. Suppose we observe multinomial trials in which the values VI. as X and Ni = number of indices j such that X j = Vi. 4. case.. . 2. We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments. 3.d. Vi = i. Frequency Plugin(2) and Extension.4 and in Chapter 6. . 2. If we let Xl. ..(1 + ")/A 2 Solving . f(u. A = X/a' where a 2 = fl.11). ~ iii ~ (X/a)'. and 1" ~ E(X') = . In this case f) ~ ('" A). I. in particular Problem 6. Here is some job category data (Mosteller.2 and 0=2 = n 1 EX1. . As an illustration consider a population of men whose occupations fall in one of five different job categories. I. then the natural estimate of Pi = P[X = Vi] suggested by the law of large numbers is Njn..5. An algorithm for estimating equations frequently used when computationofM(X. express () as a function of IJ" and fl3 = E(X 3 ) and obtain a method of moment estimator based on /11 and fi3 (Problem 2. .2 The PlugIn and Extension Principles We can view the method of moments as an example of what we call the plugin (or substitution) and extension principles.102 Methods of Estimation Chapter 2 For instance.1. '\).. 0 Algorithmic issues We note that. the proportion of sample values equal to Vi. in general.i.1. A = Jl1/a 2 .. the method of moment estimator is not unique..·) fli =D'I'(X.1. 1968). with density [A"/r(. or 5.
Pk) with Pi ~ PIX = 'Vi]. That is.... Pk) of the population proportions.}}.»i=1 for Danish men whose fathers were in category 3. . that is. . P3 PI = 0 = (1.0)... can be identified with a parameter v : P ~ R. .. Equivalently.. 1 < think of this model as P = {all probability distributions Pan {VI. and Then q(p) (p.Pk by the observable sample frequencies Nt/n.31 5 95 0. If we use the frequency substitution principle." . Suppose .1 Basic Heuristics of Estimation 103 Job Category l I Ni Pi ~ 23 0. We would be interested in estimating q(P"". Many of the models arising in the analysis of discrete data discussed in Chapter 6 are of this type. P3) given by (2. Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles.53 = 0. 1 Nk/n.0)2. v(P) = (P4 + Ps) and the frequency plugin principle simply says to replace P = (PI. whereas categories 2 and 3 correspond to whitecollar jobs.Xn . Now suppose that the proportions PI' . 1 ()d) and that we want to estimate a component of 8 or more generally a function q(8).. together with the estimates Pi = NiJn. (2. let P dennte p ~ (P". we are led to suppose that there are three types of individuals whose frequencies are given by the socalled HardyWeinberg proportions 2 .pl .44 .09. v. .03 2 84 0..12). 1~!.. The frequency plugin principle simply proposes to replace the unknown population frequencies PI. Pk do not vary freely but are continuous functions of some ddimensional parameter 8 = ((h.Ni =708 2:::o. the estimate is which in our case is 0.1. HardyWeinberg Equilibrium.. Example 2..0 < 0 < 1. suppose that in the previous job category table.0. .Pk) = (~l . i < k.P2. + P3).1. .. . in v(P) by 0 P ).(P2 + P3). For instance.Section 2. .Ps) = (P4 + Ps) . the multinomial empirical distribution of Xl. Next consider the marc general problem of estimating a continuous function q(Pl' . .. P2 ~ 20(1. . . the difference in the proportions of bluecollar and whitecollar workers.1. .11) to estimate q(Pll··. then (NIl N 2 . categories 4 and 5 correspond to bluecollar jobs.12 3 289 0..4.. .12) If N i is the number of individuals of type i in the sample of size n. If we assume the three different genotypes are identifiable.41 4 217 0. . ..13 n2::. .1. lPk). use (2. N 3 ) has a multinomial distribution with parameters (n.
we can use the principle we have introduced and estimate by J N l In.. ./P3 and. .(IJ).16) .E have an estimate P of PEP such that PEP and v : P 4 T is a parameter. the frequency of one of the alleles. as X p.. (2. thus. then v(P) is the plugin estimate of v. . (2.1.104 Methods of Estimation Chapter 2 we want to estimate fJ.).. . 1  V In general. Nk) . . a natural estimate of P and v(P) is a plugin estimate of v(P) in this nonparametric context For instance.15) ~ ".. case if P is the space of all distributions of X and Xl. Note.1. .. the representation (2. We can think of the extension principle alternatively as follows.1. If PI. In particular.1. consider va(P) = [Fl (a) + Fu 1(a )1. T(Xl. .4.1. Fu'(a) = sup{x.1. where a E (0. q(lJ) with h defined and continuous on = h(p.d.1. Because f) = ~. .Pk(IJ»). . however. X n are Ll. "·1 is.13) and estimate (2.14) As we saw in the HardyWeinberg case. Po ~ R given by v(PO) = q(O).. F(x) < a}.13) defines an extension of v from Po to P via v.Pk. Then (2. Let be a submodel of P... P > R where v(P) . ( . . in the Li.13) Given h we can apply the extension principle to estimate q( 8) as. 0 write 0 = 1 .h(p) and v(P) = v(P) for P E Po.Pk are continuous functions of 8. F(x) > a}. E A) n. the empirical distribution P of X given by    f'V 1 n P[X E Al = I(X. I) and ! F'(a) " = inf{x. by the law of large numbers. The plugin and extension principles can be abstractly stated as follows: Plugin principle. We shall consider in Chapters 3 (Example 3. If w..14) are not unique. . .. (2. if X is real and F(x) = P(X < x) is the distribution function (dJ. I: t=l (2. we can usually express q(8) as a continuous function of PI.Xn ) =h N. . suppose that we want to estimate a continuous Rlvalued function q of e. that is.. that we can also Nsfn is also a plausible estimate of O.4) and 5 how to choose among such estimates.d. . Now q(lJ) can be identified if IJ is identifiable by a parameter v .'" .
1. e e Remark 2. they are mainly applied in the i. The plugin and extension principles are used when Pe. If v: P ~ T is an extension of v in the sense that v(P) = viP) on Po.17) where F is the empirical dJ. then the plugin estimate of the jth moment v(P) = f. because when P is not symmetric. Here x!. Let viP) be the mean of X and let P be the class of distributions of X = 0 + < where B E R and the distribution of E ranges over the class of distributions with mean zero. ~ ~ . the sample median V2(P) does not converge in probability to Ep(X). but only VI(P) = X is a sensible estimate of v(P). A natural estimate is the ath sample quantile . and v are continuous. is a continuous map from = [0.i.) h(p) ~ L vip.' Nl = h N () ~ ~ = v(P) and P is the empirical distribution. = Lvi n i= 1 k .d. P E Po.2. 1/. Pe as given by the HardyWeinberg p(O). II to P.12) and to more general method of moment estimates (Problem 2. (P) is called the population (2. i=1 k /1j ~ = ~ LXI 1=1 I n .. then v(P) is an extension (and plugin) estimate of viP). As stated. ~ I ~ ~ Extension principle.1.13).tj = E(Xj) in this nonparametric . v(P8 ) = q(8) = h(p(8)) is a continuous map from to Rand D( P) = h(p) is a continuous map from P to R.1.Section 2. In this case both VI(P) = Ep(X) and V2(P) = "median of P" satisfy v(P) = v(P).1 Basic Heuristics of Estimation 105 V 1..1 I:~ 1 Xl. (P) is the ath population quantile Xa.1. For instance.1. ~ For a second example.1. case (Problem 2.4. For instance.d. With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plugin estimates for multinomial trials because I'j(8) where =L i=l k vfPi(8) = h(p(8» = viP.i.1. This reasoning extends to the general i. if X is real and P is the class of distributions with EIXlj < 00. Suppose Po is a submodel of P and P is an element of P but not necessarily Po and suppose v: Po ~ T is a p~eter. these principles are general. is called the sample median. let Po be the class of distributions of X = B + E where B E R and the distribution of E ranges over the class of symmetric distributions with mean zero. Here x ~ = median. P 'Ie Po. However. in the multinomial examples 2.1. casebut see Problem 2. The plugin and extension principles must be calibrated with the target parameter. context is the jth sample moment v(P) = xjdF(x) ~ 0.1. then Va.3 and 2. = v(P). Remark 2.14. .
To estimate the population variance B( 1 .X). and there are best extensions.l [iX. a 2 ) sample as in Example 1.1. For instance. In Example 1.Ll). . C' X n are ij. . .) = ~ (0 + I).106 Methods of Estimation Chapter 2 Here are three further simple examples illustrating reasonable and unreasonable MOM estimates. X n is a N(f1. The method of moments estimates of f1 and a 2 are X and 2 0 a. Discussion. there are often several method of moments estimates for the same q(9). estimation of B real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of (J values rather than minimizers of functions p( (J. " . X(l . then B is both the population mean and the population variance. If the model fits. This minimal property is discussed in Section 5. Algorithms for their computation will be introduced in Section 2. as we shall see in Chapter 3. Because f.d. . these are the frequency plugin (substitution) estimates (see Problem 2. these estimates are likely to be close to the value estimated (consistency)..1. Thus. valuable as preliminary estimates in algorithms that search for more efficient estimates. •i .1. 2. Unfortunately. the plugin principle is justified.2 with assumptions (1)(4) holding. (b) If the sample size is large. Because we are dealing with (unrestricted) Bernoulli trials..5. 0 = 21' . minimax. Suppose that Xl. optimality principle solutions agree to first order with the best minimum contrast and estimating equation solutions. It does turn out that there are "best" frequency plugin estimates. X.I and 2X . 0 Example 2. U {I. . because Po = P(X = 0) = exp{ O}.LI (8) = (J the method of moments leads to the natural estimate of 8.4. . we may arrive at different types of estimates than those discussed in this section. = 0]. they are often difficult to compute.8) we are led by the first moment to the estimate. we find I' = E. those obtained by the method of maximum likelihood..3 where X" . as we shall see in Section 2. a saving grace becomes apparent in Chapters 5 and 6. .4.1. Example 2. Moreover. a frequency plugin estimate of 0 is Iogpo. Plugin is not the optimal way to go for the Bayes. (X. • .4. Remark 2. ". The method of moments can lead to either the sample mean or the sample variance. Estimating the Size of a Population (continued). [ [ . When we consider optimality principles. therefore. For example.6.. )" I I • . What are the good points of the method of moments and frequency plugin? (a) They generally lead to procedures that are easy to compute and are. I .7. Suppose X I..1 because in this model B is always at least as large as X (n)' 0 As we have seen. We will make a selection among such procedures in Chapter 3. . the frequency of successes. a special type of minimum contrast and estimating equation method.1.. if we are sampling from a Poisson population with parameter B. However.2. or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter 3. This is clearly a foolish estimate if X (n) = max Xi> 2X .3. . 'I I[ i['l i . Example 2. for large amounts of data. See Section 2. O}. where Po is n.5.1 is a method of moments estimate of B. X). X n are the indicators of a set of Bernoulli trials with probability of success fJ.
ll. z) = ZT{3.Zi)j2.2 2.g((3. li) : I < i < n} with li independent and E(Yi ) = g((3. When P is the empirical probability distribution P E defined by Pe(A) ~ n.Section 2. a least squares estimate of {3 is a minimizer of p(X.) = q(O) and call v(P) a plugin estimator of q(6). It is of great importance in many areas of statistics such as the analysis of variance and regression theory. A minimum contrast estimator is a minimizer of p(X. In the multinomial case the frequency plugin estimators are empirical PIEs based on v(P) = (Ph .2.lJ) ~ E'o"p(X.. . where Pj is the probability of the jth category. For the model {Pe : () E e} a contrast p is a function from X x H to R such that the discrepancy D(lJo. and the contrast estimating equations are V Op(X. I < i < n.1 MINIMUM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares Least squares(1) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement. 1 < j < k. when g({3. In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6. For data {(Zi. 0 E e. then is called the empirical PIE.O) = O. Suppose X .. . 0). Method of moment estimates are empirical PIEs based on v(P) = (f. If P is an estimate of P with PEP. ~ ~ ~ ~ 2.ld)T where f.1 L:~ 1 I[X i E AI. If P = PO. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters.. Zi). f..Pk)... . ~ ~ ~ ~ ~ v we find a parameter v such that v(p. D(P) is called the extensionplug·in estimate of v(P).lj = E( XJ\ 1 < j < d. An extension ii of v from Po to P is a parameter satisfying v(P) = v(P).(3) ~ L[li . the associated estimating equations are called the normal equations and are given by Z1 Y = ZtZD{3. where 9 is a known function and {3 E Rd is a vector of unknown regression coefficients. P E Po.. is parametric and a vector q( 8) is to be estimated. For this contrast.. . where ZD = IIziJ·llnxd is called the design matrix. 6). P. 0 E e c Rd is uniquely minimized at the true value 8 = 8 0 of the parameter.2 Minimum Contrast Estimates and Estimating Equations 107 Summary. The plugin estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P. The general principles are shown to be related to each other.. Let Po and P be two statistical models for X with Po c P.
If we consider this model. and 13 ~ (g(l3.4 with <j simply defined as Y.C=ha:::p::.2. " . < i < n. I) are i.4) (2. j. (2.g(l3...4}{2. " (2. The estimates continue to be reasonable under the GaussMarkov assumptions..5) (2.).z). a 2 and H unknown.E:::'::'.::2 : In Example 2. as in (a) of Example 1. Note that the joint distribution H of (C}..2.) + <" I < i < n n (2. then 13 has an interpretation as a parameter on p. 1 < i < n. we can compute still Dp(l3o. Z could be educationallevel and Y income. z. 13) = LIY.2. 13 = I3(P) is the miniml2er of E(Y .g(l3.z. (Zi.1..2.z. Yi).2) Are least squares estimates still reasonable? For P in the semiparametric model P.1.E(Y. Zn»)T is 1. is modeled as a sample from a joint distribution.d. ..Y) such thatE(Y I Z = z) = g(j3.2.»)2. 13 E R d }.2. (Z" Y.g(l3.2.d.) . . z.) =0.:. fj) because (2. 1 < i < j < n 1 > 0\ . This is frequently the case for studies in the social and biological sciences. NCO. that is..g(l3.f3d and is often applied in situations in which specification of the model beyond (2.) = 0. That is. E«.z. . This follows "I . (3)  E p p(X. For instance. . The contrast p(X. . I' .6) is difficult Sometimes z can be viewed as the realization of a population variable Z. where Ci = g(l3.zt}.::'hc.o=nC.) = E(Y.:1.) + L[g(l3 o. as (Z.zn)f is 11. ... .::. The least squares method of estimation applies only to the parameters {31' . E«. aJ) and (3 ranges over R d or an open subset.. 1 < j < n.(z. Z)f. that is.2.2.). 1_' .. the model is semiparametric with {3. . i=l (2. .3) continues to be valid. Suppose that we enlarge Po to P where we retain the independence of the Yi but only require lJ. Y) ~ PEP = {Alljnintdistributions of(Z.g(l3.6) Var( €i) = u 2 COV(fi. = 0. • . .=. .2).i. or Z could be height and Y log weight Then we can write the conditional model of Y given Zj = Zj.)]' i=l ~ led to the least squares estimates (LSEs) {3 of {3.) = g(l3.108_~ ~_ _~ ~ ~cMc.3) which is again minimized as a function of 13 by 13 = 130 and uniquely so if the map 13 ~ (g(j3.:0::d=':Of. .) i.2. j..m::a. z.1. I Zj = Zj). .i.13) n n i=1  L Varp«.En) is any distribution satisfying the GaussMarkov assumptions. which satisfies (but is not fully specified by) (2.1 we considered the nonlinear (and linear) Gaussian model Po given by Y. I<i<n.
Yi). a further modeling step is often taken and it is assumed that {Zll . y)T has a nondegenerate multivariate Gaussian distribution Nd+1(Il. = py . E(Y I Z = z) can be written as d p(z) where = (30 + L(3jZj j=1 (2. Fan and Gijbels (1996).zo). j=1 (2.Section 2.2. (2.. Yi) are a sample from a (d + 1)dimensional distribution and the covariates that are the coordinates of Z are continuous.1 the most commonly used 9 in these models is g({3.L{3jPj.2 Minimum Contrast Estimates and Estimating Equations 109 from Theorem 1. J for Zo an interior point of the domain. (2. . n} we write this model in matrix fonn as Y = ZD(3 + €    where Zv = IIZij 11 is the design matrix. (2) If as we discussed earlier..2.6).2. and Seber and Wild (1989).I to each of the n pairs (Zil Yi). As we noted in Example 2. z) = zT (3. In this case we recognize the LSE {3 as simply being the usual plugin estimate (3(P). as we have seen in Section lA.2. where P is the empirical distribution assigning mass n. Sections 6. we are in a situation in which it is plausible to assume that (Zi. We can then treat p(zo)  nnknown (30 and identify (zo) with (3j to give an approximate (d 1 and Zj as before and linear model with Zo = t!.1). Zd. .1. For nonlinear cases we can use numerical methods to solve the estimating equations (2..1. E1=1 g~ (zo)zo as an + 1) dimensional d p(z) = L(3jZj.9) ..7).2. we can approximate p(z) by p(z) ~ p(zo) +L j=1 d a: a (zo)(z .2.2.41.4. see Rnppert and Wand (1994).. The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth.4. In that case.4). See also Problem 2.8) d f30 = («(3" . in conjnnction with (2. is called the linear (multiple) regression modeL For the data {(Zi.. i = 1. j=O This type of approximation is the basis for nonlinear regression analysis based on local polynomials. 2:).7) (2. . . which.3 and 6. We continue our discussion for this important special case for which explicit fonnulae and theory have been derived.2.5.5) and (2. I)/. {3d).1. and Volnme n.
Example 2. LZi(Yi . .1. {3. Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil. the solution of the normal equations can be given "explicitly" by (2. For this reason. Estimation of {3 in the linear regression model. The nonnal equations are n i=l n ~(Yi . exists and is unique and satisfies the Donnal equations (2. 1 < i < n. whose solution is .2.2. In that case. I I. 1 64 4 71 5 54 9 81 11 76 13 23 77 93 23 95 28 109 The points (Zi' Yi) and an estimate of the line 131 + 132z are plotted in Figure 2. L 1 that. i=l (2. given Zi = 1 < i < n.11) When the Zi'S are not all equal. In the measurement model in which Yi is the detennination of a constant Ih.ecor and Cochran.) ~ 0. il. ( (J2 2 ) distribution where = ayy Therefore. the relationship between z and y can be approximated well by a linear equation y = 131 + (32Z provided z is restricted to a reasonably small interval. the least squares estimate. g(z.2. ~ = (lin) L:~ 1Yi = ii.no Furthennore.2.1. Y is random with a distribution P(y I z). If we run several experiments with the same z using plants and soils that are as nearly identical as possible. i I . .8). For certain chemicals and plants. The parametrization is identifiable if and only ifZ D is of rank d or equivalently if Z'bZD is affulI rank d.12) . if the parametrization {3 ) ZD{3 is identifiable.2.(3zz. .2. d = 1. see Problem 2. we have a Gaussian linear regression model for Yi.2.il.il. il. Zi. we assume that for a given z. Zi 1'.10) Here are some examples. Following are the results of an experiment to which a regression model can be applied (Sned. Nine samples of soil were treated with different amounts Z of phosphorus.) = O.I 'i.ill .2. necessarily. p.25.il2Zi) = 0.) = 1 and the normal equation is L:~ 1(Yi . g(z. 0 Ii. 139). we get the solutions (2. 1967. il. we will find that the values of y will not be the same. We want to estimate f31 and {32.) = il" a~. We have already argued in Examp}e 2. is independent of Z and has a N(O. We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil.1 '. < Methods of Estimation Chapter 2 =y  I'(Z) Eyz:Ez~Ezy.1. Example 2.
4..1. . suppose we select p realvalued functions of z. €6 is the residual for (Z6. . Here {31 = 61. Geometrically.2. . . (zn. Zn. . 9p( z). and fi = (lin) I:~ 1 Yi· The line y = /31 + f32z is known as the sample regression line or line of best fit of Yl.13) where Z ~ (lin) I:~ 1 Zi.2. .Yi) and a line Y ~ a + bz vertically by di = Iy. .. n} and sample regression line for the phosphorus data.3. . The regression line for the phosphorus data is given in Figure 2.132 z ~ ~ (2. 91. .42.2. . Y6).. n.. Yn). This connection to prediction explains the use of vertical distance! in regression. i = 1.1..1..Section 2. P > d and postulate that 1'( z) is a linear combination of 91 (z).(/31 + (32Zi)] are called the residuals of the fit. The linear regression model is considerably more general than appears at first sight. The vertical distances €i = fYi .Yn on ZI.9p..(Z). . then the regression line minimizes the sum of the squared distances to the n points (ZI' Yl). . i = 1.. The line Y = /31 + (32Z is an estimate of the best linear MSPE predictor Ul + b1Z of Theorem 1. that is p I'(z) ~ 2:)..(a + bzi)l.. &atter plot {(Zi.. . . if we measure the distance between a point (Zi.9. For instance. j=1 Then we are still dealing with a linear regression model because we can define WpXI . and 131 = fi . 0 ~ ~ ~ ~ ~ Remark 2. .2 Minimum Contrast Estimates and Estimating Equations 111 y 50 • o 10 20 x 30 Figure 22. Yi).58 and 132 = 1....
5). we may be able to characterize the dependence of Var( ci) on Zi at least up to a multiplicative constant. Zi) + Ei .24 for more on polynomial regression. _ Yi .2... iI L Vi[Yi i=l n ({3.zd + ti.g«(3. Example 2.. We return to this in Volume II.. 1 <i < n Yi n 1 . + /3zzi)J2 (2. and (32 that minimize ~ ~ "I. but the Wi are known weights.". Weighted Linear Regression.. Zi2 = Zi. The weighted least squares estimate of {3 is now the value f3. 1<i <n Wij = 9j(Zi).2 and many similar situations it may not be reasonable to assume that the variances of the errors Ci are the same for all levels Zi of the covariate variable.. Zi) = {3l + fhZi.16) as a function of {3.2. which for given Yi = yd.2. That is.3. Zil = 1.. Weighted least squares. gp( z)) T as our covariate and consider the linear model Yi = where L j=l p ()jWij + Ci.. .8.15) . Such models are called heteroscedastic (as opposed to the equal variance models that are homoscedastic). . However.wi  _ g((3... we arrive at quadratic regression...filii minimizes ~  I i i L[1Ii . .. .)]2 i=l i=l ~ n (2. I . [Yi .. and g({3. Thus.= g({3. then = WW2/Wi = .14) where (J2 is unknown as before.2.g(. Zi) = (2.2.z..Zi)]2 = L ~.5) fails.2. 1 n.2.. if d = 1 and we take gj (z) = zj....'l are sufficient for the Yi and that Var(Ed. . i = 1. z. Note that the variables .wi. For instance.wi. In Example 2.II 112 Methods of Estimation Chapter 2 (91 (z). The method of least squares may not be appropriate because (2. Whether any linear model is appropriate in particular situations is a delicate matter partially explorable through further analysis of the data and knowledge of the subject matter. < n.2. 0 < j < 2. .wi .wi) g((3..2. . Zi)/.. if we setg«(3. fi = <d.wi 1 < . We need to find the values {3l and /32 of (3. • I' "0 r .2. . Yi = 80 + 8 1Zi + 82 + Cisee Problem 2. I and the Y i satisfy the assumption (2. we can write (2. Consider the case in which d = 2.17) .= y.
Y*) V.1 leading to (2.16) for g((3..2.2. a__ Cov(Z"'.2.4H2. wn ) and ZD = IIZijllnxd is the design matrix.2.B.2. . Let (Z*.2. By following the steps of Exarnple 2. Zi) = z.B that minimizes (2.)] ~Ui. the . using Theorem 1.2.6). Yn) and probability distribution given by PI(Z". .Y") ~(Zi.2.~ UiYi .1) and (2. That is.2 Minimum Contrast Estimates and Estimating Equations 113 where Vi = l/Wi_ This problem may be solved by setting up analogues to the normal equations (2. Y*) denote a pair of discrete random variables with possible values (z" Yl). .19) and (2.(Li~] Ui Z i)2 n 2 n (2. Var(Z") and  _ L:~l UiZiYi . . . Thus.2. then its MSPE is ElY" ~ 1l1(Z")f ~ :L UdYi i=l n «(3] + (32 Zi)1 2 It follows that the problem of minimizing (2.7).(3. ~ . we find (Problem 2.~ UiZi· n n i"~l I" n n F"l This computation suggests. 0 Next consider finding the.1.1 7) is equivalent to finding the best linear MSPE predictor of y . as we make precise in Problem 2.Y.28) that the model Y = ZD. If Itl(Z*) = {31 given by + {32Z* denotes a linear predictor of y* based on Z .2.26. . we may allow for correlation between the errors {Ei}. z) = zT. More generally.4. When ZD has rank d and Wi > 0. . that weighted least squares estimates are also plugin estimates.20).2. Moreover. i ~ I..13 minimizing the least squares contrast in this transformed model is given by (2.27) that f3 satisfy the weighted least squares normal equations ~ ~ where W = diag(wI.4 as follows. ..B+€ can be transformed to one satisfying (2. when g(.3. suppose Var(€) = a 2W for some invertible matrix W nxn.. we can write Remark 2.1..n. (Zn. Then it can be shown (Problem 2.B.2.Section 2.. We can also use the results on prediction in Section 1. (3 and for general d.(L:~1 UlYi)(L:~l uizd Li~] Ui Zi .(32E(Z') ~ .n where Ui n = vi/Lvi..2.. . . 1 < i < n.8).2. i i=l = 1..18) ~ ~ ~I" (31 = E(Y') ...
.
.
.
.
27) that are not maxima or only local maxima.. and 3 and occurring in the HardyWeinberg proportions p(I.) = ·j1 e'"p. 0) = 20'(1. X3 = 1. 8' 80. Example 2. In general.2.7. Nevertheless. the dual point of view of (2. 1..2. (2. represents the expected number of arrivals in an hour or. O)p(l. Here X takes on values {O. 2.2.2. 2n (2.0)' < ° for all B E (0.. the MLE does not exist if 112 + 2n3 = O.0). I). then I • Lx(O) ~ p(l. is zero. .. Evidently.(1.2. Similarly. ~ maximizes Lx(B). Then the same calculation shows that if 2nl + nz and n2 + 2n3 are both positive.28) If 2n. x.22) and (2.2. n2. p(2. let nJ.0) The likelihood equation is 8 80 Ix (0) ~ = 5 1 0 10 =0. X2 = 2.118 Methods of Estimation Chapter 2 which again enables us to analyze the behavior of B using known properties of sums of independent random variables. p(X. Because .2. A is an unknown positive constant and we wish to estimate A using X.1. which has the unique solution B = ~.6. . 0)p(2. } with probabilities. respectively. + n.x=O. situations with f) well defined but (2. p(3.27) doesn't make sense.. + n.OJ'". Here are two simple examples with (j real.2. . Consider a popUlation with three kinds of individuals labeled 1. and as we have seen in Example 2. . where ). Let X denote the number of customers arriving at a service counter during n hours. 2 and 3.2. there may be solutions of (2.2.. 0) ~ 20(1 .>. and n3 denote the number of {Xl. If we make the usual simplifying assumption that the arrivals form a Poisson process.O) = 0'.ny l. the rate of arrival. If we observe a sample of three individuals and obtain Xl = 1.lx(0) = 5 1 0' . . so the MLE does not exist because = (0. then X has a Poisson distribution with parameter nA.27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section. x n } equal to 1.4). which is maximized by 0 = 0.29) '(j .l.5. .. 1). the maximum likelihood estimate exists and is given by 8(x) = 2n. ]n practice. equivalently. the likelihood is {1 . OJ = (1  0)' where 0 < () < 1 (see Example 2. 0 e Example 2.
trials in which each trial can produce a result in one of k categories.2.6.8. and the equation becomes (BkIB j ) n· OJ = .32) to find I. . ~ ~ . 31=1 By (2.1. p(x.. e.2. = x/no If x is positive. consider an experiment with n Li. .31 ) To obtain the MLE (J we consider l as a function of 8 1 . J = I.kl. Let Xi = j if the ith trial produces a result in the jth category. let B = P(X. 8' (2. .. and k j=1 k Ix(fJ) = LnjlogB j . this estimate is the MLE of ). 8) = 0 if any of the 8j are zero. Example 2.8B " 1=13 k k =0. and must satisfy the likelihood equa~ons ~ 8B1x(fJ) = 8B 3 8 8 " n. 0 To apply the likelihood equation successfully we need to know when a solution is an MLE. As in Example 1.logB. = j) be the probability J of the jth category. for an experiment in which we observe nj ~ 2:7 1 I[X. A similar condition applies for vector parameters. 80..2 Minimum Contrast Estimates and Estimating Equations 119 The likelihood equation is which has the unique solution). Then p(x. j=d (2. = jl.32) We first consider the case with all the nj positive. ~ n .. (see (2.n. If x = 0 the MLE does not exist. A sufficient condition..... familiar from calculus.6..\ I 0. Multinomial Trials. fJE6= {fJ:Bj >O'LOj ~ I}. .ekl with kl Ok ~ 1. the maximum is approached as .2.30) for all This is the condition we applied in Example 2. thus.2.lx(B) <0. the MLE must have all OJ > 0. j = 1.2.2. and let N j = L:~ 1 l[Xi = j] be the number of observations in the jth category. However. this is well known to be equivalent to ~ e. .32). fJ) = TI7~1 B7'.2.d.2.o. 8B. 8Bk/8Bj (2.7.LBj j=1 • (2. is that l be concave in If l is twice differentiable.30)). L. We assume that n > k ~ 1.. . Then. k..Section 2. = L.
we can consider minimizing L:i.2.). As we have IIV. N(lll 0..(Xi . ~ g«(3.120 ~ Methods of Estimation Chapter 2 To show thaI this () maximizes lx(8).30.202 L.33) It follows that in this nj > 0. Next suppose that nj = 0 for some j.' . 1 Yn ..lV. and 0..IV.. zn) 03W. these estimates viewed as an algorithm applied to the set of data X make sense much more generally.  seen and shall see more in Section 6. i = 1. we check the concavity of lx(O): let 1 1 <j < k .1 L:~ . Ix(O) is strictly concave and (} is the unique ~ ~ maximizer of lx(O).2 are Maximum likelihood and least squares We conclude with the link between least squares and maximum likelihood. g«(3. Using the concavity argument.2. wi(5).1. k.) = gi«(3). 1 < i < n. z.d. where Var(Yi) does not depend on i..28. Zj)]Wij where W = Ilwijllnxn is a symmetric positive definite matrix.. In Section 2. n. Zi). Then (} with OJ = njjn. See Problem 2. Example 2. .1..9.. 'ffi f. is still the unique MLE of fJ. .11(a)). are known functions and {3 is a parameter to be estimated from the independent observations YI .Zi)]2. Zi)J[l:} . 0.1). where E(V..34) g«(3.2 both unknown..2.gi«(3) 1 .g«(3.3.1 we consider least squares estimators (LSEs) obtained by min2 imizing a contrast of the fonn L:~ 1 IV. Thus. . ((g((3. Then Ix «(3) log IT ~'P (V.. YnJT. then <r <k I.1 holds and X = (Y" . X n are Li.2. we find that for n > 2 the unique MLEs of Il and Ii = X and iT2 ~ n.2. o 1=1 Evidently maximizing Ix«(3) is equivalent to minimizing L:~ n (2. . The 0 < OJ < 1. . gi. j = 1. Zi)) 00 ao n log 2 I"" 2 "2 (21Too) .. zi)1 . It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model Yi independent N(g({3. Suppose that Xl. as maximum likelihood estimates for f3 when Y is distributed as lV. Suppose the model Po of Example 2.2. least squares estimates are maximum likelihood for the particular model Po.. OJ > ~ 0 case. . i=l g«(3.X)2 (Problem 2. see Problem 2. 1 < j < k.2 ) with J1. (2. More generally..6.2. Summary. ~ g((3. . version of this example will be considered in the exponential family case in Section 2.. This approach is applied to experiments in which for the ith case in a study the mean of the response Yi depends on • . . .
3 Maximum Likelihood in Multiparameter Exponential Families 121 a set of available covariate values Zil.3. b). we define 8 1n .3. That is. z=1=1 ~ 2.1 and 1. Let &e = be the boundary of where denotes the closure of in [00.00. in the N(O" O ) case. Formally.2. In particular we consider the case with 9i({3) = Zij{3j and give the LSE of (3 in the case in which I!ZijllnXd is of rank d. Suppose 8 c RP is an open set.3. See Problem 2. Suppose also that e t R where e c RP is open and 1 is (2.1.3.oo}}. (a. if X N(B I . This is largely a consequence of the strict concavity of the log likelihood in the natural parameter TI.zid. as k . Then there exists 8 E e such that 1(8) = max{l(li) : Ii E e}. which are appropriate when Var(Yj) depends on i or the Y's are correlated. though the results of Theorems 1.1 ). .ae as m t 00 to mean that for any subsequence {B1nI<} either 8 1nk t t with t ¢ e.5.I ) all tend to De as m ~ 00. are given. (a.. In general. (m. oo]P.6. For instance. (h). m). We start with a useful general framework and lemma. Extensions to weighted least squares. Concavity also plays a crucial role in the analysis of algorithms in the next section. where II denotes the Euclidean norm. Proof. These estimates are shown to be equivalent to minimum contrast estimates based on a contrast function related to Shannon entropy and KullbackLeibler information divergence. . In Section 2.2 we consider maximum likelihood estimators (MLEs) 0 that are defined as maximizers of the likelihood Lx (B) = p(x. = RxR+ and ee e.3 and 1. (}"2) distribution. b).3 MAXIMUM LIKELIHOOD IN MUlTIPARAMETER EXPONENTIAL FAMILIES Questions of existence and uniqueness of maximum likelihood estimates in canonical exponential families can be answered completely and elegantly.b) . For instance. e e I"V e De= {(a. In the case of independent response variables Yi that are modeled to have a N(9i({3). (m. 2 e Lemma 2.1.4 and Corollaries 1.6.a < b< oo} U {(a. b) : aER. including all points with ±oo as a coordinate.a= ±oo. m.00. 8).6.2 and other exponential family properties also playa role..1) lim{l(li) : Ii ~ ~ De} = 00.Section 2. or 8 1nk diverges with 181nk I .6. m.3. Suppose we are given a function 1: continuous.bE {a. it is shown that the MLEs coincide with the LSEs.. Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2. for a sequence {8 m } of points from open.1 only. (m. . Properties that derive solely fTOm concavity are given in Propositon 2. a8 is the set of points outside of 8 that can be obtained as limits of points in e.
We can now prove the following. is open..3.)) > ~ (fx(Od +lx (0.12. ~ Proof.3.3. then 1] exists and is unique ijfto E ~ where C!i: is the interior of CT.'1) with T(x) = 0. if lx('1) logp(x. (ii) The family is of rank k.3. = . E.8 and 2. if to doesn't satisfy (2. a contradiction. and 0. We give the proof for the continuous case. lI(x) =  exists. /fCT is the convex suppon ofthe distribution ofT (X). which implies existence of 17 by Lemma 2. .3) I I' (b) Conversely. Applications of this theorem are given in Problems 2.6.to. no solution. we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) .1. Furthennore.3. are distinct maximizers. Suppose the cOlulitions o/Theorem 2.3) has i 1 !. " We. Suppose X ~ (PII . ' L n . i . Then. [ffunher {x (II) e = e e t 88.'1 . h) and that (i) The natural parameter space.1. and is a solution to the equation (2. Let x be the observed data vector and set to = T(x).1. II). then the MLE doesn't exist and (2.3. Suppose P is the canonical exponential/amity generated by (T. By Lemma 2. U m = Jl~::rr' Am = . From (B. ! " .3. II E j. open C RP.)) 0 Themem 2. have a necessary and sufficient condition for existence and uniqueness of the MLE given the data. then Ix lx(O. '10) for some reference '10 E [ (see Problem 1... (a) If to E R k satisfies!!) Ifc ! 'I" 0 (2. is unique. " .9) we know that II lx(lI) is continuous on e.00.3. Proofof Theorem 2. Write 11 m = Am U m .2). . If 0.  (hO! + 0.1.3.27). Define the convex suppon of a probability P to be the smallest convex set C such that P(C) = 1. Corollary 2. We show that if {11 m } has no subsequence converging to a point in E.1 hold.1.2) then the MLE Tj exists. then lx(11 m ) . thus. I I.). .3.122 Methods of Estimation Chapter 2 Proposition 2. Existence and Uniqueness o/the MLE ij. II) is strictly concave and 1.(11) ~ 00 as densities p(x.1.3. Without loss of generality we can suppose h(x) = pix. with corresponding logp(x. then the MLE8(x) exists and is unique.3.3.
.3. lIu m ll 00.2) fails. Po[uTT(X) > 61 > O. Then AU ¢ £ by assumption. I Xl) and 1. thus. there exists c # such that Po[cTT < 0] = 1 E1](c T T(X)) < 0.5. Por n = 1.* P'IlcTT = 0] = I.se density is a general phenomenon. CT = R )( R+ FOT n > 2.1. This is equivalent to the fact that if n = 1 the formal solution to the likelihood equations gives 0'2 = 0. The Gaussian Model. T(X) has a density and. for every d i= 0.6. Evidently. CT = C~ and the MLE always exists.3. if {1]m} has no subsequence converging in £ it must have a subsequence { 11m k} that obeys either case 1 or 2 as follows. So we have Case 2: Amk t A. Nonexistence: if (2. which is impossible.3. fOT all 1].1. both {t : dTt > dTto} n CO and {t : dTt < dTto} n CO are nonempty open sets.3. 0 ° '* '* PrOD/a/Corollary 2. Because any subsequence of {11m} has no subsequence converging in E we conclude L ( 11m) t 00 and Tj exists. 0 (L:7 L:7 CJ. It is unique and satisfies x (2. So In either case limm.3.i.4.1 follow.3.d.3 Maximum Likelihood in Multiparameter Exponential Families ~ 123 1. 0 Example 2.3). iff. So. Theorem 2.3.1 hold and T k x 1 has a continuous case density on R k • Then the MLE Tj exists with probabiliry 1 and necessarily satisfies (2. Suppose X"".3) by TheoTem 1.k Lx( 11m. The equivalence of (2.3. = 0 because T(XI) is always a point on the parabola T2 = T'f and the MLE does not exist. If ij exists then E'I T ~ 0 E'I(cTT) ~ 0.3. this is the exponential family generated by T(X) = 1 Xi. contradicting the assumption that the family is of rank: k. .6. N(p" ( 2 ). um~. u mk t u. that is. By (B. Write Eo for E110 and Po for P11o . (1"2 > O.Xn are i. .9.2.Section 2.1) a point to belongs to the interior C of a convex set C iff there exist points in CO on either side of it. existence of MLEs when T has a continuous ca.J = 00.2) and COTollary 2. Case 1: Amk t 111]=11. I' E R. t u. Then. Suppose the conditions ofTheorem 2. In fact. As we observed in Example 1. Then because for some 6 > 0.
It is easy to see that ifn > 2.2. Example 2.). = = L L i i bJ _ I . The likelihood equations are equivalent to (problem 2.3. I < j < k.3. We follow the notation of Example 1. lit ~ .9).1. Multinomial Trials. and one of the two sums is nonempty because c oF 0. Thus. A nontrivial application of Theorem 2.13 and the next example). h(x) = XI.i. by Problem 2._t)T.2) holds. where Tj(X) L~ I I(Xi = j). We assume n > k .= r(jJ) A log X (2. o Remark 2. > O. we see that (2. Because the resulting value of t is possible if 0 < tjO < n.>. the maximum likelihood principle in many cases selects the «best" estimate among them...3). For instance.(X) = r~.3. They are determined by Aj = Tj In.1). >.2 that (2..4) and (2. From Theorem 1.I which generates the family is T(k_l) = (Tl>"" T.4. x > 0.3.3.I. When method of moments and frequency substitution estimates are not unique. with i'. I < j < k iff 0 < Tj < n.4. the best estimate of 8.3.ln.1. the MLE 7j in exponential families has an interpretation as a generalized method of moments estimate (see Problem 2.4) (2. then and the result follows from Corollary 2. Here is an example. I <t< k . where 0 < Aj PIX = j] < I.1 in the second.6. with density 9p.. Thus. To see this note that Tj > 0.3.3.5) ~=X A where log X ~ L:~ t1ogXi. .124 Methods of Estimation Chapter 2 Proof.2(a). In Example 3.d. I < j < k. We conclude from Theorem 2..2 follows.3 we know that E. .3.. P > 0.3.n.3.5) have a unique solution with probability 1.3.. How to find such nonexplicit solutions is discussed in Section 2..4 and 2.2. if we write c T to = {CjtjO : Cj > O} + {cjtjO : Cj < O} we can increase c T to by replacing a tjO by tjo + I in the first sum or a tjO by tjO . using (2.3.. LXi).6.2(b» r' log .3. in a certain sense. Example 2.1.1 that in this caseMLEs of"'j = 10g(AjIAk).T(X) = A(.1. 0 = If T is discrete MLEs need not exist. thus.3. The TwoParameter Gamma Family..6. I <j < k. The boundary of a convex set necessarily has volume 0 (Problem 2. I < j < k.. but only Os is a MLE. if T has a continuous case density PT(t). This is a rank 2 canonical exponential family generated by T   = (L log Xi.3.)e>"XxPI.7. liz = I  vn3/n and li3 = (2nl + nz)/2n are frequency substitution estimates (Problem 2.Xn are i.1.I and verify using Theorem 2. exist iff all Tj > O. Suppose Xl.3. The statistic of rank k .4 we will see that 83 is. Thasadensity. in the HardyWeinberg examples 2.
the MLEs ofA"j = 1. if 201 + 0. .2. the MLE does not exist if 8 ~ (0. for example. Similarly. Alternatively we can appeal to Corollary 2. then so does the MLE II in Q and it satisfies the likelihood equation ~ cT(ii) (to .1). and unicity can be losttake c not onetoone for instance.2.3.3. the bivariate normal case (Problem 2. h). Unfortunately strict concavity of Ix is not inherited by curved exponential families.3. Corollary 2.1 is useful.1 we can obtain a contradiction to (2. . Then Theorem 2. . whereas if B = [0. Here [ is the natural paramerer space of the exponential fantily P generated by (T. n > kl.lI) ~ exp{cT (II)T(x) . In Example 2.B) ~ h(x)exp LCj(B)Tj(x) . 1 < i < k . BEe.. Ck (B)) T and let x be the observed data. L.1.2. be a cUlVed exponential family p(x. The remaining case T k = a gives a contradiction if c = (1.. = 0..7) Note that c(lI) E c(8) and is in general not ij.B(B) j=l .3.3. note that in the HardyWeinberg Example 2.1.~l Aj = I}.1.13).1.. then it is the unique MLE ofB.I.1] it does exist ~nd is unique.Section 2. In some applications. (2. .3 can be applied to determine existence in cases for which (2. If the equations ~ have a solution B(x) E Co.1 and Haberman (1974). When P is not an exponential family both existence and unicity of MLEs become more problematic.8see Problem 2.3.1 directly D (Problem 2. However. (II) . 0 The argument of Example 2. 0 < j < k . c(8) is closed in [ and T(x) = to satisfies (2.3) does not have a closedform solution as in Example 1. (B).k. Remark 2. e open C R=. . our parameter set is open. lfP above satisfies the condition of Theorem 2.2) so that the MLE ij in P exists.3 Maximum likelihood in Multiparameter Exponential Families 125 On the other hand.6.3.3..3.2) by taking Cj = 1(i = j).6) on c( II) = ~~.6. Let CO denote the interior of the range of (c.3.3. The following result can be useful. when we put the multinomial in canonical exponential family form. the following corollary to Theorem 2. m < k . . 1)T.3.10).3.3. Let Q = {PII : II E e). Consider the exponential family k p(x.A(c(ii)) ~ = O.exist and are unique. x E X.3. . if any TJ = 0 or n.3.A(c(II))}h(x) Suppose c : 8 ~ [ C R k has a differential (2.8 we saw that in the multinomial case with the clQsed parameter set {Aj : Aj > 0.. mxk e.
tLi = (}1 + 82 z i . '1.ry....3. As a consequence of Theorems 2. Now p(y. < O}. 1 n. '() A 1/ Thus.nit.2 . as in Example 1. L: Xi and I.10.: Note that 11+11_ = '6M2 solution we seek is ii+. are n independent random samples. suppose Xl.3)T.6.... .\()'.n(it' + )"5it'))T which with Ji2 = n. i = 1.5X' +41i2]' < 0.i. .2.. with I.3 )(t. We find CI Example 2. .. corresponding to 1]1 = //x' 1]2 = .5 = .Zn are given constants..7) becomes = 0. ry. . simplifies to Jl2 + A6XIl. C(O) and from Example 1.3. j = 1.. Suppose that }jI •.2~~2 . m} is a 2nparameter canonical exponential family with 'TJi = /tila. that .5. t..) : ry.Xn are i. 11.A6iL2 = 0 Ii.1/'1') T .".5 and 1.ryl > O..< O. Methods of Estimation Chapter 2 family with (It) = C2 (It) = .3. l}m. Because It > 0. where 0l N(tLj 1 0'1).~Y'1. E R.7) if n > 2. we can conclude that an MLE Ii always exists and satisfies (2.2 and 2.6..4.'' .I LX.3.. = 1 " = 2 n( ryd'1' .10. .. .3.. . Next suppose. .it.) : ry. . Zl < . As in Example 1.. Gaussian with Fixed Signal to Noise. 'TJn+i = 1/2a.gx±). we see that the distribution of {01 : j = 1. ....~Yn. .126 The proof is sketched in Problem 2.. I .6.3.11. = L: xl. < Zn where Zt. Jl > 0. Evidently c(8) = {(lh.6. l = 1.0'2) with Jl/ a = AO > a known.9.3. the o q. . 9) is a curved exponential family of the form (2. m m m f?.'. . N'(fl. n.!0b''. = ( ~Yll'''' . .ry./2'1' .n.6) with " • .?' m)T " ". which is closed in E = {( ryt> ry. generated by h(Y) = 1 and "'> I T(Y) '.oJ).d.3.3. . Using Examples 1. ).g( it' .it. i Example 2. This is a curved exponential ¥. which implies i"i+ > 0. = ~ryf. Equation (2.6. af = f)3(8 1 + 82 z i )2... < O}. Ii± ~ ~[). LocationScale Regression.
Let £ be the canonical parameter set for this full model and let e ~ (II: II. even in the classical regression model with design matrix ZD of full rank. L 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of Z'bZD requires on the order of nd 2 operations to evaluate each of d(d + 1)/2 tenns with n operations to get Z'bZD and then. the fonnula (2.4.x*l: Find Xo < x" f(xo) < 0 < f(x') by taking Ixol. then the full 2nparameter model satisfies the conditions of Theorem 2. in this section. an MLE (J of (J exists and () 0 ~ ~ Summary. Ixd large enough.1). entrust ourselves. such as the twoparameter gamma. 2. However.3.O.3.4 Algorithmic Issues 127 If m > 2. in pseudocode. is the bisection algOrithm to find x*. even in the context of canonical multiparameter exponential families.3.1. there exists unique x*€(a.Section 2.4 ALGORITHMIC ISSUES As we have seen. . These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency. f i strictly. Then c(8) is closed in £ and we can conclude that for m satisfies (2. x o1d = xo.1 The Method of Bisection The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in kparameter exponential families. MLEs may not be given explicitly by fonnulae but only implicitly as the solutions of systems of nonlinear equations. Here. The packages that produce least squares estimates do not in fact use fonnula (2. L 10). In this section we derive necessary and sufficient conditions for existence of MLEs in canonical exponential families of full rank with £ open (Theorem 2.1 and Corollary 2.02 E R. strict concavity. which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2. is isolated and shown to apply to a broader class of models. order d3 operations to invert. Initialize x61d ~ XI. f( a+) < 0 < f (b. Given f continuous on (a. the basic property making Theorem 2. It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis. ~ 2. then. E R. we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all. if implemented as usual.7).3. Given tolerance € > a for IXfinal . In fact.3.3. Finally. b) such that f(x*) = O. d. by the intermediate value theorem. b).). at various times. > 2.1.1 work. We begin with the bisection and coordinate ascent methods. > OJ.
I (3) and X m + x* i as m j. If(xfinal)1 < E..4.1. the interior (a. Then.1. the MLE Tj. by the intermediate value theorem. From this lemma we can deduce the following. Let X" . .4. By Theorem 1.xoldl Methods of Estimation Chapter 2 < 2E.i.. 00. may befound (to tolerance €) by the method afbisection applied to f(.4.128 (I) If IX~ld . exists. Theorem 2. f(a+) < 0 < f(b).. • . i I i.d. 0 Example 2. satisfying the conditions of Theorem 2.x*1 < €.IXm+l . Let p(x I 1]) be a oneparameter canonical exponentialfamily generated by (T. Xnew = !(x~ld + x~ld)' ~ Xnew· (3) If f(xnew) = 0.to· I' Proaf.1. for m = log2(lxt . Moreover. in addition.) = VarryT(X) > 0 for all . I. xfinal = Xnew· (4) If f(xnew) < 0..xol· 1 (2) Therefore.4.4.. !'(. x~ld = Xnew· Go to (I). so that f is strictly increasing and continuous and necessarily because i.xol/€). which exists and is unique by Theorem 2. o ! i If desired one could evidently also arrange it so that.3. The Shape Parameter Gamma Family. Xm < x" < X m +l for all m...1) . x~ld (5) If f(xnew) > 0. (2..1 and T = to E C~. h)..3.1). xfinal = !(x~ld + x old ) and return xfinal' (2) Else. X n be i. The bisection algorithm stops at a solution xfinal such that i Proot If X m is the mth iterate of Xnew i (I) Moreover. .1.) = EryT(X) .. [(8.. . End Lemma 2. lx.6. b) of the convex support a/PT.
4. say c.1.· . eventually.Section 2. Here is the algorithm. for a canonical kparameter exponential family. but as we shall see. In fact. Notes: (1) in practice. d and finally 1] _ ~Il) _ =1] (~1 1]1"".'TJk. which is slow.1]2. r > 1.···.'TJk =tk' ) .4. 0 J:: 2.. for each of the if'. 1] ~Ok = (~1 ~1 {} "") 1]I. The function r(B) = x'Iexdx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method.'fJk' ~1) an soon. it is in fact available to high precision in standard packages such as NAG or MATLAB.oJ _ = (~1 "" {}) ~02 _ 1]1. in cycle j and stop possibly in midcycle as soon as <I< .1 can be evaluated by bisection.4 Algorithmic Issues 129 Because T(X) = L:~' equation I log Xi has a density for all n the MLE always exists. However. always converges to Ti.4.2 Coordinate Ascent The problem we consider is to solve numerically. The general case: Initialize ~o 1] = ("" 'TJll···' "") • 'TJk Solve ~1 f or1]k: Set 1] 8'T]k [) A(~l ~1 1]ll1'J2. It solves the r'(B) r(B) T(X) n which by Theorem 2. 1 k. The case k = 1: see Theorem 2. .1]2.···.'fJk· Repeat. bisection itself is a defined function in some packages.'TJ3. getting 1j 1r). we would again set a tolerenceto be. This example points to another hidden difficulty. E1/(T(X» = A(1/) = to when the MLE Tj = Tj(to) exists.
j (5) Because 1]1..) where flj has dimension d j and 2::. We use the notation of Example 2.pO) = We now use bisection ' r' . 1j(r) ~ 1j. Theorem 2. that is.I) solving r0'.. Proof.. Bya standard argument it follows that. Suppose that we can write fiT = (fir.. Because l(ijI) ~ Aand the MLE is unique. Then each step of the iteration both within cycles and from cycle to cycle is quick. The case we have C¥. Therefore...4. W = iji = ij. . . Hence. limi. ij'j and ij'(j+I) differ in only one coordinate for which iji(j+1) maximizes l.2. 1J ! But 71j E [. Continuing in this way we can get arbitrarily close to 1j. 0 Example 2. (Wn') ~ O. . . We pursue this discussion next.. . (ii) a/Theorem 2.4.. . ff 1 < j < k. Consider a point we noted in Example 2. (6) By (4) and (5). Whenever we can obtain such steps in algorithms. Suppose that this is true for each l. The TwoParameter Gamma Family (continued).4.2. Else lim. they result in substantial savings of time. the MLE 1j doesn't exist.···) C!J. the log likelihood. ..I ~(~ (1) 1 to get V. Here we use the strict concavity of t. (3) and (4) => 1]1 . refuses to converge (in fI space!)see Problem 2...130 Methods of Estimation Chapter 2 (2) Notice that (iii..3. j ¥ I) can be solved in closed form. (i)..1j{ can be explicit. (3) I (71j) = A for all j because the sequence oflikelihoods is monotone. ij = (V'I) . 1 < j < k. This twodimensional problem is essentially no harder than the onedimensional problem of Example 2.2: For some coordinates l. (fij(e) are as above. _A (1»).3. is computationally explicit and simple.2.2. ij as r t 00. . 71J.l(ij') = A. 'II. is the expectation of TI(X) in the oneparameter exponential family model with all parameters save T1I assumed known...5). Thus.. (2) The sequence (iji!. A(71') ~ to. rp differ only in the second coordinate.4. fI. We note some important generalizations. .4.=1 d j = k and the problem of obtaining ijl(tO. TIl = . . Let 1(71) = tif '1.. To complete the proof notice that if 1j(rk ) is any subsequence of 1j(r) that converges to ij' (say) then.j l(i/i) = A (say) exists and is > 00.? Continuing. to ¢ Fortunately in these cases the algorithm. by (I)..3. = 7Jk. I • .'. as it should.A(71) + log hex).'11 11 t t (I .1 hold and to E 1j(r) t g:. 71 1 is the unique MLE.1J k) . 0 ~ (4) :~i (77') = 0 because :~.\(0) = ~. I(W ) = 00 for some j. ijik) has a convergent subsequence in t x . 'tn.I)) = log X + log A 0) and then A(1) = pY. We give a series of steps. x ink) ( '~ini .2'  It is natural to ask what happens if.1.1 because the equa~ tion leading to Anew given bold' (2. in fact. We can initialize with the method 2 of moments estimate from Example 2. (I) l(ij'j) Tin j for i fixed and in.··. For n > 2 we know the MLE exists. ilL" iii.2.. the algorithm may be viewed as successive fitting of oneparameter families.
..1 illustrates the j process. The graph shows log likelihood contours.Section 2. iterate and proceed.1..4. values of (B 1 .B2 )T where the log likelihood is constant. . At each stage with one coordinate fixed.. e ~ 3 2 I o o 1 2 3 Figure 2.4. find that member of the family of contours to which the vertical (or horizontal) line is tangent. that is. r = k. See also Problem 2. Next consider the setting of Proposition 2. .7.4.3. Figure 2. If 8(x) exists and Ix is differentiable. the log likelihood for 8 E open C RP.1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method. and Holland (1975).4. is strictly concave.4. Change other coordinates accordingly. = dr = 1.. .10. .. Solve g~: (Bt. The coordinate ascent algorithm can be slow if the contours in Figure 2.p. .B~) = 0 by the method of j bisection in B to get OJ for j = 1. 1 B..4. Then it is easy to see that Theorem 2. each of whose members can be evaluated easily.2 has a generalization with cycles of length r. Feinberg. the method extends straightforwardly. and Problems 2.1 in which Ix(O). which we now sketch.92.. The coordinate ascent algorithm.4 Algorithmic Issues 131 just discussed has d 1 = . BJ+l' . 1' B .4. for instance. A special case of this is the famous DemingStephan proportional fitting of contingency tables algorithmsee Bishop.
1. In this case. in general.2) The rationale here is simple.f. Bjork. I 1 The NewtonRaphson method can be implemented by taking Bold X. This method requires computation of the inverse of the Hessian.  I o The NewtonRaphson algorithm has the property that for large n.132 Methods of Estimation Chapter 2 2. F(x. B) The density is = [I + exp{ (x = B) WI.3 The NewtonRaphson Algorithm An algorithm that.3.and lefthand sides.4. A hybrid of the two methods that always converges and shares the increased speed of the NewtonRaphson method is given in Problem 2.4. is the NewtonRaphson method.to).. When likelihoods are noncave.k I (iiold)(A(iiold) . though there is a distinct possibility of nonconvergence or convergence to a local rather than global maximum. We return to this property in Problem 6. can be shown to be faster than coordinate ascent. iinew after only one step behaves approximately like the MLE.2) gives (2.B) i=l 1 1 2 L i=l n I(X" B) < O.4.3. coordinate ascent. the argument that led to (2. which may counterbalance its advantage in speed of convergence when it does converge.4. Here is the method: If 110 ld is the current value of the algorithm. this method is known to converge to ij at a faster rate than coordinate ascentsee Dahlquist. when it converges.. (2.10.7.4.B)} [i+exp{(xB)}j2' n 1 l(B) l(B) n2Lexp{(X.6. if l(B) denotes the log likelihood. If 110 ld is close enough to fj.4. Newton's method also extends to the framework of Proposition 2. then ii new = iiold . methods such as bisection. then by f1new is the solution for 1] to the approximation equation given by the right.3) Example 2.B) We find exp{ (x . A onedimensional problem 7 . X n be a sample from the logistic distribution with d. and Anderson (1974). If 7J o ld is close to the root 'ij of A(ij) expanding A(ij) around 11old' we obtain = to. B)}F(Xi. and NewtonRaphson's are still employed. 1 l(x. Let Xl. _.
and its main properties. we observe 5 5(X) ~ Qo with density q(s. We give a few examples of situations of the foregoing type in which it is used. and Rubin (977). A prototypical example folIows. 0) where I". 1 <i< m (€i1 +€i2. 1. e = Example 2.OJ] i=l n (2. however. Many examples and important issues and methods are discussed.0 E c Rd Their log likelihood Ip.0)] ~ 28(1 . difficult to compute.B). 0.4) Evidently.4.. The algorithm was fonnalized with many examples in Dempster.4 Algorithmic Issues 133 in which such difficulties arise is given in Problem 2. .x(B) is "easy" to maximize. and so on. Ei3).(0) })2EiI 10gB + Ei210g2B(1 . (0) = log q(s.B) + 2E'310g(1 . (2.4. There are ideal observations. Here is another important example. .4. let Xi. for some individuals. If we suppose (say) that observations 81.€i3).4. the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€i2 = 1).5) + ~ [(EiI+Ei2)log(1(10)')+2Ei310g(IB)] i=m+l a function that is of curved exponential family fonn. I)] (1 . 0 leads us to an MLE if it exists in both cases.4 The EM (Expectation/Maximization) Algorithm There are many models that have the following structure. Soules. a)] = B2. X ~ Po with density p(x. 2. i = 1. . 0). be a sample from a population in HardyWeinberg equilibrium for a twoallele locus.4. in Chapter 6 of Dahlquist. Unfortunately. The log likelihood of S now is 1". then explicit solution is in general not lXJssible. 0) is difficultto maximize. Po[X = (0. Lumped HardyWeinberg Data. the function is not concave. for instance. where Po[X = (1. As in Example 2.0.4. €i2 + €i3). m+ 1 <i< n. though an earlier general form goes back to Baum. It does tum out that in this simplest case an explicit maximum likelihood solution is still possible. This could happen if. Yet the EM algorithm.6. S = S(X) where S(X) is given by (2. A fruitful way of thinking of such problems is in terms of 8 as representing part of X.0)2.4. Say there is a closedfonn MLE or at least Ip. 1 8 m are not Xi but (€il. Xi = (EiI.. Bjork. with an appropriate starting point. E'2. a < 0 < I. and Weiss (1970). but the computation is clearly not as simple as in the Original HardyWeinberg canonical exponential family example.. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997). .2. Laird.. is not X but S where = 5i 5i Xi.13. Petrie. the rest of X is "missing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood. Po[X (0.4).n.Section 2. and Anderson (1974).x(B) is concave in B. What is observed.
Although MLEs do not exist in these models. differentiating and exchanging E oo and differentiation with respect i. • • i ! I • . I .4. = 01.4. (P(X.7) ii.I.4. reset Bold = Bnew and repeatthe process.6). that under fJ.p (~). This fiveparameter model is very rich pennitting up to two modes and scales.ll. = 11 = .4.(sl'd +.(sI'z)where() = (.4. Thus. .4. O't.B) 0=00 = E.4. we have given.4. The second (M) step is to maximize J(B I Bold) as a function of B.. I That is. Here is the algorithm.B) J(B I Bo) ~ E. Initialize with Bold = Bo· Tbe first (E) step of the algorithm is to compute J(B I Bold) for as many values of Bas needed.Bo) 0 p(X. J.11).. . j.\)"'u.o log p(X. It is not obvious that this falls under our scheme but let (2. .6) where A. . If this is difficul~ the EM algorithm is probably not suitable.Sn is a sample from a population P whose density is modeled as a mixture of two Gaussian densities. if this step is difficult.B) =E. B) = (1.4. As we shall see in important situations.i = 1. the Si are independent with L()(Si I .. . Suppose that given ~ = (~11"" ~n). The EM algorithm can lead to such a local maximum. Bo) _ I S(X) = s ) (2.\"'u.Ai)I'Z. The rationale behind the algorithm lies in the following formulas. the M step is easy and the E step doable.Bo) and (2.4.p() [A. i. .tz E Rand 'Pu (5) = . I.8) . EM is not particularly appropriate. p(s. A. which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2. a local maximum close to the true 8 0 turns out to be a good "proxy" for the 0 nonexistent MLE. .4) = L()(Si I Ai) = N(Ail'l + (1. S has the marginal distribution given previously..<T. ~i tells us whether to sample from N(Jil.\ < 1.8).<T. :B 10gq(s.9) follows from (2.\ = 1. I' Hi where we suppress dependence on s.). It is easy to see (Problem 2. .5.12) q(s. Suppose 8 1 ..134 Methods of Estimation Chapter 2 Example 2. we can think of S as S(X) where X is given by (2.B) I S(X) = s) 0=00 (2.12).\.4. o (:B 10gp(X.)<T~). + (1  A. The EM Algorithm.4. I:' .(f~). Let . Again..af) or N(J12. The log likelihood similarly can have a number of local maxima and can tend to 00 as e tends to the boundary of the parameter space (Problem 2. Then we set Bnew = arg max J(B I Bold).2)andO <.9) I for all B (under suitable regularity conditions).0'2 > 0. (P(X. including the examples. Mixture of Gaussians.8) by o taking logs in (2.(I'.B) IS(X)=s) q(s. are independent identically distributed with p() [A. Note that (2.
(2.0 ) I S(X) ~ 8 . Suppose {Po : () E e} is a canonical exponential family generated by (T.4.4. We give the proof in the discrete case. Then (2. q S. Lemma 2.1.00Id) by Shannon's ineqnality.17) o The most important and revealing special case of this lemma follows. uold 0 r s.14) whete r(· j ·.1. Fat x EX.Onew) {r(X I 8 .10) [)J(O I 00 ) DO it follows that a fixed point 0 of the algorithm satisfies the likelihood equation.11)  D IOgq(8. On the other hand. However.4.Onew) E'Old { log r(X I 8. 0 0 = 0old' = Onew.3. formally. Theorem 2.0) = O.o log . O n e w ) } log ( " ) ~ J(Onew I 0old) . (2.0) } ~ log q(s. hence.4.O)r(x I 8.13) iff the conditional distribution of X given S(X) forOnew asfor 00ld and Onew maximizes J(O I 0old)' =8 is the same Proof.4.4 Algorithmic Issues 135 to 0 at 00 . I S(X) ~ s > 0 } (2.0) (2. 0old)' Equality holds in (2. (2..2. Id log (X I . Let SeX) be any statistic.0) is the conditional frequency function of X given S(X) J(O I 00 ) If 00 = s.3.4.4. Lemma 2. 00ld are as defined earlier and S(X) (2.(X 18.O) { " ( X I 8. S (x) ~ 8 p(x. J(Onew I 0old) > J(Oold I 0old) = 0 by definition of Onew. ~ E '0 (~I ogp(X . DJ(O I 00 ) [)O and. Because.h) satisfying the conditions a/Theorem 2.1.13) q(8.Section 2.16) ° q(s. then . the result holds whenever the quantities in J(O I (0 ) can be defined in a reasonable fashion. 0) I S(X) = 8 ) DO (2. 0) ~ q(s." ) I S(X) ~ 8 .12) = s.4.E. r(Xj8.4. uold Now.lfOnew.4. Onew) > q(s. 0 ) +E.4.15) q(s. In the discrete case we appeal to the product rule. DO The main reason the algorithm behaves well follows.
= Ee(T(X) I S(X) = s) ~ (2.18) exists it is necessarily unique.21) Part (a) follows. In this case.4 (continued). m + 1 < i < n) .18) (2.(A(O) .25) I" . 1'I.= +Eo ( t (2<" + <i') I <it + <i'.4.24) " ! I .A('1)}h(x) (2. we see. then it converges to a limit 0·. h(x) = 2 N i . which is necessarily a local maximum J(B I Bo) I I = Eoo{(B .4. X is distributed according to the exponential family I where p(x. 1 I • • Po I<ij = 1 I <it + <i' = OJ =  0. that.Pol<i.22) • ~ 0) .4.A(Bo)) I S(X) = s} (B .4. I • Under the assumption that the process that causes lumping is independent of the values of the €ij.(1 . Now.B) 1) = log (1 = exp(1)(2Ntn (x) + N'n(x)) . Part (b) is more difficult. A(1)) = 2nlog(1 + e") and N jn = L:~ t <ij(Xi). Proof..B) 1 .16. • I E e (2Ntn + N'n I S) = 2Nt= + N.Bof Eoo(T(X) I S(X) = y) .4. Thus. 1 <j < 3.20) has a unique solution. (b) lfthe sequence of iterates {Bm} so obtained is bounded and the equation A(B) o/q(s. i=m.4. (2.(A(O) .136 (a) The EM algorithm consists of the alternation Methods of Estimation Chapter 2 A(Bnew ) = Eoold(T(X) I S(X) Bold ~ = s) (2. . A proof due to Wu (I 983) is sketched in Problem 2. after some simplification.4.+l (2.23) I .BO)TT(X) . i i.4.I<j<2 I r r: I I r! Pol<it 11 <it + <i' = 11 0' B' B' + 20(1 . I .4.19) = Bnew · If a solution of(2.4.0)' 1.4.A(Bo)) (2. A'(1)) = 2nB (2. = 11 <it +<" = IJ.(x). 0 Example 2.B). .
we note that for the cases with Zi andlor Yi observed.. to conclude ar.T2 ]}' . b.d.): 1 <i< nt} U {Z.4. ii2 . 2 Mn . + 1 <i< n}. T 5 = n. . we oberve only Zi.new T4 (Bo1d ) . and find (Problem 2. For the Mstep.Section 2. p).new ii~. (Zn.) E.2 I Z. I Z.4.12) that if2Nl = Bm.: nl + 1 <i< n.. the EM iteration is L i=tn+l n ('iI + <. 1"1)/ad 2 + (1 . then B' .: n.26) It may be shown directly (Problem 2.. Y).l LY/.) E. compute (Problem 2.6. for nl + 1 < i < n2.1.i.new ~ + I'i.p2)a~ [1'2 + P'Y2(Z.Tl IlT4 (8ol d) . Jl2.(ZiY. Y) ~ N (I"" 1"" 1 O'~. To compute Eo (T I S = s). ai ~ We take Bo~ = B MOM ' where BMOM is the method of moment estimates (11" 112. and for n2 + 1 < i < n. T3 = The observed data are n l L zl.T = (1"1. .1).4 Algoritnmic Issues 137 where Mn ~ Thus.Bold n ~ (2. For other cases we use the properties of the bivariate nonna] distribution (Appendix BA and Section 1. I Zi) 1"2 + P'Y2(Z. l"l)/al]Zi with the corresponding Z on Y regression equations when conditioning on Yi (Problem 2. a~.27) hew i'tT2Ji{[T3 (8ol d) . al a2P + 1"1/L2).4.4. where B = (111.l L ZiYi. In this case a set of sufficient statistics is T 1 = Z. which is indeed the MLE when S is observed. converges to the unique root of ~ + N.T 2 . /L2.(2N3= + N 2=)B + 2 (N1= + (I n n =0 0 in (0. ~ T l (801 d). the conditional expected values equal their observed values.= > N 3 =)) 0 and M n > 0. at Example 2.4).Yn) be i.1) that the Mstep produces I1l. I). [T5 (8ald) 2 = T2 (8old)' iii.1) A( B) = E. r) (Problem 2.(y. B). iii. as (Z.T{ (2.4. I'l)/al [1'2 + P'Y2(Z. Let (Zl'yl). This completes the Estep. T 2 = Y... i=I i=l n n n i=l s = {(Z" Y.} U {Y.4.). where (Z.8) of B based on the observed data.(Y. unew  _ 2N1m + N 2m n + 2 . we observe only Yi . 112. new = T 3 (8ol d) .4. T4 = n. a~ + ~. Suppose that some of the Zi and some of the Yi are missing as follows: For 1 < i < nl we observe both Zi and Yi.
in general.1 with respective probabilities PI.2.0. Remark 2.. Summary.3 and the problems.0'2) based on the first two moments.4. E"IJ(O I 00)] is the KullbackLeiblerdivergence (2. Consider a population made up of three different types of individuals occurring in the HardyWeinberg proportions 02.1. . ~' ". (d) Find the method of moments estimate of the probability P(X1 will last at least a month. (a) Find the method of moments estimate of >.d. involves imputing missing values.1 1. distributions.0) and (I . Find the method of moments estimates of a = (0'1. in Section 2.i' I' I i. . X n have a beta.4. Note that if S(X) = X. &(>. use this algorithm as a building block for the general coordinate ascent algorithm.4.. . Now the process is repeated with B MOM replaced by ~ ~ ~ ~ 0new. j. B)/p(X. 0) is minimized. X n assumed to be independent and identically distributed with exponential.0'2) distribution. Important variants of and alternatives to this algorithm. (3( 0'1.B o)].4 we derive and discuss the important EM algorithm and its basic properties. .4. including the NewtonRaphson method.6. show that T 3 is a method of moment estimate of ().2. what is a frequency substitution estimate of the odds ratio B/(l. based on the second moment. 2B(1 . 2. where 0 < B < 1. We then.B)? . then J(B i Bo) is log[p(X.P3 given by the HardyWeinberg proportions. (c) Suppose X takes the values 1. based on the first moment.23). based On the first two moments..). By considering the first moment of X.138 Methods of Estimation Chapter 2 where T j (B) denotes T j with missing values replaced by the values computed in the Estep and T j = Tj(B o1d )' j = 1.4. Suppose that Li.2. . (c) Combine your answers to (a) and (b) to get a method of moment estimate of >..B)2. Finally in Section 2. j : > I) that one system 3... in the context of Example 2.. (a) Show that T 3 = N1/n + N 2 /2n is a frequency substitution estimate of e. (b) Find the method of moments estimate of>. respectively. .P2. ! I 2. The basic bisection algorithm for finding roots of monotone functions is developed and shown to yield a rapid way of computing the MLE in all oneparameter canonical exponential families with E open (when it exists).. 0 Because the Estep. (b) Using the estimate of (a). are discussed and introduced in Section 2. Also note that.. Consider n systems with failure times X I . which as a function of () is maximized where the contrast logp(X.5 PROBLEMS AND COMPLEMENTS Problems fnr Sectinn 2. which yields with certainty the MLEs in kparameter canonical exponential families with E open when it exists. the EM algorithm is often called multiple imputation. X I.
. ~ Hint: Express F in terms of p and F in terms of P ~() X = No.. < .... < X(n) be the order statistics of a sample Xl. of Xi n ~ =X. Show that these estimates coincide. ~  7.2. . Let (ZI.12.. X n . empirical substitution estimates coincides with frequency substitution estimates. (b) Exhibit method of moments estimates for VaroX = 8(1 . X 1l be the indicators of n Bernoulli trials with probability of success 8. . The jth cumulant '0' of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant Cj' Give the first three sample cumulants.. Give the details of this correspondence. Yi) such that Zi < sand Yi < t n .Section 2. . Yn ) be a set of independent and identically distributed random vectors with common distribution function F. ~ ~ ~ (a) Show that in the finite discrete case. t) is the bivariate empirical distribution function FCs. (See Problem B. Hint: Consi~r (N I . (b) Show that in the continuous case X ~ rv F means that X (c) Show that the empirical substitution estimate of the jth moment JLj is the jth sample moment JLj' Hinr: Write mj ~ f== xjdF(x) ormj = Ep(Xj) where XF. The natural estimate of F(s.8. (c) Argue that in this case all frequency substitution estimates of q(8) must agree with q(X). (Z2.. (a) Show that X is a method of moments estimate of 8.F(t. 8.) There is a Onetoone correspondence between the empirical distribution function ~ F and the order statistics in the sense that.. N 2 = n(F(t2J . Let Xl.t ) = Number of vectors (Zi. (d) FortI tk. < ~ ~ Nk+1 = n(l  F(tk)). 4.)). .2. .Xn be a sample from a population with distribution function F and frequency function or density p. Let X I.. ...5.findthejointfrequencyfunctionofF(tl).5 Problems and Complements 139 Hint: See Problem 8. = Xi with probability lin. we know the order statistics. . . . Nk+I) where N I ~ nF(t l ). The empirical distribution function F is defined by F(x) = [No.F(tk).. given the order statistics we may construct F and given P. Y2 ). 6.8)ln first using only the first moment and then using only the second moment of the population. . . of Xi < xl/n.. 5. t). See A. . (Zn. . If q(8) can be written in the fonn q(8) ~ s(F) for sOme function s of F we define the empirical substitution principle estimate of q( 8) to be s( F). Let X(l) < . . YI )..... which we define by ~ F ~( s..
2). Suppose X has possible values VI. There exists a compact set K such that for f3 in the complement of K.".4..1. 9. Suppose X = (X"". .81 tends to 00.!'r«(J)) I . k=l n  n .. as the corresponding characteristics of the distribution F.2 with X iiI and /is. . X n be LLd. (J E R d. I . z) I tends to 00 as \. 10. Zn and Y11 . {3) > c. as X ~ P(J.ZY. ~ i . . In Example 2. {3) is continuous on K. the result follows. .. Since p(X. z) is continuous in {3 and that Ig({3. Vi).. .. . . respectively. (b) Define the sample product moment of order (i.L.. : J • .' where Z. . The All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates.. . j). . Show that the least squares estimate exists. f(a. Hint: Set c = p(X. (See Problem 2. Y are the sample means of the sample correlation coefficient is given by Z11 •. Let X".Xn ) where the X. . . with (J identifiable.. L I. the sample covariance.Yn . ..140 Methods of Estimation Chapter 2 (a) Show that F(. (b) Construct an estimate of a using the estimate of part (a) and the equation a .1. I) = ".) is the distribution function of a probability P on R2 assigning mass lin to each point (Zi. " . Vk and that q(9) can be written as ec q«(J) = h(!'I«(J)"" . 1".Y) = Z)(Yk . Hint: See Problem B.). 11. Show that the sample product moment of order (i. suppose that g({3.IU9) that 1 < T < L . (c) Use the empirical substitution principle to COnstruct an estimate of cr using the relation E(IX. 0). are independent N"(O.j2.2. In Example 2. and so on. (a) Find an estimate of a 2 based on the second mOment..• .) Note that it follows from (A."ZkYk . = #.j) is given by ~ The sample covariance is given by n l n L (Zk k=l . p(X.. the sampIe correlation. >. find the method of moments estimate based on i 12.17.
i.6. . Let g. = h(PI(O). See Problem 1... A). . Suppose X 1.5. 1'(1.Pr(O)) . Vk and that q(O) for some Rkvalued function h.6.9r be given linearly independent functions and write ec n Pi(O) = EO(gj(X)). where U.p(x. 1 < j < k. . in estimate. as X rv P(J.p:::'. Let 911 .d.d.l Lgj(Xi ). 1'(0..l and (72 are fixed. fir) can be written as a frequency plugin estimate. can you give a method of moments estimate of {3? .x (iv) Gamma. (a) Use E(X..0 > 0 14.. find the method of moments estimates (i) Beta. ~ (X.) to give a method of moments estimate of p.:::m:::.:::nt:::' ~__ 141 for some R k valued function h. . . In the fOllowing cases.. Iii = n.10). r. j i=l = 1. . Hint: Use Corollary 1. (b) Suppose {PO: 0 E 8} is the kparameter exponential family given by (1. with (J E R d and (J identifiable. . to give a method of moments estimate of (72.. Suppose that X has possible values VI. Use E(U[). ... A). rep.O) ~ (x/O')exp(x'/20'). p fixed (v) Inverse Gaussian. X n are i..ilr) is a frequency plug (a) Show that the method of moments estimate if = h(fill .. . When the data are not i.Section 2. (b) Suppose P ~ po and I' = b are fixed.  1/' Po) / . General method of moment estimates(!).5 Problems and Com". .1. (c) If J. Show that the method of moments estimate q = h(j1!.1) (iii) Raleigh.. Consider the Gaussian AR(I) model of Example 1. > 0.(X) ~ Tj(X).36. 0)..• it may still be possible to express parameters as functions of moments and then use estimates based on replacing population moments with "sample" moments.1. 0 ~ (p.i. IG(p.0) (ii) Beta. 13. .6..
8. respectively. SI.) ... we define the empirical or sample moment to be j ~ .• l Yn are taken at times t 1. j > 0.. Multivariate method a/moments. ...... . Let 8" 8" and 83 denote the probabilities of S. I. b. Readings Y1 ..z.... + '2P4 + 2PS ..Y. n 1  Problems for Sectinn 2.q."" X iq ). .. 1 6 and let Pl ~ N j / n...S 1 . and IF. .. Y) and B = (ai.! .. Y) and (ai. I ... ' l X q ).ZY ~ . I • . 1 I ...z. >.. . k > 0.. II.. . k> Q mjkrs L. ..~b... and (J3.  1... n. P3 + "2PS + '2P6 PI are frequency plugin estimates of OJ. i = 1._ . I I .142 Methods of Estimation Chapter 2 15. FF. The HardyWeinberg model specifies that the six genotypes have probabilities L7=1 I Genotype Genotype Probability SS 2 II 8~ 3 FF 8j 4 SI 28.z. S = 1. of observations.. t n on the position of the object... = n 1 L:Z.•• Om) can be expressed as a function of the moments. Show that method of moments estimators of the parameters b1 and al in the best linear predictor are estimate ~ 8 of B is obtained L:Z.. where (Z.)] + [g(l3o. For independent identically distributed Xi = (XiI. I.g(I3....6). HardyWeinberg with six genotypes. 5 SF 28. bI) are as in Theorem 1. SF.. 17.. and F at one locus resulting in six genotypes labeled SS.. In a large natural population of plants (Mimulus guttatus) there are three possible alleles S. 16.. where OJ = 1. Show that <j < i ! .. Let X = (Z. .bIZ.. .. 83 6 IF 28.1 " Xir Xk J. _ (Z)' ' a. Establish (2. .1"... 11P2 + "2P4 + '2P6 .. ()2. 1 q..J is' _ n n..83 8t Let N j be the number of plants of genotype j in a sample of n independent plants.1.. ... = Y . The reading Yi . An Object of unit mass is placed in a force field of unknown constant intensity 8..4.g(I3.g(l3o. I .3..  t=1 liB = (0 1 .z. the method of moments l by replacing mjkrs by mjkrs... . )... 1". let the moments be mjkrs = E(XtX~).. Hint: [y. 1 .)] = [Y.. Q.2 1... and F.)]. . For a vector X = (Xl..
.. nl + n2. .t of the best (MSPE) predictor of Yn + I ? 4.2 may be derived from Theorem 1. Show that the fonnulae of Example 2. (zn. . . 9... .. 3.)' 1+ B5 6.. Hint: Write the lines in the fonn (z. .. _.BJ .z. . in fact..I is to be taken at time Zn+l.. . . A new observation Ynl. . " 7 5. .). B) = Bc'x('+JJ.B.We suppose the oand be uncorrelated with constant variance. Y.. B > O." + 82 z i + €i = nl with ICi a~ given by = 81 + ICi.5 Problems and Complements 143 ICi differs from the true position (8 /2)tt by a mndom error f l . . Find the least squares estimate of 0:.<) ~ . (a) f(x. (J2) variables. if we consider the distribution assigning mass l/n to each of the points (zJ. Yl).. 7.2.4)(2.Yi). What is the least squares estimate based on YI . < O. 13). Zn and that the linear regression model holds. 13). g(zn. (exponential density) (b) f(x. i = 1.(zn. . . Find the least squares estimates for the model Yi = 8 1 (2. . nl and Yi = 82 + ICi. .2. Suppose Yi ICl". . Yn).. (Pareto density) . Yn be independent random variables with equal variances such that E(Yi) = O:Zj where the Zj are known constants.. B > O.4. to have mean 2. Suppose that observations YI . .1. ~p ~(yj}) ~ . and 13 ranges over R d 8. all lie on aline.13 E R d } is closed. where €nlln2 are independent N(O.Section 2. (a) Let Y I .y. (b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1. ... Yn have been taken at times Zl. i + 1. Find the line that minimizes the sum of the squared perpendicular distance to the same points. c cOnstant> 0. X n denote a sample from a population with one of the following densities Or frequency functions. Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (Zi.. x > 0.6) under the restrictions BJ > 0. Let X I.3.5) provided that 9 is differentiable with respect to (3" 1 < i < d. . Find the least squares estimates of ()l and B . The regression line minimizes the sum of the squared vertical distances from the points (Zl. x > c. 2 10.. Find the MLE of 8..2. B) = Be'x. . . Show that the least squares estimate is always defined and satisfies the equations (2.4. Find the LSE of 8.. B. . i = 1. the range {g(zr. Hint: The quantity to be minimized is I:7J (y..n.Yn)...
then {e(w) : wEn} is a partition of e.. X n be a sample from a U[0 0 + I distribution. Suppose that Xl. Hint: Let e(w) = {O E e : q(O) = w}. 00 = I 0" exp{ (x .:r > 0. say e(w)... x> 0. c constant> 0. Define ry = h(8) and let f(x.  under onetoone transformations). X n • n > 2. ry) denote the density or frequency function of X in terms of T} (Le.144 Methods of Estimation Chapter 2 (c) f(".. = X and 8 2 ~ n.t and a 2 are both known to be nonnegative but otherwise unspecified. they are equivariant 16. x > 0. I i' 13. be independently and identically distributed with density f(x.l exp{ OX C } . . (Wei bull density) 11.5. Let X I. e c Let q be a map from e onto n. be a family of models for X E X C Rd. Suppose that T(X) is sufficient for 0 and ~at O(X) is an MLE of O. I I I (b) LetP = {PO: 0 E e}. 12. X n . iJ( VB. density) (e) fix. 0 > O. q(9) is an MLE of w = q(O). .2.2 are unknown. 0) = cO". Hint: Use the factorization theorem (Theorem 1. . I < k < p. 0 E e and let 0 denote the MLE of O. Show that depends on X through T(X) only provided that 0 is unique. ncR'. WEn . > O. . I). (72 Ii (a) Show that if J1 and 0.0"2) 15. . J1 E R. reparametrize the model using 1]). then the unique MLEs are (b) Suppose J. I • :: I Suppose that h is a onetoone function from e onto h(e).1).  e . 0 <x < I. Let X" .P > I.a)l rather than 0. (We write U[a. n > 2.r(c+<). (b) Find the maximum likelihood estimate of PelX l > tl for t > 1".0"2 (a) Find maximum likelihood estimates of J. c constant> 0. 0) = VBXv'O'. 0 > O.. for each wEn there is 0 E El such that w ~ q( 0).Pe. Show that if 0 is a MLE of 0.5 show that no maximum likelihood estimate of e = (1".. then i  WMLE = arg sup sup{Lx(O) : 0 E e(w)}. . Show that any T such that X(n) < T < X(1) + is a maximum likelihood estimate of O.II ". MLEs are unaffected by reparametrization. 0> O.O) where 0 ~ (1". bl to make pia) ~ p(b) = (b . 0) = (x/0 2) exp{ _x 2/20 2 }.. . . Show that the MLE of ry is h(O) (i. (a) Let X .0"2). Hint: You may use Problem 2.1. 0) = Ocx c . Pi od maximum likelihood estimates of J1 and a 2 .X)" > 0. (Rayleigh density) (f) f(x. the MLE of w is by definition ~ W.t and a 2 .I")/O") . (beta. (Pareto density) (d) fix.) ! ! !' ! 14.l L:~ 1 (Xi . Because q is onto n. If n exists.: I I \ . is a sample from a N(/l' ( 2 ) distribution. < I" < 00.16(b). and o belongs to only one member of this partition. 0 > O. x > 1". ~ I in Example 2. . Thus..e.
Xn be independently distributed with Xi having a N( Oi.5 Problems and Complements 145 Now show that WMLE ~ W ~ q(O). 16(i). Y n IS _ B(Y) = L." Y: . Yn which are independent.[X ~ k] ~ Bk .  . If time is measured in discrete periods. . .. . find the MLE of B. . Suppose that we only record the time of failure. and have commOn frequency function. (KieferWolfowitz) Suppose (Xl. where 0 < 0 < 1.... (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B. (a) The observations are indicators of Bernoulli trials with probability of success 8.. Thus. in a sequence of binomial trials with probability of success fJ. and Brunk (1972).B) = 1.B). identically distributed. 21.. but e . . Bartholomew. Censored Geometric Waiting Times. X n ) is a sample from a population with density f(x. f(k. We want to estimate B and Var. we observe YI . We want to estimate O. 19..2.r f(r + I. In the "life testing" problem 1.6. A general solution of this and related problems may be found in the book by Barlow. X 2 = the number of failures between the first and second successes.[X < r] ~ 1 LB k=l k  1 (1_ B) = B'..B) =Bk .n .. < 00. . 17.X I = B(I.Section 2. 0 < a 2 < oo}. if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods...) Let M = number of indices i such that Y i = r + 1. (a) Find maximum likelihood estimates of the fJ i under the assumption that these quantities vary freely. Show that maximum likelihood estimates do not exist.1 (I_B). B) = 100' cp 9 (x I') + 0' 10 cp(x 1') 1 where cp is the standard normal density and B = (1'. M 18.P... 1 <i < n.0") E = {(I'. and so on. . k ~ 1. Show that the maximum likelihood estimate of 0 based on Y1 . .O") : 00 < fl. •• . . 1) distribution. a model that is often used for the time X to failure of an item is P..~1 ' Li~1 Y. k=I. Bremner. Let Xl. (We denote by "r + 1" survival for at least (r + 1) periods.1 (1 . 20.B). Derive maximum likelihood estimates in the following models. (b) The observations are Xl = the number of failures before the first success. .
.146 Methods of Estimation Chapter 2 that snp.Y)' i=1 j=1 /(rn + n).I' fl. respectively. Show that the MLE of = (fl. Ii equals one of the numbers 22.0 2.5 0..X)' + L(Y.2 3. J2 J2 o o o I 3. 1 <k< p}. II I I :' 24. n). . i: In an experiment to study tool life (in minutes) of steelcutting tools as a function of cutting speed (in feet per minute) and feed rate (in thousands of an inch per revolution).tl. (1') populations.2 4.a 2 ) = Sllp~.6).8 14. Show that the maximum likelihood estimate of b for Nand n fixed is given by if ~ (N + 1) is not an integer. . the following data were obtained (from S.Xm and YI .(Zi) + 'i. where [t] is the largest integer that is < t.0 11.. Tool life data Peed 1 Speed 1 1 Life 54.5 86. (1') where e n ii' = L(Xi .x n . .9 3. N. and ~ b(X) = X X (N + 1) or (N + 1)1 n n othetwise.Xj for i =F j and that n > 2. Polynomial Regression.4)(2.ji..5 I' I I . .8 3.1. = fl.z!.4 o o o o o o o o o o o o o Life 20. 7 .'. Xl.J.. .0"2) if.0 I . " Yn be two independent samples from N(J. Set zl ~ .. Weisberg. .0 0.p(x.0 5.up(X. 1i(b. TABLE 2. 1 1 1 o o . distribution. x) as a function of b. Hint: Coosider the mtio L(b + 1. . and only if. Suppose Y. Let Xl. Assume that Xi =I=.2 Peed Speed  2 I 1 1 1 1 1 1 1 J2 o o 1 1 1 3. where'i satisfy (2.< J.2 4.a 2 ) and NU".1 2. (1') is If = (X. 1985)..jp): 0 <j.2..8 2.2.5 66. Y.6.' wherej E J andJisasubsetof {UI . Suppose X has a hypergeometric.8 0. 23.t. x)/ L(b. and assume that zt'·· ..
in Problem 2.(P)Z of Y is defined as the minimizer of t = zT {3. Set Y = Wly.4)(2.2. That is. Y)Z') and E(v(Z. Show that (3.l be a square root matrix of W. Show that the follow ZDf3 is identifiable. y). Derive the weighted least squares nonnal equations (2.y)!(z. . .6.900)/300.3.Section 2. E{v(Z.ZD is of rank d.y) defined . Consider the model (2.19). the second model provides a better approximation. (b) Let P be the empirical probability .1.0:4Z? + O:SZlZ2 + f Use a least squares computer package to compute estimates of the coefficients (f3's and o:'s) in the two models. (12 unknown.Z)]'). Show that p.. ZI ~ (feed rate .2. = (cutting speed .Y = Z D{3+€ satisfy the linear regression mooel (2. ~. Y) have joint probability P with joint density ! (z. (a) Show tha!.(P) and p. (2. 25. Let Z D = I!zijllnxd be a design matrix and let W nXn be a known symmetric invertible matrix. This will be discussed in Volume II.2. z) ing are equivalent. (a) Let (Z'. Y)Y') are finite. of Example 2.y)!(z. .  27. 28.y)dzdy.1). Use these estimated coefficients to compute the values of the contrast function (2..Y') have density v(z.2.4)(2. z) ~ Z D{3. Two models are contemplated (a) Y ~ 130 + pIZI + p..1).(b l + b. Let (Z. Y')/Var Z' and pI(P) = E(Y*) 11. (a) The parameterization f3 (b) ZD is of rank d.6) with g({3.y)/c where c = I Iv(z. weighted least squares estimates are plugin estimates.(P) coincide with PI and 13.z..6) with g({3. let v(z. (2.(P) = Cov(Z'. z.2. The best linear weighted mean squared prediction error predictor PI (P) + p.. Let W. + E eto (b) Y = + O:IZl + 0:2Z2 + Q3Zi +.13)/6. y) > 0 be a weight funciton such that E(v(Z. l/Var(Y I Z = z). Being larger.1.5) for (a) and (b). 26.2.5 Problems and Complements 147 The researchers analyzed these data using Y = log tool life. . . Consider the model Y = Z Df3 + € where € has covariance matrix (12 W.(P)E(Z·).1 (see (B.6)).8 and let v(z.2. Both of these models are approximations to the true mechanism generating the data. this has to be balanced against greater variability in the estimated coefficients.2. ZD = Wz ZD and € = WZ€. However. Y)[Y . (c) Zj.
0) = II j=q+l k 0.093 1.971 0.046 2.834 3.968 2..676 5. suppose some of the nj are zero. .6) W..968 9.093 5. Let ei = (€i + {HI )/2.+l I Y .. where El" . Hint: Suppose without loss of generality that ni 0.'.075 4.393 3.ZD. .457 2..ZD.054 1..d. .091 3. . k. <J > 0.392 4.081 2.20).131 5.716 7.1.916 6.229 4. i = 1.058 3. j = 1..i.. I Yi is (/1 + Vi).360 1.397 4.918 2. . = (Y  29. .249 1.6) is given by (2. J.156 5. 2..020 8..455 9. \ (e) Find the weighted least squares estimate of p.019 6. Consider the model Yi = 11 + ei.689 4.480 5.182 0.196 2. n.476 3.097 1.300 6.788 2. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay. That is.856 2. Yn are independent with Yi unifonnly distributed on [/1i .1. 4 (b) Show that Y is a multivariate method of moments estimate of p. different from Y? I I TABLE 2.611 4.599 0.038 3. where 1'.071 0. ' I (f) The following data give the elapsed times YI . Is ji..jf3j for given covariate values {z. Show that the MLE of . = L.148 ~ Methods of Estimation Chapter 2 (b) Show that if Z  D has rank d. n. in this model the optimal " MSPE predictor of the future Yi+ 1 given the past YI . (See Problem 2.274 5..2.039 9..564 1. which vanishes if 8 j = 0 for any j = q + 1.155 4.453 1.). .8. Suppose YI ..114 1.j}.053 4.858 3. k. with mean zero and variance a 2 • The ei are called moving average errors.~=l Z.100 3. 31.nk > 0.391 0. nq+1 > p(x.908 1. = nq = 0. /1i + a}...) ~ ~ (I' + 1'. .) (c) Find a matrix A such that enxl = Anx(n+l)ECn+l)xl' (d) Find the covariance matrix W of e..€n+l are i.a. _'.6) T 1 (Y . In the multinomial Example 2. i = 1. ~ ~ ~Show that the MLE of OJ is 0 with OJ = nj In.723 1.(Y ..379 2. (a) Show that E(1'.6)T(y .870 30. Use a weighted least squares c~mputer routine to compute the weighted least squares estimate Ii of /1. .. The data (courtesy S.2.921 2.582 2. . . Then = n2 = . .860 5.17.. then the  f3 that minimizes ZD..ZD. Chon) shonld be read row by row.5.958 10.703 4. Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore.064 5.665 2.511 3. .666 3. 1'. .301 2.669 7. .
1.. be the Cauchy density.6. . 1). Hint: See Example 1.5 Problems and Complements 149 (f3I' . a)T is obtained by finding 131. . Let x. .x. the sample median fj is defined as Y(k) where k ~ ~(n + 1) and Y(l). . + Xj i<j  201· (b) Define BH L to be the minimizer of J where F [x ..2.. ./Jr ~ ~ and iii.{3p. Yn ordered from smallest to largest.i.>0 ~ ~ where tLi = :Ej=l Ztj{3j for given covariate values {Zij}.17). . = I' for each i. .Ili I and then setting (j = n~I L~ I IYi . ~ 33.i. with density g(x . Y( n) denotes YI . .. y)T where Y = Z + v"XW. (3p. and x. XHL i& at each point x'. the sample median yis defined as ~ [Y(. be i.. be i. . as (Z. let Xl and X.O) Hint: See Problem 2. 34..I'. Find the MLE of A and give its mean and variance. Z and W are independent N(O.J.7 with Y having the empirical distribution F. . Let g(x) ~ 1/".l1.. 35. ..tLil and then setting a = max t IYi .(3p that minimizes the least absolute deviation contrast function L~ I IYi ..) + Y(r+l)] where r = ~n. (See (2. . Let X. " .. Show that the sample median ii is the minimizer of L~ 1 IYi . A > 0.)... .. (a) Show that the MLE of ((31. ~ f3I.28Id(F. Give the MLE when .4.I/a}. = L Ix. Let B = arg max Lx (8) be "the" MLE. then the MLE exists and is unique. . The HodgesLehmann (location) estimate XHL is defined to be the median of the ~n(n + 1) pairwise averages ~(Xi + Xj). . F)(x) *F denotes convolution.{:i.:C j • i <j (a) Show that the HodgesLehmann estimate is the minimizer of the contrast function p(x..& at each Xi.fin are called (b) If n is odd. If n is even. Hint: Use Problem 1. Show that XHL is a plugin estimate of BHL. be the observations and set II = ~ (Xl . Suppose YI .Section 2.8).32(b)... .(1 + x'j. a) is obtained by finding (31./3p that minimizes the maximum absolute value conrrast function maxi IYi .. Yn are independent with Vi having the Laplace density 1 2"exp {[Y. where Iii = ~ L:j:=1 ZtjPj· 32.. An asymptotically equivalent procedure is to take the median of the distribution placing mass and mass . .Jitl.d.d. 8 E R. (a) Show Ihat if Ill[ < 1.) Suppose 1'. i < j. These least absolute deviation estimates (LADEs)...  Illi < 1. x E R.3. ..d. where Jii = L:J=l Zij{3j.
Assume :1. 3.. (a. 37. . Define ry = h(B) and let p' (x.+~l Wj have P(rnAl) and P(mA2) distributions. . Show that (a) The likelihood is symmetric about ~ x. Let (XI. = I ! I!: Ii. then h"(y) > 0 for some nonzero y. i = 1. Problem 35 can be generalized as follows (Dhannadhikari and JoagDev. . B) is and that the KullbackLiebler divergence between p(x. > h(y) . . Bo) is tn( B. Let X ~ P" B E 8. 51.B)g(i . Assume that 5 = L:~ I Xi has a Poisson. Let Vj and W j denote the number of hits of type 1 and 2 on day j. 1985).ryt» denote the KullbackLeibler divergence between p( x. ryo) and p' (x. BI ) (p' (x. and 5. On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making). B) and pix.j'. .. where x E Rand (} E R. such that for every y E (a. and 8 2 are independent. ryl Show that o n ».t> . 1 n + m.+~l Vi and 8 2 = E. 2.B )'/ (7'. . Let Xi denote the number of hits at a certain Web site on day i.. . . ~ (XI + x. Suppose h is a II function from 8 Onto = h(8). 38. B) denote their joint density.. . and positive everywhere. 36.. Also assume that S. Let 9 be a probability density on R satisfying the following three conditions: I. N(B.d. . .i. n. Ii I.. Sl. distribution. 'I i 39. Let K(Bo. symmetric about 0.)/2 and t> = (XI . 9 is twice continuously differentiable everywhere except perhaps at O.f)) in the likelihood equation. Suppose X I.. o Ii !n 'I .. X. b) there exists a 0 > 0 ~ (c) There is an interval such that h(y + 0) . Show that the entropy of p(x.B).. B) = g(xB).B I ) (K'(ryo. j = n + I.. Find the values of B that maximize the likelihood Lx(B) when 1t>1 > L Hint: Factor out (x . P(nA). (b) Either () = x or () is not unique.150 Methods of Estimation Chapter 2 (b) Show that if 1t>1 > 1. The likelihood function is given by Lx(B) g(xI . ~ Let B ~ arg max Lx (B) be "the" MLE. ~ (d) Use (c) to show that if t> E (a.B) g(x + t> . . Find the MLEs of Al and A2 hased on 5.b).h(y) .B)g(x. If we write h = logg. B ) and p( X.. where Al +.\2 = A.Xn are i.) be a random samplefrom the distribution with density f(x. (7') and let p(x. 9 is continuous. b). then B is not unique.0). that 8 1 E. Let Xl and X2 be the observed values of Xl and X z and write j. h I (ry») denote the density or frequency function of X for the ry parametrization.x2)/2.h(y . ry) = p(x. then the MLE is not unique. a < b.
[3.~e p = 1. X n be a sample from the generalized Laplace distribution with density 1 O + 0. = L X. Variation in the population is modeled on the log scale by using the model logY. = log a . and 1'.I[X. . [3. 0). where OJ > O. 1'). on a population of a large number of organisms.O) ~ a {l+exp[[3(tJ1. h(z. 1 En are uncorrelated with mean 0 and variance estimating equations (2. Give the least squares where El" •. . and /1. a > 0. . 1).) Show that statistics. [3. < 0] are sufficient (b) Find the maximum likelihood estimates of 81 and 82 in tenns of T 1 and T 2 • Carefully check the "T1 = 0 or T 2 = 0" case. a.  J1.~t square estimating equations (2.1. = LX.7) for a. I' E R. (e) Let Yi denote the response of the ith organism in a sample and let Zij denote the level of the jth covariate (stimulus) for the ith organism.1. > OJ and T. [3 > 0. Aj) + Ei. . 1989) Y dt ~ [3 1  Idy [ (Y)!] ' y > 0.. 1959. o + 0.. . exp{x/O. n > 4. (tn.)/o]}" (b) Suppose we have observations (t I . n.)/o)} (72. n where A = (a. 0 > O. .I[X.5 Problems and Complements 151 40. . .5.. . j = 1.j ~ x <0 1.0). 6. A) = g(z. .1.. /3. 42... Suppose Xl. 1 X n satisfy the autoregressive model of Example 1. i = 1.olog{1 + exp[[3(t. . x > 0.}. + C.Section 2. give the lea.. Seber and Wild. 41. i = 1.2.1'. .7) for estimating a... . Yn). Let Xl. 1'. j 1 1 cxp{ x/Oj}.. T.p. For the ca. yd 1 ••• .c n are uncorrelated with mean zero and variance (72. and g(t. An example of a neural net model is Vi = L j=l p h(Zij. (. where 0 (a) Show that a solution to this equation is of the form y (a. a get. . The mean relative growth of an organism of size y at time t is sometimes modeled by the equation (Richards. and fj. [3.
3.l..y.l.. I .3.152 Methods of Estimation Chapter 2 (aJ If /' is known.3 1. There exists a compact set K c e such that 1(8) < c for all () not in K. Let = {(01. _ C2' 2. 1 Xl <i< n.::. C2' y' 1 1 for (a) Show that the density of X = (Xl.p).) 1 II(Z'i /1)2 (b) If j3 is known. (b) Show that the likelihood equations are equivalent to (2. Xn· Ip (x. .. t·. Is this also the MLE of /1? Problems for Section 2. Xn)T can be written as the rank 2 canonical exponential family generated by T = (ElogX"EXi ) and hex) = XI with ryl = p.. Give details of the proof or Corollary 2.d. X n be i..J.. . 3. gamma.) Then find the weighted least square estimate of f.)I(c.). I < j < 6. Yn) is not a sequence of 1's followed by all O's or the reverse. . . n > 2. r(A..'" £n)T of autoregression errors. (One way to do this is to find a matrix A such that enxl = Anxn€nx 1.0..3.4) and (2. .1 > _£l.5).. the bound is sharp and is attained only if Yi x. .3. .(8 . a. 0 for Xi < _£I. write Xi = j if the ith plant has genotype j.. Suppose Y I . find the covariance matrix W of the vector € = (€1. . fi) = a + fix.i... '12 = A and where r denotes the gamma function. = If C2 > 0. . = 1] = p(x"a. L x. = L(CI + C.fi) = 1log p pry.l  L: ft)(Xi /.Xi)Y' < L(C1 + c. . < . Let Xl. Prove Lenama 2. .02): 01 > 0.. Consider the HardyWeinberg model with the six genotypes given in Problem 2. Hint: Let C = 1(0). + 8. show that the MLE of fi is jj =  2. > 0. In a sample of n independent plants.x. + 0. . " l Xn) does the Mill exist? What is the MLE? Is it unique? e 4. This set K will have a point where the max is attained. S. (X. = OJ. < Show that the MLE of a.. . Hint: I . Yn are independent PlY.1..x.0 .. fi exists iff (Yl . n i=l n i=l n i=1 n i=l Cl LYi + C.. Under what conditions on (Xl 1 .. < I} and let 03 = 1. + C1 > 0). • II ~.15.
9. .. . Use Corollary 2.3. Let XI..10 with that the MLE exists and is unique.lkIl :Ij >0.n.d.O.5 Problems and Complements 153 n 6.3. . [(Ai). .Section 2.3.1 <j < ac k1. . (3) Show that.. Suppose Y has an exponential.z=7 11. MLEs of ryj exist iffallTj > 0.) ~ Ail = exp{a + ilz. the likelihood equations logfo that t (X It) w' iOJ t=I = 0 ~ { (X i .. if n > 2. < Zn. e E RP. fe(x) wherec. .. (O. and assume forw wll > 0 so that w is strictly convex..1).3. J. show 7. Hint: The kpoints (0.0.. . Hint: If BC has positive volume. u > 1 fR exp{Ixlfr}dxand 1·1 is the Euclidean norm. . 0:0 = 1..0).. _. < Show that the MLE of (a.6. ..3. distribution where Iti = E(Y. Let Y I .ill T exists and is unique. 10. Zj < . Then {'Ij} has a subsequence that converges to a point '10 E f. 8..0 < ZI < . a(i) + a. with density. .. the MLE 8 exists but is not unique if n is even. ~fo (X u I. and Zi is the income of the person whose duration time is ti. < Zn Zn. convex set {(Ij. 9.1} 0 OJ = (b) Give an algorithm such that starting at iP = 0. . .fl.0). n > 2.. C! > 0. 12... Hint: If it didn't there would exist ryj = c(9j ) such that ryJ 10 .A( ryj) ~ max{ 'IT toA( 'I) : 'I E c(8)} > 00. . . the MLE 8 exists and is unique. ji(i) + ji. But c( e) is closed so that '10 = c( eO) and eO must satisfy the likelihood equations.1 (a) ~ ~ c(a) exp{ Ix . (b) Show that if a = 1.'~ve a unique solution (. X n be Ll.}. 0 < Zl < . Show that the boundary of a convex C set in Rk has volume 0.t). (0. Yn denote the duration times of n independent visits to a Web site. • It) w' ( Xi It) .8) .' . Prove Theorem 2. then it must contain a sphere and the center of the sphere is an interior point by (B.. w( ±oo) = 00.Xn E RP be i. In the heterogenous regression Example 1. 1 <j< k1....40.Ld..1 to show that in the multinomial Example 2.en.6. . See also Problem 1.l E R. " (a) Show that if Cl: > ~ 1. > 3. Let Xl.n) are the vertices of the :Ij <n}.
. Note: You may use without proof (see Appendix B. P coincide with the method of moments estimates of Problem 2.9). See.b)_(ao. show that the estimates of J11.:. pal0'2 + J11J.2)/M'''2] iI I'.) has a density you may assume that > 0. ag. ~J ' . (a) Show that the MLEs of a'f.4 1. 0'1 Problems for Section 2.J1..n log a is strictly convex in (a.1.) I . /12.J1.2). b successively. O'~..d.(Yi .'2). "i . complete the Estep by finding E(Zl I l'i) and E(ZiYi I Yd· (b) In Example 2. provided that n > 3. .2..4.J1.b) . and p when J1.3. b) = x if either ao = 0 or 00 or bo = ±oo. Yn ) be a sample from a N(/11.1 and J1. J12.bo) D(a.4. .6. L:. 1985. il' i. EM for bivariate data.. b) ~ I w( aXi . Let (X10 Yd. and p = [~(Xi .8. E" . ar 1 a~. . IPl < 1.6. Describe in detail what the coordinate ascent algorithm does in estimation of the regression coefficients in the Gaussian linear model i: y ~ ZDJ3 + <. 2. Y.1)".. . > 0. En i. (i) If a strictly convex function has a minimum.2)2. 7if)F 8 08 D 802 vb2 2 2 > (a'D)2 ' then D'IS strictIy convex..154 Methods of Estimation Chapter 2 (c) Show that for the logistic distribution Fo(x) [1 + exp{ X}]I. rank(ZD) = k. (I't') IfEPD 8a2 a2 D > 0 an d > 0. (Xn . . O'~ + J1i.4. respectively.3. W is strictly convex and give the likelihood equations for f. then the coordinate ascent algorithm doesn't converge to a member of t:. it is unique. Chapter 10.2 are assumed to be known are = (lin) L:. Hint: (b) Because (XI. 8aob :1 13.i. O'~ + J1~. . O'?.. . Show that if T is minimal and t: is open and the MLE doesn't exist. 3.:. b) and lim(a. p) population. (b) If n > 5 and J11 and /12 are unknown. I (Xi .l and CT. for example.)(Yi . b = : and consider varying a.. (See Example 2. (Check that you are describing the GaussSeidel iterative method for solving a system of linear equations.J1. Apply Corollary 2. N(O. Golnb and Van Loan. {t2. verify the Mstep by showing that E(Zi I Yi). (b) Reparametrize by a = .. = (lin) L:. . EeT = (J11.) Hint: (a) Thefunction D( a.' ! (a) In the bivariate nonnal Example 2.
. j = 0.and Msteps of the EM algorithm for this problem. given X>l. and given II = j.B)n} _ nB'(I. Y'i)._x(1. ~2 = log C\) + (b) Deduce that T is minimal sufficient.[lt ~ OJ. 1]. and that the (b) Use (2.(I. B E [0. Y (~ A 2:7 I(Yi  Y)'  O"~)/ (0". where 8 = 80l d and 8 1 estimate of 8. 1') E (0. 1 < i < n. as the first approximation to the maximum likelihood .O"~).Section 2. IX > 1) = \ X ) 1 (1 6)" . Because it is known that X > 1. Yd : 1 < i family with T afi I a? known. the model often used for X is that it has the conditional distribution of a B( n. B= (A.B)n[n . I' >.2B)x + nB'] _. thus.4. i ~ 1'.. 1 and (a) Show that X . B) variable. also the trait). (c) Give explicitly the maximum likelihood estimates of Jt and 5. X is the number of members who have the trait. is distributed according to an exponential ryl < n} = i£(1 I) uzuy. 2 (uJ L }iIi + k 2: Y. Suppose the Ii in Problem 4 are not observed. . . For families in which one member has the disease.[1 . it is desired to estimate the proportion 8 that has the genetic trait. (a) Justify the following crude estimates of Jt and A. Suppose that in a family of n members in which one has the disease (and. Do you see any problems with >. Hint: Use Bayes rule.x = 1.P.[lt ~ 1] = A ~ 1 . 2:Ji) .I ~ + (1  B)n] .? (b) Give as explicitly as possible the E.B)[I.5 Problems and Complements 155 4.Ii). 6. when they exist. 8new .n.B)"]'[(1 ...(1 _ B)n]{x nB . be independent and identically distributed according to P6.(1 ..3) to show that the NewtonRaphson algorithm gives ~ _ 81 ~ 8 ~ _ ~ _ B(I. Y1 '" N(fLl a}). I n ) 8"(10)"" (a) Show that P(X = x MLE exists and is unique.{(Ii.. Consider a genetic trait that is directly unobservable but will cause a disease among a certain proportion of the individuals that have it.1) x Rwhere P. Let (Ii.
N++ c N a +c N+ bc Pabc = . c]' Show that this holds iff PIU ~ a. (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the other of the local maxima (b) Make a similar study of the NewtonRaphson method in this case. X Methods of Estimation Chapter 2 = 2. where X = (U. 7. X 3 = a. 9.b.. the sequence (11ml ijm+l) has a convergent subse quence. = 1. PIU = a.. 1 <c < C and La. .X 2 .9)')1 Suppose X.A(iJ(A)).4.X3 be independent observations from the Cauchy distribution about f(x.c)} and "+" indicates summation over the ind~x.v ~ b I W = c] = P[U = a I W = c]P[V = b I W ~ i. + Vbc where 00 < /l. b1 C. W).N+bc where N abc = #{i : Xi = (a. ~ 0. find (}l of (b) above using (} =   x/n as a preliminary estimate.Na+c.2. V. (h) Show that the family of distributions obtained by letting 1'. X n be i.156 (c) [f n = 5. v < 00. .c Pabc = l. Let Xl. (a) Suppose for aU a. V = b. N +bc given by < N ++c for all a. (c) Show that the MLEs exist iff 0 < N a+c .1 generated by N++c.i. .1 (1 + (x e. Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a. Let Xl. Let and iJnew where ).1) + C(A + B2) = C(A + B1) . . W = c] 1 < a < A. Hint: Apply the argument of the proof of Theorem 2.e. b1 c and then are .b. X.riJ(A) . Define TjD as before. N . hence.9) = ". Show that the sequence defined by this algorithm converges to the MLE if it exists.4. (1) 10gPabc = /lac = Pabe. iff U and V are independent given W. Consider the following algorithm under the conditions of Theorem 2.d. * maximizes = iJ(A') t. .2 noting that the sequence of iterates {rymJ is bounded and. 8. v vary freely is an exponential family of rank (C . n N++ c ++c  . 1 < b < B.
Nl+co (c) The model implies Pabc = P+bcPa±c/P++c and use the likelihood equations. (a) Show that S in Example 2.a>!b".1) + (A . 10.b'.I) ~ AB + AC + BC .(x) :1:::: 1(S(x) ~ = s). Suppose X is as in Problem 9.I) + (B . v.3 + (A . Justify formula (2.N++c/A. Let f. (a) Show that this is an exponential family of rank A + B + C . Initialize: pd: O) = N a ++ N_jo/>+ N++ c abc nnn Pabc d2) dl) Nab+ Pabc dO) n n Pab+ d1) dl) dO) Pabc d3) N a + c Pabc Pa+c N+bc Pabc d 2) Pabc n P+bc d2)· Reinitialize with ~t~.4.4.(A + B + C). N±bc . Hint: P..5 has the specified mixtnre of Gaussian distribution.c.=_ L ellalblp~~~'CI a'.Section 2.1)(B . (b) Consider the following "proportional fitting" algorithm for finding the maximum likelihood estimate in this model. 12.I)(C . (b) Give explicitly the E.:.8).and Msteps of the EM algorithm in this case. 11. f vary freely. c" and "a. but now (2) logPabc = /lac + Vbc + 'Yab where J1. c" parameters. Show that the algorithm converges to the MLE if it exists and di· verges otheIWise.c' obtained by fixing the "b.N++ c / B.a) 1 2 .I)(C . = fo(x  9) where fo(x) 3'P(x) + 3'P(x .[X = x I SeX) = s] = 13.5 Problems and Complements 157 Hint: (b) Consider N a+ c . Hint: Note that because {p~~~} belongs to the model so do all subsequent iterates and that ~~~ is the MLE for the exponential family Pabc = ellauplO) _=..
4. where E) . . Noles for Section 2.. Establish part (b) of Theorem 2. Bm+d} has a subsequence converging to (0" JJ*) and.5.6 . This condition is called missing at random. Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n. For instance. Hint: Show that {(19 m.4. N(O. in Example 2. For X . the process determining whether X j is missing is independent of X j . I I' .. • . necessarily 0* is the global maximizer.1 (1) "Natural" now was not so natural in the eighteenth century when the least squares principle was introduced by Legendre and Gauss..4.4."" En. That is. (2) The frequency plugin estimates are sometimes called Fisher consistent.d. ~ .6. If /12 = 1. . the probability that Vi is missing may depend on Zi. ~ 16. Show for 11 = 1 that bisection may lead to a local maximum of the likelihood. . En a an 2. al = a2 = 1 and p = 0.158 Methods of Estimation Chapter 2 and r. Establish the last claim in part (2) of the proof of Theorem 2..i. (J*) and necessarily g. = {(Zil Yi) Yi :i = 1. R. 14. . Hint: Show that {( 8m . suppose Yi is missing iff Vi < 2. Verify the fonnula given in Example 2. In Example 2. find the probability that E(Y.3. Complete the E.. given X . If Vi represents the seriousness of a disease. EM and Regression.and Msteps of the EM algorithm for estimating (ftl. consider the model I I: ! I I = J31 + J32 Z i + Ei I are i. this assumption may not be satisfied.n}.~ go. 18.6. That is. given Zi. Hint: Use the canonical nature of the family and openness of E. thus. For example. I Z. NOTES . {31 1 a~. . Bm + I)} has a subsequence converging to (B* .2. .4. 2 )." . suppose all subjects with Yi > 2 drop out of the study. N(ftl. 15. if a is sufficiently large.d. ! . .3 for the actual MLE in that example.p is the N(O} '1) density. the "missingness" of Vi is independent of Yi. Then using the Estep to impute values for the missing Y's would greatly unclerpredict the actual V's because all the V's in the imputation would have Y < 2. These considerations lead essentially to maximum likelihood estimates. Fisher (1922) argued that only estimates possessing the substitution property should be considered and the best of these selected. Ii . {32).5. and independent of El. A. Zn are ii. we observe only}li.) underpredicts Y. For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1986). Zl. Limitations a/the EM Algorithm.. The assumption underlying the computations in the EM algorithm is that the conditional probability that a component X j of the data vector X is missing given the rest of the data vector is not a function of X j .. but not on Yi. a 2 .{Xj }. 17.
D. A. La. "On the Mathematical Foundations of Theoretical Statistics. A. C. 1997. FEINBERG. Elements oflnfonnation Theory New York: Wiley. S.7 REFERENCES BARLOW. G. AND N.g. AND J. Note for Section 2.. and MacKinlay. T. La. J. W. NJ: Princeton University Press. VAN LOAN. W. COVER. LAIRD. Sciences.. Statist. BJORK. BREMNER. 39. FISHER. A. 41. 164171 (1970)." The American Statistician.7 References 159 Notes for Section 2. BAUM." J.. "Y. HOLLAND.• AND I.. GIlBELS." Journal Wash." reprinted in Contributions to MatheltUltical Statistics (by R.3 (I) Recall that in an exponential family. Soc. R. MA: MIT Press. see Cover and Thomas (1991). RUBIN. AND K. H. F. HABERMAN. 1975. FAN. . 199200 (1985). "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Am!. Statistical Inference Under Order Restrictions New York: Wiley. E. Fisher 1950) New York: J.2. M. G. AND A. The Analysis ofFrequency Data Chicago: University of Chicago Press.. R. GoLUB. BRUNK. M. L. THOMAS. ANDERSON. Matrix Computations Baltimore: John Hopkins University Press. WEISS. Wiley and Sons. Local Polynomial Modelling and Its Applications London: Chapman and Hall. DEMPSTER. a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates.2 (I) An excellent historical account of the development of least squares methods may be found in Eisenhart (1964). "The Meaning of Least in Least Squares. Numerical Analysis New York: Prentice Hall. c. P[T(X) E Al o for all or for no PEP. M.. 54. 1985. 138 (1977). "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm. M. Discrete Multivariate Analysis: Theory and Practice Cambridge. Appendix A. DAHLQUIST. A. S. E. Acad. BISHOP. A. AND D. AND N. G. S. 1997). AND C.. Statist. T. M. Math. 0.. The Econometrics ofFinancial Markets Princeton. 1991.. Note for Section 2. 2433 (1964). A.. M. 1. B. MACKINLAY. DHARMADHlKARI... PETRIE. "Y. J. CAMPBELL. (2) For further properties of KullbackLeibler divergence.Section 2. SOULES. 1972.5 (1) In the econometrics literature (e. AND P. EiSENHART.. "Examples of Nonunique Maximum Likelihood Estimators.1. Roy. for any A.1974. t 996.loAGDEV. J. B. 1974. BARTHOLOMEW. Campbell. 2. 1922. ANOH. E.
KR[SHNAN. "Association and Estimation in Contingency Tables. 10.. 1%7. Statist.." Bell System Tech. In. 1997. • . Journal. J. 128 (1968). WAND. AND M. A. Statistical Methods. N. AND T. 13461370 (1994). "On the Shannon Theory of Information Transmission in the Case of Continuous Signals. 27. MOSTELLER. IA: Iowa State University Press. I • I . Assoc.1..160 Methods of Estimation Chapter 2 KOLMOGOROV. E. "Multivariate Locally Weighted Least Squares Regression."/RE Trans! Inform. F. I 1 . i I.. 1989. 102108 (1956). 6th ed. A. 63." Ann. F. MACLACHLAN. S. Nonlinear Regression New York: Wiley. D. C. 1986. RUBIN. 11. 1985.. P. Botany. E. B. "A Flexible Growth Function for Empirical Use. The History ofStatistics Cambridge. "A Mathematical Theory of Communication. Applied linear Regression.. 290300 ( 1959). Exp. New York: Wiley. WU. The EM Algorithm and Extensions New York: Wiley.. SHANNON... Theory. C. 22. J. RUPPERT. R.. SNEDECOR. G. AND D. RICHARDS. SEBER.." J. J. 2nd ed... LIlTLE.·sis with Missing Data New York: J. G. MA: Harvard University Press. WILD. G. "On the Convergence Properties of the EM Algorithm. Amer. STIGLER." 1.95103 (1983) ." Ann. 1987.. A. WEISBERG. S. Ames.623656 (1948). Statistical Anal. . Statist. W. E.379243.. Wiley. AND W. J. AND c. Statisr. COCHRAN.
However. we study.3.2 BAYES PROCEDURES Recall from Section 1. However.4.3 that if we specify a parametric model P = {Po: E 8}.1 INTRODUCTION Here we develop the theme of Section 1. which is how to appraise and select among decision procedures. AND OPTIMAL PROCEDURES 3. 6) : e . (3. 6 2 ) for all 0 or vice versa. 3. Strict comparison of 6.1) 161 . actual implementation is limited.a). We also discuss other desiderata that strongly compete with decision theoretic optimality. We think of R(. in particular computational simplicity and robustness. OUf examples are primarily estimation of a real parameter. We return to these themes in Chapter 6. In Section 3. ) < R(B.+ R+ by ° R(O. loss function l(O. after similarly discussing testing and confidence bounds.3 we show how the important Bayes and minimax criteria can in principle be implemented.6 .6) =ER(O. by introducing a Bayes prior density (say) 1r for comparison becomes unambiguous by considering the scalar Bayes risk. in the context of estimation. the relation of the two major decision theoretic principles to the 000decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness.6) = EI(6. then for data X '" Po and any decision procedure J randomized or not we can define its risk function.6(X). action space A. and 62 on the basis of the risks alone is not well defined unless R(O. In Sections 3.2.6(X)). NOTIONS OF OPTIMALITY.Chapter 3 MEASURES OF PERFORMANCE. in Chapter 4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case. 6) as measuring a priori the performance of 6 for this model. o r(1T.2 and 3. R(·.6) = E o{(O.
2. In the continuous case with real valued and prior density 1T.2. 'f " . as usual. we could identify the Bayes rules 0.4) !. This exercise is interesting and important even if we do not view 1r as reflecting an implicitly believed in prior distribution on ().2.(0)1 .. and 1r may express that we care more about the values of the risk in some rather than other regions of e.2. 0) 2 and 1T2(0) = 1. We first consider the problem of estimating q(O) with quadratic loss.3 we showed how in an example.. if computable will c plays behave reasonably even if our knowledge is only roughly right. 6) ~ E(q(O) .4) the parametrization plays a crucial role here. oj = ".(0) a special role ("equal weight") though (Problem 3.2. It is in fact clear that prior and loss function cannot be separated out clearly either.2) the Bayes risk of the problem.6).5. if 1[' is a density and e c R r(". (3.2. 6) Bayes rule 6* is given by = a? 6'(X) = E[q(O) I XJ. Thus. For testing problems the hypothesis is often treated as more important than the alternative.(O)dfI p(x I O).2.2. 0) and "I (0) is equivalent to considering 1 (0. (0. j I 1': II . Here is an example. We don't pursue them further except in Problem 3. Clearly.5) This procedure is called the Bayes estimate for squared error loss.162 Measures of Performance Chapter 3 where (0.5). B denotes mean height of people in meters.2... Suppose () is a random variable (or vector) with (prior) frequency function or density 11"(0). we just need to replace the integrals by sums.(O)dfI (3.6) In the discrete case. We may have vague prior notions such as "IBI > 5 is physically implausible" if. it is plausible that 6. and instead tum to construction of Bayes procedure.(O)dO (3. 1 1  .r= .. Issues such as these and many others are taken up in the fundamental treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger (1985) and Bernardo and Smith (1994). In view of formulae (1. . Recall also that we can define R(.) ~ inf{r(". (0..3).4.6) = J R(0.3) In this section we shall show systematically how to construct Bayes rules.6): 6 E VJ (3. This is just the prohlem of finding the best mean squared prediction error (MSPE) predictor of q(O) given X (see Remark 1. X) is given the joint distribution specified by (1. we find that either r(1I". such that (3. 00 for all 6 or the Using our results on MSPE prediction. for instance. and that in Section 1. x _ !oo= q(O)p(x I 0).8) for the posterior density and frequency functions. e "(). If 1r is then thought of as a weight function roughly reflecting our knowledge. 0) = (q(O) using a nonrandomized decision rule 6.6(X)j2. we can give the Bayes estimate a more explicit form. 1(0. Our problem is to find the function 6 of X that minimizes r(". considering I. After all.
" Xn. 0 We now tum to the problem of finding Bayes rules for gelleral action spaces A and loss functions l. J') ~ 0 as n + 00.2 Bayes Procedures 163 Example 3. and X with weights inversely proportional to the Bayes risks of these two estimates. Suppose that we want to estimate the mean B of a nonnal distribution with known variance a 2 on the basis of a sample Xl. Fonnula (3.2.7) reveals the Bayes estimate in the proper case to be a weighted average MID + (1 .00 with. X is approximately a Bayes estimate for anyone of these prior distributions in the sense that Ir( 7r. Intuitively. If we choose the conjugate prior N( '1]0. /2) as in Example 1. r) . But X is the limit of such estimates as prior knowledge becomes "vague" (7 . a) I X = x). This action need not exist nor be unique if it does exist. see Section 5.7) Its Bayes risk (the MSPE of the predictor) is just r(7r. .) minimizes the conditional MSPE E«(Y .5.2. for each x.2. we fonn the posterior risk r(a I x) = E(l(e.1.1). a 2 / n. For more on this. Bayes Estimates for the Mean of a Normal Distribution with a Normal Prior.1. . J' )]/r( 7r. . In fact. Thus. that is.E(e I X))' I X)] E[(J'/(l+~)]=nja 2 + 1/72 ' 1 2 n n/ No finite choice of '1]0 and 7 2 will lead to X as a Bayes estimate. Applying the same idea in the general Bayes decision problem.2.w)X of the estimate to be used when there are no observations. In fact. However.4. To begin with we consider only nonrandomized rules.r( 7r.Section 3.a)' I X ~ x) as a function of the action a. X is the estimate that (3. J') E(e . Such priors with f 7r(e) ~ 00 or 2: 7r(0) ~ 00 are called impropa The resulting Bayes procedures are also called improper. E(Y I X) is the best predictor because E(Y I X ~ .2. if X = :x and we use action a. we obtain the posterior distribution The Bayes estimate is just the mean of the posterior distribution J'(X) ~ 1]0 l/T' ] _ [ n/(J2 ] [n/(J' + l/T' + X n/(J' + l/T' (3. 1]0 fixed). tends to a as n _ 00. we should. the Bayes estimate corresponding to the prior density N(1]O' 7 2 ) differs little from X for n large. take that action a = J*(x) that makes r(a I x) as small as possible. This quantity r(a I x) is what we expect to lose. '1]0. if we substitute the prior "density" 7r( 0) = 1 (Prohlem 3. Because the Bayes risk of X. If we look at the proof of Theorem 1.E(e I X))' = E[E«e .6.6) yields. we see that the key idea is to consider what we should do given X = x.12.
Bayes Procedures Whfn and A Are Finite. and the resnlt follows from (3.9) " "I " • But. ?f(O. !i n j.2. :i " " Proof. Example 3.70 and we conclude that 0'(1) = a.8. (3. and let the loss incurred when (}i is true and action aj is taken be given by j e e . As in the proof of Theorem 1.67 5.Op}.o(X)) I X)].3.2. if 0" is the Bayes rule. ~ a. E[I(O. let Wij > 0 be given constants. 0'(0) Similarly. az.5) with priOf ?f(0. az has the smallest posterior risk and..! " j. (3.1.) = 0. The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corrpeting procedures. .o) = E[I(O. I x) > r(o'(x) I x) = E[I(O. Therefore. a q }. 6* = 05 as we found previously.·.8). 10) 8 10. and aa are r(at 10) r(a.2. consider the oildrilling example (Example 1. Suppose we observe x = O.2. 11) = 5.164 Measures of Performance Chapter 3 Proposition 3.8) Then 0* is a Bayes rule. r(a. [I) = 3. Therefore. " .2. I 0)  + 91(0" ad r(a. by (3. the posterior risks of the actions aI. More generally consider the following class of situations.2.o(X))] = E[E(I(O. . r(al 11) = 8. .o'(X)) I X].o(X)) I X] > E[I(O.9). E[I(O.) = 0.35..· .. A = {ao.2.4.8) Thus.o'(X)) IX = x]. Then the posterior distribution of o is by (1.74.89. o As a first illustration. Let = {Oo. r(a. we obtain for any 0 r(?f.1. Suppose that there exists a/unction 8*(x) such that r(o'(x) [x) = inf{r(a I x) : a E A}.2..o(X)) I X = x] = r(o(x) Therefore.2.
Section 3.2
Bayes Procedures
165
Let 1r(0) be a prior distribution assigning mass 1ri to Oi, so that 1ri > 0, i = 0, ... ,p, and Ef 0 1ri = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each O. Then, by (1.2.8), the posterior probabilities are
and, thus,
raj (
x I) =
EiWipTiP(X I Oi) . Ei 1fiP(X I Oi)
(3.2.10)
The optimal action 6* (x) has
r(o'(x) I x)
=
O<rSq
min r(oj
I x).
Here are two interesting specializations. (a) Classification: Suppose that p
= q, we identify aj with OJ, j
1, i f j
=
0, ... ,p, and let
Wii
O.
This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,
r(Oi
Ix) = Pl8 f ei I X = xl
and minimizing r(Oi I x) is equivalent to the reasonable procedure of maximizing the posterior probability,
PI8 = e I X = i
xl
=
1fiP(X lei) Ej1fjp(x I OJ)
(b) Testing: Suppose p = q = 1, 1ro = 1r, 1rl = 1  'Jr, 0 < 1r < 1, ao corresponds to deciding 0 = 00 and al to deciding 0 = 01 • This is a special case of the testing fonnulation of Section 1.3 with 8 0 = {eo} and 8 1 = {ed. The Bayes rule is then to
decide e decide e
= 01 if (1 1f)p(x I 0, ) > 1fp(x I eo) = eo if (1  1f)p(x IeIl < 1fp(x Ieo)
and decide either ao or al if equality occurs. See Sections 1.3 and 4.2 on the option of randomizing .between ao and al if equality occurs. As we let 'Jr vary between zero and one. we obtain what is called the class of NeymanPearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) < Ct. This is treated further in Chapter 4. D
166
Measures of Performance
Chapter 3
To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form" the Bayes procedure for an estimation problem when the loss is not quadratic.
Example 3.2.3. Bayes Estimation ofthe Probability ofSuccess in n Bernoulli Trials. Suppose that we wish to estimate () using X I, ... , X n , the indicators of n Bernoulli trials with
probability of success
e, We shall consider the loss function I given by
(0  a)2 1(0, a) = 0(1 _ 0)' 0 < 0 < I, a real.
(3.2.11)
"
" " ,

d
This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for () close to zero, this l((), a) is close to the relative squared error (fJ  a)2 JO, lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if aU terms on the righthand side are finite,
rea I k)
(3.2.12)
;,
I
, " " I
,

I
Minimizing this parabola in a, we find our Bayes procedure is given by
J'(k) = E(1/(1  0) I S = k) E(I/O(I  0) IS  k)
;
(3.2.13)

i! , ,.
provided the denominator is not zero. For convenience let us now take as prior density the density br " (0) of the bela distribution i3( r, s). In Example 1. 2.1 we showed that this leads to a i3(k +r,n +s  k) posteriordistributionforOif S = k. Ifl < k < nI and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and
.
,
";1 
J'(k)
Jo'(I/(1  O))bk+r,n_k+,(O)dO
J~(1/0(1 O))bk+r,n_k+,(O)dO
B(k+r,nk+sI) B(k + r  I, n  k + s  I) k+rl n+s+r2'
(3.2.14)
where we are using the notation B.2.11 of Appendix B. If k = 0, it is easy to see that a = is the only a that makes r(a I k) < 00. Thus, J'(O) = O. Similarly, we get J'(n) = 1. If we assume a un!form prior density, (r = s = 1), we see that the Bayes procedure is the usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). 0
°
"Real" computation of Bayes procedures
The closed fonos of (3.2.6) and (3.2.10) make the compulation of (3.2.8) appear straightforward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 ~ (0" . .. , Op) has a hierarchically defined prior density,
"(0,, O ,... ,Op) = 2
"1 (Od"2(02
I( 1 ) ... "p(Op I Op_ d.
(3.2.15)
I
Section 3.2
Bayes Procedures
167
Here is an example. Example 3.2.4. The random effects model we shall study in Volume II has
(3.2.16)
where the €ij are i.i.d. N(O, (J:) and Jl, and the vector ~ = (AI, .. ' ,AI) is independent of {€i; ; 1 < i < I, 1 < j < J} with LI." ... ,LI.[ i.i.d. N(O,cri), 1 < j < J, Jl, '"'" N(Jl,O l (J~). Here the Xi)" can be thought of as measurements on individual i and Ai is an "individual" effect. If we now put a prior distribution on (Jl,,(J~,(J~) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (J.L, (J~, (J~, AI,. '. ,AI) with the Xii I () independent N(f.l + Ll. i . cr~). Then p(x I 0) = lli,; 'Pu.(Xi;  f.l Ll. i ) and
11'(0)
=
11'}(f.l)11'2(cr~)11'3(cri)
II 'PU" (LI.,)
i=l
I
(3.2.17)
where <Pu denotes the N(O, (J2) density. ]n such a context a loss function frequently will single out some single coordinate Os (e.g., LI.} in 3.2.17) and to compute r(a I x) we will need the posterior distribution of ill I X. But this is obtainable from the posterior distribution of (J given X = x only by integrating out OJ, j t= s, and if p is large this is intractable. ]n recent years socalled Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable 0 and the use of Bayesian methods has spread. We return to the topic in Volume II.
Linear Bayes estimates
When the problem of computing r( 1r, 6) and 611" is daunting, an alternative is to consider  of procedures for which r( 'R', 6) is easy to compute and then to look for 011" E 'D  a class 'D that minimizes r( 1f 16) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q( 0)  a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;=l bjXj _ If in (1.4.14) we identify q(8) with Y and X with Z, the solution is

8(X)
= Eq(8) + IX 
E(X)]T{3
where f3 is as defined in Section 1.4. For example, if in the 11l0del (3.2.16), (3.2.17) we set q(fJ) = Ll 1 , we can find the linear Bayes estimate of Ll 1 hy using 1.4.6 and Problem 1.4.21. We find from (1.4.14) that the best linear Bayes estimator of LI.} is
(3.2.18)
where E(Ll.I) given model
= 0,
X = (XIJ, ... ,XIJ)T, P.
= E(X)
and/l
=
ExlcExA,. For the
168
Var(X I ;)
Measures of Performance
Chapter 3
=E
Var(X,)
I B) +
Var E(XIj
I BJ = E«T~) +
(T~ + (T~,
E Cov(X,j , Xlk I B)
+ Cov(E(X'j I B), E(Xlk I BJ)
0+ Cov(1' + II'!, I' + tl.,)
= (T~ +
(T~,
Cov(tl.
Xlj)
=E
Cov(X,j , tl.,
!: ,,
I',
" From these calculations we find and Norberg (1986).
I B) + Cov(E(XIj I B), E(tl. l I 0» = 0 + (T~, =
(Tt·
f3
and OL(X). We leave the details to Problem 3.2.10.
'I
Linear Bayes procedures are useful in actuarial science, for example, Biihlmann (1970)
if
ii "
Bayes estimation, maximum likelihood, and equivariance
As we have noted earlier. the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior 71"(0) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than
modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing the estimate (j M is said to be equivariant with respect to reparametrization if for every onetoDne function h from e to fl = h (e). the estimate of w '" h(B) is W = h(OM); that is, M
~
I
= h(BM ). In Problem 2.2.16 we show that the MLE procedure is equivariant. If we consider squared error loss. then the Bayes procedure (}B = E(8 I X) is not equivariant
(h(O»
M
~

~
~
for nonlinear transfonnations because
E(h(O) I X)
op h(E(O I X))
1 ,
,
,,
for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is
i
re(a I x)
= LIB  a]'7f(O I x).
'Ee
(3.2.19)
i
If we set w = h(O) for h onetoone onto fl = heel, then w has prior >.(w) and in the w parametrization, the posterior Bayes risk is
= 7f(h
1
(w))
I
,
r,,(alx)
= L[waj2>,(wlx)
wE"
L[h(B)  a)2 7f (B I x).
(3.2.20)
'Ee
Thus, the Bayes procedure for squared error loss is not equivariant because squared error loss is not equivariant and, thus, r,,(a I x) op Te(h 1(a) Ix).
Section 3.2
Bayes Procedures
169
Loss functions of the form l((}, a) = Q(Po, Pa) are necessarily equivariant. The KullbackLeibler divergence K«(},a), (J,a E e, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h 1 (a)), thus, with this loss function,
ro(a I x)
~
re(h1(a) I x).
See Problem 2.2.38. In the discrete case using K means that the importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.l 0». In the N(O, case the K L (KullbackLeibler) loss K( a) is ~n(a  0)' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families
"J)
e,
K(1], a) = L[ryj  ajJE1]Tj
j=1
•
+ A(1]) 
A(a).
Moreover, if we can find the KL loss Bayes estimate 1]BKL of the canonical parameter 7] and if 1] = c( 9) :  t E is onetoone, then the K L loss Bayes estimate of 9 in the
e
general exponential family is 9 BKL = C 1(1]BKd. For instance, in Example 3.2.1 where J.l is the mean of a nonnal distribution and the prior is nonnal, we found the squared error Bayes estimate jiB = wf/o + (lw)X, where 1]0 is the prior mean and w is a weight. Because the K L loss is equivalent to squared error for the canonical parameter p, then if w = h(p), WBKL = h('iiUKL), where 'iiBKL =
~
wTJo + (1  w)X.
Bayes procedures based on the KullbackLeibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume 11.
Bayes methods and doing reasonable things
There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1%9; Lindley, 1965; Savage, 1954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior 1r. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E V there is a 60 E V o such that R(0, 60 ) < R(0, 6) for all O. Summary. We show how Bayes procedures can be obtained for certain problems by COmputing posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.
170
Measures of Performance
Chapter 3
3.3
MINIMAX PROCEDURES
In Section 1.3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worstcase analysis; the true () is one that is as "hard" as possible. That is, 6 1 is better than 82 from a minimax point of view if sUPe R(0,6 1 ) < sUPe R(B, 82 ) and is said to be minimax if
o·
supR(O, 0')
.
= infsupR(O,o).
,
.
,
I·
Here (J and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {p. : 0 E e}. It is fruitful to consider proper subclasses of V and subsets of P. but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by considering a socalled zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 8. For the basic game, 5 picks 0 without N's knowledge, N picks 1f without 5's knowledge and then all is revealed and S pays N
r(Jr,o)
, ,
,
=
J
R(O,o)dJr(O)
where the notation f R(O, o)dJr(0) stands for f R(0, o)Jr(O)dII in the continuous case and L, R(Oj, o)Jr(O j) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important.
I: N is told the choice 0 of 5 before picking 1r and 5 knows the rules of the game. Then
Ii ,
N naturally picks 1f,; such that
r(Jr"o)
that is, 1f,; is leastfavorable against such that
~supr(Jr,o),
o. Knowing the rules of the game S naturally picks 0*
•
(3.3.1)
r( Jr,., 0·) = sup r(Jr, 0') = inf sup r(Jr, 0).
We claim that 0* is minimax. To see this we note first that,
.
,
.
(3.3.2)
for allJr, o. On the other hand, if R(0" 0) = suP. R(0, 0), then if Jr, is point mass at 0" r( Jr" 0) = R(O" 0) and we conclude that supr(Jr,o) = sup R(O, 0)
,
•
•
(3.3.3)
I
,
•
•
Section 3.3
Minimax Procedures
171
and our claim follows. II: S is told the choke 7r of N before picking 6 and N knows the rules of the game. Then S naturally picks 6 1r such that
That is, r5 rr is a Bayes procedure for
11".
Then N should pick
7r*
such that (3.3.4)
For obvious reasons, 1f* is called a least favorable (to S) prior distribution. As we shall see by example, altbough the rigbthand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language.
Theorem 3.3.1. (von Neumann). If both
e and D are finite,
?5
11"
then:
(a)
v=supinfr(1I",6), v=infsupr(1I",6)
rr
?5
are both assumed by (say)
7r*
(least favorable), 6* minimax, respectively. Further,
") =v v=r1f,u (
.
(3.3.5)
v and v are called the lower and upper values of the basic game. When v (saY), v is called the value of the game.
=
v
=
v
Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification ~ {eo} and = {eIl (Example 3.2.2) but is too reSlrictive in ilS and testing when assumption for the great majority of inference problems. A generalization due to Wald and Karlinsee Karlin (1 959)states that the conclusions of the theorem remain valid if and D are compact subsets of Euclidean spaces. There are more farreaching generalizations but, as we shall see later, without some form of compactness of and/or D, although equality of v and v holds quite generally, existence of least favorable priors and/or minimax procedures may fail.
eo
e,
e
e
The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on and D and are easy to prove.
e
Proposition 3.3.1. Suppose 6**. 7r** can be found such that
U
£** =
r ()1r.',
1f **
= 1rJ••
(3.3.6)
172
Measures of Performance
Chapter 3
that is, 0** is Bayes against 11"** and 11"** is least favorable against 0**. Then v R( 11"** ,0**). That is, 11"** is least favorable and J*'" is minimax.
To utilize this result we need a characterization of 11"8. This is given by
v
=
Proposition 3.3.2.11"8 is leastfavorable against
°iff
1r,jO: R(O,b) = supR(O',b)) = 1.
.'
(3,3,7)
That is, n a assigns probability only to points () at which the function R(·, 0) is maximal.
Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 11"8 may not be unique. In particular, if R(O, 0) = constant. the rule has constant risk, then all 11" are least favorable. We now prove Propositions 3.3.1 and 3.3.2.
Proof of Proposition 3.3.1. Note first that we always have
v<v
because, trivially,
i~fr(1r,b)
(3.3.8)
< r(1r,b')
(3.3,9)
for aIln, 5'. Hence,
v = sup in,fr(1r,b) < supr(1r,b')
•
(3,3.10)
•
for all 0' and v
<
infa, sUP1r 1'(11", (/) =
v. On the other hand. by hypothesis,
sup1'(1r,6**) > v.
v> inf1'(11"*'",6)
,
= 1'(1I"*",0*'") =
.
(3.3.11)
Combining (3.3.8) and (3.3.11) we conclude that
:r
v
as advertised.
= i~f 1'(11"**,0) = 1'(11"**,6**) = s~p1'(1I",0*"')
= V
(3.3.12)
" 'I
,.;
~l
o
Proofof Proposition 3.3.2. 1r is least favorable for b iff E.R(8,6) =
,.,
f r(O,b)d1r(O) =s~pr( ..,6).
•
(3.3.13)
But by (3.3.3),
supr(..,b) = supR(O, 6),
(3.3,14)
•
i ,
Because E.R(8, 6)
= sUP. R(O, b), (3.3,13) is possible iff (3.3.7) holds.
o
Putting the two propositions together we have the following.
•
Section 3.3
Minimax Procedures
173
11"*
Theorem 3.3.2. Suppose 0* has sUPo R((}, 0*) = 1" < 00. If there exists a prior that 0* is Bayes for 11"* and tr" {(} : R( (}, 0") = r} = 1, then 0'" is minimax.
such
Example 3.3.1. Minimax estimation in the Binomial Case. Suppose S has a B(n,B) distribution and X = Sjn,as in Example 3.2.3. Let I(B, a) ~ (Ba)'jB(IB),O < B < 1. For this loss function,
R(B X) = E(X _B)' , B(IB)

=
B(IB) = ~ nB(IB) n'

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(Ol 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic loss neither of these assertions holds. The minimax estimate is
o's=S+hln = .,fii X+ I 1 () n+.,fii .,fii+l .,fii+1 2
This estimate does have constant risk and is Bayes against a (J( y'ri/2, vn/2) prior (Problem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n t 00 of the ratio of the risks of 0* and X is > 1 for every () =f ~. At B = the ratio tends to 1. Details are left to Problem 3.3.4. 0
!
Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending ~r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p" 0'2) where p, = v, if the satellite functions, and otherwise. The variance 0'2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or"p, = v." We view this as a decision problem with 1 loss. If the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to and 1  1r to v, use 0  1 loss, and set L(x, 0, v) = p( x Iv) j p(x I 0), then the Bayes test decides I' = v if
°
°°
L(x,O,v)
and decides p,
=
=
exp
°
" {
2"EXi   , a 2a
nv'} >
17l'
if L(x,O,v) <:17l' 7l'
174
Measures of Performance
Chapter 3
This test is equivalent to deciding f.t
= v (Problem 3.3.1) if. and only if,
"yn
1 ;;;:EXi>t,
T=
, , ,
where,
t
If we call this test d,r.
=
" [log " + 2 nv'] ;;;: vyn a
17r
2
'
R(O,J. )
1 ~ <I>(t)
<I>
= <1>( t)
R(v,o,)
(t  vf)
To get a minimax test we must have R(O, 61r ) = R( V, 6ft), which is equivalent to
v.,fii t = t  "'
h •
or
"
','
I ,~
~.
, ,
•
v.,fii t= . 20
•
Because this value of t corresponds to 7r = the intuitive test, which decides JJ only ifT > ~[Eo(T) + Ev(T)J, is indeed minimax.
!.
= v if and
0
'I
•
If is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem
e
3.3.2.
Theorem 3,3,3, Let 0' be a rule such that sup.R(O,o') = r < 00, let {"d denote a sequence of pn'or distributions such that 'lrk;{8 : R(B,o*) = r} = 1, and let Tk = inffJ r( ?rie, J), where r( 7fkl 0) denotes the Bayes risk wrt 'lrk;. If
Tk Task 00,
(3.3.i5)
then J* is minimax. Proof Because r( "k, 0') = r
supR(B, 0') = rk + 0(1)
•
where 0(1)
~
0 as k
~ 00.
But hy (3.3.13) for any competitor 0
supR(O,o) > E.,(R(B,o)) > rk ~supR(O,o') 0(1).
•
•
(3.3.16)
,
'.
If we let k _ suP. R(B, 0').
00
the lefthand side of (3.3.16) is unchanged, whereas the right tends to
0
j
1
•
Section 3.3
M'mimax Procedures
175
Example 3.3.3. Normal Mean. We now show that X is minimax in Example 3.2.1. Identify 1fk with theN(1Jo, 7 2 ) prior where k = 7 2 . Then
whereas the Bayes risk of the Bayes rule of Example 3.2.1 is
i~frk(J) ~ (,,'/n) +7' n
Because
(0
7
2
0
2
=
00,
n  (,,'/n) +7'
a2
1
0
2
n·
2
In)1 « (T2 In) + 7 2 )
>
0 as T 2

we can conclude that
X is minimax.
0
Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann). Suppose XI,'" ,Xn arei.i.d, FE:F
Then X is minimax for estimating B(F) EF(Xt} with quadratic loss. This can be viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as foUows:(J)
(i)
=
".{F: VarF(XIl # M}
= O.
(ii) ".{ F : F
# N(I', M) for some I'} = O.
(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F),M).
Evidently, the Bayes risk is now the same as in Example 3.3.3 with 0"2 evidently,
= M.
Because.
) VarF(X, ) max R(F, X = max :F :F n
Theorem 3.3.3 applies and the result follows. Minimax procedures and symmetry
M
n
o
As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" 0, There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann (1986) and Chapter 5 of Lehmann and CaseUa (1998), for instance. We shall discuss this approach somewhat, by example. in Chapters 4 and Volume II but refer to Lehmann (1986) and Lehmann and Casella (1998) for further reading. Summary. We introduce the minimax principle in the contex.t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.
This approach has early on been applied to parametric families Va.1 Unbiased Estimation. there is a least favorable prior 1[" and a minimax rule 8* such that J* is the Bayes rule for n* and 1r* maximizes the Bayes risk of J* over all priors. then in estimating a linear function of the /3J. (72) = . computational ease.d. such as 5(X) = q(Oo). 5') for aile. Ii I More specifically. I I . if Y is postulated as following a linear regression model with E(Y) = zT (3 as in Section 2.I . N (Il. v equals the Bayes risk of the Bayes rule 8* for the prior 1r*. estimates that ignore the data.2. Bayes and minimaxity. When v = ii. Obviously. In the nonBayesian framework. all 5 E Va. .• •••.6.L and (72 when XI. 5) > R(0. in many cases. it is natural to consider the computationally simple class of linear estimates. We show that Bayes rules with constant risk. ruling out. and so on. An alternative approach is to specify a proper subclass of procedures. looking for the procedure 0. Moreover. for which it is possible to characterize and. 3. according to these criteria. . compute procedures (in particular estimates) that are best in the class of all procedures.4.. or more generally with constant risk over the support of some prior. the solution is given in Section 3. 0'5) model. the game is said to have a value v. We introduced.4 UNBIASED ESTIMATION AND RISK INEQUALITIES 3. S(Y) = L:~ 1 diYi .. for instance." R(0.3. This notion has intuitive appeal. Survey Sampling In the previous two sections we have considered two decision theoretic optimality principles. we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1['.. in Section 1. This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in the N (0 . then the game of S versus N has a value 11. D. This approach coupled with the principle of unbiasedness we now introduce leads to the famous GaussMarkov theorem proved in Section 6.1. Von Neumann's Theorem states that if e and D are both finite. are minimax. An estimate such that Biase (8) 0 is called unbiased. and then see if within the Do we can find 8* E Do that is best according to the "gold standard. The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk. A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called leas! favorable.. symmetry. The most famous unbiased estimates are the familiar estimates of f. . on other grounds. When Do is the class of linear procedures and I is quadratic Joss.2.1 . which can't be beat for 8 = 80 but can obviously be arbitrarily terrible. 176 Measures of Performance Chapter 3 . for example. we can also take this point of view with humbler aims. the notion of bias of an estimate O(X) of a parameter q(O) in a model P {Po: 0 E e} as = Biaso(5) = Eo5(X)  q(O). I X n are d. E Do that minimizes the Bayes risk with respect to a prior 1r among all J E Do. Do C D. II .
.XN} 1 (3. . .4. UMVU (uniformly minimum variance unbiased). ~ N L.4. X N ) as parameter .14) and has = iv L:fl (3. reflecting the probable correlation between ('ttl. Unbiased Estimates in Survey Sampling.il) Clearly for each b. UN) and (Xl... This leads to the model with x = (Xl. Suppose we wish to sample from a finite population. Ui is the last census income corresponding to = it E~ 1 'Ui.4 Unbiased Estimation and Risk Inequalities 177 given by (see Example 1.1 ) ~ 1 nl L (Xi ~ ooc n ~ X) 2 .6) where b is a prespecified positive constant..Xn denote the incomes of a sample of n families drawn at random without replacement.3..3) = 0 otherwise.4.. and u =X  b(U . .. It is easy to see that the natural estimate X ~ L:~ 1 Xi is unbiased (Problem 3.4.4.. X N for the unknown current family incomes and correspondingly UI.' . . to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census. .4) where 2 CT. . these are both UMVU. Unbiased estimates playa particularly important role in survey sampling.5) This method of sampling does not use the information contained in 'ttl.. . X N ). a census unit. .(Xi 1=1 I" N ~2 x) .4. . Write Xl. .< 2 ~ ~ (3.. ..8) Jt =X .3 and Problem 1. (3.2.···.. . for instance. (~) If{aj. . (3. [J = ~ I:~ 1 Ui· XR is also unbiased. If ..1. is to estimate by a regression estimate ~ XR Xi.2) 1 Because for unbiased estimates mean square error and variance coincide we call an unbiased estimate O*(X) of q(O) that has minimum MSE among all unbiased estimat~s for all 0.3. We let Xl.Section 3... (3.. .4. . UN for the known last census incomes..an}C{XI. Example 3.4. As we shall see shortly for X and in Volume 2 for . We want to estimate the parameter X = Xj. UN' One way to do this. We ignore difficulties such as families moving..
q(e) is biased for q(e) unless q is linear. .4). (i) Typically unbiased estimates do not existsee Bickel and Lehmann (1969) and Problem 3.3. .4. X M } of random size M such that E(M) = n (Problem 3.. " 1 ".X)!Var(U) (Problem 3...19).4... estimate than X and the best choice of b is bopt The value of bopt is unknown but can be estimated by = The resulting estimate is no longer unbiased but behaves well for large samplessee Problem 5. . To see this write ~ 1 ' " Xj XHT ~ N LJ 7'" l(Xj E S).. . for instance..4. If B is unbiased for e. ". ' ...I. M (3.Xi XHT=LJN i=l 7fJ. (iii) Unbiased_estimates do not <:.'I .2Gand minimax estimates often are. ~ . . An alternative approach to using the Uj is to not sample all units with the same prob ability.7) II ! where Ji is defined by Xi = xJ. I j .11. However. I . They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later. N toss a coin with probability 7fj of landing heads and select Xj if the coin lands heads. Unbiasedness is also used in stratified sampling theory (see Problem 1.j . 1. Specifically let 0 < 7fl. Xl!Var(U). . 0 Discussion. :I . .18. outside of sampling.bey the attractive equivariance property. 111" N < 1 with 1 1fj = n. . The result is a sample S = {Xl. For each unit 1. . The HorvitzThompson estimate then stays unbiased.3. .j' I ': Because 1rj = P[Xj E Sj by construction unbiasedness follows.i . . (ii) Bayes estimates are necessarily biasedsee Problem 3. I II. j=1 3 N ! I: . It is possible to avoid the undesirable random sample size of these schemes and yet have specified 1rj.4. 178 Measures of Performance Chapter 3 i .. Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems.1 the correlation of Ui and Xi is positive and b < 2Cov(U.4. this will be a beller cov(U. This makes it more likely for big incomes to be induded and is intuitively desirable. . the unbiasedness principle has largely fallen out of favor for a number of reasons. Ifthc 'lrj are not all equal. II it >: . X is not unbiased but the following estimate known as the HorvitzThompson estimate is: Ef ..15).~'! I . A natural choice of 1rj is ~n.
B) for II to hold.8) is finite. Simpler assumptions can be fonnulated using Lebesgue integration theory.O E 81iJO log p( x.2. e e. 167.p) where i = (Y . For all x E A. for integration over Rq. The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous~case theorems given later. p. 0 or equivalently Vare(Bn)/MSEe(Bn) t 1 as n t 00. the variance a 2 = Var(ei) is estimated by the unbiased estimate 8 2 = iTe/(n . We make two regularity assumptions on the family {Pe : BEe}. That is. What is needed are simple sufficient conditions on p(x.4.OJdx] = J T(xJ%oP(x..4. (II) 1fT is any statistic such that E 8 (ITf) < 00 for all 0 E then the operations of integration and differentiation by B can be interchanged in J T( x )p(x.4.OJdx (3. We suppose throughout that we have a regular parametric model and further that is an open subset of the line. OJ exists and is finite. and we can interchange differentiation and integration in p(x.4.8) is assumed to hold if T(xJ = I for all x. Then 11 holdS provided that for all T such that E. B)dx. has some decision theoretic applications. suppose I holds. This preference of 8 2 over the MLE 52 = iT e/n is in accord with optimal behavior when both the number of observations and number of parameters are large.Section 3. For instance. %0 [j T(X)P(x. Note that in particular (3. Assumption II is practically useless as written.2 The Information Inequality The oneparameter case We will develop a lower bound for the variance of a statistic. B) is a density. 3.( lTD < 00 J . (I) The set A ~ {x : p(x.4.4 Unbiased Estimation and Risk Inequalities 179 Nevertheless.   . From this point on we will suppose p(x. For instance. Finally.O) > O} docs not depend on O. f3 is the least squares estimate. e.9. and appears in the asymptotic optimality theory of Section 5. See Problem 3. Some classical conditiolls may be found in Apostol (1974).4.ZDf3). good estimates in large samples ~ 1 ~ are approximately unbiased. in the linear regression model Y = ZDf3 + e of Section 2. The arguments will be based on asymptotic versions of the important inequalities in the next subsection. which can be used to show that an estimate is UMVU. We expect that jBiase(Bn)I/Vart (B n ) . The lower bound is interesting in its own right. unbiased estimates are still in favor when it comes to estimating residual variances. B)dx.8) whenever the righthand side of (3. and p is the number of coefficients in f3. In particular we shall show that maximum likelihood estimates are approximately unbiased and approximately best among all estimates. as we shall see in Chapters 5 and 6.
(3.n and 1(0) = Var (~n_l .. It is not hard to check (using Laplace transform theory) that a oneparameter exponential family quite generally satisfies Assumptions I and II. where a 2 is known. 0) ~ 1(0) = E. 0))' = J(:Ologp(x. For instance.10) and. t. O)dx = O.. 1 X n is a sample from a Poisson PCB) population.i J T(x) [:OP(X.6..4. 1 X n is a sample from a N(B.180 for all Measures of Performance Chapter 3 e. Suppose Xl. Similarly. suppose XI. the integrals . . which is denoted by I(B) and given by Proposition 3. Suppose that 1 and II hold and that E &Ologp(X.4. Ii ! I h(x) exp{ ~(O)T(x) . I J J & logp(x 0 ) = ~n 1 . then I and II hold. I 1(0) 1 = Var (~ !Ogp(X.4.O) & < 00. 1 and 11 are satisfied for samples from gamma and beta distributions with one parameter fixed.11 ) .0 Xi) 1 = nO =  0' n O· o .  Proof. the Fisher information number.B(O)} is an exponential family and TJ(B) has a nonvanishing continuous derivative on 8.O)]j P(x.4. .1.I [I ·1 . (~ logp(X. (3.O)} p(x. O)dx ~ :0 J p(x.9) Note that 0 < 1(0) < Lemma 3.1.4. 0))' p(x. O)dx. o Example 3. Then (see Table 1. thus. (J2) population.O)] dxand J T(x) [:oP(X.4.O)). 00. If I holds it is possible to define an important characteristic of the family {Po}.&0 '0 {[:oP(X. O)dx :Op(x. .1) ~(O) = 0la' and 1 and 11 are satisfied. I' .2. Then Xi ..O)] dx " I'  are continuous functions(3) of O. Then (3. /fp(x.
4. 1/. Var (t. Suppose that X = (XI.4.1. 10gp(X. 0 E e. nI.4.'(0)]'  I(e) . ej.4.O)dX= J T(x) UOIOgp(x..O). CoroUary 3.1.4.(T(X)) > [1/. Suppose that I and II hold and 0 < 1(0) < 00. Denote E.'(0) = J T(X):Op(x.1 hold and T is an unbiased estimate ofB. .1.I 1.' (e) = Cov (:e log p(X. then I(e) = nI.. Here's another important special case. (3.Section 3. Proposition 3.4.l4) and Lemma 3.I1.16) to the random variables 8/8010gp(X. (3. Let [. Then Var.13) By (A. 0 The lower bound given in the information inequality depends on T(X) through 1/.4 Unbiased Estimation and Risk Inequalities 181 Here is the main result of this section. 1/.(0). Suppose the conditions of Theorem 3. e) and T(X). Theorem 3.12) Proof.4.17) . 1/.O))P(x.4.Xu) is a sample from a population with density f(x.'(~)I)'.(T(X)) < 00 for all O. > [1/. Using I and II we obtain. e)) = I(e).O)dX.4.1.(T(X) > I(O) 1 (3. . 0 (3. T(X)) .(T(X)) by 1/. and that the conditions of Theorem 3.4. If we consider the class of unbiased estimates of q(B) = B. (3. by Lemma 3. Let T(X) be all)' statistic such that Var. We get (3.(0).14) Now let ns apply the correlation (CauchySchwarz) inequality (A. (Information Inequality).1 hold.(0) = E(feiogf(X"e»)'. Then for all 0.4.4.2.(0) is differentiable and Var (T(X)) . (e) and Var.4. we obtain a universal lower bound given by the following.16) The number 1/I(B) is often referred to as the information or CramerRoo lower bound for the variance of an unbiased estimate of 'tjJ(B).15) The theorem follows because.
As we previously remarked..(T(X». . By Corollary 3.1 for every B.j.18) follows. o II (B) is often referred to as the information contained in one observation. Theorem 3.182 Measures of Performance Chapter 3 Proof. (3. I I "I and (3. then X is UMVU..4. 0 We can similarly show (Problem 3.2.p'(B)]'ll(B) for all BEe.Xn are the indicators of n Bernoulli trials with probability of success B.4. then T' is UMVU as an estimate of 1. (Br Now Var(X) ~ a'ln. These are situations in which X follows a oneparameter exponential family. (Continued)..2.3.B)] = nh(B).4. Because X is unbiased and Var( X) ~ Bin. if {P9} is a oneparameter exponentialfamily of the form (1.(8) has a continuous nonvanishing derivative on e.4.19) Ii i .1 we see that the conclusion that X is UMVU follows if ~  Var(X) ~ nI. If the family {Pe} satisfies I and II and if there exists an unbiased estimate T* of 1/.. : BEe} satisfies assumptions I and II and there exists an unbiased estimate T* of1. then X is a UMVU estimate of B. We have just shown that the information [(B) in a sample of size n is nh (8).4. the conditions of the information inequality are satisfied. we have in fact proved that X is UMVU even if a 2 is unknown.18) 1 a ' . For a sample from a P(B) distribution. i g . then T(X) achieves the infonnation inequality bound and is a UMVU estimate of E.1) that if Xl.B) = h(x)exp[ry(B)T'(x) .4. . Suppose Xl.4. Next we note how we can apply the information inequality to the problem of unbiased estimation.. the MLE is B ~ X.(B).4. whereas if 'P denotes the N(O. Note that because X is UMVU whatever may be a 2 .B)] Var " i) ~ 8B iogj(Xi. Then {P9} is a oneparameter exponential family with density or frequency function af the fonn II p(x. Suppose that the family {p.. Example 3.B) ] [ ~var [%0 ]Ogj(Xi.6.4. which achieves the lower bound ofTheorem 3.1) with natural sufficient statistic T( X) and . This is no accident. Conversely. .j. Example 3.1 and I(B) = Var [:B !ogp(X. X n is a sample from a nonnal distribution with unknown mean B and known variance a 2 .4.[T'(X)] = [. ..B(B)J. . This is a consequence of Lemma 3. then  1 (3. 1) density.(B) such that Var.
4.A (B) .4.4. then (3. In the HardyWeinberg model of Examples 2.6). B) = a. Our argument is essentially that of Wijsman (1973). (3. .B)) + nz)[log B . Then ~ ( 80 logp X. The passage from (3.4.24) [(B) = Vare(T(X) .20) with Pg probability 1 for each B. it is necessary.4.4 and 2. If A. the information bound is [A"(B))2/A"(B) A"(B) Vare(T(X») so that T(X) achieves the information bound as an estimate of EeT(X).21 ) Upon integrating both sides of (3.4.20) with respect to B we get (3.19) is highly technical. Here is the argument.20) guarantees Pe(Ae) ~ 1 and assumption 1 guarantees P. p(x. j hence. 2 A ** = A'" and the result follows. By (3.4.4.20) to (3.6. X2 E A"''''.4.B) so that = T(X) . B .2..2.10g(1 .4.1) we assume without loss of generality (Problem 3.4.23) for all B1 .4. there exist functions al ((J) and a2 (B) such that :B 10gp(X.16) we know that T'" achieves the lower bound for all (J if. and only if. Suppose without loss of generality that T(xI) of T(X2) for Xl. continuous in B.4.B) = a. B ..4 Unbiased Estimation and Risk Inequalities 183 Proof We start with the first assertion. By solving for a" a2 in (3..4. (B)T'(X) + a2(B) (3.B) I + 2nlog(1  BJ} . But now if x is such that = 1. From this equality of random variables we shall show that Pe[X E A'I = 1 for all B where A' = {x: :Blogp(x.19).A'(B» = VareT(X) = A"(B).p(B) = A'(B) and.4.23) must hold for all B.14) and the conditions for equality in the correlation inequality (A. (3.6.25) But .4. denotes the set of x for which (3. Note that if AU = nmAem .(A") = 1 for all B'.(B)T' (x) +a2(B) for all BEe}.3) that we have the canonical case with ~(B) = Band B( Bj = A(B) = logf h(x)exp{BT(x))dx. B) = aj(B)T'(x) + a2(B) (3. then (3.Section 3.1. J 2 Pe. Conversely in the exponential family case (1. Let B .2 and. 0 Example 3. " and both sides are continuous in B.22) for j = 1..ll. Thus.(Ae) = 1 for all B' (Problem 3. we see that all a2 are linear combinations of {} log p( Xj. However. :e Iogp(x. be a denumerable dense subset of 8. thus.4. (3.4. B) IdB. B) = 2"' exp{(2nj 2"' exp{(2n. + n2) 10gB + (2n3 + n3) Iog(1 .20) hold.
3.4.. () (ell" .(e) exist. we will find a lower bound on the variance of an estimator . we have iJ' aB' logp (X..25) we obtain a' 10gp(X. Na). The variance of B can be computed directly using the moments of the multinomial distribution of (NI • N z . B) 2 1 a = p(x.ed)' In particular. B) to canonical form by setting t) = 10g[B((1 .4.27) and integrate both sides with respect to p(x.4. . Even worse.4. .IOgp(X.4. B) satisfies in addition to I and II: p(·.'" .B)(2n. or by transforming p(x.26) holds. Sharpenings of the information inequality are available but don't help in general.4. We find (Problem 3. as Theorem 3. in the U(O. ~ ~ This T coincides with the MLE () of Example 2.B)] and then using Theorem 1.4. . assumptions I and II are satisfied and UMVU estimates of 1/. I(B) = Eo aB' It turns out that this identity also holds outside exponential families. '.B) ~ A "( B).2.25). that I and II fail to hold. Proof We need only check that I a aB' log p(x.2 implies that T = (2N 1 + N z )/2n is UMVU for estimating E(T) ~ (2n)1[2nB' + 2nB(1 . in many situations. e).4.4.B)] ~ B. B)  2 (a log p(x.24). (Continued). • The multiparameter case We will extend the information lower bound to the case of several parameters.2. B) example. Theorem 3. .p' (B) I'( I (B).4. although UMVU estimates exist.2. It often happens. (3. See Volume II. Then (3.B) is twice differentiable and interchange between integration and differentiation is pennitted. B) )' aB (3. for instance.6. Extensions to models in which B is multidimensional are considered next. I I' f .2 suggests. A third metho:! would be to use Var(B) = I(I(B) and formula (3. For a sample from a P( B) distribution o EB(::. DiscRSsion.26) Proposition 3.7) Var(B) ~ B(I .6. .. Because this is an exponential family. B). Example 3. but the variance of the best estimate is not equal to the bound [. B) aB' p(x.4.184 Measures of Performance Chapter 3 where we have used the identity (2nl + HZ) + (2n3 + nz) = 211. . 0 0 . By (3. " 'oi .B)) which equals I(B). 0 ~ Note that by differentiating (3. =e'E(~Xi) = . Suppose p('..
.Section 3.... (c) If.31) and 1(0) (b) If Xl.0) = er.4.4.Xnf' has information matrix 1(0) = EO (aO:.. 0 = (I". (3. = Var('... . We assume that e is an open subset of Rd and that {p(x. iJO logp(X..32) Proof. . er 2 ). . then X = (Xl.4. .4. j = 1. (0) iJiJ lt2(0) ~ E [aer 2 iJl" logp(x.4. Suppose X = 1 case and are left to the problems. er 2).. .Ok IOgp(X.d.4.2 ] ~ er. .. B) : 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect OJ.30) iJ Ijd O) = Covo ( eO logp(X. . a ). as X.4. Let p( x.28) where (3.29) Proposition 3. .( x 1") 2 2a 2 logp(x 0) .Xn are U.O») . nh (O) where h is the information matrix ofX. (3. ~ N(I". 1 <j< d. pC B) is twice differentiable and double integration and differentiation under the integral sign can be interchanged. Under the conditions in the opening paragraph.O) ] iJ2 ] l22(0) = E [ (iJer2)21ogp(x.2 = er.4.5.4 Unbiased Estimation and Risk Inequalities 185 of fh when the parameters 82 . 1 <k< d. (a) B=T 1 (3.4 E(x 1") = 0 ~ I. The arguments follow the d Example 3. 0)] = E[er. Then 1 = log(211') 2 1 2 1 2 loger .4 /2. . 0) j k That is.? ologp(X. 0). III (0) = E [~:2 logp(x. d. in addition. The (Fisher) information matrix is defined as (3.Bd are unknown. 0). 6) denote the density or frequency function of X where X E X C Rq.
6. We will use the prediction inequality Var(Y) > Var(I'L(Z».4.4. The conditions I. where I'L(Z) denotes the optimal MSPE linear predictor of Y. J)d assumed unknown.38) where Lzy = EO(TVOlogp(X. Then VarO(T(X» > Lz~ [1(0) Lzy (3.(O).(0) exists and (3.4/2 . (continued).4. Assume the conditions of the .4. that is. Example 3. 1/. . 0) .6. Canonical kParameter Exponential Family.13).' = explLTj(x)9j j=1 . Then Theorem 3. I'L(Z) = I'Y + (Z l'z)TLz~ Lzy· Now set Y (3.36) Proof. We claim that each of Tj(X) is a UMVU . Suppose k . Here are some consequences of this result. (3. A(O).'(0) = V1/..4. .4.4.O») = VOEO(T(X» and the last equality follows 0 from the argument in (3.6. = ..!pening paragraph hold and suppose that the matrix [(0) is nonsingular Then/or all 0.4.1. Let 1/. in this case [(0) ! .. UMVU Estimates in Canonical Exponential Families. By (3.3. Z = V 0 logp(x.( 0) be the d x 1 vector of partial derivatives.2 ( 0 0 ) .A.6 hold.4.O) = T(X) . then [(0) = VarOT(X).(0) = EOT(X) and let ".33) o Example 3.A(O)}h(x) (3.30) and Corollary 1. II are easily checked and because VOlogp(x. 0).. ! p(x.. [(0) = VarOT(X) = (3..4.4.34) () E e open.186 Measures of Performance Chapter 3 Thus.37) = T(X). Suppose the conditions of Example 3. .4.35) o ~ Next suppose 8 1 = T is an estimate of (Jl with 82.
39) where.A(O)} whereTT(x) = (T!(x).p(0)11(0) ~ (1.41) because .. then Nj/n is UMVU for >'j.4. (3..(1.) = Var(Tj(X».. kI A(O) ~ 1£ log Note that 1+ Le j=l Oj 80 A (0) J 8 = 1 "k 11 ne~ I 0 +LJl=le l = 1£>'. Thus.0.. is >.6 with Xl.Xn)T.(X)) 82 80.d. i =j:. the lower bound on the variance of an unbiased estimator of 1/Jj(O) = E(n1Tj(X) = >. ..Section 3. k. . a'i X and Aj = P(X = j).. .. . .. (1 + E7 / eel) = n>'j(1. . we let j = 1.. .i.. j=1 X = (XI. 0 = (0 1 .0).4.4 Unbiased Estimation and Risk Inequalities 187 estimate of EOTj(X). .A.4. Example 3.4. = nE(T. This is a different claim than TJ(X) is UMVU for EOTj(X) if Oi.(X) = I)IXi = j]...40) J We claim that in this case . j = 1.j. But f. AI..7.A(0) j ne ej = (1 + E7 eel _ eOj ) . Multinomial Trials.T(O) = ~:~ I (3.p(OJrI(O). by Theorem 3.4.Tk_I(X)). . we transformed the multinomial model M(n. .A(O) is just VarOT! (X). n T..4 8'A rl(O)= ( 8080 t )1 kx k .>'j)/n.p(0) is the first row of 1(0) and. But because Nj/n is unbiased and has 0 variance >'j(1 . " Ok_il T. hence. .6.O) = exp{TT(x)O . . X n i.3. are known. Ak) to the canonical form p(x. In the multinomial Example 1.. . without loss of generality. j..>'. . .)/1£. We have already computed in Proposition 3.4. To see our claim note that in our case (3.. .
Also note that ~ o unbiased '* VarOO > r'(O).21.4.3 are 0 explored in the problems. .d. 3. Suppose that the conditions afTheorem 3. They are dealt with extensively 'in books on numerical analysis such as Dahlquist. T(X) is the UMVU estimate of its expectation.4.4. (72) then X is UMVU for J1 and ~ is UMVU for p." . Asymptotic analogues oftbese inequalities are sharp and lead to the notion and construction of efficient estimates. . The Normal Case. We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal. Then J (3. " il 'I.8.3 hold and • is a ddimensional statistic. even if the loss function and model are well specified. 188 MeasureS of Performance Chapter 3 j Example 3.i.X)2 is UMVU for 0 2 . features other than the risk function are also of importance in selection of a procedure.42) where A > B means aT(A .4.3 whose proof is left to Problem 3. • Theorem 3. ...pd(OW = (~(O)) dxd . ~ In Chapters 5 and 6 we show that in smoothly parametrized models.B)a > Ofor all . reasonable estimates are asymptotically unbiased. and robustness to model departures. X n are i. .. If Xl.2 + (J2. . We study the important application of the unbiasedness principle in survey sampling.4.4. Using inequalities from prediction theory. .. We derive the information inequa!ity in oneparameter models and show how it can be used to establish that in a canonical exponential family.42) are d x d matrices.4. But it does not follow that n~1 L:(Xi . N(IL. 1 j Here is an important extension of Theorem 3.4. .5.5 NON DECISION THEORETIC CRITERIA In practice. Let 1/J(O) ~EO(T(X))dXl and "'(0) = (1/Jl(O).4. Summary. These and other examples and the implications of Theorem 3.4.1 Computation Speed of computation and numerical stability issues have been discussed briefly in Section 2."xl' Note that both sides of (3. 3.. interpretability of the procedure. we show how the infomlation inequality can be extended to the multiparameter case. LX. The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure.
estimates of parameters based on samples of size n have standard deviations of order n. The interplay between estimated. consider the Gaussian linear model of Example 2. It follows that striving for numerical accuracy of ord~r smaller than n. Closed form versus iteratively computed estimates At one level closed form is clearly preferable. Its maximum likelihood estimate X/a continues to have the same intuitive interpretation as an estimate of /1.2.2 Interpretability Suppose that iu the normal N(Il.2). with ever faster computers a difference at this level is irrelevant.1 / 2 . the signaltonoise ratio.Section 3./ a even if the data are a sample from a distribution with . < 1 then J is of the order of log ~ (Problem 3.1 / 2 is wasteful.3.5. The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d 3 operations when done in the usual way and can be numerically unstable. solve equation (2.P) in Example 2. But it reappears when the data sets are big and the number of parameters large. Gaussian elimination for the particular z'b Faster versus slower algorithms Consider estimation of the MLE 8 in a general canonical exponential family as in Section 2. Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at least bounds on the COnstants involved. 3.4.4.5.1.1. We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory. On the other hand.1. On the other hand.AI (flU 1) (T(X) . if we seek to take enough steps J so that III III < .1. at least if started close enough to 8. Of course.4.1). (72) Example 2.9) by.5 we are interested in the parameter III (7 • • This parameter. variance and computation As we have seen in special cases in Examples 3.4. Then least squares estimates are given in closed fonn by equ<\tion (2.5.3. and Anderson (1974). say.10).2. a method of moments estimate of (A. It is in fact faster and better to Y. for this population of measurements has a clear interpretation. It may be shown that. For instance.A(BU ~l»)).3 and 3.2 is given by where 0.5 Nondecision Theoretic Criteria 189 Bjork. the NewtonRaphson method in ~ ~(J) which the jth iterate. takes on the order of log log ~ steps (Problem 3. The improvement in speed may however be spurious since AI is costly to compute if d is largethough the same trick as in computing least squares estimates can be used.11). fI(j) = iPl) .2 is the empirical variance (Problem 2. It is clearly easier to compute than the MLE. in the algOrithm we discuss in Section 2.
they may be interested in total consumption of a commodity such as coffee. Similarly. suppose that if n measurements X· (Xi. what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about). would be an adequate approximation to the distribution of X* (i. We return to this in Section 5.5. 1976) and Doksum (1975). the parameter v that has half of the population prices on either side (fonnally. We will consider situations (b) and (c). and both the mean tt and median v qualify. but there are a few • • = . 1 X~) could be taken without gross errors then p.')1/2 = l. 1975b. . E p. (c) We have a qualitative idea of what the parameter is. amoog others. we could suppose X* rv p.3 Robustness Finally.. P(X > v) > ~).. v is any value such that P(X < v) > ~. Alternatively. But () is still the target in which we are interested. anomalous values that arise because of human error (often in recording) or instrument malfunction. However. but there are several parameters that satisfy this qualitative notion. .)(p/>. which as we shall see later (Section 5. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied.I1'. we may be interested in the center of a population. For instance. but we do not necessarily observe X*.\)' distribution.e. suppose we initially postulate a model in which the data are a sample from a gamma. However. Then E(X)/lVar(X) ~ (p/>. we observe not X· but X = (XI •..4. See Problem 3. the HardyWeinberg parameter () has a clear biological interpretation and is the parameter for the experiment described in Example 2. We can now use the MLE iF2. say () = N p" where N is the population size and tt is the expected consumption of a randomly drawn individuaL (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true'' parametric model with an interpretable parameter B. the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X) / [Var( X)] 1/2. 9(Pl. Gross error models Most measurement and recording processes are subject to gross errors.5. 3. that is. The actual observation X is X· contaminated with "gross errOfs"see the following discussion. if gross errors occur.190 Measures of Performance Chapter 3 mean Jl and variance 0'2 other than the normal. However. E P*). X n ) where most of the Xi = X:. we turn to robustness. This idea has been developed by Bickel and Lehmann (l975a. economists often work with median housing prices. We consider three situations (a) The problem dictates the parameter..5. This is an issue easy to point to in practice but remarkably difficult to formalize appropriately..4) is for n large a more precise estimate than X/a if this model is correct.13. To be a bit formal. On the other hand.1. For instance.
the gross errors. in situation (c).h(x). with common density f(x .2) for some h such that h(x) = h(x) for all x}. X is the best estimate in a variety of senses.l. A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the Ci still i. Consider the onesample symmetric location model P defined by ~ ~ i=l""l n . identically distributed.5. we do not need the symmetry assumption.)~ 'P C) + .5. .2) Here h is the density of the gross errors and . However.d.Xn are i. X n ) knowing that B(Xi. Unfortunately. where f satisfies (3. for example. P.. . . Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of X· .Section 3. iff X" .A Y.) are ij. F. it is possible to have PC""f. and definitions of insensitivity to gross errors.remains identifiable. This corresponds to. if we drop the symmetry assumption. •.1).l. Y. make sense for fixed n. That is.1'). . .5. the assumption that h is itself symmetric about 0 seems patently untenable for gross errors. (3.\ is the probability of making a gross error. The advantage of this formulation is that jJ.J) for all such P. we encounter one of the basic difficulties in formulating robustness in situation (b). That is. but with common distribution function F and density f of the form f(x) ~ (1 . with probability ..2).l. = {f (. .18).f. Xi Xt with probability 1 .. We next define and examine the sensitivity curve in the context of the Gaussian location model. Without h symmetric the quantity jJ. two notions. X~) is a good estimate.d. .i. the sensitivity curve and the breakdown point. Informally B(X1 . Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6.j) E P.d.5. .i. Example 1. X n ) will continue to be a good or at least reasonable estimate if its value is not greatly affected by the Xi I Xt.. Formal definitions require model specification.5.1 ) where the errors are independent.j = PC""f. Further assumptions that are commonly made are that h has a particular fonn. In our new formulation it is the xt that obey (3. The breakdown point will be discussed in Volume II. so it is unclear what we are estimating. Again informally we shall call such procedures robust. We return to these issues in Chapter 6. has density h(y .) for 1'1 of 1'2 (Problem 3. p(". and symmetric about 0 with common density f and d.. Then the gross error model issemiparametric. specification of the gross error mechanism..5 Nondecision Theoretic Criteria ~ 191 wild values. If the error distribution is normal. and then more generally.. ISi'1 or 1'2 our goal? On the other hand. Now suppose we want to estimate B(P*) and use B(X 1 .is not a parameter.1. it is the center of symmetry of p(Po.1') and (Xt.5. However.1') : f satisfies (3. where Y. (3.2. h = ~(7 <p (:(7) where K ::» 1 or more generally that h is an unknown density symmetric about O.
. P. SC(x. (}(X 1 .5. We start by defining the sensitivity curve for general plugin estimates. See Section 2.. X"_l )J.5. X =n (Xl+ ..14. is appropriate for the symmetric location model. and it splits the sample into two equal halves.4).. See Problem 3. Now fix Xl!'" ... that is. Are there estimates that are less sensitive? A classical estimate of location based on the order statistics is the sample median X defined by ~ ~ X X(k+l) !(X(k) ifn~2k+1 + X(k+l)) ifn = 2k where X (I)' . that is. • . x) . .X. . where Xl. . . not its location.O(Xl' .2. 1') = 0. The empirical plugin estimate of 0 is 0 = O(P) where P is the empirical probability distribution.L. ..f) E P.. (}(PU. . in particular.16). X(n) are the order statistics.J.od Problem 2. How sensitive is it to the presence of gross errors among XI.··.1 for all p(/l. I . (2.. .. .. . (i) It is the empirical plugin estimate of the population median v (Problem 3. Often this is done by fixing Xl..L = E(X).32." . . X n ) . . shift the sensitivity curve in the horizontal or vertical direction whenever this produces more transparent formulas. .1.Xnl so that their mean has the ideal value zero.2. Then ~ ~ O.17). 1 Xnl represents an observed sample of size nl from P and X represents an observation that (potentially) comes from a distribution different from P. . See (2. . ~ ) SC( X. At this point we ask: Suppose that an estimate T(X 1 .I.1.Xn ? An interesting way of studying this due to Tukey (1972) and Hampel (1974) is the sensitivity curve defined as follows for plugin estimates (which are well defined for all sample sizes n)..L = (}(X l 1'..1. Xnl as an "ideal" sample of size n . We are interested in the shape of the sensitivity curve. Because the estimators we consider are location invariant. therefore. In our examples we shall. 0) = n[O(xl. Thus. . the sample mean is arbitrarily sensitive to gross errOfa large gross error can throw the mean off entirely. . . Suppose that X ~ P and that 0 ~ O(P) is a parameter..l.j)) = J.. we take I' ~ 0 without loss of generality. +X"_I+X) n = x. X n ) = B(F). X n ordered from smallest to largest.1 for which the estimator () gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation X does to the value of ~ We return to the location problem with e equal to the mean J. has the plugin property. . X" 1').192 The sensitivity curve Measures of Performance Chapter 3 . where F is the empirical d.. Xl. This is equivalent to shifting the Be vertically to make its value at x = 0 equal to zero. The sensitivity curoe of () is defined as ~ ~ ~ ~ ~ ~ ~ ~ ~ ... . The sample median can be motivated as an estimate of location on various grounds. .od because E(X. . .
XnI.2. its perfonnance at the nonnal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X. .5 Nondecision Theoretic Criteria 193 (ii) In the symmetric location model (3. x is an empirical (iii) The sample median is the MLE when we assume the common density f(x) of the errors {cd in (3. .1). A class of estimates providing such intermediate behavior and including both the mean and . SC(x) SC(x) x x Figure 3. we obtain .1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when X is near Ji. Although the median behaves well when gross errors are expected. The sensitivity curves of the mean and median. < X(nl) are the ordered XI. x) nx(k) nx = _nx(k+l) for for < x(k) for x(k) < x x <x(k+I) nx(k+l) x> x(k+l) where xCI) < . The sensitivity curve in Figure 3.1) is the Laplace (double exponential) density f(x) = 1 exp{l x l/7}. The sensitivity curve of the median is as follows: If....Section 3. v coincides with fL and plugin estimate of fL. SC(x.5.5.9..5.5.1..32 and 3.. .5.Xn_l = (X(k) + X(ktl)/2 = 0. say. n = 2k + 1 is odd and the median of Xl. 27 a density having substantially heavier tails than the normaL See Problems 2.
4e'x'. For more sophisticated arguments see Huber (1981). Hampel. Intuitively we expect that if there are no gross errors. Haber.1.2. . and Tukey (1972). We define the a (3. which corresponds approximately to a = This can be verified in tenns of asymptotic variances (MSEs}see Problem 5.5. . whereas as Q i ~. That is.5.5.194 Measures of Performance Chapter 3 the median has ~en known since the eighteenth century. f(x) = or even more strikingly the Cauchy. Bickel. . If f is symmetric about 0 but has "heavier tails" (see Problem 3. f(x) = 1/11"(1 + x 2 ). I I ! . The estimates can be justified on plugin grounds (see Problem 3..1 again. suppose we take as our data the differences in Table 3.20 seems to yield estimates that provide adequate protection against the proportions of gross errors expected and yet perform reasonably well when sampling is from the nonnal distribution. + X(n[nuj) n .2[nnl where [na] is the largest integer < nO' and X(1) < .5. See Andrews.2[nn]/n)I. i I . Xu = X.4. .5).2. l Xnl is zero.. For a discussion of these and other forms of "adaptation. by <Q < 4. Xa  X.1.5. that is. the mean is better than any trimmed mean with a > 0 including the median. 4. Which a should we choose in the trimmed mean? There seems to be no simple answer. Let 0 trimmed mean. (The middle portion is the line y = x(1 . Note that if Q = 0. .l)Q] and the trimmed mean of Xl.10 < a < 0. The range 0. If [na] = [(n .5.3) Xu = X([no]+l) + . the sensitivity curve calculation points to an equally intuitive conclusion." see Jaeckel (1971).. we throw out the "outer" [na] observations on either side and take the average of the rest. There has also been some research into procedures for which a is chosen using the observations. infinitely better in the case of the Cauchysee Problem 5. Xa... Rogers. For instance. then the trimmed means for a > 0 and even the median can be much better than the mean.8) than the Gaussian density. and Hogg (1974). the Laplace density..) SC(x) X x(n[naJ) .. for example. < X (n) are the ordered observations. Huber (1972). The sensitivity curve of the trimmed mean. . f = <p. Figure 3. the sensitivity Jo ~ CUIve of an Q trimmed mean is sketched in Figure 3. However.4.
P( X > xO') > 1 . < x(nI) are the ordered Xl. X n denote a sample from that population. = ~ [X(k) + X(k+l)]' and at sample size n  1. X n is any value such that P( X < x n ) > Ct. and let Xo.. Let B(P) = Var(X) = 0"2 denote the variance in a population and let XI.75 and X.25). then SC(X. Let B(P) = X o deaote a ath quantile of the distribution of X. 0 < 0: < 1. where Xu has 100Ct percent of the values in the population on its left (fonnally. where x(t) < .5.10).I L:~ I (Xi ..2. If we are interested in the spread of the values in a population. Quantiles and the lQR.X.5. &) (3.16). other examples will be given in the problems.Section 3. . 0"2) model.742(x. say k. .1. Then a~ = 11.. . Xo: = x(k). Example 3.is typically used. Similarly.Ct). the nth sample quantile is Xo.5. Write xn = 11..XnI.25. A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = X. Xu is called a Ctth quantile and X. To simplify our expression we shift the horizontal axis so that L~/ Xi = O. n + 00 (Problem 3.4) where the approximation is valid for X fixed. If no: is an integer. .X. Because T = 2 x (. the scale measure used is 0.X)2 is the empirical plugin estimate of 0. .5 Nondecision Theoretic Criteria 195 Gross errors or outlying data points affect estimates in a variety of situations.75 .5. The IQR is often calibrated so that it equals 0" in the N(/l.1 L~ 1 Xi = 11. Spread.25 are called the upper and lower quartiles. &2) It is clear that a:~ is very sensitive to large outlying Ixi values. We next consider two estimates of the spread in the population as well as estimates of quantiles... denote the o:th sample quantile (see 2. 0 Example 3.1.2 . then the variance 0"2 or standard deviation 0.75 . . SC(X.674)0".1x.
which is defined by IF(x.196 thus. and Stabel (1983).6. SC(x.) . ~ ~ ~ Next consider the sample lQR T=X.• . Summary.1 denotes the empirical distribution based on Xl. r: ~i II i ..2. in particular the breakdown point. have been studied extensively and a number of procedures proposed and implemented.Xnl. .25· Then we can write SC(x.25) and the sample IQR is robust with respect to outlying gross errors x.. although this difficulty appears to be being overcome lately. . 0. Other aspects of robustness..5. xa: is not sensitive to outlying x's.. and computability. discussing the difficult issues of identifiability.F) where = limIF. 1'. Unfortunately these procedures tend to be extremely demanding computationally. ~ . 1'0) ~[Xlkl) _ xlk)] x < XlkI) 2 '  2 2 1 Ix _ xlkl] XlkI) < x < xlk+11 '  (3.5.(x.F) dO I. It is easy to see ~ ! '. The sensitivity of the parameter B(F) to x can be measured by the influence function. An exposition of this point of view and some of the earlier procedures proposed is in Hampel.O. ~' is the distribution function of point mass at x (. median.5) 1 [xlk+11 _ xlkl] x> xlk+1) ' Clearly. i) = SC(x. trimmed mean.O(F)] = .SC(x.15) I[t < xl). The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean.1. = <1[0((1  <)F + <t. Discussion. Ii I~ .O. It plays an important role in functional expansions of estimates. 1'75) . F) and~:z. Most of the section focuses on robustness..5. x (t) that (Problem 3. Ronchetti.(x.75 X. Rousseuw.: : where F n . o Remark 3. . We will return to the influence function in Volume II. for 2 Measures of Performance Chapter 3 < k < n . We discuss briefly nondecision theoretic considerations for selecting procedures including interpretability. and other procedures.
E = R. 0) the Bayes rule is where fo(x.2. we found that (S + 1)/(n + 2) is the Bayes rule. (c) In Example 3. under what condition on S does the Bayes rule exist and what is the Bayes rule? 5.1.2 with unifonn prior on the probabilility of success O. respectively. . a(O) > 0. 71') = R(7f) /r( 71'. 3.4. That is. change the loss function to 1(0. Consider the relative risk e( J.3). Show that OB ~ (S + 1)/(n+2) for ~ ~ the uniform prior.4. J). ~ ~ 4. Find the Bayes risk r(7f. is preferred to 0. Find the Bayes estimate 0B of 0 and write it as a weighted average wOo + (1 ~ w)X of the mean 00 of the prior and the sample mean X = Sin.r 7f(0)p(x I O)dO.2.Xn be the indicators of n Bernoulli trials with success probability O.2.a)' jBQ(1. where R( 71') is the Bayes risk. Show that (h) Let 1(0. (X I 0 = 0) ~ p(x I 0). 0) = (0 .6 PROBLEMS AND COMPLEMENTS Problems for Section 3. Compute the limit of e( J. where OB is the Bayes estimate of O. Let X I. 0) is the quadratic loss (0 .2.0)' and that the prinf7r(O) is the beta. Check whether q(OB) = E(q(IJ) I x). Suppose 1(0. 71') as ..6 Problems and Complements 197 3.O) and = p(x 1 0)[7f(0)lw(0)]/c c= JJ p(x I 0)[7f(0)/w(0)]dOdx is assumed to be finite. In some studies (see Section 6.. then the improper Bayes rule for squared error loss is 6"* (x) = X. which is called the odds ratio (for success).O)P. = (0 o)'lw(O) for some weight function w(O) > 0.2 preceeding. (a) Show that the joint density of X and 0 is f(x. Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule.0). Show that if Xl. s). .2 o e 1.Section 3.. if 71' and 1 are changed to 0(0)71'(0) and 1(0. the Bayes rule does not change. . 1 X n is a N(B. give the MLE of the Bernoulli variance q(0) = 0(1 . density.3. lf we put a (improper) uniform prior on A.0) and give the Bayes estimate of q(O). In the Bernoulli Problem 3. Hint: See Problem 1. J) ofJ(x) = X in Example 3.0 E e. (72) sample and 71" is the improper prior 71"(0) = 1. 6.0)/0(0).O) = p(x I 0)71'(0) = c(x)7f(0 .. the parameter A = 01(1 . 2.24. Suppose IJ ~ 71'(0). x) where c(x) =. (3(r. In Problem 3.
. find the Bayes decision mle o' and the minimum conditional Bayes risk r(o'(x) I x). 7. . find necessary and j sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule. There are two possible actions: 0'5 a a with losses I( B.. 76) distribution... .. (. Find the Bayes decision rule. where CI. B). I j l . .a)' / nj~. 1). N{O.O'5). . B . Hint: If 0 ~ D(a).E). Suppose tbat N lo . Bioequivalence trials are used to test whether a generic drug is.. (Bj aj)2.2. to a close approximation. where ao = L:J=l Qj. i: agency specifies a number f > a such that if f) E (E.3. (a) Problem 1. (c) Problem 1. . Let q( 0) = L:. (e) a 2 + 00.. 8. E) and positive when f) 1998) is 1.exp {  2~' B' } . a r )T. 0) and I( B.. Note that ).. compute the posterior risks of the possible actions and give the optimal Bayes decisions when x = O. Assume that given f). Set a{::} Bioequivalent 1 {::} Not Bioequivalent .15. (c) We want to estimate the vector (B" . and that 0 has the Dirichlet distribution D( a). then the generic and brandname drugs are. On the basis of X = (X I.3.. and Cov{8j . bioequivalent. a = (a" .9j ) = aiO:'j/a5(aa + 1). a) = Lj~. a) = (q(B) .Xn ) we want to decide whether or not 0 E (e. do not derive them. (a) If I(B.\(B) = r .I(d). where is known. Let 0 = ftc . given 0 = B are multinomial M(n.19(c).) with loss function I( B. a) = [q(B)a]'. .(0) should be negative when 0 E (E.3.2(d)(i) and (ii).• N. (Use these results. c2 > 0 I.ft B be the difference in mean effect of the generic and namebrand drugs.. . Measures of Performance Chapter 3 (b) n . " X n of differences in the effect of generic and namebrand effects fora certain drug. One such function (Lindley. . defined in Problem 1. 00. and that 9 is random with aN(rlO. Suppose we have a sample Xl. • 9. by definition. where E(X) = O. Xl. I ~' . (b) Problem 1.) (b) When the loss function is I(B. .Xu are Li.  I . B = (B" .. Cr are given constants. . Var(Oj) = aj(ao .)T. E).=l CjUj .aj)/a5(ao + 1).\(B) = I(B. I) = difference in loss of acceptance and rejection of bioequivalence. (E.. ...B.d. B.O)  I(B.198 (a) T + 00. equivalent to a namebrand drug. A regulatory " ! I i· . then E(O)) = aj/ao. E). For the following problems.
(a) Show that a necessary condition for (xo. In Example 3.\(±€.) + ~}" . Yo) = inf g(xo. v) > ?r/(l ?r) is equivalent to T > t. (Xv. Hint: See Example 3. Yo) = sup g(x..2.) = 0 implies that r satisfies logr 1 = ( 2 2c' This is an example with two possible actions 0 and 1 where l((). (b) It is proposed that the preceding prior is "uninformative" if it has 170 large ("76 + 00"). . + = 0 and it 00"). find (a) the linear Bayes estimate of ~l. respectively.Yp).Yo) Xi {}g {}g Yj = 0.Yo) = {} (Xo.. Discuss the preceding decision rule for this "prior..2. .6 Problems and Complements 199 where 0 < r < 1.. For the model defined by (3.17). 0. Yo) to be a saddle point is that.1) is equivalent to "Accept bioequivalence if[E(O I x»)' where = x) < 0" (3. Yo) is in the interior of S x T. RP. 2. (c) Is the assumption that the ~ 's are nonnal needed in (a) and (b)? Problems for Section 3. Suppose 9 . Con 10. A point (xo. S x T ~ R. . y).1) < (T6(n) + c'){log(rg(~)+..Xm).6.. (a) Show that the Bayes rule is equivalent to "Accept biocquivalence if E(>'(O) I X and show that (3.y = (YI.1.3 1. (b) the linear Bayes estimate of fl.. 1) are not constant. S T Suppose S and T are subsets of Rm." (c) Discuss the behavior of the preceding decision rule for large n ("n sider the general case (a) and the specific case (b). yo) is a saddle point of 9 if g(xo. {} (Xo.2 show that L(x.16) and (3. Any two functions with difference "\(8) are possible loss functions at a = 0 and I.3. representing x = (Xl. . 0) and l(().6.Section 3. ..2. Note that . and 9 is twice differentiable.
d.Xn ) .. B1 ) >  (l. Hint: See Problem 3.0).1 < i < m.j=O." 1 Xi ~ l}.0•• ). 5. o')IR(O.2. BIl has a continuous distrio 1 bution under both P Oo and P BI • Show that (a) For every 0 < 7f < 1. I(Bi. Hint: Show that there exists (a unique) 1r* so that 61f~' that R(Bo. Bd.o•• ) =R(B1 . and show that this limit ( . d) = (a) Show that if I' is known to be 0 (!.= 1 .=1 CijXiYj with XE 8 m .thesimplex. 4.1(0. 1rWlO = 0 otherwise is Bayes against a prior such that PIB = B ] = . I • > 1 for 0 i ~...2. . Yo . • " = '!. B ) ~ p(X. n: I>l is minimax.l. • . y ESp. I j 3.d< p. Show that the von Neumann minimax theorem is equivalent to the existence of a saddle point for any twice differentiable g.j)=Wij>O.a)'. Let X" .. and that the mndel is regnlar.. iij. 2 J'(X1 . L:. B1 ) Ip(X.n12.b < m. N(I'.<7') and 1(<7'.n12). Suppose e = {Bo.n)/(n + . .. I}. Suppose i I(Bi. Yc Yd foraHI < i.andg(x.(X) = 1 if Lx (B o. Let S ~ 8(n. 1 <j. (b) Suppose Sm = (x: Xi > 0.c.200 and Measures of Performance Chapter 3 8'g (x ) < 0 8'g(xo. f" I • L : I' (a) Show that 0' has constant risk and is Bayes for the beta._ I)'. the conclusion of von Neumann's theorem holds.. and 1 . . (b) Show that limn_ooIR(O. Thus. fJ( .)w"..y) ~ L~l 12. Let Lx (Bo. f.Yo) > 0 8 8 8 X a 8 Xb 0.i. .:.n).i) =0. and o'(S) = (S + 1 2 . X n be i. i. (b) There exists 0 < 11"* < 1 such that the prior 1r* is least favorable against is.. o(S) = X = Sin. B ) and suppose that Lx (B o. &* is minimax. . a) ~ (0 . A = {O. the test rule 1511" given by o.. prior.a.. 0») equals 1 when B = ~.PIB = Bo].
. . respectively...Pi)' Pjqj <j < k. ... 8. (c) Use (B.. < im..Pi.X)' are inadmissible. prior.. Xl· (e) Show that if I' is unknown. . . .1... then is minimax for the loss function r: l(p. f(k. . . d = (d). 1). J. . ik). .. . (jl.d) where qj ~ 1 . d) = L(d. 9.. b = ttl..I'kf. See Volume II. . .\.I). / . o'(X) is also minimax and R(I'. Rk) where Rj is the rank of Xi. 1 = t j=l (di . .. hence. 1 <i< k. I' (X"". dk)T. Show that if = (I'I''''. jl~ < . .tj. that is.tn I(i.j.?.12..). .1)1 L(Xi . .k..l .2. Show that X has a Poisson (.a? 1.Section 3_6 Problems and Complements 201 (b) If I' ~ 0. N k ) has a multinomial.). Show that if (N). o(X 1. Xk)T. .o) for alII'.j..~~n X < Pi < < R(I'.29). Jlk. 6.. .. 0 1.. < /12 is a known set of values. > jm). X k be independent with means f.X)' is best among all rules of the form oc(X) = c L(Xi . Rj = L~ I I(XI < Xi)' Hint: Consider the uniform prior on permutations and compute the Bayes rule by showing that the posterior risk of a pennutation (i l .. an d R a < R b. Let k .\.. .X)' and. . . ik is an arbitrary unknown permutation of 1. and iI. Write X  • "i)' 1(1'. a) = (.. Remark: Stein (1956) has shown that if k > 3. o(X) ~ n~l L(Xi . Hint: (a) Consider a gamma prior on 0 = 1/<7'. Permutations of I. See Problem 1.. .. ftk) = (Il?l'"'' J1. b. . ... distribution.k} I«i).3. . where (Ill. Let Xi beindependentN(I'i. j 7. = tj. ..a < b" = 'lb. .. ik) is smaller than that of (i'l"'" i~). ttl . Let Xl.. M (n. . . . LetA = {(iI.\ . .) .. . . . h were t·.\) distribution and 1(. that both the MLE and the estimate 8' = (n .j. Hint: Consider the gamma. " a. 1 < j < k. X is no longer unique minimax. show that 0' is unifonnly best among all rules of the form oc(X) = CL Conclude that the MLE is inadmissible. X k ) = (R l . For instance.» = Show that the minimax rule is to take L l. i=l then o(X) = X is minimax.. PI.. 0') = (1.. Then X is nurnmax.P. 00.
a) is the posterior mean E(6 I X). n) be i. q) denote the K LD (KullbackLeiblerdivergence) between the densities Po and q and define the Bayes KLD between P = {Po : BEe} and q as k(q.B)....2. . I: I ir'z ..I)!.Xo)T has a multinomial.. B(n... . of Xi < X • 1 + 1 o.15. .d. ... .. For a given x we want to estimate the proportion F(x) of the population to the left of x.....202 Measures of Performance Chapter 3 Hint: Consider Dirichlet priors on (PI. B). .. . 1 ... ..a) and let the prior be the uniform prior 1r(B" .2..4... 14.. Suppose that given 6 ~ B. . .1'»2 of these estimates is bounded for all nand 1'. Define OiflXI v'n < d < v'n d v'n d d X .i f X> . X has a binomial. 1). Pk. B j > 0.. J K(po.. . distribution... See also Problem 3. 10.BO)T. Let X" . See Problem 3.i. + l)j(n + k).=1 Show that the Bayes estimate is (Xi ..q)1r(B)dB. ! p(x) = J po(x)1r(B)dB. Hint: Consider the risk function of o~ No. Let Xi(i = 1. 13. Let K(po. M(n.3.1r) = Show that the marginal density of X. X n be independent N(I'. Show that v'n 1+ v'n 2(1+ v'n) is minimax for estimating F(x) = P(X i < x) with squared error loss. Show that the Bayes estimate of 0 for the KullbackLeibler loss function lp(B.. ~ I I . X = (X"". (b) How does the risk of these estimates compare to that of X? 12.d with density defined in Problem 1. LB= j 01 1. distribution... Suppose that given 6 = B = (B . BO l) ~ (k . with unknown distribution F.. d X+ifX 11. Let the loss " function be the KullbackLeibler divergence lp(B. v'n v'n (a) Show that the risk (for squared error loss) E( v'n(o(X) .._.8. I: • .
.O)!.. Prove Proposition 3. Reparametrize the model by setting ry = h(O) and let q(x... Equivariance.X is called the mutual information between 0 and X. . if I(O. Let A = 3. Problems for Section 3. Fisher infonnation is not equivariant under increasing transformations of the parameter. Show that X is an UMVU estimate of O. 0) with 0 E e c R. ry) = p(x.Section 3. .2) with I' . . Hint: Use Jensen's inequality: If 9 is a convex function and X is a random variable. Show that (a) (b) cr5 = n.6 Problems and Complements 203 minimizes k( q.4. . then E(g(X) > g(E(X». K) . p(X) IO.x =.alai) < al(O.'? /1 as in (3. K) = J [Eo {log ~i~i}] K((J)d(J > a by Jensen's inequality. S. 0) and B(n. Let X .1 . Hint: k(q. X n be the indicators of n Bernoulli trials with success probability B..~). N(I".. . the Fisher information lower bound is equivariant. Jeffrey's "Prior. .aao + (1 . Jeffrey's priors are proportional to 1. 2. a) is convex and J'(X) = E( J(X) I I(X»..4. 0.4. . (a) Show that if Ip(O) and Iq(fJ) denote the Fisher information in the two parametrizations. N(I"o. Let Bp(O) and Bq(ry) denote the information inequality lower bound ('Ij.f [E.12) for the two parametrizations p(x. suppose that assumptions I and II hold and that h is a monotone increasing differentiable function from e onto h(8).). Suppose that there is an unbiased estimate J of q((J) and that T(X) is sufficient.1 E: 1 (Xi  J. a" (J. for any ao. 15. Show that in theN(O. Suppose X I. 0) cases. 4. p( x.4 1.k(p. R. and O~ (1 .tO)2 is a UMVU estimate of a 2 • &'8 is inadmissible." A density proportional to VJp(O) is called Jeffrey's prior. then R(O. We shall say a loss function is convex. It is often improper. 7f) and that the minimum is I(J. then That is. a < a < 1. a. . X n are Ll. (b) Equivariance of the Fisher Information Bound. Give the Bayes rules for squared error in these three cases.d.1"0 known. 0) and q(x. Show that if 1(0. J') < R( (J. . ao) + (1 . Show that B q(ry) = Bp(h. ry). h1(ry)) denote the model in the new parametrization. (ry»).a )I( 0. . Let X I.. J). {log PB(X)}] K(O)d(J. respectively. that is.
.1.8. {3) T Compute I( 9) for the model in (a) and then find the lower bound on the variances of unbiased estimators and {J of a and (J.3....\ = a + bB is UMVU for estimating .Yn are independent Poisson random variables with E(Y'i) = !Ji where Jli = exp{ Ct + (3Zi} depends on the levels Zi of a covariate. Suppose (J is UMVU for estimating fJ. .p) is an unbiased estimate of (72 in the linear regression model of Section 2. 0) = Oc"I(x > 0). Show that assumption I implies that if A . . Measures of Performance Chapter 3 I (c) if 110 is not known and the true distribution of X t is N(Ji. . ~ ~ ~ (a) Write the model for Yl give the sufficient statistic.13 E R.o. . Hint: Use the integral approximation to sums. . P. Does X achieve the infonnation .\ = a + bOo 11.ZDf3)/(n . For instance. Let F denote the class of densities with mean 0.1 and variance 0. 7. 0) for any set E. find the bias o f ao' 6. X n ) is a sample drawn without replacement from an unknown finite population {Xl. and . Suppose Yl .. distribution. . Ct. =f.4. Show that if (Xl. Show that . n. . . Pe(E) = 1 for some 0 if and only if Pe(E) ~ ~2 > OJ docsn't depend on 0. In Example 3. Show that a density that minimizes the Fisher infonnation over F is f(x. ( 2). a ~ i 12. Find lim n.. compute Var(O) using each ofthe three methods indicated.11(9) as n ~ give the limit of n times the lower bound on the variances of and (J. B(O.4.ZDf3)T(y .. a ~ (c) Suppose that Zi = log[i/(n + 1)1. i = 1.2 (0 > 0) that satisfy the conditions of the information inequality. Zi could be the level of a drug given to the ith patient with an infectious disease and Vi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered. Let a and b be constants. 9. X n be a sample from the beta. then = 1 forallO. Is it unbiased? Does it achieve the infonnation inequality lower bound? (b) Show that X is an unbiased estimate of 0/(0 inequality lower bound? + 1).4.{x : p(x. Show that 8 = (Y . 8. . Let X" . 204 Hint: See Problem 3. .5(b). . .P. 14.. .Ii . 2 ~ ~ 10. Establish the claims of Example 3. • 13. Y n in twoparameter canonical exponential form and (b) Let 0 = (a. then (a) X is an unbiased estimate of x = I ~ L~ 1 Xi· . Hint: Consider T(X) = X in Theorem 3. ..XN }..2. (a) Find the MLE of 1/0. .' . 00. I I . 1)..4.
Show that X k given by (3. Suppose UI. then E( M) ~ L 1r j=1 N J ~ n. (c) Explain how it is possible if Po is binomial.. . ..~).) Suppose the Uj can be relabeled into strata {xkd.3.l. (a) Take samples with replacement of size mk from stratum k = fonn the corresponding sample averages Xl.4.6) is (a) unbiased and (b) has smaller variance than X if  b < 2 Cov(U. P[o(X) = 91 = 1. _ Show that X is unbiased and if X is the mean of a simple random sample without replacement from the population then VarX<VarX with equality iff Xk. Define ~ {Xkl... B(n. K. 8).. XK. (b) Deduce that if p. :/i.6 Problems and Complements 205 (b) The variance of X is given by (3. if and only if. Suppose the sampling scheme given in Problem 15 is employed with 'Trj _ ~. UN are as in Example 3. that ~ is a Bayes estimate for O. distribution. = 1. : 0 E for (J such that E((J2) < 00. B(n. Suppose X is distributed accordihg to {p. Show that is not unbiasedly estimable.4). 18. Show that the resulting unbiased HorvitzThompson estimate for the population mean has variance strictly larger than the estimate obtained by taking the mean of a sample of size n taken without replacement from the population. . = N(O. 1 < i < h.} and X =K  1".4.511 > (b) Show that the inequality between Var X and Var X continues to hold if ~ .1 and Uj is retained independently of all other Uj with probability 1rj where 'Lf 11rj = n. G 19.• . 'L~ 1 h = N. k = 1. 20. Let 7fk = ~ and suppose 7fk = 1 < k < K. Let X have a binomial.X)/Var(U). More generally only polynomials of degree n in p are unbiasedly estimable.. X is not II Bayes estimate for any prior 1r. Stratified Sampling..4.~ for all k. . Show that if M is the expected sample size. 7.4... .". 16. . .p). 17. k=l K ~1rkXk. ec R} and 1r is a prior distribution (a) Show that o(X) is both an unbiased estimate of (J and the Bayes estimate with respect to quadratic loss.1  E1 k I Xki doesn't depend on k for all k such that 1Tk > o.Xkl. even for sampling without replacement in each stratum.Section 3. (See also Problem 1. 15. .
give and plot the sensitivity curve of the median. 1 2. that is. Regularity Conditions are Needed for the Information Inequality..4. 4. 1 X n1 C.75' and the IQR. B) is differentiable fat anB > x. 22. If a = 0. for all X!. Show that the sample median X is an empirical plugin estimate of the population median v. Show that.. T . Note that 1/J (9)a ~ 'i7 E9(a 9) and apply Theorem 3.25. give and plot the sensitivity curves of the lower quartile X. J xdF(x) denotes Jxp(x)dx in the continuous case and L:xp(x) in the discrete 6. . use (3. B») ~ °and the information bound is infinite. (iii) 2X is unbiased for . is an empirical plugin estimate of ~ Here case. Yet show eand has finite variance. for all adx 1. 5.4.9)' 21. and we can thus define moments of 8/8B log p(x. An estimate J(X) is said to be shift or translation equivariant if.4. the upper quartile X. Var(a T 6) > aT (¢(9)II(9).5) to plot the sensitivity curve of the 1. B).. Note that logp(x. If n IQR. however. with probability I for each B. Let X ~ U(O.206 Hint: Given E(ii(X) 19) ~ 9.5. Hint: It is equivalent to show that. . Show that the a trimmed mean XCII.l)a is an integer. Prove Theorem 3. B) be the uniform distribution on (0./(9)a [¢ T (9)a]T II (9)[¢ T (9)a]. If a = 0. = 2k is even.25 and (n ."" F (ij) Vat (:0 logp(X. B).3.25 and net =k 3. E(9 Measures of Performance Chapter 3 I X) ~ ii(X) compute E(ii(X) . • Problems for Section 35 I is an integer.
Its properties are similar to those of the trimmed mean.30. . J is an unbiased estimate of 11.6 Problems and Complements 207 It is antisymmetric if for all Xl. . .3.Section 3... .. xH L is translation equivariant and antisymmetric. median. F(x . trimmed mean with a = 1/4.. For x > .. and xiflxl < k kifx > k kifx<k. is symmetrically distributed about O.5 and for (j is.(a) Show that X. The Huber estimate X k is defined implicitly as the solution of the equation where 0 < k < 00. X. and the HodgesLehmann estimate. .). (b) Show that   .) ~ 8. X a arc translation equivariant and antisymmetric. .03. (a) Suppose n = 5 and the "ideal" ordered sample of size n ~ 1 = 4 is 1.JL) where JL is unknown and Xi . Xu. (b) Suppose Xl.6. In . 7. i < j. The HodgesLehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages ~(Xi + Xj). (7= moo 1 IX.1. .1.03 (these are expected values of four N(O. k t 0 to the median. I)order statistics). It has the advantage that there is no trimming proportion Q that needs to be subjectively specified. X n is a sample from a population with dJ. '& is an estimate of scale. plot the sensitivity curves of the mean.5. Show that if 15 is translation equivariant and antisymmetric and E o(15(X» exists and is finite.. <i< n Show that (a) k = 00 corresponds to X. One reasonable choice for k is k = 1. X are unbiased estimates of the center of symmetry of a symmetric distribution. then (i.e.  XI/0..30. .67. Deduce that X.. (See Problem 3.
This problem may be done on the computer. For the ideal sample of Problem 3.) are two densities with medians v zero and identical scale parameters 7.x}2.5. = O.5." .25. ..6). P(IXI > 3) and P(IXI > 4) for the nonnal. X 12.d. (d) Xk is translation equivariant and antisymmetric (see Problem 3. Suppose L:~i Xi ~ . Use a fixed known 0"0 in place of Ci. we say that g(. with density fo((. The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant \.. In the case of the Cauchy density.~ !~ t . In what follows adjust f and 9 to have v = 0 and 'T = 1. The (student) tratio is defined as 1= v'n(x I'o)/s.11. •• . Find the limit of the sensitivity curve of t as (a) Ixl ~ (b) n ~ 00.. Xk) is a finite constant.Xn arei. thenXk is the MLEofB when Xl.75 .X. where S2 = (n . and is fixed. . (b) Find thetml probabilities P(IXI > 2).) has heavier tails than f() if g(x) is above f(x) for Ixllarge.7(a). [. (e) If k < 00. . then limlxj>00 SC(x. Let 1'0 = 0 and choose the ideal sample Xl. Location Parameters. If f(·) and g(.• 13. plot the sensitivity curve of (a) iTn . 00. thus. iTn ) ~ (2a)1 (x 2  a 2 ) as n ~ 00. Let JJo be a hypothesized mean for a certain population. 11. . and Cauchy distributions. 9. Show that SC(x. with k and € connected through 2<p(k) _ 24>(k) ~ e k 1(e) Xk exists and is unique when k £ ! i > O. (c) Show that go(x)/<p(x) is of order exp{x2 } as Ixl ~ 10. n is fixed. 1 Xnl to have sample mean zero. (a) Find the set of Ixl where g(Ixl) > <p(lxl) for 9 equal to the Laplace and Cauchy densitiesgL(x) = (2ry)1 exp{ Ixl/ry} and gc(x) = b[b2 + x 2 ]1 /rr.. the standard deviation does not exist.i.. ii.O)/ao) where fo(x) for Ixl for Ixl <k > k. and "" "'" (b) the Iratio of Problem 3.208 ~ Measures of Performance Chapter 3 (b) Ifa is replaced by a known 0"0. we will use the IQR scale parameter T = X.5. .1)1 L:~ 1(Xi . 00. Laplace. . Let X be a random variable with continuous distribution function F.
(e) Show that if the support S(F) = {x : 0 < F(x) < 1} of F is a finite interval. a. F(t) ~ P(X < t) > P(Y < t) antisymmctric. An estimate ()n is said to be shift and scale equivariant if for all xl.a +bxn ) ~ a + b8n (XI. 8. Let Y denote a random variable with continuous distribution function G.8. .  vl/O.. . (e) Let Ji{k) be the solution to the equation E ( t/Jk (X :. 8. compare SC(x.:.(F) = !(xa + XI"). 8) = IF~ (x.) That is. Fn _ I ). the ~ = bdSC(x.1: (a) Show that SC(x.5) are location parameters. i/(F)] is the value of some location parameter. ~ ~ 8n (a +bxI. Also note that H(x) is symmetric about zero. then ()(F) = c. i/(F)] and. median v. ([v(F).t )) = 0. any point in [v(F). SC(a + bx. and sample trimmed mean are shift and scale equivariant.. (jx. If 0 is scale and shift equivariant. (a) Show that if F is symmetric about c and () is a location parameter. F) lim n _ oo SC(x. Show that v" is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < i/(F).(F): 0 <" < 1/2}. v(F) = inf{va(F) : 0 < " < I/2} andi/(F) =sup{v. . then for a E R.5.xn ).F). ~ se is shift invariant and scale equivariant. ~ (b) Write the Be as SC(x. n (b) In the following cases. 8.. let H(x) be the distribution function whose inverse is H'(a) = ![xax...].. .. ~ = d> 0. x n _. b > 0. In this case we write X " Y. and ordcr preserving. it is called a location parameter. a + bx. b > 0. . 0 < " < 1.._. .) 14.xn . sample median. Show that if () is shift and location equivariant.). then v(F) and i/( F) are location parameters. Hint: For the second part. 8) and ~ IF(x._.. i/(F)] is the location parameter set in the sense that for any continuous F the value ()( F) of any location parameter must be in [v(F).8. .(k) is a (d) For 0 < " < 1. Show that J1... In Remark 3.. X is said to be stochastically smaller than Y if = G(t) for all t E R. ~ ~ 15.O. xnd. ~ (a) Show that the sample mean.6 Problems and Complements 209 if (Ja+bX = a + bex· It is antisymmetric if (J x :. c E R. 8 < " is said to be order preserving if X < Y ::::} Ox < ()y. · . let v" = v. if F is also strictly increasing.. and note thatH(xi/(F» < F(x) < H(xv(F». .5.t. (b) Show that the mean Ji.Xn_l) to show its dependence on Xnl (Xl.67 and tPk is defined in Problem 3.5. where T is the median of the distributioo of IX location parameter. and trimmed population mean JiOl (see Problem 3. c + dO.Section 3.
Show that in the bisection method.Bilarge enough. 3. The NewtonRaphson method in this case is .4. ~ I(x  = X n . 18. Because priority of discovery is now given to the French mathematician M. show that (a) If h is a density that is symmetric about zero.0 > 0 (depending on t/J) such that if lOCO) .BI < C1(1(i. Frechet. where B is the class of Borel sets. (b) ~ ~ .field generated by SA.5.1/.' > 0. • • . B E B. F)! ~ 0 in the cases (i). ~ ~ 17.OJ then l(I(i) . (ii) O(F) = (iii) e(F) "7. • . also true of the method of coordinate ascent. (ii). (b) If no assumptions are made about h.7 NOTES Note for Section 3. A. · .1) Hint: (a) Try t/J(x) = Alogx with A> 1. ) (a) Show by example that for suitable t/J and 1(1(0) . . I I Notes for Section 3. IlFfdF(x).rdF(l:). in order to be certain that the Jth iterate (j J is within e of the desired () such that 'IjJ( fJ) = 0.8) . we shall • I' . (1) The result of Theorem 3. . is not identifiable. In the gross error model (3. and (iii) preceding? ~ 16.1 is commonly known as the Cramer~Rao inequality.. I : . . Let d = 1 and suppose that 'I/J is twice continuously differentiable. . C < 00.4 I .t is identifiable.. We define S as the . 81" < 0. consequently.2). . then fJ. This is.210 (i) Measures of Performance Chapter 3 O(F) ~ 1'1' ~ J .I F(x. then J. (b) Show that there exists. Assume that F is strictly increasing. (e) Does n j [SC(x. 0.3 (1) A technical problem is to give the class S of subsets of:F for which we can assign probability (the measurable sets).B ~ {F E :F : Pp(A) E B}. and we seek the unique solution (j of 'IjJ(8) = O. we in general must take on the order of log ~ steps. {BU)} do not converge.
APoSTOL. AND A..8 References 211 follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement. BERGER.. 40. J. Ma. F. 1994. BICKEL. BAXTER.. TUKEY. Robust Estimates of Location: Sun>ey and Advances Princeton. HUBER. LEHMANN. 15231535 (1969). SMITH. "Measures of Location and AsymmeUy. 1122 (1975). M. Bayesian Theory New York: Wiley.. Statistical Decision Theory and Bayesian Analysis New York: Springer.. 1.. Statist.. H. LEHMANN. AND E. AND C. P. BICKEL. Optimal Statistical Decisions New York: McGrawHill. F.. D.)dXd>. 3. DAHLQUIST. . P. M. J. A.4. DE GROOT. 1985. (3) The continuity of the first integral ensures that : (J [rO )00 . II. DoKSUM. BICKEL. 11391158 (1976). Math. J." Ann.. 2. Statist. Location. "Descriptive Statistics for Nonparametric Models.. Families. J. P. BERNARDO.8).. H. Statist. BOHLMANN.cific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: SpringerVerlag. A. H.(T(X)) = 00. Introduction." Scand. 3. P. R. AND E. K. p(x. L. AND J. 3. 1974.. LEHMANN. III. I.10451069 (1975b). 10381044 (l975a). 1969. "Descriptive Statistics for Nonparametric Models. S. J. in Proceedings of the Second Pa. T. R.. "Unbiased Estimation in Convex. DOWE. MA: AddisonWesley. ROGERS. P. Numen'cal Analysis New York: Prentice Hall. (2) Note that this inequality is true but uninteresting if f(O) = 00 (and 1/J'(0) is finite) or if Var. LEHMANN.. A. 1974. HAMPEL. Jroo T(x) :>." Ann.. Jroo T(x) [~p(X' 0)] dx roo Joo 8(J 00 00 00 for all (J whereas the continuity (or even boundedness on compact sets) of the second integral guarantees that we can interchange the order of integration in (4) The finiteness of Var8(T(X)) and f(O) imply that 1/J'(0) is finite by the covariance interpretation given in (3. M. W. W. M. D. NJ: Princeton University Press. Point Estimation Using the KullbackLeibler Loss Function and MML. AND N.] = Jroo. 0. 1998..thematical Methods in Risk Theory Heidelberg: Springer Verlag.8 REFERENCES ANDREWS. ANDERSON. Statist. J. Mathematical Analysis. AND E. P.. ofStatist. AND E.. BICKEL. 1970. 4." Ann. WALLACE. OLIVER. BJORK. F. BICKEL.." Ann. Reading. Dispersion. >. 2nd ed.Section 3. 1972. G. "Descriptive Statistics for Nonparametric Models.
Actuarial J. "Adaptive Robust Procedures. New York: Springer. FREEMAN. 2nd ed. Programming. I. 49.. 1954. L.. R. E.. and Probability. LINDLEY. S. Mathematical Methods and Theory in Games. SHIBATA. WUSMAN. Statist. StatiSl. "Estimation and Inference by Compact Coding (With Discussions). "The Influence Curve and Its Role in Robust Estimation. 223239 (1987).. A. H. Royal Statist. RONCHEro. 1986. L.. 69. P. YU. Assoc." Ann. S." Proc. LEHMANN. D.212 Measures of Performance Chapter 3 HAMPEL. StrJtist." J. "On the Attainment of the CramerRao Lower Bound. Theory ofPoint Estimation. STAHEL. 1981. W. "Stochastic Complexity (With Discussions):' J. 383393 (1974). (2000). M. Third Berkeley Symposium on Math. j I JEFFREYS. R. D.. Amer. F. London: Oxford University Press. AND B. Theory ofProbability.. Amer. Robust Statistics New York: Wiley. 1959." Ann. 69. ROUSSEUW. HUBER. 538542(1973). AND W.10411067 (1972). C.. . 1." J.. Part I: Probability.... HAMPEL.. WALLACE. A. B. Math. LEHMANN. 13. Soc.. University of California Press. "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution. P. 10201034 (1971). R." Statistica Sinica. R. E. HOGG." Statistical Science.. Stall'. Assoc. "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification. E..." J. 43. "Robust Statistics: A Review." Scand. Statist. 197206 (1956). KARLIN.. 42. Wiley & Sons. 909927 (1974). . E. RoyalStatist. 1948.222 (1986). C. 2Q4. "Model Selection and the Principle of Mimimum Description Length. HUBER. L.. The Foundations afStatistics New York: J. Math. Stu/ist. Part II: Inference. STEIN. LINDLEY. J.... TuKEY. SAVAGE. J. 49. 1998. "Boostrap Estimate of KullbackLeibler Information for Model Selection. "Decision Analysis and BioequivaIence Trials. Cambridge University Press. Amer. Robust Statistics: The Approach Based on Influence Functions New York: J. and Economics Reading. Exploratory Data Analysis Reading.V. 7. London." Ann. 1965. AND P. Wiley & Sons. R." J. MA: AddisonWesley.. AND G. Stlltist.375394 (1997). 136141 (1998). Testing Statistical Hypotheses New York: Springer.. L. MA: AddisonWesley. 1986.n.. JAECKEL. Y. RISSANEN. HANSEN. 2nd ed. H. B. • • 1 . . Assoc.. Math. . CASELLA. • NORBERG.. P. Introduction to Probability and Statistics from a Bayesian Point of View. "Robust Estimates of Location. 240251 (1987). 1. Soc. 1972..
This framework is natural if. 8 1 are a partition of the model P or. not a sample.3 we defined the testing problem abstractly. to answering "no" or "yes" to the preceding questions. as is often the case.1. and we have data providing some evidence one way or the other. peIfonn a survey. Nmo.1. respectively. where is partitioned into {80 . modeled by us as having distribution P(j. and indeed most human activities. pUblic policy. Nfo of denied applicants.Nfb the numbers of admitted male and female applicants.1 INTRODUCTION In Sections 1. or more generally construct an experiment that yields data X in X C Rq.PmllPmO. The design of the experiment may not be under our control. Sex Bias in Graduate Admissions at Berkeley. treating it as a decision theory problem in which we are to decide whether P E Po or P l or.Nf1 . and the corresponding numbers N mo . Accepting this model provisionally. we are trying to get a yes or no answer to important questions in science. 8d with 8 0 and 8.3. what is an appropriate stochastic model for the data may be questionable. But this model is suspect because in fact we are looking at the population of all applicants here. As we have seen. 3. The Graduate Division of the University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. If n is the total number of applicants. Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial.1. and 3. parametrically. Nfo) by a multinomial. it might be tempting to model (Nm1 .PfO). the situation is less simple. PI or 8 0 .2.Pfl. M(n. Usually. whether (j E 8 0 or 8 1 jf P j = {Pe : () E 8 j }. medicine. They initially tabulated Nm1. distribution. and what 8 0 and 8 1 correspond to in tenns of the stochastic model may be unclear.3 the questions are sometimes simple and the type of data to be gathered under our control. what does the 213 . Here are two examples that illustrate these issues. respectively. where Po. e Example 4.Chapter 4 TESTING AND CONFIDENCE REGIONS: BASIC THEORY 4. the parameter space e. in examples such as 1. o E 8. conesponding.
D.. • Pml + P/I + PmO + Pio • I . • In fact. .than might be expected under the hypothesis that N AA has a binomial. In a modem formulation. has a binomial (n. . one of which was dominant. that N AA.214 Testing and Confidence Regions Chapter 4 hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male] = Pm! Pml +PmO = P[Admit I Female] = P fI Pil +PjO But is this a correct translation of what absence of bias means? Only if admission is determined centrally by the toss of a coin with probability Pml Pi! Pml + PmO PIl +PiO [n fact.. If departments "use different coins. This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate. the number of homozygous dominants. distribution. either N AA cannot really be thought of as stochastic or any stochastic I . The hypothesis of dominant inheritance ~. as is discussed in a paper by Bickel.n 3 t I . B (n. I I 1] Fisher conjectured that rather than believing that such a very extraordinary event occurred it is more likely that the numbers were made to "agree with theory" by an overzealous assistant. That is. • I I . D). D). I . the same data can lead to opposite conclusions regarding these hypothesesa phenomenon called Simpson's paradox.. The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive)..2. ..! . . NjOd.. . Pfld. . . m P [ NAA . ! . Hammel. N mOd . for d = 1. if there were n dominant offspring (seeds). and so on. i:. admissions are petfonned at the departmental level and rates of admission differ significantly from department to department.. PfOd. • . I . . . In one of his famous experiments laying the foundation of the quantitative theory of genetics.. ~). ~.p) distribution. where N m1d is the number of male admits to department d.. The example illustrates both the difficulty of specifying a stochastic model and translating the question one wants to answer into a statistical hypothesis. OUf multinomial assumption now becomes N ""' M(pmld' PmOd.: Example 4. NIld. pml +P/I !. It was noted by Fisher corresponds to H : p = ~ with the alternative K : p as reported in Jeffreys (1961) that in this experiment the observed fraction ':: was much closer to 3. and O'Connell (1975). the natural model is to assume.< .1. Mendel's Peas. d = 1. . 0 I • . . d = 1. . Mendel crossed peas heterozygous for a trait with two alleles." then the data are naturally decomposed into N = (Nm1d. In these tenns the hypothesis of "no bias" can now be translated into: H: Pml Pmld PmOd Pfld Pild + + PfOd .=7xlO 5 ' n 3 . if the inheritance ratio can be arbitrary.
1.3.!. B) distribution. If the theory is false. See Remark 4. and we are then led to the natural 0 . we call 8 0 and H simple. and then base our decision on the observed sample X = (X J.11. say. see.3 recover from the disease with the old drug. Moreover. (2) If we let () be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite. Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied. and accept H otherwise. 11 and K is composite. a) = 0 if BE 8 a and 1 otherwise. in the e . where 00 is the probability of recovery usiog the old drug. is point mass at . That a treatment has no effect is easier to specify than what its effect is. It is convenient to distinguish between two structural possibilities for 8 0 and 8 1 : If 8 0 consists of only one point. (1 . K : () > Bo Ifwe allow for the possibility that the new drug is less effective than the old. it's not clear what P should be as in the preceding Mendel example. The set of distributions corresponding to one answer. administer the new drug. What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug.(1) As we have stated earlier.1 loss l(B. (}o] and eo is composite. then S has a B(n. we reject H if S exceeds or equals some integer. Suppose that we know from past experience that a fixed proportion Bo = 0. The same conventions apply to 8] and K. = Example 4.1. suppose we observe S = EXi . If we suppose the new drug is at least as effective as the old. I} or critical region C {x: Ii(x) = I}.. n 3' 0 What the second of these examples suggests is often the case. In this example with 80 = {()o} it is reasonable to reject IJ if S is "much" larger than what would be expected by chance if H is true and the value of B is eo. then 8 0 = [0 .1. for instance. . p). In science generally a theory typically closely specifies the type of distribution P of the data X as. Now 8 0 = {Oo} and H is simple. 8 1 is the interval ((}o. recall that a decision procedure in the case of a test is described by a test function Ii: x ~ {D. say k. In situations such as this one .3. When 8 0 contains more than one point.. That is. To investigate this question we would have to perform a random experiment. we shall simplify notation and write H : () = eo. the set of points for which we reject. then = [00 .Xn ). These considerations lead to the asymmetric formulation that saying P E Po (e E 8 0 ) corresponds to acceptance of the hypothesis H : P E Po and P E PI corresponds to rejection sometimes written as K : P E PJ . Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug. We illustrate these ideas in the following example. where Xi is 1 if the ith patient recovers and 0 otherwise. the number of recoveries among the n randomly selected patients who have been administered the new drug. . Most simply we would sample n patients. Thus. for instance. It will turn out that in most cases the solution to testing problem~ with 80 simple also solves the composite 8 0 problem.€)51 + cB( nlP). where 1 ~ E is the probability that the assistant fudged the data and 6!. say 8 0 . 8 0 and H are called compOSite. our discussion of constant treatment effect in Example 1. Thus. 0 E 8 0 . acceptance and rejection can be thought of as actions a = 0 or 1.1 Introduction 215 model needs to pennit distributions other than B( n.Section 4. P = Po. is better defined than the alternative answer 8 1 .
(5 > k) < k). is much better defined than its complement and/or the distribution of statistics T under eo is easy to compute.. as we shall see in Sections 4. We do not find this persuasive. ~ : I ! j . 1 i . our critical region C is {X : S rule is Ok(X) = I{S > k} with > k} and the test function or PI ~ probability of type I error = Pe. if H is false. We call T a test statistic. One way of doing this is to align the known and unknown regions and compute statistics based on the number of matches. testing techniques are used in searching for regions of the genome that resemble other regions that are known to have significant biological activity. it again reason~bly leads to a Neyman Pearson fonnulation..3. 0 PH = probability of type II error ~ P.2.e.. if H is true. and large. ! : .2. For instance.1 and4.3 this asymmetry appears reasonable. when H is false.T would then be a test statistic in our sense. Note that a test statistic generates a family of possible tests as c varies. matches at one position are independent of matches at other positions) and the probability of a match is ~.. The value c that completes our specification is referred to as the critical value of the test.216 Testing and Confidence Regions Chapter 4 tenninology of Section 1.1. .1. but if this view is accepted. By convention this is chosen to be the type I error and that in tum detennines what we call H and what we call K..1 . We will discuss the fundamental issue of how to choose T in Sections 4. then the probability of exceeding the threshold (type I) error is smaller than Q". how reasonable is this point of view? In the medical setting of Example 4. The Neyman Pearson Framework The Neyman Pearson approach rests On the idea that. asymmetry is often also imposed because one of eo. e 1 . generally in science.that has in fact occurred. . It has also been argued that.1. and later chapters.2 and 4. • . 0 > 00 .3. of the two errors. Thresholds (critical values) are set so that if the matches occur at random (i. . As we noted in Examples 4.) We select a number c and our test is to calculate T(x) and then reject H ifT(x) > c and accept H otherwise. In most problems it turns out that the tests that arise naturally have the kind of structure we have just described. . There is a statistic T that "tends" to be small. announcing that a new phenomenon has been observed when in fact nothing has happened (the socalled null hypothesis) is more serious than missing something new. No one really believes that H is true and possible types of alternatives are vaguely known at best. : I I j ! 1 I . suggesting what test statistics it is best to use.. (Other authors consider test statistics T that tend to be small. one can be thought of as more important. To detennine significant values of these statistics a (more complicated) version of the following is done. Given this position. We now tum to the prevalent point of view on how to choose c. 1. In that case rejecting the hypothesis at level a is interpreted as a measure of the weight of evidence we attach to the falsity of H. 4.3. (5 The constant k that determines the critical region is called the critical value. The Neyman Pearson framework is still valuable in these situations by at least making us think of possible alternatives and then. . but computation under H is easy.
we find from binomial tables the level 0.8)n..(S > 6) = 0. Ll) Nowa(e) is nonincreasing in c and typically a(c) r 1 as c 1 00 and a(e) 1 0 as c r 00. 8 0 = 0. it is too limited in any situation in which. even nominally. Thus. the power is 1 minus the probability of type II error. numbers to the two losses that are not equal and/or depend on 0.Ok) = P(S > k) A plot of this function for n ~ t( j=k n ) 8i (1 .Section 4. .0) = Pe[Rejection] = Pe[o(X) = 1] = Pe[T(X) If 8 E 8 0 • {3(8. In that case. By convention 1 . in the Bayesian framework with a prior distribution on the parameter. Finally. the probabilities of type II error as 8 ranges over 8 1 are determined. Example 4. if 0 < a < 1.3 and n = 10. That is. Because a test of level a is also of level a' > a. Begin by specifying a small number a > 0 such that probabilities of type I error greater than a arc undesirable.1. The power is a function of 8 on I. Both the power and the probability of type I error are contained in the power function. 80 = 0.2. k = 6 is given in Figure 4.05 are commonly used in practice. Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8 0 . If 8 0 is composite as well.05 critical value 6 and the test has size . 0) is just the probability of type! errot.1.1. and we speak of rejecting H at level a.1. there exists a unique smallest c for which a(c) < a. even though there are just two actions.1. then the probability of type I error is also a function of 8. the approach of Example 3.j J = 10. Here are the elements of the Neyman Pearson story. In Example 4. if we have a test statistic T and use critical value c.01 and 0. Such tests are said to have level (o/significance) a.3 (continued).1.(6) = Pe.1. {3(8. The power of a test against the alternative 0 is the probability of rejecting H when () is true. Once the level or critical value is fixed.3.P [type 11 error] is usually considered.3 with O(X) = I{S > k}. Indeed.2(b) is the one to take in all cases with 8 0 and 8 1 simple. our test has size a(c) given by a(e) ~ sup{Pe[T(X) > cJ: 8 E eo}· (4. which is defined/or all 8 E 8 by e {3(8) = {3(8.9. This is the critical value we shall use. whereas if 8 E power against ().1. 0) is the {3(8. e t . if our test statistic is T and we want level a. It is referred to as the level a critical value. such as the quality control Example 1. Here > cJ. It can be thought of as the probability that the test will "detect" that the alternative 8 holds. This quantity is called the size of the test and is the maximum probability of type I error. it is convenient to give a name to the smallest level of significance of a test. Definition 4. Specifically.1 Introduction 217 There is an important class of situations in which the Neyman Pearson framework is inappropriate.0473. The values a = 0. we can attach. See Problem 3.2.
. a 67% improvement.. From Figure 4.5.1 it appears that the power function is increasing (a proof will be given in Section 4.5.3770. .. and H is the hypothesis that the drug has no effect or is detrimental.0 0 ]. 0) family of distributions. .3 versus K : B> 0.3)..8 06 04 0.1. The power is plotted as a function of 0. then the drug effect is measured by p.~==/++_++1'+ 0 o 0. for each of a group of n randomly selected patients.8 0._~:. k ~ 6 and the size is 0.3 to (it > 0. _l X n are nonnally distributed with mean p. If we assume XI.. I I I I I 1 . Power function of the level 0. This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject.[T(X) > k].:. .05 test will detect an improvement of the recovery fate from 0.6 0. :1 • • Note that in this example the power at () = B1 > 0. j 1 .7 0. and variance (T2. . We return to this question in Section 4. For instance. suppose we want to see if a drug induces sleep.05 onesided test c5k of H : () = 0. whereas K is the alternative that it has some positive effect. . I j a(k) = sup{Pe[T(X) > k]: 0 E eo} = Pe.0473.3 is the probability that the level 0.3.1.t < 0 versus K : J. ! • i.3 04 05 0.3. OneSided Tests for the Mean ofa Normal Distribution with Known Variance. What is needed to improve on this situation is a larger sample size n. Suppose that X = (X I. That is. We might.0 Figure 4.9 1. I i j . Xn ) is a sample from N (/'.t > O. One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives. record sleeping time without the drug (or after the administration of a placebo) and then after some time administer the drug and record sleeping time again.2 o~31:.1.:"::':::'.) We want to test H : J.2 0.. (The (T2 unknown case is treated in Section 4.. 0 Remark 4.2 is known. When (}1 is 0. Example 4. Let Xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient.0 0. this probability is only .2) popnlation with .218 Testing and Confidence Regions Chapter 4 1.1. It follows that the level and size of the test are unchanged if instead of 80 = {Oo} we used eo = [0.1.1 0.4.3 for the B(lO.
.1.. and we have available a good estimate (j of (j.( c) = C\' or e = z(a) where z(a) = z(1 . It is convenient to replace X by the test statistic T(X) = . . p ~ P[AA]. Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward. 1) distribution. The task of finding a critical value is greatly simplified if £.eo) = IN~A . In Example 4. T(X(lI).3. The smallest c for whieh ~(c) < C\' is obtained by setting q:.co .3. That is..p) because ~(z) (4.1.5).5.o versus p. £0. This minimum distance principle is essentially what underlies Examples 4. it is natural to reject H for large values of X. (12) observations with both parameters unknown (the t tests of Example 4. has a closed form and is tabled.y) : YES}.~ (c .2.   = . ~ estimates(jandd(~. eo).~(z).2) = 1.(jo) = (~ (jo)+ where y+ = Y l(y > 0).a) is the (1.. In all of these cases. In Example 4.a) is an integer (Problem 4.nX/ (J. = P. Given a test statistic T(X) we need to determine critical values and eventually the power of the resulting tests. in any case.(T(X)) doesn't depend on 0 for 0 E 8 0 . then the test that rejects iff T(X) > T«B+l)(la)).I') ~ ~ (c + V.1 and Example 4.a) quantile oftheN(O.3. Because (J(jL) a(e) ~ is increasing.1.o if we have H(p.1. where d is the Euclidean (or some equivalent) distance and d(x. where T(l) < .1 Introduction 219 Because X tends to be larger under K than under H. But it occurs also in more interesting situations such as testing p.1. has level a if £0 is continuous and (B + 1)(1 . . which generates the same family of critical regions. the common distribution of T(X) under (j E 8 0 . #..1. S) inf{d(x..9). N~A is the MLE of p and d(N~A. < T(B+l) are the ordered T(X). if we generate i.Section 4.1.• T(X(B)).i. Rejecting for large values of this statistic is equivalent to rejecting for large values of X. T(X(B)) from £0.V. T(X(l)).P. (Tn) for (j E 8 0 is usually invariance The key feature of situations in which under the action of a group of transformations_ See Lehmann (1997) and Volume II for discussions of this property. .d. However.. sup{(J(p) : p < OJ ~ (J(O) = ~(c). The power function of the test with critical value c is p " [Vii (X !") > e _ Vii!"] (J (J 1..co =.2 and 4. critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods. it is clear that a reasonable test statistic is d((j. o The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H : (j E eo c RP.~I. This occurs if 8 0 is simple as in Example 4.1.
Fo(x)[. .   Dn = sup IF(x) .1 El{U. the order statistics. 1997). and h n (L224) for" = .. which is called the Kolmogorov statistic. and the result follows. This is again a consequence of invariance properties (Lehmann.1 J) where x(l) < .i. o Note that the hypothesis here is simple so that for anyone of these hypotheses F = F o• the distribution can be simulated (or exhibited in closed fonn).1.01.L5 rewriting H : F(!' + (Tx) = <p(x) for all x where !' = EF(X .1.. . :i' D n = sup [U(u) . F. This statistic has the following distributionjree property: 1 Proposition 4. as X .7) that Do..10 respectively.Fa.d. for n > 80. as a tcst statistic . As x ranges over R.. . < x(n) is the ordered observed sample. ). • Example 4. that is. F and the hypothesis is H : F = ~ (' (7'/1:) for some M. Let X I.1 El{Xi ~ U(O. h n (L358). and hn(t) = t/( Vri + 0.(i_l.1.. 0 Example 4.1 El{Fo(Xi ) < Fo(x)} n.u[ O<u<l . Set Ui ~ j .. . (T2 = VarF(Xtl. .)} n (4. Goodness of Fit Tests.1. which is evidently composite. . . Suppose Xl. 1)... thus. 1). • . . < Fo(x)} = U(Fo(x)) ..6.. In particular.. The natural estimate of the parameter F(!. n. Consider the problem of testing H : F = Fo versus K : F i. and . We can proceed as in Example 4. Let F denote the empirical distribution and consider the sup distance between the hypothesis Fo and the plugin estimate of F.220 Testing and Confidence Regions Chapter 4 1 . where U denotes the U(O. Goodness of Fit to the Gaussian Family.05. the empirical distribution function F. P Fo (D n < d) = Pu (D n < d). .d. Un' where U denotes the empirical distribution function of Ul u = Fo(x) ranges over (0. The distribution of D n under H is the same for all continuous Fo.Xn be i.. The distribution of D n has been thoroughly smdied for finite and large n. + (Tx) is .1. What is remarkable is that it is independ~nt of which F o we consider. F o (Xi).n {~ n Fo(x(i))' FO(X(i)) _ . then by Problem B. Proof.12 + OJI/Vri) close approximations to the size" critical values ka are h n (L628).4. can be wriHen as Dn =~ax max tI. where F is continuous. U. X n are ij. In particular.3. x It can be shown (Problem 4.U.. ..5. ... 1) distribution. Also F(x) < x} = n..
. 1.3. 1 < i < n. from N(O.1. Experimenter] may be satisfied to reject the hypothesis H using a test with size a = 0. H is not rejected. if the experimenter's critical value corresponds to a test of size less than the p~value. If we observe X = x = (Xl. 1).. the pvalue is <l'( T(x)) = <l' (. It is then possible that experimenter I rejects the hypothesis H. This quantity is a statistic that is defined as the smallest level of significance 0' at which an experimenter using T would reject on the basis ofthe observed outcome x.(3) 0 The pValue: The Test Statistic as Evidence o Different individuals faced with the same testing problem may have different criteria of size.d. . Now the Monte Carlo critical value is the I(B + 1)(1 . if X = x. this difficulty may be overcome by reporting the outcome of the experiment in tenus of the observed size or pvalue or significance probability of the test. observations Zi.(iix u > z(a) or upon applying <l' to both sides if. Tn has the same distribution £0 under H. Tn sup IF'(X x x + iix) .01. . ~n) doesn't depend on fl.<l'(x)1 sup IG(x) . We do this B times independently.d. whereas experimenter II insists on using 0' = 0. T nB . That is. and only if.) Thus..<l'(x)1 where G is the empirical distribution of (L~l"'" Lln ) with Lli (Xi . N(O. where Z I. thereby obtaining Tn1 . But.2.. · · . Zn arc i. Tn!.. under H.05.i. .. the joint distribution of (Dq . . for instance. a satisfies T(x) = .Section 4. then computing the Tn corresponding to those Zi.X) .' . H is rejected... a > <l'( T(x)).Z) / (~ 2::7 ! (Zi  if) • . . Example 4. (Sec Section 8.4. If the two experimenters can agree on a common test statistic T. whatever be 11 and (12. otherwise. (4. . where X and 0'2 are the MLEs of J1 and we obtain the statistic F(X + ax) Applying the sup distance again.1 Introduction 221 (12. .. I). and the critical value may be obtained by simulating ij.4) Considered as a statistic the pvalue is <l'( y"nX /u). whereas experimenter II accepts H on the basis of the same outcome x of an experiment. Consider. {12 and is that of ~  (Zi ..X)/ii. T nB . .a) + IJth order statistic among Tn. and only if. Therefore. . we would reject H if. In). I < i < n.
. then if H is simple Fisher (1958) proposed using l j :i.222 Testing and Confidence Regions Cnapter 4 In general. a(Tr ). But the size of a test with critical value c is just a(c) and a(c) is decreasing in c.1.5) .1. so that type II error considerations are unclear.6). Proposition 4. distribution (Problem 4. 1985). H is rejected if T > Xla where Xl_a is the 1 .1. Thus.. Thus. to quote Fisher (1958).2.. This is in agreement with (4. . a(8) = p.1.. We will show that we can express the pvalue simply in tenns of the function a(·) defined in (4. •• 1. Similarly in Example 4. if r experimenters use continuous test statistics T 1 .g. we would reject H if. n(1 . ''The actual value of p obtainable from the table by interpolation indicates the strength of the evidence against the null hypothesis" (p. that is. • i~! <:1 im _ . and only if. 1). and the pvalue is a( s) where s is the observed value of X. Thus. the smallest a for which we would reject corresponds to the largest c for which we would reject and is just a(T(x)). I T r to produce p .Oo)} > 5.ath quantile of the X~n distribution. !: The pvalue is used extensively in situations of the type we described earlier. aCT) has a uniform.1.3. More generally.). • T = ~2 I: loga(Tj) j=l ~ ~ r (4. (4. Various melhods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff (1967).1. For example. Then if we use critical value c.4).1. .<I> ( [nOo(1 ~ 0 )1' 2 s~l~nOo) 0 . the largest critical value c for which we would reject is c = T(x). The pvalue is a(T(X)).. .. let X be a q dimensional random vector. The statistic T has a chisquare distribution with 2n degrees of freedom (Problem 4. ~ H fJ. The pvalue can be thought of as a standardized version of our original statistic. T(x) > c. The normal approximation is used for the pvalue also.1. when H is well defined..(S > s) '" 1. . these kinds of issues are currently being discussed under the rubric of datafusion and metaanalysis (e.1). U(O.• see f [' Hedges and Olkin. Suppose that we observe X = x. It is possible to use pvalues to combine the evidence relating to a given hypothesis H provided by several different independent experiments producing different kinds of data. In this context. for miu{ nOo. Thus. We have proved the following. a(T) is on the unit interval and when H is simple and T has a continuous distribution. ' I values a(T. I.6) I • to test H.5).. 80). but K is not.
that is.Section 4. test statistics. In this case the Bayes principle led to procedures based on the simple likelihood ratio statistic defined by L(x. Such a test and the corresponding test statistic are called most poweiful (MP). The statistic L is reasonable for testing H versuS K with large values of L favoring K over H. test functions. OIl > 0. Typically a test statistic is not given but must be chosen on the basis of its perfonnance.00)ln~5 [0./00)5[(1. the highest possible power. In the NeymanPearson framework. OIl = p (x.2 Choosing a Test Statistic The NeymanPearson lemma 223 The preceding paragraph gives an example in which the hypothesis specifies a distribution completely. 0.)1 5 [(1. In Sections 3.0. by convention. Summary.2. 00 .2 and 3. In this section we will consider the problem of finding the level a test that ha<.) which is large when S true.Oo)]n. subject to this restriction. We introduce the basic concepts and terminology of testing statistical hypotheses and give the NeymanPearson framework.3). a given test statistic T. we try to maximize the probability (power) of rejecting H when K is true. where eo and e 1 are disjoint subsets of the parameter space 6. type II error. under H. power function. significance level. we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 80 or e 1. (null) hypothesis H and alternative (hypothesis) K. equals 0 when both numerator and denominator vanish. This is an instance of testing goodness offit. that is. 00. we specify a small number 0: and conStruct tests that have at most probability (significance level) 0: of rejecting H (deciding K) when H is true. L(x. power.01 )/(1. (0. 0) 0 p(x. in the binomial example (4.2 CHOOSING A TEST STATISTIC: THE NEYMANPEARSON LEMMA We have seen how a hypothesistesting problem is defined and how perfonnance of a given test b. then. The statistic L takes on the value 00 when p(x. and S tends to be large when K : () = 01 > 00 is . size. For instance.Otl/(1 . (I . critical regions. We start with the problem of testing a simple hypothesis H : () = ()o versus a simple alternative K : 0 = 01. 0) is the density or frequency function of the random vector X. We introduce the basic concepts of simple and composite hypotheses. 00 ) = 0. and. I) = EXi is large.00)/00 (1 . and pvalue. OIl where p(x. is measured in the NeymanPearson theory. type I error. (4. p(x. In particular.1. we test whether the distribution of X is different from a specified Fo. or equivalently. 4. 1) distribution. o:(Td has a U(O.3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk.
B.'P(X)] > O. Note that a > 0 implies k < 00 and.'P(X)] [:~~:::l. likelihood ratio tests are unbeatable no matter what the size a is.Bo. (b) For each 0 < a < 1 there exists an MP size 0' likelihood ratio test provided that randomization is permitted. (c) If <p is an MP level 0: test.2.'P(X)[L(X. then it must be a level a likelihood ratio test. Bo.'P(X)] . (NeymanPearson Lemma).. B .P(S > 5)I!P(S = 5) = .p is a level a test. i = 0.kJ}. (a) Let E i denote E8. 0 < <p(x) < 1. BIl o . If L(x. .2) that 'Pk is a Bayes rule with k ~ 7l' /(1 .1].1. k] .0262 if S = 5.224 Testing and Confidence Regions Chapter 4 We call iflk a likelihood ratio or NeymanPearsoll (NP) test ifunction) if for some 0 k < OJ we can write the test function Yk as 'Pk () = X < 1 o if L( x. EO'PdX) We want to show E. 1.1).7l').kEo['Pk(X) . Bo) = O.Bo. where 1= EO{CPk(X)[L(X. Proof. thns. o Because L(x. Such randomized tests are not used in practice. Because we want results valid for all possible test sizes Cl' in [0. there exists k such that (4. 'P(x) = 1 if S > 5. Finally.2) forB = BoandB = B. B.) For instance.1. Theorem 4. If 0 < 'P(x) < 1 for the observation vector x. we choose 'P(x) = 0 if S < 5. that is.'P(X)] = Eo['Pk{X) . which are tests that may take values in (0.2.3). then <{Jk is MP in the class oflevel a tests.) .1) if equality occurs.2. (a) If a > 0 and I{Jk is a size a likelihood ratio test. and because 0 < 'P(x) < 1.2.. then ~ . B . then I > O. Bo) = O} = I + II (say).05 .) . lOY some x. where 7l' denotes the prior probability of {Bo}. BI )  . 'Pk is MP for level E'c'PdX). 'Pk(X) = 1 if p(x.('Pk(X) .3 with n = 10 and Bo = 0. Eo'P(X) < a.. (See also Section 1. Note (Section 3.k] + E'['Pk(X) .3) 1 : > O. i' ~ E.2. we consider randomized tests '1'.['Pk(X) .'P(X)] 1{P(X. (4.'P(X)] a. and suppose r.3. They are only used to show that with randomization. To this end consider (4.Bd >k <k with 'Pdx) any value in (0.05 in Example 4.3. the interpretation is that we toss a coin with probability of heads cp(x) and reject H iff the coin shows heads. using (4. and 'P(x) = [0. we have shown that I j I i E.4) 1 j 7 . We show that in addition to being Bayes optimal.'P(X)] > kEo['PdX) .k is < 0 or > 0 according as 'Pk(X) is 0 or 1. if want size a ~ . It follows that II > O.[CPk(X) .
2.4) we need tohave'P(x) ~ 'Pk(X) = I when L(x. See Problem 4. 0. 0. 0..fit [ logL(X.1T)p(x.) + 1Tp(X. Part (a) of the lemma can.2. Then the posterior probability of (). 60 .1. is 1T(O. The same argument works for x E {x : p(x.5) If 0.0.Po[L(X. Example 4. 0.Oo.10).pk is MP size a.O.) = k j.k] on the set {x : L(x. 00 . I x) = (1 1T)p(X. If a = 1. where v is a known signal. or 00 according as 1T(01 Ix) is larger than or smaller than 1/2. Let 1T denote the prior probability of 00 so that (I 1T) is the prior probability of (it.) > k] Po[L(X.fit. define a. X n ) is a sample of n N(j. Here is an example illustrating calculation of the most powerful level a test <{Jk. denotes the Bayes procedure of Exarnple 3.v)=exp 2LX'2 ..) (1 .2) holds forO = 0.Oo. OJ! = ooJ ~ 0. k = 00 makes 'Pk MP size a. (c) Let x E {x: p(x. { (7 i=l 2(7 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions. .) < k.2. 0 It follows from the NeymanPearson lemma that an MP test has power at least as large as its level. 00 ) > and 0 = 00 .2 Choosing a Test Statistic: The NeymanPearson lemma " 225 =:cc (b) If a ~ 0. 0.v) + nv ] 2(72 x ..) (1 . 0. = v. tben. then 'Pk is MP size a. 'Pk(X) = 1 and !.) > then to have equality in (4. 00 . Next consider 0 < Q: < 1.O. . OJ! > k] < a and Po[T.2.1.u 2 ) random variables with (72 known and we test H : 11 = a versus K : 11 = v.3.7.) . then there exists k < 00 such that Po[L(X.) + 1T' (4. 00 . for 0 < a < 1. 0. 00 .2. then E9. 00 . 0. we conc1udefrom (4.) = k] = 0. Therefore. 00 .1T)L(x.OI)' Proof.5) thatthis 0. Let Pi denote Po" i = 0. = 'Pk· Moreover.L... also be easily argued from this Bayes property of 'Pk (Problem 4.2 where X = (XI.I. 0.Section 4. Remark 4. 00 ) (11T)L(x. 2 (7 T(X) ~. OJ. Because Po[L(X. 'P(X) > a with equality iffp(" 60 ) = P(·. 'Pk(X) =a . 00 . It follows that (4.2. If not. OJ! > k] > If Po[L(X. 00 . that is.(X.2(b).) > k and have 'P(x) = 'Pk(X) ~ 0 when L(x.. OJ Corollary 4. decides 0.O. Now 'Pk is MP size a. 0.2./f 'P is an MP level a test. We found nv2} V n L(X.2.1. when 1T = k/(k + 1). k = 0 makes E. Consider Example 3.2.2.
But T is the test statistic we proposed in Example 4.6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > O. Particularly important is the case Eo = E 1 when "Q large" is equivalent to "F = (ttl . large..0..1. The following important example illustrates.2.I.2." The function F is known as the Fisher discriminant function. O)T and Eo = I. E j ). for the two popnlations are known.. E j ). If this is not the case. It is used in a classification context in which 9 0 . . The power of this test is. Eo are known. 0 An interesting feature of the preceding example is that the test defined by (4. • • .226 Testing and Confidence Regions Chapter 4 is also optimal for this problem. Suppose X ~ N(Pj../iil (J)). and only if. 6.9).' Rejecting H for L large is equivalent to rejecting for I 1 .1. We return to this in Volume II. if Eo #.2. .2. . > 0 and E l = Eo. By the NeymanPearsoo lemma this is the largest power available with a level Ct test. Simple Hypothesis Against Simple Alternative for the Multivariate Normal: Fisher's Discriminant Flmction. say. . T>z(la) (4. Example 4. then this test rule is to reject H if Xl is large. (JI correspond to two known populations and we desire to classify a new observation X as belonging to one or the other.90 or . itl' However if. among other things. 0.2. Note that in general the test statistic L depends intrinsically on ito.6) has probability of type I error 0:. The likelihood ratio test for H : (J = 6 0 versus K : (J = (h is based on . From our discussion there we know that for any specified Ct.. In this example we bave assnmed that (Jo and (J. This is the smallest possible n for any size Ct test. then. itl = ito + )"6. however..8). then we solve <1>( z(a) + (v. that the UMP test phenomenon is largely a feature of onedimensional parameter problems.2.4. a UMP (for all ). if ito. If ~o = (1.95). they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population CQvanances. l. the test that rejects if. ).Jto)E01X large. <I>(z(a) + (v. if we want the probability of detecting a signal v to be at least a preassigned value j3 (say. .7) where c ~ z(1 . 0.2. Thus. We will discuss the phenomenon further in the next section. by (4./ii/(J)) = 13 for n and find that we need to take n = ((J Iv j2[z(la) + z(I3)]'.2).a)[~6Eo' ~oJ! (Problem 4. I I . Such a test is called uniformly most powerful (UMP). (Jj = (Pj.) test exists and is given by: Reject if (4. this is no longer the case (Problem 4. j = 0.
Ok)' The simple hypothesis would correspond to the theory that the expected proportion of offspring of types 1. . N k ) has a multinomial M(n.. if a match in a genetic breeding experiment can result in k types. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : (j = ()o versus the simple alternative K : 0 = ()1. Before we give the example. We note the connection of the MP test to the Bayes procedure of Section 3.. . which states that the size 0:' SLR test is uniquely most powerful (MP) in the class of level a tests. .1. . 0" . Suppose that (N1 . 4. . Example 4..1) test !. then (N" . r lUI nt···· nk· ".u k nn.)N' i=l to Here is an interesting special case: Suppose OjO integer I with 1 < I < k > 0 for all}.2) where .. .Section 4.nk. For instance... Ok = OkO. . (}l. _ (nl. there is sometimes a simple alternative theory K : (}l = (}ll.1.3.. and N i is the number of offspring of type i. Nk) . is established.OkO' Usually the alternative to H is composite. However.. 1 (}k) distribution with frequency function. . A level a test 'P' is wtiformly most powerful (UMP) for H : 0 E versus K : 0 E 8 1 if eo (3(0.···. where nl..'P) for all 0 E 8" for any other level 0:' (4.2 that UMP tests for onedimensional parameter problems exist."" (}k = Oklo In this case..3. .p...3.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 227 Summary.M( n. .. . nk are integers summing to n. here is the general definition of UMP: Definition 4. With such data we often want to test a simple hypothesis H : (}I = Ow. 0) P n! nn. 0 < € < 1 and for some fixed (4. . Testing for a Multinomial Vector. . Two examples in which the MP test does not depend on 0 1 are given. 'P') > (3(0.2 for deciding between 00 and Ol. . ..3 UNIFORMLY MOST POWERFUL TESTS ANO MONOTONE LIKELIHOOD RATIO MODELS We saw in the two Gaussian examples of Section 4.. k are given by Ow. This phenomenon is not restricted to the Gaussian case as the next example illustrates. Such tests are said to be UMP (unifonnly most powerful). n offspring are observed. the likelihood ratio L 's L~ rr(:.3. The NeymanPearson lemma.
Because dt does not depend on ()l. e c R...3. we conclude that the MP lest rejects H. If 1J(O) is strictly increasing in () E e. 0) ~.. = h(x)exp{ry(O)T(x) ~ B(O)}. this test is UMP fortesting H versus K : 0 E 8 1 = {/:/ : /:/ ... Suppose {P. then L(x. B( n. X ifT(x) < t ° (4.2.o)n. p(x. Example 4. Bernoulli case. Note that because l can be any of the integers 1 . we have seen three models where. .(X) = '" > 0.2 (Example 4. J Definition 4. is UMP level".: 0 E e}.1. there is a statistic T such that the test with critical region {x : T(x) > c} is UMP. it is UMP at level a = EOodt(X) for testing H : e = eo versus K : B > Bo in fact. 0 .228 Testmg and Confidence Regions Chapter 4 That is. In this i. Example 4.1) MLR in s.3. the power function (3(/:/) = E.1. Consider the problem of testing H : 0 = 00 versus K: 0 ~ 01 with 00 < 81. Thus. . under the alternative.6.d.2) with 0 < f < I}..00). If {P. The family of models {P.OW and the model is by (4.. if and only if.. . 0 Typically the MP test of H : () = eo versus K : () = (}1 depends on (h and the test is not UMP. N 1 < c.o)n[O/(1 . for". : 0 E e} with e c R is said to be a monotone likelihood ratio (MLR) family if for (it < O the distributions POl and P0 2 are distinct and 2 the ratio p(x. Moreover. equals the likelihood ratio test 'Ph(t) and is MP. set s then = l:~ 1 Xi. Example 4. = (I .3. k. e c R. Ow) under H. However. .1) if T(x) = t. 00 . . Oil = h(T(x)) for some increasing function h. 0 form with T(x) . . = . 02)/p(X. in the case of a real parameter. for testing H : 8 < /:/0 versus K:O>/:/I' . Then L = pnN1£N1 = pTl(E/p)N1. (2) If E'o6. This is part of a general phenomena we now describe. we get radically different best tests depending on which Oi we assume to be (ho under H. .(x) any value in (0.O) = 0'(1 .(X) is increasing in 0. is an MLRfamily in T(x).. is an MLRfamily in T(x).3. (1) For each t E (0. is of the form (4. :.3.3 continned).3) with 6. type I is less frequent than under H and the conditional probabilities of the other types given that type I has not occurred are the same under K as they are under H.3. Critical values for level a are easily determined because Nl .3. = P( N 1 < c).nx/u and ry(!") Define the NeymanPearson (NP) test function 6 ( ) _ 1 ifT(x) > t . then this family is MLR. Because f. 6.2. < 1 implies that p > E. Theorem 4. . Oil is an increasing function ofT(x).nu)!".2. Consider the oneparameter exponential family mode! o p(x.i. : 8 E e}.1 is of this (. where u is known. . then 6.
Let S = l:~ I (X. distribution. Testing Precision. distribution. If a is a value taken on by the distribution of X.l of the measurements Xl. we now show that the test O· with reject H if.o. where h(a) is the ath quantile of the hypergeometric.o. then by (1). recall that we have seen that e5 t maximizes the power for testing II : (J = (Jo versus K : (J > (Jo among the class of tests with level <> ~ Eo. we test H : u > Uo l versus K : u < uo. To show (2). Because the class of tests with level 0' for H : (J < (Jo is contained in the class of tests with level 0' for H : (J = (Jo.1 by noting that for any (it < (J2' e5 t is MP at level Eo. X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives.(X).(bl1) x.bo > n. If the distributionfunction Fo ofT(X) under X"" POo is continuous and ift(1a) isasolution of Fo(t) = 1 . (8 < s(a)) = a.. 1 bo(bol) (blX+l)(Nbl) (box+1)(Nbo) (Nbln+x+l) (Nbon+x+l)' . Because the most serious error is to judge the precision adequate when it is not. Then. _1')2.Xn is a sample from a N(I1. Thus. e c p(x. if N0 1 = b1 < bo and 0 < x < b1 . .. where 11 is a known standard. X < h(a). Suppose {Po: 0 E Example 4.. For simplicity suppose that bo > n. the critical constant 8(0') is u5xn(a).Section 4. d t is UMP for H : (J < (Jo versus K : (J > (Jo. and because dt maximizes the power over this larger class.. where xn(a) is the ath quantile of the X. then e}. . 0 Example 4. H.Xn . R. is UMP level a. and specifies an 0' such that the probability of rejecting H (keeping a bad lot) is at most 0'. where b = N(J. we could be interested in the precision of a new measuring instrument and test it by applying it to a known standard. and only if. If 0 < 00 . Suppose X!.3.5.(X) for testing H: 0 = 01 versus J( : 0 = 0. For instance. (X) < <> and b t is of level 0: for H : (J :S (Jo. N .l.l.3. then the test/hat rejects H if and only ifT(r) > t(1 .( NOo.3 Uniformly Most Po~rful Tests and Monotone likelihood Ratio Models 229 and Corollary 4. the alternative K as (J < (Jo. Corollary 4.U 2 ) population. . 0) = exp {~8 ~ IOg(27r0'2)} .<>.3. Suppose tha~ as in Example l.<» is lfMP level afor testing H: (J < (Jo versus K : (J > (Jo. n). Ifwe write ~= Uo t i''''l (Xi _1')2 000 we see that Sju5 has a X~ distribution. and we are interested in the precision u.1.l) yields e L( 0 0 )=b. 0 Proof (I) follows from b l = iPh(t) The following useful result follows immediately. If the inspector making the test considers lots with bo = N(Jo defectives or more unsatisfactory. N.2. Eoo. she formulates the hypothesis H as > (Jo.. Quality Control. 20' 2  This is a oneparameter exponential family and is MLR in T = So The UMP level 0' test rejects H if and only if 8 < s(a) where s(a) is such that Pa . where U O represents the minimum tolerable precision. 0.l.4. . is an MLRfamily in T(r). (l.
1.4 we might be uninterested in values of p.1. This is a subset of the alternative on which we are willing to tolerate low power. The critical values for the hypergeometric distribution are available on statistical calculators and software.. (0. This continuity of the power shows that not too much significance can be attached to acceptance of H. 0) continuous in 0.(}o.a. On the other hand. L is decreasing in x and the hypergeometric model is an MLR family in T( x) = r.3. Thus.:(x:. In our example this means that in addition to the indifference region and level a. the appropriate n is obtained by solving i' . In both these cases. ..4 because (3(p. H and K are of the fonn H : () < ()o and K : () > ()o.. i. By CoroUary 4.. ~) for some small ~ > 0 because such improvements are negligible.O'"O. as seen in Figure 4.1.'. This is possible for arbitrary /3 < 1 only by making the sample size n large enough. that is. this is a general phenomenon in MLR family models with p( x.. Note that a small signaltonoise ratio ~/ a will require a large sample size n. That is..x) <I L(x. However. This equation is equiValent to = {3 z(a) + .230 NotethatL(x. ". This is not serious in practice if we have an indifference region..x) (N .2). if all points in the alternative are of equal significance: We can find > 00 sufficiently close to 00 so that {3( 0) is arbitrarily close to {3(00) = a. I' . we want guaranteed power as well as an upper bound on the probability of type I error. we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small. Off the indifference region. In our nonnal example 4. in (O . It follows that 8* is UMP level Q.c:O". ° ° . not possible for all parameters in the alternative 8. forO <:1' < b1 1.+. .Oo. we specify {3 close to I and would like to have (3(!') > (3 for aU !' > t. we want the probability of correctly detecting an alternative K to be large.L. Thus.(b o .) is increasing. and the powers are continuous increasing functions with limOlO.' . Therefore. this is.1..OIl box (Nn+I)(blx) .n + I) .Il = (b l . in general. we would also like large power (3(0) when () E 8 1 ..ntI") for sample size n. {3(0) ~ a. For such the probability of falsely accepting H is almost 1 .1. I i (3(t) ~ 'I>(z(a) + ..ntI" = z({3) whose solution is il . 0 Power and Sample Size In the NeymanPearson framework we choose the test whose size is small. 1 I .I. In Example 4.1 and formula (4. ~) would be our indifference region.B 1 ) =0 forb l Testing and Confidence Regions Chapter 4 < X <n.
(h = 00 + Dt. = 0.05. we can have very great power for alternatives very close to O. Suppose 8 is a vector.) = (3 for n and find the approximate solution no+ 1 . and 0. to achieve approximate size a. Now let .05)2{1.1. we next show how to find the sample size that will "approximately" achieve desired power {3 for the size 0: test in the binomial example. This problem arises particularly in goodnessoffit tests (see Example 4.3 continued). that is. As a further example and precursor to Section 5.3 requires approximately 163 observations to have probability . where (3(0. > O. Formula (4. and only if. In Example 4. we solve (3(00 ) = Po" (S > s) for s using (4. Again using the nonnal approximation. Often there is a function q(B) such that H and K can be formulated as H : q(O) <: qo and K : q(O) > qo.1.0. ifn is very large and/or a is small. The power achievable (exactly. Such hypotheses are often rejected even though for practical purposes "the fit is good enough.4) and find the approximate critical value So ~ nOo 1 + 2 + z(l.35. when we test the hypothesis that a very large sample comes from a particular distribution. Bd.2) shows that.5). 0 ° Our discussion can be generalized.7) + 1. the size .Oi) [nOo(1 . if Oi = . we find (3(0) = PotS > so) = <I> ( [nO(l ~ 0)11/ 2 Now consider the indifference region (80 .4. test for = . Dt. 1. (3 = .3(0.90 of detecting the 17% increase in 8 from 0.3 to 0.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 231 Dual to the problem of not having enough power is that of having too much.6 (Example 4. Our discussion uses the classical normal approximation to the binomial distribution.Section 4." The reason is that n is so large that unimportant small discrepancies are picked up. 00 = 0.35(0. far from the hypothesis.80 ) .3. Thus.35. There are various ways of dealing with this problem.4 this would mean rejecting H if. They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is 0:.1.3. Example 4.86. We solve For instance.35 and n = 163 is 0. It is natural to associate statistical significance with practical significance so that a very low pvalue is interpreted as evidence that the alternative that holds is physically significant. First..645 x 0.4.282 x 0.1.00 )] 1/2 .4.90.55)}2 = 162. using the SPLUS package) for the level .05 binomial test of H : 8 = 0. we fleed n ~ (0.
2. 00). In general. a). J.< (10 among all tests depending on <.(0) = o} aild 9.2 is (}'2 = ~ E~ 1 (Xi .9.1 by taking q( 0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative. Then the MLE of 0.£ is unknown.(Tn ) is an MLR family.232 Testing and Confidence Regions Chapter 4 q. then rejecting for large values of Tn is UMP among all tests based on Tn.0) > ° I(O. I}. > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q. is the a percentile of X~l' It is evident from the argument of Example 4..l)l(O.(0) so that 9 0 = {O : ).4) j j 1 .3.. 0 = Complete Families of Tests The NeymanPearson framework is based on using the 01 loss function. It may also happen that the distribution of Tn as () ranges over 9.4. a reasonable class of loss functions are those that satisfy : 1 i j 1 1(0. for instance. first let Co be the smallest number c such that Then let n be the smallest integer such that P" IT > col > fJ where 00 is such that q( 00 ) = qo and 0.1) 1(0.1. Example 4.= (10 versus K : 0..l' independent of IJ.+ 00. Suppose that in the Gaussian model of Example 4. is detennined by a onedimensional parameter ). We have seen in Example 4. I . To achieve level a and power at least (3. = {O : ). we may consider l(O. .. Suppose that j3(O) depends on () only through q( (J) and is a continuous increasing function of q( 0). .3.< (10 and rejecting H if Tn is small. that are not 01... The theory we have developed demonstrates that if C.3 that this test is UMP for H : (1 > (10 versus K : 0. The set {O : qo < q(O) < ql} is our indjfference region.(0) > O} and Co (Tn) = C'(') (Tn) for all O. the critical value for testing H : 0. and also increases to 1 for fixed () E 6 1 as n . Implicit in this calculation is the assumption that POl [T > col is an increasing function ofn. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II.O)<O forB < 00 forB>B o. a E A = {O. when testing H : 0 < 00 versus K : B > Bo. Although H : 0.2 only. 0) = (B . This procedure can be applied. Testing Precision Continued. for 9..= 00 is now composite. = (tm. 0 E 9.Bo).Xf as in Example 2. 0 E 9.3.7. to the F test of the linear model in Section 6. we illustrate what can happen with a simple example. the distribution of Tn ni:T 2/ (15 is X. (4. We may ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O.5 that a particular test statistic can have a fixed distribution £0 under the hypothesis. Thus. For each n suppose we have a level a test for H versus I< based on a suitable test statistic T. .3. For instance. is such that q( Oil = q. However.
5) holds for all 8.(X) = E"I"(X) > O. a) satisfies (4.(I"(X))) < 0 for 8 > 8 0 . 0 < " < 1. Suppose {Po: 0 E e}. 1") = (1(0. = R(O. locally most powerful (LMP) tests in some cases can be found.o) < R(O. the risk of any procedure can be matched or improved by an NP test. is complete.I") ~ = EO{I"(X)I(O. we show that for MLR models.Section 4. when UMP tests do not exist. it isn't worthwhile to look outside of complete classes.4 CONFIDENCE BOUNDS. hence. (4.3.(x)) .3. E.4). Proof. If E.I"(X) 0 for allO then O=(X) clearly satisfies (4. e 4.E.3.(X) = ". Thus.12) and. Intervals. INTERVALS.o.3. Theorem 4. In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H.3. 1") for all 0 E e.I"(X) for 0 < 00 .(2) a v R(O.(X) be such that. then the class of tests of the form (4. 1) 1(0. hence. a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing ()o versus 8 1 is an increasing function of a statistic T( x) for every ()o < ()1. O))(E. Thus.4 Confidence Bounds. Finally.(Io. (4.(X)) = 1. For MLR models. if the model is correct and loss function is appropriate.O) + [1(0. is similarly UMP for H : 0 > 8 0 versus K : 0 < 00 (Problem 4. 1) + [1 1"(X)]I(O.R(O. and Regions 233 if for any decision rule 'P The class D of decision procedures is said to be there exists E such that complete(I).0.(o. In the following the decision procedures arc test functions. We also show how. the class of NP tests is complete in the sense that for loss functions other than the 01 loss function.3) with Eo. For () real.(X) > 1. for some 00 • E"o.E.O)]I"(X)} Let o.6) But 1 . The risk function of any test rule 'P is R(O.1 and.3.O)} E.{1(8. is an MLR family in T( x) and suppose the loss function 1(0. 0 Summary.E. Now 0.3. e c R. is UMP for H : 0 < 00 versus K: 8 > 8 0 by Theorem 4.3.5) That is. (4. AND REGIONS We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter {J i" a .5). then any procedure not in the complete class can be matched or improved at all () by one in the complete class.3. 1) 1(O. the test that rejects H : 8 < 8 0 for large values of T(x) is UMP for K : 8 > 8 0 .2.) .o. We consider models {Po : () E e} for which there exist tests that are most powerful for every () in a composite alternative t (UMP tests).
As an illustration consider Example 4.±(X) ~ X ± oz (1.a)I. as in (1. we want to find a such that the probability that the interval [X .! = X .. That is. .i.a.234 Testing and Confidence Regions Chapter 4 member of a specified set 8 0 .(X) . Suppose that JL represents the mean increase in sleep among patients administered a drug.. X n ) to establish a lower bound .a)l. if v ~ v(P).Q with 1 .. X + a] contains pis 1 . Then we can use the experimental outcome X = (Xl. or sets that constrain the parameter with prescribed probability 1 . .a). 11 I • J . .(X). This gives . In our example this is achieved by writing By solving the inequality inside the probability for tt..fri. and X ~ P.) and =1 a .(Xj < . 1 X n are i. ...) = Ia..!a)/. . ( 2 ) with (72 known. We say that [j. In general. intervals.Q equal to ..(X) and solving the inequality inside the probability for p.) = 1 .a.a) confidence bound for ".3.2 ) example this means finding a statistic ii(X) snch that P(ii(X) > ..fri.8). is a parameter.95 Or some other desired level of confidence. is a lower bound with P(.. X E Rq.a)l..+(X)] is a level (1 .) = 1 .a.95.fri < . • j I where ...(X) is a lower confidence bound with confidence level 1 . N (/1. In the N(".fri < .a) of being correct. Finally.... we want both lower and upper bounds. PEP.oz(1 . Here ii(X) is called an upper level (1 . 0..a) such as ..4 where X I. and a solution is ii(X) = X + O"z(1 . In the nonBayesian framework.a) confidence interval for ".. is a constant..a) for a prescribed (1. We find such an interval by noting . and we look for a statistic .(X) that satisfies P(. . We say that . we find P(X . it may not be possible for a bound or interval to achieve exactly probability (1 .a. Now we consider the problem of giving confidence bounds.a. in many situations where we want an indication of the accuracy of an estimator. In this case.oz(1 ..(X) for" with a prescribed probability (1 .. That is. . Similarly.. we may be interested in an upper bound on a parameter. we settle for a probability at least (1 .d.1..
1) = T(I') . In this process Z(I') is called a pivot.0)% confidence intervolfor v if.0) or a 100(1 . For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient.X) n i=l 2 .. the random interval [v( X). Moreover.! that Z(iL)1 "jVI(n .4. Example 4.O!) lower confidence bound for v if for every PEP. P[v(X) = v] > 10. Intervals. which has a X~l distribution. by Theorem B.Section 4. v(X) is a level (I . has a N(o. 1) distribution and is..o. independent of V = (n . Now we tum to the (72 unknown case and propose the pivot T(J.L.1.4.e. the confidence level is clearly not unique because any number (I . and assume initially that (72 is known.. ii( Xl] formed by a pair of statistics v(X). has aN(O. we will need the distribution of Now Z(I') = Jii(X 1')/". In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level.4 Confidence Bounds.~o) for 1'.!o) < Z(I') < z (1. Similarly.3. P E PI} (i. The quantities on the left are called the probabilities of coverage and (1 . the minimum probability of coverage).0) is.L) obtained by replacing (7 in Z(J.a) upper confidence bound for v if for every PEP. In the preceding discussion we used the fact tbat Z (1') = Jii(X . In general. finding confidence intervals (or bounds) often involves finding appropriate pivots.3. and Regions 235 Definition 4.a) is called a confidence level. lhat is. P[v(X) < vi > I .L) by its estimate s.1. I) distribu~on to obtain a confidence interval for I' by solving z (1 .0) will be a confidence level if (I . Let X t . where S 2 = 1 nI L (X. . " X n be a sample from a N(J. D(X) is called a level (1 . For a given bound or interval.l)s2 lIT'.o. Note that in the case of intervals this is just inf{ PI!c(X) < v < v(X).0') < (J . We conclude from the definition of the (Student) t distribution in Section B. A statistic v(X) is called a level (1 . for all PEP. P[v(X) < v < v(X)] > I .1')1". The (Student) t Interval and Bounds.3. (72) population.
a).11 whatever be Il and Tj.fii are natural lower and upper confidence bounds with confidence coefficients (1 . " .005. By Theorem B . we have assumed that Xl.1..n < J. for very skew distributions such as the X2 with few degrees of freedom. and if al + a2 = a.~a)/.3. or Tables I and II..3.~a)) = 1.355 and IX . we enter Table II to find that the probability that a 'TnI variable exceeds 3.~a) = 3.st n_l (1 ~.nJ = 1" X ± sc/. . we see that as n + 00 the Tn_l distribution converges in law to the standard normal distribution. or very heavytailed distributions such as the Cauchy.l (I .Q). Thus.)/.Xn is a sample from a N(p" a 2 ) population. Hence. Then (72. 0 . if n = 9 and a = 0. • " ~...236 has the t distribution 7. Confidence Intervals and Bounds for the Vartance of a Normal Distribution.1 (1 . then P(X("I) < V(<7 2) < x(l.) confidence interval of the type [X .4. ...1..l)s2/X(1.355 is .1 )s2/<7 2 has a X. .4. the confidence coefficient of (4.) /. For the usual values of a. To calculate the coefficients t n. . In this case the interval (4.1) The shortest level (I . . Up to this point.1) in nonGaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5.. .4.1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal.st n _ 1 (I . Suppose that X 1. .. Example 4. the interval will have probability (1 .a. N (p" a 2 ). It turns out that the distribution of the pivot T(J. .a)..1) has confidence coefficient close to 1 .l (I .!a) and tndl .fii.2.l)s2/X("I)1 is a confidence interval with confidence coefficient (1 . .1 distribution and can be used as a pivot. t n .1.. X . thus.355s/3.stn_ 1 (I . Solving the inequality inside the probability for Jt. By solving the inequality inside the probability for a 2 we find that [(n .Q. For instance.)/ v'n]. X n are i. . we use a calculator.. (4.a) /.fii and X + stn. computer software.d. .1 (p) by the standard normal quantile z(p) for n > 120."2)' (n . X +stn _ 1 (1 Similarly.1.2) f..l < X + st n_ 1 (1.a) in the limit as n + 00.. ~.355s/31 is the desired level 0.1 (I . (4. distribution. • r. .n is. we can reasonably replace t n .7.~a) < T(J..12).3. we find P [X .4.~a)/. If we assume a 2 < 00. if we let Xn l(p) denote the pth quantile of the X~1 distribution. See Figure 5.l) is fairly close to the Tn .1) can be much larger than 1 .. .i. Testing and Confidence Regions Chapter 4 1 ."2)) = 1".99 confidence intervaL From the results of Section B. Let tk (p) denote the pth quantile of the P (t n..01.7 (see Problem 8. The properties of confidence intervals such as (4. V( <7 2 ) = (n .a. On the other hand. X + 3.l) < t n_ 1 (1. i '1 !.
vIn(X .0) < z (1)0(1 .4. Example 4. If we consider "approximate" pivots.16 we give an interval with correct limiting coverage probability.3." we can write P [ Let ka vIn(X . X) < OJ (4. In Problem 1.0) . ..Section 44 Confidence Bounds. Intervals. 1959). = Z (1 .Q) and (n .~ a) and observe that this is equivalent to Q P [(X . the confidence interval and bounds for (T2 do not have confidence coefficient 1. The pivot V( (T2) similarly yields the respective lower and upper confidence bounds (n . We illustrate by an example.S)jn] {S + k2~ + k~j4} / (n + k~) (4.X) = (1+ k!) 0' . = ~(X) < 0 < B(X)]' Because the coefficient of (j2 in g(() 1 X) is greater than zero.S)jn] + k~j4} / (n + k~). There is no natural "exact" pivot based on X and O.4) . In contrast to Example 4. 1) distribution.4. However. X) is a quadratic polynomial with two real roots. 1 X n are the indicators of n Bernoulli trials with probability of success (j.l)sjx(Q).4. In tenns of S = nX. which unifonnly minimizes expected length among all intervals of this type.X) < OJ'" 1where g(O. There is a unique choice of cq and a2. [0: g(O. taking at = (Y2 = ~a is not far from optimal (Tate and Klett.. It may be shown that for n large.0) has approximately a N(O. If X I..1. If we use this function as an "approximate" pivot and let:=:::::: denote "approximate equality. Asymptotic methods and Monte Carlo experiments as described in Chapter 5 have shown that the confidence coefficient may be arbitrarily small depending on the underlying true distribution.(2X + ~) e+X 2 For fixed 0 < X < I. 0 The method of pivots works primarily in problems related to sampling from nonnal populations. and Regions 237 The length of this interval is random.3) + ka)[S(n . g( 0.O)j )0(1 .0)  1Ql] = 12 Q. the scope of the method becomes much broader. they are(1) O(X) O(X) {S + k2~ . Approximate Confidence Bounds and Intervalsfor the Probability of Success in n Bernoulli Trials.0(1. which typically is unknown.k' < .a even in the limit as n + 00.1.0) ] = Plg(O.ka)[S(n .4. then X is the MLE of (j.l)sjx(1 . if we drop the nonnality assumption. by the De MoivreLaplace theorem.
and uses the preceding model.96) 2 = 9.5) (4. to bound I above by 1 0 choose . we choose n so that ka. Cai.2a) interval are approximate upper and lower level (1 . (1 . He draws a sample of n potential customers.600. For instance. Another approximate pivot for this example is yIn(X . = length 0.4) has length 0.a) confidence bounds.S)ln] Now use the fact that + kzJ4}(n + k~)l (S .S)ln ~ to conclude that tn . calls willingness to buy success.975. we n . consider the market researcher whose interest is the proportion B of a population that will buy a product. ka.a) confidence interval for O.X).5) is used and it is only good when 8 is near 1/2.. say I.601. and Das Gupta (2000) for a discussion. Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 00 < ~. of the interval is I ~ 2ko { JIS(n .5.96.O > 8 1 > ~. if the smaller of nO..4.~a = 0.~n)2 < S(n . note that the length.~a) ko 2 kcc i 1.96 0. A discussion is given in Brown.a) procedure developed in Section 4. .4.02 by choosing n so that Z = n~ ( 1.8) is at least 6.95. That is.7) See Brown. He can then detennine how many customers should be sampled so that (4. = T. and we can achieve the desired .02 and is a confidence interval with confidence coefficient approximately 0. We can similarly show that the endpoints of the level (1 .(1. This fonnula for the sample size is very crude because (4. Note that in this example we can detennine the sample size needed for desired accuracy.16. See Problem 4.4. and Das Gupta (2000). i o 1 [.02 ) 2 .02. For small n. .4. [n this case. 1 .: iI j .0)1 JX(1 . O(X)] is an approximate level (1 . This leads to the simple interval 1 I (4.6) Thus.238 Testing and Confidence Regions Chapter 4 so that [O(X).n 1 tn (4. = 0.4.(n + k~)~ = 10 . n(l . it is better to use the exact level (1 .4. i " . These inlervals and bounds are satisfactory in practice for the usual levels. To see this. or n = 9. Cai.4.
j==1 (4. That is.4.r} is said to be a level (1 . (T c' Tc ) are independent. Note that ifC(X) is a level (1 ~ 0') confidence region for 0..a). £(8. .) confidence interval for qj (0) and if the ) pairs (T l ' 1'.a). Confidence Regions of Higher Dimension We can extend the notion of a confidence interval for onedimensional functions q( B) to rdimensional vectors q( 0) = (q. By Problem B. ..a).). we can find confidence regions for q( 0) entirely contained in q( C(X)) with confidence level (1 ..0').8) . In this case. then q(C(X)) ~ {q(O) . x IT(1a. . 2nX/0 has a chi~square... ..4. qc( 0)) is at least (I .. and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet. if q( B) = 01 . . If q is not 1 . .(X) < q. For instance. this technique is typically wasteful.~a) < 0 <2nX/x (~a) where x(j3) denotes the 13th quantile of the X~n distribution. qc (0)).3.Section 44 Confidence Bounds. q(C(X)) is larger than the confidence set obtained by focusing on B1 alone.. Example 4. then exp{ x/O} edenote the lower < q(O) < exp{ x/i)} is a confidence interval for q( 0) with confidence coefficient (1 . We write this as P[q(O) E I(X)I > 1.0') confidence region for q(O).a Note that if I. X n is modeled as a sample from an exponential. Let Xl..). Here q(O) = 1 .(X). and Regions 239 Confidence Regions for Functions of Parameters We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 . we will later give confidence regions C(X) for pairs 8 = (0 1 . Suppose q (X) and ii) (X) are realJ valued.a. ~.a) confidence region.4. (0). I is a level (I . ( 2 ) T.4.1 ). By using 2nX /0 as a pivot we find the (1 .X n denote the number of hours a sample of internet subscribers spend per week on the Internet. X~n' distribution. . j ~ I. then the rectangle I( X) Ic(X) has level = h (X) x . Suppose X I. Then the rdimensional random rectangle I(X) = {q(O). . (X) ~ [q. ii.a) confidence interval 2nX/x (I . 0 E C(X)} is a level (1.(O) < ii.F(x) = exp{ x/O}.. if the probability that it covers the unknown but fixed true (ql (0). Intervals. Let 0 and and upper boundaries of this interval. distribution. .... . ..1. .
Let dO' be chosen such that PF(Dn(F) < do) = 1 .0:. Thus. if we choose a J = air.2).2. for each t E R. Suppose Xl.ia)' Xnl (lo) is a reasonable confidence interval for (72 with confidence coefficient (1. if we choose 0' j = 1 .. (7 2). An approach that works even if the I j are not independent is to use Bonferroni's inequality (A. F(t) + do})· We have shown ~ ~ I o P(C(X)(t) :) F(t) for all t E R) for all P E 'P =1. an rdimensional confidence rectangle is in this case automaticalll obtained from the onedimensional intervals. then I(X) has confidence level (1 . We assume that F is continuous.do}. then leX) has confidence level 1 . as X rv P.4..40) 2. From Example 4..Laj. . Suppose Xl> . X n are i.2.4..~a).i. and we are interested in the distribution function F(t) = P(X < t). From Example 4.240 Testing and Confidence Regions Chapter 4 Thus. See Problem 4. j=1 j=1 Thus.1) the distribution of . .. . tl continuous in t.7). Confidence Rectangle for the Parameters ofa Normal Distribution.1.5.d..a).4. we find that a simultaneous in t size 1. Example 4.F(t)1 tEn i " does not depend on F and is known (Example 4. That is. in which case (Proposition 4. is a confidence interval for J1 with confidence coefficient (1  I (X) _ [(nl)S2 (nl)S2] 2 Xnl (1 .(1 .a = set of P with P( 00. Example 4. .a) confidence rectangle for (/" .a) r. 0 The method of pivots can also be applied to oodimensional parameters such as F.LP[qj(O) '" Ij(X)1 > 1. that is.1.6.0 confidence region C(x)(·) is the confidence band which.~a)/rn !a).1 h(X) = X ± stn_l (1. Then by solving Dn(F) < do for F. r. ( 2 ) sample and we want a confidence rectangle for (Ii. .5). D n (F) is a pivot. Moreover. j = 1.. min{l. consists of the interval i J C(x)(t) = (max{O.4. . c c P[q(O) E I(X)] > 1.15. It is possible to show that the exact confidence coefficient is (1 . F(t) . . v(P) = F(·). h (X) X 12 (X) is a level (1 .. 1 X n is a N (M. According to this inequality.a.4. ~ ~'f Dn(F) = sup IF(t) .
19. X n are i.!(F) : F E C(X)) ~ J.! ~ /L(F) = f o tf(t)dt = exists.4. 1992) where such bounds are discussed and shown to be asymptotically strictly conservative. for a given hypothesis H.5 THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS Confidence regions are random subsets of the parameter space that contain the true parameter with probability at least 1 . .0') lower confidence bound for /1. o 4.6. a 2 ) model with a 2 unknown. if J.4.18) arise in accounting practice (see Bickel. 1WoSided Tests for the Mean of a Normal Distribution. Suppose Xl. and we derive an exact confidence interval for the binomial parameter. We derive the (Student) t interval for I' in the N(Il. given by (4. subsets of the sample space with probability of accepting H at least 1 .!(F+) and sUP{J.4. Example 4.4 and 4.1..9)   because for C(X) as in Example 4. We define lower and upper confidence bounds (LCBs and DCBs). then Let F(t) and F+(t) be the lower and upper simultaneous confidence boundaries of Example 4. In a parametric model {Pe : BEe}. Then a (1 . We shall establish a duality between confidence regions and acceptance regions for families of hypotheses.. .d. J. and more generally confidence regions.i." for all PEP.6. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X.5 to give confidence regions for scalar or vector parameters in nonparametric models.0'.4. a level 1 . Suppose that an established theory postulates the value Po for a certain physical constant.!(F)   ~ oosee Problem 4.4. A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining . Intervals for the case F supported on an interval (see Problem 4. Acceptance regions of statistical tests are. is /1.Section 4.7.4.0: when H is true. A Lower Confidence Boundfor the Mean of a Nonnegative Random Variable.0: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Pe that C(X) covers q( 8) is at least 1 .! ~ inf{J.4. We begin by illustrating the duality in the following example. By integration by parts. 0 Summary. confidence intervals.5 The Duality Between Confidence Regions and Tests 241 We can apply the notions studied in Examples 4.!(F) : F E C(X)} = J. For a nonparametric class P ~ {P} and parameter v ~ v(P). as X and that X has a density f( t) = F' (t). we similarly require P(C(X) :J v) > 1. which is zero for t < 0 and nonzero for t > O. Example 4.0: for all E e.5.4.
2(p) takes values in N = (0. . if we start out with (say) the level (1 .4. This test is called twosided because it rejects for both large and small values of the statistic T.5.00). (/" . t n .5. . it has power against parameter values on either side of flo.a) LCB X . consider v(P) = F.1 (1 .5. 00) X (0. all PEP. that is P[v E S(X)] > 1 .. i:. O{X.2) we obtain the confidence interval (4. We can base a size Q' test on the level (1 .a.~1 >tn_l(l~a) ootherwise. = 1. Evidently. For a function space example.l other than flo is a possible alternative.1) is used for every flo we see that we have.fLo)/ s. and in Example 4.6.I n . if and only if.~Ct). and only if. then it is reasonable to formulate the problem as that of testing H : fl = Jlo versus K : fl i.4. (4.1. where F is the distribution function of Xi' Here an example of N is the class of all continuous distribution functions. We accept H. by starting with the test (4.~a).2. generated a family of level a tests {J(X.2) These tests correspond to different hypotheses.00).2) takes values in N = (00 .Q) confidence interval (4.a) confidence region for v if the probability that S(X) contains v is at least (1 . Let S = S{X) be a map from X to subsets of N.4.a). X .4.!a) < T < I n. /l) ~ O.2 = . if and only if. (4.4..flO. o These are examples of a general phenomenon.00).. • j n .. as in Example 4.5. 1.5. We achieve a similar effect. in Example 4.a)s/ vn and define J'(X. X + Slnl (1  ~a) /vn]. generating a family of level a tests.~a)] = 0 the test is equivalently characterized by rejecting H when ITI > tnl (1 .1 (1. In contrast to the tests of Example 4..1) by finding the set of /l where J(X..1 (1 .5'[) If we let T = vn(X . in Example 4.242 Testing and Confidence Regions Chapter 4 measurements Xl . in fact. Consider the general framework where the random vector X takes values in the sample space X C Rq and X has distribution PEP. For instance. Because the same interval (4.1) we constructed for Jl as follows... /l) to equal 1 if. Jlo) being of size a only for the hypothesis H : Jl = flo· Conversely. /l = /l(P) takes values in N = (00.. the postulated value JLo is a member of the level (1 0' ) confidence interval [X  Slnl (1 ~a) /vn. then S is a (I .I n _ 1 (1.a. Xl!_ Knowledge of his instruments leads him to assume that the Xi are independent and identically distributed normal random variables with mean {L and variance a 2 . /l)} where lifvnIX. Let v = v{P) be a parameter that takes values in the set N. If any value of J.l (1.a)s/vn > /l. then our test accepts H.4. Because p"IITI = I n. .
Ct confidence region for v.(01 + 0"2). if 0"1 + 0"2 < 1. 0 We next give connections between confidence bounds. let .0".a has a solution O.(T) of Fo(T) = a with coefficient (1 . H may be rejected. Conversely.Ct.vo) = OJ is a subset of X with probability at least 1 . We have the following. For some specified Va. Then the acceptance regIOn A(vo) ~ (x: J(x.0" quantile of F oo • By the duality theorem.Ct). Moreover. By Corollary 4. We next apply the duality theorem to MLR families: Theorem 4. Formally. H may be accepted.Section 4. then ()u(T) is a lower confidence boundfor () with confidence coefficient 1 . Suppose we have a test 15(X. the acceptance region of the UMP size a test of H : 0 = 00 versus K : > ()a can be written e A(Oo) = (x: T(x) < too (1. By applying F o to hoth sides oft we find 8(t) = (O E e: Fo(t) < 1. Suppose X ~ Po where (Po: 0 E ej is MLR in T = T(X) and suppose that the distribution function Fo(t) ofT under Po is continuous in each of the variables t and 0 when the other is fixed. The proofs for the npper confid~nce hound and interval follow by the same type of argument.0" confidence region for v. is a level 0 test for HI/o. < to(la). acceptance regions. Let S(X) = (va EN. Duality Theorem. X E A(vo)). then the test that accepts HI/o if and only if Va is in S(X). If the equation Fo(t) = 1 . eo e Proof.a)}.3. if 8(t) = (O E e: t < to(1 . va) with level 0". then [QUI' oU2] is confidence interval for () with confidence coefficient 1 . if S(X) is a level 1 .aj. That i~. the power function Po(T > t) = 1 . for other specified Va. Fe (t) is decreasing in ().a)) where to o (1 .1. this is a random set contained in N with probapility at least 1 . 00). then 8(T) is a Ia confidence region forO. E is an upper confidence bound for 0 with any solution e.5 The Duality Between Confidence Regions and Tests 243 Next consider the testing framework where we test the hypothesis H = Hvo : v = Va for some specified value va. It follows that Fe (t) < 1 . let Pvo = (P : v(P) ~ Vo : va E V}. Consider the set of Va for which HI/o is accepted. Similarly.Fo(t) for a test with critical constant t is increasing in (). then PIX E A(vo)) > 1  a for all P E P vo if and only if S (X) is a 1 .G iff 0 > Iia(t) and 8(t) = Ilia. By Theorem 4.5.3.1.0" of containing the true value of v(P) whatever be P.Ct) is the 1. and pvalues for MLR families: Let t denote the observed value t = T(x) ofT(X) for the datum x.1.(t) in e.
We illustrate these ideas using the example of testing H : fL = J. we seek reasonable exact level (1 . .2. vertical sections of C are the confidence regions B( t) whereas borizontal sections are the acceptance regions A" (v) = {t : J( t. A"(B) Set) Proof.a)] > a} = [~(t).v) : a(t.a) test of H. The result follows. We shall use some of the results derived in Example 4.. where S = Ef I Xi_ To analyze the structure of the region we need to examine k(fJ. Because D Fe{t) is a distribution function.5. Let T = X.B) > a} {B: a(t. Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials. • a(t. Figure 4.Fo(t) is decreasing in t.80 E (0. (ii) k(B. a). vol denote the pvalue of a test6(T.Xn ) = {B : S < k(B.10 when X I . B) ~ poeT > t) = 1 . Let Xl. To find a lower confidence bound for B our preceding discussion leads us to consider level a tests for H : 6 < 00..3. • . The pvalue is {t: a(t.p): It pI < <7Z (1.to(l. In general. for the given t. . In the (t.oo).1. 1 . We claim that (i) k(B.1). cr 2 ) with 0. v will be accepted. then C = {(t. I c= {(t.v) where. a) denote the critical constant of a level (1.1). and A"(B) = T(A(B)) = {T(x) : x Corollary 4. B) pnints. 00 ) denote the pvalue for the UMP size Q' test of H : () let = eo versus K : () > eo.B) = (oo. Let k(Bo. v) =O} gives the pairs (t. . B) plane. '1 Example 4. Under the conditions a/Theorem 4.a)ifBiBo. We have seen in the proof of Theorem 4. We call C the set of compatible (t. E A(B)). v) = O}.1 244 Testing and Confidence Regions Chapter 4 1 o:(t.1 that 1 .a) is nondecreasing in B. Then the set . . .d. and an acceptance set A" (po) for this example.Fo(t).3. v) <a} = {(t.a) confidence region is given by C(X t.4.1.a)~k(Bo. : : !.1.I}. a) . _1 X n be the indicators of n binomial trials with probability of success 6. and for the given v. N (IL..2 known. t is in the acceptance region. a confidence region S( to).v) : J(t. For a E (0.1 shows the set C.5. vo) = 1 IT > cJ of H : v ~ Vo based on a statistic T = T(X) with observed value t = T(x).1 I.5. .Fo(t) is increasing in O.~) upper and lower confidence bounds and confidence intervals for B.~a)/ v'n}. . I X n are U.. The corresponding level (1 . let a( t.
(ii). Q).[S > j] < Q for all 0 < 00 O(S) = inf{O: k(O. if 0 > 00 .Q) = n+ 1. Therefore. and (iv) we see that.Section 4. [S > j] > Q. Q) as 0 tOo. To prove (i) note that it was shown in Theorem 4. [S > j] < Q. (iii). whereas A'" (110) is the acceptance region for Hp. The assertion (ii) is a consequence of the following remarks. Po.5 The Duality Between Confidence Regions and Tests 245 Figure 4. (iv) k(O.L = {to in the normal model. I] if S ifS >0 =0 . The claims (iii) and (iv) are left as exercises. If fJo is a discontinuity point 0 k(O.Q)] > Po.[S > k(02. From (i). hence.Q) = S + I}.Q) I] > Po. let j be the limit of k(O. Clearly. and. Therefore.1 (i) that PetS > j] is nondecreasing in () for fixed j. The shaded region is the compatibility set C for the twosided test of Hp. Q). [S > j] = Q and j = k(Oo. a) increases by exactly 1 at its points of discontinuity. it is also nonincreasing in j for fixed e.o : J. Then P.[S > k(O"Q) I] > Q. then C(X) ={ (O(S).[S > k(02.3.1. Q) would imply that Q > Po. e < e and 1 2 k(O" Q) > k(02.I] [0. P. Po.o' (iii) k(fJ. if we define a contradiction. On the other hand.Q) ~ I andk(I. S(to) is a confidence interval for 11 for a given value to of T.5.
When S = n. i • .5 I i .. .2 portrays the situatiou. These intervals can be obtained from computer packages that use algorithms based on the preceding considerations.5. then k(O(S). therefore.1 d I J f o o 0. if n is large. Similarly.16) for n = 2. . 1 _ . O(S) is the unique solution of 1 ~( ~ s ) or(1 _ 8)nr = Q. O(S) = I.2. we define ~ O(S)~sup{O:j(O. When S 0.3. when S > 0. 8(S) = O.1 0.Q) LCB for 0(2) Figure 4. .16) 3 I 2 11. Plot of k(8.4 0. O(S) together we get the confidence interval [8(S). Then 0(S) is a level (1 . we find O(S) as the unique solution of the equation. From our discussion. these bounds and intervals differ little from those obtained by the first approximate method in Example 4.2 0. j '~ I • I • .2Q).Q)=Sl} where j (0.3 0.246 Testing and Confidence Regions Chapter 4 and O(S) is the desired level (I .0. Q) is given by.Q) DCB for 0 and when S < n. j Figure 4. 4 k(8. O(S) I oflevel (1. . 0. Putting the bounds O(S).5. As might be expected. .I. I • i I .4.Q) = S and.
Example 4.5. Using this interval and (4.5 The Duality Between Confidence Regions and Tests 247 Applications of Confidence Intervals to Comparisons and Selections We have seen that confidence intervals lead naturally to twosided tests. but if H is rejected. If we decide B < 0. 2.5. we test H : B = 0 versus K : 8 i.13.!a). suppose B is the expected difference in blood pressure when two treatments. N(/l.3) 3.8 < B • or B > 80 is an example of a threeo decision problem and is a special case of the decision problems in Section 1. are given to high blood pressure patients. Thus.4.O.2 ) with u 2 known. o o Decide B < 80 if! is entirely to the left of B . This event has probability Similarly. Suppose Xl. the probability of the wrong decision is at most ~Q.3) we obtain the following three decision rule based on T = J1i(X  1'0)/": Do not reject H : /' ~ I'c if ITI < z(1 . Make no judgment as 1O whether 0 < 80 or 8 > B if I contains B . and 3. then the set S(x) of Vo where . then we select A as the better treatment. the twosided test can be regarded as the first step in the decision procedure where if H is not rejected. If J(x.Section 4. we can control the probabilities of a wrong selection by setting the 0' of the parent test or confidence interval. we usually want to know whether H : () > B or H : B < 80 .5.~Q). We explore the connection between tests of statistical hypotheses and confidence regions.. Therefore. I X n are i. The problem of deciding whether B = 80 .d.2.B .3. In Section 4. Summary. 0. v = va. vo) is a level 0' test of H . To see this consider first the case 8 > 00 . A and B. by using this kind of procedure in a comparison or selection problem. If H is rejected. when !t < !to.i. We can use the twosided tests and confidence intervals introduced in later chapters in similar fashions. Decide I' < 1'0 ifT < z(l. Decide I' > I'c ifT > z(1 . Because we do not know whether A or B is to be preferred. we decide whether this is because () is smaller or larger than Bo. However. Here we consider the simple solution suggested by the level (1 . and vice versa..Jii for !t.!a).~a).. o (4.Q) confidence interval X ± uz(1 .4 we considered the level (1 . .~Q) / . o o For instance. For this threedecision rule. it is natural to carry the comparison of A and B further by asking whether 8 < 0 or B > O. Then the wrong decision "'1 < !to" is made when T < z(1 .0') confidence interval!: 1. we make no claims of significance. the probability of falsely claiming significance of either 8 < 80 or 0 > 80 is bounded above by ~Q. twosided tests seem incomplete in the sense that if H : B = B is rejected in favor of o H : () i. and o Decide 8 > 80 if I is entirely to the right of B .
.n(X . where 80 is a specified value. we say that the bound with the smaller probability of being far below () is more accurate.1 (Examples 3.1 t! Example 4.Ilo)/er > z(1 .1) for all competitors are called uniformly most accurate as are upper confidence bounds satisfying (4. P.2 and 4. Definition 4. where X(1) < X(2) < .1 continued). twosided tests.if.~'(X) < B'] < P. the following is true.2) Lower confidence bounds e* satisfying (4. ~). less than 80 .3.. optimality of the tests translates into accuracy of the bounds.Q)er/. unifonnly most accurate in the N(/L.Q) con· fidence region for v.Q) LCB B if. which reveals that (4. We also give a connection between confidence intervals. I' > 1'0 rejects H when . 0"2) random variables with (72 known. j i I .6. for 0') UCB e is more accurate than a competitor e . < X(n) denotes the ordered Xl. B(n.2) for all competitors.. and only if.0') lower confidence bounds for (J.1. (X) = X .[B(X) < B']. We next show that for a certain notion of accuracy of confidence bounds. which is connected to the power of the associated onesided tests.Q) UCB for B.I (X) is /L more accurate than .z(1 . Note that 0* is a unifonnly most accurate level (1 . 0 I i.6.' .0) confidence region for va. I' i' .. e. .6. We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution.. they are both very likely to fall below the true (J. Thus. for X E X C Rq.6.6. (72) model. If (J and (J* are two competing level (1 . . va) ~ 0 is a level (1 . . If S(x) is a level (1 .1) is nothing more than a comparison of (X)wer functions. A level (1.6...n.5. This is a consequence of the following theorem.Xn ) is a sample of a N (/L. Formally. Using Problem 4. v = Va when lIa E 8(:1') is a level 0: test. Ii n II r . in fact.248 Testing and Confidence Regions Chapter 4 0("'. we find that a competing lower confidence bound is 1'2(X) = X(k). random variable S. for any fixed B and all B' < B. P.a) LCB 0* of (J is said to be more accurate than a competing level (1 .. or larger than eo. Which lower bound is more accurate? It does tum out that .Q for a binomial. (4. Suppose X = (X" .Xn and k is defined by P(S > k) = 1 .2.IB' (X) I • • < B'] < P.Q).u2(X) and is.. . and only if. 4. (4. and the threedecision problem of deciding whether a parameter 1 l e is eo. .6. A level a test of H : /L = /La vs K . a level (1 any fixed B and all B' > B.a) LCB for if and only if 0* is a unifonnly most accurate level (1 . then the lest that accepts H .6. The dual lower confidence bound is 1'.1) Similarly.[B(X) < B']. But we also want the bounds to be close to (J.6 UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account.
for any other level (1 . 1 X n be the times to failure of n pieces of equipment where we assume that the Xi are independent £(A) variables. Also.6. we find that j. then for all 0 where a+ = a. [O(X) > 001 < Pe. Boundsforthe Probability ofEarly Failure ofEquipment. We define q* to be a uniformly most accurate level (1 .Because O'(X. Pe[f < q(O')1 < Pe[q < q(O')] whenever q((}') < q((}). Most accurate upper confidence bounds are defined similarly. 00 ) ~ 0 if. However.•. such that for each (Jo the associated test whose critical function o*(x. and 0 otherwise.2 and the result follows.a) Lea for q( 0) if. Identify 00 with {}/ and 01 with (J in the statement of Definition 4. . Defined 0(x. a real parameter.z(1 a)a/ JTi is uniformly most accurate.2). they have the smallest expected "distance" to 0: Corollary 4.e>. Then!l* is uniformly most accurate at level (1 . (Jo) is given by o'(x.0') LeB for (J. (O'(X.a) upper confidence bound q* for q( A) = 1 . [O'(X) > 001.1 to Example 4.Oo) 1 ifO'(x) > 00 ootherwise is UMP level a for H : (J = eo versus K : (J > 00 . and only if. 0 If we apply the result and Example 4.1. Then O(X.5. O(x) < 00 .6. We can extend the notion of accuracy to confidence bounds for realvalued functions of an arbitrary parameter.Oo)) < Eo.4.a). Let XI. for e 1 > (Jo we must have Ee. For instance (see Problem 4.1. and only if. Uniformly most accurate (UMA) bounds turn out to have related nice properties. We want a unifonnly most accurate level (1 . Example 4. Let O(X) be any other (1 .7 for the proof).6 Uniformly Most Accurate Confidence Bounds 249 Theorem 4.a) LCB q. the robustness considerations of Section 3.a) lower confidence bound.a) lower confidence boundfor O. X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it.a) LeB 00. if a > 0. . 00 ) is a level a test for H : 0 = 00 versus K : 0 > 00 . Suppose ~'(X) is UMA level (1 . Proof Let 0 be a competing level (1 .Section 4. 00 )) or Pe.5 favor X{k) (see Example 3. the probability of early failure of a piece of equipment. . 00 ) by o(x.2.2.t o .1.(o(X.6. Let f)* be a level (1 .6. 00 ) is UMP level Q' for H : (J = eo versus K : 0 > 00 .6.
Of course.6. Confidence intervals obtained from twosided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes. .. we can restrict attention to certain reasonable subclasses of level (1 . a uniformly most accurate level (1 . These topics are discussed in Lehmann (1997).a). there exist level (1 . the intervals developed in Example 4. Therefore.a) lower confidence bounds to fall below any value ()' below the true B. because q is strictly increasing in A. the confidence region corresponding to this test is (0.a) by the property that Pe[T. *) is a uniformly most accurate level (1 . There are. l . it follows that q( >. B'. However. That is. 1962.3) or equivalently if A X2n(1a) o < 2"'~ 1 X 2 L_n= where X2n(1.a) iliat has uniformly minimum length among all such intervals. they are less likely ilian oilier level (1 . ~ q(()') ~ t] for every ().a) UCB for the probability of early failure.a) intervals iliat has minimum expected length for all B. subject to ilie requirement that the confidence level is (1 . the interval must be at least as likely to cover the true value of q( B) as any other value.. By using the duality between onesided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level (1 . Neyman defines unbiased confidence intervals of level (1 .. there does not exist a member of the class of level (1 . Thus.\ *) where>. Pratt (1961) showed that in many of the classical problems of estimation . Considerations of accuracy lead us to ask that..1 have this property. To find'\* we invert the family of UMP level a tests of H : A ~ AO versus K : A < AO. Summary. If we turn to the expected length Ee(t .a) is the (1. in general.) as a measure of precision.5.a) UCB for A and. the confidence interval be as short as possible.a) in the sense that. however. By Problem 4. is random and it can be shown that in most situations there is no confidence interval of level (1 . the situation is still unsatisfactory because.T.a) quantile of the X§n distribution.1.6. some large sample results in iliis direction (see Wilks.6. the UMP test accepts H if LXi < X2n(1. The situation wiili confidence intervals is more complicated.8. 0 Discussion We have only considered confidence bounds. in the case of lower bounds. the lengili t . ~ q(B) ~ t] ~ Pe[T.a)/ 2Ao i=1 n (4. 374376). pp. In particular.a) intervals for which members with uniformly smallest expected length exist.a) confidence intervals that have uniformly minimum expected lengili among all level (1. * is by Theorem 4.250 Testing and Confidence Regions Chapter 4 We begin by finding a uniformly most accurate level (1.a) UCB'\* for A.T. as in the estimation problem.a) unbiased confidence intervals.
and that fL rv N(fLo. a E e c R.7 Frequentist and Bayesian Formulations 251 4. (j]. Let II( 'Ix) denote the posterior probability distribution of given X = x..a)% of the intervals would contain the true unknown parameter value.'" .a) lower and upper credible bounds for if they respectively satisfy II(fl.a)% confidence interval is that if we repeated an experiment indefinitely each time computing a 100(1 . a Definition 4.::: 1 . the posterior distribution of fL given Xl.a) credible region for e if II( C k Ix) ~ 1  a . then fl.i.::: k} is called a level (1 .d.7.a.. and {j are level (1 . .Xn is N(liB. then 100(1 .7. Suppose that given fL. Example 4.a)% confidence interval.1. II(a:s: t9lx) . a~). N(fL. Then. with fLo and 75 known.1.A)2 nro ao 1 Ji = liB + Zla Vn (1 + . In the Bayesian formulation of Sections 1. . A consequence of this approach is that once a numerical interval has been computed from experimental data. then Ck = {a: 7r(lx) .2 and 1. We next give such an example. given a.)2 nro 1 . 75). it is natural to consider the collec tion of that is "most likely" under the distribution II(alx). what are called level (1 '.2. and that a has the prior probability distribution II.a lower and upper credible bounds for fL are  fL = fLB  ao Zla Vn (1 + . Suppose that. If 7r( alx) is unimodal.6. :s: alx) ~ 1 .7 FREQUENTIST AND BAYESIAN FORMULATIONS We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E X c Rq are random while the parameters are fixed but unknown.4.a. no probability statement can be attached to this interval. X has distribution P e. with known. then Ck will be an interval of the form [fl.7. Let 7r('lx) denote the density of agiven X = x.Xn are i. a a Turning to Bayesian credible intervals and regions. with ~ a6) a6 fLB = ~ nx+l/1 ~t""o n ~ + I ~ ~2 . Thus.a) credible bounds and intervals are subsets of the parameter space which are given probability at least (1 .12.Section 4. Definition 4..1. the interpretation of a 100(1". Instead. Xl. from Example 1.a) by the posterior distribution of the parameter given the data.aB = n ~ + 1 I ~ It follows that the level 1 .3.
In the Bayesian framework we define bounds and intervals.n(1+ ~)2 nTo 1 • Compared to the frequentist interval X ± Zl~oO/. .12. the interpretations are different: In the frequentist confidence interval. ~b) density where a > 0. where t = 2:(Xi . the level (1 .d. 01 Summary.3. .. whereas in the Bayesian credible interval.2 . it is desirable to give an interval [Y. Compared to the frequentist bound (n 1) 8 2 / Xnl (a) of Example 4. Suppose that given 0./10)2. the probability of coverage is computed with X = x fixed and () random with probability distribution II (B I X = x).2. X n .4. Xl. however.8 PREDICTION INTERVALS In Section 1. we may want an interval for the . t] in which the treatment is likely to take effect.a) credible interval is similar to the frequentist interval except it is pulled in the direction /10 of the prior mean and it is a little narrower.n. In addition to point prediction of Y. Y] that contains the unknown value Y with prescribed probability (1 . N(/10. Let A = a. the probability of coverage is computed with the data X random and B fixed. the Bayesian interval tends to the frequentist interval. Let xa+n(a) denote the ath quantile of the X~+n distribution. In the case of a normal prior w( B) and normal model p( X I B). We shall analyze Bayesian credible regions further in Chapter 5. the interpretations of the intervals are different. Similarly.02) where /10 is known. a doctor administering a treatment with delayed effect will give patients a time interval [1::.. (t + b)A has a X~+n distribution. then .2.a) by the posterior distribution of the parameter () given the data :c.a) lower credible bound for A and ° is a level (1 . 4. D Example 4. /1 +1with /1 ± = /111 ± Zl'" 2 ~ 00 .. . the center liB of the Bayesian interval is pulled in the direction of /10.Xn are Li.2 . where /10 is a prior guess of the value of /1.. given Xl.a) upper credible bound for 0. See Example 1.7.6. However. is shifted in the direction of the reciprocal b/ a of the mean of W(A).2.252 Testing and Confidence Regions Chapter 4 while the level (1 . = x a+n (a) / (t + b) is a level (1 .4 we discussed situations in which we want to predict the value of a random variable Y. Then.• . For instance.2 and suppose A has the gamma f( ~a. that determine subsets of the parameter space that are assigned probability at least (1 .a) credible interval is [/1.. called level (1 a) credible bounds and intervals.4 for sources of such prior guesses.a). Note that as TO > 00.. by Problem 1. b > are known parameters.
8 Prediction Intervals 253 future GPA of a student or a future value of a portfolio.. we find the (1 . In fact. X is the optimal estimator. [n. "~).3. In Example 3. Also note that the prediction interval is much wider than the confidence interval (4.a) prediction interval Y = X± l (1 ~Ct) ::.. we found that in the class of unbiased estimators.Y) = 0.3 and independent of X n +l by assumption.4. Y] based on data X such that P(Y ::.Ct) prediction interval as an interval [Y. Note that Y Y = X .1. 1) distribution and is independent of V = (n . Tp(Y) !a) . by the definition of the (Student) t distribution in Section B. fnl (1  ~a) for Vn. . We want a prediction interval for Y = X n +1.. which has a X. Thus. ~ We next use the prediction error Y . Moreover. It follows that Z (Y) p  Y vnY l+lcr has a N(O.1) Note that Tp(Y) acts as a prediction interval pivot in the same way that T(p. The (Student) t Prediction Interval.4. The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4.. Y) ?: 1 Ct. . and can c~nclu~e that in the class of prediction unbiased predictors. Y ::. as X '" N(p" cr 2 ). has the t distribution.X) is independent of X by Theorem B. .8..Y to construct a pivot that can be used to give a prediction interval. let Xl. the optimal MSPE predictor is Y = X.1. it can be shown using the methods of .4. which is assumed to be also N(p" cr 2 ) and independent of Xl.i.3.+ 18tn l (1  (4.Section 4.) acts as a confidence interval pivot in Example 4.l + 1]cr2 ). TnI. Let Y = Y(X) denote a predictor based on X = (Xl. in this case.1). ::.Xn .1)8 2 /cr 2. . .X n +1 '" N(O.. 8 2 = (n .1.8. the optimal estimator when it exists is also the optimal predictor.Xn be i.. .1)1 L~(Xi . We define a level (1.8. As in Example 4.4. where MSE denotes the estimation theory mean squared error. Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is Note that Y can be regarded as both a predictor of Y and as an estimate of p" and when we do so. We define a predictor Y* to be prediction unbiased for Y if E(Y* . By solving tnl Y. AISP E(Y) = MSE(Y) + cr 2 .d. It follows that.l distribution.
v) j(v lI)dH(u. .8. Q(. I x) of Xn+l is defined as the conditional distribution of X n + l given x = (XI.1) tends to zero in probability at the rate n!.4...8.~o:).8.1) is approximately correct for large n even if the sample comes from a nonnormal distribution. < X(n) denote the order statistics of Xl. UCk ) = v)dH(u. Suppose XI.. that is. (4.i. where F is a continuous distribution function with positive density f on (a.8. " X n. with a sum replacing the integral in the discrete case..d. . Let X(1) < . E(UCi») thus. as X rv F. by Problem B.i. uniform.0:) in the limit as n ) 00 for samples from nonGaussian distributions. i = 1.4. Let U(1) < . 00 :s: a < b :s: 00.0:) Bayesian prediction interval for Y = X n + l if .9.'. Un ordered. . We want a prediction interval for Y = X n+l rv F.d. Example 4..2) P(X(j) It follows that [X(j) . By ProblemB.Un+l are i. This interval is a distributionfree prediction interval.. = i/(n+ 1).1) is not (1 ...8.254 Testing and Confidence Regions Chapter 4 Chapter 5 that the width of the confidence interval (4.d.. .. Set Ui = F(Xi ). The posterior predictive distribution Q(. U(O..j is a level 0: = (n + 1 .2.xn ). then. p(X I e). where X n+l is independent of the data Xl'. Bayesian Predictive Distributions Xl' .2). . .12.2. Now [Y B' YB ] is said to be a level (1 . Ul . . n + 1. whereas the level of the prediction interval (4.v) = E(U(k») .. 1).' X n . See Problem 4.5 for a simpler proof of (4. Moreover. Here Xl"'" Xn are observable and Xn+l is to be predicted. 0 We next give a prediction interval that is valid from samples from any population with a continuous distribution.2j)/(n + 1) prediction interval for X n +l .2. . I x) has in the continuous case density e. then P(U(j) j :s: Un+l :s: UCk ») P(u:S: Un+l:S: v I U(j) = U.. the confidence level of (4.Xn are i. . . whereas the width of the prediction interval tends to 2(]z (1. X(k)] with k :s: Xn+l :s: XCk») = n + l' kj = n + 1 . " X n + l are Suppose that () is random with () rv 1'( and that given () = i. < UCn) be Ul .E(U(j)) where H is the joint distribution of UCj) and U(k). b)..' •.i.
note that given X = t. (J"5 known.3. .0 and 0 are uncorrelated and. N(B. we find that the interval (4. Summary. A sufficient statistic based on the observables Xl. (4.9 4.  0) + 0 I X = t} = N(liB.8. This is the same as the probability limit of the frequentist interval (4. Xn+l . The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables. T2).B)B I 0 = B} = O. In the case of a normal sample of size n + 1 with only n variables observable. X n + l and 0. and X ~ Bas n + 00. (J"5).c{(Xn +1 where. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distributionfree prediction interval.1. (J" B n(J" B 2) It follows that a level (1 . Thus.8.8. However. and 7r( B) is N( TJo .I . from Example 4. 4. We consider intervals based on observable random variables that contain an unobservable random variable with probability at least (1.d.3) we compute its probability limit under the assumption that Xl " '" Xn are i.1 LIKELIHOOD RATIO PROCEDURES Introduction Up to this point.9. Thus. 1983).0:) Bayesian prediction interval for Y is [YB ' Yt] with Yf = liB ± Z (1. Because (J"~ + 0.. Xn+l . and it is enough to derive the marginal distribution of Y = X n +l from the joint distribution of X. (J"5). where X and X n + l are independent. Note that E[(Xn+l . .3) To consider the frequentist properties of the Bayesian prediction interval (4.0)0] = E{E(Xn+l .8.~o:) V(J"5 + a~.1 where (Xi I B) '" N(B .c{Xn+l I X = t} = . (J"5 + a~) ~2 = (J" B n I (.2 o + I' T2 ~ P. Xn is T = X = n 1 2:~=1 Xi .9 Likelihood Ratio Procedures 255 Example 4.8.1) . 0 The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box. by Theorem BA. even in . B = (2 / T 2) TJo + ( 2 / (J"0 X. Consider Example 3.. we construct the Student t prediction interval for the unobservable variable.~o:) (J"o as n + 00.0 and 0 are still uncorrelated and independent. (n(J"~ /(J"5) + 1.7.3) converges in probability to B±z (1 .i. the results and examples in this chapter deal mostly with oneparameter problems in which it sometimes is possible to find optimal procedures.2. T2 known. independent. The Bayesian prediction interval is derived for the normal model with a normal prior.0:).Section 4. . To obtain the predictive distribution.
In particular cases. e) : e E 8 l }.x) = p(x.9. e) : () sup{p(x. Because h(A(X)) is equivalent to A(X) .Xn ) has density or frequency function p(x. e) : e E 8l}islargecomparedtosup{p(x. We start with a generalization of the NeymanPearson statistic p(x . 8 1 = {e l }./Lo)/a.its (1 ./Lo.2. ed/p(x. then the observed sample is best explained by some E 8 1 . 1.e): E 8 0 }. Calculate the MLE e of e. the basic steps are always the same. by the uniqueness of the NP test (Theorem 4. Xn is a sample from a N(/L.1 that if /Ll > /Lo. p(x. In the cases we shall consider. In this section we introduce intuitive and efficient procedures that can be used when no optimal methods are available and that are natural for multidimensional parameters. . where T = fo(X . Also note that L(x) coincides with the optimal test statistic p(x . The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6..256 Testing and Confidence Regions Chapter 4 the case in which is onedimensional. if /Ll < /Lo. Find a function h that is strictly increasing on the range of A such that h(A(X)) has a simple form and a tabled distribution under H .. . Because 6a (x) I. e) is a continuous function of e and eo is of smaller dimension than 8 = 8 0 U 8 1 so that the likelihood ratio equals the test statistic I e e 1 A(X) = sup{p(x. 4. For instance. e) : e E 8 0 } e e e Tests that reject H for large values of L(x) are called likelihood ratio tests. sup{p(x. The test statistic we want to consider is the likelihood ratio given by L(x) = sup{p(x.'Pa(x).~a). and for large samples. . if Xl. likelihood ratio tests have weak optimality properties to be discussed in Chapters 5 and 6. . So. recall from Section 2. 2. To see this. 3. e) and we wish to test H : E 8 0 vs K : E 8 1 ./LO. there is no UMP test for testing H : /L = /Lo vs K : /L I. Form A(X) = p(x.. eo) when 8 0 = {eo}. Xn).2 that we think of the likelihood function L(e. ed / p(x.2. z(a ).. optimal procedures may not exist. . we specify the size a likelihood ratio test through the test statistic h(A(X) and . We are going to derive likelihood ratio tests in several important testing problems. B')/p(x . Calculate the MLE eo of e where e may vary only over 8 0 . e) as a measure of how well e "explains" the given sample x = (Xl. the MP level a test 6a (X) rejects H for T > z(l .2. . Although the calculations differ from case to case. there can be no UMP test of H : /L = /Lo vs H : /L I.1(c».a )th quantile obtained from the table. e) E 8} : eE 8 0} (4. eo). a 2 ) population with a 2 known. On the other hand. the MP level a test 'Pa(X) rejects H if T ::::. Suppose that X = (Xl .1) whose computation is often simple. eo). 1). . . ifsup{p(x. note that it follows from Example 4. To see that this is a plausible statistic. and conversely. . Note that in general A(X) = max(L(x).
c 0 0 It is often approximately true (see Chapter 6) that c( 8) is independent of 8.5.a) confidence region C(x) = {8 : p(x. [ supe p(X. Suppose we want to study the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age. the experiment proceeds as follows.Section 4. Let Xi denote the difference between the treated and control responses for the ith pair.9. 8) > (8)] = eo a. 4. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo. An important class of situations for which this model may be appropriate occurs in matched pair experiments. 8n e (4. with probability ~) and given the treatment.24. We shall obtain likelihood ratio tests for hypotheses of the form H : 81 = 8lD .82 ) where 81 is the parameter of interest and 82 is a nuisance parameter. After the matching. mileage of cars with and without a certain ingredient or adjustment.8) . bounds. To see how the process works we refer to the specific examples in Sections 4.2.9. In that case. diet. we measure the response of a subject when under treatment and when not under treatment. 8) ~ [c( 8)t l sup p(x. and other factors.2 Tests for the Mean of a Normal DistributionMatched Pair Experiments Suppose Xl.Xn form a sample from a N(JL. and so on. In the ith pair one patient is picked at random (i. we consider pairs of patients matched so that within each pair the patients are as alike as possible with respect to the extraneous factors. p (X . we can invert the family of size a likelihood ratio tests of the point hypothesis H : 8 = 80 and obtain the level (1 . Studies in which subjects serve as their own control can also be thought of as matched pair experiments. the difference Xi has a distribution that is . sales performance before and after a course in salesmanship. Response measurements are taken on the treated and control members of each pair. The family of such level a likelihood ratio tests obtained by varying 8lD can also be inverted and yield confidence regions for 8 1 . and soon.2) where sUPe denotes sup over 8 E e and the critical constant c( 8) satisfies p. while the second patient serves as control and receives a placebo. 0'2) population in which both JL and 0'2 are unknown. C (x) is just the set of all 8 whose likelihood is on or above some fixed value dependent on the data. If the treatment and placebo have the same effect.'" . Here are some examples..9. An example is discussed in Section 4. which are composite because 82 can vary freely.9. This section includes situations in which 8 = (8 1 .9 Likelihood Ratio Procedures 257 We can also invert families of likelihood ratio tests to obtain what we shall call likelihood confidence regions.e.9. In order to reduce differences due to the extraneous factors. We can regard twins as being matched pairs. That is. For instance. We are interested in expected differences in responses due to the treatment effect.
6. The problem of finding the supremum of p(x. the test can be modified into a threedecision rule that decides whether there is a significant positive or negative effect.258 Testing and Confidence Regions Chapter 4 symmetric about zero.0}.=1 fJ. = O. The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter 5 if the nonnality assumption is not satisfied. = fJ. =Ie fJ. To this we have added the nonnality assumption. a~). where we think of fJ. B) was solved in Example 3. We think of fJ.L ( X i . Let fJ. This corresponds to the alternative "The treatment has some effect. We found that sup{p(x.X) . which has the immediate solution ao ~2 = .o. n i=l is the maximum likelihood estimate of B.a)= ~ ~2 (1 ~ 1 ~ n i=l 2 ) . = fJ. as discussed in Section 4. as representing the treatment effect.0) 2  a2 n] = 0.0 as an established standard for an old treatment.3. Our null hypothesis of no treatment effect is then H : fJ. The likelihood equation is oa 2 a logp(x.B): B E 8} = p(x.0 is known and then evaluating p(x. B) at (fJ. a 2 ) : fJ. = E(X 1 ) denote the mean difference between the response of the treated and control subjects. 8 0 = {(fJ" Under our assumptions. we test H : fJ.0 )2 . = fJ. However. B) 1[1~ ="2 a 4 L(Xi . . B) : B E 8 0 } boils down to finding the maximum likelihood estimate a~ of a 2 when fJ. TwoSided Tests We begin by considering K : fJ.5. where B=(x. Form of the TwoSided Tests Let B = (fJ" a 2 ). 'for the purpose of referring to the duality between testing and confidence procedures.L Xi 1 ~( n i=l . e). Finding sup{p(x.L X i ' .fJ.0.0." However. good or bad.
suppose n = 25 and we want a = 0. A and B. therefore. where 8 = (M .11 we argue that P"[Tn Z t] is increasing in 8. . the size a critical value is t n 1 (1 .\(x) is equivalent to log . which can be established by expanding both sides. Because 8 2 function of ITn 1 where = 1 + (x .Mo) .~a) and we can use calculators or software that gives quantiles of the t distribution. and only if. However.\(x) logp(x. The test statistic .1. the testing problem H : M ::.Section 4.3. Mo versus K : M > Mo (with Mo = 0) is suggested. To simplify the rule further we use the following equation. and only if. Then we would reject H if.1)1 I:(Xi .\(x). (Mo.  {~[(log271') + (log&5)]~} Our test rule.MO)2/&2.8) logp(x. ITnl Z 2. 8 Therefore. &5 gives the maximum of p(x.Mo)/a. Therefore. Therefore. if we are comparing a treatment and control. 8) for 8 E 8 0. the size a likelihood ratio test for H : M Z Mo versus K : M < Mo rejects H if. A proof is sketched in Problem 4. Thus. the relevant question is whether the treatment creates an improvement. The statistic Tn is equivalent to the likelihood ratio statistic A for this problem.4. are considered to be equal before the experiment is performed. In Problem 4. the test that rejects H for Tn Z t nl(1 .1).2. Similarly. &0)) ~ 2 {~[(log271') + (log&2)]~} ~ log(&5/&2). Because Tn has a T distribution under H (see Example 4.x)2 = n&2/(n . For instance. Mo .05. or Table III. to find the critical value. OneSided Tests The twosided formulation is natural if two treatments.9.1).064. the likelihood ratio tests reject for large values of ITn I. is of size a for H : M ::.a).9. (n . which thus equals log . rejects H for iarge values of (&5/&2). &5/&2 (&5/&2) is monotone increasing Tn = y'n(x .9 Likelihood Ratio Procedures 259 By Theorem 2.
we may be required to take more observations than we can afford on the second stage.jV/ k where Z and V are independent and have N(8. With this solution.260 Power Functions and Confidence 4 To discuss the power of these tests./Lol ~ ~. thus. Similarly the onesided tests lead to the lower and upper confidence bounds of Example 4. respectively.3. 1) and X~ distributions. Thus. p. Note that the distribution of Tn depends on () (/L./Lo) / a has N (8. the ratio fo(X . 1997. / (n1)s2 /u 2 n1 V has a 'Tn1. The reason is that. fo( X ./Lo) / a as close to 0 as we please and. Computer software will compute n. The density of Z/ .9. we can no longer control both probabilities of error by choosing the sample size. 260.b distribution.1)s2/a 2 are independent and that (n 1)s2/a 2 has a X.11) just as the power functions of the corresponding tests of Example 4. To derive the distribution of Tn./Lo)/a .4. we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8. . with 8 = fo(/L . We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 ~ 81 > 0 in the twosided case and 8 ~ 81 or 8 :S 81 in the onesided cases. note that from Section B. 1) distribution./Lo)/a and Yare .4./Lo)/a) = 1.n(X .7) when discussing confidence intervals. however. Problem 17).2. A Stein solution. Likelihood Confidence Regions If we invert the twosided tests. whatever be n. denoted by Tk. by making a sufficiently large we can force the noncentrality parameter 8 = fo(/L . Because E[fo(X /Lo)/a] = fo(/L .1. This distribution. ( 2) only through 8.jV/k is given in Problem 4. bring the power arbitrarily close to 0:. we know that fo(X /L)/a and (n . If we consider alternatives of the form (/L /Lo) ~ ~./Lo) / a. We have met similar difficulties (Problem 4. is possible (Lehmann.1 are monotone in fo/L/ a.9. is by definition the distribution of Z / .b. The power functions of the onesided tests are monotone in 8 (Problem 4..4. say. in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with I/L.l distribution.1. we obtain the confidence region We recognize C(X) as the confidence interval of Example 4.12. and the power can be obtained from computer software or tables of the noncentral t distribution.
. Let e = (Jlt.g.0 6 3. .4 3 0.7 1.99 confidence interval for the mean difference Jl between treatments is [0. Because tg(0.9 likelihood Ratio Procedures 261 Data Example As an illustration of these procedures. one from each popUlation.84]. . suppose we wanted to test the effect of a certain drug on some biological variable (e. volume.'" .A in sleep gained using drugs A and B on 10 patients. ( 2).Xnl and YI .9..9 1.6 4. .4 1.Xn1 .. Y) = (Xl. This is a matched pair experiment with each subject serving as its own control.6.2 1.9.2 2 1.3 4 1.6 0.3. 121) giving the difference B . The preceding assumptions were discussed in Example 1. it is usually assumed that X I.7 5. A discussion of the consequences of the violation of these assumptions will be postponed to Chapters 5 and 6.4 If we denote the difference as x's.25. 8 2 = 1. .1 0.995) = 3....) 4. and so forth.1 1. YI .5.06. 'Yn2 are the measurements on a sample given the drug.9. weight.. . As in Section 4.. It suggests that not only are the drugs different but in fact B is better than A because no hypothesis Jl = Jl' < 0 is accepted at this level.32. (See also (4.Xn1 could be blood pressure measurements on a sample of patients given a placebo. consider the following data due to Cushny and Peebles (see Fisher.1. For instance. . The 0. Then Xl.2 the likelihood function and its log are maximized In Problem 4. Then 8 0 = {e : Jli = Jl2} and 9 1 = {e : Jli =F Jl2}' The log of the likelihood of (X.5 1. ( 2) and N (Jl2.8 9 0. e . .Xn1 and YI .58.0 4.0 7 3. temperature.2 0. Patient i A B BA 1 0. . . For quantitative measurements such as blood pressure. then x = 1.1 0. and ITnl = 4.. Tests We first consider the problem of testing H : Jl1 = J:L2 versus H : Jli =F Jl2. height. .. In the control versus treatment example.6 0. blood pressure). length. .3 Tests and Confidence Intervals for the Difference in Means of Two Normal Populations We often want to compare two populations with distribution F and G on the basis of two independent samples X I.4 4. p. while YI ..Section 4.513.8 1..3 5 0. 'Yn2 are independent samples from N (JlI. this is the problem of determining whether the treatment has any effect. 1958.6 10 2.3).. it is shown that = over 8 by the maximum likelihood estimate e.. Yn2 .4 1.8 2.Y ) is n2 where n = nl + n2.0 3.8 8 0. ( 2) populations.. Jl2..2. respectively.1 1. we conclude at the 1% level of significance that the two drugs are significantly different.
2 1 I:rt2 .1 .j112 log likelihood ratio statistic [(Xi . where and If we use the identities ~ fCYi ?l i=l j1)2 1 n fp'i i=l y)2 + n2 (Y _ j1)2 n obtained by writing [Xi . where 2 Testing and Confidence Regions Chapter 4 When ttl tt2 = tt. Thus. (7 ).9. the maximum of p over 8 0 is obtained for f) (j1. Y. . j1. (75).3.X) + (X  j1)]2 and expanding. our model reduces to the onesample model of Section 4. we find that the is equivalent to the test statistic ITI where and To complete specification of the size a likelihood ratio test we show that T has distribution when ttl = tt2. Y.2.2 (y.2 X.262 (X. By Theorem B.Y) '1':" a a a i=l a j=l J I .1 I:rtl .3 Tn2 1 . .2 (Xi .X) and .
T( Ll) has a Tn2 distribution and inversion ofthe tests leads to the interval (4.N(1L2/0. 1/nI). for H : 1L2 ::. We conclude from this remark and the additive property of the X2 distribution that (n . 1/n2). corresponding to the twosided test.ILl #. Confidence Intervals To obtain confidence intervals for 1L2 .~ Q likelihood confidence bounds.2 has a X~2 distribution.2 • Therefore. there are two onesided tests with critical regions.Ll. We can show that these tests are likelihood ratio tests for these hypotheses. and only if.2 under H bution and is independent of (n .ILl = Ll versus K : 1L2 .ILl. 1L2.3) for 1L2 . twosample t test rejects if. X~ll' X~2l' respectively.X) /0. IL 1 and for H : ILl ::. f"V As usual. we find a simple equivalent statistic IT(Ll)1 where If 1L2 ILl = Ll. this follows from the fact that. It is also true that these procedures are of size Q for their respective hypotheses. and that j nl n2/n(Y .ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 . As for the special case Ll = 0. T and the resulting twosided.1L2.has a N(O. onesided tests lead to the upper and lower endpoints of the interval as 1 . by definition. Similarly. if ILl #. 1) distriTn .9 likelihood Ratio Procedures 263 are independent and distributed asN(ILI/o. . As in the onesample case.2)8 2/0.2)8 2 /0.9.Section 4. T has a noncentral t distribution with noncentrality parameter.
consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines. Ji2) E R x R.9.977.368.776. When Ji1 = Ji2 = Ji.x = 0. we are lead to a model where Xl. it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same. thus.. From past experience. thus.95 confidence interval for the difference in mean log permeability is 0.3.. 472) x (machine 1) Y (machine 2) 1. p. /11 = x and /12 = y for (Ji1.9. On the basis of the results of this experiment. setting the derivative of the log likelihood equal to zero yields the MLE . Again we can show that the selection procedure based on the level (1 . ai) and N(Ji2. respectively. . This is the BehrensFisher problem.790 1. Y1.~o:).2 (1 . Because t4(0.395.264 Testing and Confidence Regions Chapter 4 Data Example As an illustration. it may happen that the X's and Y's have different variances. the more waterproof material. The level 0.8 2 = 0. For instance. . we conclude at the 5% level of significance that there is a significant difference between the expected log permeability for the two machines.042 1.Xn1 .282 We test the hypothesis H of no difference in expected log permeability..845 1.4 The TwoSample Problem with Unequal Variances In twosample problems of the kind mentioned in the introduction to Section 4. 4.. .583 1. The log likelihood. we would select machine 2 as producing the smaller permeability and. The results in terms oflogarithms were (from Hald. is The MLEs of Ji1 and Ji2 are. . except for an additive constant. Here y . If normality holds. a treatment that increases mean response may increase the variance of the responses. As a first step we may still want to compare mean responses for the X and Y populations.0:) confidence interval has probability at most ~ 0: of making the wrong selection. and T = 2. :.0264. a~) samples. 'Yn2 are two independent N (Ji1. .627 2. 1952.395 ± 0. Suppose first that aI and a~ are known.975) = 2. H is rejected if ITI 2 t n .
n2 an approximation to the distribution of (D .t1.9 Likelihood Ratio Procedures J 265 where.)18D has approximately a standard normal distribution (Problem 5.3. n2.1:::.Section 4.y) By writing nl L(Xi .fi = n2(x . it Thus. where D = Y .28).fi? j=1 + n2(j1.t2..t1. 2 8y 8~ . that is Yare X) + Var(Y) = nl + n2 .y)/(nl + .fi)2 we obtain \(x. It follows that "\(x.n2)' It follows that the likelihood ratio test is equivalent to the statistic IDl/aD.. where I:::..t2 must be estimated. BecauseaJy is unknown.t1 = J. and more generally (D . 1) distribution.t2 :::.X and aJy is the variance of D. IDI/8D as a test statistic for H : J. by Slutsky's theorem and the central limit theorem.) I 8 D depends on at I a~ for fixed nl.n2)' Similarly.)18D to generate confidence procedures. j1x = = nl ..x)/(nl + . (D 1:::..X)2 + nl(j1. For large nl. 8D=+' nl n2 It is natural to try and use D I 8 D as a test statistic for the onesided hypothesis H : J. Unfortunately the distribution of (D 1:::. n2. = J. For small and moderate nI. y) Next we compute = exp {2~~ eM  x)2 + 2:'~ (Ii  y?}. j1. DiaD has aN(I:::.1:::. An unbiased estimate is J.x)2 i=1 n2 i=1 n2 L(Yi .j1)2 j=1 = L(Yi . = at I a~.n2(fi ..) 18 D due to Welch (1949) works well.. J.
. The LR procedure derived in Section 4. X test tistical studies. Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many staweight. then we end up with a bivariate random sample (XI. X = average cigarette consumption per day in grams.05 and Q: 0. Y = blood pressure.003. (Xn' Yn). Note that Welch's solution works whether the variances are equal or not. etc.3. . See Figure 5.u~. Y cholesterol level in blood.9. machines. ur Testing Independence.3. mice.3 and Problem 5. can unfortunately be very misleading if =1= u~ and nl =1= n2. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the BehrensFisher problem..5 likelihood Ratio Procedures for Bivariate Normal Distributions If n subjects (persons. which works well if the variances are equal or nl = n2.J1. the critical value is obtained by linear interpolation in the t tables or using computer software. If we have a sample as before and assume the bivariate normal model for (X. Yd. X percentage of fat in score on mathematics exam. p).1 When k is not an integer.9. . Wang (1971) has shown the approximation to be very good for Q: = 0. N(j1.28..1.2 . fields.3.01. uI 4. TwoSided Tests . Y = age at death. Y) have a joint bivariate normal distribution. Empirical data sometimes suggest that a reasonable model is one in which the two characteristics (X. our problem becomes that of testing H : p O. with > 0. Y diet.266 Testing and Confidence Regions Chapter 4 Let c = Sf/nlsb' Then Welch's approximation is Tk where k c2 [ nl 1 (1 C)2]1 + n2 . Some familiar examples are: X test score on English exam. the maximum error in size being bounded by 0. Y). u~ > O.) are sampled from a population and two numerical characteristics are measured on each case.
Xn and that ofY1 . When p = 0 we have two independent samples. .Now. ~ = (Yi .13 as 2 = ..8). the distribution of p depends on p only. the twosided likelihood ratio tests can be based on [Tn I and the critical values obtained from Table II.~( Yi . .X .Section 4.0"2 = .9.4. Yn .9. We have eo = (x.1. ii.(j~.Jll)/O"I. V n ) is a sample from the N(O. for any a. p) distribution.2 distribution. where Ui = (Xi . When p = 0. then by Problem B. 1 (Problem 2.Jl2)/0"2. if we specify indifference regions. Therefore. y. (4.5) has a Tn. There is no simple form for the distribution of p (or Tn) when p #.. Because ITn [ is an increasing function of [P1..9 Likelihood Ratio Procedures 267 The unrestricted maximum likelihood estimate (x. . (j~. (Un.~( Xi . . and eo can be obtained by separately maximizing the likelihood of XI. the distribution of pis available on computer packages.3.1. See Example 5. Because (U1.o.X)(Yi ~=1 fj)] /n(jl(72. To obtain critical values we need the distribution of p or an equivalent statistic under H. .4) = Thus. Qualitatively. If p = 0. 0. VI)"" . Pis called the sample correlation coefficient and satisfies 1 ::. .5.9. 1~ )2 2 1~ )2 0"1 n i=1 n i=1 p= [t(Xi . 0) and the log of the likelihood ratio statistic becomes log A(X) (4. log A(X) is an increasing function of p2.. p). where (jr e was given in Problem 2.Y . we can control probabilities of type II error by increasing the sample size. (jr . p::. the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1.3.1. A normal approximation is available. and the likelihood ratio tests reject H for large values of [P1.
~Q bounds together. ! Data Example A~ an illustration. by putting two level 1 . The power functions of these tests are monotone. we find that there is no evidence of correlation: the pvalue is bigger than 0. 'I7 Summary. These intervals do not correspond to the inversion of the size Q LR tests of H : p = Po versus K : p :::f. Thus. Therefore.75. c(po)] 1 . For instance. Po but rather of the "equaltailed" test that rejects if.7 18. The likelihood ratio test statistic >.: : d(po) or p::. 0 versus K : P > 0 and similarly that corresponds to the likelihood ratio statistic for H : P 0 versus K : P < O. These tests can be shown to be of the form "Accept if. 0 versus K : P > O. only onesided alternatives are of interest. We find the likelihood ratio tests and associated confidence procedures for four classical normal models: .1 Here p 0. It can be shown that pis equivalent to the likelihood ratio statistic for testing H : P ::. and only if.Q. inversion of this family of tests leads to levell Q lower confidence bounds. for large n the equal tails and LR confidence intervals approximately coincide with each other. we can start by constructing size Q likelihood ratio tests of H : P Po versus K : p > po. p::. by using the twosided test and referring to the tables. Because c can be shown to be monotone increasing in p. 370. c(po) where P po [I)' d(po)] P po [I)'::. We can similarly obtain 1 Q upper confidence bounds and. we obtain a commonly used confidence interval for p.:::: c] is an increasing function of p for fixed c (Problem 4. we obtain size Q tests for each of these hypotheses by setting the critical value so that the probability of type I error is Q when p = O. if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood.268 Testing and Confidence Regions Chapter 4 OneSided Tests In many cases.15).9. c(po)" where Ppo [p::. we would test H : P = 0 versus K : P > 0 or H : P ::. Confidence Bounds and Intervals Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present. is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis. We obtain c(p) either from computer software or by the approximation of Chapter 5. consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week. and only if p . We want to know whether there is a correlation between the initial weights and the weight increase and formulate the hypothesis H : p O. We can show that Po [p.18 and Tn 0. However. To obtain lower confidence bounds. c(Po)] 1 ~Q.48.
. .X)I SD. ( 2) and we test the hypothesis that the mean difference J.XI.Xn denote the times in days to failure of n similar pieces of equipment.Ll' an and N(J. (2) Twosample experiments in which two independent samples are modeled as coming from N(J. (4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases.Xn ) is an £(A) sample.L2' ( 2) populations. where p is the sample correlation coefficient. . e).. . .L :::. We test the hypothesis that X and Yare indepen dent and find that the likelihood ratio test is equivalent to the test based on IPl. 4. J. Approximate critical values are obtained using Welch's t distribution approximation.L. we use (Y . what is the pvalue? 2. (d) How large should n be so that the 6c specified in (b) has power 0. where SD is an estimate of the standard deviation of D = Y .. . .Section 4. . ~ versus K : e> ~.L2' a~) populations. X n ) and let 1 if Mn 2:: c = of e. respectively.10 Problems and Complements 269 (1) Matched pair experiments in which differences are modeled as N(J. .X. The likelihood ratio test is equivalent to the onesample (Student) t test.1 1.. . Let Mn = max(X 1 . We also find that the likelihood ratio statistic is equivalent to a t statistic with n . 0 otherwise. Let Xl.05? e :::.98 for (e) If in a sample of size n = 20. When a? and a~ are known.. X n are independently and identically distributed according to the uniform distribution U(O.L is zero. Mn e = ~? = 0.10 PROBLEMS AND COMPLEMENTS Problems for Section 4. (a) Compute the power function of 6c and show that it is a monotone increasing function (b) In testing H : exactly 0. We test the hypothesis that the means are equal and find that the likelihood ratio test is equivalent to the twosample (Student) t test. When a? and a~ are unknown.48. the likelihood ratio test is equivalent to the test based on IY .Lo.. what choice of c would make 6c have size (c) Draw a rough graph of the power function of 6c specified in (b) when n = 20. respectively. Suppose that Xl. Assume the model where X = (Xl. . Consider the hypothesis H that the mean life II A = J. . (3) Twosample experiments in which two independent samples are modeled as coming from N(J.Ll' ( 2) and N(J.2 degrees of freedom.
where x(I .O) = (xIO')exp{x'/20'}.3). " i I 5. Let Xl. r. .. i I (b) Check that your test statistic has greater expected value under K than under H. .a)th quantile of the X~n distribution. Let o:(Tj ) denote the pvalue for 1j.r.a) is the (1 . give a normal approximation to the significance probability. (b) Show that the power function of your test is increasing in 8. . . (0) Show that..1. 1nz t Z i .<»1 VTi)J .270 Testing and Confidence Regions Chapter 4 (a) Use the result of Problem 8. 6. X> 0. then the pvalue <>(T) has a unifonn.. .. 1 using a I . (c) Give an approximate expression for the critical value if n is large and B not too close to 0 or 00. Show that if H is simple and the test statistic T has a c. I . .3. Hint: Approximate the critical region by IX > 1"0(1 + z(1 .4. j = 1. T ~ Hint: See Problem B.ontinuous distribution. . Tr are independent test statistics for the same simple H and that each Tj has a continuous distribution.Xn be a sample from a population with the Rayleigh density . Hint: See Problem B.. 1). f(x. j=l . (d) The following are days until failure of air monitors at a nuclear plant. is a size Q test.~llog <>(Tj ) has a X~r distribution. (Use the central limit theorem. distribution..0 > 0.. If 110 = 25. Assume that F o and F are continuous.2 L:.12..) 4. Let Xl.2. . under H. (a) Use the MLE X of 0 to construct a level <> test for H : 0 < 00 versus K : 0 > 00 .. Hint: Use the central limit theorem for the critical value. . . (b) Give an expression of the power in tenns of the X~n distribution. Draw a graph of the approximate power function.. 7. . X n be a 1'(0) sample. U(O.. (cj Use the central limit theorem to show that <l?[ (I"oz( <» II") + VTi(1"  1"0) I 1"] is an approximation to the power of the test in part (a). .05? 3. . Days until failure: 315040343237342316514150274627103037 Is H rejected at level <> ~ 0. j . Suppose that T 1 .. .3. Establish (4.4 to show that the test with critical region IX > I"ox( 1  <» 12n].. 0: (a) Construct a test of H : B = 1 versus K : B > 1 with approximate size complete sufficient statistic for this model.
1) and F(x) ~ (1 +exp( _"/7))1 where 7 = "13/. .2. 1) to (0.... .o sup. (b) When 'Ij.T(B+l) denote T. . X n be d.1.o T¢.Fo(x)1 > k o ) for a ~ 0. (a) Show that the statistic Tn of Example 4.1. Let T(l).p(Fo(x))lF(x) .p(F(x)IF(x) . and let a > O. = (Xi .Fo(x)IOdF(x).12(b).l) (Tn). 10..u. b > 0. (In practice these can be obtained by drawing B independent samples X(I). (a) For each of these statistics show that the distribution under H does not depend on F o.5. That is.5. (b) Use part (a) to conclude that LN(". 80 and x = 0. (a) Show that the power PF[D n > kal of the Kolmogorov test is bounded below by ~ sup Fpl!F(x).p(Fo(x))IF(x) . Let X I. if X. is chosen 00 so that 00 x 2 dF(x) = 1. . then T. 9. let . 1) variable on the computer and set X ~ F O 1 (U) as in Problem B.Fo(x)IO x ~ ~ sup.) Next let T(I).a)/b.T(X(l)). . .. Fo(x)1 > kol x ~ Hint: D n > !F(x) . T(B) ordered. Vw.Fo(x)IOdFo(x) .Fo• generate a U(O. B.. = ~ I. Express the Cramervon Mises statistic as a sum.o V¢.Fo(x) 10 U¢. n (c) Show that if F and Fo are continuous and FiFo.) Evaluate the bound Pp(lF(x) .Section 4.1. Define the statistics S¢. . T(B) be B independent Monte Carlo simulated values of T.Fo(x)1 foteach. . Here.J(u) = 1 and 0: = 2. In Example 4. then the power of the Kolmogorov test tends to 1 as n .5. ~ ll.. . . to get X with distribution . Hint: If H is true T(X).6 is invariant under location and scale.d.. Use the fact that T(X) is equally likely to be any particular order statistic..p(F(x»!F(x) . .o: is called the Cramervon Mises statistic. and 1. T(l)..00).. . j = 1..10. . Show that the test rejects H iff T > T(B+lm) has level a = m/(B + 1). 00.. (This is the logistic distribution with mean zero and variance 1. 1.X(B) from F o on the computer and computing TU) = T(XU)). Suppose that the distribution £0 of the statistic T = T(X) is continuous under H and that H is rejected for large values of T.5 using the nonnal approximation to the binomial distribution of nF(x) and the approximate critical value in Example 4...10 Problems and Complements 271 8. with distribution function F and consider H : F = Fo.)(Tn ) = LN(O. . . .o J J x . (b) Suppose Fo isN(O. T(X(B)) is a sample of size B + 1 from La.T.(X') = Tn(X).p(u) be a function from (0.
Expected pvalues. 2. . Whichever system you buy during the year.to on the basis of the N(p" . I X n from this population..2. .1. You want to buy one of two systems. ! . let N I • N 2 • and N 3 denote the number of Xj equal to I. The first system costs $106 .05. and 3 occuring in the HardyWeinberg proportions f(I. j 1 i :J . ! J (c) Suppose that for each a E (0. One has signalMtonoise ratio v/ao = 2. Consider Examples 3.nO /.2 I i i 1.. Show that EPV(O) = P(To > T). A gambler observing a game in which a single die is tossed repeatedly gets the impression that 6 comes up about 18% of the time. 01 ) is an increasing function of 2N1 + N. 5 about 14% of the time. respectively. Hint: peT < to I To = to) is 1 minus the power of a test with critical value to_ (d) Consider the problem of testing H : J1. the other $105 • One second of transmission on either system COsts $103 each.. Let To denote a random variable with distribution F o. Consider a population with three kinds of individuals labeled 1. which system is cheaper on the basis of a year's operation? 1 I Ii I il 2. 2N1 + N. > cJ = a.272 Testing and Confidence Regions Chapter 4 (e) Are any of the four statistics in (a) invariant under location and scale. which is independent ofT.). Without loss of generality. the UMP test is of the form I {T > c).2. whereas the other a I ~ a .3. = /10 versus K : J. ! .) I 1 .) . . 00 . i • (a) Show that if the test has level a. For a sample Xl. X n . I 1 (b) Show that if c > 0 and a E (0.2 and 4. then the pvalue is U =I . Hint: P(To > T) = P(To > tiT = t)fe(t)dt where fe(t) is the density of Fe(t).0) = 20(1.L > J. Suppose that T has a continuous distribution Fe. (See Problem 4. take 80 = O.a)) = inf{t: Fo(t) > u}.. and 3.0) = (1..10. i i (b) Define the expected pvalue as EPV(O) = EeU. I). Let 0 < 00 < 0 1 < 1.Fe(F"l(l. 12.. f(3. If each time you test.. and only if. the other has v/aQ = 1.') sample Xl. (For a recent review of expected p values see Sackrowitz and SamuelCahn. Problems for Section 4./"0' Show that EPV(O) = if! ( . then the test that rejects H if.0)'.1) satisfy Pe. Let T = X/"o and 0 = /" . 1999.. Consider a test with critical region of the fann {T > c} for testing H : () = Bo versus I< : () > 80 . > cis MP for testing H : 0 = 00 versus K : 0 = 0 1 • . f(2. (a) Show that L(x. Show that the EPV(O) for I{T > c) is uniformly minimal in 0 > 0 when compared to the EPV(O) for any other test. 2. [2N1 + N.1.Fo(T). where if! denotes the standard normal distribution.. you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0.. where" is known. you intend to test the satellite 100 times. 3.O) = 0'.0).. the power is (3(0) = P(U where FO 1 (u)  < a) = 1.
= a. Bd > cJ = I . PdT < cJ is as small as possible.. Nk) ~ 4.... I. In Examle 4.. two 5's are obtained..0. Bk).0..1(a) using the connection between likelihood ratio tests and Bayes tests given in Remark 4. and only if. . 9. [L(X.0196 test rejects if. then a.6) or (as in population 1) according to N(I. B" . Hint: Use Problem 4.o = (1. 7. Bo.2. . M(n..0. Bo.17).Pe. prove Theorem 4. Prove Corollary 4.2.2) that linear combinations of bivariate nonnal random variables are nonnaUy distributed.. (c) Using the fact that if(N" . ..2. (a) Show that if in testing H : f} such that = f}o versus K : f) = f)l there exists a critical value c Po.2. .10 Problems and Complements 273 four numbers are equally likely to occur (i.1.2. (X.~:":2:~~=C:="::===='. 6. A fonnulation of goodness of tests specifies that a test is best if the max. Find a statistic T( X. I.1. find the MP test for testing 10.4.7).1. Y) known to be distributed either (as in population 0) according to N(O. the gambler asks that he first be allowed to test his hypothesis by tossing the die n times. L . Hint: The MP test has power at least that of the test with test function J(x) 8.N. A newly discovered skull has cranial measurements (X.1. Y) and a critical value c such that if we use the classification rule. For 0 < a < I. [L(X.. with probability . derive the UMP test defined by (4.2. 5..e.Bd > cJ then the likelihood ratio test with critical value c is best in this sense.<l. H : (J = (Jo versus K : (J = (J. Y) belongs to population 1 if T > c. In Example 4. . 0.. (b) Find the test that is best in this sense for Example 4. and to population 0 ifT < c. of and ~o '" I. where 11 = I:7 1 ajf}j and a 2 = I:~ 1 f}i(ai 11)2.4 and recall (Proposition B. then the maximum of the two probabilities ofmisclassification porT > cJ. 1.. (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerfullevel... Show that if randomization is pennitted.2. +akNk has approx.2. +. MPsized a likelihood ratio tests with 0 < a < 1 have power nondecreasing in the sample size.2.. if . . find an approximation to the critical value of the MP level a test for this problem..6) where all parameters are known.2. Upon being asked to play.imately a N(np" na 2 ) distribution.imum probability of error (of either type) is as small as possible.=  Section 4.
01 lest to have power at least 0.i . Consider the foregoing situation of Problem 4. the probability of your deciding to stay open is < 0. Find the sample size needed for a level 0. but if the arrival rate is > 15. (c) Suppose 1/>'0 ~ 12.Xn be the times in months until failure of n similar pieces of equipment. Here c is a known positive constant and A > 0 is the parameter of interest. 274 Problems for Section 4. (b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP test? 2. How many days must you observe to ensure that the UMP test of Problem 4. . 0) = ( : ) 0'(1. A possible model for these data is to assume that customers arrive according to a homogeneous Poisson process and.<» is the (1 . 1965) is the one where Xl.1. Let Xl.95 at the alternative value 1/>'1 = 15.3. Use the normal approximation to the critical value and the probability of rejection. .Xn is a sample from a truncated binomial distribution with • I. the probability of your deciding to close is also < 0. . • • " 5.(IOn x = 1•. 1 j ! Xf is an optimal test statistic for testing H : 1/ A < 1/ AO versus J ! ..) 3.&(>').. Show that if X I.. I i . . ..<»/2>'0 where X2n(1. Suppose that if 8 < 8 0 it is not worth keeping the counter open. (a) Exhibit the optimal (UMP) test statistic for H : 0 < 00 versus K : 0 > 00.1 achieves this? (Use the normal approximation..01.4.: (b) Show that the critical value for the size a test with critical region [L~ 1 Xi > k] is k = X2n(1 .3. r p(x. hence. .<»th quantile of the X~n distributiou and that the power function of the lIMP level a test is given by where G 2n denotes the X~n distribution function.. . In Example 4. . a model often used (see Barlow and Proschan. i 1 . that the Xi are a sample from a Poisson distribution with parameter 8. show that the power of the UMP test can be written as (3(<J) = Gn(<J~Xn(<»/<J2) where G n denotes the X~n distribution function.Xn is a sample from a Weibull distribution with density f(x. • I i • 4. A) = c AcxC1e>'x . . the expected number of arrivals per day. ]f the equipment is subject to wear. I i ... You want to ensure that if the arrival rate is < 10. x > O.01.3 Testing and Confidence Regions Chapter 4 1. > 1/>'0.0)"'/[1.• n. i' i .. (a) Show that L~ K: 1/>.3. Hint: Show that Xf . Let Xl be the number of arrivals at a service counter on the ith of a sequence of n days.
and let YI .d. X 2. b). (Ii) Show that if we model the distribotion of Y as C(min{X I .O) = cBBx(l+BI.."" X n denote the incomes of n persons chosen at random from a certain population. (See Problem 4. In the goodnessoffit Example 4. For the purpose of modeling.Q)th quantile of the X~n distribution.8 > 80 . (a) Express mean income JL in terms of e.. ~ > O. where Xl_ a isthe (1 . Show how to find critical values.Fo(y)  }).6) is UMP for testing that the pvalues are uniformly distributed. . which is independent of XI.1... X 2 .. 7. . 6.•• of Li.  ..Section 4. survival times of a sample of patients receiving an experimental treatment. 1 ' Y > 0. . To test whether the new treatment is beneficial we test H : ~ < 1 versus K : .10 Problems and Complements 275 then 2.1.B) = Fi!(x). Assume that Fo has ~ density fo. then P(Y < y) = 1 e  .d. . . (b) NabeyaMiura Alternative.XFo(y)  1 P(Y < y) = e ' 1 ' Y > 0.) It follows that Fisher's method for cQmbining pvalues (see 4.[1.O<B<1. Find the UMP test.~) = 1.6. against F(u)=u B.).2 to find the mean and variance of the optimal test statistic. (a) Lehmann Alte~p. JL (e) Use the central limit theorem to find a normal approximation to the critical value of test in part (b).tive. survival times with distribution Fa.. In Problem 1. > 1.1. 00 < a < b < 00. A > O. Suppose that each Xi has the Pareto density f(x.. Y n be the Ll. random variable. and consider the alternative with distribution function F(x. 0 < B < 1. A > O. (b) Find the optimal test statistic for testing H : JL = JLo versus K .6. imagine a sequence X I. > po.1. Let N be a zerotruncated Poisson. y > 0. Hint: Use the results of Theorem 1.12. we derived the model G(y..5. .. (i) Show that if we model the distribution of Y as C(max{X I . suppose that Fo(x) has a nonzero density on some interval (a. . X N e>.:7 I Xi is an optimal test statistic for testing H : () = eo versus l\ .. .6.Fo(y)l". X N }). Let Xl. then e. Show that the UMP test for testing H : B > 1 versuS K : B < 1 rejects H if 2E log FO(Xi) > Xl_a. x> c where 8 > 1 and c > O. P(). 8. Let the distribution of sUIvival times of patients receiving a standard treatment be the known distribution Fo.
2. J J • 9. Hint: A Bayes test rejects (accepts) H if J . 'i . x > 0. 1 X n be i.' (cf.0) confidence intervals of fixed finite length for loga2 • (b) Suppose that By 1 (Xi .(O) 6 .1). What would you .. L(x.. Problems for Section 4.01.2..' " 2. Assume that F o has a density !o(Y).. n announce as your level (1 . I Hint: Consider the class of all Bayes tests of H : () = . . Problem 2.. /. Show that under the assumptions of Theorem 4. where the €i are independent normal random variables with mean 0 and known variance .0". 1 • .. . or Weibull. The numerator is an increasing function of T(x). . . 0.3.6t is UMP for testing H : 8 80 versus K : 8 < 80 . > Er • .' L(x. . Show that under the assumptions of Theorem 4.exp( _x B). Show that the test is not UMP. . i = 1.d. = . 0 = 0. I .3. e e at l · 6.X)2. j 1 To see whether the new treatment is beneficial. 0) e9Fo (Y) Fo(Y)... We want to test whether F is exponential. Show that under the assumptions of Theorem 4.0) UeB for '. Find the MP test for testing H : {} = 1 versus K : B = 8 1 > 1.0 e6 1 0~0.. Let Xi (f)/2)t~ + €i. with distribution function F(x). Let Xl. .4 1. we test H : {} < 0 versus K : {} > O..'? = 2.i. 00 p(x.{Od varies between 0 and 1. Oo)d.(O) L':x.1 and 01 loss... x > 0.(O) «) L > 1 i The lefthand side equals f6".52. 0.exp( x). B> O. the denominator decreasing. n.X)' = 16.(O)/ 16' I • 00 p(x. (a) Show how to construct level (1 .{Oo} = 1 . 1 .276 (iii) Consider the model Testing and Confidence Regions Chapter 4 G(y. every Bayes test for H : < 80 versus K : > (Jl is of the fann for some t.. 10. eo versus K : B = fh where I i I : 11.. Show that the UMP test is based on the statistic L~ I Fo(Y.). Using a pivot based on 1 (Xi .j "'" . Oo)d. F(x) = 1 .  1 j . Let Xl.3.2 the class of all Bayes tests is complete.. 1 X n be a sample from a normal population with unknown mean J.L and unknown variance (T2.O)d. F(x) = 1 .. 12. O)d.
.SOt no l (1.3 we know that 8 < O.n is obtained by taking at = Q2 = a/2 (assume 172 known).4.. Suppose that in Example 4.al). with X .a) interval of the fonn [ X .(al + (2)) confidence interval for q(8).. t.d. where c is chosen so that the confidence level under the assumed value of 17 2 is 1 ..n .IN(X  = I:[' lX. Use calculus.3).(2) UCB for q(8). . . .4..1) based on n = N observations has length at most l for some preassigned length l = 2d.4.1.Xn be as in Problem 4. then the shortest level n (1 .] . Show that if q(X) is a level (1 .a. Let Xl. (Define the interval arbitrarily if q > q. 0. where 8. I")/so. N(Ji" 17 2) and al + a2 < a. Stein's (1945) twostage procedure is the following. but we may otherwise choose the t! freely.IN.2. n. It follows that (1 . find a fixed length level (b) ItO < t i < I. X!)/~i 1tl ofB.1 Show that. minI ii. .4. . < a.l.q(X)] is a level (1 ..z(1 ./N. although N is random.) LCB and q(X) is a level (1.1.(2) " . (a) Justify the interval [ii. Show that if Xl... of the form Ji.4. Suppose we want to select a sample size N such that the interval (4.1.~a) /d]2 . .~a)!.X are ij.(X) = X .IN] . Then take N . X + z(1 . < 0..Section 4. Begin by taking a fixed number no > 2 of observations and calculate X 0 = (1/ no) I:~O 1 Xi and S5 = (no 1)II:~OI(Xi  XO )2. 0.~a)/ .02..no further observations. Hint: Reduce to QI +a2 = a by showing that if al +a2 with Ql + Q2 = a. distribution.a. What is the actual confidence coefficient of Ji" if 17 2 can take on all positive values? 4.7). [0. with N being the smallest integer greater than no and greater than or equal to [Sot". Suppose that an experimenter thinking he knows the value of 17 2 uses a lower confidence bound for Ji. Compare yonr result to the n needed for (4.1)] if 8 given by (4.a) confidence interval for B. X+SOt no l (1.3).10 Problems and Complements 277 (a) Using a pivot based on the MLE (2L:r<ol (1 . there is a shorter interval 7.c.) Hint: Use (A. has a 1. then [q(X).1. i = 1.01 [X . ii are (b) Calculate the smallest n needed to bound the length of the 95% interval of part (a) by 0.. . 5. 6.. what values should we use for the t! so as to make our interval as short as possible for given a? 3.1] if 8 > 0.
we may very likely be forced to take a prohibitively large number of observations. (a) Show that X interval for 8. it is necessary to take at least Z2 (1 (12/ d 2 observations. ±. 0) and X = 81n.6. if (J' is large. . (12) and N(1/. 11.. in order to have a level (1 . Hint: Set up an inequality for the length and solve for n.18) to show that sin 1( v'X)±z (1 . sn.  10. (The sticky point of this approach is that we have no control over N. Indicate what tables you wdtdd need to calculate the interval. (12.a) confidence interval for sin 1 (v'e).. i (b) What would be the minimum sample size in part (a) if (). use the result in part (a) to compute an approximate level 0.051 (c) Suppose that n (12 = 5.a) interval defined by (4.: " are known.l4. " 1 a) confidence .~ a) upper and lower bounds. Y n2 be two independent samples from N(IJ.a:) confidence interval of length at most 2d wheri 0'2 is known. = 0. and.3.) Hint: Note that X = (noIN)Xno + (lIN)Etn n _ +IXi.3) are indeed approximate level (1 . and the fundamental monograph of Wald. Hence. Such two sample problems arise In comparing the precision of two instruments and in detennining the effect of a treatment. By Theorem B. 1986.O)]l < z (1.. Show that the endpoints of the approximate level (1 .) (lr!d observations are necessary to achieve the aim of part (a). but we are sure that (12 < (It. 1"2. (J'2 jk) distribution. I ..001. X n1 and Yll .4. (1  ~a) / 2. (a) If all parameters are unknown. 0") distribution and is independent of sn" 8. X has a N(J1. d = i II: .. . exhibit a fixed length level (1 .a) confidence interval for 1"2/(Y2 using a pivot based on the statistics of part (a).0') for Jt of length at most 2d.!a)/ 4. (12 2 9. > z2 (1  is not known exactly. Suppose that it is known that 0 < ~.1.jn(X . 278 Testing and Confidence Regions Chapter 4 I i' I is a confidence interval with confidence coefficient (1 . is _ independent of X no ' Because N depends only on sno' given N = k. respectively.) I ~i . .0)/[0(1. 0.3. 4().1') has a N(o. (b) Exhibit a level (1 .') populations. ./3z (1.95 confidence interval for (J. Show that ~(). v. The reader interested in pursuing the study of sequential procedures such as this one is referred to the book of Wetherill and Glazebrook..jn is an approximate level (1  • 12. T 2 I I' " I'.4. Let XI.fN(X .~a)]. . (c) If (12. (a) Show that in Problem 4. (a) Use (A. Let 8 ~ B(n.... Show that these two quadruples are each sufficient. find ML estimates of 11.O). 1947. . .a) confidence interval for (1/1').jn is an approximate level (b) If n = 100 and X = 0.. I 1 j Hint: [O(X) < OJ = [. Let 8 ~ B(n.
In the case where Xi is normal. Assuming that the sample is from a /V({L. . give a 95% confidence interval for the true proportion of cures 8 using (al (4.2 is known as the kurtosis coefficient. find (a) A level 0.1)8'/0') . See Problem 5. is replaced by its MOM estimate. it equals O. = 3. (a) Show that x( "d and x{1. Hint: Let t = tn~l (1 .9 confidence interval for {L.1) t z{1 . See A.5 is (1 _ ~a) 2.<. 15. Find the limit of the distribution ofn. Show that the confidence coefficient of the rectangle of Example 4.n} and use this distribution to find an approximate 1 .I). (In practice.4/0.10 Problems and Complements 279 (b) What sample size is needed to guarantee that this interval has length at most 0.3.2.1. K. Hint: Use Problem 8. 4) are independent.1 and. Slutsky's theorem.' Now use the central limit theorem.027 13.8). then BytheproofofTheoremB.1 is known. K.7).t ([en . and (b) (4. (c) Suppose Xi has a X~ distribution.4.' = 2:. (n _1)8'/0' ~ X~l = r(~(nl).9 confidence region for (It. and 10 000.)4 < 00 and that'" = Var[(X.30.J1. and X2 = X n _ I (1 . :L7/ (b) Suppose that Xi does not necessarily have a nonnal diStribution.n(X . 10. (d) A level 0. (b) A level 0. 0).4.4.J1. .1) + V2( n .)'.3.3. XI 16.4 ) .9 confidence interval for {L x= + cr.3). cr 2 ) population.ia).) can be approximated by x(" 1) "" (n . 1) random variables.2. but assume that J1.) Hint: (n .D andn(X{L)2/ cr 2 '" = r (4.".). and the central limit theorem as given in Appendix A.4.I) + V2( n1) t z( "1) and x(1 .4 and the fact that X~ = r (k.J1.Q confidence interval for cr 2 .2. (c) A level 0. Suppose that 25 measurements on the breaking strength of a certain alloy yield 11. 100.4.Section 4.)/01' ~ (J1.4 = E(Xi .IUI.3. Now use the law of large numbers. Compute the (K.9 confidence interval for cr.3. . Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures are observed.~Q).' 1 (Xi .". V(o') can be written as a sum of squares of n 1 independentN(O."') . If S "' 8(64.J1.)' .(n . Now the result follows from Theorem 8. 14. Xl = Xn_l na). Z. Hint: By B. Compare them to the approximate interval given in part (a). known) confidence intervals of part (b) when k = I. In Example 4. ~).
)F(x)[l. (b)ForO<o<b< I. U(O.4. we want a level (1 .6 with .6.4. 18.u) confidence interval for 1". In Example 4.u.F(x)]dx. That is.r fixed.i.9) and the upper boundary I" = 00. verify the lower bounary I" given by (4.05 and . distribution function.I Bn(F) = sup vnl1'(x) .. find a level (I .l). pl(O) < X <Fl(b)}.3 can be turned into simultaneous confidence intervals for F(x) by replacing z (I . Show that for F continuous where U denotes the uniform. It follows that the binomial confidence intervals for B in Example 4. Suppose Xl. • i.1.4.d. < xl has a binomial distribotion and ~ vnl1'(x) . i 1 . . I (b) Using Example 4. 19.F(x)l.1 (b) V1'(x)[l. indicate how critical values U a and t a for An (Fo) and Bn(Fo) can be obtained using the Monte Carlo method of Section 4.~u) by the value U a determined by Pu(An(U) < u) = 1.11 .I I 280 Testing and Confidence Regions Chapter 4 17. X n are i. (a) Show that I" • j j j = fa F(x)dx + f..F(x)1 is the approximate pivot given in Example 4.4.95.1'(x)] Show that for F continuous. " • j . Consider Example 4. . (c) For testing H o : F = Fo with F o continuous.. as X and that X has density f(t) = F' (t).a) confidence interval for F(x).3 for deriving a confidence interval for B = F(x).1. define . I' . !i .7.4. . In this case nF(l') = #[X. Assume that f(t) > 0 ifft E (a. (a) For 0 < a < b < I. b) for some 00 < a < 0 < b < 00. define An(F) = sup {vnl1'(X) .F(x)1 Typical choices of a and bare .F(x)1 )F(x)[1 .F(x)1 . . 1'1(0) < X < 1'.
..5.2). N( O . Hint: Use the results of Problems B. Yn2 be independent exponential E( 8) and £(. O ) is an increasing 2 function of + B~.. test given in part (a) has power 0. Show that if 8(X) is a level (1 .. for H : (12 > (15.90? 4. How small must an alternative before the size 0. .10 Problems and Complements 281 Problems for Section 4. Let Xl. X 2 be independent N( 01.3. "6 based on the level (1  a) (b) Give explicitly the power function of the test of part (a) in tenos of the X~l distribution function. (e) Suppose that n = 16. Hint: If 0> 00 ..1"2nl. with confidence coefficient 1 .12 and B. . Or .2 that the tests of H : . What value of c givessizea? (b) Using Problems B. is of level a for testing H : 0 > 00 . a (b) Show that the test with acceptance region [f(~a) for testing H : Il.~a)J has size (c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant.a) VCB for this problem and exhibit the confidence intervals obtained by pntting two such hounds of level (1 . lem of testing H : 8 1 = 82 = 0 versus K : Or (a) Let oc(X 1 .1. . (a) Find c such that Oc of Problem 4. x y 315040343237342316514150274627103037 826 10 8 29 20 10 ~ Is H : Il. 1/ n J is the shortest such confidence interval... respectively.al) and (1 . Experience has shown that the exponential assumption is warranted.05. 5.13 show that the power (3(0 1 . (15 = 1. = 1 versns K : Il.a) UCB for 0. respectively.1 O? 2.) samples.) denotes the o. and let Il.· (a) If 1(0. (e) Similarly derive the level (1 . .2 = VCBs of Example 4..0. and consider the prob2 + 8~ > 0 when (}"2 is known. .(2) together. Let Xl.3. < XI? < f(1. (b) Derive the level (1. [8(X) < 0] :J [8(X) < 00 ].~a) I Xl is a confidence interval for Il. . Yf (1 . 3. of l. X2) = 1 if and only if Xl + Xi > c. Give a 90% confidence interval for the ratio ~ of mean life times.Section 4.. (a) Deduce from Problem 4. then the test that accepts. if and only if 8(X) > 00 .3.2n2 distribution. = 1 rejected at level a 0.Xn1 and YI .3. show that [Y f( ~a) I X. M n /0.a.2 are level 0.4.1 has size a for H : 0 (}"2 be < 00 . ~ 01). (d) Show that [Mn .\.4 and 8.2).5. = 0.a) LCB corresponding to Oc of parr (a).th quantile of the .5 1.
a. 'TJo) of the composite hypothesis H : ry = ryo. O. (g) Suppose that we drop the assumption that I(t) = J( t) for all t and replace 0 by the v = median of F. X. T) where 1] is a parameter of interest and T is a nuisance parameter.8) where () and fare unknown. lif (tllXi > OJ) ootherwise. respectively.IX(j) < 0 < X(k)] do not depend on 1 nr I. Show that the conclusions of (aHf) still hold. Let C(X) = (ry : o(X. 0.00 .. Thus. (c) Show that Ok(o) (X. but I(t) = I( t) for all t. and 1 is continuous and positive. (a) Show that testing H : 8 < 0 versus K : e > 0 is equivalent to testing . (e) Show directly that P.  j 1 j . >k Determine the smallest value k = k(a) such that oklo) is level a for H and show that for n large.0:) confidence region for 1] is equivalent to a family of level tests of these composite hypotheses. 1 Ct ~r (b) Find the family aftests corresponding to the level (1.Xn be a sample from a population with density f (t .2). Hint: (c) X. (b) The sign test of H versus K is given by. ry) = O}. 0. k . . l 7. N( 020g.. of Example 4.a) confidence region for the parameter ry and con versely that any level (1 . L:J ~ ( . og are independentN (0. X n  00 ) is a level a test of H : 0 < 00 versus K : 0 > 00 • (d) Deduce that X(nk(Q)+l) (where XU) is the jth order statistic of the sample) is a level (1 .2). .~n + ~z(l .4. . (a) Shnw that C(X) is a level (l . ). we have a location parameter family. I: H': PIX! >OJ < ~ versus K': PIX! >OJ> ~. (c) Modify the test of part (a) to obtain a procedure that is level 0: for H : OJ e= O?. .a)y'n.0:) LeB for 8 whatever be f satisfying our conditions. We are given for each possible value 1]0 of1] a level 0: test o(X. Let X l1 .[XU) (f) Suppose that a = 2(n') < OJ and P.0:) confidence interval for 11. Show that PIX(k) < 0 < X(nk+l)] ~ .1 when (72 is unknown. . 6. Suppose () = ('T}. I .282 Testing and Confidence Regions Chapter 4 = ()~ 1 2 eg and exhibit the corresponding family of confidence circles for (ell ( 2). . . O?.
Establish (iii) and (iv) of Example 4. 9.a) .5. 1) and q(O) the ray (X . that suP. Y. laifO>O <I>(z(1 .p) = = 0 ifiX . Po) is a size a test of H : p = Po. Then the level (1 . 11. Let 1] denote a parameter of interest. 12. Suppose X.2. 00) is = 0'.00 indicates how far we have to go from 80 before the value 8 is not at all surprising under H. Let X ~ N(O.2 a) (a) Show that J(X. 1). ij(x)] for ry is said to be unbiased confidence interval if Pl/[ry(X) < ry' < 7)(X)] <1 a for all ry' F ry.3 and let [O(S). Show that the Student t interval (4. Let P (p.10 Problems and Complements 283 8.[q(X) < 0 2 ] = 1. Y ~ N(r/. 10.pYI < (1 1 otherwise.O ~ J(X. or). Thus. Yare independent and X ~ N(v. (b) Describe the confidence region obtained by inverting the family (J(X.Oo) denote the pvalue of the test of H : 0 = 00 versus K : 0 > 00 in Example 4. let T denote a nuisance parameter. the quantity ~ = 8(8) . That is.20) if 0 < 0 and. .2. Show that as 0 ranges from O(S) to 8(S).a). Y.a) = (b) Show that 0 if X < z(1 .Section 4. ' + P2 l' z(1 1 .5.a))2 if X > z(1 . 8( S)] be the exact level (1 . Note that the region is not necessarily an interval or ray.2a) confidence interval for 0 of Example 4. and let () = (ry.7.7. the interval is unbiased if it has larger probability of covering the true value 1] than the wrong value 1]'.z(l. hence. p)} as in Problem 4.a) confidence interval [ry(x).5.a.5.a). 7)).Y. This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line.1. a( S.5. Hint: You may use the result of Problem 4.z(1 . p. Let a(S. if 00 < O(S) (S is inconsistent with H : 0 = 00 ).1) is unbiased. (a) Show that the lower confidence bound for q( 8) obtained from the image under q of q(X) (X . 1). Define = vl1J. 0) ranges from a to a value no smaller than 1 . alll/.
Suppose we want a distributionfree confidence region for x p valid for all 0 < p < 1. X n be a sample from a population with continuous distribution F.. . and F+(x) be as in Examples 4. where ! 1 1 heal c: n(1  p) + zl_QVnp(1  pl.) Suppose that p is specified. Detennine the smallest value k = k(a) such that Jk(Q) has level a for H and show that for n large. iI . (t) Show that k and 1in part (e) can be approximated by h (~a) and h (1  ~a) where ! heal is given in part (b). That is. .F(x p)]. 1 .p) versus K' : P(X > 0) > (1 .1 and Fu 1. k+ ' i .05 could be the fifth percentile of the duration time for a certain disease. Let x p = ~ IFI (p) + Fi! I (P)]. Construct the interval using F. Show that Jk(X I x'. Let F.95 could be the 95th percentile of the salaries in a certain profession l or lOOx .a) confidence interval for x p whatever be F satisfying our conditions. F(x). .. (a) Show that testing H : x p < 0 versus I< : x p > 0 is equivalent to testing H' : P(X > 0) < (I . (See Section 3. it is distribution free. . (X(k)l X(n_I») is a level (1 . L.6 and 4.1» . .4. ! . I • (c) Let x· be a specified number with 0 < F(x') < 1. Show that P(X(k) < x p < X(nl)) = 1. .a) LeB for x p whatever be f satisfying our conditions. (e) Let S denote a B(n.a.. Then   P(P(x) < F(x) < P+(x)) for all x E (a..b) (a) Show that this statement is equivalent to = 1. Let Xl. " II [I . Thus. 14. That is.7.a = P(k < S < n I + 1) = pi(1. i j (d) Deduce that X(n_k(Q)+I) (XU) is the jth order statistic of the sample) is a level (1 .p) variable and choose k and I such that 1 . X n x*) is a level a test for testing H : x p < x* versus K : x p > x*. 0 < p < 1.284 Testing and Confidence Regions Chapter 4 13. Simultaneous Confidence Regions for Quantiles.5. Show that the interval in parts (e) and (0 can be derived from the pivot T(x ) p  . (b) The quantile sign test Ok of H versus K has critical region {x : L:~ 1 1 [Xi > 0] > k}. Confidence Regions for QlIalltiles.p). We can proceed as follows.. = ynIP(xp) . In Problem 13 preceding we gave a disstributionfree confidence interval for the pth quantile x p for p fixed. P(xp < x p < xp for all p E (0. (g) Let F(x) denote the empirical distribution. lOOx. il.a.4. k(a) .h(a). vF(xp) [1 F(xp)] " Hint: Note that F(x p ) = p.p)"j.. be the pth quantile of F.
xp]: 0 < l' < I}. That is. that is. . (b) Express x p and .j < F_x(x)] See also Example 4.u are the empirical distributions of U and 1 .x.. F f (. F~x) has the same distribution Show that if F x is continuous and H holds. then L i==l n IIXi < X + t.) < p} and. .4. L i==I n I IFx (Xi) . . Note the similarity to the interval in Problem 4.X I. Suppose X denotes the difference between responses after a subject has been given treatments A and B.p.5.1. where p ~ F(x).Q) simultaneous confidence band for the curve {VF(p) : 0 < p < I}. LetFx and F x be the empirical distributions based on the i.. where F u and F t .. TbeaitemativeisthatF_x(t) IF(t) forsomet E R. I). Give a distributionfree level (1 . X I. (b) Suppose we measure the difference between the effects of A and B by ~ the difference between the quantiles of X and X. (c) Show how the statistic An(F) of Problem 4..X n . .10 Problems and Complements 285 where x" ~ SUp{. F(x) > pl. then D(F ~ ~ ~ ~ as D(Fu 1 F I  U = F(X) U ).(x)] < Fx(x)] = nF1_u(Fx(x))..Section 4.. 15.U with ~ U(O.d.1.r" ~ inf{x: a < x < b.z. n Hint: nFx(x) = Li~ll[Fx(Xi) < Fx(x)] ~ nFu(F(x)) and ~ ~ nF_x(x) = L IIXi < xl ~ L i==I i==I n n IIFx( X. Hint: Let t.r p in terms of the critical value of the Kolmogorov statistic and the order statistics.(x) = F ~(Fx(x» . We will write Fx for F when we need to distinguish it from the distribution F_ x of X.. X n and . .13(g) preceding.. Suppose that X has the continuous distribution F.T: a <. where A is a placebo.. ~ ~ (a) Consider the test statistic x . .i. Express the band in terms of critical values for An(F) and the order statistics. The hypothesis that A and B are equally effective can be expressed as H : F ~x(t) = Fx(t) for all t E R. VF(p) = ![x p + xlpl. the desired confidence region is the band consisting of the collection of intervals {[.r < b.17(a) and (c) can be used to give another distributionfree simultaneous confidence band for x p .
i.1) empirical distributions.F1 _u). nFx(x) we set ~ 2.x (': + A(x)). (b) Consider the parameter tJ p ( F x. i .3. nFy(t) = 2. Also note that x = H. then D(Fx .:~ 1 1 [Fy(Y. i=I n 1 i t . where do:.F x l (l . (p). treatment A (placebo) responses and let Y1 .t::. for ~.. Then H is symmetric about zero.. is the nth quantile of the distribution of D(Fu l F 1 ..) < Fx(t)] = nFu(Fx(t)). he a location parameter as defined in Problem 3.x = 2VF(F(x)).) ~ < F x (")] ~ ~ = nFu(Fx(x)).5.17.. Let Fx and F y denote the X and Y empirical distributions and consider the test statistic ~ ~ ."..U ). Hint: Let A(x) = FyI (Fx(x)) . treatment B responses. .. Fenstad and Aaberge (1977). ~ ~ ~ F~x. the probability is (1 .'" 1 X n be Li.:~ I1IFx(X. We assume that the X's and Y's are in~ dependent and that they have respective continuous distributions F x and F y .) : :F VF ~ inf vF(p). Fx. let Xl. "" = Yp .. then  nFy(x + A(X)) = L: I[Y.. Hint: nFx(t) = 2.(F(x)) .1. i I f F.(x) ~ R.. . Properties of this and other bands are given by Doksum. Moreover. . It follows that X is stochastically between Xsv F and XS+VF where X s HI(F(X)) has the symmetric distribution H.a) that the interval IvF' vt] contains the location set LF = {O(F) : 0(. where x p and YP are the pth quan tiles of F x and Fy. To test the hypothesis H that the two treatments are equally effective.Jt: 0 < p < 1) for the curve (5 p (Fx.l (p) ~ ~[Fxl(p) .p)] = ~[Fxl(p) + F~l:(p)].a) simultaneous confidence band for A(x) = F~l:(Fx(x)) . vt = O<p<l sup vt(p) O<p<l where [V. we test H : Fx(t) ~ Fy (t) for all t versus K : Fx(t) # Fy(t) for some t E R.) < Fy(x p )] = nFv (Fx (t)) under H. F y ) has the same distribntion as D(Fu . Hint: Define H by H. The result now follows from the properties of B(·). . where F u and F v are independent U(O. F y ). It follows that if c. and by solving D( Fx. Let t (c) A Distdbution and ParameterFree Confidence Interval. Show that for given F E F.. Give a distributionfree level (1. vt(p)) is the band in part (b).) < do.286 Testing and Confidence Regions ~ Chapter 4 '1.) = D(Fu .x. Let 8(. .a) simultaneous confidence band 15. <aJ Show that if H holds..Fy ): 0 < p < 1}.. F y ) . Fx(t)l· .d..2A(x) ~ HI(F(x)) l 1 + vF(F(x)).) < Fx(x)) = nFv(Fx(x)).) is a location parameter} of all location parameter values at F. I I I 1 1 D(Fx...:~ II[Fx(X. = 1 i 16. < F y I (Fx(x))] i=I n = L: l[Fy (Y.d. As in Example 1. then D(Fx. where :F is the class of distribution functions with finite support.Fy ) = maxIFy(t) 'ER "" .F~x. 1 Yn be i.x p . we get a distributionfree level (1 .
F y ) < O(Fx . where is nnknown.(P). then Y' = Y. the probability is (1 .Fy).).6+] contains the shift parameter set {O(Fx. Q. . (d) Show that E(Y) .. Fv ). ) is a shift parameter} of the values of all shift parameters at (Fx .6.0:) confidence bound forO. ~ ~ = (e) A Distribution and ParameterFree Confidence Interval.. 0 < p < 1.6 1. :F x :F O(Fx. nFx ("') = nFu(Fx(x)).L.T.= minO<p<l 6. Show that for the model of Problem 4.Section 4. F y ).2uynz(1 i=! i=l all L i=l n ti is also a level (1.. Xl > X 8t 8t 0} (c) A parameter 0 = <5 ('.4. 6+ maxO<p<1 6+(p). Now apply the axioms. F y. B(. F y ) and b _maxo<p<l 6p(Fx . ~) distribution.E(X). Fy.(x) = Fy(x thea D(Fx. Show that if 0(·. F x ) ~ a and YI > Y.) .. (a) Consider the model of Problem 4. where :F is the class of distri butions with finite support.).Xn is a sample from a r (p..(.. if I' = 1I>'. Let6. (b) Consider the unbiased estimate of O.6].2z(1 i=l n n a)ulL tt]} i=l is a uniformly most accurate lower confidence bound for 8. Exhibit the UMA level (1 . O(Fx. = 22:7 I Xf/X2n( a) is . Fy ) . t Hint: Set Y' = X + t.4. 3. It follows that if we set F\~•.10 Problems and Complements 287 Moreover. .. and 8 are shift parameters. F y ) is in [6. 6.. Fy ).3.    Problems for Section 4.F y ' (p) = 6. F y ) E F x F. Fx +a) = O(Fx a.(X). t. Show that B = (2 L X.(Fx . x' 7 R.)I L ti .) 12:7 I if.2. then by solving D(F x. thea It a unifonnly most accurate level 1 .) ~ + t. (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E~ 1 X i )/ E~ 1 has uniformly smaller variance than T.a) UCB for O. Let do denote a size a critical value for D(Fu . we find a distributionfree level (1 . .(x)) .) < do: for Ll. T n n = (22:7 I X. F y ) > O(Fx " Fy).0: UCB for J. . 2. then B( Fx .('. Properties of ~ £ = "" this and other bands are given by Doksum and Sievers (1976).0:) simultaneouS coofidence band for t. tt Hint: Both 0 and 8* are nonnally distributed.. F y.a) that the interval [8.) = F (p) . is called a shift parameter if O(Fx.) I i=l L tt .·) is a shift parameter. Suppose X I. ~ D(Fu •F v ). moreover X + 6 < Y' < X + 6. Let 6 = minO<1'<16p(Fx. Show that for given (Fx . Show that n p is known and 0 O' ~ (2 2~A X.
l2(b). density = B. Suppose that B' is a uniformly most accurate level (I . Poisson. and for (). J • 6. Hint: Use Examples 4. B). .B(n. U = (8 . Xl.4.2.6 for c fixed.1. satisfying the conditions of Problem 8. 0'].d. s) distribution with T and s integers. respectively. t)dsdt = J<>'= P. . ·'"1 1 . U(O.12 so that F. (B" 0') have joint densities.i.\.\ =.B). establish that the UMP test has acceptance region (4..[B < u < O]du.(O . .1 are well defined and strictly increasing. [B.' with "> ~. f3( T..W < B' < OJ for allB' l' B. 2.aj LCB such that I . s > 0.. 8. Suppose that given. I'..B)+. Show that if F(x) < G(x) for all x and E(U). distribution with r and s positive integers.. .Xn are Ll. n = 1. Prove Corollary 4.6. t > e. where Show that if (B.B) = J== J<>'= p( s.3 and 4. F1(tjdt. . 3. Hint: By Problem B.( 0' Hint: E.O] are two level (1 . • . then E(U) > E(V). Suppose [B'. then )" = sB(r(1 . Let U. Suppose that given B Pareto. (a) Show that if 8 has a beta. P(>. ~. Establish the following result due to Pratt (1961). • • • = t) is distributed as W( s.E(V) are finite. . .l. 1I"(t) Xl. /3( T.B')+. 5.2. (1: dU) pes. '" J.. 0).d. In Example 4. V be random variables with d. s).0 upper and lower confidence bounds for J1 in the model of Problem 4.B') < E.3. .a) upper and lower credible bounds for A. (0 ."" X n are i.W < BI = 7.6.X.C. (b) Suppose that given B = B. j . distribution and that B has beta. G corresponding to densities j.2" Hint: See Sections 8.4. X has a binomial. E(U) = . > 1" (h) Show how quantiles of the X2 distribution can be used to determine level (I .1. s).3).3. .6 to V = (B . 1. i .2 and B..) and that.. OJ. g. where So is some COnstant and V rv X~· Let T = L:~ 1 Xi' (a) Show that ()" I T m = k + 2t.2. .a) confidence intervals such that Po[8' < B' < 0'] < p. Show how the quantiles of the F distribution can be used to find upper and lower credible bounds for.6.Bj has the F distribution F 2r . Construct unifoffilly most accurate level 1 .a.B). Hint: Apply Problem 4. t) is the joint density of (8.7 1 . uniform.6. p. where s = 80+ 2n and W ~ X.288 Testing and Confidence Regions Chapter 4 4. e> 0.\ is distributed as V /050. Pa( c. Problems for Section 4.f:s F. then E. and that B has the = se' (t.3.
..Xn is observable and X n + 1 is to be predicted..s') withe' = max{c. 4..y. 'T).2 degrees of freedom.2. Hint: 7f(Ll 1x.. y) and that the joint density of. 'P) is the N(x . Let Xl. y) is proportional to p(8)p(x 1/11.1 )) distribution. as X.y) is 1f(r 180)1f(/11 1 r.r)p(y 1/12. r > O.y) proportional to where 1f(r I so) is the density of solV with V ~ Xm+n2...x) is aN(x.i.rlm) density and 7f(/12 1r. Here Xl. Problems for Section 4.Section 4. Show thatthe posterior distribution 7f( t 1 x.x)' + E(Yj y)'. Hint: p( 0 I x.a) upper and lower credible bounds forO to the level (Ia) upper and lower confidence bounds for B.m} and + n. and Y1. where (75 is known. Show that (8 = S IM ~ Tn) ~ Pa(c'. r 1x. .r 1x. (b) Find level (1 . /11 and /12 are independent in the posterior distribution p(O X. 1f(/11 1r. T) = (111.. = PI ~ /12 and'T is 7f(Ll.l..a) upper and lower credible bounds for O. . . r) where 7f(Ll I x .1. Suppose 8 has the improper priof?r(O) = l/r. .0"5). Suppose that given 8 = (ttl' J.r). y) of is (Student) t with m + n .Xn + 1 be i.y. (b) Show that given r. . .0:) prediction interval for Xnt 1. ". Yn are two independent N (111 \ 'T) and N(112 1 'T) samples.x)1f(/1. y)... r In) density..y) = 7f(r I sO)7f(LlI x ..112.Xm. (a) Let So = E(x. Xl. N(J.. .0:) confidence interval for B. .d. . (a) Give a level (1 . y) is aN(y.1. r( TnI (c) Set s2 + n.10 Problems and Complements 289 (a)LetM = Sf max{Xj. respectively. . 1 r.y. (d) Use part (c) to give level (Ia) credible bounds and a level (Ia) credible interval for Ll. Show fonnaUy that the posteriof?r(O 1 x. (d) Compare the level (1. = So I (m + n  2). (c) Give a level (1 . y) is obtained by integrating out Tin 7f(Ll..Xn ). In particular consider the credible bounds as n + 00.8 1.
'.x ) where B(·. X n such that P(Y < Y) > 1. N(Jl.11. give level 3.. and that (J has a beta. and a = . That is.0'5) with 0'5 known. Give a level (1 .(0 I x)dO. 0) distribution given (J = 8.9. 1 2.i. random variable. 00 < a < b (1 . .2. . )B(r+x+y. let U(l) < .. q(ylx)= ( .. In Example 4..5. Establish (4.. which is not observable.. n = 100.. B(n. Un + 1 ordered.a) prediction interval for X n + l ..Q) distribution free lower and upper prediction bounds for X n + 1 • < 00. X n+ l are i.. 0).a) lower and upper prediction bounds for X n + 1 . < u(n+l) denote Ul .' . b). where Xl.0:) lower (upper) prediction bound on Y = X n+ 1 is defined to be a function Y(Y) of Xl.x . 1 X n are i.Xn + 1 be i.s+nx+my)/B(r+x.05. Show that the conditional (predictive) distribution of Y given X = xis 1 J I i I . . 0 > 0. . s).a).2n distribution. . X is a binomial.d.a (P(Y < Y) > I . i! Problems for Section 4. . Suppose that given (J = 0.d.Xn are observable and we want to predict Xn+l.2) hy using the observation that Un + l is equally likely to be any of the values U(I).8..nl. Hint: Xi/8 has a ~ distribution and nXn+d E~ 1 Xi has an F 2 .12.290 Testing and Confidence Regions Chapter 4 • • (b) Compare the interval in part (a) to the Bayesian prediction interval (4. give level (1 ~ a:) lower and upper prediction (c) If F is continuous with a positive density f on (a. . Then the level of the frequentist interval is 95%.8.i..8. :[ = . ! " • Suppose Xl.. CJ2) with (J2 unknown. • .10. (a) If F is N(Jl.. ~o = 10.i. as X where X has the exponential distribution F(x I 0) =I .. suppose Xl. distribution.8). . as X . Show that the likelihood ratio statistic for testing If : 0 = ~ versus K : 0 i ~ is equivalent to 12X . give level (I .'" . F. 5. 4.. .t for M = 5. . Suppose XI. Find the probability that the Bayesian interval covers the true mean j. A level (1 .15.) Hint: First show that .e...9.9 1. 0'5)' Take 0'5 ~ 7 2 = I. B( n.8. . . Present the results in a table and a graph.Xn are observable and X n + 1 is to be predicted. x > 0.3) by doing a frequentist computation of the probability of coverage. I x) is sometimes called the Polya q(y I x) = J p(y I 0). (This q(y distribution. Suppose that Y.s+n. has a B(m.d. distribution.• . (b) If F is N(p" bounds for X n + 1 . Let X have a binomial. (3(r.·) denotes the beta function. " u(n+l).. Let Xl.5.10..
Thus..V2nz(1 . (ii) CI ~ C2 = n logcl/C2.. I . (b) Use the nonnal approximatioh to check that CIn C'n n .Q). ° 3..(nx).log (ii' / (5)1 otherwise.Q.( x) ~ 0.Xn be a N(/L. and only if. let X 1l .. (c) Deduce that the critical values of the commonly used equaltailed test. Hint: log . if ii' /175 < 1 and = .C2n !' 1 as n !' 00.3.(x) = . 2 2 L. if Tn < and = (n/2) log(1 + T. ~2 . .~Q) also approximately satisfy (i) and (ii) of part (a).='7"nlog c l n /C2n CIn . JL > Po show that the onesided. onesample t test is the likelihood ratio test (fo~ Q <S ~).F(CI) ~ 1.(Xi .. log . >. 0'2) sample with both JL and 0'2 unknown. Show that (a) Likelihood ratio tests are of the form: Reject if. and only if.(x) = 0./(n .f. of the X~I distribution.X) aD· t=l n > C. where Tn is the t statistic. We want to test H : = 0'0 versus K : 0' 1= 0'0' (a) Show that the size a likelihood ratio test accepts if. · . In Problems 24.) .3. 0' 4.Section 4. XnI (1 . where F is the d. (n/2)[ii' / C a5  (b) To obtain size Q for H we should take Hint: Recall Theorem B. = Xnl (1 .(x) Hint: Show that for x < !n.~Q) n + V2nz(l.. Xn_1 (~a). TwoSided Tests for Scale.. (c) These tests coincide with the testS obtained by inverting the family of level (1  Q) lower confidence bounds for (12. 2. OneSided Tests for Scale.(Xi < 2" " (10 i=I  X) 2 < C2 where CI and C2 satisfy.. 0'2 < O'~ versus K : 0'2 > O'~. (i) F(c. Hint: Note that liD ~ X if X < 1'0 and ~ 1'0 otherwise. = aD nO' 1".I)) for Tn > 0. In testing H : It < 110 versus K .~Q) approximately satisfy (i) and also (ii) in the sense that the ratio . We want to test H .. n Cj I L.n) and >.10 Problems and Complements 291 is an increasing function of (2x .
x y vi I I . 100.2' (12) samples. e 1 ) depends on X only throngh T. 190.z .292 Testing and Confidence Regions Chapter 4 5. 1 I 794 2012 1800 2477 576 3498 411 2092 897 1808 I Assume the twosample normal model with equal variances. 0'2 ~ ! corresponding to inversion of the (c) Compute a level 0. . 110.05 onesample t test.05. 0'2). Assume the onesample normal model.9. respectively.L. I Yn2 be two independent N(J. (a) Show that the MLE of 0 ~ (1'1. (a) Using the size 0: = 0. I'· (c) Find a level 0. . 114. 0 E e.. (c) Using the normal approximation <l>(z(a)+ nl nz/n(1'1 I'z) /(7) to the power. . (b) Consider the problem aftesting H : Jl. . can we conclude that the mean blood pressure in the population is significantly larger than IDO? (b) Compute a level 0.Xn1 and YI .. . . . Forage production is also measured in pounds per acre. (a) Find a level 0. .01 test to have power 0.P. i = 1. find the sample size n needed for the level 0. 7. Suppose X has density p(x.. Assume a Show that the likelihood ratio statistic is equivalent to the twosample t statistic T.. Let Xl. (X..95 confidence interval for p.O). ... The following blood pressures were obtained in a sample of size n = 5 from a certain population: 124. n.95 when nl = nz = ~n and (1'1 1'2)/17 ~ ~. 6. whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre.1'2. . Show that A(X..Xn are said to be serially correlated or to follow an autoregressive model if we can write ~ .95 confidence interval for equaltailed tests of Problem 4.4. (b) Can we conclude that leaving the indicated amount of mulch on the ground significantly improves forage production? Use ct = 0. and that T is sufficient for O.1 < J. (7 2) is Section 4. The following data are from an experiment to study the relationship between forage production in the spring and mulch left on the ground the previous fall. 9.lIl (7 2) and N(Jl. 8.. Y. = 0 and €l.. The control measurements (x's) correspond to 0 pounds of mulch per acre.tl > {t2.l.90 confidence interval for the mean bloC<! pressure J. • Xi = (JXi where X o I 1 + €i. . where a' is as defined in < ~.90 confidence interval for a by using the pivot 8 2 ja z . eo..3. €n are independent N(O.9. The nonnally distributed random variables Xl. a 2 ) random variables..L2 versus K : j.
(T~). Stein). Fix 0 < a < and <>/[2(1 ..O)e (a) What is the size a likelihood ratio test for testing H : B = 1 versus K : B 1= I? (b) Show that the test that rejects if. (Tn.[T > tl is an increasing function of 6..and twosided t tests. P. ! x 0 2 1<> 2 I 0 <> I 2 1<> 2 1 ~a Ia 2 iI Oe (i~H <» (1') a 10: (t~) (~. Hint: Let Z and V be independent and have N(o. Y2 )dY2. distribution. Yl. has level a and is strictly more 11.OXi_d 2} i= 1 < Xi < 00.IITI > tl is an increasing function of 161.. 1 {= xj(kl)ellx+(tVx/k'I'ldx.. X powerful whatever be B. P. samples from N(J1l. (b) Show that the likelihood ratio statistic of H : 0 = 0 (independence) versus K : 0 o(serial correlation) is equivalent to C~=~ 2 X i X i _I)2 / l Xl. . ... has density !k . for each v > 0.. 7k. Then use py.. get the joint distribution of Yl = ZI vVlk and Y2 = V. (An example due to C. Xo = o. Yn2 be two independent N(P. (a) P. I).2) In exp{ (I/Z(72) 2)Xi .6.<» (1 . respectively. (Yl) = f PY"Y. . 0) by the following table. From the joint distribution of Z and V. Let e consist of the point I and the interval [0. The power functions of one. Consider the following model. / L:: i 10. (Yl. (b) P. Show that. II. 12. .Section 4. J7i'k(~k)ZJ(k+ll io Hint: Let Z and V be as in the preceding hint.n.'" p(X. Condition on V and apply the double expectation theorem. Define the frequency functions p(x.. Show that the noncentral t distribution. XZ distributions respectively. 7k.10 Problems and Complements 293 X n ) is n (a) Show that the density of X = (Xl. X n1 .(t) = . . . i = 1. . 13. = 0.2. Let Xl.[Z > tVV/k1 is increasing in 6. and only if.<»1 < e < <>.6. The F Test for Equality of Scale. Then. with all parameters assumed unknown..O) for 00 = (Z.IIZI > tjvlk] is increasing in [61. Suppose that T has a noncentral t.
as an approximation to the LR test of H : a1 = a2 versus K : a1 =I a2. · . p) distribution. show that given Uz = Uz .37 384 2.64 298 2. . i=l j=l n n (b) Show that if we have a sample from a bivariate N(1L1l 1L2.9.'. (a) Show that the likelihood ratio ~tatistic is equivalent to ITI where . . 1. Hint: Use the transfonnations and Problem B.96 279 2. T has a noncentral Tnz distribution with noncentrality parameter p.4.4) and (4.). ! .1)]E(Y. 1 . and use Probl~1ll4. and only if.1' Sf = L U. note that . . • . V. (d) Relate the twosided test of part (c) to the confidence intervals for a?/ar obtained in Problem 4.1.62 284 2.. Let '(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p the bivariate normal model. where l _ .5) that 2 log '(X) V has a distribution.a/2) or F < f( n/2). and using the arguments of Problems B. where a. where f( t) is the tth quantile of the F nz .8.J J .~ I i " 294 Testing and Confidence Regions Chapter 4 . r> (c) Justify the twosided F test: Reject H if.9.7 and B. a?.. L Yj'. can you conclude at the 10% level of significance that blood cholest~rollevel is correlated with weight/height ratio? 'I I . the continuous versiop of (B.. . Un = Un.4. si = L v.4.24) implies that this is also the unconditional distribution. 0. . . Finally. <. Un). vn 1 l i 16..7 to conclude that tribution as 8 12 /81 8 2 . 1. . F > /(1 .V.nl1 ar is of the fonn: Reject if. . . Sbow.~ I ..X)' (b) Show that (aUa~)F has an F nz obtained from the :F table.nt 1 distribution.1I(a). (a) Show that the LR test of H : af = a~ versus K : ai > and only if.'. using (4. i=2 i=2 n n S12 =L i=2 n U.4.68 250 2.. F ~ [(nl . .8. Because this conditional distribution does not depend on (U21 •. . • I " and (U" V. 0.. I ' LX. .71 240 2.' ! . Argue as in Proplem 4.61 310 2. then P[p > c] is an increasing function of p for fixed c. (11. (Un.4. . x y 254 2. that T is an increasing function of R.0 has the same distribution as R.1 . Yn) be a sampl~ from a bivariateN(O. 14. Let (XI. . . . 15.94 Using the likelihood ratio test for the bivariate nonnal model. i 0 in xi !:.1)/(n.0 has the same dis r j . distribution and that critical values can be . Consider the problem of testing H : p = 0 versus K : p i O. p) distribution.. !I " . Vn ) is a sample from a N(O. . YI ).Y)'/E(X. The following data are the blood cholesterol levels (x's) and weightlheight ratios (y's) of 10 men involved in a heart study. > C. Let R = S12/SIS" T = 2R/V1 R'.10.19 315 2. .9. p) distribution. . (1~. . (Xn .12 337 1.
2. G..9. Wiley & Sons.. E. AND J.11 Notes 295 17. 1974) to the critical value is <. i].3.Section 4. HAMMEL. S in 8(X) would be replaced by 5 + and 5 in 8(X) is replaced by 5 . AND F.01. ....9.[ii .11 NOTES Notes for Section 4.15 is used here. Wu.3). T6 . P.11. it also has confidence level (1 . Box. Consider the bioequivalence example in Problem 3. P. Stephens. the class of Bayes procedures is complete.01 + 0. Because the region contains C(X). 00. (b) Compare your solution to the Bayesian solution based on a continuous loss function rro = 0. 4.398404 (975).819 for" = 0. (a) Find the level" LR test for testing H : 8 E given in Problem 3.0. Tl .3 (1) Such a class is sometimes called essentially complete. 1983. Box.4 (I) If the continuity correction discussed in Section A.(t) = tl(. 187.2. T.o = 0. 1973.851.895 and t = 0. G.10. Data Analysis and Robustness. Mathematical Theory of Reliability New York: J. 00.035.[ii) where t = 1.05 and 0. if the parameter space is compact and loss functions are bounded. respectively. and r.. !. Consider the cases Ii. Editors New York: Academic Press. E. (3) A good approximation (Durbin.3) holds for some 8 if <p 't V. "Is there a sex bias in graduate admissions?" Science. R.I\ E. More generally the closure of the class of Bayes procedures (in a suitable metric) is complete. Leonard..1. The term complete is then reserved for the class where strict inequality in (4. Rejection is more definitive. (2) We ignore at this time some reallife inadequacies of this experiment such as the placebo effect (see Example 1. Essentially. (2) In using 8(5) as a confidence hound we are using the region [8(5). i] versus K : 8 't [i. 4. Apology for Ecumenism in Statistics and Scientific lnjerence. see also Ferguson (1967). PROSCHAN. Notes for Section 4. F.0.~.12 1965. Nntes for Section 4. O'CONNELL.1 (1) The point of view usually taken in science is that of Karl Popper [1968]. . BICKEL. and C. W.). REFERENCES BARLOW.. Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding. (2) The theory of complete and essentially complete families is developed in Wald (1950).
. 69. A. AND E. Sratisr. AND R. S. HAlO. 473487 (1977). M. "On the combination of independent test statistics.. Amer.. 1.. 54 (2000). 3rd ed. D. A. FL: Academic Press. Assoc. K. c. R. DURBIN. 8. DAS GUPTA. WETHERDJ. WILKS. The Theory of Probability Oxford: Oxford University Press... 605608 (1971). 1950. Testing Statistical Hypotheses... 56. 1986.. G. WALD. Statistical Theory with Engineering Applications . 243246 (1949). j 1 . AND K. E. "Plots and tests for symmetry. AND G. "PloUing with confidence: Graphical comparisons of two populations. . SIEVERS." J. Assoc. PRATI. ! . WANG. WELCH. POPPER." Biometrika. "EDF statistics for goodness of fit. K. 659680 (1967). Amer. "Optimal confidence intervals for the variance of a normal distribution.... B. Statistical Decision Functions New York: Wiley. DOKSUM. A. ! I I I FERGUSON. A Decision Theoretic Approach New York: Academic Press. F." An" Math. A. Statist. SACKRoWITZ." The AmencanStatistician. S. I . 36. . Y. LEHMANN. CAl. "Further notes on Mrs. 730737 (1974).• Conjectures and Refutations.• Statistical Methods for Research Workers. AND 1. DOKSUM. j 1 . J. 326331 (1999). 63." Regional Conference Series in Applied Math. VAN ZWET. Pennsylvania (1973). AND G. 1961.. "Length of confidence intervals.. Sequential Methods in Statistics New York: Chapman and Hall. Assoc. . New York: I J 1 Harper and Row. ... j j New York: 1. STEIN. . 64. Statist. Philadelphia. "A twosample test for a linear hypothesis whose pOWer is independent of the variance. H. KLETI. Statist. STEPHENs. D. Statistical Methods for MetaAnalysis Orlando. L. "Probabilities of the type I errors of the Welch tests. Aspin's tables:' Biometrika. New York: Springer.. 1%2. TATE.. Amer. Statist. 54. I . Math. Amer. . "Interval estimation for a binomial proportion. "P values as random variablesExpected P values. 1947." Ann. V. L. T. . W. WALD.. GLAZEBROOK. 1985. 549567 (1961)." Biometrika. . 1968. Wiley & Sons. Mathematical Statistics New York: J. Wiley & Sons. 9. JEFFREYS. T. R.674682 (1959). 38. L.." The American Statistician. W. R. R. FENSTAD. New York: Hafner Publishing Company. G." J. 1958.. OSTERHOFF. the Growth ofScientific Knowledge.. 421434 (1976). K. Mathematical Statistics...243258 (1945)." J. FISHER. 1967. AARBERGE.296 Testing and Confidence Regions Chapter 4 BROWN. Statist. 1952. HEDGES. 13th ed. AND A. AND I. Sequential Analysis New York: Wiley. SAMUEL<:AHN. A. 53. 16. SIAM. OLlaN. 66. A. H. 2nd ed." 1. 1997. "Distribution theory for tests based on the sample distribution function.
1. v(F) ~ F. telling us exactly how the MSE behaves as a function of n.9. k Evaluation here requires only evaluation of F and a onedimensional integration.. consider a sample Xl.1 (D. .1). Ifwe want to estimate J1(F} = (5. consider evaluation of the power function of the onesided t test of Chapter 4. Even if the risk is computable for a specific P by numerical integration in one dimension.3) gn(X) =n ( 2k ) F k(x)(1 . and most of this chapter. If n is odd.2. N(J1. If XI.1 INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS Despite the many simple examples we have dealt with.2 that .fiiXIS has a noncentral t distribution with parameter 1"1 (J and n .3).Xn are i. and calculable for any F and all n by a single onedimensional integration.1. . our setting for this section EFX1 and use X we can write.1. the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5. closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule.F(x)k f(x).d. To go one step further.1.i.1) This is a highly informative formula..2) and (5.13). Worse.2) where. 1 X n from a distribution F. This distribution may be evaluated by a twodimensional integral using classical functions 297 . computation even at a single point may involve highdimensional integrals. from Problem (B.. the qualitative behavior of the risk as a function of parameter and sample size is hard to ascertain. consider med(X 11 •.• . In particular. (5.1.1 degrees of freedom.1. However.Xn ) as an estimate of the population median II(F).. if n ~ 2k + 1. and F has density f we can wrile (5. a 2 ) we have seen in Section 4. .Chapter 5 ASYMPTOTIC APPROXIMATIONS 5. Worse. but a different one for each n (Problem 5.
Approximately evaluate Rn(F) by _ 1 B i (5.. . is to approximate the risk function under study by a qualitatively simpler to understand and easier to compute function..n (Xl. j=l • By the law of large numbers as B j. where A= { ~ .Xn):~Xi<n_l (~2 (EX.. X n + EF(Xd or p £F( . Rn (F).i. but for the time being let's stick to this case as we have until now. or the sequence Asymptotic statements are always statements about the sequence. always refers to a sequence of statistics I. The classical examples are.)2) } . which we explore further in later chapters. Draw B independent "samples" of size n. . just as in numerical integration. .3. of distributions of statistics.I I !' 298 Asymptotic Approximations Chapter 5 (Problem 5. • . Xn)}n>l. Thus.Xnj ) . which occupies us for most of this chapter. Asymptotics. Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or. Monte Carlo is described as follows. X n as n + 00. more specifically. It seems impossible to determine explicitly what happens to the power function because the distribution of fiX / S requires the joint distribution of (X. .3. where X n = ~ :E~ of medians.2) and its qualitative properties are reasonably transparent. But suppose F is not Gaussian. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5. for instance the sequence of means {Xn }n>l. 1 < j < B from F using a random number generator and an explicit fonn for F.. {X 1j l ' •• .4) i RB =B LI(F. . . . is to use the Monte Carlo method. save for the possibility of a very unlikely event.. 10=1 f=l There are two complementary approaches to these difficulties. i . observations Xl. or it refers to the sequence of their distributions 1 Xi. The first.1.1. We shall see later that the scope of asymptotics is much greater.d. S) and in general this is only representable as an ndimensional integral..fiit !Xi. 00. fiB ~ R n (F). I j {Tn (X" . in this context. based on observing n i.o(X1). The other. we can approximate Rn(F) arbitrarily closely. X nj }. VarF(X 1 ». .fii(Xn  EF(Xd) + N(O. In its simplest fonn. .. _. .
01.f. the much PFlIx n  Because IXd < 1 implies that .6) for all £ > O. (5. Is n > 100 enough or does it have to be n > 100.5) That is.1.15...14. if EFXf < 00.1.1..1.1. if more delicate Hoeffding bound (B.. (5.1. ' .l) (5.4.1. which are available in the classical situations of (5. 1') < z] ~ ij>(z) (5. .7) where 41 is the standard nonnal d. Further qualitative features of these bounds and relations to approximation (5. the right sup PF x [r. say. X n ) but in practice we act as if they do because the T n (Xl. then P [vn:(X F n .14. say.1.8) Again we are faced with the questions of how good the approximation is for given n. 7). then (5.6) to fall.25 whereas (5.10) < 1 with . n = 400. the celebrated BerryEsseen bound (A.8) are given in Problem 5.6) does not tell us how large n has to be for the chance of the approximation ~ot holding to this degree (the lefthand side of (5. Similarly. (5. and P F • What we in principle prefer are bounds..' m (5. below .10) is .l1) states that if EFIXd' < 00.. We interpret this as saying that.' = 1 possible (Problem 5. (see A.9) when .. X n )) (in an appropriate sense).2 hand side of (5. . the central limit theorem tells us that if EFIXII < 00. £ = .VarF(Xd. For instance.1. 1'1 > €] < 2exp {~n€'}.9.1. For € = ./' is as above and . .Section 5.Xn ) or £F(Tn (Xl. PF[IX n _  ..1.1 Introduction: The Meaning and Uses of Asymptotics 299 In theory these limits say nothing about any particular Tn (Xl.3).(Xn .11) .1.. x. DOD? Similarly.1.2 . Xn is approximately equal to its expectation.' 1'1 > €] < .01. this reads (5.omes 11m'.6) gives IX11 < 1. As an approximation.1. the weak law of large numbers tells us that.6) and (5.1.2 is unknown be.9) As a bound this is typically far too conservative. . For instance. .1') < x] _ ij>(x) < CEFIXd' v'~ 3 1/' " "n (5. for n sufficiently large. X n ) we consider are closely related as functions of n so that we expect the limit to approximate Tn(Xl~"· . Thus.. if EFIXd < 00. by Chebychev's inequality..1.. The trouble is that for any specified degree of approximation.1.1.9) is .
1. The arguments apply to vectorvalued estimates of Euclidean parameters.1. and asymptotically normal. consistency is proved for the estimates of canonical parameters in exponential families.8) is typically much betler than (5.1. asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate. I . Asymptotics has another important function beyond suggesting numerical approximations for specific nand F. For instance. as we have seen.3 begins with asymptotic computation of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for oneparameter exponential families. Section 5.F) > N(O 1) . which is reasonable. The qualitative implications of results such as are very impor~ tant when we consider comparisons between competing procedures. Yet.1. good estimates On of parameters O(F) will behave like  Xn does in relation to Ji.1. It suggests that qualitatively the risk of X n as an estimate of Ji. We now turn to specifics. behaves like . If they are simple.' (0)( (1 / y'n)( v'27i') (Problem 5. Section 5. Practically one proceeds as follows: (a) Asymptotic approximations are derived.e(F)]) (1(e. i . As we shall see.t and 0"2 in a precise way. • c F (y'n[Bn . F) typically is the standard deviation (SD) of J1iOn or an approximation to this SD.11) is again much too consctvative generally. Although giving us some idea of how much (5.7) says that the behavior of the distribution of X n is for large n governed (approximately) only by j.2 deals with consistency of various estimates including maximum likelihood. In particular.2 and asymptotic normality via the delta method in Section 5.5) and quite generally that risk increases with (1 and decreases with n. "(0) > 0.1.3.12) where (T(O. d) = '(II' . even here they are not a very reliable guide. for any loss function of the form I(F. . Bounds for the goodness of approximations have been available for X n and its distribution to a much greater extent than for nonlinear statistics such as the median. : i . for all F in the n n . .11) suggests.1. As we mentioned. although the actual djstribution depends on Pp in a complicated way. 1 i . The estimates B will be consistent. The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17  j • i. quite generally.di) where '(0) = 0. Consistency will be pursued in Section 5.8) differs from the truth. . (5. Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo. (5. (b) Their validity for the given n and Ttl for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. • model. (5. B ~ O(F).300 Asymptotic Approximations Chapter 5 where C is a universal constant known to be < 33/4.' I I • If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good.I '1.(l) The approximation (5.
. asymptotics are methods of approximating risks.2) is called unifonn consistency. distributions.1) where I .1).5 we examine the asymptotic behavior of Bayes procedures.) forsuP8 P8 lI'in . X ~ p(P) ~ ~  p = E(XJ) and p(P) = X. For P this large it is not unifonnly consistent. in accordance with (A. and other statistical quantities that are not realistically computable in closed form.2.2) are preferable and we shall indicate some of qualitative interest when we can.i. is a consistent estimate of p(P).Xn from Po where 0 E and want to estimate a real or vector q(O).14. But. we talk of uniform cornistency over K. if...Xn ) is that as e n 8 E ~ e.1 CONSISTENCY PlugIn Estimates and MlEs in Exponential Family Models Suppose that we have a sample Xl. ' . by quantities that can be so computed. P where P is unknown but EplX11 < 00 then.7.) However.2. In practice.2. Means.. A stronger requirement is (5. The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A. 5.2 Consistency 301 in exponential families among other results.14. 1denotes Euclidean distance. 00. but we shall use results we need from A. which is called consistency of qn and can be thought of as O'th order asymptotics. Section 5.2. If is replaced by a smaller set K. for . > O.7 without further discussion.1) and (B. 'in 1] q(8) for all 8.Section 5. e Example 5. case become increasingly valid as the sample size increases. Summary. The least we can ask of our estimate Qn(X I.1.14.2. We will recall relevant definitions from that appendix as we need them. The stronger statement (5. We also introduce Monte Carlo methods and discuss the interaction of asymptotics.7. and probability bounds. . for all (5. and 8.2) Bounds b(n. (See Problem 5. Finally in Section 5.2 5.2.. A. Monte Carlo. .1.d. .i. If Xl. Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to 00.1). Most aSymptotic theory we consider leads to approximations that in the i.4 deals with optimality results for likelihoodbased procedures in onedimensional parameter models.lS.d.2. AlS. That is. remains central to all asymptotic theory. (5. .q(8)1 > 'I that yield (5.2. with all the caveats of Section 5. and B. 1 X n are i.2. by the WLLN. where P is the empirical distribution. The simplest example of consistency is that of the mean. . .
6 >0 O. Suppose that q : S + RP is continuous. with Xi E X 1 .2.l «) = inf{6 : w(q.6. . 0) is defined by . Suppose the modulus ofcontinuity of q.2.Pk) : 0 < Pi < 1. . . for all PEP.2.• X n be the indicators of binomial trials with P[X I = I] = p. 0 In fact. w. where Pi = PIXI = xi]' 1 < j < k.p'l < 6}. there exists 6 «) > 0 such that p. Theorem 5. which is ~ q(ji). p' E S.2.6)! Oas6! O. l iA) E S be the empirical distribution.pi > 6«)] But. If q is continuous w(q. · .l : [a.14.. implies Iq(p')q(p) I < <Then Pp[li1n .I l 302 instance.b] < R+bedefinedastheinver~eofw.·) is increasing in 0 and has the range [a. Then qn q(Pn) is a unifonnly consistent estimate of q(p).4) I i (5. and p = X = N In is a uniformly consistent estimate of p.b) say. Binomial Variance.. sup{ Pp[liln pi > 61 : pES} < kl4n6 2 (Problem 5. q(P) is consistent.ql > <] < Pp[IPn . 0 ~ To some extent the plugin method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plugin estimates. By the weak law of large numbers for all p. Then N = LXi has a B(n. 6) It easily follows (Problem 5.. Pn = (iiI..6) = sup{lq(p)  q(p')I: Ip . Letw. consider the plugin estimate p(Iji)ln of the variance of p. Because q is continuous and S is compact. w( q. the kdimensional simplex.. we can go further..p) distribution. and (Xl. it is uniformly continuous on S." is the following: I X n are LLd. Other moments of Xl can be consistently estimated in the same way.3) that > <} (5. Thus. .1) and the result follows.1.. for every < > 0.3) Evidently. Suppose thnt P = S = {(Pl. where q(p) = p(1p).. (5.2. 0 < p < 1. . Evidently. Pp [IPn  pi > 6] ~ .5) I A simple and important result for the case in which X!.2. Asymptotic Approximations Chapter 5 X is uniformly consistent over P because o Example 5.2.2. .1 < j < k. i . w(q.. xd is the range of Xl' Let N i = L~ 11(Xi = Xj) and Pj _ Njln. P = {P : EpX'f < M < oo}. = Proof. But further. Let X 1. in this case.LJ~IPJ = I}. w(q. then by Chebyshev's inequality. by A. Ip' pi < 6«).
then vIP) h(g).2 Consistency 303 Suppose Proposition 5. Questions of uniform consistency and consistency when P = { DistribuCorr(Uj. ' . EVj2 < 00. Let Xi = (Ui . and let q(O) = h(m(O)).. (ii) ij is consistent.i.. .1. then Ifh=m.2. variances. Theorem 5.2. and correlation coefficient are all consistent.p).2. If we let 8 = (JLI.V. )1 < oo}. conclude by Proposition 5..V2. We need only apply the general weak law of large numbers (for vectors) to conclude that (5.1 and Theorem 2.)1 < 1 } tions such that EUr < 00.7.) > 0. af > o. ai. Let TI. if h is continuous. where P is the empirical distribution. 1 are discussed in Problem 5.Section 5.3. (i) Plj [The MLE Ti exists] ~ 1.then which is well defined and continuous at all points of the range of m.2.1 that the empirical means. If Xl.9d) map X onto Y C R d Eolgj(X')1 < 00. Variances and Correlations. X n are a sample from PrJ E P. U implies !'. . We may.1 . Vi). 0 Here is a general consequence of Proposition 5.) I < that h(D n) Foreonsisteney of h(g) apply Proposition B.2.ar. 1 < j < d.v} = (u.2. let mj(O) '" E09j(X.3. 'Vi) is the statistic generating this 5parameter exponential family. .4. thus. = Proof.1: D n 00.d. where h: Y ~ RP. Suppose P is a canonical exponentialfamily of rank d generated by T. Suppose [ is open. Var(U. is consistent for v(P).. o Example 5. Let g(u.U 2.JL2.Jor all 0.2. [ and A(·) correspond to P as in Section 1. Then.1.6) if Ep[g(X. 1 < i < n be i. Let g (91. Jl2. 1 < j < d. 11. ag.6.p). Var(V.UV) so that E~ I g(Ui . More generally if v(P) = h(E p g(X .. N2(JLI.lpl < 1. )) and P = {P : Eplg(X .) > 0. is a consistent estimate of q(O). Then. .a~. h(D) for all continuous h.2.). !:.
II) = where.3. z=l 1 n i. j 1 Pn(X. D( 110 . i i • • ! . T(Xdl < 6} C CT' By the law of large numbers.2. . {t: It . 5. is a continuous function of a mean of Li. . II) 0 j =EII. Let Xl.2..2. (5. . . for example.2 Consistency of Minimum Contrast Estimates . vectors evidently used exponential family properties. n LT(Xi ) i=I 21 En. We showed in Theorem 2..3.L[p(Xi ..1 to Theorem 2.= 1 .. Po. see. Note that. (T(Xl)) must hy Theorem 2. is solved by 110. • .E1).1.7) occurs and (i) follows..1 that i)(Xj. The argument of the the previous subsection in which a minimum contrast estimate. But Tj.lI) . T(Xl).3. . . if 1)0 is true. Let 0 be a minimum contrast estimate that minimizes I . as usual.d. and the result follows from Proposition 5.. . Suppose 1 n PII sup{l. 1 i . '1 P1)'[n LT(Xi ) E C T) ~ 1..j. E1). which solves 1 n A(1)) = .. .8) I > O.9) n and Then (} is consistent. II) L p(X n..LT(Xi ). I I Proof Recall from Corollary 2. = 1 n p. I ! . I .. 0 Hence. the inverse AI: A(e) ~ e is continuous on 8.1 belong to the interior of the convex support hecause the equation A(1)) = to. By a classical result. BEe c Rd. Rudin (1987). i=l 1 n (5. By definition of the interior of the convex support there exists a ball 8..3.304 Asymptotic Approximations Chapter 5 .2.d. D .2. exists iff the event in (5.3.1I 0 ) foreverye I .7) I . 7 I . X n be Li.2.p(X" II) is uniquely minimized at 11 i=l for all 110 E 6. .D(1I0 . the MLE.  inf{D(II.lI o): 111110 1 > e} > D(1I0 .1I)11: II E 6} ~'O (5. n . i I .1 that On CT the map 17 A(1]) is II and continuous on E. where to = A(1)o) = E1)o T(X 1 ).. A more general argument is given in the following simple theorem whose conditions are hard to check. I J Theorem 5.2.Xn ) exists iff ~ L~ 1 T(X i ) = Tn belongs to the interior CT of the convex support of the distribution of Tn.
E n ~ [p(Xi .2.IJol > E] < PIJ [inf{ . An alternative condition that is readily seen to work more widely is the replacement of (5.2. EIJollogp(XI."'[p(X i .2.2.lJ o): IIJ IJol > oj. .IJ o)  D(IJo. 0) .IJO)) ' IIJ IJol > <} n i=l .11) implies that the righthand side of (5. Ie ¥ OJ] ~ Ojoroll j. [inf c e.1 we need only check that (5. is finite.2.D(IJo.2.2. Coronary 5.2.8) and (5.2.[max{[ ~ L:~ 1 (p(Xi .L(p(Xi.p(X" IJo)) : IJ E g ~(P(X" K'} > 0] ~ 1. PIJ. PIJ. ifIJ is the MLE. {IJ" .D(IJo.IIIJ  IJol > c] (5.2.9) hold for p(x. IJ) .IJ) p(Xi. But for ( > 0 let o ~ ~ inf{D(IJ.8) can often failsee Problem 5. and (5. /fe e =   Proof Note that for some < > 0.13) By Shannon's Lemma 2.2. IJ)I : IJ K } PIJ !l o.D(IJo.2.2.11) implies that 1 n ~ 0 (5.2. IIJ . IJ) n ~ i=l p(X"IJ o)] .2.14) (ii) For some compact K PIJ.D(IJ o. .2. 0 Condition (5. IJ) . IIJ .. _ I) 1 n PIJ o [16 . (5.inf{D(IJo.Section 5. IJ j )) : 1 <j < d} > <] < d maxi PIJ.2 Consistency 305 proof Note that.9) follows from Shannon's lemma.8) by (i) For ail compact K sup In { c e. 2 0 (5.IJor > <} < 0] because the event in (5.2. [I ~ L:~ 1 (p(Xi .11) sup{l.5. IJj))1 > <I : 1 < j < d} ~ O. IJ) . A simple and important special case is given by the following. (5.L[p(Xi .10) By hypothesis. IJ) = logp(x. then..IJ)1 < 00 and the parameterization is identifiable. IJ). IJ)II ' IJ n i=l E e} > . IJj ) .D( IJ o.IJd ). IJ j ) . 1 n PIJo[inf{.8) follows from the WLLN and PIJ. o Then (5.[IJ ¥ OJ] = PIJ.10) tends to O.2. But because e is finite.12) which has probability tending to 0 by (5. for all 0 > 0. IJo)) .1.IJol > <} < 0] (5.8).
3.Il E(X Il).B(P)I > Ej : PEP} t 0 for all € > O.i. Summary. ! ! (ii) EIXd~ < 00 Let E(X1 ) = Il. that sup{PIiBn .1.1 that the principal use of asymptotics is to provide quantitatively or qualitatively useful approximations to risk. ~l Theorem 5.d. A general approach due to Wald and a similar approach for consistency of generalized estimating equation solutions are left to the problems. . =suPx 1k<~)(x)1 <M < .3. X valued and for the moment take X = R.14) is in general difficult. then = h(ll) + L (j)( ) j=l h . As usual let Xl.3. D . sequence of estimates) consistency. m Mi) and assume (b) IIh(~)lloo j 1 .. Let h : R ~ R. in Section 5.306 Asymptotic Approximations Chapter 5 We shall see examples in which this modification works in the problems.) = <7 2 Eh(X) We have the following. + Rm (5. Sufficient conditions are explored in the problems. We then sketch the extension to functions of vector means. We introduce the minimal property we require of any estimate (strictly speak ing. If fin is an estimate of B(P). Unfortunately checking conditions such as (5. we require that On ~ B(P) as n ~ 00. .1 The Delta Method for Moments • • We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders.1) where • .3 FIRST. see Problem 5. When the observations are independent but not identically distributed. ~ 5. We denote the jth derivative of h by 00 . let Ilglioo = sup{ Ig( t) I : t E R) denote the sup nonn. .33. I ~.Xn be i. consistency of the MLE may fail if the number of parameters tends to infinity. 5. If (i) and (ii) hold.2.8) and (5. Unifonn consistency for l' requires more. We conclude by studying consistency of the MLE and more generally Me estimates in the case e finite and e Euclidean. VariX. and assume (i) (a) h is m times differentiable on R. We show how consistency holds for continuous functions of vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families. J. ..2.. > 2.AND HIGHERORDER ASYMPTOTlCS: THE DelTA METHOD WITH APPLICATIONS We have argued.3.
..3. . (n .. tI.2. and the following lemma.'1 +. ... .'" ..Section 5. . C [~]! n(n . ik > 2. so the number d of nonzero tenns in (a) is b/2] (c) l::n _ r. We give the proof of (5.ij } • sup IE(Xi . 21..1) .. Let I' = E(X. 'Zr J .4) for all j and (5..3. EIX I'li = E(X _1')' Proof..5. then (a) But E(Xi .. .3. Xij)1 = Elxdj il.i j appears at by Problem 5.3. (b) a unless each integer that appears among {i I . If EjX Ilj < 00. tr i/o>2 all k where tl . .3) (5. . Moreover. +'r=j J . rl l:: . :i1 + . The more difficult argument needed for (5.and HigherOrder Asymptotics: The Delta Method with Applications 307 The proof is an immediate consequence of Taylor's expansion. . .3. X ij ) least twice.1'1..1. . _ . + i r = j. t r 1 and [t] denotes the greatest integer n. .t" = t1 . Lemma 5.3. j > 2. then there are constants C j > 0 and D j > 0 such (5. (5. bounded by (d) < t.3 First. .5[jJ2J max {l:: { ..) ~ 0.3.2) where IX' that 1'1 < IX .3) and j odd is given in Problem 5.1 < k < r}}. ..3.[jj2] + 1) where Cj = 1 <r... for j < n/2. In!.3) for j even.3. The expression in (c) is..4) Note that for j eveo..
308 But (e) Asymptotic Approximations Chapter 5 1 I j i njn(n 1) . (d).3.3. Then Rm = G(n 2) and also E(X .4). apply Theorem 5.5) (b) if E(Xt) < G(n. (a) ifEIXd3 < 00 and Ilh( 31 11 oo < Eh(X) 00. I .3. i • r' Var h(X) = 02[h(11(J.llJ' + G(n3f2) (5. I . and (e) applied to (a) imply (5. EIX1 .3.+ O(n.3.2.3.3f') in (5. (n li/2] + 1) < nIJf'jj and (c).l) + 00 2n I' + G(n3f2).5) can be replaced by Proof For (5.l) + [h(llj'(1')}':. I (b) Ifllh(j11l= < 00. then G(n.3) for j even..5) follows. .. Corollary 5.ll j < 2j EIXd j and the lemma follows. I I I Corollary 5.2 ). if I' = O.1')2 = 0 2 In.JL as our basic variables we obtain the lemma but with EIXd j replaced by EIX 1 .3.l)h(J. 00 and 11hC 41 11= < then G(n.3. Because E(X . .5) apply Theorem 5. • . then h(')( ) 2 0 = h(J. n (b) Next.1. If the conditions of (b) hold.3.1..1')3 = G(n 2) by (5.l)E(X .6) can be 1 Eh 2 (X) = h'(J.l)h(1)(J. and EXt < replaced by G(n'). (5. Proof (a) Write 00. 1 iln .3f2 ) in (5.1 with m = 3.3.l) + {h(21(J. then I • I.l) + [h(1)]2(J. 1 < j < 3.1'11. if 1 1 <j (a) Ilh(J)II= < 00.J.4) for j odd and (5. give approximations to the bias of h(X) as an estimate of h(J.l)}E(X . using Corollary 5. In general by considering Xi .l)h(J.6. . I 1 . (5. respectively.3..3.3.3f2 ). 0 1 I ..3.3.l) and its variance and MSE. 0 The two most important corollaries of Theorem 5.6) n .1 with m = 4.1')' + 1 E[h'](3)(X')(X _ 1')3 = h2 (J..1.1') + {h(2)(J. < 3 and EIXd 3 < 00.3.l) 6 + 2h(J. By Problem 5.
and HigherOrder Asymptotics: The Delta Method with Applications 309 Subtracting Cal from (b) we get C5. Bias.8) because h(ZI (t) = 4(r· 3 . by Corollary 5..3. 0 Clearly the statements of the corollaries as well can be turned to expansions as in Theorem 5. the bias of h(X) defined by Eh(X) . We will compare E(h(X)) and Var h(X) with their approximations.10) + [h<ZI (/1)J'(14} + R~ ! with R~ tending to zero at the rate 1/n3 .3.h(/1) is G(n.3(1_ ')In + G(n. X n are i. where /1. (1Z 5. the MLE of ' is X .11) Example 5.l / Z) unless h(l)(/1) = O. T!Jus. and.4 ) exp( 2/t).3.  . A qualitatively simple explanation of this important phenonemon will be given in Theorem 5.3.t.d. Note an important qualitative feature revealed by these approximations.3. We can use the two coronaries to compute asymptotic approximations to the means and variance of heX). when hit) = t(1 .3.3.3.1.3.Z. If X [.3. then we may be interested in the warranty failure probability (5..= E"X I ~ 11 A. . {h(l) (/1)h(ZI U')/13 (5.(h(X) .6). then heX) is the MLE of 1 .3. Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5.3.3.l ).exp( 2') c(') = h(/1).1) = 11.4) E(X .3.i.9) o Further expansion can be done to increase precision of the approximation to Var heX) for large n.3 First.i.5).Z) (5.2. as the plugin estimate of the parameter h(Jl) then.1 with bounds on the remainders. . which is neglible compared to the standard deviation of h(X). by Corollary 5. as we nonnally would.(h(X)) E. To get part (b) we need to expand Eh 2 (X) to four terms and similarly apply the appropriate form of (5.3.2 ) 2e. Bias and Variance of the MLE of the Binomial VanOance.Section 5.. (5.3.Z ". If heX) is viewed./1)3 = ~~.7) If h(t) ~ 1 . which is G(n. by expanding Eh 2 (X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = ~[h(I)U')]'(1Z +. Thus.t) and X.1..h(/1)) h(2~(II):: + O(n.2 (Problem (5.1 If the Xi represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours.exp( 2It). Example 5. for large n.
.p) .2p)p(l.5) yields E(h(X))  ~ p(1 . Suppose g : X ~ R d and let Y i = g(Xi ) = (91 (Xi).p)(I.3.p(1 .2. . asSume that h has continuous partial derivatives of order up to m.pl. (5. a Xd d} . . + id = m.p) {(I _ 2p)2 n Thus.2(1 . Next compute Varh(X)=p(I.. R'n p(1 ~ p) [(I . Let h : R d t R.P )} (n_l)2 n nl n Because 1'3 = p(l. ~ Varh(X) = (1.' .p) n . .2p)')} + R~.2p(1 .. I .3.2t. • . 0 < i.!c[2p(1 n p) .p)'} + R~ p(l .p(1 .10) is in a situation in which the approximation can be checked. Theorem 5. D 1 i I .{2(1.6p(l. the error of approximation is 1 1 + .E(X') ~ p .e x ) : i l + . " ! .3.p)J n p(1 ~ p) [I . ... .p).. .' • " The generalization of this approach to approximation of moments for functions of vector means is fonnally the same but computationally not much used for d larger than 2. D " ~ I .P) {(1_2 P)2+ 2P(I. < m.3.p) .p) + .10) yields .p)(I.p)] n I " .2p) n n +21"(1 .310 Asymptotic Approximations Chapter 5 B(l. M2) ~ 2. and that (i) l IIDm(h)ll= < 00 where Dmh(x) is the array (tensor) a i.3 ).5) is exact as it should be. (5. 1 and in this case (5.[Var(X) + (E(X»2] I nI ~ p(1 .2p)' . and will illustrate how accurate (5. II'. First calculate Eh(X) = E(X) . 1 < j < { Xl .2p).2p)2p(1 .amh i.j .3. • ~ O(n.p) = p ( l .9d(Xi )f. n n Because MII(t) = I .
1 Then.) Cov(Y".and HigherOrder Asymptotics: The Delta Method with Applications 311 (ii) EIY'Jlm < 00.) + a. (5.) Var(Y.). Y 12 ) 3/2).~E(X. if EIY.5).3. Theorem 5. 1. under appropriate conditions (Problem 5. Lemma 5. B. for d = 2. The proof is outlined in Problem 5.)} + O(n (5. E xl < 00 and h is differentiable at (5. then Eh(Y) This is a consequence of Taylor's expansion in d variables. Similarly.I' < 00.3.3. The most interesting application.»)'var(Y'2)] +O(n') Approximations (5. .3..4. (7'(h)) where and (7' = VariX. and (5.) Cov(Y".14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d . (5.3.2 ).". EI Yl13 < 00 Eh(Y) h(J1.) )' Var(Y.14) + (g:.3.11. and J. Y12 ) (5.) + ~ {~~:~ (J1.3 First.8. = EY 1 .. ijYk = ~ I:~ Yik> Y = ~ I:~ I Vi.3.3.h(!. and the appropriate generalization of Lemma 5. The results in the next subsection go much further and "explain" the fonn of the approximations we already have.3.3. h : R ~ R.3.). (J1..12) Var h(Y) ~ [(:. Suppose {Un} are real random variables and tfult/or a sequence {an} constants with an + 00 as n + 00.3.3).) + 2 g:'.'. then O(n.3. is to m = 3.12) can be replaced by O(n. + ~ ~:l (J1.C( v'n(h(X) .Section 5.))) ~ N(O.3. We get.) gx~ (J1.3.hUll)~. by (5. Suppose thot X = R..6). as for the case d = 1.13) Moreover. 0/ The result follows from the more generally usefullernma.~ (J1.x.2.(J1.1.) Var(Y.13).3..3 / 2 ) in (5.2 The Delta Method for In law Approximations = As usual we begin with d !. 5. (J1. 1 < j < d where Y iJ I = 9j(Xi )..15) Then .
g(ll(U)(v  u)1 < <Iv . then EV. Formally we expect that if Vn !:.• ' " and.16) Proof. Thus.3. V for some constant u. . . . . we expect (5.312 Asymptotic Approximations Chapter 5 (i) an(Un . PIIUn €  ul < iii ~ 1 • "'.15) "explains" Lemma 5. Therefore.a 2 ).3. an = n l / 2.N(O. (e) Using (e). j . by hypothesis. (5. The theorem follows from the central limit theorem letting Un X.u)!:. = I 0 . ~ EVj (although this need not be true.32 and B.. • • .ul Note that (i) (b) '* . ( 2 ). By definition of the derivative. _ F 7 . for every (e) > O. (ii) 9 : R Then > R is d(fferentiable at u with derivative 9(1) (u). an(Un .3. for every Ii > 0 (d) j. ·i Note that (5. j " But. 'j But (e) implies (f) from (b). from (a). hence.3. V .17) .8). ~: . I (g) I ~: .1. Consider 1 Vn = v'n(X !") !:. 0 .7. V and the result follows.i. V.u) !:. I .3.ul < Ii '* Ig(v)  g(u) . u = /}" j \ V . see Problems 5. for every € > 0 there exists a 8 > 0 such that (a) Iv . .N(O.
Using the central limit theorem. we find (Problem 5.3.TiS X where S 2 = 1 nl L (X.X" be i. and S2 _ n nl ( ~ X) 2) n~(X.s Tn =.3.3.)./2 ). then S t:.Section 5.18) In particular this implies not only that t n.3 we saw that the two sample t statistic Sn vn1n2 (Y X) ' n = = n s nl + n2 has a Tn2 distribution under H when the X's and Y's are normal withar = a~. (b) The TwoSample Case.i. ••• .1 (Ia) + Zla butthat the t n.9.1 (1 .O < A < 1.2 and Slutsky's theorem. v) = u!v.. In Example 4. "t" Statistics. al = Var(X. N n (0 Aal. For the proof note that by the central limit theorem.X) n 2 . j even = o(nJ/'). £ (5. Consider testing H: /11 = j.'" .'" 1 X n1 and Y1 . Now Slutsky's theorem yields (5. i=l If:F = {Gaussian distributions}.j odd.a) for Tn from the Tn . FE :F where EF(X I ) = '".and HigherOrder Asymptotics: The Delta Method with Applications 313 where Z ~ N(O.18) because Tn = Un!(sn!a) = g(Un .d..17) yields O(n J. Let Xl.t2 versus K : 112 > 111. EZJ > 0. else EZJ ~ = O. But if j is even. and the foregoing arguments. In general we claim that if F E F and H is true. Let Xl. A statistic for testing the hypothesis H : 11. Example 5.N(O. 1'2 = E(Y.28) that if nI/n _ A.3 First. we can obtain the critical value t n. 1). 1 2 a P by Theorem 5.+A)al + Aa~) . 1). sn!a)..a) critical value (or Zla) is approximately correct if H is true and F is not Gaussian.l distribution.). then Tn . where g(u.> 0 .2.3. (1  A)a~ .l (1.= 0 versuS K : 11. Then (5. Yn2 be two independent samples with 1'1 = E(X I ). (1 . VarF(Xd = a 2 < 00. (a) The OneSample Case. .3.3. Slutsky's theorem.) and a~ = Var(Y.
. the For the twosample t tests. 316. X~.5 2 2... when 0: = 0.3.. The simulations are repeated for different sample sizes and the observed significance levels are plotted. . 10. as indicated in the plot.02 .05. Chisquare data I 0. and the true distribution F is X~ with d > 10. 32.5 . Y . One sample: 10000 Simulations.I = u~ and nl = n2...2 shows that when t n _2(10:) critical value is a very good approximation even for small n and for X. where d is either 2..1 (0.3. Other distributions should also be tried. Figure 5.314 Asymptotic Approximations Chapter 5 It follows that if 111 = 11. each lime computing the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table.. then the critical value t. approximations based on asymptotic results should be checked by Monte Carlo simulations.c:'::c~:____:_J 0. 0) for Sn is Monte Carlo Simulation As mentioned in Section 5. Here we use the XJ distribution because for small to moderate d it is quite different from the normal distribution. Figure 5.. Each plotted point represents the results of 10.1.. We illustrate such simulations for the preceding t tests by generating data from the X~ distribution M times independently. .000 onesample t tests using X~ data. i 2(1 approximately correct if H is true and the X's and Y's are not normal.5 1 1. 20. The X~ distribution is extremely skew.1 shows that for the onesample t test. i I . .95) approximation is only good for n > 10 2 . " ..2 or of a~.. ..'.. the asymptotic result gives a good approximation when n > 10 1..5 . .. or 50.5 3 Log10 sample size j i Figure 5. and in this case the t n .. oL.1.3. ur .
3. 1 (ri .2. af = a~.n2 and at I=. when both nl I=.a~ show that as long as nl = n2._' 0..0' H<I o 0.Xi have a symmetric distribution. then the twosample t tests with critical region 1 {S'n > t n .2 (1 .9.Section 5_3 First. or 50.3.n2 and at = a'). in this case. By Theoretn 5. Moreover.3.3..Xd. In this case Monte Carlo studies have shown that the test in Section 4. and a~ = 12af. }. as we see from the limiting law of Sn and Figure 5.hoi n s[h(l)(X)1 . For each simulation the two samples are the same size (the size indicated on the xaxis). _ y'ii:[h(X) .3.95) approximation is good for nl > 100.4 based on Welch's approximation works well.a~. ChiSquare Dala.ll)(!. y'ii:[h(X) . Y . the t n _2(O.12 0.cCc":'::~___:. 10. Each plotted point represents the results of 10. let h(X) be an estimate of h(J. scaled to have the same means. Other Monte Carlo runs (not shown) with I=. N(O. To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T. the t n 2(1 .h(JL)] ". 10000 Simulations.5 1 1. Equal Variances 0.(})) do not have approximate level 0.X = ~. in the onesample situation.000 twosample t tests.and HigherOrder Asymptotics: The Delta Method with Applications 315 1 This is because. However.0:) approximation is good when nl I=.L) where h is con tinuously differentiable at !'. and the data are X~ where d is one of 2.5 2 2.' 0. D Next.5 3 log10 sample size Figure 5.02 d'::'.o'[h(l)(JLJf). even when the X's and Y's have different X~ distributions.) i O. 2:7 ar Two sample. and Yi .
.£.~'ri i _ . Variance Stabilizing Transfonnations Example 5. 1) and in the second they are N(Ola 2) where a 2 takes on the values 1. £l __ __ 0.3 and Slutsky's theorem. called variance stabilizing.oj:::  6K . as indicated in the plot.. Tn . .316 Asymptotic Approximations Chapter 5 Two Sample.J~ f).3. The data in the first sample are N(O. If we take a sample from a member of one of these families.I _ __ 0 I +__ I :::. 2nd sample 2x bigger 0. Poisson. +2 + 25 J O'. gamma. o. I: . we see that here.10000 Simulations: Gaussian Data. Combining Theorem 5. '! 1 .0.. if H is true .02""~".".{x)() twosample t tests. such that Var heX) is approximately independent of the parameters indexing the family we are considering.' OIlS .6) and .3.... then the sample mean X will be approximately normally distributed with variance 0"2 In depending on the parameters indexing the family considered.6. and beta. +___ 9.3.5 Log10 (smaller sample size) l Figure 5. such as the binomial. too. " 0.3. 0.3. Unequal Variances. The xaxis denotes the size of the smaller of the two samples. In Appendices A and B we encounter several important families of distributions. N(O.3. For each simulation the two samples differ in size: The second sample is two times the size of the first. which are indexed by one or more parameters.. and 9.5 1 1..c:~c:_~___:J 0.12 . .4. 1) so that ZI_Q is the asymptotic critical value. We have seen that smooth transformations heX) are also approximately normally distributed. It turns out to be useful to know transformations h. From (5. Each plotted point represents the results of IO.
a variance stabilizing transformation h is such that Vi'(h('Y) .(A)') has approximately aN(O. . in the preceding P( A) case. 1/4) distribution. 0 One application of variance stabilizing transformations.Xn are an i.Section 5.3.3 First. where d is arbitrary.and HigherOrder Asymptotics: The Delta Method with Applications 317 (5.)C.. which varies freely. To have Varh(X) approximately constant in A.6) we find Var(X)' ~ 1/4n and Vi'((X) . Suppose. See Problems 5.13) we see that a first approximation to the variance of h( X) is a' [h(l) (/. Under general conditions (Bhattacharya and Rao.. Suppose further that Then again..3. A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models.. Some further examples of variance stabilizing transformations are given in the problems. Thus. which has as its solution h(A) = 2. .. See Example 5. h must satisfy the differential equation [h(ll(A)j2A = C > 0 for some arbitrary c > O. Also closely related but different are socaBed normalizing transformations. 538) one can improve on .16.3.5. 1n (X I. In this case (5.\ + d. c) (5. . Edgeworth Approximations The normal approximation to the distribution of X utilizes only the first two moments of X. is to exhibit monotone functions of parameters of interest for which we can give fixed length (independent of the data) confidence intervals. The notion of such transformations can be extended to the following situation.19) is an ordinary differential equation. If we require that h is increasing. by their definition. suppose that Xl. h(t) = Ii is a variance stabilizing transformation of X for the Poisson family of distributions..3..)] 2/ n . X n) is an estimate of a real parameter! indexing a family of distributions from which Xl. p. yX± r5 2z(1 . finding a variance stabilizing transfonnation is equivalent to finding a function h such that for all Jl and (J appropriate to our family. sample.hb)) + N(o.. I X n is a sample from a P(A) family.d..15 and 5.19) for all '/. Substituting in (5.3. In this case a'2 = A and Var(X) = A/n.3. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II.6.1 0 ) Vi' ' is an approximate 1. this leads to h(l)(A) = VC/J>.i.Ct confidence interval for J>. . Major examples are the generalized linear models of Section 6. .3. 1976. Thus. Such a function can usually be found if (J depends only on fJ.' . As an example. A > 0. Thus.
0105 0.3.n)1 V2ri has approximately aN(O. i = 1.9943 0.0287 0. ii. Table 5.61 0.1. Hs(x) = xS  IOx 3 + 15x.75 0..ססoo 0.1 gives this approximation together with the exact distribution and the nonnal approximation when n = 10.9997 4.0254 0.2024 EA NA x 0. We can use Problem B. .95 3. I' 0. TABLE 5..20) is called the Edgeworth expansion for Fn .04 1.51 1. 0.3.1051 0. • " .ססOO I . xro I • I I .6000 0.35 0.!.5421 0.3.7792 5.5999 0. Suppose V rv X~.n)' _ 3 (2n)2 = 12 n 1 I .IOx 3 + 15x)] I I y'n(X n 9n ! I i !.0010 ~: EA NA x Exact .3x) + _(xS .38 0.8008 0.0250 0.15 0.5. Exact to' EA NA {.0481 1. "J 1 '~ • 'YIn = E(Vn? (2n)1 E(V .ססoo 0...0100 0.I .21) The expansion (5.6548 4.9999 1. Edgeworth Approximations to the X 2 Distribution.0208 ~1.7000 0.1) + 2 (x 3 .9750 0.2.8000 0.3513 0.9724 0.9950 0.318 Asymptotic Approximations Chapter 5 the normal approximation by utilizing the third and fourth moments. (5. " I. Example 5. P(Tn < x).4000 0.91 0.9900 0.38 0. It follows from the central limit theorem that Tn = (2::7 I Xi n)/V2ri = (V . 0 x Exacl 2.9996 1.0397 0.9876 0. According to Theorem 8. Fn(x) ~ <!>(x) .9097 o.40 0.9990 0.1964 1.9500 'll ./2 2 .4 to compute Xl. V has the same distribution as E~ 1 where the Xi are independent and Xi N(O..85 0 0.9029 0.72 .3006 0.66 0.0001 0 0.0655 O.95 :' 1. . Therefore.3.n.0500 ! ! .34 2.1254 0.II 0.0050 0.77 0.6999 0.79 0. ~ 2. Let F'1 denote the distribution of Tn = vn( X .0284 1. 1) distribution." • [3 n .3000 0. . where Tn is a standardized random variable.2000 0.4999 0.1.0553 0.9905 0. Edgeworth(2) and nonna! approximations EA and NA to the X~o distribution.9999 1.86 0.4415 0.. 4 0. To improve on this approximation.' •• .40 0.0005 0 0.15 0. H 3 (x) = x3  3x.9984 0. we need only compute lIn and 1'2n.9995 0.1000 0.0877 0. 1).3.3.9506 0.4000 0.ססoo 1.2706 0. 0. Then under some conditionsY) where Tn tends to zero at a rate faster than lin and H 2 • H 3 • and H s are Hermite polynomials defined by H 2 (x) ~ x 2  I.5000 0.JL) / a and let lIn and 1'211 denote the coefficient of skewness and kurtosis of Tn.0032 ·1.9684 0.<p(x) ' .
Example 5. Let central limit theorem T'f. and Yj = (Y .0 .i.Section 5.0 TO .22) 2 g(l)(u) = (2U. .1) jointly have the same asymptotic distribution as vn(U n . (ii) g. P = E(XY).I.2 >'2. a~ = Var Y.3.1. = Var(Xkyj) and Ak.2._p2). 4f.o}. . >'1.0 2 (5.1. (i) un(U n . .j. Lemma 5.J11)/a.5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test of H : p = is based on Irl. where g(UI > U2. Let p2 = Cov 2(X. Using the central limit and Slutsky's theorems.3 and (B. = ai = 1.. U3) = Ui/U2U3.0.3.0.u) ~ N(O. aiD.2. vn(i7f . R d !' RP has a differential g~~d(U) at u.2 extends to the dvariate case.n1EY/) ° ~ ai ~ and u = (p. 0< Ey4 < 00.. 1. E).9) that vn(C . The proof follows from the arguments of the proof of Lemma 5.0.2 >'1.l = Cov(XkyJ. Because of the location and scale invariance of p and r. we can show (Problem 5.a~) : n 3 !' R.0. Yn ) be i. Then Proof.0. >'2. Let (X" Y.m.p).2 T20 . Ui/U~U3. (X n .ar..= (Xi . then by the 2 vn(U .2. and let r 2 = C2 /(j3<.6.3.0 >'1.9.2rJ >". 1). we can .3.2. UillL2U~) = (2p.1.1.3. as (X.5.ij~ where ar Recall from Section 4.3 First.J12) / a2 to conclude that without use the transformations Xi j loss of generality we may assume J11 = 1k2 = 0.d.n lEX.and HigherOrder Asymptotics: The Delta Method with Applications 319 The Multivariate Case Lemma 5.u) ~~ V dx1 forsome d xl vector ofconstants u. _p2. It follows from Lemma 5.3.6) that vn(r 2 N(O. Suppose {Un} are ddimensional random vectors and that for some sequence of constants {an} with an !' (Xl as n Jo (Xl.2. Y) where 0 < EX 4 < 00. xmyl).1) and vn(i7~ .2.0.. E Next we compute = T11 ./U2U3.).2 + p'A2.3.u) where Un = (n1EXiYi.J 2 2 + 4 2 4 2 Tn P 120 + P T 02 +2{ _2 p3Al.3. . Y)/(T~ai where = Var X. We can write r 2 = g(C.1.0 >'1. with  r?) is asymptotically nonnal.2.
. has been studied extensively and it has been shown (e.h(p)) is closely approximated by the NCO.p).a)% confidence interval of fixed length.mil £ ~ hey) = h(rn) and (b) so that . . (1 _ p2)2).3(h(r) . Suppose Y 1. .~a)/vn3} where tanh is the hyperbolic tangent. David.8. Theorem 5.3. . 1938) that £(vn . we see (Problem 5.3.h(p)J !:.3. .3[h(c) . .9) Refening to (5. ': I = Ilgki(rn)11 pxd. Then ~.5 (a) ! .3.Y) ~ N(/li. I Y n are independent identically distributed d vectors with ElY Ii 2 < 00. that is. c E (1.1'2.: ! + h(i)(rn)(y .3.P The approximation based on this transfonnation.3.24) I Proof. j f· ~.l) distribution.rn) + o(ly .hp )andhhasatotaldifferentialh(1)(rn) f • . .fii(r . (5. .3.1).1) is achieved hy choosing h(P)=!log(1+ P ). Argue as before using B. . (5.. N(O. ! f This expression provides approximations to the critical value of tests of H : p = O.g.rn) N(o..u 2.p) !:. . EY 1 = rn. .UI..3. Asymptotic Approximations Chapter 5 = 4p2(1_ p2)'.h(p)]).19).320 When (X.4. N(O. 2 1.fii(Y .10) that in the bivariate nonnal case a variance stahilizing transformation h(r) with . . and (Prohlem 5. and it provides the approximate 100(1 .h= (h i . then u5 .23) • . E) = . Var Y i = E and h : 0 ~ RP where 0 is an open subset ofRd.fii[h(r) . p=tanh{h(r)±z (1. per < c) '" <I>(vn . o Here is an extension of Theorem 5. it gives approximations to the power of these tests. which is called Fisher's z.
3. > 0 and EXt < 00. only that they be i. check the entries of Table IV against the last row.211 = 0. 00.:l ~ m k+m L X. 1 Xl has a x% distribution. which is labeled m = 00. if k = 5 and m = 60..3. Then. where Z ~ N(O.' = Var(X. Suppose for simplicity that k = m and k . Then according to Corollary 8.m distribution can be approximated by the distribution of Vlk.37] = P[(vlk) < 2.m  (11k) (11m) L.26) l. Suppose that Xl. when k = 10.1.k.15.05 quantiles are 2. when the number of degrees of freedom in the denominator is large. But E(Z') = Var(Z) = 1.d.21 for the distribution of Vlk. Suppose that n > 60 so that Table IV cannot be used for the distribution ofTk. the mean ofaxi variable is E(Z2).h(m)) = y'nh(l)(m)(Y  m) + op(I). Next we turn to the normal approximation to the distribution of Tk.1 in which the density of Vlk. Thus. EX.7. i=k+1 Using the (b) part of Slutsky's theorem. where V .m.m' To show this.m distribution.0 distribution and 2.3.1. . I).= density.l) distribution. (5. we can write.. For instance. xi and Normal Approximation to the Distribution of F Statistics. We write T k for Tk. > 0 is left to the problems.3. gives the quantiles of the distribution of V/k.i. we first note that (11m) Xl is the average ofm independent xi random variables. When k is fixed and m (or equivalently n = k + m) is large.k+m L::l Xl 2 (5. We do not require the Xi to be normal.i=k+l Xi "...Section 5.1.). D Example 5.and HigherOrder Asymptotics: The Delta Method with Applications 321 (c) jn(h(Y) .x%. is given as the :FIO . 14.7) implies that as m t 00.m. By Theorem B. To get an idea of the accuracy of this approximation. the :Fk. we can use Slutsky's theorem (A. See also Figure B.37 for the :F5 .3 First. Now the weak law of large numbers (A. as m t 00.0 < 2.25) has an :Fk. where k + m = n.~1. the:F statistic T _ k. . By Theorem B. This row.3.3... then P[T5 .Xn is a sample from a N(O. we conclude that for fixed k.9) to find an approximation to the distribution of Tk. L::+. with EX l = 0.:: .05 and the respective 0. The case m = )'k for some ).3. if .
l (: An interesting and important point (noted by Box.'. i = 1. Thus (Problem 5.3. We conclude that xl jer}. v) ~ ~.m} t 00. .8(d)). • • 1)/12) . _. the upper h.28) . Specifically.a) '" 1 + is asymptotically incorrect. When it can be eSlimated by the method of moments (Problem 5.2Var(Yll))' In particular if X. X n are a sample from PTJ E P and ij is defined as the MLE 1 . 2) distribution. 1)T.I. Suppose P is a canonical exponential family of rank d generated by T with [open.3. . Then if Xl.3 Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families Our final application of the 8method follows. By Theorem 5.3. 1'.o) k) K (5.3. 1'. v'n(Tk . In general.»)T and ~ = Var(Yll)J. the F test for equality of variances (Problem 5.a). a'). if it exists and equal to c (some fixed value) otherwise..)/ad 2 . i. :.3. (5.k (t 1 (5.. In general (Problem 53.)T.m 1) can be ap proximated by a N(O. 4).7).8(c» one has to use the critical value .3. v) identity.m(1. which by (5. when Xi ~ N(o. 1953) is that unlike the t test.28) satisfies Zto '" 1 I I:. where J is the 2 x 2 v'n(Tk 1) ""N(0.k(Tk. . h(i)(u. I' Theorem 5.3.1)/12 j 2(m + k) mk Z. !1 .27) where 1 = (1.m 1) <) :'. and a~ = Var(X.8(a» does not have robustness of level. 322 Asymptotic Approximations Chapter 5 where Yi1 = and Yi2 = Xf+i/a2.5. . when rnin{k. ~ Ck. the distribution of'.' !k . j m"': k (/k. E(Y i ) ~ (1.29) is unknown. as k ~ 00. ~ N (0.). a'). .4. 0 ! j i · 1 where K = Var[(X. Equivalently Tk = hey) where Y i ~ (li" li..m = (1+ jK(:: z..m critical value fk. ifVar(Xf) of 2a'. k.3. :f. i. ! i l _ .k(tl)] '" 'fI() :'.1) "" N(O. I)T and h(u.m(1 .3. 1 5.m(l. P[Tk._o l or ~.). = (~. = E(X.3.k(T>.m • < tl P[) :'.a) .
Now Vii11.2. I'.. Ih our case.71) for any nnbiased estimator ij.1.  (TIl' .3 First. (i) follows from (5.2 that. This is an "asymptotic efficiency" property of the MLE we return to in Section 6.l4..2 where 0" = T.T.3.ift = A(T/). are sufficient statistics in the canonical model. .Nd(O.) . = A by definition and.and HigherOrder Asymptotics: The Delta Method with Applications 323 (i) ii = 71 + ~ I:71 A •• • I (T/)(T(X.8.3.Section 5.3.32) Hence. 0'). T/2 By Theorem 5.4.A 1(71))· Proof.3.3.3.In(ij .4 with AI and m with A(T/). thus.2.5.3. of Theorems 5. The result is a consequence. Recall that A(T/) ~ VarT/(T) = 1(71) is the Fisher information. therefore.'l 1 (ii) LT/(Vii(iiT/)). = (5. Note that by B.38) on the variance matrix of y'n(ij .3. by Corollary 1. Thus. in our case. Thus. if T ~ I:7 I T(X. hence..31) .23).A(T/)) + oPT/ (n . = 1/20'.N(o.3.I(T/) (5. o Remark 5. We showed in the proof ofTheorem 5.EX. and.).". Example 5.i.24).d. where.3.3.30) But D A .1. PT/[T E A(E)] ~ 1 and. Then T 1 = X and 1 T2 = n.O. = 1/2. and 1).2.6.2 and 5. For (ii) simply note that. Let Xl>' .8.33) 71' 711 Here 711 = 1'/0'. as X witb X ~ Nil'. X n be i. (I" +0')] .11) eqnals the lower bound (3.4.3. PT/[ii = AI(T)I ~ L Identify h in Theorem 5.1. iiI = X/O". the asymptotic variance matrix /1(11) of . (5. (ii) follows from (5. by Example 2. (5.4.
i In this section we define and study asymptotic optimality for estimation. . The moment and in law approximations lead to the definition of variance stabilizing transfonnations for classical onedimensional exponential families. .3. il' . N = (No. We begin in Section 5. taking values {xo. Nk) where N j . .33) and Theorem 5.. ..L~ I l(Xi = Xj) is sufficient.7). Assume A : 0 ~ pix. 8 open C R (e.".3. Firstorder asymptotics provides approximations to the difference between a quantity tending to a limit and the limit. .Pk) where (5.1. .1 by studying approximations to moments and central moments of estimates. and PES. 5. These stochastic approximations lead to Gaussian approximations to the laws of important statistics. and confidence bounds.: CPo.4 and Problem 2. .i.4.d. the (k+ I)dimensional simplex (see Example 1. i 7 •• . and il' = T.. see Example 2. Thus.d. 0.15). .') N(O. Secondorder asyrnptotics provides approximations to the difference between the error and its firstorder approximation. I . Consistency is Othorder asymptotics. Finally. 0 .g.4. Eo) .6. testing.1) i . . sampling. Y n are i.4 to find (Problem 5. 5.4 ASYMPTOTIC THEORY IN ONE DIMENSION I: I " ! .s where Eo = diag(a 2 )2a 4 )...324 Asymptotic Approximations Chapter 5 Because X = T. we can use (5.. when we are dealing with onedimensional smooth parametric models. .3. which lead to a result on the asymptotic nonnality of the MLE in multiparameter exponential families. EY = J. . the difference between a consistent estimate and the parameter it estimates.)'. I I I· < P.d. Higherorder approximations to distributions (Edgeworth series) are discussed briefly. Summary. These "8 method" approximations based on Taylor's fonnula and elementary results about moments of means of Ll.1. under Li. Xk} only so that P is defined by p ... Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal. Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean. is twice differentiable for 0 < j < k.L. In Chapter 6 we sketch how these ideas can be extended to multidimensional parametric families. .. .P(Xk' 0)) : 0 E 8}.3. Following Fisher (1958)'p) we develop the theory first for the case that X""" X n are i. and h is smooth.(T. . for instance. as Y.1 Estimation: The Multinomial Case . 0).. .. and so on. .d.i. We focus first on estimation of O.26) vn(X /".h(JL) where Y 1. We consider onedimensional parametric submodels of S defined by P = {(p(xo. variables are explained in tenns of similar stochastic approximations to h(Y) . 0). J J . stochastic approximations in the case of vector statistics and parameters are developed. . I i I . < 1.
Many such h exist if k 2. if A also holds.4 Asymptotic Theory in One Dimension 325 Note that A implies that k [(X I . Moreover.11)) of (J where h: S satisfies ~ R h(p(8» = 8 for all 8 E e > (5.1.4) g.4.p(Xk.2(0.1.2). Then we have the following theorem.4.4. Consider Example where p(8) ~ (p(xo. 88 (Xl> 8) and =0 (5..4.7) where .l (Xl. h) with eqUality if and only if. (0) 80(Xj. 8) is similarly bounded and well defined with (5.. < k.4.1.11).8).2 (8. for instance.4.:) (see (2. Next suppose we are given a plugin estimator h (r. fJl E. bounded random variable (5. . (5.8) logp(X I .Jor all 8.3) Furthermore (See'ion 3.2) is twice differentiable and g~ (X I) 8) is a well~defined. pC') =I 1 m . h) is given by (5. (5.4.8)1(X I =Xj) )=00 (5.4.6) 1. Assume H : h is differentiable. > rl(O) M a(p(8)) P... 0 <J (5.4.0).5) As usual we call 1(8) the Fisher infonnation.Section 5.4.8) ~ I)ogp(Xj.9) .4. Under H. Theorem 5. 8))T. .4.8) .
4. Noting that the covariance of the right' anp lefthand sides is a(8)...326 Asymptotic Approximations Chapter 5 Proof.8) ) +op(I). but also its asymptotic variance is 0'(8.6). whil. we s~~ iliat equality in (5.4. • • "'i.12).h)Var.13). I (x j. (5. h). h k &h &pj(p(8)) (N p(xj.15) i • . which implies (5..4. h) = I .4.9). I h 2 (5. 2: • j=O &l a:(p(8))(I(XI = Xj) .p(xj.8) with equality iff. By (5.. using the correlation inequality (A.4. we obtain 2: a:(p(8)) &0(xj.4. (5.10) Thus. &8(X.8)).2 noting that N) vn (h (. 0) = • &h fJp (5.4.10).8) = 1 j=o PJ or equivalently. : h • &h &Pj (p(8)) (N . for some a(8) i' 0 and some b(8) with prob~bility 1.l6) as in the proof of the information inequality (3.11) • (&h (p(O)) ) ' p(xj. 0)] p(Xj.8)) = a(8) &8 (X" 8) +b(O) &h Pl (5.4.. by (5.p(x). (5.4. we obtain &l 1 <0 . kh &h &Pj (p(8))(I(Xi = Xj) .8) ) ~ .8) gives a'(8)I'(8) = 1. Apply Theorem 5.I n.8) j=O 'PJ Note that by differentiating (5. (8. ( = 0 '(8.p(xj. by noting ~(Xj.4. h(p(8)) ) ~ vn Note that.4.h)I8) (5.16) o .4. 8). using the definition of N j .8)&Pj 2: • &h a:(p(0))p(x. ~ir common variance is a'(8)I(8) = 0' (0.12) [t.4.14) .II.3. not only is vn{h (.13) I I · . Taking expectations we get b(8) = O.')  h(p(8)) } asymptotically normal with mean 0./: .
Example 5. Xl. 5.4. : E Ell.4.2 Asymptotic Normality of Minimum Contrast and MEstimates o e o We begin with an asymptotic normality theorem for minimum contrast estimates.0 (5..o(p(X"O)  p(X"lJo )) .18) and k h(p) = [A]I LT(xj)pj )". p) and HardyWeinberg models can both be put into this framework with canonical parameters such as B = log ( G) in the first case. As in Theorem 5.. Suppose p(x. Write P = {P. . 0 E e..::7 We shall see in Section 5.). Let p: X x ~ R where e D(O. Note that because n N T = n .Section 5A Asymptotic Theory in One Dim"e::n::'::io::n ~_ __'3::2:.1.A(O)}h(x) where h(x) = l(x E {xo..3) L.0)=0.2.i.0)) ~N (0.xd). the MLE of B where It is defined implicitly ~ by: h(p) is the value of O.. is a canonical oneparameter exponential family (supported on {xo.(vn(B .4.d. In the next two subsections we consider some more general situations.8) is..5 applies to the MLE 0 and ~ e is open. achieved by if = It (r. 0) and (ii) solvesL::7~ONJgi(Xj.17) L. 0 Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena. Then Theo(5. . . J(~)) with the asymptotic variance achieving the infonnation bound Jl(B). OneParameter Discrete Exponential Families. .8) = exp{OT(x) . then. E open C R and corresponding density/frequency functions p('. Xk}) and rem 5. which (i) maximizes L::7~o Nj logp(xj.3 that the information bound (5.. .3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check.4.3..19) The binomial (n.I L::i~1 T(Xi) = " k T(xJ )".4. .3. Suppose i.Xn are tentatively modeled to be distributed according to Po... by ( 2."j~O (5. 0).Oo) = E.4. if it exists and under regularity conditions.4.
p(x. On is consistent on P = {Pe : 0 E e}. 0) is differentiable. Under AOA5.p(X" On) n = O. A4: sup. J is well defined on P. < co for all PEP.O(P)) +op(n.. i' On where = O(P) +.p(x.. O(P))) .20) In what follows we let p.p E p 80 (X" O(P») A3: .21) i J A2: Ep. O)ldP(x) < co. We need only that O(P) is a parameter as defined in Section 1.4. j. parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for.!:. O(P» / ( Ep ~~ (XI. 1 n i=l _ .' '. ~ (X" 0) has a finite expectation and i ! . • is uniquely minimized at Bo. .4. as pointed out later in Remark 5.4. PEP and O(P) is the unique solution of(5. L.=1 n ! (5.t) .p(.22) J i I .LP(Xi.p(x. P) = .3. Let On be the minimum contrast estimate On ~ argmin .1.1.O(P)I < En} £. rather than Pe.0(P)) l..p = Then Uis well defined. That is. O)dP(x) = 0 (5. Theorem 5. (5. 0 E e. ~I AS: On £. As we saw in Section 2.1/ 2) n '. o if En ~ O.4. # o. 8.2. under regularity conditions the properties developed in this section are valid for P ~ {Pe : 0 E e}. hence. .p(Xi. ~i  i . f 1 . .~(Xi.. n i=l _ 1 n Suppose AO: .21) and. That is.4.23) I.O).O(p)))I: It . { ~ L~ I (~(Xi. (5. denote the distribution of Xi_ This is because.pix. O(P). .4. Suppose AI: The parameter O(P) given hy the solution of . . O(Pe ) = O.4.p2(X.L .328 Asymptotic Approximations Chapter 5 .
1j2 ).4.4.O(P)) ~ . .O(P)) = n.p) < 00 by AI.Ti(en . we obtain. (iJ1jJ ) In (On . t=1 1 n (5.24) follows from the central limit theorem and Slutsky's theorem.L 1jJ(X" 8(P)) = Op(n.29) .4.20).4.20) and (5. and A3.4 Asymptotic Theory in One Dimension 329 Hence.4.O(P)) 2' (E '!W(X. A2. where (T2(1jJ.22) because .27) we get. applied to (5._ .O(P))) p (5..425)(5.' " 1jJ(Xi . (5.25) where 18~  8(P)1 < IiJn .21).O(P)I· Apply AS and A4 to conclude that (5. By expanding n.4.On)(8n . en) around 8(P).27) Combining (5. O(P)).4..28) .4.4.O(P)) Ep iJO (Xl.. Let On = O(P) where P denotes the empirical probability.4.4. using (5.O(P)) / ( El' ~t (X" O(P))) o while E p 1jJ2(X"p) = (T2(1jJ.24) proof Claim (5.22) follows by a Taylor expansion of the equations (5. Next we show that (5.l   2::7 1 iJ1jJ. n.8(P)) I n n n~ t=1 n~ t=1 e (5.' " iJ (Xi . 8(P)) + op(l) = n ~ 1jJ(Xi .p) = E p 1jJ2(Xl . 1 1jJ(Xi .4.lj2 and L 1jJ(X" P) + op(l) i=1 n EI'1jJ(X 1. But by the central limit theorem and AI.Section 5.26) and A3 and the WLLN to conclude that (5.4.
2.4.4. n I _ . A4 is found. en j• . =0 (5. Remark 5.4.30) Note that (5.O) = O.Eo (XI.d..8).4.4.(x) (5.4. A2. O(P)) + op (I k n n ?jJ(Xi . This is in fact truesee Problem 5. . X n are i.O).4.4. 0) is replaced by Covo(?jJ(X" 0). and we define h(p) = 6(xj )Pj. Nothing in the arguments require that be a minimum contrast as well as an Mestimate (i. I 'iiIf 1 I • I I . Our arguments apply even if Xl. 0) = 6 (x) 0. we see that A6 corresponds to (5. O)dp.30) is formally obtained by differentiating the equation (5. written as J i J 2::.30) suggests that ifP is regular the conclusion of Theorem 5. Suppose lis differen· liable and assume that ! • &1 Eo &O(X I . Solutions to (5. .22) follows from the foregoing and (5. This extension will be pursued in Volume2.4. (2) O(P) solves Ep?jJ(XI.?jJ(XI'O)). for A4 and A6.28) we tinally obtain On . Conditions AO. {xo 1 ••• 1 Xk}. 0) Covo (:~(Xt.2 is valid with O(P) as in (I) or (2).. Remark 5. essentially due to Cramer (1946). 0) = logp(x. An additional assumption A6 gives a slightly different formula for E p 'iiIf (X" O(P)) if P = Po.4. e) is as usual a density or frequency function.4. O(P)) ) and (5.! . A6: Suppose P = Po so that O(P) = 0. O(P) in AIA5 is then replaced by (I) O(P) ~ argmin Epp(X I . and that the model P is regular and letl(x.4.29).20) are called M~estimates.'. B) where p('.1. 0) l' P = {Po: or more generally.4. If an unbiased estimateJ (X d of 0 exists and we let ?jJ (x.=0 ?jJ(x. 330 Asymptotic Approximations Chapter 5 Dividing by the second factor in (5.2..31) for all O.4.e. Z~ (Xl. B» and a suitable replacement for A3.4. If further Xl takes on a finite set of values. for Mestimates. 1 •• I ! .4.2 may hold even if ?jJ is not differentiable provided that . Identity (5. that 1/J = ~ for some p).i. it is easy to see that A6 is the same as (3.3.1.: 1. • Theorem 5. We conclude by stating some sufficient conditions.21). O)p(x. AI. and A3 are readily checkable whereas we have given conditions for AS in Section 5.12). Our arguments apply to Mestimates. . P but P E a}.O)?jJ(X I . I o Remark 5. 0 • .4.O(P) Ik = n n ?jJ(Xi .
p(x. 0) obeys AOA6.4. satisfies If AOA6 apply to p(x. That is.0'1 < J(O)} 00. .4.. We also indicate by example in the problems that some conditions are needed (Problem 5. 0) and .O) 10gp(x. In this case is the MLE and we obtain an identity of Fisher's. s) diL(X )ds < 00 for some J ~ J(O) > 0.eo (Xl.O) = l(x.1). then the MLE On (5. where EeM(Xl.4.p = a(0) g~ for some a 01 o. 0') .34) Furthermore.4.:d in Section 3.4. then if On is a minimum contrast estimate whose corresponding p and '1jJ satisfy 2 0' (.p.. 0) .O). We can now state the basic result on asymptotic normality and e~ciency of the MLE. Theorem 5.20) occurs when p(x.4.4) but A4' and A6' are not necessary (Problem 5.4. w.35) with equality iff. = en en = &21 Ee e02 (Xl.8) is a continuous function of8 for all x. < M(Xl.33) so that (5.O) = l(x.32) where 1(8) is t~e Fisher information introduq. 0') is defined for all x." Details of how A4' (with ADA3) iIT!plies A4 and A6' implies A6 are given in the problems.4. B).3.4. 0) (5.O) and P ~ ~ Pe. Pe) > 1(0) 1 (5.01 < J( 0) and J:+: JI'Ii:' (x.3 Asymptotic Normality and Efficiency of the MLE The most important special case of (5. "dP(x) = p(x)diL(x).4 Asymptotic Theory in One Dimension 331 A4/: (a) 8 + ~~(xI. (b) There exists J(O) sup { > 0 such that O"lj') Ehi) eo (Xl.Section 5. 10' . where iL(X) is the dominating measure for P(x) defined in (A. AOA6. 5. 0) < A6': ~t (x.. lO.4. 0) g~ (x.
00. I I Example 5. 0 because nIl' .3 is not valid without some conditions on the esti· mates being considered. and. .37) . PoliXI < n1/'1 . for higherdimensional the phenomenon becomes more disturbing and has important practical consequences. PIIZ + .4. 0 e en e. 0 1 .A'(B).. thus. (5..".4.1). . }. We discuss this further in Volume II.4.30) and (5.4. .nB) .• •. However. The optimality part of Theorem 5.. PoliXI < n 1/4 ] . . B) with J T(x) .36) is just the correlation inequality and the theorem follows because equality holds iff 'I/J is a nonzero multiple a( B) of 3!. Consider the following competitor to X: B n I 0 if IXI < n.4.3 generalizes Example 5. p.. . Therefore. Then .B).i.i .4. 1998.<1>( _n '/4 . . see Lehmann and Casella.4. I.2.. N(B. . claim (5.38) Therefore. 1 .35) is equivalent to (5. Let Z ~ N(O. We next compule the limiting distribution of . 1 I I.35).'(0) = 0 < l(~)' The phenomenon (5. 5.39) with "'(B) < [I(B) for all B E eand"'(Bo) < [I(Bo).l'1 Let X" . if B I' 0. The major testing problem if B is onedimensional is H : < 8 0 versus K : > If p(. 442.. By (5.. (5. {X 1.. 1.. . If B = 0.4 Testing ! I i . .4.4. . .ne . We can interpret this estimate as first testing H : () = 0 using the test "Reject iff IXI > n.4.r 332 Asymptotic Approximations Chapter 5 Proof Claims (5.. and Polen = OJ .. cross multiplication shows that (5.n(Bn . .1/4 I (5. .nBI < nl/'J <I>(n l/ ' .4.36) Because Eo 'lj.4.4..4. 1.4...(X B). For this estimate superefficiency implies poor behavior of at values close to 0.'(B) = I = 1(~I' B I' 0.2. for some Bo E is known as superefficiency..4..1/ 4 " and using X as our estimate if the test rejects and 0 as our estimate otherwise.1/4 X if!XI > n. we know that all likelihood ratio tests for simple BI e e eo.Xn be i. Then X is the MLE of B and it is trivial to calculate [( B) _ 1. B) = 0.nB).439) where . B) is an MLR family in T(X).33) and (5. PolBn = Xl . 1 . 1).34) follow directly by Theorem 5. j 1 " Note that Theorem 5.. Hodges's Example.d..1 once we identify 'IjJ(x.
[Bn > c(a.4.46) PolBn > cn(a. If pC e) is a oneparameter exponential family in B generated by T(X).:cO"=~ ~ 3=3:. The test is then precisely. Xl. 1) distribution.Oo)] PolOn > cn(a.00) other hand.4..0)]. Then ljO > 00 .(a.::.(a. B E (a.r'(O)) where 1(0) (5. 00) .40) > Ofor all O. Thus. Then c. e e e e..Oo)] = a and B is the MLE n n of We will use asymptotic theory to study the behavior of this test when we observe ij.Section 5.. .41) where Zlct is the 1 . a < eo < b.4. (5. < >"0 versus K : >. • • where A = A(e) because A is strictly increasing. and then directly and through problems exhibit other tests wi!. "Reject H for large values of the MLE T(X) of >.4 Asymptotic Theo:c'Y:c. ~ 0. 0 E e} is such thot the conditions of Theorem 5. > AO.2 apply to '0 = g~ and On. ljO < 00 . Theorem 5.m='=":c'.4.0:c":c'=D.a quantile o/the N(O. On the (5.4.45) which implies that vnI(Oo)(c.:c".I / 2 ) (5. eo) denote the critical value of the test using the MLE en based On n observations. .43) Property (5. (5. are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error Q at eo.4. Zla (5. this test can also be interpreted as a test of H : >.22) guarantees that sup IPoolvn(Bn . PoolBn > 00 + zl_a/VnI(Oo)] = POo IVnI(Oo) (Bn  00) > ZI_a] ~ a. 00 )]  ~ 1. 00) .4.00) > z] . That is..(a.. X n distributed according to Po.42)  ~ o.. 00)] = PolvnI(O)(Bn  0) > VnI(O)(cn(a. Suppose the model P = {Po ..4.00) > z] ~ 11>(z) by (5.4.44) But Polya's theorem (A..4. .:. (5. Proof.h the same behavior. [o(vn(Bn 0)) ~N(O. PolO n > c. derive an optimality property.42) is sometimes called consistency of the test against a fixed alternative. "Reject H if B > c(a.4.Oo) = 00 + ZIa/VnI(Oo) +0(n.(1 1>(z))1 ~ 0. b).40).3 versus simple 2 .d. and (5.l4. 00 )" where P8 . 1 < z . Let en (0:'. as well as the likelihood ratio test for H versus K..4. The proof is straightforward: PoolvnI(Oo)(Bn . the MLE. It seems natural in general to study the behavior of the test.41) follows.. Suppose (A4') holds as well as (A6) and 1(0) < oofor all O.4.
00 if8 > 80 0 and .k'Pn(X1.4. 80 )  8) . (5.4.5.[. the test based on 8n is still asymptotically MP.jI(80)) ill' < O.8) < z] .50). ~! i • Note that (5. then (5.4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to Ct from the left at 00 and continues rising steeply to 1 to the right of 80 . j .···.   j Proof.jnI(8)(80 . the power of tests with asymptotic level 0' tend to 0'.48) and (5. < 1 lI(Zl_a ")'.4.50) can be interpreted as saying that among all tests that are asymptotically level 0' (obey (5. In either case.jnI(8)(Cn(0'. iflpn(X1 .")' ~ "m(8  80 ). Suppose the conditions afTheorem 5..(80) > O.51) I 1 I I .42) and (5. then by (5. ~ " uniformly in 'Y.4. Claims (5.[.4.4.40) hold uniformly for (J in a neighborhood of (Jo.50)) and has asymptotically smallest probability of type I error for B < Bo.jnI(8)(80 . LetQ) = Po. o if "m(8 . (3) = Theorem 5.jnI(O)(Bn .J 334 Asymptotic Approximations Chapter 5 By (5.48) .jnI(8)(Cn(0'. .48).4.80 ) tends to zero. .jI(80)) ill' > 0 > llI(zl_a ")'. f.50) i. these statements can only be interpreted as valid in a small neighborhood of 80 because 'Y fixed means () + B . X n ) is any sequence of{possibly randomized) critical (test) functions such that .jnI(8)(Bn .47) lor. (5. Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ")' "m( 8 .43) follow.4. . On the other hand.49)) the test based on rejecting for large values of 8n is asymptotically uniformly most powerful (obey (5.8 + zl_a/. Write Po[Bn > cn(O'.jnI(80) + 0(n. 1 ~. assume sup{IP. 80)J 1 P. (5. That is. the power of the test based on 8n tends to I by (5.80 1 < «80 )) I J .. Theorem 5.4. In fact. I.80) tends to infinity.jnI(80) +0(n. . .8 + zl_o/.4.8) + 0(1) . j (5.80) ~ 8)J Po[. then limnE'o+. I .4.4..8) > .8) > .jnI(8)(Bn .2 and (5. 00 if 8 < 80 .  i. X n ) i.(1 lI(z))1 : 18 . 0.4. Furthennore. .017".41).4. .4.4.4.49) ! i .1/ 2 ))]. If "m(8 .jnI(8)(80 .80 ).1 / 2 )) .4..
.4. hence.8) 1. . .jI(Oo)) + 0(1)) and (5.4 Asymptotic Theory in One Dimension 335 If.. note that the Neyman Pearson LR test for H : 0 = 00 versus K : 00 + t.4. 0 The asymptotic!esults we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n.. . 0 n p(Xi . . .53) is Q if. ~ .53) tends to the righthand side of (5. k n (80 ' Q) ~ if OWn (Xl. 0 (5. is asymptotically most powerful as well. These are the likelihood ratio test and the score or Rao test. > dn (Q. It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form "Reject if L log[p(X" 8n )!p(X i=1 n i .4. for Q < ~.8) that. . hand side of (5.Xn) is the critical function of the LR test then.)] ~ L (5. is O. There are two other types of test that have the same asymptotic behavior.4. 00 + E) logPn(X l E I " . .54) Assertion (5.<P(Zla(1 + 0(1)) + .5 in the future will be referred to as a Wald test. X.. " Xn.53) where p(x. The test li8n > en (Q.[logPn( Xl. 00)]1(8n > 80 ) > kn(Oo..4.4. P. 1. € > 0 rejects for large values of z1.O+.4.jn(I(8o) + 0(1))(80 . 0) denotes the density of Xi and dn .50) and. 80)] of Theorems 5.4. . X n .j1i(8  8 0 ) is fixed...<P(Zl_a .7).. Llog·· (X 0) =dn (Q.4 and 5.' L10g i=1 p (X 8) 1.8 0 ) ..4..4. Finally. if I + 0(1)) (5.4....54) establishes that the test 0Ln yields equality in (5.OO+7n) +EnP." + 0(1) and that It may be shown (Problem 5. Further Taylor expansion and probabilistic arguments of the type we have used show that the righthand side of (5.50) note that by the NeymanPearson lemma. i=1 P 1. Q). 8 + fi) 0 P'o+:.80 ) < n p (Xi.. .o + 7. t n are uniquely chosen so that the right.OO)J . The details are in Problem 5.4.4.5.Xn ) = 5Wn (X" .Xn) is the critical function of the Wald test and oLn (Xl.48) follows.52) > 0. To prove (5.4.?. [5In (X" .50) for all. for all 'Y.4.Section 5. .. 1(8) = 1(80 + + fi) ~ 1(80) because our uniformity assumption implies that 0 1(0) is continuous (Problem 5. Thus.
336
Asymptotic Approximations
Chapter 5
where PlI(X 1, ... ,Xn ,8) is the joint density of Xl, ... ,Xn . Fort: small. n fixed, this is approximately the same as rejecting for large values of a~o logPn(X 11 • • • 1 X n ) eo).
,
• ,
The preceding argument doesn't depend on the fact that Xl" .. , X n are i.i.d. with common density or frequency function p{x, 8) and the test that rejects H for large values of a~o log Pn (XI, ... ,Xn , eo) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming
"Reject H iff
t
i=I
iJ~
: ,
logp(X i ,90) > Tn (a,9 0)."
0
I I
(5.4.55)
It is easy to see (Problem 5.4.15) that
Tn(a, 90 ) = Z10 VnI(90) + o(n I/2 )
and that again if G (Xl, ... , X n ) is the critical function of the Rao test then
nn
,
•
.,
,•
Po,+' WRn (X t, ... ,Xn ) = Ow n(X t , ... ,Xn)J ~ 1, rn
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(90 ), which d' may require numerical integration, can be replaced by _n l d021n(Bn) (Problem 5.4.10).
5.4,5
Confidence Bounds
Q
We define an asymptotic Levell that
lower confidence bound (LCB) On by the requirement
(5.4.57)
r I I i ,
., '
.'
for all () and similarly define asymptotic level!  a DeBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
(i) By using a natural pivot.
, f.
(
.,
(ii) By inverting the testing regions derived in Section 5.4.4.
,. "',
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)(A6), (A4'), and I(9) finite for all it follows (Problem 5.4.9) that
e.
Co(
V n  e)) ~ N(o, 1) nI(lin)(li
Z1a/VnI(lin ).
(5.4.58)
for all () and, hence. an asymptotic level!  a lower confidence bound is given by
9~ = lin e~,
(5.4.59)
Turning tto method (ii), inversion of 8Wn gives fonnally
= inf{9: en(a, 9) > 9n }
(5.4.60)
,
=
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
337
or if we use the approximation C (0, e) ~ n
e+ zlQ/vnI(iJ), (5,4,41),
e) > en}'
~~, = inf{e , Cn(C>,

(5,4,61)
In fact neither e~I' or e~2 properly inverts the tests unless cn(Q, e) and Cn (Q, e) are increasing in The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, e~l is preferable because this bound is not only approximately but genuinely level 1 Q. But computationally it is often hard to implement because cn(Q, 0) needs, in general, to be computed by simulation for a grid of values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4,10) are preferred but Can be quite inadequate (Problem 5,4,1 I), These bounds e~, O~I' e~2' are in fact asymptotically equivalent and optimal in a suitable sense (Problems 5,4,12 and 5,4,13),
e.
e
Summary. We have defined asymptotic optimality for estimates in oneparameter models. In particular, we developed an asymptotic analogue of the information inequality of Chapter 3 for estimates of in a onedimensional subfamily of the multinomial distributions, showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of oneparameter discrete exponential families. In Section 5.4.2 we developed the theory of minimum contrast and M estimates, generalizations of the MLE, along the lines of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mooel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimality results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson (1996), Le Cam and Yang (1990), Lehmann (1999), Rao (1973), and Serfling (1980),
e
5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as n t 00 in a sense we now describe. The framework we consider is the one considered in Sections 5.2 and 5.4, i.i.d. observations from a regular madel in which is open C R or = {e 1 , , , , , e,} finite, and e is identifiable, Most of the questions we address and answer are under the assumption that fJ = 0, an arbitrary specified value, or in frequentist tenns, that 8 is true.
e
e
Consistency The first natural question is whether the Bayes posterior distribution as n + 00 concentrates all mass more and more tightly around B. Intuitively this means that the data that are coming from Po eventually wipe out any prior belief that parameter values not close to are likely, Formalizing this statement about the posterior distribution, II(· I X It •.• , X n ), which is a functionvalued statistic, is somewhat subtle in general. But for = {O l ,' .. , Ok} it is
e
e
i
338
straightforward. Let
Asymptotic Approximations Chapter 5
I.
11(8 i XI,···, Xn)
Then we say that II(·
=PIO = 8 I Xl,···, Xn].
(5.5.1)
e, P,li"(8 I Xl, ... , Xn)  11 > ,] ~ 0 for all f. > O. There is a slightly stronger definition: rIC I XI,' .. ,Xn ) is a.S. iff for all 8 E e, 11(8 I Xl, ... , Xn) ~ 1 a.s. P,.
is consistent iff for all 8 E
General a.s. consistency is not hard to formulate:
I Xl, ... , Xn)
(5.5.2)
consistent
(5.5.3)
11(· I X), ... , Xn)
=}
OJ'} a.s. P,
(5.5.4)
where::::} denotes convergence in law and <5{O} is point mass at satisfactory result for finite.
e
e.
There is a completely
, ,
,
Theorem 5.5.1. Let 1rj  p[e = Bj ], j = 1, ... 1 k denote the prior distribution 0/8. Then II(· I Xl, ... ,Xn ) is consistent (a.s. consistent) iff 7fj > afor j = I, ... , k.
Proof. Let p(., B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1["] = 0 for some j implies that 1f(Bj I Xl, ... ,Xn ) = 0 for all Xl, .. . , X n because, by (1.2.8),
.,
,J,
11(8j
I Xl, ... ,Xn)
PIO = 8j I Xl, ... ,Xn ] 11j Ir~l p(Xi , 8il
L:.~l 11. ni~l p(Xi , 8.)
k
n .
(5.5.5)
,
,
, ,,
Intuitively, no amount of data can convince a Bayesian who has decided a priori that OJ is impossible. On the other hand, suppose all 71" j are positive. If the true is (J j or equivalently 8 = (J j, then
e
log
11(8.IXl, ... ,Xn) =n 11(8 j I X), ... ,Xn)
(11og+ L.. og P(Xi,8.)) . 11. 1{f.,1 n
7fj
n i~l
p(Xi ,8j)
,
By the weak (respectively strong) LLN, under POi'
.,i
1{f.,log p(Xi ,8.)  Lni~l
p(Xi ,8j )
+
E
OJ
(I
I
P(XI ,8.)) og p(X I ,8j )
i
I
in probability (respectively a.s.). But Eo;
(log: ~:::;))
~
< 0, by Shannon's inequality, if
Ba
• .,
=I=
Bj' Therefore,
11(8.IXI, ... ,Xn) 1 og 11(8j I X), ... , X n )
00
,
in the appropriate sense, and the theorem follows.
o
i ~,
h
_
Section 55
Asymptotic Behavior and Optimality of the Posterior Distribution
339
e
Remark 5.5.1. We have proved more than is stated. Namely. that for each I XI, . .. ,Xn ]  a exponentially.
e E e. Po[O =l0
As this proof suggests, consistency of the posterior distribution is very much akin to consistency of the MLE. The appropriate analogues of Theorem 5.2.3 are valid. Next we give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution
Under conditions AOA6 for p(x, B) that if B is the MLE,

= lex, Bj
=logp(x, B), we showed in Section 5.4
(5.5.6)
Ca(y'n(e  B» ~ N(O,rl(B)).
Consider C( ..;ii((}  B) I Xl, ... , X n ), the posterior probability distribution of y'n((} B( Xl, ... , X n )), where we emphasize that (j depends only on the data and is a constant given XI, ... , X n . For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions that strengthen A4 and A5 by replacing convergence in Po probability by convergence a.s. p•. We also add,



A7: For all (), and all 0> o there exists t(o,(})
p. [sup
> 0 such that
{~ t[I(Xi,B') /(Xi,B)]: 18'  BI > /j} < '(0, B)] ~ I.
e such that 1r(') is continuous and positive
AS: The prior distribution has a density 1f(') On at all B. Remarkably,
Theorem 55.2 (UBernsteinlvon Mises"). If conditions ADA3, A4(a.s.), A5(a.s.), A6, A7, and A8 hold. then
C(y'n((}(})
I X1, ... ,Xn )
~N(O,l
1
(B»)
(5.5.7)
a.s. under P%ralle.
We can rewrite (5.5.7) more usefully as
sup IP[y'n((}  e) < x I Xl, ... , X n]  of>(xVI(B»)j ~ 0
x
(5.5.8)
for all a.s. Po and, of course, the statement holds for our usual and weaker convergence in Po probability also. From this restatement we obtain the important corollary. Corollary 5.5.1. Under the conditions of Theorem 5.5.2,
e
sup IP[y'n(O  e) < x j Xl, ... , XnJ  of>(xVl(e)1
x
~0
(5.5.9)
a.s. P%r all B.
1 , ,
340
Asymptotic Approximations Chapter 5
Remarks
(I) Statements (5,5.4) and (5,5,7)(5,5,9) are, in fact, frequentist statements about the
asymptotic behavior of certain functionvalued statistics.
(2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in P, probability if A4 and
A5 are used rather than their strong formssee Problem 5.5.7.
(3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
identifiability guarantees consistency of Bin a regular model.

Proof We compute the posterior density of .,fii(O  B) as
(5.5.10)
where en = en(X!, . .. ,Xn) is given by

Divide top and bottom of (5.5.10) by
;,'
II7
1 p(Xi ,
B) to obtain
(5.5.11)

where l(x,B)
= 10gp(x,B) and
,
,
We claim that
for all B. To establish this note that (a) sup { 11" + 1I"(B) : ItI < M} tent and 1T' is continuous. (b) Expanding, (5.5.13)
(e In) 
~ 0 a.s.
for all M because
eis a.s. consis
I
I
1
i
1 ! , ,
I
p
J
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
341
where
Ie  Bit)1 < )n.
We use I:~
1
g~ (Xi, e) ~ 0 here. By A4(a.s.), A5(a.s.),
1
n
sup { n~[}B,(Xi,B'(t))n~[}B,(Xi,B):ltl<M
In [}'I
[}'l
}
~O,
for all M, a.s. Po. Using (5.5.13). the strong law of large numbers (SLLN) and A8, we obtain (Problem 5.5.3),
Po
[dnqn(t)~1f(B)exp{Eo:;:(Xl,B)~}
forallt] =1.
(5.5.14)
Using A6 we obtain (5.5.12).
Now consider
dn =
I:
r
+y'n
1f(e+
;")exp{~I(Xi,9+;,,) 1(Xi,e)}ds
(5.5.15)
dnqn(s)ds
J1:;I<o,fii
J
1f(t) exp
{~(l(Xi' t) 1(X
i,
9)) } l(lt 
el > o)dt
By AS and A7,
Po [sup { exp
{~(l(Xi,t) 1(Xi , e») } : It  el > 0} < e"'("O)] ~ 1
(5.5.16)
for all 0 so that the second teon in (5.5.14) is bounded by y'ne"'("O) ~ 0 a.s. Po for all 0> O. Finally note that (Problem 5.5.4) by arguing as for (5.5.14), tbere exists o(B) > 0 such that
Po [dnqn(t) < 21f(8) exp {~ Eo (:;: (Xl, B))
By (5.5.15) and (5.5.16), for all 0
~}
for all It 1 < 0(8)y'n]
~ I.
(5.5.17)
> 0,
(5.5.18)
Po [dn 
r dnqn(s)ds ~ 0] = I. J1:;I<o,fii
exp {_ 8'I(B)} ds
2
Finally, apply the dominated convergence theorem, Theorem B.7.5, to dnqn(sl(lsl < 0(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s. Po,
d ~ 1f(B)
n
r= L=
= 1f(8)v'21i'.
JI(B)
(5.5.19)
,
I
342
Hence, a.S. Po,
Asymptot'lc Approximations Chapter 5
qn(t) ~ V1(e)<p(tvI(e))
where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. 0 Example 5.5.1. Posterior Behavior in the Normal Translation Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N{ (), ( 2 ) distribution with a 2 known and we put aN ('TJ, 7 2 ) prior on 8. Then the posterior
distribution of8 isN(Wln7J!W2nX,
(~I r12)1) where
,,2
W2n
.,
I
• •
WIn
= nT 2 +U2'
= !WIn
(5.5.20)
,
'"
,
r
, .,,
I
Evidently, as n + 00, WIn + 0, X + 8, a.s., if () = e, and (~I T\) 1 + O. That is, the posterior distribution has mean approximately (j and variance approximately 0, for n large, or equivalently the posterior is close to point mass at as we vn(O  9) has posterior distribution expect from Theorem 5.5.1. Because 9 =
;
N ( .,!nw1n(ry  X), n (~+
;'»
1).
x,
e
Now, vnW1n
=
O(n 1/ 2) ~ 0(1) and
0
n (;i + ~ ) 1
rem 5.5.2.
+ (12 =
II (8) and we have directly established the conclusion of Theo
I ,
I
I
Example 5.5.2. Posterior Behavior in the Binomial~Beta Model. (Example 3.2.3 continued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X" ... , X n Li.d. Bernoulli (I, e) and put a beta, (3(r, s) prior on e, then, as in Example 3.2.3, (J has posterior (3(8n +r, n+s  8 n ). We have shown in Problem 5.3.20 that if Ua,b has a f3(a, b) distribution, then as a + 00, b + 00,
I
If 0
a) £ (a+b)3]1( Ua,b a+b ~N(o,I). [ ab
< B<
(5.5.21)
j I ,
i j
1 is true, Sn/n ~. () so that Sn + r + 00, n + s  Sn + 00 a.s. Po. By identifying a with Sn + r and b with n + s  Sn we conclude after some algebra that because 9 = X,
vn((J  X)!:' N(O,e(l e))
a.s. Po, as claimed by Theorem 5.5.2.
o
Bayesian optimality of optimal frequentist procedures and frequentist optimality of
Bayesian procedures
•
Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors are asymptotically efficient in the sense of the previous section.
,
1 ,
I
I
t
hz _
I
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
343
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following.
~
Theorem 5.~.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution ofB. Then
(i)
(5.5.22)
a.s. Pe for all
e. Consequently,
~, _ I ~ 1 az. I (e)ae(X" e) +op,(n 1/2 ) e  e+ n L.
l=l
(5.5.23)
and LO( .,fii(rr  e)) ~ N(o, rl(e)).
(ii)
(5.5.24)
E( .,fii(111 
el11111 Xl,'"
,Xn) = mjn E(.,fii(111  dl 
1(11) 1Xl.··· ,Xn) + op(I).
(5.5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions In (e, d) = .,fii(18 dlIell· But the Bayes estimatesforl n and forl(e, d) = 18dl must agree whenever E(11111 Xl, ... , Xn) < 00. (Note that if E(1111 I Xl, ... , X n ) = 00, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.l4.22)
sup IP[.,fii(O  e)
< x I Xl,'" ,Xu) 1>(xy'""'I(""'e))1 ~
Oa.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7.1 I). Thus, any median of the posterior distribution of .,fii(11  e) tends to 0, the median of N(O, II (~)), a.s. Po. But the median of the posterior of .,fii(0  (1) is .,fii(e'  e), and (5.5.22) follows. To prove (5.5.24) note that
~ ~ ~
and, hence, that
E(.,fii(IOelll1e'l) IXl, .. ·,Xn) < .,fiile
e'l ~O
(5.5.27)
a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?). claim (5.5.24) follows and, hence,
E( .,fii(10 
01 101) I h ... , Xn)
= E( .,fii(10 
0'1  101) I X" ... , X n ) + op(I).
(5.5.28)
344
~
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for In(e,d), (5.5.25) and the theorem follows. 0 Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang (1990).
Bayes credible regions
~
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
Theorem 5.5.4. Suppose the conditions afTheorem 5.5.2 are satisfied. Let
where en is chosen so that 1l"(Cn I Xl, ... ,Xn) = 1  0', be the Bayes credible region defined in Section 4.7. Let Inh) be the asymptotically level 1  'Y optimal interval based on B, given by
~
where dn(y)
i •
= z (! !)
JI~). ThenJorevery€ >
0, 0,
~ 1.
P.lIn(a + €) C Cn(X1 , .•. ,Xn ) C In(a  €)J
(5.5.29)
I
I
I
The proof, which uses a strengthened version of Theorem 5.5.2 by which the posterior density of Jii( IJ  0) converges to the N(O,Il (0)) density nnifonnly over compact neighborhoods of 0 for each fixed 0, is sketched in Problem 5.5.6. The message of the theorem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n 1j2 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n 1 order (see Schervisch, 1995).
Thsting
! ,
•
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of 00 given X I, ... ,Xn if H is false is of a different magnitude than the pvalue for the same data. For more on this socalled Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + D.) or (00  D., 00 + D.), then Bayes and freqnentist testing procedures agree in the limit. See Problem 5.5.2. Summary. Here we established the frequentist consistency of Bayes estimates in the finite parameter case, if all parameter values are a prior possible. Second. we established
i
! I
I b
II
!
_
j
TI
Section 5.6
Problems and Complements
345
the socalled Bernsteinvon Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the socalled Bernstein~von Mises theorem and frequentist contjdence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1
1. Suppose Xl, ... , X n are i.i.d. as X ous case density.
rv
F, where F has median F 1 (4) and a continu
(a) Show that, if n
= 2k + 1,
, Xn)
EFmed(X),
n (
_
~
)
l'
k (1  t)kdt F' (t)t
EF
med (X"
2
,Xn )
n( 2;) [1P'(t)f tk (1t)k dt
= 1, 3,
(b) Suppose F is unifonn, U(O, 1). Find the MSE of the sample median for n and 5.
2. Suppose Z ~ N(I', 1) and V is independent of Z with distribution X;'. Then T
Z/
=
(~)!
is said to have a noncentral t distribution with noncentrality J1 and m degrees
of freedom. See Section 4.9.2. (a) Show that
where fm(w) is the x~ density, and <P is the nonnal distribution function.
(b) If X" ... ,Xn are i.i.d.N(I',<T2 ) show that y'nX /
(,.~, L:(Xi
_X)2)! has a
noncentral t distribution with noncentrality parameter .fiiJ1/IT and n  1 degrees of freedom. (c) Show that T 2 in (a) has a noncentral :FI,m distribution with noncentrality parameter J12. Deduce that the density of T is
p(t)
= 2L
i=O
00
P[R = iJ . hi+,(f)[<p(t 1')1(t > 0)
+ <p(t + 1')1(t < 0)1
where R is given in Problem B.3.12.
, " I I. ,
346
Hint: Condition on
Asymptotic Approximations
Chapter 5
ITI.
1, then Var(X) < 1 with equality iff X
3. Show that if P[lXI < 11
±1 with
probability! .
Hint: Var(X) < EX 2
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through ..jiif..
(a) Show that the ratio of the Hoeffding function h( VilE) to the Chebychev function e( Jii€) tends to as Jii€ ~ 00 so that he) is arbitrarily better than en in the tails.
°
I,.,'
i~
(b) Show that the normal approximation 24> (
V;€)  1 gives lower results than h in
00.
,
the tails if P[lXI < 1] = 1 because, if ,,2 < 1. 1  <p(t)  <p(t)lt as t ~ Note: Hoeffding (1963) exhibits better bounds for known a 2 .
R has .\(0) ~ 0, is bounded, and has a hounded second derivative .\n. Show that if Xl, ... , X n are i.i.d., EX l = f.L and Var Xl = 02 < 00, then
5. Suppose.\ : R
~
E.\(X 
1') = .\'(0)
;'/!; +
0
(~)
as n >
00.
= E>.'(O)JiiIX  1'1 + E (";' (X  I')(X I'?) where IX  1'1 < Ix  1'1· The last term is < suPx I>." (x) 1,,2 In and the first tends to
Hint: JiiE(.\(IX 1,1) .\(0))
I
",
1! "
.\'(0)" f== Izl<p(z)dz by Remark B.7. 1(2).
Problems for Section 5,2
1. Using the notation of Theorern 5.2.1, show that
"
2. Let X" ... ,Xn be ij,d. N(I',,,2), Show that for all n
Sup p(•• u) [IX
u
>
1, all €
>
°
/,1 > ,] ~ 1.
Hint: Let (J
)
00.
•
3, Establish (5.2.5). Hint: Iiln  q(p)1
> € =} IPn  pi > w (€).
l
J
4. Let (Ui , V;), 1 < i < n, be i.i.d.  PEP.
(a) Let y(P)
= PIU, > 0, V, > OJ. Show that if P = N(O, 0,1,1, p), then
p
~ sin21l' (Y(P)  ~).
(Wald) Suppose e ~ p( X. eol > A} > 0 for some A < 00. e E Rand (i) For some «eo) >0 Eo. (c) Suppose p{P) is defined generally as Covp (U.(eo)} < 00.2. e) . and < > 0 there is o{0) > 0 such that e. (1 (J.d. ". e') . i (e))} > <. eo)} : e E K n {I : 1 IB .2.Section 5. .eol > . eo) : e' E S(O. 05) where ao is known. VarpUVarpV > O}.n 1 (XiJ.lO)2)) . Hint: From continuity of p. {p(X" 0) .p(X. Suppose Xl. .eol > ..14)(i)..p(X.~ . (Ii) Show that condition (5.\} c U s(ejJ(e j=l T J) {~ t n . 5.sup{lp(X.(ii) holds. 5_0  lim Eo. N (/J. for each e eo.lJ.o)} =0 where S( 0) is the 0 ball about Therefore. then is a consistent estimate of p.\} . By compactness there is a finite number 8 1 . Eo.. e') .6 Problems and Complements 347 (b) Deduce that if P is the bivariate normal distribution. Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but p is no longer consistent.e')p(X.2 rII. where A is an arbitrary positive and finite constant. 6.8) fails even in this simplest case in which X ~ /J is clear. .Xn are i. (i). Show that the maximum contrast estimate 8 is consistent. Or of sphere centers such that K Now inf n {e: Ie . sup{lp(X. by the basic property of maximum contrast estimates. V)j /VarpU Varp V for PEP ~ {P: EpU' + EpV' < 00. e) is continuous. and the dominated convergence theorem. A]. t e. . 7. inf{p(X. e)1 (ii) Eo.p(X.e)1 : e' E 5'(e.14)(i) add (ii) suffice for consistency.e'l < . (a) Show that condition (5.p(Xi . n Lt= ". inf{p(X.•. eo) : Ie  : Ie . Hint: K can be taken as [A.i. Hint: sup 1.2. . Prove that (5.l)2 _ + tTl = o:J.
. .d.I'lj < EIX .. n. . . and take the values ±1 with probability ~.X~ are i.xW) < I'lj· 3. i . Establish (5.i.d.3.2.3 I.11).X. ~ .. . and let X' = n1EX.. Hint: Taylor expand and note that if i . Then EIX .3. The condition of Problem 7(ii) can also fail.O'l n i=l p(X"Oo)}: B' E S(Bj. .. L:~ 1(Xi .X'i j .1.Xn but independent of them. < > OJ. and if CI. II I ....) 2] .1. (ii) If ti are ij. with the same distribution as Xl. Let X~ be i. + id = m E II (Y. i .O(BJll}. Establish Theorem 5.k / 2 . in (i) and apply (ii) to get (iv) E IL:~ 1 (Xi  x. ..1'. 9. 4.348 Asymptotic Approximations Chapter 5 > min l<J$r {~tinf{p(Xi.3.. for some constants M j . Establish (5."d' k=l d < mm+1 L d ElY. . then by Jensen's inequality. • • (i) Suppose Xf. 1 + . P > 1.3.. Establish (5.l m < Cm k=I n. . < Mjn~ E (..L.i.[. = 1.)I' < MjE [L:~ 1 (Xi . . .3. Indicate how the conditions of Problem 7 have to be changed to ensure uniform consistencyon K.3) for j odd as follows: . li .9) in the exponential model of Example 5. Show that the log likelihood tends to 00 as a + 0 and the condition fails. I 10. 1/(J.)2]' < t Mjn~ E [.3. For r fixed apply the law of large numbers. (72). L:!Xi . .X.. . Compact sets J( can be taken of the form {II'I < A. . Hint: See part (a) of the proof of Lemma 5..d.. < < 17 < 1/<. 8. Extend the result of Problem 7 to the case () E RP. en are constants. Problems for Section 5. 2. J (iii) Condition on IXi  X.
.\k for some .) I : i" .1) +E{IX.6 Problems and Complements 349 Suppose ad > 0. ~ xl".c> . X.\ > 0 and = 1 + JI«k+m) ZIa. P(sll s~ < ~ 1~ a as k ~ 00. theu the LR test of H : or = a~ versus K : af =I a~ is based on the statistic s1 / s~.a as k t 00. FandYi""'Yn2 bei. .I) Hint: By the iterated expectation theorem EIX. ~ iLl' I IXli > liLl}P(lXd > 11. j > 2.a~d < [max(al"" .(l'i ."" X n be i.m) Then.). (d) Let Ck. Establish 5.3. respectively..m 00.1)'2:7' .ad)]m < m L aj j=1 m <mmILaj.m with K. 8. . ~ iLli I IX. LetX1"".. 1)'2:7' . .I. 6.aD.i. G.m be Ck.... I < liLl}P(IXd < liLl)· 7.. Let XI. 2 = Var[ ( ) /0'1 ]2 .1 and 'In = n2 . then (sVaf)/(s~/a~) has an .d. Show that if m = . (b) Show that when F and G are nonnal as in part (a). . . replaced by its method of moments estimate. L~=1 i j = m then m a~l. R valued with EX 1 = O. . i.Fk... if a < EXr < 00.i. ~ iLli < 2 i EIX.andsupposetheX'sandY's are independent. s~ ~ (n2 . PH(st! s~ < Ck. . (a) Show that if F and G are N(iL"af) and N(iL2. then EIX. n} EIX. j = 1.(X.d.r'j"./LI = E(Xt}. )=1 5.28. 0'1 = Var(Xd· Xl /LI Ck. Show that ~ sup{ IE( X.l(e) Now suppose that F and G are not necessarily nonnal but that and that 0 < Var( Xn < Ck.Section 5. 1 < j < m.pli ~ E{IX.Tn) + 1 . km K. where s1 = (n.i. under H : Var(Xtl ~ Var(Y.Xn1 bei. Show that if EIXlii < 00. . Show that under the assumptions of part (c).d.m distribution with k = nl ..
. . then P(sUs~ < qk. . . . • 10. Under the assumptions of part (c). N where the t:i are i. TN i. 7 (1 . i = I. Wnte (l 1pP .+. (cj Show that if p = 0. 4p' (I .lIl) + 1 . . () E such that T i = ti where ti = (ui.p')') and. err eri Show that 4log (~+~) is the variance stabilizing transfonnation for the correlation 1 Ip coefficient in Example 5.) =a 2 < 00 and Var(U) Hint: (a) X ..I' ill "rI' and "2 = Var[( X. then jn(r' .A») where 7 2 = Var(XIl. . use a normal approximation to tind an approximate critical value qk. ..p) ~ N(O. . = ("t . "i.m (0 Let 9. "1 = "2 = I. . suppose i j = j.i.".a as k .i.. Tin' which we have sampled at random ~ E~ 1 Xi.I).1 EX.p) ~ N(O. (a) If 1'1 = 1'2 ~ 0..2" ( 11. 1). . then jn(r .Il2. (b) If (X.p. there exists T I .00. Without loss of generality. .I))T has the same asymptotic distribution as n~ [n.XN)} we are interested in is itself a sample from a superpopulation that is known up to parameters. . (b) Use the delta method for the multivariate case and note (b opt  b)(U . . In particular. Without loss of generality.iL) • . Et:i = 0. (iJl. = = 1. p). Show that under the assumptions of part (e). i = 1.. . jn(r . t N }.2< 7'.1'2)1"2)' such that PH(SI!S~ < Qk.xd. (iJi . . jn((C . to estimate Ii 1 < j < n. .p). as N 2 t Show that.N. Var(€. from {t I. Hint: Use the centra] limit theorem and Slutsky's theorem.d.XN} or {(uI. X) U t" (ii) X R = bopt(U ~  iL) as in Example 3..p') ~ N(O. (a) jn(X . i (b) Suppose Po is such that X I = bUi+t:i.1JT..1 that we use Til. ' . n. In survey sampling the modelbased approach postulates that the population {Xl. n.. Po.I. ! . be qk. Show that jn(XRx) ~N(0.t . H En. if p i' 0.d.. Y) ~ N(I'I. + I+p" ' ) I I I II. (UN..\ < 1. if 0 < EX~ < 00 and 0 < Eyl8 < 00.1.. = (1. . • .l EY? . JLl = JL2 = 0.1 · /.m) + 1 . Consider as estimates e = ') (I X = x. • op(n').3. and if EX.4.(I_A)a2)._ J ..350 Asymptotic Approximations Chapter 5 (e) Next drop the assumption that G E g. if ~ then .6. .~) (X . that is.\ 00.m with KI and K2 replaced by their method of moment estimates. suppose in the context of Example 3.m (depending on ~'l ~ Var[ (X I .3. show that i' ~ .Ii > 0. .4.1." •. (I _ P')')." .x) ~ N(O. < 00 (in the supennodel).6.XC) wHere XC = N~n Et n+l Xi.".a: as k + 00. iik. Instead assume that 0 < Var(Y?) < x..xd.. . .l EXiYi .. ~ ] ..+xn n when T. . 0 < . In Example 5. .
X = XO.i'a)(Yb . Here x q denotes the qth quantile of the distribution.3' Justify fonnally j1" variance (72..13(c)... The following approximation to the distribution of Sn (due to Wilson and Hilferty. (h) Use (a) to justify the approximation 17. . X..10.E{h(X))1' = 0 to terms up to order l/n' for all A > 0 are of the form h{t) ~ ct'!3 + d. (h) Deduce fonnula (5. X~.3.14)..3. X n are independent.. (a) Show that the only transformations h that make E[h(X) .3.3. E(WV) < 00. . It can be shown (under suitable conditions) that the nonnal approximation to the distribution of h( X) improves as the coefficient of skewness 'Y1 n of heX) diminishes. . Normalizing Transformation for the Poisson Distribution. (a) Suppose that Ely. This < xl (h) From (a) deduce the approximation P[Sn '" 1>( v'2X  v'2rl). EU 13.i'c) I < Mn'.6 Problems and Complements 351 12. (a) Use this fact and Problem 5.. then E(UVW) = O. X n is a sample from a population with mean third central moment j1. .y'ri has approximately aN (0. Suppose X 1l . . Let Sn have a X~ distribution.. = 0. Show that IE(Y. Suppose XI.. n = 5.14 to explain the numerical results of Problem 5.n)/v'2rl) and the exact values of PISn < xl from the X' table for x = XO.6) to explain why. 16.3 < 00.. 1931) is found to be excellent Use (5.3. W).12). and Hint: Use (5. each with HardyWeinberg frequency function f given by .i'b)(Yc . Hint: If U is independent of (V.s. (c) Compare the approximation of (b) with the central limit approximation P[Sn < xl = 1'((x ..90. !) distribution.99.. 14. (a) Show that if n is large..25. is known as Fisher's approximation.Section 5. 15.: . (b) Let Sn . Suppose X I I • • • l X n is a sample from a peA) distrihution.
1'2)(X . i 21. [vm +  n (Bm. E(Y) = 1'2. X n be the indicators of n binomial trials with probability of success B. where I' = (a) Find an approximation to P[X < E(X. and h'(t) > for all t.a tends to zero at the rate I/(m + n)2. Let then have a beta distribution with parameters m and n.)? 18. • j b _ . Yn) is a sample from a bivariate population with E(X) ~ 1'1. 1'2)h2(1'1.y). . Show that the only variance stabilizing transformation h such that h(O) = 0.. Y) .n  m/(m + n») < x] 1 > 'li(x). . Show directly using Problem B. 20.n tends to zero at the rate I/(m + n)2.y) = a axh(x..h(l'l.. Var(Y) = C7~. 172 + [h 2(1'l.5 that under the conditions of the previous problem. 1'2) Hint: h(X.• • • ~{[hl (1"... + (mX/nY)] where Xl. I ~ 352 x Asymptotic Approximations Chapter 5 where °< e < f(x) 1.y). .. .a) Hmt: Use Bm.1'2)]2 177 n +2h. Y) where (X" Yi). (1'1. 19. is given by h(t) = (2/1r) sin' (Yt). Y) ~ p!7IC72. 1'2)pC7. (b) Find an approximation to P[JX' < t] in terms of 0 and t. Var(X) = 171...(x. ° . (a) (b) Var(h(X. which are integers. Var Bmnm + n +Rmn = 'm+n ' ' where Rm. (e) What is the approximate distribution of Vr'( X . 1'2)(Y  + O( n '). where I I h.a) E(Bmn ) = .n P . .1') + X 2 . if m/(m + n) . (1'1. Justify fonnally the following expressions for the moments of h(X. h(l) = I. Cov(X.2. (Xn . 101 02 20(10) I f (10)2 2 tJ in terms of fJ and t.y) = a ayh(x. Y" .Xm . v'a(l. 1'2)]2C7n + O(n 2) i . then I m a(l. Bm. h 2 (x.. .n = (mX/nY)i1 independent standard eXJX>nentials. Yn are < < . . V)) '" . Variance Stabilizing Transfo111U1tion for the Binomial Distribution. Let X I. 0< a < I. Show that ifm and n are both tending to oc in such a way that m/(m + n) > a. 1'2) = h.I'll + h2(1'1.
)2 and n(X .J1..)] !:. ()'2 = Var(X) < 00. .. k > I.)] is asymptotically distributed (b) Use part (a) to show that when J1. X n is a sample from a population and that h is a realvalued function 01 X whose derivatives of order k are denoted by h(k). Suppose IM41 (x)1 < M for all x and some constant M and suppose that J. Show that Eh(X) 3 h(J1.4 to give a directjustification of where R n / yin 0 as in n Recall Stirling's approximation: + + 00.)] ~ as ~h(2)(J1.X2 . (a) Show that y'n[h(X) .) 23.. = 0. J1. find the asymptotic distribution of y'n(T. Let Xl>"" X n be a sample Irom a population with and let T = X2 be an estimate of 112 . (It may be shown but is not required that [y'nR" I is bounded.J1. XI. 24.(1. xi. (e) Fiud the limiting laws of y'n(X .3 1/6n2 + M(J1. .l4 is finite.Section 5.2. = ~.) + ": + Rn where IRnl < M 1(J1..2)/24n2 Hint: Therefore.2V where V ~ 00. xi 25. and that h(1)(J. Let Sn "' X~.J1.6 Problems and Complements 353 22. Compare your answer to the answer in part (a). = E(X) "f 0. .h(J1.2) using the delta method.)2. n[X(I .l) = o. Suppose that Xl.J1. < t) = p(Vi < (b) When J1.2V with V "' Give an approximation to the distribution of X (1 . 1 X n be a sample from a population with mean J. (a) When J1. ° while n[h(X h(J1.X) in tenns of the distribution function when J.). find the asymptotic distrintion of nT nsing P(nT y'nX < Vi).J1. _..4 + 3o.)1J1.) + ~h(2)(J1. .l = ~.. Usc Stirling's approximation and Problem 8.l and variance (J2 < Suppose h has a second derivative h<2) continuous at IJ. Let Xl.
9.I I (a) Show that T has asymptotically a standard Donnal distribution as and I I .3.4. al = Var(XIl.3.\)al + . N(Pl (}2). Suppose Xl. Yn2 are as in Section 4.o. .3).~a) > 2( Val + a'4  ~ a) where the righthand side is the limit of . E In] > 1 . . .JTi times the length of the onesample t interval based on the differences.) < (b) Deduce that if p tl ~ <P (t [1.t.9. 28. .3) have correct asymptotic (c) Show that if a~ > al and .\ probability of coverage. I i t I .354 26.1/ S D where D and S D are as in Section 4. 1 X n1 and Y1 .\. . (a) Show that P[T(t. Eo) where Eo = diag(a'. " Yn as separate samples and use the twosample t intervals (4.9. 2pawz)z(1  > 0 and In is given by (4. (c) Apply Slutsky's theorem.) of Example 4.9. so that ntln ~ .\ > 1 .\ul + (1 ~ or al .3. n2 + 00.\)a'4)/(1 ..2 .•. .4. Suppose nl + 00 .\a'4)I'). 'I :I ! t. a~.9.)/SD where D. Hint: (a).../". Show that if Xl" . cyi .. . . /"' = E(YIl. I . Let n _ 00 and (a) Show that P[T(t.3). whereas the situation is reversed if the sample size inequalities and variance inequalities agree. Yd.9.9. . What happens? Analysis for fixed n is difficult because T(~) no longer has a '12n2 distribution.2a 4 ). We want to obtain confidence intervals on J1. if n}.a. the interval (4. Let T = (D ..t2.\ < 1.}.a.. Assume that the observations have a common N(jtl.') < 00.d. Asymptotic Approximations Chapter 5 then ~ £ c: 2 yn(X .) (b) Deduce that if.. Hint: Use (5. . (d) Make a comparison of the asymptotic length of (4. 27. . Suppose (Xl. . .Jli = ~. 1 Xn. and a~ = Var(Y1 ). < tJ ~ <P(t[(. . 29. Viillnl ~ 2vul + a~z(l . We want to study the behavior of the twosample pivot T(Li. (Xnl Yn ) are n sets of control and treatment responses in a matched pair experiment. .3) and the intervals based on the pivot ID .3) has asymptotic probability of coverage < 1 . Suppose that instead of using the onesample t intervals based on the differences Vi .(~':~fJ').. = = az. 0 < . p) distribution.. . (c) Show that if IInl is the length of the interval In. and SD are as defined in Section 4. .9.9. .t. n2 + 00.3 independent samples with III = E(XIl.4. then limn PIt. XII are ij. that E(Xt) < 00 and E(Y. Y1 . b l . the intervals (4.33) and Theorem 5.(j ' a ) N(O.\.Xi we treat Xl.
33.a) is the critical value using the Welch approximation. . It may be shown that if Tn is any sequence of random variables such that Tn if the variances ofT and Tn exist. (a) Show that the MLEs of I'i and (). (c) Let R be the method of moment estimaie of K.6 Problems and Complements 355 (b) Let k be the Welch degrees of freedom defined in Section 4.1). . ()._I' Find approximations to P(Yn and P{Vn < XnI (I . as X ~ F and let I' = E(X).8Z = (n .3 using the Welch test based on T rather than the twosample t test based on Sn. and let Va ~ (n .. Let X ij (i = I. (a) SupposeE(X 4 ) < 00.d.1. has asymptotic level a. then lim infn Var(Tn ) > Var(T). < XnI (a)) (b) Let Xn_1 (a) be the ath quantile of X.z = Var(X). j = I.i. Generalize Lemma 5.3.I)sz/().9 and 4. X n be i.4.. I< = Var[(X 1')/()'jZ.d.1 L j=I k X ij and iT z ~ (kp)I L L(Xij i=l j=l p k Iii)" ." . ..a). but Var(Tn ) + 00. T and where X is unifonn.. then P(Vn < va<) t a as 11 t 00. Let . ().£. . (c) Show using parts (a) and (b) that the tests that reject H : fJI = 112 in favor of K: 112 > III when T > tk(l . .xdf and Ixl is Euclidean distance..I) + y'i«n . Let XI.1)1 L.£.X)" Then by Theorem B. Tn .. .z are Iii = k.3 by showing that if Y 1.3.4. EIY.16.Section 5. Show that k ~ ex:: as 111 ~ 00 and 112 t 00. .~ t (Xi . Y nERd are Li. X. U( 1. Vn = (n . then for all integers k: where C depends on d. 30. Show that as n t 00.. Plot your results. x = (XI.9. 32.a)) and evaluate the approximations when F is 7.. Show that if 0 < EX 8 < 00.Z).p. Hint: If Ixll ~ L. then there exist universal constants 0 < Cd < Cd < 00 Such that cdlxl1 < Ixl < Cdlxh· 31.3.I k and k only. vectors and EIY 1 1k < 00. where I.. .I)z(a).:~l IXjl. where tk(1. Hint: See Problems B. k) be independent with Xi} ~ N(l'i. . I is the Euclidean norm. Carry out a Monte Carlo study such as the one that led to Figure 5.z has a X~I distribution when F is theN(fJl (72) distribution.3. (d) Find or write a computer program that carries out the Welch test.
Show that. ..p(x) (b) Suppose lbat for all PEP. and (iii) imply that O(P) defined (not uniquely) by Ep.Xn be i.n7(0) ~[. . I i ..i. (Use .nCOn .d..0) = O. . .(0) Varp!/!(X. is continuous and lbat 1(0) = F'(O) exists. (e) Suppose lbat the d. . n 1 F' (x) exists. . . random variables distributed according to PEP.On) .p(X.0))  7'(0)) 0. then 8' !. I . Show that On is consistent for B(P) over P. . . 1 X n. 00 (iii) I!/!I(x) < M < for all x.p( 00) as 0 ~ 00... .) Hint: Show that Ep1/J(X1  8) is nonincreasing in B. .On) < 0).n(On . 1) for every sequence {On} wilb On 1 n = 0 + t/.. Use the bounded convergence lbeorem applied to . Set (.. ! Problems for Section 5.\(On)] !:.0) ~ N Hint: P( . I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _1 .n(X .. ... Assmne lbat '\'(0) < 0 exists and lbat .. aliI/ > O(P) is finite. (c) Give a consistent estimate of (]'2. Let On = B(P)...p(X. [N(O))' . 1948).. N(O. F(x) of X.n for t E R. 1/).)). ..p(X. .0). Let i !' X denote the sample median. (a) Show that (i)..1)cr'/k.0) !:.356 Asymptotic Approximations Chapter 5 .p(Xi . Suppose !/! : R~ R (i) is monotone nondecreasing (ii) !/!(oo) < 0 < !/!(oo) .. \ . . (c) Assume the conditions in (a) and (b). ..f.4 1. then < t) = P(O < On) = P (..0) !. Let Xl. • 1 = sgn(x)... .p(X.p(X.... Deduce that the sample median is a consistent estimate of the population median if lbe latter is unique. (b) Show that if k is fixed and p ~ 00. N(O. _ . That is the MLE jl' is not consistent (Neyman and Scott. . . Show lbat if I(x) ~ (d) Assume part (c) and A6. (ii)..p(X. under the conditions of (c). N(O) = Cov(. ..0). Ep. Show that _ £ ( .L~.0) and 7'(0) .. . I(X.. (k . O(P) is lbe unique solution of Ep. 1/4/'(0)).O(P)) > 0 > Ep!/!(X. I ! 1 i . where P is the empirical distribution of Xl.
X) =0.6 Problems and Complements 357 (0 For two estImates 0.10. 1979) which you may assume. Show that   ep(X. Thus. then ep(X.'" .~ for 0 > x and is undefined for 0 g~(X. 2.2.0)p(x. . (a) Show that g~ (x.(1:e )n ~ 1. .O)p(x.(x) = (I ..5) where f.20 and note that X is more efficient than X for these gross error cases. X) = 1r/2. the asymptotic 2 . 0 > 0. 4. If (] = I.' . Condition A6' pennits interchange of the order of integration by Fubini's theorem (Billingsley.. Let XI.I / 2 ."' c .d. Show that assumption A4' of this section coupled with AGA3 implies assumption A4.B)) ~ [(I/O). X n be i.O)dl"(x)) = J 1J..b)p(x.7. not O! Hint: Peln(B . x E R.O). 0) = . 0 «< 0. J = 1.(x.(x. 'T = 4.(x. X) as defined in (f). " relative efficiency of 8 1 with respect to 82 is defined as ep(8 1 .0) (see Section 3..y. Show that if (g) Suppose Xl has the gross error density f.a)p(x.B) < xl = 1.i. Hint: Apply A4 and the dominated convergence theorem B. and O WIth .exp(x/O)..O))dl"(x) ifforalloo < a < b< 00.(x. Conclude that (b) Show that if 0 = max(XI.(x) + <'P.Xn ) is the MLE.jii(Oj .. evaluate the efficiency for € = .(x). Find the efficiency ep(X.    .9)dl"(x) = J te(1J. ( 2 ).0. then 'ce(n(B .  = a?/o. ~"'. 3. denotes the N(O. lJ :o(1J.O)p(x. not only does asymptotic nonnality not hold but 8 converges B faster than at rate n. Show that A6' implies A6.0) ~ N(O.c)'P.(x.4.b)dl"(x) J 1J. (h) Suppose that Xl has the Cauchy density fix) = 1/1r(1 + x 2). oj).0.Section 5.a)dl"(x). 82 ) Pis N(I". This is compatible with I(B) = 00.. 0» density.. Hint: teJ1J. .15 and 0.(x . .O) is defined with Pe probability I but < x.5  and '1'.05. U(O.
O') .. under F oo ' and conclude that .(X.4. O.00) ..in. Suppose A4'. 80 +. j.in) 1 is replaced by the likelihood ratio statistic 7. 8) is continuous and I if En sup {:. 8 ) 0 . I I j 6. I .80 + p(Xi . . I . ~ L.80 + p(X .50). . &l/&O so that E.22).4) to g.14. (8) Show that in Theorem 5..358 5. ) n en] .[ L:.5 Asymptotic Approximations Chapter 5 ~lOg n p(Xi.·_t1og vn n j = p( x"'o+. .i(X. Show that the conclusions of PrQple~ 5.) P"o (X . 0 ~N ~ (."log .5 continue to hold if .18 .. 1(80 ).p ~ 1(0) < 00. I 1 .4.~ (X. and A6 hold for. i I £"+7.< ~log n p(X.. I (c) Prove (5. 0). ~log (b) Show that n p(Xi .i.in) . Show that 0 ~ 1(0) is continuous.4. 8)p(x.In ~ "( n & &8 10gp(Xi. A2. 8 ) i . Apply the dominated convergence theorem (B."('1(8) .i (x. 0 for any sequence {en} by using (b) and Polyii's theorem (A.i'''i~l p(Xi. Hint: (b) Expand as in (a) but aroond 80 (d) Show that ?'o+". ) ! ..7.0'1 < En}'!' 0 .Oo) p(Xi . i.g. .in) ~ .O) 1(0) and Hint: () _ g..80 p(X" 80) +.In) + .
2.4. Hint: Use Problems 5.56).54) and (5. 11.4.5 hold.3. (a) Establish (5. (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5. the bound (5.4. if X = 0 or 1. 10. hence. gives B Compare with the exact bound of Example 4. Then (5.7). ~ 1 L i=I n 8'[ ~ 8B' (Xi. (Ii) Compare the bounds in Ca) with the bound (5.4.4.2. f is a is an asymptotic lower confidence bound for fJ. there is a neighborhood V(Bo) of B o o such that limn sup{Po[B~ < BJ : B E V(Bo)} ~ 1 .Section 5. B' _n + op (n 1/') 13. (a) Show that under assumptions CAO)(A6) for 1/! consistent estimate of I (fJ).61) for X ~ 0 and L 12.59).59) coincide.:vell .4.6 Problems and Complements 359 8. (Ii) Suppose the conditions of Theorem 5.a.58).14.59). (a) Show that under assumptions (AO)(A6) for all Band (A4').4.4.4. A.Q" is asymptotically at least as good as B if.4. = X.5 and 5.2. for all f n2 nI n2 .4. setting a lower confidence bound for binomial fJ.6.4. (a) Show that. Establish (5.4. which agrees with (4. 9.3).5.57) can be strengthened to: For each B E e. B' nj for j = 1. all the 8~i are at least as good as any competitors.61).4. and give the behavior of (5. . Compare Theorem 4. (Ii) Deduce that g.4.6 and Slutsky's theorem. We say that B >0 nI Show that 8~I and. at all e and (A4'). Consider Example 4.B lower confidence bounds. Let B be two asymptotic l. ~. which is just (4. B). Let [~11. Hint.4. Let B~ be as in (5. Hint: Use Problem 5.7.4.
i.5. is a given number. 1. (c) Find the approximate critical value of the NeymanPearson test using a Donnal approximation. . =f:.55). 1).1.LJ...\ = '\0 versus K • (b) Show that the NP test is UMP for testing H : A > AO versus K : A < AO.2.A 7" 0.d. Suppose that Xl. .riX) = T(l + nr 2)1/2rp (I+~~. By Problem 4. .360 1 Asymptotic Approximations Chapter 5 14.)) and that when I' = LJ. X > 0. N(Jt. 0 given Xl. : A < Ao (a) Find the NeymanPearson (NP) test for testing H : . inverse Gaussian with parameters J. (a) Show that the test that rejects H for large values of v'n(X . = O. (e) Find the Rao score test for testing H : . where .'" XI! are ij. (d) Find the Wald test for testing H : A = AO versus K : A < AO. Jl > OJ ..1.1... 1.6.1 and 3. Let X 1> . 2. ! (a) Show that the posterior probability of (OJ is where m n (.\ l .] versus K . " " I I .I ' '1 against H as measured by the smallness of the pvalue is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "paradox"). is distributed according to 7r such that 1> 7f({O}) and given I' ~ A > 0. T') distribution. Consider the problem of testing H : I' E [0. I ..•. where Jl is known.)l/2 Hint: Use Examples 3.2. the pvalue fJ . Consider testing H : j. 00.d.d.) has pvalue p = <1>(v'n(X .01 . ! I • = '\0 versus K : . 7f(1' 7" 0) = 1 . Show that (J l'. 1 . Show that (J/(J l'. X n i.d.4. J1.\ > o. variables with mean zero and variance 1(0). if H is false..21J2 2x A} .\ < Ao. X n be i. the test statistic is a sum of i.. That is. .i. > ~. Consider the Bayes test when J1.t = 0 versus K : J. I' has aN(O. That is. (c) Suppose that I' = 8 > O.L and A.. Establish (5. . 1) distribution.LJ. N(I'.11).2. (b) Suppose that I' ). Now use the central limit theorem. Problems for Section 5.5 ! 1. 1 15.=2[1  <1>( v'nIXI)] has a I U(O. Hint: By (3.4. A exp . phas a U(O...I).i. each Xi has density ( 27rx3 A ) 1/2 {AX + J.4.10) and (3. the evidence I . 1) distribution. LJ.
Hint: By (5.050 logdnqn(t) = ~ {I(O).for M(. P.0 3. ~ E.O'(t)).I(Xi.2. Establish (5. 0' (i» n L 802 i=l < In n ~sup {82 8021(X i .054 50 .Oo»)+log7f(O+ J.4. Hint: 1 n [PI . By Theorem 5.5.5. By (5.s. Suppose that in addition to the conditions of Theorem 5..13) and the SLLN.S.14). (Lindley's "paradox" of Problem 5.645 and p = 0.i Itlqn(t)dt < J:J')..5.1 is not in effect.2 it is equivalent to show that J 02 7f (0)dO < a. 1) and p ~ L U(O.) sufficiently large.) (d) Compute plimn~oo p/pfor fJ.' " .17).058 20 .1 I'> = 1.034 .046 .I).. L M M tqn(t)dt ~ 0 a. all . 1) prior.l has a N(O. Fe.(Xi..)}.Section 5.052 100 . [0.(Xi. 4.s. (e) Show that when Jl ~ ~.8) ~ 0 a.:.05.O') : 100'1 < o} 5. Apply the argnment used for Theorem 5.5.i It Iexp {iI(O)'. Hint: In view of Theorem 5.042 .2 and the continuity of 7f( 0).2.. ~l when ynX = n 1'>=0..1.17). Extablish (5. 10 . oyn..6 Problems and Complements 361 (b) Suppose that J.5.} dt < .029 .sup{ :. for all M < 00. ifltl < ApplytheSLLN and 0 ous at 0 = O. yn(E(9 [ X) .. yn(anX   ~)/ va.0'): iO' .01 < 0 } continuThen 00. ~ L N(O.~~(:. J:J').5. (e) Verify the following table giving posterior probabilities of 1.. i ~.5.( Xt.. Show that the posterior probability of H is where an = n/(n + I).
(I) If the rightband side is negative for some x. i=l }()+O(fJ) I Apply (5. A.8) and (5. The sets en (c) {t .5.) by A4 and A5. ! ' (2) Computed by Winston Cbow. all d and c(d) / in d. . See Bhattacharya and Ranga Rao (1976) for further discussion. Notes for Section 5.29). Show tbat (5.5. Bernstein and R. (O)<p(tI. Fn(x) is taken to be O.16) noting that vlnen. . . (a) Show that sup{lqn (t) .5.s(8)vn tqn(t)dt ~ roc vIn(t  0) exp {i=(l(Xi' t) . For n large these do not correspond to distributions one typically faces.I Notes for Section 5.2 we replace the assumptions A4(a. J I . von Misessee Stigler (1986) and Le Cam and Yang (1990).9) hold with a. qn (t) > c} are monotone increasing in c. ~ II• '.s.f.5 " I i .1. I : ItI < M} O.5. .1 (I) The bound is actually known to be essentially attained for Xi = a with probability Pn and 1 with probability 1 .!. 7.1 . " . for all 6.) ~ 0 and I jtl7r(t)dt < co.Pn where Pn a or 1. 'i roc J.. I: . convergence replaced by convergence in Po probability.3 ). . . to obtain = I we must have C n = C(ZI_ ~ [f(0)nt l / 2 )(1 + op(I)) by Theorem 5.". . Hint: (t : Jf(O)<p(tJf(O)) > c(d)} = [d.5. A proof was given by Cramer (1946).S. (0» (b) Deduce (5.5. Fisher (1925). jo I I .I(Xi}»} 7r(t)dt.s. . . dn Finally. Finally.7 NOTES Notes for Section 5.4 (1) This result was first stated by R. (1) This famous result appears in Laplace's work and was rediscovered by S.s. I i. .. ~ 0 a. L . .d) for some c(d). . 362 f Asymptotic Approximations Chapter 5 > O. • .) and A5(a. • Notes for Section 5. 5. . Suppose that in Theorem 5.
1938. Theory o/Statistics New York: Springer. BHATTACHARYA. U. 318324 (1953). Assoc. AND G. S. Approximation Theorems of Mathematical Statistics New York: J. "Probability inequalities for sums of bounded random variables. 1979.. 2nd ed.. "Theory of statistical estimation. MA: Harvard University press. 1946.. L. HOEFFDING. H. Cambridge University Press. 1995.S. Math.. New York: McGraw Hill. 17. AND M. 3rd ed. 1999. F.8 REFERENCES BERGER. Some Basic Concepts New York: Springer. A. 1967. RAo. L. 132 (1948). A CoUrse in Large Sample Theory New York: Chapman and Hall. C. FERGUSON. HfLPERTY. HAMMERSLEY.. SCHERVISCH. Mathematical Methods of Statistics Princeton. 'The distribution of chi square. G.. RANGA RAO. BILLINGSLEY. O. H. Amer. LEHMANN. The Behavior of the Maximum Likelihood Estimator Under NonStandard Conditions.. LEHMANN. M. R. I.. Probability and Measure New York: Wiley. p. Normal Approximation and Asymptotic Expansions New York: Wiley. Statistical Decision Theory and Bayesian Analysis New York: SpringerVerlag.. Pearson. Nat. P.Section 5. HUBER. YANG. LE CAM. FISHER. B.8 References 363 5. Box. SERFLING. 1980. New York: J. CASELLA. 1973. R. Editors Cambridge: Cambridge University Press. AND G. NEYMAN. W. E.. 16. I Berkeley.. N.. HANSCOMB. RUDIN. W. Proc. A. R." Proc. DAVID. Phil. Soc. 1986. Statist. 700725 (1925). L. Symp. Elements of LargeSample Theory New York: SpringerVerlag." Proc. S. Vth Berkeley Symposium. 1976. M. S. T.. C. AND D. E. "Nonnormality and tests on variances:' Biometrika. Vol. 58. Hartley and E. 1990.. J. R.. Mathematical Analysis. 684 (1931). The History of Statistics: The Measurement of Uncertainty Before 1900 Cambridge. 1987.. CRAMER. 1380 (1963). 3rd ed. L. reprinted in Biometrika Tables/or Statisticians (1966). Statist. M. FISHER. R.. Tables of the Correlation Coefficient. J...A. P. NJ: Princeton University Press. SCOTT. Wiley & Sons. E. Vth Berk. Monte Carlo Methods London: Methuen & Co. CA: University of California Press. 40. Vol. J. 1964. Prob." J. L. "Consistent estimates based on partially consistent observations. 1985. WILSON.. J. 1998. 22. 1958. 1996. J. Asymptotics in Statistics. Linear Statistical Inference and Its Applications. STIGLER.. Camb. H." Econometrica. Statisticallnjerence and Scientific Method. Sci. AND E. Acad. P. Wiley & Sons. AND R. . E.. Theory ofPoint EstimatiOn New York SpringerVerlag.
.I 1 .\ . . ' i ' . . . . I . (I I i : " .
6. However.7. and prediction in such situations. 2. the number of observations.3. the properties of nonparametric MLEs.2.and semiparametric models. multiple regression models (Examples 1. for instance. The inequalities ofVapnikChervonenkis.3.1. Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II. and efficiency in semiparametric models. the multinomial (Examples 1. real parameters and frequently even more semi. 2.3). often many. There is.1) and more generally have studied the theory of multiparameter exponential families (Sections 1. testing. 365 .3). tests. the number of parameters.2 and 5.1. This chapter is a leadin to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non. We begin our study with a thorough analysis of the Gaussian linear model with known variance in which exact calculations are possible. curve estimates. confidence regions.5. 2.4.Chapter 6 INFERENCE IN THE MUlTIPARAMETER CASE 6. an important aspect of practical situations that is not touched by the approximation.4.1 INFERENCE FOR GAUSSIAN LINEAR MODElS • Most modern statistical questions iovol ve large data sets.or nonpararnetric models. [n this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates. we have not considered asymptotic inference. are often both large and commensurate or nearly so. however. 2. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for functionvalued statistics.2.3. 1. the bootstrap. the fact that d. We shall show how the exact behavior of likelihood procedures in this model correspond to limiting behavior of such procedures in the unknown variance case and more generally in large samples from regular ddimensional parametric models and shall illustrate our results with a number of important examples. and confidence regions in regular onedimensional parametric models for ddirnensional models {PO: 0 E 8}. with the exception of Theorems 5. in which we looked at asymptotic theory for the MLE in multiparameter exponential families.2.8 C R d We have presented several such models already. the modeling of whose stochastic structure involves complex models governed by several. and n.6.
We consider experiments in which n cases are sampled from a population. 0"2).n (6.. 6. the Zij are called the design values.2(4) in this framework. .Zip' We are interested in relating the mean of the response to the covariate values. .. In Section 6. Regression. are U. . .N(O.3) whereZi = (Zil.1..d.1.1 covariate measurements denoted by Zi2 • .1. Z = (Zij)nxp. The model is Yi j = /31 + €i.1. Here Yi is called the response variable.1. '.1. we have a response Yi and a set of p . i = l.1. .d. .3): l I • Example 6. i = 1. We have n independent measurements Y1 .zipf.2J) (6.'" .1.a2). . and Z is called the design matrix. Example 6.4) 0 where€I.1." . .1. I The regression framewor~ of Examples 1. and for each case.1. . 1 en = LZij{3j j=1 +Ei. we write (6.. The OneSample Location Problem.. • .1 is also of the fonn (6.6 we will investigate the sensitivity of these procedures to the assumptions of the model. .1. e ~N(O.l)T.1) j .. • i .1. In the classical Gaussian (nannal) linear model this dependence takes the fonn p 1 Yi where EI. The normal linear regression model is }i = /31 + L j=2 p Zij/3j + €i. Here is Example 1. In vector and matrix notation. . 1 Yn from a population with mean /31 = E(Y).. andJ is then x nidentity matrix.Herep= 1andZnxl = (l. N(O. I I. let expressions such as (J refer to both column and row vectors. i . when there is no ambiguity.4 and 2. .1. " 'I and Y = Z(3 + e. n (6.j .Zip. Notational Convention: In this chapter we will.... say the ith case.3)..2.1.l p.. . " n (6.5) . 366 Inference in the Multiparameter Case Chapter 6 ! . .. These are among the most commonly used statistical techniques.3).1 The Classical Gaussian linear Model l' f Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants Zil. In this section we will derive exact statistical procedures under the assumptions of the model (6... . . i = 1.€nareij. ..2) Ii .. It turn~ out that these techniques are sensible and useful outside the narrow framework of model (6.
. one from each population. (6. (6. . if no = 0.3 and Section 4. The pSample Problem Or One. .. To see that this is a linear model we relabel the observations as Y1 . Frequently... We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional distribution of Y given a set of observed covariate values. we often have more than two competing drugs to compare. and so on.1 Inference for Gaussian Linear Models 367 where (31 is called the regression intercept and /32. we want to do so for a variety of locations. ..3) applies.. .6) where Y kl is the response of the lth subject in the group obtaining the kth treatment. ..1. + n p = n.3.2) and (6. where YI. The random design Gaussian linear regression model is given in Example 1. · · . 0 Example 6. .Way Layout.1. The model (6. If we are comparing pollution levels. TWosample models apply when the design values represent a qualitative factor taking on only two values. Yn1 + 1 .1.9. this terminology is commonly used when the design values are qualitative. and the €kl are independent N(O.1fwe set Zil = 1."i = 1.1.~) random variables.1. Ynt +n2 to that getting the second..1. then the notation (6. Generally. Yn1 correspond to the group receiving the first treatment. In Example I.3 we considered experiments involving the comparisons of two population means when we had available two independent samples.6) is an example of what is often called analysis ojvariance models.0: because then Ok represents the difference between the kth and average treatment . In this case.1.•. we arrive at the one~way layout or psample model. and so on. The model (6.5) is called the fixed design norrnallinear regression model.6) is often reparametrized by introducing ( l ~ pl I:~~1 (3k and Ok = 13k .Section 6. /3p are called the regression coefficients.Yn . .4./. We treat the covariate values Zij as fixed (nonrandom).. the design matrix has elements: 1 if L jl nk + 1<i < L nk k=1 j k=l ootherwise and z= o 0 ••• o o Ip where I j is a column vector of nj ones and the 0 in the "row" whose jth member is I j is a column vector of nj zeros. If the control and treatment responses are independent and nonnally distributed with the same variance a 2 . To fix ideas suppose we are interested in comparing the performance of p > 2 treatments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k.3. 1 < k < p. nl + . 13k is the mean response to the kth treatment. we are interested in qualitative factors taking on several values. . n. Then for 1 < j < p.
. Recall that orthononnal means Vj = 0 fori f j andvTvi = 1. The parameter set for f3 is RP and the parameter set for It is W = {I' = Z(3. Let T denote the number of linearly independent Cj. . k model is Inference in the Multiparameter Case Chapter 6 1.. r i' . i=I (6. Even if {3 is not a parameter (is unidentifiable). the linear Y = Z'(3' + E.. i .. However. The Canonical Fonn of the Gaussian Linear Model The linear model can be analyzed easily using some geometry. .! x (p+l) = (1.1. . This type oflinear model with the number of columns d of the design matrix larger than its rank r... . we call Vi and Vj orthogonal. j = I. an orthononnal basis VI. j = I •. b I _ .. n..po In tenns of the new parameter {3* = (0'.368 effects. .. E ~N(O..P. " . and with the parameters identifiable only once d . • I t Ew ¢:> t = L:(v[t)Vi i=l ¢:> vTt = 0. V n for Rn such that VI. n. . Note that Z· is of rank p and that {3* is not identifiable for {3* E RP+l. It is given by 0 = (itl. . {3* is identifiable in the pdimensional linear subspace {(3' E RP+l : L:~~l dk = O} of RP+ 1 obtained by adding the linear restriction E~=l 15k = 0 forced by the definition of the 15k '5. is common in analysis of variance models. i = 1. . We now introduce the canonical variables and means .. •.(T2J) where Z... .. . the vector of means p of Y always is.17). It follows that the parametrization ({3..3..7) !. i = T + 1. V r span w. Note that any t E Jl:l can be written .... vT n i I t and that T = L(vTt)v. by the GramSchmidt process) (see Section B. .r additional linear restrictions have been specified. .1. 1 JLn)T I' where the Cj = Z(3 = L j=l P .2).6jCj are the columns of the design matrix.. • . Z) and Inx I is the vector with n ones. . then r is the rank of Z and w has dimension r. TJi = E(Ui) = V'[ J1. . Because dimw = r. Ui = v. Cj = (Zlj. .. i5p )T.g.. . I5 h . . (3 E RP}.Y.. We assume that n > r. n. •• Note that w is the linear space spanned by the columns Cj. of the design matrix.p. there exists (e. When VrVj = O. j = 1. .a 2 ) is identifiable if and only ifr = p(Problem6.• znj)T.
which are sufficient for (IJ. '" ~ A 1'1..2 L i=l + _ '" ryiui _ 02~ i=l 1 r r 2 (6. u) based on U £('I.9 N(rli. (T2)T. E w.. . The U1 are independent and Ui 11i = 0..1.. n. = L.Cr are constants.1.. .1. .u) 1 ~ 2 n 2 . . In the canonical/orm ofthe Gaussian linear model with 0. whereas (6... 2 0. v~. and by Theorem B.1.2. U.and 7J are equivalently related. . . .3. is Ui = making vTit. .. Then we can write U = AY. 2 ' " 'Ii i=l ~2(T2 6. which is the guide to asymptotic inference in general.2 known (i) T = (UI •.2<7' L. ais (v) The MLE of.10).. 1 If Cl. .2 ) using the parametrization (11.... (3.2 Estimation 0. and .. Proof. Let A nxn be the orthogonal matrix with rows vi.)T is sufJicientfor'l. Un are independent nonnal with 0 variance 0. . (6. (6. i i = 1. Note that Y~AIU.1]r. n.10) It will be convenient to obtain our statistical procedures for the canonical variables U. (iii) U i is the UMVU estimate of1]i. Moreover. Theorem 6.• . U1 . ..'Ii) .1. .. observing U and Y is the same thing.8)(6. 20.. it = E. . n. n because p. also UMVU for CY..1. and then translate them to procedures for .2"log(21r<7 ) t=1 n 1 n _ _ ' " u.. Theorem 6..1. 1 ViUi and Mi is UMVU for Mi. We start by considering the log likelihood £('1.1.).'J nxn .Section 6. where = r + 1. . IJ.. . . 1]r)T varies/reely over Rr.. r.. •. i it and U equivalent. 7J = AIJ.11) _ n log(21r"').1.. = 1..9) Var(Y) ~ Var(U) = .".•• (ii) U 1 .L = 0 for i = r + 1. i (iv) = 1. then the MLE of 0: Ci1]i is a = E~ I CiUi..2 We first consider the known case. . . .1 Inference for G::'::"::"::"::"::L::'::"e::'::'::M::o::d::e'::' '" '3::6::::.' based on Y using (6.(Ui .2 and E(Ui ) = vi J.. while (1]1.1. • .2.Ur istheMLEof1]I.8) So. . . 0.
4 and Example 3.3.Ur'L~ 1 Ul)T is sufficient.) = '7" i = 1. " V n of R n with VI = C = (c" . (i) By observation. . .1 L~ r+l U. = 0.. 2(J i=r+l ~ U. ( 2 )T. . Theorem 6. . j = 1. (6.. {3.2 (ii). 2 2 ()j = 1'0/0.'7. Proof.11) is a function of 1}1.11) by setting T j = Uj . By Problem 3. WI = Q is sufficient for 6 = Ct. .. But because L~ 1 Ui2 L:~ I Ui2 + r+l Ul..4.(v) are still valid. (ii) U1. ~i = vi'TJ.. "7r because. 0  i Next we consider the case in which a 2 is unknown and assume n >r+ 1. there exists an orthonormal basis VI) . (ii) The MLE ofa 2 is n.3 and Example (iii) is clear because 3. ..4.1. .11) is an exponential family with sufficient statistic T.'  .. .. .) (iii) By Theorem 3. where J is the n x n identity matrix. 2:7 r+l un Tis sufficientfor (1]1.2 . Q is UMVU. That is.1. .r) 2:7 1 L~ r+l ul is an unbiased estimator of (J2. . Assume that at least one c is different from zero.4. this statistic is equivalent to T and (i) follows. 1lr only through L~' 1 CUi . .3. . l Cr .. n.Ui · If all thee's are zero. is q(8) = L~ 1 <. i = 1..1. . I.11) has fJi 1 2 = Ui . n " .6.. By (6.3. Ui is UMVU for E(U. . Let Wi = vTu.1.1. I i I . In the canonical Gaussian linear model with a 2 unknown. (v) Follows from (iv). The maximizer is easily seen to be n 1 L~ r+l (Problem 6.) = a.2..(J2J) by Theorem B. and is UMVU for its expectation E(W.N(~. and give a geometric interpretation of ii. r... . T r + 1 = L~ I and ()r+l = 1/20. ..1). then W . .10. we need to maximize n (log 27f(J2) 2 II ".. the MLE of q(9) = I:~ I c.. apply Theorem 3. 370 Inference in the Multiparameter Case Chapter 6 I iI .ti· 1 . recall that the maximum of (6. 0.. and . (We could also apply Theorem 2. I . wecan assume without loss of generality that 2:::'=1 c.• EUl U? Ul Projections We next express j1 in terms of Y.. obtain the MLE fJ of fl. (iii) 8 2 =(n . (i) T = (Ul1" . (6.. = 1. Proof. r. To show (iv). " as a function of 0.2 . (iv) By the invariance of the MLE (Section 2.. define the norm It I of a vector tERn by  ! It!' = I:~ .4. To show (ii). 1 Ur .2... . . I i > . By GramSchmidt orthogonalization. i > r+ 1. The distribution of W is an exponential family. ."7i)2 and is minimized by setting '(Ii = Ui.2).O)T E R n .1. (U I . 1 U1" are the MLEs of 1}1.4.1... by observation.11).6 to the canonical exponential family obtained from (6. To this end. (iv) The conclusions a/Theorem 6.1.1..2.1]r.
fj = arg min{Iy . ZTZ is nonsingular.3 of.12) implies ZT. Proof.' and is given by ji (ii) jl is orthogonal to Y (iii) 8' ~ = zfj (6. . by (6.J) of a point y Yo =argmin{lyW: tEw}. and /3 = (ZTZll ZT 1"..3. any linear combination ofY's is also a linear combination of U's. because Z has full rank.L. note that 6. . ~ The maximum likelihood estimate .1.L of vectors s orthogonal to w can be written as ~ 2:7 w. Thus.9)..Z/3I' : /3 E W}. and  Jii is the UMVU estimate of J.jil' /(n r) (6. which implies ZT s = 0 for all s E w.1 Inference for Gaussian linear Models 371 E Rn on w is the point Definition 6. fj = (ZTZ)lZTji..3 maximizes 1 log p(y..13) (iv) lfp = r.1. /3 = (ZTZ)l ZT. _.n.L = {s ERn: ST(Z/3) = Oforall/3 E RP}.1. the MLE of {3 equal~ the least squares estimate (LSE) of {3 defined in Example 2.Z/3 I' .ji) = 0 and the second equality in (6. That is. the MLE = LSE of /3 is unique and given by (6. .1 and Section 2..3 E RP.Ii = . In the Gaussian linear model (i) jl is the unique projection oiY on L<. note that the space w.p...3(iv).2(iv) and 6..1. It follows that /3T (ZT s) = 0 for all/3 E RP. (ii) and (iii) are also clear from Theorem r+l VjUj ..1. then /3 is identifiable. f3j and Jii are UMVU because. spans w. j = 1. ZT (Y .3 because Ii = l:~ 1 ViUi and Y .2.2<T' IY .1.n log(21f<T .12) ii. = Z T Z/3 and ZTji = ZTZfj and. <T) = .' and by Theorems 6.1. = Z/3 and (6. ..1. 0 ~ . /3. i=l.1.. To show f3 = (ZTZ)lZTy.. IY . (i) is clear because Z.14) follows..1.ti.14) (v) f3j is the UMVU estimate of (3j.1.. equivalently.1.1.4. To show (iv). The projection Yo = 7l"(Y I L. ) 2 or.Section 6.1. We have Theorem 6. any linear combination of U's is a UMVU estimate of its expectation.
Var(€) = (12(J . .Y = (J ..3) that if J = J nxn is the identity matrix.'. the ith component of the fitted value iJ. • ~ ~ Y=HY where i• 1 . The goodness of the fit is measured by the residual sum of squares (RSS) IY . w~ obtain Pi = Zi(3. : H = Z(Z T Z)IZT The matrix H is the projection matrix mapping R n into w. Example 2. i .2 illustrates this tenninology in the context of Example 6.)I are the vertical distances from the points to the fitted line. the best MSPE predictor of Y if f3 is known as well as z is E(Y) = ZT{3 and its best (UMVU) estimate not knowing f3 is Y = ZT {j.14) and the normal equations (Z T Z)f3 = Z~Y. (ii) (iii) y ~ N(J. = [y.1." As a projection matrix H is necessarily symmetric and idempotent. Suppose we are given a value of the covariate z at which a value Y following the linear model (6. i = 1. 1 · .1.. Note that by (6. n lie on the regression line fitted to the data {('" y. € ~ N(o. . In this method of "prediction" of Yi.16) J I I CoroUary 6. We can now conclude the following. (iv) ifp = r.' € = Y .1. .ill 2 = 1 q.4. That is. (j ~ N(f3. .H).1.. • .). By Theorem 1.372 Inference in the Multiparameter Case Chapter 6 Note that in Example 2..5. 1 < i < n.2 with P = 2.H)Y. In the Gaussian linear model (i) the fitted values Y = iJ. see also Section RIO.3) is to be taken. i = 1. L7 1 .14). ~ ~ ·. 1 < i < n.2.1. Y = il.1. it is commOn to write Yi for 'j1i. . I . The estimate fi = Z{3 of It is called thefitted value and € = y . H T 2 =H.1. In statistics it is also called the hat matrix because it "puts the hat on Y. .1. moreover. n}. and I. (12 (ZTz) 1 ).H)).. I. and the residual € are independent. then (6. .12) and (6. whenp = r. =H . . The residuals are the projection of Y on the orthocomplement of wand ·i .1 we give an alternative derivation of (6. (12(J .1. 1 ~ ~ ! I ~ ~ ~ i .15) Next note that the residuals can be written as ~ I .1.ii is called the residual from this fit.1. H I I It follows from this and (B..({3t + {3. Taking z = Zi. There the points Pi = {31 + fhzi. the residuals €.I" (12H). (6.
.1. In the Gaussian case..Ill'. which we have seen before in the unbiased estimator 8 2 of (72 is L:~ I (Yi ~  Yf/ Problem 1. 'E) is a linear transformation of U and. .Section 6..no The variances of . k = 1.4. The error variance (72 = Var(El) can be unbiasedly estimated by 8 2 = (n _p)lIY .8.. 0 ~ ~ ~ ~ ~ ~ ~ Example 6.1. . 0 Example 6.p.. The independence follows from the identification of j1 and € in tenns of the Ui in the theorem..1.1. . One Sample (continued).8 and that {3j and f1i are UMVU for {3j and J1.2.1. (n .8. } is a multipleindexed sequence of numbers or variables. .1. The OneWay Layout (continued). k = 1. + n p and we can write the least squares estimates as {3k=Yk . Example 6.81 and i1 = {31 Y. If {Cijk _. .1).2)..1 and Section 2. hence.. the treatments. Thus.a of the kth treatment is ~ Pk = Yk..8) = (J2(Z T Z)1 follows from (B.p.. (not Y. k=I.ii. in general) p k=l L and the UMVU estimate of the incremental effect Ok = {3k .1 ~ Inference for Gaussian Linear Models 373 Proof (Y. Y. Var(. . We now see that the MLE of It is il = Z. joint Gaussian.3. then replacement of a subscript by a dot indicates that we are considering the average over that subscript.p. i = 1.. 0 ~ We now return to our examples. In this example the nonnal equations (Z T Z)(3 = ZY become n. then the MLE = LSE estimate is j3 = (ZTZr1 ZTy as seen before in Example 2.. . Here Ji = .ii = Y. where n = nl + ..1. If the design matrix Z has rank P.j = 1. and € are nonnally distributed with Y and € independent. 1=1 At this point we introduce an important notational convention in statistics. . and€"= Y .p. Moreover. Regression (continued).i respectively.8 and (3. a = {3. o .3. is th~ UMVU estimate of the average effect of all a 1 p = Yk . nk(3k = LYkl.1.5.1..Y are given in Corollary 6.3)..3. By Theorem 6. . .2. in the Gaussian model..
q < r. the mean vector is an element of the space {JL : J.\(y) = sup{p(Y. .7 and 2. The first inferential question is typically "Are the means equal or notT' Thus we test H : 131 = .6. which together with a 2 specifies the model.JL): JL E wo} ~ ! . .374 Inference in the Multiparameter Case Chapter 6 Remark 6.2.1. whereas for the full model JL is in a pdimensional subspace of Rn. under H.L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L~ 1 IYi .. in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider. . .3 Tests and Confidence Intervals . Thus. j." Now.1. For more on LADEs. which is a onedimensional subspace of Rn. = {3p = 13 for some 13 E R versus K: "the f3's are not all equal.. .17).1. An alternative approach 10 the MLEs for the nonnal model and the associated LSEs of this section is an approach based on MLEs for the model in which the errors El. .. .4. j where Zi2 is the dose level of the drug given the ith patient.... 1 and the estimates of f3 and J.p}. The most important hypothesistesting questions in the context of a linear model correspond to restriction of the vector of means JL to a linear subspace of the space w. Now we would test H : 132 = 0 versus K : 132 =I O.al.17) I. i = 1. " .JL): JL E w} sup{p(Y. . 1 < .. in the context of Example 6. see Problems 1. {JL : Jti = 131 + f33zi3. However. we let w correspond to the full model with dimension r and let Wo be a qdimensional linear subspace over which JL can range under the null hypothesis H. In general. i = 1. and the matrix IIziillnx3 with Zil = 1 has rank 3. The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEssee Stigler (1986).zT . Zi3 is the age of the ith patient. under H. the LADEs are obtained fairly quickly by modem computing methods.31. • = 131 + f32Zi2 + f33Zi3.En in (6. a regression equation of the form mean response ~ .i 1 6.1) have the Laplace distribution with density . . The LSEs are preferred because of ease of computation and their geometric properties. ! j ! I I .2.1. We first consider the a 2 known case and consider the likelihood ratio statistic .1.1. For instance.. Next consider the psample model of Example 1. n} is a twodimensional linear subspace of the full model's threedimensional linear subspace of Rn given by (6. • .3 with 13k representing the mean resJXlnse for the kth population. . .Li = 13 E R. I.1. j 1 • 1 . see Koenker and D'Orey (1987) and Portnoy and Koenker (1997). (6.
'" (}r)T (see Problem B..20) i=q+I 210g.1. 2 log . .1.iio 2 1 } where i1 and flo are the projections of Yon wand wo. respectively.1. 1) distribution with OJ = 'Ida. V r span wand set .I8) then. Write 'Ii is as defined in (6.\(Y) = L i=q+l r (ud a )'.\(Y) Proof We only need to establish the second equality in (6.1. . We write X.1.1. .q {(}2) for this distribution.q degrees offreedom and noncentrality parameter Ej2 = 181 2 = L:=q+ I where 8 = «(}q+l.Section 6_1 Inference for Gaussian Linear Models 375 for testing H : fL E Wo versus j{ : JL E W  woo Because (6. 2 log ..1.\(Y) ~ exp {  2t21Y .q( (}2) distribution with 0 2 =a 2 ' \ '1]i L. by Theorem 6.J X.l9) then.l2).\(Y) has a X.19).IY . then r._q' = AJL where A L 'I..3. . Note that (uda) has a N(Oi. . (}r. In this case the distribution of L:~q+1 (Uda? is called a chisquare distribution with r .iii' ..1..2(v). v~' such that VI..21).. o . . . V q (6. = IJL i=q+l r JLol'. when H holds. by Theorem 6. span Wo and VI. In the Gaussian linear model with 17 2 known.. i=q+l r = a 21 fL  Ito [2 (6. Proposition 6.\(Y) It follows that = exp 1 2172 L '\' Ui' (6.1.1.1..21 ) where fLo is the projection of fL on woo In particular.. We have shown the following. But if we let A nx n be an orthogonal matrix with rows vi.
iT'): /L E w} .1. The distribution of such a variable is called the noncentral F distribution with noncentrality parameter 0 2 and r . In particular.L n where p{y. The resulting test is intuitive.23) .liol' IY r _q IY _ iii' iii' (r . Substituting j).q and m = n .2.1. has the noncentral F distribution F r _q./Lol'. We have seen in PmIXlsition 6. . We know from Problem 6. We have shown the following.2.liol' (n _ r) 'IY _ iii' .r degrees affreedam (see Prohlem B.1. we can write .'I/L.1.1..m{O') for this distrihution where k = r .• j . and 86 into the likelihood ratio statistic.. In Proposition 6.a:.1 suppose the assumption "0. ]t consists of rejecting H when the fit./L.JLo. which is equivalent to the likelihood ratio statistic for H : Ii.376 Inference in the Multiparameter Case Chapter 6 Next consider the case in which a 2 is unknown.E Wo for K : Jt E W . By the canonicaL representation (6.itol 2 = L~ q+ 1 (Ui /0) 2..itol 2 have a X. Remark 6.wo.~I.') denotes the righthand side of (6.2 is known" is replaced by '.1. statistic :\(y) _ max{p{y. which has a X~r distribution and is independent of u. . as measured by the residual sum of squares under the model specified by H.5) that if we introduce the variance equal likelihood ratio w'" . IIY .1.1 that 021it .'}' YJ. {io.a = ~(Y'E. Proposition 6.max{p{y. T is called the F statistic for the general linear hypothesis.I}.iT') : /L E wo} (6. E In this case.3.1 that the MLEs of a 2 for It E wand It E wQ are . In the Gaussian linear model the F statistic defined by (6. T has the representation .q)'Ili . T = n ./L.1.22) Because T = {n .21ii.14). Thus.. I .2 1JL .2 is the same under Hand K and estimated by the MLE 0:2 for /.19). it can be shown (Problem 6.. J • T = (noncentral X:.r){r .1.L a =  n I' and .l) {Ir . (6.q)'{[A{Y)J'/n ..I.0.. .r IY . 8 2 .JtoI 2 . For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >. when H I lwlds.1..n_r(02) where 0 2 = u. T is an increasing function of A{Y) and the two test statistics are equivalent._q(02) distribution with 0' = .r. .22)./to I' ' n J respectively. we obtain o A(Y) = P Y. = aD IIY . We write :h. .nr distribution. /L.q and n .q variable)/df (central X~T variahle)/df with the numerator and denominator independent. is poor compared to the fit under the general model.'IY _ iii' = L~ r+l (Ui /0)2.18).(Y).1.J. T has the (central) Frq.
1.1.1. and the Pythagorean identity.1.1.25) which we exploited in the preceding derivations.r)ln central X~_cln where T is the F statistic (6.\(Y) = . (6.19) made it possible to recognize the identity IY . r = 1 and T = 1'0 versus K : (3 'f 1'0. We next return to our examples.1. Yl = Y2 Figure 6. The projections il and ilo of Yon w and Wo. In this = (Y 1'0)2 (nl) lE(Y.iLol'.(Y) equals the likelihood ratio statistic for the 0'2.1.2. where t is the onesample Student t statistic of Section 4. 0 . The canonical representation (6.1 Inference for Gaussian Linear Models ~ 377 then >.Section 6.3. It follows that ~ (J2 known case with rT 2 replaced by r.24) Remark 6.q noncentral X? q 2Iog.9.1.iLl' ~++Yl I' I' 1 . Y3 y Iy .IO. This is the Pythagorean identity.iLl' + liL .1.iLol' = IY . (6.1.1'0 I' y.Y)2' which we recognize as t 2 / n. Example 6.1 and Section B. We test H : (31 case wo = {I'o}. See Pigure 6.22). q = 0. One Sample (continued).~T = (n .
.1..1. However. age. 1 02 = (J2(p . .1 L~=. P nk (Y. I • ! .27) 1 .q covariates in multiple regression have an effect after fitting the first q.Q)f3f(ZfZ2)f32.)2....1. _y.22) we can write the F statistic version of the likelihood ratio test in the intuitive fonn =   i .2..26) 0 versus K : f3 2 i' O.P .. . j ! j • litPol2 = LL(Yk _y.  I 1.q) x 1 vector of main (e.1.26) and H.22) we obtain the F statistic for the hypothesis H in the oneway layout .n 378 Inference in the Multiparameter Case Chapter 6 iI II " • Example 6. I • I b .=11=1 k=1 •• Substituting in (6. = {3p. Under H all the observations have the same mean so that. .vy.RSSF)/(dflf .dfp) RSSF/dh I 2 where RSSF = IY and RSSH = IY .ZfZl(ZiZl)'ziz. The One. . economic status) coefficients.1. 0 I j Example 6.1. Z2) where Z1 is n x q and Z2 is 11 X (p .q).: I :I T _ n ~ p L~l nk(Y . As we indicated earlier. Without loss of generality we ask whether the last p . 7) iW l.Way Layout (continued).Yk. The F test rejects H if F is large when compared to the ath quantile of the Fpq.. Regression (continued). I3I) where {32 is a (p . •.. Recall that the least squares estimates of {31. . which only depends on the second set of variables and coefficients.q)1f3nZfZ2 . respectivelY..' • ito = Thus.1. Using (6. • 1 F = (RSSlf . We consider the possibility that a subset of p . anddfF = np and dfH = nq are the corresponding degrees of freedom. In this case f3 (ZrZ)lZry and f3 0 = (Z[Ztl1Z[Y are the ML& under the full model (6. P ... To formulate this question. '1 Yp.{3p are Y1.q covariates does not affect the mean response. in general 02 depends on the sample correlations between the variables in Zl and those in Z2' This issue is discussed further in Example 62.np distribution. respectively.ito 1 are the residual sums of squares under the full model and H.y)2 k . Now the linear model can be written as i i I I ! We test H : f32 (6. L~"(Yk' .)2= Lnk(Yk.3.. Under the alternative F has a noncentral Fp_q.1. 02 simplifies to (J2(p .}f32. . (6.1. we want to test H : {31 = .g. we partition the design matrix Z by writing it as Z = (ZI. In the special case that ZrZ2 = 0 so the variables in Zl are orthogonal to the variables in Z2.g..n_p(fP) distribution with noncentrality parameter (Problem 6.. and we partition {3 as f3T = (f3f.)2' . treatment) effect coefficients and f3 1 is a q x 1 vector of "nuisance" (e..
np distribution with noncentrality parameter (6. . n. identifying 0' and (p .. and the F statistic. which is their ratio. .p) degrees of freedom. . As an illustration.. consider the following data(I) giving blood cholesterol levels of men in three different socioeconomic groups labeled I..29) Thus.1) and 8Sw /(n ..Y.28) where j3 = n I :2:=~"" 1 nifJi. Because 88B/0" and SSw / a' are independent X' variables with (p .2 1M . This information as well as S8B/(P ..25) 88T = 888 + 88w ...1) degrees offreedom as "coming" from SS8/a' and the remaining (n . T has a noncentral Fp1..1 Inference for Gaussian Linear Models 379 When H holds.1. k=l 1=1 measures variation within the samples. are often summarized in what is known as an analysis a/variance (ANOVA) table. the unbiased estimates of 02 and a 2 .1 and 6.p). Ypl . To derive IF. . then by the Pythagorean identity (6. .···.1. into two constituent components.. . sum of squares in the denominator. we have a decomposition of the variability of the whole set of data. ijjT = There is an interesting way of looking at the pieces of infonnation summarized by the F statistic. We assume the oneway layout is valid. 888 =L k=I P nk(Yk. Yi nl .)' . we see that the decomposition (6. . are not all equal.3. Note that this implies the possibly unrealistic ..1. the total sum of squares. respectively.p) of the (n 1) degrees offreedom of SST /0" as "cooling" from S8w/a'.fJ. . ... SST. SSB.1.1. . .30) can also be viewed stochastically. . (3p)T and its projection 1'0 = (ij.. (3p. compute a. See Tables 6. Ypnp' The 88w ~ L L(Y Y p "' k'  k )'. T has a Fpl. ..1) degrees of freedom and noncentrality parameter 0'. is a measure of variation between the p samples YII . k=1 1=1 which measures the variability of the pooled samples. The sum of squares in the numerator.fLo 12 for the vector Jt (rh.1.np distribution. and III with I being the "high" end. [f we define the total sum of squares as 88T =L l' L(Y nk k'  Y. the between groups (or treatment) sum of squares and SSw. If the fJ.Section 6.)'. (6.1) and (n . (31. (3" .. SST/a 2 is a (noncentral) X2 variable with (n . the within groups (or residual) sum of squares.
j:
I ,
380
Inference in the Multiparameter Case Chapter 6
,
:
TABLE 6.1.1. ANOYA table for the oneway layout
Sum of squares
d.f.
,
Between samples
Within samples
SSe
I:
r I lld\'k_
Mean squares
F value
1\1 S '
i'dS B
Lf!
P
1
MSB
~
, ,
Total
58W  ""k '1'1n I "(I'k l  I')'  I " k 55T  ,. P 1 L "I" 1 (I'kl _ I' )' ~ k I
np
Tl  1
" '''
A1S w = SSw
1
1
j
TABLE 6.1.2. Blood cholesterol levels
I
J
286 290
II III
403 312 403
311 222 244
269 302 353
336 420 235
259 420 319
386 260
353
210
l 1
I
I
,
i,
.!
, , I
I , ,
assumption that the variance of the measurement is the same in the three groups (not to speak of normality). But see Section 6.6 for "robustness" to these assumptions. We want to test whether there is a significant difference among the mean blood cholesterol of the three groups. Here p = 3, nl = 5, n2 = 10, n3 = 6. n = 21, and we compute
TABLE 6.1.3. ANOYA table for the cholesterol data
;
,
88
Between groups Within groups Total
I
,
,I
1202.5 85,750,5 86,953.0
dJ, 2 18 20
M8
601.2 4763,9
F~value
0.126
'I
From :F tables, we find that the pvalue corresponding to the Fvalue 0.126 is 0.88. Thus, there is no evidence to indicate that mean blood cholesterol is different for the three socioeconomic groups. 0 Remark 6.1.4. Decompositions such as (6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding to variation of covariates are referred to as analysis oj variance. They can be fonnulated in any linear model including regression models. See Scheff" (1959, pp. 4245) and Weisberg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to establish the distribution theory of the components via a device known as Cochran's theorem (Graybill, 1961, p. 86). Their principal use now is in the motivation of the convenient summaries of infonnation we call ANOVA tables.
,
,I "
,
,
,
Section 6.1
Inference for Gaussian linear Models
~
381
Confidence Intervals and Regions We next use our distributional results and the method of pivots to find confidence intervals for J.li, 1 < i < n, !3j, 1 <j < p, and in general, any linear combination
n
,p
= ,p(/l) = Lai/Li
i::: 1
~ aT /l
of the J.l's. If we set;j;
= 1:7
1 ai/Ii
= aT fl and
~
~
where H is the hat matrix, then (,p  ,p)ja(,p) has a N(O, I) distribution. Moreover,
(n  r)8 2ja 2 ~
IY  iii 2 ja 2 =
L
i=r+l
n
(Uda 2 )'
has a X~r distribution and is independent of ;;;. Let
~
~
be an estimate of the standard deviation a('IjJ) of
~
'0. This estimated standard deviation is
called the standard error of 'IjJ. By referring to the definition of the t distribution, we find that the pivot
has a TnT. distribution. Let t n  r (1  40) denote the 1 ~o: quantile of the bution, then by solving IT(,p)1 < t n _, (I  ~",) for,p, we find that
Tn r
distri
is, in the Gaussian linear model, a 100(1  a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider'IjJ = p. We obtain the interval
i' ~ Y ± t n 1
(1
~q) sj.,fii,
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = T. First consider 1/J = f3j for some specified ~gression coefficient (3j. The 100(1  a)% confidence interval for (3j is
(3j
}" = (3j ± t n p (','" s{ [ (ZT Z) 11 j j ' 1 )
382
Inference in the Multiparameter Case
Chapter 6
where [(ZTZ)~l]j) is the jthdiagonal element of (ZTZ)~ '. Computersoftware computes (ZTZ)~I and labels S{[(ZTZ)~lI]j); as the standard errOr of the (estimated) jth regression coefficient. Next consider ljJ = j.ti = mean response for the ith case, 1 < i < n. The level (1  0:) confidence interval is
J.Li =
Jii ± t n _ p (1 
!a) sJh:
where hii is the ith diagonal element of the hat matrix H. Here sjh;; is called the standard error of the (estimated) mean of the ith case. Next consider the special case in which p = 2 and
Yi = 131 + 132Zi2 + Eil i = 1 ", n.
"
If we use the identity
n
~)Zi2  Z2)(l'i  Y)
i=1
~)Zi2  Z.2)l'i,
We obtain from Example 2.2.2 that
ih ~
Because Var(Yi) = a 2 , we obtain
~ Var(.6,)
L~l (Zi2  z.,)l'i
L~ 1 (Zi2  Z.2)2 .
(6.1.30)
= (J I L)Zi' i=l
''''
n
Z.2) ,
,
and the 100(1  a)% confidence interval for .6, has the form
732 ± t n p (I  ~a) sl J2:(Zi2  Z.2)'. The confidence interval for 131 is given in Problem 6.1.10. .62
=
Similarly, in the p = 2 case, it is straightforward (Problem 6.1.10) to compute
i.
h ii
I ,
•
;
_ 1

(Zi2  z.,)2
",n
+ n
L....i=l Zi2 
(
Z·2
)'
,
0
•
I
and the confidence interval for th~ mean response J1.i of the ith case has a simple explicit
fonn.
,
1
,
i ,
.I
Example 6.1.3. OneWay Layout (continued). We consider 'I/J = 13k. 1 < k .6k = Yk. ~ N(.6k, (J'lnk), we find the 100(1  a)% confidence interval
~
< p. Because
•
,
I
= 7J. ± t n  p (1 ~a) slj'nj; where 8' = SSwl(np). The intervals for I' = .6. and the incremental effect Ok =.6k1'
.6k
are given in Problem 6.1.11. 0
j
j
"I
I
i
,
I
I
Do
l
Section 6.2
Asymptotic Estimation Theory in p Dimensions
383
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j, 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector /3 with prescribed probability (1  0:). This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of f3 0 that is accepted when the level (1  a) F test is used to test H : (3 ~ (30' Under H, /' = /'0 = Z(3o; and the numerator of the F statistic (6.1.22) is based on
Iii  /'01'
C "
=
Izi3 
Z(3ol' =
(13  (30)T(Z T Z)(i3 (30)
(30)
Thus, using (6.1.22), the simultaneous confidence region for
f3 is the ellipse
= {(30 .. (13  (30)T(Z 2 Z)(i3 rs
T
< f r,ni"" (1 _ 1 20
'J}
. .T
(6.1.31)
where fr,nr
(1  40:)
is the 1  !o quantile of the :Fr,nr distribution.
Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) write Z = (ZJ, Z2) and /3T = (f3j, {3f), where f32 is a vector of main effect coefficients and f3 1is a vector of "nuisance" coefficients. Similarly, we partition {3 as {3 = ({31 ,(32 ) where (31 is q x 1 and (3, is (p  q) x 1. By Corollary 6.1.1, O"(ZTZ) is the variancecovariance matrix of {3. It follows that if we let 8 denote the lower right (p  q) x (p ~ q) comer of (ZTZ)I, then (728 is the variancecovariance matrix of 132' Thus, a joint 100(1  0:)% confidence region for {32 is the p  q dimensional ellipse
~ ~ ~
.T ",T
C={(3
0' .
. (i3,(302)TSl(i3,_(3o,) <f
() 2
pqs

pq,np
(11
20:·
l}
o
Summary. We consider the classical Gaussian linear model in which the resonse Yi for (3jZij of the ith case in an experiment is expressed as a linear combination J1i = covariates plus an error fi, where Ci, ... 1 f n are i.i.d. N (0, (72). By introducing a suitable orthogonal transfonnation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transfonnation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {(3j}, the resJXlnse means {J1i}, and linear combinations of these.
LJ=l
6.2
ASYMPTOTIC ESTIMATION THEORY IN p DIMENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic properties of the MLE and related tests and confidence bounds for onedimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generalizing Section 5.4.2.
384'
f ",'c:,c:""""ce=:::;'=''''h':.M=,'"';,,pa::.'::'m'''e::.t::"...:C::':::",..::Cc::h,:,p:::'e::'~6
6.2.1
Estimating Equations
OUf assumptions are as before save that everything is made a vector: X!, ... , X n are i.i.d. Pwhere P E Q, a model containing P = {PO: 0 E e} such that
(i)
e open C RP.
e.
(ii) Densities of P, are pC 0),9 E
1 , I
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=(,p" ... , ,pp)T where,p) ~ g:., is well defined and
 L >I'(X n.
1=1
1
n
..
i,
On) = O.
(6.2.1)
A solution to (6.2.1) is called an estimating equation estimate or an M estimate.
AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p
unknowns):
J
A2. Epl'l'(X" 0(P)1 2
>I'(x, O)dP(x)
=0
(6.2.2)
, , ,
I
is well defined on Q so that O(P) is the unique solution of (6.2.2). Necessarily O(PO) because Q => P.
=0
I , ,
< 00 where I· I is the Euclidean nonn.
I I:
.' , ,
,
I
A3. 'l/Ji{', 8), 1 < i < P. have firstorder partials with respect to all coordinates and using the notation of Section B.8,
I
where
': I
,
~
.,
i
~~
l:iI ~
is nonsingular.
A4. sup
..
{I ~ L~ 1 (D>I'(X"
p
t)  D>I'(Xi , O(P))) I: It  O(P)I < En} ~ 0 if En ~ O.
, .I ,
,
I
I
AS. On ~ O(P) for all P E Q.
Theorem 6.2.1. Under AOA5 ofthis section
8n = O(P) + where
L iii(X n.
t=l
1
n
i,
O(P))
+ op(n 1 / 2 )
(6.2.3)
i..I , ,
• •
iii(x,o(p))
= (EpD>I'(X" 0(P)W 1 >1'(x, O(P)).
(6.2.4)
b
Section 6.2
Asymptotic Estimation Theory in p Dimensions
385
Hence,
(6.2.5)
where
E(q" P) ~ J(O. p)Eq,q,T(X" O(p))F (0, P)
and
81/J~
J
(6.2.6)
r'(o,p)
= EpDq,(X"O(P))
~
E p 011 (X"O(P))
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
 n
2.:= q,(Xi , O(P)) = n 2.:= Dq,(Xi,0~)(9n
i=l
n
1
n
 O(P)).
(6.2.7)
i=l
Note that the lefthand side of (6.2.7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open , subset of RP , representable, for instance, as the Set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to conclude that A3 and A4 guarantee that with probability tending to 1, ~ l::~ I Dq,(Xi , 6~) is nonsingular. Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by p, the behavior in (6.2.3) is guaranteed for P E Q, which can include P <1c P. In fact, typically Q is essentially the set of P's for which O(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to: A6. If 1(,0) is differentiabLe
EODq,(X"O)

EOq,(X" O)DI(X" 0) CovO(q,(X" 0), DI(X, , 0))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the onedimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however. be shown that with probability tending to 1, a rootfinding algorithm starting at a consistent estimate 6~ will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
386
Inference in the Multiparameter Case
Chapter 6
6,2,2
comes
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x,O)
=
l(x,O)
= 10gp(x,0), and >I>(x,O)
obeys AOA6, then (62,8) be.
T
l
EODl (X" O)D l( X 1,0» VarODl(X"O)
(62,9)
where
j 1 ,
, ,
is the Fisher information matrix I(e) introduced in Section 3.4. If p: e _ R, e c R d , is a scalar function, the matrix)! 8~~P(}J (e) is known as the Hessian or curvature matrix of the sutface p. Thus, (6.2.9) stateS that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
Theorem 6.2.2. If AOA6 holdfor p(x, 0)
,1, , ,
I
j
=10gp(x, 0), then the MLE  satisfies On
i,
(62,10)
(62,11)
On ~ 0+  :Lr'(O)DI(Xi,O) + op(n"/')
n
i=1
1
n
,,
so that
is a minimum contrast estimate with p and 'f/J satisfying AOA6 and corresponding asymptotic variance matrix E('I1, Pe), then
"
"
If en
"
, ,
E(>I>,PO)
On
> r'(O)
(62,12)
in the sense of Theorem 3.4.4 with equality in (6,2,12) for 0
,
= 0 0 iff, undRr 0 0 ,
(6,2,13)
~
On + op(n 1/2 ),
, i ,
•
,
Proof. The proofs of (6,2,10) and (6,2,11) parallel those of (5.4.33) and (5.4,34) exactly, The proof of (6,2,12) parallels that of Theorem 3.4.4, For completeness we give it Note that hy (6,2,6) and (6,2,8)
I
" :I I.
1" ,
where U >I>(Xt,O). V Var(U T , VT)T nonsingular Var(V)
=
E(>I>,PO)
= CovO'(U, V)VarO(U)CovO'(V, U)
~
(62,14)
DI(X1,0),
But hy (B,lO.8), for any U,V with
(6,2.15)
> Cov(U, V)Var1(U)Cov(V, U),
Taking inverses of both sides yields
r1(0)
= Var01(V) < E(>I>,O).
(6.2.16)
• • !
Section 6.2
Asymptotic Estimation Theory in p Dimensions
387
Equality holds in (6.2.15) by (B. 10.2.3) iff for some b U
= b(O)
(6.2.17)
= b + Cov(U, V)Var1(V)V with probability 1. This means in view of Eow = EODl = 0 that
w(X"O) = b(0)Dl(X1,0).
In the case of identity in (6.2.16) we must have
[EODw(X 1, OW'W(X 1 , 0)
=
r1(0)DI(X" 0).
(6.2.18)
Hence, from (6.2.3) and (6.2.10) we conclude that (6.2.13) holds.
o
apxl, a T 8 n
~
We see that, by the theorem, the MLE Is efficient in the sense that for any has asymptotic bias o(n1/2) and asymptotic variance nlaT ]1(8)a, which is n.? larger than that of any competing minimum contrast estimate. Further any competitor 8 n such that aTO n has the same asymptotic behavior as a T 8 n for all a in fact agrees with On to ordern 1/2


A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6
on the asymptotic nonnality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zr, Yi)T. 1 < i < n, be ij.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways:
(il
(6.2.19) where, is distributed as N(O, (),2) independent of Z and E(Z) Z, Y has aN (a + ZT [3, (),2) distrihution.
= O.
That is, given
(ii) The distribution Ho of Z is known with density h o and E(ZZT) is nonsingular.
The second assumption is unreasonable but easily dispensed with. It readily follows (Prohlem 6.2.6) that the MLE of [3 is given by (with probability 1) [3 = [Z(n}Z(n}]
T
~
lT
Zen} Y.
(6.2.20)
Here Zen) is the n x p matrix IIZij ~ Z.j II where z.j = ~ 1 Zij. We used subscripts (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Zen) = (Zl, .. , 1 Zn)T is referred to as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and ()'2 are
p
2:7
Ci
=Y
 ~ Zj{3j, (j
J=l
"

2
.
1 ~2 = IY  (Ci + Z(n)[3)1 .
n
(6.2.21 )
I
388
~
Inference in the Multiparameter Case
Chapter 6
Note that although given ZI," " Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AOA6 hold in this case because if H o has density k o and if 8 denotes (a,{3T,a 2 )T. then
~
j
1
j
I(X,8) Dl(X,8)
and
 20'[Y  (a + Z T 13W
(
1
)
2(logo'
1
)
+ log21T) + logho(z)
(6.2.22)
;2'
1
i,
Z ;" 20 4
(.2
o o
1)
I
.,,
a 2
1(8) =
I
0 0' E(ZZT)
0
,
o o
20 4
(6.2.23)
i ,
,
1
so that by Theorem 6.2.2
iI
I'
L(y'n(ii 
a,(3  13,8'  0 2»
~ N(O,diag(02,02[E(ZZ T W',20 4 ».
(6.2.24)
1
!
I
!
,,
This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of H o 2 known plays no role in the limiting result for a,j3,Ci • Of course, these will only be the MLEs if H o depends only on parameters other than (a, f3,( 2 ). In this case we can estimate E(ZZT) by ~ L~ 1 ZiZ'[ and give approximate confidence intervals for (3j. j = 1 .. ,po " An interesting feature of (6.2.23) is that because 1(8) is a block diagonal matrix so is II (6) and, consequently, f3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perfonn inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew 0 2 , the MLE would still be and its asymptotic varianc~ optimal for this model. If we knew a and 13. ;;2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, 0=2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = {P(9,"} : 8 E e, '/ E £}
~
j,
1 ,
l
I ,
1 ,
1
13
. ,
:
~
•
L
.' ,
we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE (J (or more generally, an efficient estimate of e) in the pair (e, iiJ is lhe same as that of e(,/o), the efficient estimate for 'P'I7o = {P(9,l'jo) : E 8}. The possibility of adaptation is in fact rare. though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, ({32 ... , /3p ) with
~ ~
e
(i) a, (3" ... , {3p known. (ii)
13 arbitrary.
/32
... = {3p = O. LetZi
In case (i), we take, without loss of generality, a = (ZiI, ... , ZiP) T. then the efficient estimate in case (i) is
,
•• "
I
;;n _ L~l Zit Yi Pi n 2
Li=l Zil
(6,2.25)
, , ,
, I
.
I
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance (T2[EZlJ1. On the other hand, [31 is the first coordinate of {3 given by (6.2.20). Its asymptotic variance is the (1,1) element of O"'[EZZTjl, which is strictly bigger than ,,' [EZfj1 unless [EZZTj1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate [31 adaptively if {3z, .. . , (3p are regarded as nuisance parameters. What is happening can be seen by a representation of [Z'fn) Z(n)ll Zen) Y and [11(11) where ['(II) Ill'j(lI) II· We claim that


=
2i
fJl
= 2:~_l(Z"
",n
 Zill))y; (Z _ 2(11 ),
tl
t
(6.2.26)
L....t=1
where Z(1) is the regression of (Zu, ... , Z1n)T on the linear space spanned by (Zj1,' . " Zjn)T, 2 < j < p. Similarly.
[11 (II)
= 0"'/ E(ZlI
 I1(ZlI
1
Z'l,"" Zp,»)'
(6.2.27)
where II(Z11 I ZZ1,' .. , Zpl) is the projection of ZII on the linear span of Z211 .. , , Zpl (Problem 6.2.11). Thus, I1(ZlI I Z'lo"" Zpl) = 2:;=, Zjl where (ai, ... ,a;) minimizes E(Zll  2: P _, ajZjl)' over (a" ... , ap) E RPI (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3" ... , f3p when the variables Z2, .. . ,Zp are in any way correlated with Z1 and the price is measured by
a;
[E(Zll  I1(Zll I Z'lo"" Zpl)'jl = ( _ E(I1(Zll 1 Z'l, ... ,ZPI)),)I '''''';:;f;c;o,~''"'''''1 , E(ZII) E(ZII)
(6.2.28) In the extreme case of perfect collinearity the price is 00 as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where (Z2, . .. , Zp) have no value in predicting Zl linearly (see Section 1.4). Correspondingly in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, ... , n, (31 is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if E(Zu  I1(ZlI I Z'l, ... , Zpt})' = O. 0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are ij.d. but not necessarily Gaussian with density ~fo (~). for instance, e x
fo(x)
~
=
(1 + eX)"
the logistic density. Such error densities have the often more realistic, heavier tails(l) than the Gaussian density. The estimates {301 0'0 now solve
and
390
Inference in the Multiparameter Case
Chapter 6
f' ~ ~ where1/; = 'X;,X(Y) =  (iQ)~ Yfo(y)+1 ,(30 (fito, ... ,ppO)T The assumptions of Theorem 6.2.2 may be shown to hold (Problem 6.2.9) if
=
(i) log fa is strictly concave, i.e.,
*
is strictly decreasing.
(ii) (log 10)" exists and is bonnded.
Then, if further fa is symmetric about 0,
I(IJ)
cr"I((3T, I)
= cr"
where
Cl
(C 1 Z ) E(t
~)
(6.2.29)
, ,
= J (f,(x»)' lo(x)dx, c, = J (xf,(x) + I)' 10 (x)dx.
~
Thus, ,Bo, ao are opti
mal estimates of {3 and (l in the sense of Theorem 6.2.2 if fo is true. Now suppose fa generating the estimates Po and (T~ is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable conditions we can apply Theorem 6.2.1 with
1 ,
,
i
where
i
1f;j(Z,y,(3,(J)
,
;
"
~ 1f; (y  L~1 Zkpk )
, I
<j < p
(6.2.30)
!
,
•
> (y  L~1
to conclude that
where (301 ao solve
I 1
j
Zkpk )
1
I
I
I
L:o( y'n(,Bo  (30)) ~ N(O, I: ('I', P») L:( y'n(a  cro) ~ N(O, (J'(P))
1 •
,
(6.2.31)
• ,
J'I'(y zT(30)dP=O
• ,
I
p
II . ,.I'
ii
I:
and E('I', P) is as in (6.2.6). What is the relation between (30' (Jo and (3, (J given in the Gaussian model (6.2.19)? If 10 is symmetric about 0 and the solution of (6.2.31) is unique, . then (30 = (3. But (Jo = c(fo)q for some, c(to) typically different from one. Thus, (30 can be used for estimating (3 althougll if t1j. true distribution of the is N(O, (J') it should ' perform less well than (3. On the Qther hand, o is an estimate of (J only if normalized by a constant depending on 10. (See Problem 6.2.5.) These are issues of robustness, that
~

a
'i
1:
is, to have a bounded sensitivity curve (Section 3~5. Problem 3.5.8), we may well wish to use a nonlinear bounded '.Ii = ('!f;11'" l ,pp)T to estimate f3 even though it is suboptimal when € rv N(O, (12). and to use a suita~ly nonnalized version of {To for the same purpose. One effective choice of.,pj is the Hu~er function defined in Problem 3.5.8. We will discuss these issues further in Section 6.6 and Volume II. 0
•
2 The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5. Wald tests (a generalization of pivots). Op). However. up to the proportionality constant f e .32) a. typically there is an attempt at "exact" calculation. Kadane.2.2..8 we obtain Theorem 6..3. A new major issue that arises is computation.19). . Summary. and Rao's tests. the equivalence of Bayesian and frequentist optimality asymptotically. These methods are beyond the scope of this volume but will be discussed briefly in Volume II.) and A6A8 hold then. the latter can pose a fonnidable problem if p > 2. because we then need to integrate out (03 . All of these will be developed in Section 6.5. O ). Again the two approaches differ at the second order when the prior begins to make a difference.19) (Laplace's method). we are interested in the posterior distribution of some of the parameters. ~ (6. The problem arises also when.2. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : fJ 1 < 0 versus K : fh > 0 where fJ 2 . fJ p vary freely. . as is usually the case. for fixed n.2.3 The Posterior Distribution in the Multiparameter Case The asymptotic theory of the posterior distribution parallels that in the onedimensional case exactly. say. The consequences of Theorem 6. 6...3. When the estimating equations defining the M estimates coincide with the likelihood .2. 1T(8) rr~ 1 P(Xi1 8).5. See Schervish (1995) for some of the relevant calculations.s.3. and interpret I . This approach is refined in Kass. in perfonnance and computationally. The three approaches coincide asymptotically but differ substantially. A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems. Ifthe multivariate versions of AOA3. A5(a.s. Although it is easy to write down the posterior density of 8. and Tierney (1989).5. .(t) n~ 1 p(Xi . Using multivariate expansions as in B. the likelihood ratio principle. We defined minimum contrast (Me) and M estimates in the case of pdimensional parameters and established their convergence in law to a nonnal distribution.Section 6. ife denotes the MLE. .I as the Euclidean nonn in conditions A7 and A8. Confidence regions that parallel the tests will also be developed in Section 6.3 are the same as those of Theorem 5.).2 Asymptotic Estimation Theory in p Dimensions 391 Testing and Confidence Bounds There are three principal approaches to testing hypotheses in multiparameter models. We have implicitly done this in the calculations leading up to (5. t)dt. We simply make 8 a vector. under PeforallB. . say (fJ 1. A4(a.s.
We present three procedures that are used frequently: likelihood ratio. In Section 6. ry E £} if the asymptotic distribution of In(() .4.3 • . Another example deals with M estimates based on estimating equations generated by linear models with nonGaussian error distribution. 9 E 8" 8 1 = 8 . I H I! ~. and showed that in several statistical models involving normal distributions.9) . '(x) = sup{p(x. under P. this result gives the asymptotic distribution of the MLE. We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {PO.3. In these cases exact methods are typically not available. 170 specified. LARGE SAMPLE TESTS AND CONFIDENCE REGIONS ~ 0) converges a.8 0 . 1 '~ J . In such . confidence regions. However. In linear regression. and other methods of inference. PO'  . Finally we show that in the Bayesian framework where given 0. .•• . I j I I . .B} has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {PO.1 .I . These were treated for f) real in Section 5. In this section we will use the results of Section 6.1 we considered the likelihood ratio test statistic. the exact critical value is not available analytically. covariates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations.4 to vectorvalued parameters. I . 1 . X II . and confidence procedures.4.f/ : () E 8.". to the N(O. . 9 E 8 0 } for testing H : 9 E 8 0 versus K .392 Inference in the Multiparameter Case Chapter 6 .I .. and we tum to asymptotic approximations to construct tests. I . Wald and Rao large sample tests. A(X) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions.9). adaptive estimation of 131 is possible iff Zl is uncorre1ated with every linear function of Z2) .d.9 and 6.110 : () E 8}..6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal. as in the linear model. "' ~ 6.denotes the MLE for PO' then the posterior distribution of yn(O if 0 r' (9)) distribution.2 to extend some of the results of Section 5.s.4. We shall show (see Section 6. Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic i. . X n are i.i. . We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AIestimate if we know the correct model P = {Po : BEe} and use the MLE for this model. 1 Zp. 9 E 8} sup{p(x.1 we developed exact tests and confidence regions that are appropriate in re~ gression and anaysis of variance (ANOVA) situations when the responses are normally distributed. equations. However. 6. we need methods for situations in which. in many experimental situations in which the likelihood ratio test can be used to address important questions. ! I In Sections 4.
The Gaussian Linear Model with Known Variance. Q > 0.. Other approximations that will be explored in Volume II are based on Monte Carlo and bootstrap simulations..1 exp{ iJx )/qQ). 0 We iHustrate the remarkable fact that X~q holds as an approximation to the null distribution of 2 log A quite generally when the hypothesis is a nice qdimensional submanifold of an rdimensional parameter space with the following. Moreover. . ..1. Here is an example in which Wilks's approximation to £('\(X)) is useful: Example 6.3.. ~ X.3 Large Sample Tests and Confidence Regions 393 cases we can turn to an approximation to the distribution of .n.. such that VI.3. 0 It remains to find the critical value.i = 1. which is usually referred to as Wilks's theorem or approximation.n.2 we showed that the MLE. In this u 2 known example. 1.3. . .5) to be iJo = l/x and p(x. 2 log .\(Y) = L X. where A nxn is an orthogonal matrix with rows vI. As in Section 6. iJ > O. . Thus. 0) ~ iJ"x·. exists and in Example 2. 0).1..\(x) is available as p(x. The MLE of f3 under H is readily seen from (2. Suppose XL X 2 .d.. . iJo) is the denominator of the likelihood ratio statistic.3. q < r.randXi = Uduo. when testing whether a parameter vector is restricted to an open subset of Rq or R r .q' i=q+1 r Wilks's theorem states that. =6r = O. SetBi = TU/uo. x ~ > 0. distribution with density p(X. iJ)..1. i = r+ 1.tn)T is a member of a qdimensional linear subspace of woo versus the alternative that J. .3 we test whether J.2 we showed how to find B as a nonexplicit solution of likelihood equations. as X where X has the gamma.Wo where w is an rdimensional linear subspace of Rn and W ::> Wo. •. the hypothesis H is equivalent to H: 6 q + I = .tl. Y n be nn. . 0) = IT~ 1 p(Xi. . . qQ. V q span Wo and VI. under regularity conditions.3.l...i = 1. . i = 1.... 0 = (iX. .. = (J. Let YI . Suppose we want to test H : Q = 1 (exponential distribution) versus K : a i.4. Then Xi rvN(B i .Section 6.1. . Wilks's approximation is exact. we conclude that under H.'··' J. E W .v'[.\(Y)). the X~_q distribution is an approximation to £(21og . r and Xi rv N(O. We next give an example that can be viewed as the limiting situation for which the approximation is exact: independent with Yi rv N(Pi.£ X~" for degrees of freedom d to be specified later.Xn are i.i. . 1). ~ In Example 2. Using Section 6.2...."" V r spanw.l. This is not available analytically.1). and we transform to canonical form by setting Example 6. ~ ~ ~ ~ ~ The approximation we shall give is based on the result "2 log '\(X) . . . <1~) where {To is known.\(X) based on asymptotic theory. the numerator of . iJ).
21og>. 0 Consider the general i. and conclude that Vn S X. If Yi are as in = (IL..(Y) defined in Remark ° 6. under H .(Jo) ".d. ranges over a q + Idimensional manifold. I .3. c = and conclude that 210g . and (J E c W.2. ..1.7 to Vn = I:~_q+lXl/nlI:~ r+IX.394 Inference in the Multiparameter Case Chapter 6 1 .3.. By Theorem 6. j . Write the log likelihood as e In«(J) ~ L i=I n logp(X i . 0). I( I In((J) ~ n 1 n L i=l & & &" &"...q also in the 0. where DO is the derivative with respect to e. Suppose the assumptions of Theorem 6.2) V ~ N(o. .(In[ < l(Jn .3. : . Eln«(Jo) ~ I«(Jo).\(Y) £ X. Then. an = n. (72) ranges over an r + Idimensional Example 6.~ L H . In Section 6. o .Onl + IOn  (Jol < 210n .3. . "'" T.i=q+l 'C'" I=r+l Xi' ) X. i\ I . ". I I i Example 6.3 and AA that that In((J~) "'" 2[ln((Jn) In((Jo)] £r ~ V I((Jo)V.(Jol.· . N(o. Here ~ ~ j.2 but (T2 is unknown then manifold whereas under H. 2 log >'(Y) = Vn ".2 with 9(1) = log(l + I). we derived e e 2log >'(Y) ~ n log ( 1 + 2::" L.6. Apply Example 5.1) " I.i. Proof Because On solves the likelihood equation DOln(O) = 0.2. I(J~ . Note that for J. Because .(Jo) for some (J~ with [(J~ .. we can conclude arguing from A.(Jol < I(J~ . (6. ! . (J = (Jo. y'ri(On .3.1 ((Jo)). . I.(X) ~ 1 .3.. an expansion of lnUI) about en evaluated at 8 = 0 0 gives 2[ln«(Jn) In((Jo)] = n«(Jn .2. .1. = 2[ln«(Jn) In((Jo)] ~ ~ ~ £ X. ·'I1 . The result follows because.3. The Gaussian Linear Model with Unknown Variance.3. hy Corollary B. B)..3. . .(Jo) In«(Jn)«(Jn . .(J) Uk uJ rxr c'· • 1 . X.2 unknown case._q as well. where Irxr«(J) is the Fisher information matrix.. !. case with Xl. I1«(Jo))." ..2.2.2 are satisfied.q' Finally apply Lemma 5. where x E Xc R'. . . Hence...logp(X...Xn a sample from p(x.1.. V T I«(Jo)V ~ X.(Jol. "'" (6. 1 • ! We tirst consider the simple hypothesis H : 6 = 8 0 _ Theorem 6.
0(1) = (8 1" ".".80.0:'. 0(2) = (8. where x r (1.n) /n(Oo)l.3. distribution. Then under H : 0 E 8 0.a)) (6..35) where 8(00) = n 1/2 L Dl(X" 0) i=1 n and 8 = (8 1 . SupposetharOo istheMLEofO underH and that 0 0 satisfiesA6for Po.+l. ~ {Oo : 2[/ n (On) /n(OO)] < x.3.3 Large Sample Tests and Confidence Regions 395 = As a consequence of the theorem. T (1) (2) Proof.3.10) and (6.3..8q ).1) applied to On and the corresponding argument applied to 8 0 . By (6.3. for given true 0 0 in 8 0 • '7=M(OOo) where.(Ia). j ~ q + 1.n and (6. (JO. .1 and 6.8.2 hold for p(x. 8 2)T where 8 1 is the first q coordinates of8.81(00).. Let 00.3. dropping the dependence on 6 0 • M ~ PI 1 / 2 (6.n = (80 .2.8.4).q.3) is a confidence region for 0 with approximat~ coverage probability 1 . 10 (0 0 ) = VarO.a.j...T. We set d ~ r . ~(11 ~ (6.0) is the 1 and C!' quantile of the X. Examples 6. and {8 0 .' ".(9 0 .6) . Suppose that the assumptions of Theorem 6. has approximately level 1.)1'..Section 6. where e is open and 8 0 is the set of 0 E e with 8j = 80 . Furthennore.2 illustrate such 8 0 . 0).2.0 0 ). 0b2) = (80"+1". (6. 0 E e. Make a change of parameter.)T.2[1. Let Po be the model {Po: 0 E eo} with corresponding parametn"zation 0(1) = ~(1) (1) ~(1) (8 1.3..0(2». Let 0 0 E 8 0 and write 210g >'(X) ~ = 2[1n(9 n )  In(Oo)] .4) It is easy to see that AOA6 for 10 imply AOA5 for Po. .j} are specified values.)T.(] . Next we tum to the more general hypothesis H : 8 E 8 0 . Theorem 6.(X) 8 0 when > x..2. the test that rejects H : () 2Iog>.3. OT ~ (0(1).
(6. Me : 1]q+l = TJr = a}. Note that.3.3. This asymptotic level 1 .I '1): 11 0 + M.II).LTl(o) + op(l) i=l i=l L i=q+l r Ti2 (0) + op(l).• I . then by 2 log "(X) TT(O)T(O) . if A o _ {(} .3. .6) and (6.In( 00 .all (6. = 0.3. with ell _ ...1'1 '1 E eo} and from (B.Tf(O)T .13) D'1I{x. Moreover..(1 .. ~ 'I. rejecting if .00 : e E 8 0} MAD = {'1: '/.10) LTl(o) . • 1 y{X) = sup{P{x..+I> .1 '1ll/sup{p(x. IIr ) : 2[ln(lI n ) .1 '1)..8 0 : () E 8}.5) to "(X) we obtain.(X) > x r .a) is an asymptotically level a test of H : 8 E Of equal importance is that we obtain an asymptotic confidence region for (e q+ 1. 1e ). • . 18q acting as r nuisance parameters.1I0 + M. which has a limiting X~q distribution by Slutsky's theorem because T(O) has a limiting Nr(O.3..7). (O) r • + op(l) (6.. Such P exists by the argument given in Example 6.3. Tbe result follows from (6. .. .+1 ~ . is invariant under reparametrization . Or)J < xr_. I I eo~ ~ ~ {(O.(X) = y{X) where ! 1 1 ..3. 0 Note that this argument is simply an asymptotic version of the one given in Example 6... ••. under the conditions of Theorem 6.l. because in tenus of 7]. a piece of 8.+ 1 .1I 0 + M.. ).3. . We deduce from (6.1I0 + M. J) distribution by (6.9).1 '1) = [M. .2.396 Inference in the Multiparameter Case Chapter 6 and P is an orthogonal matrix such that.2.1 because /1/2 Ao is the intersection of a q dirnensionallinear subspace of R r with JI/2{8 . His {fJ E applying (6.3.0: confidence region is i .8) that if T('1) then = 1 2 n. Thus.(l .1ITD III(x./ L i=I n D'1I{X" liD + M.9) . = .1/ 2 p = J.8. Now write De for differentiation with respect to (J and Dry for differentiation with respect to TJ.0.2.3. by definition.. : • Var T(O) = pT r ' /2 II.l.. '1 E Me}.1> .00 .11) .
Suppose the assumptions of Theorem 6.i. Wilks's theorem depends critically On the fact that nOt only is open but that if giveu in (6. 2' 2 + X 2 )) 2 and 210g A(X) ~ xf but if 81 + 82 < 1 clearly 2 log A(X) ~ op(I). Theorem 6. such that Dg(O) exists and is of rauk r .8 0 E Wo where Wo is a linear space of dimension q are also covered. B2 .. The formulation of Theorem 6... "'0 ~ {O : OT Vj = 0. t. More complicated linear hypotheses such as H : 6 . .13).3.e:::S:::. if 0 0 is true.g. l B. v r span " R T then.3.. 9j .. The esseutial idea is that.. themselves depending on 8 q + 1 . will appear in the next section. As an example ofwhatcau go wrong.3. let (XiI.2 and the previously conditions on g. It can be extended as follows. 8q)T : 0 E 8} is opeu iu Rq. R. which require the following general theorem. N(B 1 ..3.cTe:::.2)(6. 8. Suppose H is specified by: There exist d functions.( 0 ) ~ o} (6... .o''C6::.3.5 and 6. .2 falls under this schema with 9.2 is still inadequate for most applications. J).2 to this situation is easy and given in Problem 6.. ..."de"'::'"e:::R::e"g.6.13) then the set {(8" . We only need note that if WQ is a linear space spanned by an orthogonal basis v\.:::m.3).3..80.p:::'e:.d..3.::. Suppose the MLE hold (JO..3. . .o:::"'' 397 CC where 00.14) a hypothesis of the fonn (6.3.12) The extension of Theorem 6. 2 e eo Ii o ~ (Xl + X 2) 2 + ~ 1 _ (X. if A(X) is ~ the likelihood ratio statistic for H : 0 E 8 0 given in (6..::Se::. X. More sophisticated examples are given in Problems 6. .q at all 0 E 8..3.13)..) be i. If 81 + B2 = 1.l:::.. Vl' are orthogonal towo and v" ... A(X) behaves asymptotically like a test for H : 0 E 8 00 where 800 = {O E 8: Dg(Oo)(O .'! are the MLEs..(0) ~ 8j .3. Examples such as testing for independence in contingency tables.13) Evideutly... q + 1 < j < r written as a vector g.. Then. (6. The proof is sketched iu Problems (6. Here the dimension of 8 0 and 8 is the same but the boundary of 8 0 has lower dimension.3.'':::':::"::d_C:::o:::. X~q under H.3. q + 1 <j < r}....3. Theorem 6..3.3.2. where J is the 2 x 2 identity matrix aud 8 0 = {O : B1 + O < I}.3. .. 210g A(X) ".:::f. (6. q + 1 < j < r. of f)l" .. 8q o assuming that 8q + 1 . We need both properties because we need to analyze both the numerator and denominator of A(X). " ' J v q and Vq+l " .n under H is consistent for all () E 8 0.8r are known. Defiue H : 0 E 8 0 with e 80 = {O E 8 : g(O) = o}.3.t.
~ D 2ln (IJ n ). ~ ~ 00. It and I(8 n ) also have the advantage that the confidence region One generates {6 : Wn (6) < xp(l .2. r 1 (1J)) where. . Y .3.16) n(9 n _IJ)TI(9 n )(9 n IJ) !:.rl(IJ» asn ~ p ~ L 00.3.9).0.16). 0 ""11 F . hence.2. I .15) and (6. I22(lJo)) if 1J 0 E eo holds.X. the Hessian (problem 6. for instance. (6.10).1.2.q' ~(2) (6.fii(1J IJ) ~N(O.3. 1(8) continuous implies that [1(8) is continuous and. More generally /(9 0 ) can be replaced by any consistent estimate of I( 1J0 ).lJ n ) where IJ n statistic as ~(1) ~(2) ~(1) = (0" . It follows that the Wald test that rejects H : (J = 6 0 in favor of K : () • i= 00 when . it follows from Proposition B.) and IJ n ~ ~ ~(2) ~ (Oq+" .3.Nr(O. X. .3. according to Corollary B.6. i favored because it is usually computed automatically with the MLE.18) i 1 Proof. respectively... . j Wn(IJ~2)) !:.. I 22 (9 n ) is replaceable by any consistent estimate of J22( 8). i .2 hold.7. .2. Then . But by Theorem 6.3. I. (6. The last Hessian choice is ~ ~ . For the more general hypothesis H : (} E 8 0 we write the MLE for 8 E as = ~ ! I: (IJ n . y T I(IJ)Y . ~ Theorem 6. Under the conditions afTheorem 6. has asymptotic level Q.17) ~ ~ e ~ en 1 • I I Wn(IJ~2)) ~ n(9~2) 1J~2)l[I"(9nJrl(9~) _1J~2») where I 22 (IJ) is the lower diagonal block of II (IJ) written as I2 I 1 _ (Ill(lJ) (IJ) I21(IJ) I (1J)) I22(IJ) with diagonal blocks of dimension q x q and d x d.Or) and define the Wald (6.. . More generally. the lower diagonal block of the inverse of .2.~ D 2 1n ( IJ n).2.2 Wald's and Rao's Large Sample Tests The Wald Test Suppose that the assumptions of Theorem 6.a)} is an ellipsoid in W easily interpretable and computablesee (6. ..l(a) that I(lJ n ) ~ I(IJ) asn By Slutsky's theorem B.398 Inference in the Multiparameter Case Chapter 6 6.: b.4.7.15) Because I( IJ) is continuous in IJ (Problem 6.3.2. (6.3.3.~ D 2l n (1J 0 ) or I (IJ n ) or .31). [22 is continuous. theorem completes the proof.. y T I(IJ)Y. in particular .fii(lJn (2) L 1J 0 ) ~ Nd(O.3. Slutsky's it I' I' . If H is true..
= 2 log A(X) + op(l) (6.' (0 0 )112 (00) (6. (Problem 6. in practice they can be very different. the asymptotic variance of .n) ~  where :E is a consistent estimate of E( (}o). the two tests are equivalent asymptotically.. and so on. under H.1/ 2 D 2 In (0) where D 1 l n represents the q x 1 gradient with respect to the first q coordinates and D 2 l n the d x 1 gradient with respect to the last d. 1(0 0 )) where 1/J n = n .20) vnt/Jn(OO) !. (6. ~ B ) T given by {8(2) : r W n (0(2)) < x r _ q (1 These regions are ellipsoids in R d Although.n) 2  + D21 ln (00.nlE ~ (2) _ T  . It can be shown that (Problem 6.3.. 112 is the upper right q x d block. by the central limit theorem. is.21) where III is the upper left q x q block of the r x r infonnation matrix I (80 ).Section 6. requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests.Q).19) : (J c: 80.. The Rao test is based on the statistic Rn(Oo ) ~n'1'n(Oo.3. which rejects iff HIn (9b ») > x 1_ q (1 .9. N(O. therefore.9) under AOA6 and consistency of (JO.8) E(Oo) = 1. The Rao Score Test For the simple hypothesis H that. This test has the advantage that it can be carried out without computing the MLE. Rn(Oo) = nt/J?:(Oo)r' (OO)t/Jn(OO) !.22) ..nl]  2  1 . What is not as evident is that.:El ((}o) is ~ n 1 [D.3.\(X) is the LR statistic for H Thus. The test that rejects H when R n ( (}o) > x r (1..ln(8 0 . as n _ CXJ.3. as (6.3... the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n.. a consistent estimate of .n) under n H. The argument is sketched in Problem 6.3..n] D"ln(Oo.n under H. ~1 '1'n(OO. Furthermore..19) indicates.ln(Oo. It follows from this and Corollary B.3. X.I Dln ((Jo) is the likelihood score vector. The Wald test leads to the Wald confidence regions for (B q +] . 2 Wn(Oo ) ~(2) where .n)[D.2 that under H. Rao's SCOre test is based on the observation (6.n W (80 . The extension of the Rao test to H : (J E runs as follows.(00 )  121 (0 0 )/.0:) is called the Rao score test. and the convergence Rn((}o) ~ X. (J = 8 0 .. asymptotically level Q. Let eo '1'n(O) = n.3. . un.6.3 Large Sample Tests and Confidence Regions 399 The Wold test..
400 Inference in the Multiparameter Case Chapter 6 where D~ is the d x d matrix of second partials of ill with respect to matrix of mixed second partials with respect to e(l). 2 log A(X) has an asymptotic X. Finally.19) holds under On and that the power behavior is unaffected and applies to all three tests. Summary. The analysis for 8 0 = {Oo} is relatively easy.5. On the other hand.On) + t.2 but with Rn(lJb 2 1 1 » !:. which measures the distance between the hypothesized value of 8(2) and its MLE. it shares the disadvantage of the Wald test that matrices need to be computed and inverted. For instance. .) .0 0 ) I(On)(On . called the Wald statistic. D 21 the d x d Theorem 6. we introduced the Rao score test. We also considered a quadratic fonn. X. and so on. and Wald Tests It is possible as in the onedimensional case to derive the asymptotic power for these tests for alternatives of the form On = ~ eo + ~ where 8 0 E 8 0 .2. where X~ (')'2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'? It may he shown that the equivalence (6. for the Wald test neOn . .q distribution under H.3.3. X~' 2 j > xd(1 .a)} and • The Rao large sample critical and confidence regions are {R n (Ob » {0(2) : Rn (0(2) < xd(l. We established Wilks's theorem.. . . .0 0 ) = T'" "" (vn(On . I . under regularity conditions. 0(2). Power Behavior of the LR. e(l). Rao. The asymptotic distribution of this quadratic fonn is also q• I e X: X: 6.. b .8 0 where is an open subset of R: and 8 0 is the collection of 8 E 8 with the last r ~ q coordinates 0(2) specified.)I(On)( vn(On  On) + A) !:. which stales that if A(X) is the LR statistic. Under H : (J E A6 required only for Po eo and the conditions ADAS a/Theorem 6. In particular we shall discuss problems of . ._q(A T I(Oo)t. We considered the problem of testing H : 8 E 80 versus K : 8 E 8 . then. I I . Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score testssee Rao (1973) for more on this.a)}.. . and showed that this quadratic fonn has limiting distribution q .4 LARGE SAMPLE METHODS FOR DISCRETE DATA In this section we give a number of important applications of the general methods we have developed to inference for discrete data. The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. which is based on a quadratic fonn in the gradient of the log likelihood. .
treated in more detail in Section 6.nOo.3. L(9. k. = L:~ 1 1{Xi = j}.. Because ()k = 1 . where N. j=l . # j. k1. we consider the parameter 9 = (()l.2... log(N.]2100.1 GoodnessofFit in a Multinomial Model. .2.lnOo.d.1. trials in which Xi = j if the ith trial prcx:luces a result in the jth category. For i. j=1 To find the Wald test.." . In. 2. Let ()j = P(Xi = j) be the probability of the jth category. + n LL(Bi . . j=l i=1 The second term on the right is kl 2 n Thus.)/OOk.8. we need the information matrix I = we find using (2.2..7.4. ~ N. I ij [ . . and 2.32) thaI IIIij II.()k_l)T and test the hypothesis H : ()j = ()OJ for specified (JOj.OOi)(B. .8 we found the MLE 0. In Example 2. or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by theory. j = 1. j = 1. .33) and (3. ..6. 6.i.. . . j=1 Do.00k)2lOOk. Thus. with ()Ok =  1 1 E7 : kl j=l ()OJ. Ok if i if i = j.E~. j = 1.Section 6. k Wn(Oo) = L(N. Ok Thus.+ ] 1 1 0. .4 Large Sample Methods for Discret_e_D_a_ta ~_ _ 401 goodnessoffit and special cases oflog linear and generalized linear models (GLM).5.4..)2InOo. consider i.0).OO. ()j.) /OOk = n(Bk . It follows that the large sample LR rejection region is ~ k 2IogA(X) = 2 LN.00.) > Xkl(1. Pearson's X2 Test As in Examples 1. the Wald statistic is klkl Wn(Oo) = n LfB. we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution.3. k .
2.2). we could invert] or note that by (6. 80j 80k 1 and expand the square keeping the square brackets intact.4. note that from Example 2.. It is easily remembered as 2 = SUM (Observed .Expected)2 Expected (6. the Rao statistic is . . The second term on the right is n [ j=1 (!.11). .2) .1S.80k).. by A..Nk_d T and.~ 8 8 0j 0k  2 80j ) ( n LL _ _ L kl kl kl ( J=1 z=1 ~ 8_Z 80i ~) 8k 80k (~ 8 _J 80J ~) ) 8k .2.4..80j ) = (Ok . ~ . . 80i 80j 80k (6.~) 8 80j 80 k ~ ~ ~ ] 2 0j = n [~ 80 k ~ 1] 2 To simplify the first term on the right of (6. Then.4. The general form (6. n ( L kl ( j=1 ~ ~) !.8... we write ~ ~  8j 80j .4.L . with To find ]1. ]1(9) = ~ = Var(N). ~ = Ilaijll(kl)X(kl) with Thus. To derive the Rao test..402 Inference in the Multiparameter Case Chapter 6 The term on the right is called Pearson's chisquare (X 2 ) statistic and is the statistic that is typically used for this multinomial testing problem.1) X where the sum is over categories and "expected" refers to the expected frequency E H (Nj ).1) of Pearson's X2 will reappear in other multinomial applications in this section. j=1 . 8k 80 k = {[8 0k (8 j  80j )]  [8 0j (8 k ~  80k ) ] } .. because kl L(Oj . where N = (N1 .13.L .. . '" .
25 + (2. (2) wrinkled yellow.25)2 312.4). which is a onedimensional curve in the twodimensional parameter space 8. Example 6. this value may be too small! See Note 1. Contingency Tables Suppose N = (NI . However. If we assume the seeds are produced independently. n2 = 101.fh. LOi=l}. 7. fh = 03 = 3/16.. Mendel observed nl = 315. (3) round green. distribution. Then. • 6. in the HardyWeinberg model (Example 2. 04 = 1/16. n3 = 108. In experiments on pea breeding. 8). which has a pvalue of 0. and we want to test whether the distribution of types in the n = 556 trials he performed (seeds he observed) is consistent with his theory.25)2 104. n040 = 34.75. 2. nOlO = 312. Testing a Genetic Theory.1. Here testing the adequacy of the HardyWeinberg model means testing H : 8 E 8 0 versus K : 8 E . j. . 04.75)2 104.1. k = 4 X 2 = (2.1) dimensional parameter space k 8={8:0i~0. M(n.75 + (3.k.25.25 + (3.9 when referred to a X~ table.2) becomes It follows that the Rao statistic equals Pearson 's X2. n020 = n030 = 104. where 8 0 is a composite "smooth" subset of the (k . For comparison 210g A = 0. Nk)T has a multinomial.48 in this case.75)2 = 04 34. .4. Mendel's theory predicted that 01 = 9/16. Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds.75 .4.2 GoodnessofFit to Composite Multinomial Models. We will investigate how to test H : 0 E 8 0 versus K : 0 ¢:. i=1 For example.i::. 3. Possible types of progeny were: (1) round yellow. we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1.: . l::.75. and (4) wrinkled green. There is insufficient evidence to reject Mendel's hypothesis.Section 6. 4 as above and associated probabilities of occurrence fh. 8 0 . n4 = 32.. 02.4.4 Large Sample Methods for Discrete Data 403 the first term on the right of (6.
. The algebra showing Rn(8 0 ) = X2 in Section 6.. under H.2~(1 . . We suppose that we can describe 8 0 parametrically as where 11 = ('r/1. .3. is a subset of qdimensional space. . The Rao statistic is also invariant under reparametrization and. by the algebra of Section 6.1. However. ..4. £ is open.~) and test H : e~ = O. i=l t n. j = q + 1. That is. To apply the results of Section 6.q' Moreover.4. l~j~q.(t)~ei(11)=O. The Wald statistic is only asymptotically invariant under reparametrization..4.." For instance.3 and conclude that. i=l k If 11 ~ e( 11) is differentiable in each coordinate... j = 1. ek(11))T takes £ into 8 0 . the Wald statistic based on the parametrization 8 (11) obtained by replacing e by e (ij). .. 8 0 • Let p( n1. 1 ~ j ~ q or k (6. we obtain the Rao statistic for the composite multinomial hypothesis by replacing eOj in (6.X approximately has a X.4. .2) by ej(ij).. approximately X.nk.ne j (ij)]2 = 2 ne. nk.. nk) = L ndlog(ni/n) log ei(ij)]. To avoid trivialities we assume q < k . then it must solve the likelihood equation for the model. .4.. . .. 8(11)) : 11 E £}.1). If a maximizing value. we define ej = 9j (8).. If £ is not open sometimes the closure of £ will contain a solution of (6. 8(11)) for 11 E £. . Consider the likelihood ratio test for H : e E 8 0 versus K : e 1:. . Then we can conclude from Theorem 6..l now leads to the Rao statistic R (8("") n 11 =~ ~ j=l [Ni . ij satisfies l a a'r/j logp(nl. which will be pursued further later in this section.3 that 2 log .."" 'r/q) T.q) exists. .3) Le.. {p(.3). . Maximizing p(n1. the log likelihood ratio is given by log 'x(nb . .r. ij = (iiI. nk.8 0 . Other examples. e~ = e2 . nk. also equal to Pearson's X2 • .4. to test the HardyWeinberg model we set e~ = e1 .. 8) for 8 E 8 0 is the same as maximizing p( n1. ... e~) T ranges over an open subset of Rq and ej = eOj .1.. 8(11)) = 0. .. and ij exists. 8) denote the frequency function of N. where 9j is chosen so that H becomes equivalent to "( e~ . a 11 'r/J . thus. r.q distribution for large n. .. and the map 11 ~ (e 1(11). involve restrictions on the e obtained by specifying independence assumptions on i classifications of cases into different categories. 2 log .404 Inference in the Multiparameter Case Chapter 6 8 1 where 8 1 = 8 . .r for specified eOj . j Oj is. .("") X J 11 I where the righthand side is Pearson's X2 as defined in general by (6.X approximately has a xi distribution under H.
We found in Example 2. we obtain critical values from the X~ tables.4. (}4) distribution.2. A linkage model (Fisher. An individual either is or is not inoculated against a disease. (3) starchywhite. .a) with if (2nl + n2)/2n. To study the relation between the two characteristics we take a random sample of size n from the population.4.TJ). ij {}ij = (Bil + Bi2 ) (Blj + B2j ). A selfcrossing of maize heterozygous on two characteristics (starchy versus sugary. Independent classification then means that the events [being an A] and [being a B] are independent or in terms of the B . T() test the validity of the linkage model we would take 8 0 {G(2 + TJ). We often want to know whether such characteristics are linked or are independent.Section 6. do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second Band B.3) becomes i(1 °:.1).4. (2) sugarygreen. If Ni is the number of offspring of type i among a total of n offspring. HardyWeinberg. The only root of this equation in [0. 301).4) which reduces to a quadratic equation in if. i(1 .4.4. H is rejected if X2 2:: Xl (1 .4. The Fisher Linkage Model. AB. 9(ij) ((2nl + n2) 2n 2n 2. and so on.. TJ). 1958. {}22. 1} a "onedimensional curve" of the threedimensional parameter space 8.5. is or is not a smoker.. (2nl + n2)(2 3 + n2) .• . Then a randomly selected individual from the population can be one of four types AB. (4) starchygreen. then (Nb . Because q = 1. Denote the probabilities of these types by {}n. 0 Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories. k 4. respectively. green base leaf versus white base leaf) leads to four possible offspring types: (1) sugarywhite. AB. (h. . p. N 4 ) has a M(n. The likelihood equation (6. (6. iTJ) : TJ :. nl (2+TJ) (n2 + n3) n4 (1TJ) +:ry 0. AB. 1] is the desired estimate (see Problem 6. . ( 2n3 + n2) 2) T 2n 2n o Example 6.4 Methods for Discrete Data 405 Example 6. {}21.. is male or female. {}12. The results are assembled in what is called a 2 x 2 contingency table such as the one shown.6 that Thus. For instance. specifies that where TJ is an unknown number between 0 and 1. .
which vary freely. N 22 )T.4.~ + n12) iiI (nll + n2d (nll i72 (n21 + n22) (1 . 0 ::. the likelihood equations (6. 'r/2 to indicate that these are parameters. 'r/2 ::. 021 . This suggests that X2 may be written as the square of a single (approximately) standard normal variable. q = 2. if N = (Nll' N 12 . Then. 'r/1 (1 . For () E 8 0 . (6. for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic. X2 has approximately a X~ distribution.2). 'r/2 (1 .'r/1). 1. because k = 4. 0 12 .5) whose solutions are Til 'r/2 = (n11+ n 12)/n (nll + n21)/n. 'r/1 ::. C j = N 1j + N 2j is the jth column sum. These solutions are the maximum likelihood estimates.4.4. Pearson's statistic is then easily seen to be (6.3) become .4. (1 .RiCj/n) are all the same in absolute value and. where 8 0 is a twodimensional subset of 8 given by 8 0 = {( 'r/1 'r/2. 0 11 .6) the proportions of individuals of type A and type B. we have N rv M(n. N 21 . We test the hypothesis H : () E 8 0 versus K : 0 f{.7) where Ri = Nil + Ni2 is the ith row sum.'r/2).4. respectively. In fact (Problem 6.Tid (n12 + n22) (1 . Here we have relabeled 0 11 + 0 12 .'r/2)) : 0 ::. I}.'r/1) (1 . 8 0 .406 Inference in the Multiparameter Case Chapter 6 l A 11 The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column. By our theory if H is true.Ti2) (6. ( 22 ). the (Nij .011 + 021 as 'r/1. Thus. where z tt [R~jl1 2=1 J=l .
if X2 measures deviations from independence. a. Z indicates what directions these deviations take. j :::.4. hair color). Positive values of Z indicate that A and B are positively associated (i.4 Large Sample Methods for Discrete Data 407 An important altemative form for Z is given by (6... (Jij TJil The hypothesis that the characteristics are assigned independently becomes H : TJil TJj2 for 1 :::. TJj2 are nonnegative and 2:~=1 = 2:~=1 TJj2 = 1. . if and only if. eye color. b). 1 Nu 2 N12 .. 1 :::. it is reasonable to use the test that rejects.4.. Next we consider contingency tables for two nonnumerical characteristics having a and b states.. and only if. j = 1.. The X2 test is equivalent to rejecting (twosidedly) if. j :::. j :::.. . b where the TJil. If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N ij . Nab Cb Ra n . respectively... peA I B)) versus K : peA I B) > peA I B). 1 :::. peA I B) = peA I B).a) as a level a onesided test of H : peA I B) = peA I B) (or peA I B) :::.Section 6. B.8) Thus. . If (Jij = P[A randomly selected individual is of type i for 1 and j for 2]..e.. . B to denote the event that a randomly selected individual has characteristic A..3) that if A and B are independent. a. then Z is approximately distributed as N(O.. b ~ 2 (e. 1 :::. . Z ~ z(l.g. Thus. b NIb Rl a Nal C1 C2 . B. i :::. a.4. a.. Therefore. The N ij can be arranged in a a x b contingency table. . (Jij : 1 :::.peA I B)] [~(B) ~(~)ll/2 peA) peA) where P is the empirical distribution and where we use A. B. i = 1. then {Nij : 1 :::. b} "J M(n. Z = v'n[P(A I B) . . that A is more likely to occur in the presence of B than it would in the presence of B). b where N ij is the number of individuals of type i for characteristic 1 and j for characteristic 2. i :::. i :::. that is. . It may be shown (Problem 6. ... a. 1).. .4.
In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1." We assume that the distribution of the response Y depends on the known covariate vector ZT.10. whicll we introduced in Example 1. perhaps after a transfonnation. usually called the log it. (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y = 0). [Xi C :i"J log + m. or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y = 0). The log likelihood of 7r = (7rl."F~1 Yij where Yij is the response on the jth of the mi trials in block i. and the loglog transform g2(7r) = 10g[log(1 . we call Y = 1 a "success" and Y = 0 a "failure.. Other transforms. a simple linear representation zT ~ for 7r(. we observe the number of successes Xi = L. Instead we turn to the logistic transform g( 7r).4. Thus.4.408 Inference in the Multiparameter Case Chapter 6 with row and column sums as indicated.9) which has approximately a X(al)(bl) distribution under H..1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are. we obtain what is called the logistic linear regression model where .6.{3p.4."" 7rk)T based on X = (Xl"'" Xk)T is t.. 6. In this section we assume that the data are grouped or replicated so that for each fixed i. Because 7r(z) varies between 0 and 1. (6. we observe independent Xl' . log(l "il] + t. B(mi' 7ri)' where 7ri = 7r(Zi) is the probability of success for a case with covariate vector Zi. Maximum likelihood and dimensionality calculations similar to those for the 2 x 2 table show that Pearson's X2 for the hypothesis of independence is given by (6.. As is typical. log ( ~: ) .7r)].f. 1 :::.8 as the canonical parameter zT TJ = g ( 7r) = log [7r / (1 .4. (6.11) When we use the logit transfonn g(7r). with Xi binomial.3 Logistic Regression for Binary Responses In Section 6.) over the whole range of z is impossible.Li = L. . such as the probit gl(7r) = <I>1(7r) where <I> is the N(O. . Next we choose a parametric model for 7r(z) that will generate useful procedures for analyzing experiments with binary responses. ' XI. Examples are (1) medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0). 1) d.7r)] are also used in practice. i :::. k. The argument is left to the problems as are some numerical applications. approximately normally distributed ~ for known constants {Zij} and and whose means are modeled as J.~ =1 Zij {3j = unknown parameters {31.
Section 6.3.2 can be used to compute the MLE of f3.3.f3p )T is. we can guarantee convergence with probability tending to 1 as N + 00 as follows. It follows that the NILE of f3 solves E f3 (Tj ) = T ..+ . Note that IN(f3) is the log likelihood of a pparameter canonical exponential model with parameter vector f3 and sufficient statistic T = (T1' .+ . the likelihood equations are just ZT(X . in (6. that if m 1fi > 0 for 1 ::.p.J. k ( ) (6. if N = 2:7=1 mi.4 Large Sample Methods for Discrete Data 409 The special case p = 2.4.15) Vi = log X+. . mi 2mi Xi 1 Xi 1 = .. i ::.~ Xi 2 1 +"2 ' (6. or Ef3(Z T X) = ZTX.3. .[ i(l7r iW')· . As the initial estimate use (6.4.4.4 the Fisher information matrix is I(f3) = ZTWZ (6.1fi)*}' + 00. Similarly. k m} (V.1. mi 2mi Here the adjustment 1/2mi is used to avoid log 0 and log 00..1 applies and we can conclude that if 0 < Xi < mi and Z has rank p.J) SN(O. Theorem 2. IN(f3) of f3 = (f31..l. The condition is sufficient but not necessary for existencesee Problem 2.. . Zi The log likelihood l( 7r(f3)) = == k (1.. Although unlike coordinate ascent NewtonRaphson need not converge.4. the empirical logistic transform.. . j = 1.14) where W = diag{ mi1fi(l1fi)}kxk.Li = E(Xi ) = mi1fi. We let J. The coordinate ascent iterative procedure of Section 2. Zi)T is the logistic regression model of Problem 2. 7r Using the 8method.1fi )* = 1 . where Z = IIZij Ilrnxp is the design matrix.4.4.L) = O.4.! ) ( mi . it follows by Theorem 5.(1..12) where T j = 2:7=1 ZijXi and we make the dependence on N explicit. Alternatively with a good initial value the NewtonRaphson algorithm can be employed.1 and by Proposition 3. (6. j Thus. W is estimated using Wo = diag{mi1f. Because f3 = (ZTz) 1 ZT 'T] and TJi = 10g[1fi (1 1fi) in TJi has been replaced by 730 is a plugin estimate of f3 where 1fi and (1 1f. ( 1 .. Then E(Tj ) = 2:7=1 Zij J.3.Li. I N(f3) = f. .14).13) By Theorem 2..3.3. p (Jj Tj  ~ mi loge1+ exp{Zif3} ) + ~ log '. the solution to this equation exists and gives the unique MLE 73 of f3.4.log(l:'.16) 1fi) 1]. Tp) T.
If Wo is a qdimensional linear subspace of w with q < r. We want to contrast w to the case where there are no restrictions on TJ. D(X.. recall from Example 1.12) k zT D(X. .4.410 Inference in the Mllltin::lr.1 (L:~=l Xij(3j).13.. and for the ith location count the number Xi among mi that has the given attribute. . Thus. Example 6. To get expressions for the MLEs of 7r and JL. where ji is the MLE of JL for TJ E w. . . that is.\ for testing H : TJ E w versus K : TJ n . suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition.1. Here is a special case. ji) measures the distance between the fit ji based on the model wand the data X. f"V . We obtain k independent samples...:IITlPt'pr Case 6 Because Z has rankp.4. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and recording the number Xi of patients that recover.~.\ has an asymptotic X.. . For a second pendently and we observe Xl. Testing In analogy with Section 6. then we can form the LR statistic for H : TJ E Wo versus K: TJ E w . i = 1. 2 log . {3 E RP} and let r be the dimension of w..)l {LOt (6. from (6.11) and (6.4. As in Section 6.6.flm. The LR statistic 2 log . Theorem 5.4. it foHows (Problem 6.14) that {3o is consistent.q distribution as mi + 00.. By the multivariate delta method. . we let w = {TJ : 'f]i = {3. i = 1.4.4.4. In this case the likelihood is a product of independent binomial densities."" Xk independent with Xi example. one from each location. the MLE of 1ri is 1ri = 9. and the MLEs of 1ri and {Li are Xdmi and Xi.1.IOg(E. D(X. ji) 2 I)Xi 10g(Xdfli) + XIlog(XI!flDJ i=1 (6.) +X. The samples are collected indeB(1ri' mi). k..k. we set n = Rk and consider TJ E n.17) where X: = mi Xi and M~ mi Mi.Wo 210g.4.8 that the inverse of the logit transform 9 is the logistic distribution function Thus. distribution for TJ E w as mi + 00. The Binomial OneWay Layout.13. ji).w is denoted by D(y.4. In the present case.1 linear subhypotheses are important. i 1. by Problem 6.18) {Lo~ where jio is the MLE of JL under H and fl~i = mi . fl) has asymptotically a X%r.3..\=2t [Xi log i=1 (Ei. k < oosee Problem 6.
Finally. and as in that section.5 Generalized Linear Models 411 This model corresponds to the oneway layout of Section 6.4. we test H : 7rl = 7r2 7rk = 7r. When the hypothesis is that the multinomial parameter is in a qdimensional subset of the k .Section 6.1 and 6. . Under H the log likelihood in canonical exponential form is {3T . Using Z as given in the oneway layout in Example 6.3.3.Li = ~i. Summary. N 2::=1 mi. In particular.1 that if 0 < T < N the MLE exists and is the solution of (6.Li of a response is expressed as a function of a linear combination ~i =.4.7r)].18) with JiOi mi7r.3 to find tests for important statistical problems involving discrete data.4. The LR statistic is given by (6. we give explicitly the MLEs and X2 test.5 GENERALIZED LINEAR MODELS Yi In Sections 6. we find that the MLE of 7r under His 7r = TIN. where J. and give the LR test. J." which equals the sum of standardized squared distances between observed frequencies and expected frequencies under H.7r ~ i "')2 mi 7r is a Wald statistic and the X2 test is equivalent asymptotically to the LR test (Problem 6.mk7r)T. The Pearson statistic 2 _ X  L k1 i=l (X "'(1) m'7r . We derive the likelihood equations. We used the large sample testing results of Section 6.4.Li = EYij. then J.Li 7ri = gl(~i)' where gl(y) is the logistic distribution .1. and (3 = 10g[7r1(1 .N log(1+ exp{{3}) + ~ log ( k ~: ) where T = 2::=1 Xi.15). discuss algorithms for computing MLEs. 7r E (0. In Section 6.. an important hypothesis is that the popUlations are homogenous. In the special case of testing independence of multinomial frequencies representing classifications in a twoway contingency table.3 we considered experiments in which the mean J. .L (ml7r. 6. z.13). if J. It follows from Theorem 2. (3 = L Zij{3j j=l p of covariate values. Thus. the Pearson statistic is shown to have a simple intuitive form.1.1). In the special case of testing equality of k binomial parameters.3. the Rao statistic is again of the Pearson X2 form. we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates. We found that for testing the hypothesis that a multinomial parameter equals a specified value.4. in the case of a Gaussian response.. versus the alternative that the 7r'S are not all equal. the Wald and Rao statistics take a form called "Pearson's X2.Idimensional parameter space e.
. that is. Typically. 11) given by (6. More generally. ••• . ... Typically. Canonical links The most important case corresponds to the link being canonical. g(fL) is of the form (g(JL1) . called the link junction."" zn) with Zi = (Zib"" Zip)T nonrandom and Y has density p(y. Note that if A is oneone. most importantly the log linear model developed by Goodman and Haberman.. g = or 11 L J'j Z(j) = Z{3. The generalized linear model with dispersion depending only on the mean The data consist of an observation (Z.. Special cases are: (i) The linear model with known variance. 1989) synthesized a number of previous generalizations of the linear model. Yn)T is n x 1 and ZJxn = (ZI.1) where 11 is not in E.5. j=1 p In this case.r Case 6 function..1. . McCullagh and NeIder (1983. the natural parameter space of the nparameter canonical exponential family (6.5. We assume that there is a onetoone transform g(fL) of fL. (ii) Log linear models. As we know from Corollary 1. These Ii are independent Gaussian with known variance 1 and means JLi The model is GLM with canonical g(fL) fL. the mean fL of Y is related to 11 via fL = A(11).. fL determines 11 and thereby Var(Y) A( 11). which is p x 1.. in which case JLi A~ (TJi). Y) where Y = (Yb .6. the identity. the GLM is the canonical subfamily of the original exponential family generated by ZTy. A (11) = Ao (TJi) for some A o. Znj)T is the jth column vector of Z.412 Inference in the 1\III1IITin'~r:>rn"'T. in which case 9 is also called the link function. = L:~=1 Zij{3.1). but in a subset of E obtained by restricting TJi to be of the form where h is a known function. See Haberman (1974). g(JLn)) T. . such that g(fL) = L J'jZ(j) j=1 p Z{3 where Z(j) (Zlj.
t1. 2:~=1 B = 1.8) (6. ZTy or = ZT E13 y = ZT A(Z. 1 ::.Yp)T is M(n. 1 ::.1.7.tp) T . j ::. j ::. B+ j = 2:~=1 Bij .. . .. The coordinate ascent algorithm can be used to solve (6 ./1(. 0 < (}i < 1 withcanonicallinkg(B) = 10g[B(1B)]. is that of independence Bij = Bi+B+j where Bi+ = 2:~=1 Bij . p. where f3i.B 1. . Then. But. b. If we take g(/1) = (log J.5 Generalized linear Models 413 Suppose (Y1. as seen in Example 1. Algorithms If the link is canonical. in general.5. 1 ::. See Haberman (1974) for a further discussion. In this case.80 ~ f3 0 . the . This isjust the logistic linear model of Section 6. The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ((}i). j::. so that Yij is the indicator of.. 1 ::. the link is canonical.1::.2) It's interesting to note that (6..5. for example. p are canonical parameters.3. i ::. say. b. /1(.5. f3j are free (unidentifiable parameters).3) where In this situation and more generally even for noncanonical links. classification i on characteristic 1 and j on characteristic 2.8) is not a member of that space.4. that Y = IIYij 111:Si:Sa.5. Suppose. if maximum likelihood estimates they necessarily uniquely satisfy the equation f3 exist.8) is orthogonal to the column space of Z. a.Section 6. .Bp). "Ij = 10gBj. With a good starting point f3 0 one can achieve faster convergence with the NewtonRaphson algorithm of Section 2. log J. j ::.. Bj > 0. The models we obtain are called log linearsee Haberman (1974) for an extensive treatment. that procedure is just (6. NewtonRaphson coincides with Fisher's method of scoring described in Problem 6.1.5.2) (or ascertain that no solution exists). Then () = IIBij II. .. and the log linear model corresponding to log Bij = f3i + f3j.6.2) can be interpreted geometrically in somewhat the same way as in the Gausian linear modelthe "residual" vector Y . 1f.4. . by Theorem 2.3.
For the Gaussian linear model with known variance 0'5 (Problem 6. as n ~ 00. Yn ) T (assume that Y is in the interior of the convex support of {y : p(y. Let ~m+l == 13m+1 . The LR statistic for H : IL E Wo is just D(Y.Wo with WI ~ Wo is D(Y. IL) : IL E wo} where iLo is the MLE of IL in Woo The LR statistic for H : IL E Wo versus K : IL E WI . the correction Am+ l is given by the weighted least squares formula (2. 1]) > o}). We can then formally write an analysis of deviance analogous to the analysis of variance of Section 6. iLo) == inf{D(Y. 210gA = 2[l(Y. Unfortunately tl =f.414 Inference in the Multiparameter Case Chapter 6 true value of {3. iLl)' each of which can be thought of as a squared distance between their arguments. D generally except in the Gaussian case.5.6) a decomposition of the deviance between Y and iLo as the sum of two nonnegative components.5.13m' which satisfies the equation (6. This quantity. (6. We can think of the test statistic .5.. called the deviance between Y and lLo.5) is always ~ O.D(Y. Testing in GLM Testing hypotheses in GLM is done via the LR statistic.5.4) That is. the variance covariance matrix is W m and the regression is on the columns ofW mZProblem 6. In this context. iLo) .1. the deviance of Y to iLl and tl(iL o. iLl) where iLl is the MLE under WI. iLl) == D(Y.5.1](Y)) l(Y .20) when the data are the residuals from the fit at stage m. ••• . . iLo) .D(Y . If Wo C WI we can write (6. As in the linear model we can define the biggest possible GLM M of the form (6.4). then with probability tending to 1 the algorithm converges to the MLE if it exists.2. Write 1](') for AI.1](lLa)] for the hypothesis that IL = lLa within M as a "measure" of (squared) distance between Y and lLo. the algorithm is also called iterated weighted least squares.5.1) for which p = n. This name stems from the following interpretation. In that case the MLE of J1 is iL M = (Y1 .2.
0. if we take Zi as having marginal density qo. .2. f3d+l.2. (Zn' Y n ) can be viewed as a sample from a population and the link is canonical. if we assume the covariates in logistic regression with canonical link to be stochastic..7) More details are discussed in what follows. in order to obtain approximate confidence procedures.3. can be estimated by f = :EA(Z{3) where :E is the sample variance matrix of the covariates. 6. the theory of Sections 6. 1. . . Ao(Z.3 applies straightforwardly in view of the general smoothness properties of canonical exponential families. (3))Z. ..5.3 hold (Problem 6.3.5 Generalized Linear Models 415 Formally if Wo is a GLM of dimension p and WI of dimension q with canonical links..5. . and can conclude that the statistic of (6. we obtain If we wish to test hypotheses such as H : 131 = .2.Section 6 . Zn in the sample (ZI' Y 1 )..5.. and 6. which we temporarily assume known. we can calculate i=l where {3 H is the (p xl) MLE for the GLM with {3~xl = (0. asymptotically exists.3).. f3p ). (Zn.5. Asymptotic theory for estimates and tests If (ZI' Y 1 ). For instance. . However. y . (3) is (Yi and so... then ll(fio. there are easy conditions under which conditions of Theorems 6. .10) is asymptotically X~ under H.2 and 6. . . and (6.. .1 . Similar conclusions r '''''''. This can be made precise for stochastic GLMs obtained by conditioning on ZI. then (ZI' Y1 ). (Zn' Y n ) has density (6.5. fil) is thought of as being asymptotically X. Thus.. (3) term. d < p. so that the MLE {3 is unique. . = f3d = 0. Y n ) from the family with density (6. is consistent with probability 1.8) This is not unconditionally an exponential family in view of the Ao (z. ••• .. which.9) What is ]1 ((3)? The efficient score function a~i logp(z .q.
JL ::.5. Note that these tests can be carried out without knowing the density qo of Z1.12) The lefthand side of (6. then it is easy to see that E(Y) = A(11) Var(Y) = c(r)A(11) (6. and NeIder (1983. As he points out.11) Jp(y. For instance.416 Inference in the Multiparameter Case Chapter 6 follow for the Wald and Rao statistics.A(11))}h(y. then A(11)/c(r) = log! exp{c1 (r)11T y}h(y. Because (6.14) so that the variance can be written as the product of a function of the mean and a general dispersion parameter.4.10) is of product form A( 11) [1/ c( r)] whereas the righthand side cannot always be put in this form. (J2) and gamma (p. A) families. However. It is customary to write the model as. we take g(JL) = <1>1 (JL) so that 7ri = <1> (zT!3) we obtain the socalled probit model. 11. if in the binary data regression model of Section 6. the results of analyses with these various transformations over the range .5.5. .5.5. From the point of analysis for fixed n. 1989). . Cox (1970) considers the variance stabilizing transformation which makes asymptotic analysis equivalent to that in the standard Gaussian linear model.1) depend on an additional scalar parameter r. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6.13) (6.5. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean. r)dy = 1.1 (r)(11T y . r) = exp{c.1 ::. when it can. For further discussion of this generalization see McCullagp. p(y. noncanonicallinks can cause numerical problems because the models are now curved rather than canonical exponential families. Important special cases are the N (JL. 11.r)dy. These conclusions remain valid for the usual situation in which the Zi are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables. (6. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred. which we postpone to Volume II. for c( r) > 0.3. General link functions Links other than the canonical one can be of interest.9 are rather similar. r).
Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem 6. ".2.5).1. and Models 417 Summary.i. PI {P : (zT. Ii 6. had any distribution symmetric around 0 (Problem 6. These issues.. we use the asymptotic results of the previous sections to develop large sample estimation results. the distributional and implicit structural assumptions of parametric models are often suspect. with density f for some f symmetric about O}. In Example 6. still a consistent asymptotically normal estimate of (3 and."n". the resulting MLEs for (3 optimal under fa continue to estimate (3 as defined by (6.Section 6.6. Furthermore. EpZTZ nonsingular.r+i".d." Of course.J . We considered the canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor. and tests. have a fixed covariate exact counterpart.2. if P3 is the nonparametric model where we assume only that (Z.19) with Ei i. the right thing to do if one assumes the (Zi' Yi) are i..i.6 Robustness Pr. whose further discussion we postpone to Volume II.2. if we consider the semiparametric model. Y) has a joint distribution and if we are interested in estimating the best linear predictor /lL(Z) of Y given Z. are discussed below. the LSE is not the best estimate of (3. /3. of a linear predictor of the form Ef3j Z(j). in fact.2. For this model it turns out that the LSE is optimal in a sense to be discussed in Volume II. ____ . There is another semiparametric model P 2 = {P : (ZT. confidence procedures. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function. roughly speaking.2. We discussed algorithms for computing MLEs of In the random design case. then the LSE of (3 is. We found that if we assume the error distribution fa. called the link function.2.2.2. where the Z(j) are observable covariate vectors and (3 is a vector of regression coefficients. the GaussMarkov theorem. a 2 ) or.6 ROBUSTNESS PROPERTIES AND SEMIPARAMETRIC MODELS As we most recently indicated in Example 6. under further mild conditions on P. we studied what procedures would be appropriate if the linearity of the linear model held but the error distribution failed to be Gaussian. Y) rv P given by (6. y)T rv P that satisfies Ep(Y I Z) = ZT(3.d.31) with fa symmetric about O.19) even if the true errors are N(O. is "act as ifthe model were the one given in Example 6. so is any estimate solving the equations based on (6.2. E p y2 < oo}...8 but otherwise also postponed to Volume II. That is. which is symmetric about 0.6. in fact. for estimating /lL(Z) in a submodel of P 3 with /3 (6.1) where E is independent of Z but ao(Z) is not constant and ao is assumed known.
moreover. for any parameter of the form 0: = E~l aiJLi for some constants a1.6. Ifwe replace the Gaussian assumptions on the errors E1.1. ( 2 ). Let Ci stand for any estimate linear in Y1 .3) are normal. n 2 VarcM(a) = La. Var(e) = a 2 (I . In Example 6. € are n x 1.6. See Problem 6.Y . . . i=l i<j i=l where VarCM refers to the variance computed under the GaussMarkov assumptions. in Example 6.1 and the LSE in general still holds when they are compared to other linear estimates. ..1.4 are still valid. Instead we assume the GaussMarkov linear model where (6. n = 0:. Example 6.2. . jj and il are still unbiased and Var(Ji) = a 2 H.4.1. is UMVU in the class of linear estimates for all models with EY/ < 00. and Y.1.1. One Sample (continued). and Ji = 131 = Y. Var(jj) = a 2 (ZTZ)1. Then.an.S.i..14).2(iv) and 6. where Varc(Ci) stands for the variance computed under the Gaussian assumption that E1. The GaussMarkov theorem shows that Y.3(iv). n a Proof. .3..p . Because Varc = VarCM for all linear estimators. {3 is p x 1. and ifp = r. In fact. Varc(Ci) for all unbiased Ci. . and by (B. .Yn . Because E( Ei) = 0. JL = f31.2) holds.En are i.1.1.1. E(a) = E~l aiJLi Cov(Yi..1.6.H). our current Ji coincides with the empirical plugin estimate of the optimal linear predictor (1. Var(Ei) + 2 Laiaj COV(Ei. By Theorems 6.4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case. the conclusions (1) and (2) of Theorem 6. Many of the properties stated for the Gaussian case carry over to the GaussMarkov case: Proposition 6. . the result follows.. Ej).6). N(O.. However. and a is unbiased. in addition to being UMVU in the normal case. The optimality of the estimates Jii.. Moreover.En in the linear model (6.'.. 418 Inference in the Multiparameter Case Chapter 6 Robustness in Estimation We drop the assumption that the errors E1. 0 Note that the preceding result and proof are similar to Theorem 1.1.. the estimate = E~l aiJii has uniformly minimum variance among all unbiased estimates linear in Y 1 .6. . 13i of Section 6. The preceding computation shows that VarcM(Ci) = Varc(Ci).6.4. . .1. Suppose the GaussMarkov linear model (6. Yj) = Cov( Ei. for .2) where Z is an n x p matrix of constants. Theorem 6.Ej) = a La.d..1. Varc(a) ::.' En with the GaussMarkov assumptions.
0.8(P)) ~ N(O. Wald and Rao tests depends critically on the asymptotic behavior of the underlying MLEs 8 and 80 .. The 6method implies that if the observations Xi come from a distribution P.. For this density and all symmetric densities.2 we know that if '11(" 8) = Die.P) i= II (8 0 ) in general.2. Y n ) are d..6.6. . Y) where Z is a (p x 1) vector of random covariates and we model the relationship between Z and Y as (6. . as (Z.. There is an important special case where all is well: the linear model we have discussed in Example 6. Suppose we have a heteroscedastic 0. If the are version of the linear model where E( €) known.6).3 we investigated the robustness of the significance levels of t tests for the one. Remark 6. The Linear Model with Stochastic Covariates. the asymptotic distribution of Tw is not X.d. From the theory developed in Section 6. This observation holds for the LR and Rao tests as wellbut see Problem 6.ii(8 . for instance.2.Ill}.12).. we can use the GaussMarkov theorem to conclude that the weighted least squares are unknown. Example 6.6. when the not optimal even in the class of linear estimates. which is the unique solution of ! and w(x.2. Suppose (ZI' Yd.Section 6. 1 (Zn.:lrarnetlric Models 419 sample n large. Y has a larger variance (Problem 6. Because ~ ('lI . 0. As seen in Example 6. ~(w.3) .5) than the nonlinear estimate Y median when the density of Y is the Laplace density 1 2A exp{ Aly . aT Robustness of Tests In Section 5.1. and .:. If 8(P) (6. our estimates are estimates of Section 2. ~('11. Another weakness is that it only applies to the homoscedastic case where Var(1'i) is the same for all i.1. then Tw ~ VT I(8 0 )V where V '" N(O.5) .4. Thus. 8) and Pis true we expect that 8n 8(P). the asymptotic behavior of the LR.6.6 Robustness "" .6. 8)dP(x) ° = 80 (6.6. P)).. A > O.4) and the Wald test statis8 0 evidently we have i= Tw But if 8(P) = 8 0 .2 are UMVU. y E R.. Y is unbiased (Problem 3.2.and twosample problems using asymptotic and Monte Carlo methods. a major weakness of the GaussMarkov theorem is that it only applies to linear estimates."n". but Var( Ei) depends on i.ti". Now we will use asymptotic methods to investigate the robustness of levels more generally. consider H : 8 tics Tw = n(e ( 0 )T I(8 0 )(e ( 0 ).6.4.. P)) with :E given by (6. which does not belong to the model.. 00.ennip.6. However. Il E R..
Thus. X. (Jp (Jo.B) by where W pxp is Gaussian with mean 0 and.6.420 Inference in the M . It is intimately linked to the fact that even though the parametric model on which the test was based is false. H : (3 = O. it is still true by Theorem 6.P i=l n p  Z(n)(31 . by the law of large numbers.np(l a) .7.1 that (6. . n lZT Z (n) (n) so that the confidence ~ ~ ZiZ'f n ~ 1 (6.3) equals (3 in (6.6.t.. EpE 0.2..  2 Now.6..6. For instance.r Case 6 with the distribution P of (Z.2.30) then (3(P) specified by (6. Y) such that Eand Z are independent..8) Moreover.1 and 6.p or more generally (3 E £0 + (30' a qdimensional affine subspace of RP. This kind of robustness holds for H : (Jq+ 1 (Jo.6. . it is still true that if qi is given by (6..Zn)T and  2 1 ~ " 2 1 ~(Yi . Zen) S (Zb .Yi) = IY n .. and Rao tests all are equivalent to the F test: Reject if Tn = (3 Z(n)Z(n)(3/ S 2 !p.5) has asymptotic level a even if the errors are not Gaussian..9) i=l n1Zfn)Z(n)S2 / Vii are still asymptotically of correct level.. procedures based on approximating Var(.a) where _ ""T T  2 (6.6) (3 is the LSE.6. even if E is not Gaussian.q+ 1. and we consider. say. (i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model. under H : (3 = 0. It follows by Slutsky's theorem that. the test (6.7) and (6. (see Examples 6.6. E p E2 < 00. Wald. .t xp(l a) by Example 5. the limiting distribution of is Because !p.2.r::nn". .np(1 . Then. when E is N(O. !ltin:::. (12).2.5) and (12(p) = Varp(E).3.. hence.2) the LR.
X(2») where X(I). To see this.1 Vn(~ where (3) + N(O. The Linear Model with Stochastic Covariates with E and Z Dependent. It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general.a)" (6.6. XU) and X(2) are identically distributed. the theory goes wrong. Then H is still meaningful. V2 VarpX{I) (~2) Xz in general.13) but & V2Ep(X(I») ::f..6.15) .10) nL and ~ ~ X(j) 2 (6. However suppose now that X(I) 101 and X(2) 102 are identically distributed but not exponential. If it holds but the second fails. . the lifetimes A2 the standard Wald test (6..6.6. If we take H : Al .6. Example 6.. Suppose without loss of generality that Var( (') = L Then (6.6. Q) (6. The TwoSample Scale Problem.10) does not have asymptotic level a in general.Section 6.17) is Suppose our model is that X "Reject iff n (0 1 where &2 2 O) > x 1 (1  . note that.12) The conditions of Theorem 6.6.3..2.. X(2) are independent E of paired pieces of equipment.6. we assume variances are heteroscedastic. But the test (6.) respectively.2 clearly hold.6.3. i=1 (6.11) i=1 &= v 2n ! t(xP) + X?»).. (6.7) fails and in fact by Theorem 6. 2  (th).4. (X(I). then it's not clear what H : (J = (Jo means anymore..6.6 Robustness Properties and Semiparametric Models 421 If the first of these conditions fails. Suppose E(E I Z) = 0 so that the parameters f3 of (6. We illustrate with two final examples. let the distribution of ( given Z = z be that of a(z)(' where (' is independent of Z.5) are still meaningful but (and Z are dependent.14) o Example 6. For simplicity.3.6. (6. under H. That is. E (i.. Simply replace (12 by (J ~2 = ~ ~ { ( ~1) n~ 2=1 Xz _ (X(I) + X(2»))2 2 + _ (X(1) + X(2»))2} 2 .
. the confidence procedures derived from the normal model of Section 6. Wald. the methods need adjustment.422 Inference in the Multiparameter Case Chapter 6 (Problem 6..n l1Y . Ur..6) and. Yn . 6. in general. For d = 2 above this is just the asymptotic solution of the BehrensFisher problem discussed in Section 4. It is easy to see that our tests of H : {32 = . .fiT) o:2)T of U 2)T .6. and Rao tests need to be modified to continue to be valid asymptotically. . = {3d = 0 fail to have correct asymptotic levels in general unless (52 (Z) is constant or A1 = . In particular. 0 < Aj < 1.Id . (52) are still reasonable when the true error distribution is not Gaussian. and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model. For the canonical linear Gaussian model with (52 unknown. . (i) the MLE does not exist if n = r. 1967).. The GaussMarkov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i. . then the MLE (iil.11) with both 11 and (52 unknown. . 0 To summarize: If hypotheses remain meaningful when the model is false then.Id) has a multinomial (A1. in general. N(O. . . LR.d. use Theorem 3. N(O.1. Compare this bound to the variance of 8 2 • . . In partlCU Iar.6. = Ad = 1/ d (Problem 6..i.Ad)T where (I1. A solution is to replace Z(n)Z(n)/ 8 2 in (6.9. then the MLE and LR procedures for a specific model will fail asymptotically in the wider model. We also demonstrated that when either the hypothesis H or the variance of the MLE is not preserved when going to the wider model."i=r+ I i . the twosample problem with unequal sample sizes and variances.11 . identical variances. In this case.3.1. provided we restrict the class of estimates to linear functions of Y1 . 1 ::.. . We considered the behavior of estimates and tests when the model that generated them does not hold. Specialize to the case Z = (1. The simplest solution at least for Wald tests is to use as an estimate of [Var y'n8]1 not 1(8) or ~D2ln(8) but the socalled "sandwich estimate" (Huber.. .. This is the stochastic version of the dsample model of Example 6.. In the linear model with a random design matrix. Wald. Show that in the canonical exponential model (6. ••• .6..3 to compute the information lower bound on the variance of an unbiased estimator of (52.16) Summary.6).6) does not have the correct level.4. n 1 L. we showed that the MLEs and tests generated by the model where the errors are U. where Qis a consistent estimate of Q. .4.6. d.6. ""2 _ ""12 (TJ1. j ::.. and Rao tests.Ad) distribution.1 1. (ii) if n 2: r + 1.d. and are uncorrelated.. the test (6..6) by Q1. .7 PROBLEMS AND COMPLEMENTS Problems for Section 6. (6. TJr. (5 2)T·IS (U1..1 are still approximately valid as are the LR.A1. .. (5 .. (52) assumption on the errors is replaced by the assumption that the errors have mean zero.JL • ""n 2.
see Problem 2. . i = 1.... .2 with p = r coincides with the empirical plugin estimate of JLL = ({LL1. Derive the formula (6.. Yl).29). is (c) Show that Y and Bare unbiased.Li = {Ly+(zi {Lz)f3. (d) Show that Var(O) ~ Var(Y). the 100(1 . 5. n. where a. . zi = (Zi21"" Zip)T.d. .2 ).13..2 coincides with the likelihood ratio statistic A(Y) for the 0. Var(Uf) = 20.. 0. = (Zi2. Show that Ii of Example 6. . N(O.1.a)% confidence interval for /31 in the Gaussian linear model is Z2)2 ] . i = 1.. (Zi..d. and the 1.'''' Zip)T.7 Problems and Complements 423 Hint: By A.. fn where ei = ceil + fi. Let Yi denote the response of a subject at time i.1.Section 6. i = 1. then B the MLE of (). Derive the formula (6. Find the MLE B ().4. {Ly = /31. Show that >:(Y) defined in Remark 6. 9. n.14).. and (3 and {Lz are as defined in (1.n. Yn ) where Z..1. 1 {LLn)T. .. .1] is a known constant and are i. of 7. t'V (b) Show that if ei N(O. . .29) for the noncentrality parameter 82 in the oneway layout.l+c (_c)j+l)/~ (1. i = 1. . . 8. . Consider the model (see Example 1. 1 n f1. where IJ.i. Show that in the regression example with p = r = 2.1. c E [0. Here the empirical plugin estimate is based on U. J =~ i=O L)) c i (1. 4.28) for the noncentrality parameter ()2 in the regression example. fO 0.."" (Z~.5) Yi () + ei.4 • 3. .2 replaced by 0: 2 • 6. Let 1.2 known case with 0.n ~ where fi can be written as fi = ceil + ei for given constant c satisfying 0 ~ c are independent identically distributed with mean zero and variance 0. i eo = 0 (the €i are called moving average errors.2 ). Suppose that Yi satisfies the following model Yi ei = () + fi.2 . (e) Show that Var(B) < Var(Y) unless c = O.2. 0.22.1..(_C)i)2 l+c L i=l (a) Show that Bis the weighted least squares estimate of ().
(a) Find a level (1 . ..Z.1 2)? and that a level (1 a) confidence interval for a 2 is given by ~a)::.. (Zi2 Z.p (1 .. In the oneway layout. ~(~ + t33) t31' r = 2. where n is even. = 2(Y .":=::.{3i are given by = ~(f.2). (a) Show that level (1 .2)2 a and 8k in the oneway layout Var(8k ) .a) confidence intervals for linear functions of the form {3j . . . l(Yh . Show that if p Inference in the Multiparameter Case Chapter 6 = r = 2 in Example 6.. then Var( a) is minimized by choosing ni = n / p. .p)S2 /x n .I). a 2 ::..• 424 10...a)% confidence intervals for a and 6k • 12.2) L(Zi2 . ~ 1 . Show that for the estimates (a) Var(a) = +==_=:. (b) Find a level (1 a) prediction interval for Y (i.Z.1 and variance a 2 . Y ::. (c) If n is fixed and divisible by 2(p . 'lj.p (~a) (n . ::.:1 Yi + (3/ 2n) L~= ~ n+ 1 }i.. then the hat matrix H = (h ij ) is given by 1 n 11.e . (n .1. Often a treatment that is beneficial in small doses is harmful in large doses. Yn ) such that P[t. = np n/2(p 1). Yn . (d) Give the 100(1 . The following model is useful in such situations. 1  (a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2? (b) Which estimate has the smallest MSE for estimating 0 13.. Yn ). . Assume the linear regression model with p future observation Y to be taken at the pont z.a) confidence interval for the best MSPE predictor E(Y) = {31 + {32 Z.. which is the amount .. 15. n2 = .a). Consider a covariate x. Yn be a sample from a population with mean f. then Var(8k ) is minimized by choosing ni = n/2. We want to predict the value of a 14.2)(Zj2 .• statistics HYl. (b) Find confidence intervals for 'lj. Note that Y is independent of Yl. . Let Y 1 .2.~ L~=1 nk = (p~:)2 + Lk#i .k)' C (b) If n is fixed and divisible by p. T2 1 (1/ 2n) L.p)S2 /x n .. ••. Consider the three estimates T1 = and T3 Y.
8) = 2. Problems for Section 6.or dose of a treatment. . (33.I::llogxi' You may use X = 0.5 Zi2 * + (33zi2 where Zil = Xi  X. ( 2 ) density. But xC'C = o => IIxC'I1 2 = xC'Cx' = 0 => xC' = O... X (nitrogen) Hint: Do the regression for {Li = (31 + (32zil + (32zil = 10gXi .7 Probtems and Lornol. 17. (b) Plot (Xi.• . P. (a) For the following data (from Hald. 7 2 ) does not exist so that A6 doesn't hold. 8). xC' = 0 => x = 0 for any rvector x.77 167. .r2) (x). : {L E R.0'2)(x ) + E<t'(JL. Hint: Because C' is of rank r. p(x.1 and let Q be the class of distributions with densities of the form (1 E) <t' CJL. ... and Q = P. fli) where fi1. ( 2 ). a 2 > O}. n are independent N(O. (b) Show that the MLE of ({L. hence. ( 2 ). Show that if C is an n x r matrix of rank r. fi2. En Xi. . p.r2 )l 1 2 7 > O.0'2) is the N(Il'. Suppose a good fit is obtained by the equation where Yi is observed yield for dose 10gYi where E1.95 Yi = e131 e132xi Xf3. r :::.A6 when 8 = ({L.2 1. Find the value of X that maxi mizes the estimated yield Y= e131 e132xx133. 1 2 where <t'CJL. ( 2 )T is identifiable if and only if r = p. . 16.0289. Yi) and (Xi. and a response variable Y. fi3 and level 0. ( 2 ) logp(x. In the Gaussian linear model show that the parametrization ({3. Y (yield) 3.emEmts 425 . Let 8. a 2 <7 2 .2. Check AO. then the r x r matrix C'C is of rank r and. 653). i = 1. (a) Show that AOA4 and A6 hold for model Qo with densities of the form 1 2<t'CJL. Assume the model = (31 + (32 X i + (33 log Xi + Ei. n.Section 6. 1952. •.1) + 2<t'CJL. compute confidence intervals for (31. and p be as in Problem 6. E :::. (32. which is yield or production. P = {N({L. nonsingular.
2. iT 2 ) are conditional MLEs. given Z(n).2.24) directly as follows: (a) Show that if Zn = ~ I:~ 1 Zi then.426 Inference in the Multiparameter Case Chapter 6 I . that (d) (fj  (3)TZ'f.. show that ([EZZ T jl )(1.2.20).2. The MLE of based on (X. Show that if we identify X = (z(n). e Euclidean. n[z(~)~en)jl)' X. X .2.21).". In some cases it is possible to find T(X) such that the distribution of X given T(X) = tis Q" which doesn't depend on H E H.1 show thatMLEs of (3. /1.1) EZ . (fj multivariate nannal distribution with mean 0 and variance. (P("H) : e E e.Zen) (31 2 e ~ 7. 6. .1. Fill in the details of the proof of Theorem 6.I3).(b) The MLE minimizes . Z i ~O.2. (a) In Example 6. then ({3.(3) = op(n. (e) Construct a method of moment estimate moments which are ~ consistent. Hint: (a). i' " (b) Suppose that the distribution of Z is not known so that the model is semiparametric.2.10 that the estimate Raphson estimates from On is efficient. T 2 ) based on the first two I (d) Deduce from Problem 6. 8.. . (6.2.2.  I .i > 1. T(X) = Zen). In Example 6.2. Establish (6.. hence. and that 0:2 is independent of the preceding vector with n(j2 /0 2 having a (b) Apply the law oflarge numbers to conclude that T P n I Z(n)Z(n) ~ E ( ZZ T ).2. and .. b . 3. I .1/ 2 ).24).2 are as given in (6. (I) Combine (aHe) to establish (6.'. (c) Apply Slutsky's theorem to conclude that and. t) is then called a conditional MLE. Hint: !x(x) = !YIZ(Y)!z(z). In Example 6. .  (3)T)T has a ("._p distribution. Y). I I ..)Z(n)(fj . show that the assumptions of Theorem 6. ell of (} ~ = (Ji.2. In Example 6.2 hold if (i) and (ii) hold. . 5. show that c(fo) = (To/a is 1 if fo is normal and is different from 1 if 10 is logistic. H E H}.Pe"H)..2. '".. (e) Show that 0:2 is unconditionally independent of (ji. (J derived as the limit of Newtonwith equality if and only if > [EZfj1 4. H abstract.IY . .1.fii(ji .
<).1/ 2 ).. and f(x) = e.l + aCi where fl. p is strictly convex. Him: You may use a uniform version of the inverse function theorem: If gn : Rd j.Oo) i=cl (1 . (al Let ii n be the first iterate of the NewtonRapbson algorithm for solving (6. . (a) Show that if solves (Y ao is assumed known a unique MLE for L. ] t=l 1 n . (logistic).1) starting at 0.•. t=l 1 tt Show that On satisfies (6. Hint: n . the <ball about 0 0 . = J.'2:7 1 'l'(Xi.7 Problems and Complements 427 9. R d are such that: (i) sup{jDgn(O) .0 0 1 < <} ~ 0. ..p(Xi'O~) + op(1) n i=l n ) (O~ . p i=l J.(XiI') _O..LD..l exists and uniquely ~ . Suppose AGA4 hold and 8~ is vn consistent. Examples are f Gaussian.3).2.Dg(O)j .0 0 ). hence.L'l'(Xi'O~) n. Y. Let Y1 •.L 'l'(Xi'O~) n i=l 1 1 =n n L 'l'(Xi.O) has auniqueOin S(Oo. 10 .2. (b) Show that under AGA4 there exists E > 0 such that with probability tending to 1. E R. (iii) Dg(0 0 ) is nonsingular. . (iv) Dg(O) is continuous at 0 0 .l real be independent identically distributed Y.' . a' . that is. a > 0 are unknown and ( has known density f > 0 such that if p(x) log f(x) then p" > 0 and.LD.1' _ On = O~  [ . ao (}2 = (b) Write 8 1 uniquely solves 1 8. Show that if BB a unique MLE for B1 exists and 10. (ii) gn(Oo) ~ g(Oo).p(Xi'O~) n.Section 6. 8~ = 1 80 + Op(n.x (1 + C X ) .
1 on 8(9 0 • 6) (c) Conclude that with probability tending to 1. • (6. LJ j=2 Differentiate with respect to {31.ip range freely and Ei are i. 2 ° 2. •. 8) for 'I E 3 and 3 = {'1( 0) : 9 E e}.2. . 'I) : 'I E 3 0 } and. .6).3. Hint: You may use the fact that if the initial value of NewtonRaphson is close enough to a unique solution. Problems for Section 6./3)' = L i=1 1 CjZiU) n p . there exists a j and their image contains a ball S(g(Oo).d.II(Zil I Zi2.(j) l. Zip)) + L j=2 p "YjZij + €i J .O).(Z" . N(O. 1 Zw Find the asymptotic likelihood ratio. that Theorem 6.. 0 < ZI < . P I .3). 'I) = p(. Zi.(Z"  ~(') z.12) and the assumptions of Theorem (6. .Zi )13. iteration of the NewtonRaphson algorithm starting at 0. q(.' " 13... hence.3. minimizing I:~ 1 (Ii 2 i.3 i 1. P(Ai).. . f. log Ai = ElI + (hZi. > 0 such that gil are 1 . CjIJtlZ. i2. . then it converges to that solution. . . i .. and Rao tests for testing H : Ih. {Vj} are orthonormal. .i.. Hint: Write n J • • • i I • DY. .428 then. Similarly compute the infonnation matrix when the model is written as Y. .. Thus. 1 Yn are independent Poisson variables with Yi .27). . I '" LJ i=l ~(1) Y. Inference in the Multiparameter Case Chapter 6 > O.2 holds for eo as given in (6. = 131 (Zil .. Reparametnze l' by '1(0) = Lj~1 '1j(O)Vj where '1j(9) 0 Vj.. ~. i=l • Z.26) and (6. and ~ .. t• where {31. Wald. •. converges to the unique root On described in (b) and that On satisfies 1 1 • j..2. Suppose that Wo is given by (6..l hold for p(x. = versus K : O oF 0. II. Ii 'I :f Z'[(3)2 over all {3 is the same as minimizing n ..O E e.2.3. Suppose responses YI .3. < Zn 1 for given covariate values ZI. Show that if 3 0 = {'I E 3 : '1j = 0. q + 1 < j < r} then )'(X) for the original testing problem is given = by )'(X) = sup{ q(X· 'I) : 'I E 3}/ sup{q(X. Y. )Ih  '" j=l LJ(lJj + where ~(l) = I:j and the Cj do not depend on j3. for "II sufficiently large. Establish (6.~_. a 2 ).12)..
. 1 < i < n.. respectively.X. aTB. (ii) E" (a) Let ). S(lJ o) and a map ce which is continuously differentiable such that (i) ~J(IJ) = 9J(IJ) on S(lJo).. Deduce that Theorem 6.3 hold.n:c"' 4c:2=9 3. be ii.) 1(Xi .IJ) where q(. Let e = {B o . N(Oz.. Xi.. = ~ 4.d. hence. Oil + Jria(00 . X n Li. > 0 or IJz > O.Xn ) be the likelihood ratio statistic. Bt ) is a KullbackLeibler in fonnation ° "_Q K(Oo. =  p(·. Consider testing H : 8 1 = 82 = 0 versus K : 0.o log (X 0) PI. There exists an open ball about 1J0.3. Consider testing H : (j = 00 versus K : B = B1 . . .7 Problems and C:com"p"":c'm"'.n 7J(Bo.'1 8 0 = and.2. (Adjoin to 9q+l.'1(IJ» is uniquely defined on:::: = {'1(IJ) : IJ E Tio. .) where k(Bo. .(X" . j = 1. Assume that Pel of Pea' and that for some b > 0. with density p(. OJ}.) Show that if we reparametrize {PIJ: IJ E S(lJo)} by q('. independent. (b) If h = 2 show that asymptotically the critical value of the most powerful (NeymanPearson) test with Tn ~ L~ . (ii) 'I is IIon S( 1J0 ) and D'I( IJ) is a nonsingular r x r matrix for aIlIJ E S( 1J0). 1 5. 210g.. 0 ») is nK(Oo.. IJ.n) satisfy the conditions of Problem 6. \arO whereal. Suppose that Bo E (:)0 and the conditions of Theorem 6. .(I(X i ..3.. {IJ E S(lJo): ~j(lJ) = 0.. q + 1 <j< rl. IJ. N(IJ" 1).3 is valid. . q('. .gr. Suppose 8)" > 0.. . 1).3. Let (Xi.Section 6.d. B). pXd . 'I) S(lJo)} then.) O. q + 1 <j < r S(lJo) .l.2..(X1 . with Xi and Y... Oil and p(X"Oo) = E. Testing Simple versus Simple.aq are orthogonal to the linear span of ~(1J0) . Vi).'" .. < 00. Show that under H.'1) and Tin '1(lJ n) and. p . even if ~ b = 0.
which is a mixture of point mass at O. ii. . =0. li).0. j ~ ~ 6.2 for 210g. = v'1~c2' 0 <.0"10' O"~o. = a + BB where .. be i.. (h) : 0 < 0. < . Sucb restrictions are natural if.=0 where U ~ N(O. mixture of point mass at 0. (d) Relate the result of (b) to the result of Problem 4(a). Z 2 = poX1Y v' .\(Xi. for instance. Show that 2log.430 Infer~nce in the Multiparameter Case Chapter 6 (a) Show that whatever be n.6.li : 1 <i< n) has a null distribution. (b) If 0. = (T20 = 1 andZ 1= X 1.i.2 and compute W n (8~2» showing that its leading term is the same as that obtained in the proof of Theorem 6.)) ~ N(O. Note: The results of Problems 4 and 5 apply generally to models obeying AQA6 when we restrict the parameter space to a cone (Robertson. ' · H In!: Consl'defa1O . /:1.) have an N.0. > OJ. Yi : 1 and X~ with probabilities ~. Po) distribution and (Xi.. Po ..19) holds. O > 0.1. .0).0. and Dykstra. 1 < i < n. Show that (6.) if 0. • i . (0. and ~ where sin. Wright. (b) Suppose Xil Yi e 4 (e) Let (X" Y. ~ ° with probability ~ and V is independent of ~ (e) Obtain tbe limit distribution of y'n( 0. 0. ±' < i < n) is distributed as a respectively. (iii) Reparametrize as in Theorem 6.d.3. > 0.) under the model and show that (a) If 0. > 0.6. we test the efficacy of a treatment on the basis of two correlated responses per individual. = 0. 1) with probability ~ and U with the same distribution. 7. Let Bi l 82 > 0 and H be as above. I . XI and X~ but with probabilities ~ . Yi : l<i<n). are as above with the same hypothesis but = {(£It. 0. Then xi t.\(X). In the model of Problem 5(a) compute the MLE (0.. under H. 2 log >'( Xi.\( Xi.3.3. 2 e( y'n(ii. < cIJ"O. Exhibit the null distribution of 2 log .  OJ.. 1988). Hint: ~ (i) Show that liOn) can be replaced by 1(0). Hint: By sufficiency reduce to n = 1.. (ii) Show that Wn(B~2» is invariant under affine reparametrizations "1 B is nonsingular.  0.1.~.0.
2 to lin . 1.22) is a consistent estimate of ~l(lIo). Show that under A2.P(A)P(B) JP(A)(1 .. Exhibit the two solutions of (6.4. (a) Show that the correlation of Xl and YI is p = peA n B) . 9. hence.4. Under conditions AOA6 for (a) and AOA6 with A6 for (a) ~(1 ) i!~1) for (b) establish that [~D2ln(en)]1 is a consistent estimate of 1.4) explicitly and find the one that corresponds to the maximizer of the likelihood.3. is of the fonn (b) Deduce that X' ~ Z' where Z is given by (6.7 Problems and Complements ~(1 ) 431 8.8) for Z.P(B))· (b) Show that the sample correlation coefficient r studied in Example 5. A611 Problems for Section 6.3. 10.nf.3. Show that under AOA5 and A6 for 8 11 where ~(lIo) is given by (6.6 is related to Z of (6. A3.4 ~ 1(11) is continuous.2. In the 2 x 2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a Born. Hint: Argue as in Problem 5.21).P(A))P(B)(1 .8) by Z ~ . (b) (6.3. 2.0 < PCB) < 1. (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (estimated) expectations from each entry has all rows and columns summing to zero.4. (e) Conclude that if A and B are independent. 3. Hint: Write and apply Theorem 6. then Z has a limitingN(011) distribution. .1(0 0 ).Section 6.4.10. 0 < P( A) < 1.8) (e) Derive the alternative form (6.
2 is 1t(Ci.2. b I Ri ( = Ti.1. C i = Ci. nab) A ) = B. TJj2. Hint: (a) Consider the likelihood as a function of TJil.~~. 432 Inference in the Multiparameter Case Chapter 6 i 4. Let R i = Nil + N i2 • Ci = Nii + N 2i · Show that given R 1 = TI. .. n) can be then the test that rejects (conditionally on R I = TI' C 1 = GI) if N ll > j(a) is exact level o. j.b1only.. N 12 • N 21 ... N 22 ) rv M (u. (b) Sbow that 812 /(8l! ° + 812 ) ~ 821 /(8 21 + 822 ) iff R 1 and C 1 are independent.. This is known as Fisher's exact test. CI. .D.u. TJj2 ~ = Cj n where R. TJj2 are given by TJil ~ = ..~~.1 II .9) and has approximately a X1al)(bl) distribution under H.TI.54). ( rl. ... =T 1.. n R. j = 1. a . . 8(r2' 82 I/ (8" + 8.. Suppose in Problem 6. in principle. n a2 ) . Consider the hypothesis H : Oij = TJil TJj2 for all i.. n.. 6. 8l! / (8 1l + 812 )). I (c) Show that under independence the conditional distribution of N ii given R. 811 \ 12 .C.ra) nab .4.~l. B!c1b!. . (22 ) as in the contingency table. (b) Deduce that Pearson's X2 is given by (6.4.)). TJj2 = 2::: a x b contingency table with associated probabilities Bij and 1 Oij. Cj = Cj] : ( nll. It may be shown (see Volume IT) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5. C j = L' N'j. Ti) (the hypergeometric distribution). . . . (a) Show that the maximum likelihood estimates of TJil. . .1 .. S. 7. ( where ( ..6 that H is true. R 2 = T2 = n . nal ) n12. 1 : X 2 (b) How would you. . = Lj N'j. Fisher's Exact Test From the result of Problem 6. (a) Let (NIl. . are the multinomial coefficients.. ... N ll and N 21 are independent 8( r. i = 1..4 deduce that jf j(o:) (depending on chosen so that Tl. j = 1.~. use this result to construct a test of H similar to the test with probability of type I error independent of TJil' TJj2? 1 .4. i = 1. Let N ij be the entries of an let 1Jil = E~=l (}ij.. 8 21 . (a) Show that then P[N'j niji i = 1.
Show that.93'1 215 103 69 172 Deny 225 162 n=387 (d) Relate your results to the phenomenon discussed in (a). The following table gives the number of applicants to the graduate program of a small department of the University of California. Establish (6.Section 6. C2). Admit Men Women 1 235 1~35' 38 7 273 42 n = 315 Deny Admit 270 45 Men Women I 122 1'. if and only if... Zi not all equal. for suitable a.7 Problems and Complements 433 8. 9.4. but (iii) does not.5? Admit Deny Men Women 1 19 .. ~i = {3.12 I .BINDEPENDENTGIVENC) n B) = P(A)P(B) (A.) Show that (i) and (ii) imply (iii). classified by sex and admission status. consider the assertions.4. and that under H. Test in both cases whether the events [being a man] and [being admitted] are independent.(rl + cd or N 22 < ql + n . 10. (h) Construct an experiment and three events for which (i) and (ii) hold. n. (i) p(AnB (ii) I C) = PeA I C)P(B I C) (A. N 22 is conditionally distributed 1t(r2. 5 Hint: (b) It is easier to work with N 22 • Argue that the Fisher test is equivalent to rejecting H if N 22 > q2 + n . = 0 in the logistic model. and petfonn the same test on the resulting table.0.ziNi > k. C are three events. 2:f . Then combine the two tables into one. if A and C are independent or B and C are independent. Suppose that we know that {3. . B. Give pvalues for the three cases. Would you accept Or reject the hypothesis of independence at the 0. B INDEPENDENT) (iii) PeA (C is the complement of C. (e) The following 2 x 2 tables classify applicants for graduate study in different departments of the university according to admission status and sex.ZiNi > k] = a. (b).1""". there is a UMP level a test. + IhZi.(rl + cI). (a) If A.14).BINDEPENDENTGNENC) p(AnB I C) ~ peA I C)P(B I C) (A.05 level (a) using the X2 test with approximate critical value? (b) using Fisher's exact test of Problem 6. which rejects. where Pp~ [2:f . and that we wish to test H : Ih < f3E versus K : Ih > f3E. 11.
but under K may be either multinomial with 0 #. 434 • Inference in the Multiparameter Case Chapter 6 f • 12. . which is valid for the i.. a5 j J j 1 J I .5.. This is an approximation (for large k. Show that the likelihO<Xt ratio test of H : O} = 010 . 13. = rn I < /3g and show that it agrees with the test of (b) Suppose that 131 is unknown. In the binomial oneway layout show that the LR test is asymptotically equivalent to Pearson's  X2 test in the sense that 2log'\  X2 .)llmi~i(1 'lri). case.f.4. Show that if Wo C WI are nested logistic regression models of dimension q < r < k and mI. then 130 as defined by (6.. . .i. I). a under H. i Problems for Section 6. .3.3. Suppose that (Z" Yj). k and a 2 is unknown..OiO)2 > k 2 or < k}. .3) l:::~ I (Xi .) ~ B(m.Ld. in the logistic regression model. < f3g in this (c) By conditioning on L~ 1 Xi and using the approach of Problem 6.5 1.11 are obtained as realization of i. ..4.11.4. with (Xi I Z. ...20) for the regression described after (6. for example.5 construct an exact test (level independent of (31). Xi) are i. .z(kJl) ~I 1 ... mk ~ 00 and H : fJ E Wo is true then the law of the statistic of (6.d...: nOiO. X k be independent Xi '" N (Oi. .4) is as claimed formula (2. · .d. E {z(lJ.5. . a 2 = 0"5 is of the form: Reject if (1/0". 15. or Oi = Bio (known) i = 1. . n) and simplification of a model under which (N1. if the design matrix has rank p.4. .LJ2 Z i )). (a) Compute the Rao test for H : (32 Problem 6. but Var.5.(Ni ) < nOiO(1 ...4). Use this to imitate the argument of Theorem 6.. Let Xl. Nk) ~ M(n. asymptotically N(O..15) is consistent.18) tends 2 to Xrq' Hint: (Xi .'Ir(. Suppose the ::i in Problem 6.. . • I 1 • • (a)P[Z.8) and...". Given an initial value ()o define iterates  Om+l . 010. . . 1 i Show that for GLM this method coincides with the NewtonRaphson method of Section 2. " Ok = 8kO. Tn. 1 . 2. a 2 ) where either a 2 = (known) and 01 . Compute the Rao test statistic for H : (32 case. . i 16. Y n ) have density as in (6.4. (Zn. Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fishersee Rao (1973).00 or have Eo(Nd : . Verify that (6. .1 14. Zi and so that (Zi..Oio)("Cooked data").i.4.OkO) under H.. . .. I. . Show that..2. I < i < k are independent. " lOk vary freely...l 1 . 3. ~ 8 m + [1(8 m )Dl(8 m ).
. (Zn. has the Poisson. j = 1.. distribution.T)exp { C(T) where T is known.5.. .(. g(p..5. Suppose Y.. Let YI. J JrJ 4. h(y.b(Oi)} p(y.. Give 0.F. and v(..Section 6.5). and b' and 9 are monotone.) = ~ Var(Y)/c(T) b"(O).= 1 . 1'0) = jy 1'01 2 /<7. Set ~ = g(.. P(I"). Y) and that given Z = z. . b(B). (e) Suppose that Y. Show that.Oi)~h(y.. . . d". Wi = w(".5.) . . Show that for the Gaussian linear model with known variance D(y.. as (Z. your result coincides with (6. ball about "k=l A" zlil in RP . T).) and C(T) such that the model for Yi can be written as O.). k.).. In the random design case..)z'j dC.. ..O(z)) where O(z) solves 1/(0) = gI(zT {3).y .9). Yn ) are i. and v(.) and v(. C(T). Show that when 9 is the canonical link. L.. . ~ .. (a) Show that the likelihood equations are ~ i=I L.i. Y follow the model p(y. b(9).. then the convex support of the conditional distribution of = 1 Aj Yj zU) given Z j = Z (j) . ~ N("" <7. give the asymptotic distribution of y'n({3 . T. Yn be independent responses and suppose the distribution of Yi depends on a covariate vector Zi. Give 0.. b(9). 00 d" ~ a(3j (b) Show that the Fisher information is Z]. T.. k.WZ v where Zv = Ilz'jll is the design matrix and W = diag( WI. . (d) Gaussian GLM.. . C(T). g("i) = zT {3. Show that the conditions AQA6 hold for P = P{3o E P (where qo is assumed known). . 9 = (b')l.p. Assume that there exist functions h(y. under appropriate conditions..ztkl} is RP (c) P[ZI ~ z(jl] > 0 for all j.9). . the resuit of (c) coincides with (6. Find the canonical link function and show that when g is the canonical link.1 = 0 V fJ~ .) ~ 1/11("i)( d~i/ d".d. . Hint: Show that if the convex support of the conditional distribution of YI given ZI = zU) contains an open interval about p'j for j = 1. (y. h(y. (c) Suppose (Z" Y I ). contains an open L. <.7 Problems and Complements 435 (b) The linear span of {ZII). . T).(3). Hint: By the chain rule a l(y a(3j' 0) = i3l dO d" a~ . Wn). 05.T). J . the deviance is 5.
(6.i. Suppose ADA6 are valid. hence. let 1r = s(z.2. Show that the standard Wald test forthe problem of Example 6.1 under this model and verifying the fonnula given.10).1) ~ .1. (a) Show that if f is symmetric about 1'. then 0'2(P) < 0'2. 0 < Vae f < 00. Suppose that the ttue P does not belong to"P but if f}(P) is defined by (6.4. and R n are the corresponding test statistics. 7. Consider the Rao test for H : f} = f}o for the model "P = {P/I : /I E e} and ADA6 hold. the unique median of p. I: .j 436 Inference in the Multiparameter Case Chapter 6 i Problems for Section 6. (3) where s(t) is the continuous distribution function of a random variable symmetric about 0.d.3. set) = 1. Consider the linear model of Example 6. I "j 5. Establish (6. Apply Theorem 6. then under H.3 and. n . . that is. . Wald. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation. then the Rao test does not in general have the correct asymptotic level.6. then O' 2(p) > 0'2 = Varp(X. j.O' 2 (p») = 1/4f(v(p».6. 2. Show that 0: 2 given in (6. if P has a positive density v(P).2. = 1" (b) Show that if f is N(I'. Wn Wn + op(l) + op(I). then it is.Q+1>'" .(3p ~ (3o.10) creates a valid level u test. t E R.6.3. replacing &2 by 172 in (6.6. W n and R n are computed under the assumption of the Gaussian linear model with a 2 known. Suppose Xl. but if f~(x) = ~ exp Ix 1'1.14) is a consistent estimate of2 VarpX(l) in Example 6. then the sample median X satisfies ~ f at I I Vri(X where O' (p) 2 yep») ~ N(0.6.Xn are i. but that if the estimate ~ L:~ dDIllDljT (Xi.6.p under the sole assumption that E€ = 0. I 4. In the hinary data regression model of Section 6.).. P.3 is as given in (6. . . . Show that the LR. the infonnation bound and asymptotic variance of Vri(X 1'). Note: 2 log An. O' 2(p)/0'2 = 2/1r.6. then v( P) .7. ! I I ! I . 0'2).1. • I 3. " . /10) is estimated by 1(80 ).15) by verifying the condition of Theorem 6.3) then f}(P) = f}o.6. . Show that. W n . if VarpDl(X..s(t). . f}o) is used.6 1.2 and the hypothesis (3q+l = (3o. in fact. By Problem 5.. l 1 . . i 6. I . and Rao tests are still asymptotically equivalent in the sense that if 2 log An.
A natural goal to entertain is to obtain new values Yi". /3P+I = '" = /3d ~ O. Let f3(p) be the LSE under this assumption and Y.1) Yi = J1i + ti.... hence. .{3lp) p and {3(P) ~ (/31. 1 n. i . . ~ ~ (d) Show that (a).. . where Xln) ~ {(Y..I EL(Y. (Model Selection) Consider the classical Gaussian linear model (6.p don't matter... Show that if the correct model has Jri given by s as above and {3 = {3o. for instance). are indepeqdent of}] 1 ' · ' 1 Yn and ~* is distributed as }'i. .9..lp)2 Here Yi" l ' •• 1 Y.lpI)2 be the residual sum of squares. 1 n. then 13L is not a consistent estimate of f3 0 unless s(t) is the logistic distribution. _. . Suppose that . then ~ Vri(rJL QI ({3) Var(ZI (Y1   {3 Ll has a limiting normal distribution with mean 0 and variance A(Zr {3)) )[QI ({3)J where Q({3) = E(Zr A(Zr{30)ZI) is p x p and necessarily nonsingular. where ti are i. = (3p = 0 by the (average) expected prediction error ~ ~ n EPE(p) = n.d.2 is known.9. .. y~p) and. . Gaussian with mean zero J1i = Z T {3 and variance 0'2. 1973. 1 V.2..1.1. But if 13L is defined as the solution of EZ1s(Zr {30) = Q(/3) where Q({3) = E(Zr A(Zr/3) is p x 1. .p.' i=l .2 (b) Show that (1 + ~D + . i = 1.2 is an unbiased estimate of EPE(P).7 Problems and Complements 437 (a) Show that ~ Jr can be written in this form for both the probit and logit models.. Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d .i . Zi... (b) Suppose that Zi are realizations of U.Section 6. .d. the model with 13d+l = .Zn and evaluate the performance of Yjep) . (a) Show that EPE(p) ~ . L:~ 1 (P.. be the MLE for the logit model. .(b) continue to hold if we assume the GaussMarkov model.. 0.. Hint: Apply Theorem 6. Zi are ddimensional vectors for covariate (factor) values. 8. Let RSS(p) = 2JY.. O)T and deduce that (c) EPE(p) = RSS(p) + ~. i = 1. at Zll'" . Model selection consists in selecting p to minimize EPE(p) and then using Y(P) as a predictor (Mallows.. f3p. Zi) : 1 <i< n}. that Zj is bounded with probability 1 and let ih(X ln )).jP)2 where ILl ) = z.(p) the corresponding fitted value.
Note for Section 6. i=1 Derive the result for the canonical model. .5. Y. and (ii) 'T}i = /31 Zil + f32zi2. . 1969.. + Evaluate EPE for (i) ~i ~ "IZ.~I'.L . (b) The result depends only on the mean and covariance structure of the i=l"". .8 NOTES Note for Section 6. 3rd 00. AND F. this makes no sense for the model we discussed in this section. 1 n n L (I'. Y/" j1~. The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section 6. ti.6. DIXON. which are not multinomial. if we consider alternatives to H. MASSEY. Give values of /31.i2} such that the EPE in case (i) is smaller than in case (d) and vice versa. W.n. LR test statistics for enlarged models of this type do indeed reject H for data corresponding to small values of X2 as well as large ones (Problem 6. R.438 (e) Suppose p ~ Inference in the Multiparameter Case Chapter 6 2 and 11(Z) ~ Ii.2 (1) See Problem 3. Introduction to Statistical Analysis. 6. . 1 I . 1970. . .1 (1) From the L A. ! . 1 "'( i=1 . Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good.) . Use 0'2 = 1 and n = 10.".9 for a discussion of densities with heavy tails..' . Note for Section 6.4. but it is reasonable. A. For instance. To guard against such situations he argued that the test should be used in a twotailed fashion and that we should reject H both for large and for small values of X2 • Of course. Heart Study after Dixon and Massey (1969).16). EPE(p) n R SS( p) = .4 (1) R. Hint: (a) Note that ". t12 and {Z/I.. i 6. D. ~(p) n . The Analysis of Binary Data London: Methuen.z.  J .9 REFERENCES Cox. )I'i). we might envision the possibility that an overzealous assistant of Mendel "cooked" the data. New York: McGrawHill.
GauthierVillars. S. Applied linear Regression. T. 1974. 1986. R. HUBER. An Inrmdllction to Linear Stati." J. NELDER. New York: Wiley.. M. Paris) (1789). P. SCHEFFE. PORTNOY. S.. /5. 1959." Statistical Science. STIGLER. AND R. Vol. A.. AND R. 13th ed. Theory ofStatistics New York: Springer. RAO. New York: Hafner.. P. "Approximate marginal densities of nonlinear functions.. of California Press.." Biometrika.Section 6.Him{ Models. WEISBERG. J. 1952.. $CHERVISCH. "Computing regression qunntiles. Math. c. 2nd ed. 1985. 1973. II. MA: Harvard University Press. The l!istory of Statistics: The Measuremem of Uncel1ainty Before 1900 Cambridge. 1989. 76. MALLOWS. "The Gaussian Hare and the Laplacian Tortoise: Computability of squarederror versus absoluteerror estimators. KASS. KADANE AND L. 12. KOENKER. C. t 983. Wiley & Sons. Statistical Theory with £r1gineering Applications New York: Wiley. 1958.• "Sur quelques points du systeme du monde. I. Ser.. LAPLACE. D'OREY. A. 36. "The behavior of the maximum likelihood estimator under nonstandard conditions. 2nd ed. f. A. C. AND Y. I New York: McGrawHill. T. 1995. DYKSTRA. II. 383393 (1987)." Technometrics.• St(ltistical Methods {Of· Research lV(Jrkers. PS. R.9 References 439 Frs HER. E. HAW. S.425433 (1989). . Fifth BNkeley Symp. 221233 (1967). HABERMAN." Proc. Statist. 279300 (1997). 1961. TIERNEY. Roy. Genemlized Linear Models London: Chapman and Hall. 1988. Prob. ROBERTSON. R.. New York. "Some comments on C p . GRAYBILL. Univ. S. Statist. McCULLAGH. The Analysis of Variance New York: Wiley. The Analysis of Frequency Data Chicago: University of Chicago Press.661675 (1973). R. second edition. Order Restricted Statistical Inference New York: Wiley. 475558. KOENKER. New York: J. J.. WRIGHT. Linear Statisticallnference and Its Applications. Soc. AND j. L." Memoires de l'Academie des Sciences de Paris (Reprinted in Oevres CompUtes.
I :i . . I I .. i 1 I . .. . . I j i. I . . j. .
is tossed can land heads or tails.IS contain some results that the student may not know. in addition. which are relevant to our study of statistics.Appendix A A REVIEW OF BASIC PROBABILITY THEORY In statistics we study techniques for obtaining and using information in the presence of uncertainty. Viewed naively. The intensity of solar flares in the same month of two different years can vary sharply. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level.I THE BASIC MODEl Classical mechanics is built around the principle that like causes produce like effects. Therefore. The Kolmogorov model and the modem theory of probability based on it are what we need. A coin that. The adjective random is used only to indicate that we do not. at least conceptually. require that every repetition yield the same outcome. Sections A. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties. we include some commentary. The purpose of this appendix is to indicate what results we consider basic and to introduce some of the notation that will be used in the rest of the book. The situations we are going to model can all be thought of as random experiments. A. We add to this notion the requirement that to be called an experiment such an action must be repeatable.14 and A. although we do not exclude this case. we include some proofs as well in these sections. The reader is expected to have had a basic course in probability theory. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts. an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. Probability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. This 441 .
The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A. Raiffa and Schlaiffer (\96\).\ is the number of times the possible outcome A occurs in n repetitions.I I 442 A Review of Basic Probability Theory Appendix A 1 longtefm relative frequency 11 . Chung. complementation. If A contains more than one point. 1992. almost any kind of activity involving uncertainty. Its complement." Another school of statisticians finds this formulation too restrictive. Grimmett and Stirzaker. {w} is called an elementary event. We denote events by A. Lindley (1965). the operational interpretation of the mathematical concept of probability. c. In this sense. For example. they are willing to assign probabilities in any situation involving uncertainty.1.4 We will let A denote a class of subsets of to which we an assign probabilities. which by definition is a nonempty class of events closed under countable unions. induding the authors. A is always taken to be a sigma field. However. A probabiliry distribution or measure is a nonnegative function P on A having the following properties: (i) P(Q) n 1 n . .2 A sample point is any member of 0. A random experiment is described mathematically in tenns of the following quantities. B. Savage (1962). de Groot (1970). A2 .. and inclusion as is usual in elementary set theory. We now turn to the mathematical abstraction of a random experiment. A. as we shall see subsequently. . the relation A C B between sets considered as events means that the occurrence of A implies the occurrence of B.1. intersections. " are pairwise disjoint sets in I • I Recall that Ui I Ai is just the collection of points that are in anyone of the sets Ai and that two sets are disjoint if they have no points in common.l. and complementation (cf.. intersection. we preSUnle the reader to be familiar with elementary set theory and its notation at the level of Chapter I of Feller (1968) or Chapter 1 of Parzen (1960).. 1 1  . whether it is conceptually repeatable or not.3 Subsets of are called events. and Loeve. For a discussion of this approach and further references the reader may wish to consult Savage (1954). n.l The sample space is the set of all possible outcomes of a random experiment. then (ii) If AI. is denoted by A. . If wEn. For technical mathematical reasons it may not be possible to assign a probability P to every subset of n. • . 1974. C for union. the null set or impossible event. and Berger (1985). We denote it by n. In this section and throughout the book.~ n and is typically denoted by w. and so on or by a description of their members. from horse races to genetic experiments.1/ /1. . By interpreting probability as a subjective measure. the probability model. = 1. . . falls under the vague heading of "random experiment.l. We shall use the symbols U. A. where 11. Ii I I I j . I l 1. 1977). it is called a composite event. is to many statistician::. set theoretic difference." The set operations we have mentioned have interpretations also. A. i ! A.
when we refer to events we shall automatically exclude those that are not members of A.2 Parzen (1960) Chapter 1.3 Hoel. P(B) A.) (Bonferroni's inequality).3.3 Panen (1960) Chapter I. A. by axiom (ii) of (A. and Stone (1992) Section 1. Sections 15 Pitman (1993) Sections 1. ~ A. we can write f! = {WI.2.2..3 A.3 If A C B. A. .2. P(0) ~ O.L~~l peA.40 < P(A) < 1..S P A. A. (1967) Chapter l.l.). Sections 45 Pitman (1993) Section 1. and P together describe a random experiment mathematically. A. References Gnedenko (1967) Chapter I. } and A is the collection of subsets of n.3 DISCRETE PROBABILITY MODELS A. W2.' 1 An) < L. Port. C An .. A.3.Section A..6 If A .2 PeN) 1 . References Gnedenko.2 Elementary Properties of Probability Models 443 A. then P (U::' A. (A. > PeA).1. In this case.A) = PCB) ..2.2) n .3 A... 1. J. c . we have for any event A.. For convenience. (U. Sections 13. and Stone (1971) Sections 1.2. Port.2 ELEMENTARY PROPERTIES OF PROBABILITY MODELS The following are consequences of the definition of P. . c A.68 Grimmett and Stirzaker (1992) Sections 1.7 P 1 An) = limn~= P(A n ).3 Hoe!.l1. We shall refer to the triple (0. (n~~l Ai) > 1.4).PeA).2. then PCB .S The three objects n.P(A).2 and 1...' 1 peA.t A probability model is called discrete if is finite or countably infinite and every subset of f! is assigned a probability. That is. P) either as a probability model or identify the model with what it represents as a (random) experiment.2. Section 8 Grimmett and Stirzaker (1992) Section 1.1 If A c B.
(A. From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred. machines. shaking well.3) yield U.3. the function P(.4.• 1 B n are (pairwise) disjoint events of positive probability whose union is fl. 1 1 PiA n B) = P(B)P(A I B).3. say N..4. we define the conditional probability of A given B.. Sections 67 Pitman (1993) Section l. (A.3) If B l l B 2 .. guinea pigs. Then P( {w}) = 1/ N for every wEn. Such selection can be carried out if N is small by putting the "names" of the Wi in a hopper.444 A Review of Basic Probability Theory Appendix A An important special case arises when n has a finite number of elements.I B) is a probability measure on (fl. by l P(A I B) ~ PtA n B) P(B) . and P( A ) ~ j . I B) = ~ PiA. I B). flowers.:c Number of elements in A N .(A n B j ). then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur. Given an event B such that P( B) > 0 and any other event A.3) 1 I A. A) which is referred to as the conditional probability measure given B.1 • 1 . (A. (A. all of which are equally likely. . Sections 45 Parzen (1960) Chapter I.1. Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another. . is an experiment leading to the model of (A.• P (Q A.4)(ii) and (A.2) In fact. (A. WN are the members of some population (humans. For large N. the identity A = .4.. A.~ 1 1 ~• 1 . j=l (A. . are (pairwise) disjoint events and P(B) > 0. which we write PtA I B). ~f 1 .3). n PiA) = LP(A I Bj)P(Bj ).4. a random number table or computer can be used. If A" A 2 ..' .=.4) . i References Gnedenko (1967) Chapter I.).4 Suppose that WI. for fixed B as before. etc.:c:.4. . Transposition of the denominator in (AA.. and drawing.3.1) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment.=:.l n' " I"I ". i • i: " i:. then .4 CONDITIONAL PROBABILITY AND INDEPENDENCE I j 1 1 ! . . selecting at random. '< = ~=~:..l) gives the multiplication rule..:. .
. .7) whenever P(B I n . Port..4. Section 4..4) and obtain Bayes rule . .i.En such that P(B I n . ... B I . B ll is written P(A B I •..Section AA Conditional Probability and Independence 445 If P( A) is positive.8) > 0.• .." . .A. and (A.) = II P(A.} such thatj ct {i" . and Stone (1971) Sections lA.) j=l k (AA.) ~ P(A J ) (AA..8) may be written P(A I B) ~ P(A) (AA. Ifall theP(A i ) are positive.4.. Chapter 3. . Sections IA Pittnan (1993) Section lA . .) A) ~ ""_ PIA I B )p(BT L.9) In other words.lO) is equivalent to requiring that P(A J I A..I_1 . n Bnd > O..i k } of the integers {l. P(B" I Bl.3). .}...I J (AA. A and B are independent if knowledge of B does not affect the probability of A.)P(B.···..S parzen (1960) Chapter 2. .II) for any j and {i" . I... B n ) and for any events A. (AA. relation (A. Sections 9 Grimmett and Stirzaker (1992) Section IA Hoel....4.. n B n ) > O. I PeA I B. References Gnedenko (1967) Chapter I. Two events A and B are said to be independent if P(A n B) If P( B) ~ P(A)P(B).. .. B 2 ) .S) The conditional probability of A given B I defined by . the relation (AA.i. . ..lO) for any subset {iI... P(il.. Simple algebra leads to the multiplication rule. .. BnJl (AA. .. we can combine (A. The events All .n}. P(B 1 n·· n B n ) ~ P(B 1 )P(B2 I BJlP(B3 I ill. (AA.. n··· nA i . An are said to be independent if P(A i .1 ).
(A53) It may be shown (Billingsley. the sigma field corresponding to £i."" .. independent. .5 COMPOUND EXPERIMENTS There is an intuitive notion of independent experiments. These will be discussed in this section. .. that is..' .5. . We shall speak of independent experiments £1. . On the other hand.. This makes sense in the compound experiment. . W2 is the outcome of £2 and E Oi corresponds to the Occurrence of the comso on.0 if and only if WI is the outcome of £1. the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second. W n ) : W~ E Ai. 1977) that if P is defined by (A. If P is the probability measure defined on the sigma field A of the compound experiment. I n  I . x " .. \ I I . then Ai corresponds to . For example.. the Cartesian product Al x .1 ... £n with respective sample spaces OJ.. A. There are certain natural ways of defining sigma fields and probabilities for these experiments. if Ai E Ai. on (fl2. X A 2 X '" x fl n ).0 1 X '" x . the subsets A ofn to which we can assign probability(1).. The interpretation of the sample space 0 is that (WI. x fl 2 x '" x fln)P(fl. then the sample space . .6 where examples of compound experiments are given.. then (A52) defines P for A) x ". then intuitively we should have alI classes of events AI. Pn({W n }) foral! Wi E fli.wn )}) I: " = PI ({Wi}) . . . ) = P(A. .. X . . To be able to talk about independence and dependence of experiments. . then the probability of a given chip in the second draw will depend on the outcome of the first dmw. and if we do not replace the first chip drawn before the second draw. I <i< n. 1974. .... . Pn(A n ).An with Ai E Ai...5. . Loeve. r. . I ! i w? 1 . . . it can be uniquely extended to the sigma field A specified in note (l) at the end of this appendix. An is by definition {(WI. 1995..'" £n and recording all n outcomes. it is easy to give examples of dependent experiments: If we draw twice at random from a hat containing two green chips and one red chip. InformaUy. we should have .. x On in the compound experiment. (A5A) i. . 1 < i < n}. x An of AI. . In the discrete case (A. More generally.. (A.0) given by . 1 £n if the n stage compound experiment has its probability structure specified by (A5. I P n on (fln> An). x An) ~ P(A. x An. wn ) is a sample point in . The (n stage) compound experiment consists in performing component experiments £1. 446 A Review of Basic Probability Theory Appendix A A. .2) 1 i " 1 If we are given probabilities Pi on (fl" AI).) .on. If we want to make the £i independent. A2). (A5. if we toss a coin twice. a compound experiment is one made up of two or more component experiments.3). x flnl n Ifl) x A 2 x ' .w n ) E o : Wi = wf}. 1 1 . P.0 1 x··· XOi_I x {wn XOi+1 x··· x.53) holds provided that P({(w"". The reader not interested in the formalities may skip to Section A... To say that £t has had outcome pound event (in .' . X fl n 1 X An). . P(fl... x'" . P([A 1 x fl2 X . . ..1 Recall that if AI." X An) = P.o i ..o n = {(WI. x flnl n ' . An are events. x .3) for events Al x .1 X Ai x 0i+ I x . we introduce the notion of a compound experiment.. x An by 1 j P(A. .. Chung.0 ofthe n stage compound experiment is by definition 0 1 x·· . If we are given n experiments (probability models) t\.on.
.w n )}) = P(£l has outcome wd P(£2 hasoutcomew21 £1 has outcome WI)'" P(£T' has outcome W I £1 has outcome WI..·. and Stone (1971) Section 1. Sampling With and Without Replacement 447 Specifying P when the £1 are dependent is more complicated. The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success). . If the experiment is perfonned n times independently. ..6 Hoel.S) ..6. If o is the sample space of the compound experiment. In the discrete case we know P once we have specified P( {(wI . the following.6.: n ) with Wi E 0 1..5. . Port.Pq' If fl is the sample space of this experiment and W E fl..: n )}) for each (WI. any point wEn is an ndimensional vector of S's and F's and. n The probability structure is determined by these conditional probabilities and conversely.6. If we repeat such an experiment n times independently.2) where k(w) is the number of S's appearing in w.3) where n ) ( k = n! kl(n . A. .'" . then (A.. i = 1" . the compound experiment is called n multinomial trials with probabilities PI... ill the discrete case... we shall refer to such an experiment as a Bernoulli trial with probability of success p. . J.£"_1 has outcome wnd.6. . 1. . if an experiment has q possible outcomes WI. (A. .Section A 6 Bernoulli and Multinomial Trials. A.SP({(w] .7) we have. /I. we say we have performed n Bernoulli trials with success probability p. References Grimmett and Stirzaker ( (992) Sections (. we refer to such an experiment as a multinomial trial with probabilities PI. Ifweassign P( {5}) = p.5 Parzen (1960) Chapter 3 A.4 More generally.. By the multiplication rule (A.3) is known as the binomial probability.. SAMPLING WITH AND WITHOUT REPLACEMENT A. ..u.6. which we shall denote by 5 (success) and F (failure).4. If Ak is the event [exactly k S's occur].5.k)!' The fonnula (A. .1 Suppose that we have an experiment with only two possible outcomes.wq and P( {Wi}) = Pi.. then (A. Other examples will appear naturally in what follows.· IPq.6 BERNOULLI AND MULTINOMIAL TRIALS.6.
. i .7 If we perform an experiment given by (. .. 10) is known as the hypergeometric probability.1 . . = N! (N .k.N (1 . .6. Sections 14 Pitman (1993) Section 2. . with replacement is added to distinguish this situation from that described in (A. i .n)!' If the case drawn is replaced before the next drawing. PtA ) k =( n ) (Np). . then ~ .6. . · · • A. kq!Pt' .. . A..) of the compound experiment. Port.S) as follows.P))nk k (N)n = (A6.6. and the component experiments are independent and P( {a}) = liNn. 1 . exactly k 2 wz's are observed. for any outcome a = (Wil"" 1Wi. we are sampling with replacement. (N)n .7 PROBABILITIES ON EUCLIDEAN SPACE Random experiments whose outcomes are real numbers playa central role in theory and practice. .S If we have a finite population of cases = {WI"" WN} and we select cases Wi successively at random n times without replacement. ·Pq t (A.4 Parzen (1960) Chapter 3. then P( k" . I' .448 A Review of Basic Probability Theory Appendix A where k~(w) = number of times Wi appears in the sequence w. we shall sometimes refer to the outcome of the compound experiment as a sample of size n from the population given by (n. and Stone (1971) Section 2.. . .1O) J for max(O.1 . P) independently n times.k q is the event (exactly k l WI 's are observed. .0. References Gnedeuko (1967) Chapter 2.1 j .(N(I. .. and P(Ak) = 0 otherwise. The probability models corresponding to such experiments can all be thought of as having a Euclidean space for sample space. Np). A. If Np of the members of n have a "special" characteristic S and N (1 ~ p) have the opposite characteristic F and A k = (exactly k "special" individuals are obtained in the sample). exactly kqwq's are observed).9) . __________________J i . I A.A l P). n ..) A = n! k k k !. When n is finite the tenn. The fonnula (A.. . Sectiou 11 Hoel.6) where the k i are natural numbers adding up to n.6..6.. the component experiments are not independent and. If AkJ.6. .p)) < k < min( n. n P({a}) ~ (N)n where 1 I (A.
We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely.. P(A) is the volume of the "cylinder" with base A and height p(x) at x. Recall that the integral on the right of (A. . and 0 otherwise.8 are usually called absolutely continuous. . which we denote by f3k. .S) is by definition r JR' 1A(X)P(x)dx where 1A(x) ~ 1 if x E A.7 Probabilities on Euclidean Space 449 We shall use the notation R k of k~dimensional Euclidean space and denote members of Rk by symbols such as x or (Xl. . is defined to be the smallest sigma field having all open k rectangles as members. xd'. . It may be shown that a function P so defined satisfies (AlA)..7. That is. dx n • is called a density function. x A. any nonnegative function pon R k vanishing except on a sequence {Xl.9) . A. for practical purposes.7. we shall call the set (aJ.7.Section A. (A7. An important special case of (A.Xk) :ai <Xi <bi. where ( )' denotes transpose.6 A nonnegative function p on R k • which is integrable and which has r p(x)dx = 1.. .S) is given by (A.EA pix.7...7.)..bd.5) A.S) for some density function P and all events A. P defined by A. A.7. Any subset of R k we might conceivably be interested in turns out to be a member of f3k.3 A discrete (probability) distribution on R k is a probability measure P such that L:~ I P( {Xi}) = 1 for some sequence of points {xd in R k . 1 <i <k}anopenkrectallgle..1If (al.. .. ..b k ) = {(XI"". We will write R for R1 and f3 for f31. only an Xi can occur as an outcome of the experiment.7.Xn . Geometrically. A. Thefrequency function p of a discrete distribution is defined on Rk by n pix) = P({x»). This definition is consistent with (A..3.7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation P(A) = L p(x)d x =1 (A7. } of vectors and that satisfies L:~ I P(Xi) = 1 defines a unique discrete probability distribution by the relation P(A) = L x. (A7A) Conversely.7.I) because the study of this model and that of the model that has = {Xl..7. X n ..(ak. However.. . JR' where dx denotes dX1 . Riemann integrals are adequate. } are equivalent.bd x '" (ak.bk ) are k open intervals. .2 The Borelfield in R k . Integrals should be interpreted in the sense of Lebesgue.
r .1'. P Xl (A.2 parzen (1960) Chapter 4. x. For instance. x (00. x n j X =? F(x n ) ~ (A.450 A Review of Basic Probability Theory Appendix A It turns out that a continuous probability distribution determines the density that generates it "uniquely. defines P in the sense that if P and Q are two probabilities with the same d.. (A.7. ! I .1.7.13) x <y =? F(x) < F(y) (Monotone) F(x) (Continuous from the right) (A..f. Thus.12) The dJ.16) I It may be shown that any function F satisfying (A. . limx~oo F(x) limx~_oo =1 F(x) = O. .7..14) . be thought of as measuring approximately how much more Or less likely we are to obtain an outcome in a neighborhood of XQ then one in a neighborhood of Xl_ A. . (A."(l) Although in a continuous model P( {x}) = 0 for every x.2.h. Xo P([xo .15) I. x.11 The distribution function (dJ. 5. When k = 1. We always have F(x)F(xO)(2) =P({x}).5 i • .xo + h]) '" 2hp(xo) and P([ h Xl 1 + h]) + Xl p(xo) hi) '" ( ).1.7.])..:0 and Xl are in R.7.) = P( ( 00. • J .7. if p is a continuous density on R. (A.7. then by the mean value theorem paxo .J x . 5.7. and Stone (1971) Sections 3. then P = Q. 22 Hoel.7. Sections 14.) F is defined by F(Xl' .1 and 4.17) = O. References Gnedenko (1967) Chapter 4.13HA. thus.7 Pitman (1993) Sections 3. F is continuous at x if and only if P( {x}) (A. and h is close to 0.16) defines a unique P on the real line. 3. the density function has an operational interpretation close to thal of the frequency function.x. F is a function of a real variable characterized by the following properties: > I .4. . Sections 21.h. POlt.lO) The ratio p(xo)jp(xl) can.7.. 4.
7.2 A random vector X = (Xl •.'s. and so on. the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred.Xk)T is ktuple of random variables. m > 1. k. Here is the formal definition of such transformations. Px ) given by n Px(B) = PIX E BI· (A. dj. Forexample.4 A random vector is said to have a continuous or discrete distribution (or to be continuous or discrete) according to whether its probability distribution is continuous or discrete.Section A. by definition. 13k .5) p(x)dx .8 Random VariOlbles and Vectors: Transformations 451 A. and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted.8. Thus.a RANDOM VARIABLES AND VECTORS: TRANSFORMATIONS Although sample spaces can be very diverse. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X. or equivalently a function from to Rk such that the set {w . referring to those features of its probability distribution.. such that(2) gl(B) = {y E: Rk : g(y) E: .S.3) A. The study of real. the probability measure Px in the model (R k . the yield per acre of a field of wheat in a given year. The event Xl( B) will usually be written [X E B] and P([X E BJ) will be written PIX E: H]. In the discrete case this means we need only know the frequency function and in the continuous case the density.1 A random variable X is a function from Oto Rsuch that the set {w: X(w) E B} X. these quantities will correspond to random variables and vectors. the concentration of a certain pollutant in the atmosphere.5) and (A. A.8) PIX E: A] LP(X).or vectorvalued functions of a random vectOr X is central in the theory of probability and of statistics. . ifXisdiscrete xEA L (A.8. X(w) E: B} ~ X.1 (H) is in A for every B E BkJI) For k = 1 random vectors are just random variables.8. the time to breakdown and length of repair time for a randomly chosen machine. dJ. in fact. Letg be any function from Rk to Rm. The probability distribution of a random vector X is. we will refer to the frequency Junction. density. if X is continuous.1 (B) is in 0 fnr every BE B.(1) = A. we measure the weight of pigs drawn at random from a population.7. In the probability model. When we are interested in particular random variables or vectors. The subscript X or X will be used for densities. from (A. we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined. .8. Similarly. and so on of a random vectOr when we are.
(A. Pg(X) (I) = j. Then g(X) is continuous with density given by .X)2. it may be shown (as a consequence of (A.y). known as the marginal frequency function. If g(X) ~ aX + 1'.y)(x. Furthennore. Yf is a discrete random vector with frequency function p(X. • II ! I • . y)T is continuous with density p(X. . and X is continuous. The probability distribution of g(X) is completely detennined by that of X through L:: P[g(X) E BI = PIX E gI(B)]. j .8. a random vector obtained by putting two random vectors together.8) Suppose that X is continuous with density PX and 9 is realvalued and onetoone(3) on an open set S such that P[X E 5] = 1. then 1 (I .1') .y)(x.9) t E g(S).11) and (A. (A. Then the random tran~form(lti(m g( X) is defined by g(X)(w) = g(X(w)).8. Discrete random variables may be used to approximate continuous ones arbitrarily closely and vice versa.1O) From (A. is given by(4) i I ( PX(X) = LP(X. This is called the change of variable formula.6) An example of a transformation often used in statistics is g (91.12) by summing or integrating out over yin P(X. max{X. ! for Pg(x)(I) ~ PX(gl(t)) Ig'(g 1(1))1 (A. a # 0.8. .S. The (marginal) frequency or density of X is found as in (A.8)) that X is a marginal density function given by px(x) ~ 1: P(X. y). V). and 0 otherwise.1 1 Xi = X and 92(X) = k. .8.8. Another common example is g(X) = (min{X. then the frequency function of X.8.92/ with 91 (X) = k.Y).452 A Review of BClsic ProbClbility Theory Appendix A [J} E BI.7.(5) (A. if (X. .S.Y). i . assume that the derivative l of 9 exists and does not vanish on S.7) If X is discrete with frequency function Px.y)dy. (A. .8) it follows that if (X. . .).1 E~' l(X i .12) 1 .7) and (A. y (A.)'.: for every B E Bill.11) Similarly.Y) (x. then g(X) is discrete and has frequency function Pg(X)(t) = L {x:g(x)=t} Px(x).jPX a (A.8.8.8. These notions generalize to the case Z = (X.
.S.7 The preceding equivalences are valid for random vectors XlJ .l5 and B.7). 1 X n are independent if. if (Xl.[Xn E An] are independent.3 By (A. .. Port. ' .... A. 5.6. References Gnedenko (1967) Chapter 4..9.1 Two random variables X I and X 2 are said to be independent if and only if for sets A and B in E. .6IftheXi are all continuous and independent..Section A.9.) and Y" and so on.Xn are said to be (mutually) independent if and only if for any sets AI.1. X 2 ) and (Yt> Y2 ) are independent.4) (A.9. all random variables are discrete because there is no instrument that can measure with perfect accuracy. Sections 2124 Grimmett and Stirzaker (1992) Section 4. so are Xl + X2 and YI Y2 . Theorem A. Nevertheless.4 Parzen (1960) Chapter 7. the approximation of a discrete distribution by a continuous one is made reasonable by one of the limit theorems of Sections A. which may be easier to deal with... it is common in statistics to work with continuous distributions. .4 A. " X n ) is either a discrete or continuous random vector. either ofthe (A. and only if. the events [X I E All.. A. Suppose X (Xl. .9 Independence of Random Variables and Vectors 453 In practice. . 9 Pitman (1993) Section 4.. One possibility is that the observed random variable or vector is obtained by rounding off to a large number of places the true unobservable continuous random variable specified by some idealized physical modeL Or else.9. .2 The random variables X I.. .13 A convention: We shan write X = Y if the probability of the event IX fe YI is O. .9.3. whatever be g and h.7.9. then X = (Xl.9 INDEPENDENCE OF RANDOM VARIABLES AND VECTORS A.Xn (not necessarily of the same dimensionality) we need only use the events [Xi E Ail where Ai is a set in the range of Xi' A. . the events [XI E AI and IX. A.S. if X and Yare independent. The justification for this may be theoretical or pragmatic. Sections 15.5) = following two conditions hold: A. S. (X"XIX. Then the random variables Xl.9. E BI are independent.1..2.An in S. so are g(X) and h(Y). " X n ) is continuous. 6.9. . . X n with X (XI>"" Xn)' . and Stone (1971) Sections 3. To generalize these definitions to random vectors Xl.7 Hoel.. For example.
3. decompose {Xl l X2.9) If we perfonn n Bernoulli trials with probability of success p and we let Xi be the indicator of the event (success on the ith trial). .S If Xl . Then a reasonable measure of the center of the distribution of X is the average height of an individual in the given population..4) from a population and measuring k characteristics on each member. A. . I. The same quantity arises (approximately) if we use the longrun frequency interpretation of probability and calculate the average height of the individuals in a large sample from the population in question. In statistics. . "" 1 X q are the only heights present in the population.. X l l is called a random sample (l size n from a population with dJ.9. we define the random variable 1 A. we define the expectation or mean of X. .4 Par""n (1960) Chapter 7. X2.) A.. F x or density (frequency function) PX..5.454 A Review of Basic Probability Theory Appendix A A. by 00 1 i . 7 Pitman (1993) Sections 2. if X is discrete.. 5. If X is a nonnegative.. by 1 ifw E A o otherwise. . .I0 THE EXPECTATION OF A RANDOM VARIABLE r: r I Let X be the height of an individual sampled at random from a finite population. 1 Xi =i. Sections 6. Such samples win be referred to as the indicators of n Bernoulli tn"als with probability ofsuccess p" References Gnedenko (1967) Chapter 4. and Stone (1971) Section 3. Take • . If XI . Sections 23.' (Infinity is a possible value of E(X). X n are independent identically distributed kdimensional random vectors with dJ. Fx or density (frequency function) Px. 4. px(i) = i(i+1)' i= 1.. If A is any event.2. written E(X)..I) .Ll XiP[X = Xi] where P[X = Xi] is just the proportion of individuals of height Xi in the population.j j ..~ ( .p) to 0. such a random sample is often obtained by selecting n members at random in the sense of (A. (A9. .2.X.I . ..~ 1 i E(X) = LXiPX(Xi). } into two sets A and B where A consists of aU nonnegative Xi and B of all negative Xi· If either 'L..2 Hoel.I0.• }. In line with these ideas we develop the general concept of expectation as follows.EA XiPX (Xi) < . 24 Grimmett and Stirzaker (1992) Sections 3. then Xl . .. i==l (A. the indicator of the event A.. ..3 . i ..I l I .W. Port. then the Xi form a sample from a distribution that assigns probability p to 1 and (1. discrete random variable with possible values {Xl. it follows that this average is given by 'L. . I i I I .2 More generally.
Jo= xpx(x)dx or A. n."5'. i = 1.9.v') = c for all w..'_""o"_of_'""R_'_"d"o:. (A.)Px (. then E(X) = P(A).rn".6) Taking g(x) = L:~ 1 O:'iXi we obtain the fundamental relationship (A. Otherwise.IO. If X is a constant. 10. A. Those familiar with Lebesgue integration will realize that this leads to E(X) = 1: xpx(x)dx whenever (A.. Here are some properties of the expectation that hold when X is discrete. . . 10. we define E(X) unambiguously by (A. if 9 is a realvalued function on Rn."_0_"_A". . 00 455 or L. If X is a continuous random variable. an are constants and E(!Xil) < 00. . and if E(lg(X) I) < 00. I 0..I).'"":.J:.. If X (A.5) As a consequence of this result. E(X) is left undefined.3) = IA (cf (A.V. 10. then E(X) < E(Y)."_"_b_. we leave E(X) undefined.8From (A.l"O_T_h''CEx. It may be shown that if X is a continuous k~dimensional random vector and g(X) is any random variable such that JR' { Ig(x)IPx(x)dx < 00.tO A random variable X is said to be integrable if E(IXI) < 00.p. we have 00 E(IXI) ~ L i=l IXiIPx(Xi).9)).. it is natural to attempt a definition of the expectation via approximation from the discrete case. Otherwise. .lO.) < 00.7) if Qt. E(Y) are defined.ES( x. then E(X) = c. 10.7) it follows that if X < Y and E(X). then it may be shown that 00 E(g(X)) = Lg(Xi)PX(Xi) i=l (A. . 10. lOA) If X is an ndimensional random vector..IO. .9) as the definition of the expectation or mean of X f~oo xpx(x)dx is finite.. X(t. (A.c..
7).. I0. 10.. 7. . (A. The interested reader may consult an advanced text such as Chung (1974).3 Hoel. 1 . . In the continuous case dJ1(x) = dx and M(X) is called Lebesgue measure. In the discrete case J1 assigns weight one to each of the points in {x : p(x) > O} and it is called counting measure. . g(X)Px(x)dx.. the moments depend on the distribution of X only.lO.!I) In the continuous case expectation properties (A.5) and (A 10. It is possible to define the expectation of a random variable in general using discrete approximations. and (AIO. A...3. I: We refer to J1 = IlP as the dominating measure for P.. 'I I' I. 4. I In general. By (A.S) as well as continuous analogues of (A..B) g(x)p(x)dx.H.3). which means J .11).IRk g(x)dP(x) (A. • g(x)dP(x) = L g(Xi)p(Xi).. I0..! If k is any natural number and X is a random variable. (AIO.5) and (A. 10.3. I .. discrete case f: f References i~l (AIO. the kth moment of X is defined to be the expectation of X k. POrl. Section 26 Grimmett and Stirzaker (1992) Sections 3.6) hold.. 10. Chapter 3. A convenient notation is dP(x) = p(x)dl'(x).. I I) are both sometimes written as E(g(X)) ~ r ink = g(x)dF(x) or r . Chapter S. 4. E(X k ) j = Lxkpx(x) if X is discrete . lOA).5).. 3A. and Stone (1971) Sections 4. The fonnulae (A 10.IO.ll MOMENTS 1 . continuous case.1...5) and (A. j .2) xkpx(x)dx if X is continuous. It I! II .456 then E(g(X» exists and E(g(X» A Review of BaSIC Probability Theory Appendix A = 1. j i .1 I 1 .. II. (A..12) where F denotes the distribution function of X and P is the probability function of X defined by (AS.. Chung (1974) Chapter 3 Gnedenko (1967) Chapter 5.' • 1: x I (A I 1. We assume that all moments written here exist. j A.1 parzen (1960) Chapter 5. We will often refer to p(x) as the density of X in the discrete case as well as the continuous case. Sections 14 Pitman (1993) Sections 3.
2 are expressed in tenns of cumulants. is by definition E[(X . E(XiX~). These results follow..ll Moments 457 A.1 and the kunosis ')'2.6) it follows then that A. are the same as those of X.II.E(XIl)i(X. are used in the coefficient of skewness .7) Var(aX + b) = a2 Var X. Another measure of the same type is E(jX . The central product moment of order (i. then 11 A. If XI and X 2 are random variables and i. from (A. A.E(X)J). These descriptive measures are useful in comparing the shapes of various frequently used densities.S The second central moment is called the variance of X and will be written Var X. then by (A. (A. then the coefficient of skewness and the kurtosis of Y = 12 = O. A. X = E(X) (a constant). This is the case.4 The kth central moment of X (X . 1) is .E(X))/vVar x.1l.2).3 The distribution of a random variable is typically uniquely specified by its moments. the kth moment of A. The central product moment of order (1. for example.n. which is often referred to as the mean deviation.12 It is possible to generalize the notion of moments to random vectors.n.n. The variance of X is finite if and only if the second moment of X is finite (cf. . See also Section A.IO The third and fourth central moment.ll. then the product moment of order (i. then X = O. and is denoted by I'k. j are natural numbers. 10.15».7 If X is any random variable with welldefined (finite) mean and variance. for instance. The standard deviation measures the spread of the distribution of X about its expectation. A.) dardized version or Zscore of X is the random variable Z ~ (X .S) A.j) of X.E(X 2 ))i].7) and (All. and X 2 is again by definition E[(X.Section A. If X ~ N(/l. the stan E(Z) = 0 and Var Z = 1. .ll.H.E(X)). By (A 10. If a and b are constants.U.9 If E(X 2 ) = 0. by definition. It is also called a measure of scale. If Var X = 0. 0 2). which are defined by where 0'2 = Var X. (All.15. if the random variable possesses a moment generating function (cf.6) (One side of the equation exists if and only if the other does.12 where.n If Y = a + bX with b > 0. (AIl.E(X))k]. (AI2.1».j) of Xl and X 2 is.l and. For sim plicity we consider the case k = 2. The nonnegative square root of Var X is called the standard deviation of X.
19) 1 • t .).I 1. This is the correlation inequality.11l3) (A. The correlation of Xl and X 2.E( X. with equality holding if and only if (1) Xl or X 2 is a constant Or ! • • . • i .X. . (2) (XI .4.3) and (AID. i .) + od Cov(X"X. is defined whenever Xl and X 2 are not constant and the variances of Xl and X 2 are finite by (A.) = lfwe put XI ~ ~E(XI . Equality holds if and only if one of Zll Z2 equals 0 or 2 1 = aZ2 for some constant a. A proof of the CauchySchwartz inequality is given in Remark 1. then Cov(XI. J I • for any two random variables Z" Z. denoted by Corr(XI1 X2).E(X J!)( X.14). .X 2).IIl4) If Xi and Xf are distributed as Xl and X 2 and are independent of Xl and X 2. It may be obtained from the CauchySchwartz inequality. Z. =X in (A.458 A Review of Basic Prob<lbility Theory Appendix A called the cOl'Orionce of X 1 and.).) = ac Cov(X I .) (X. 0 # 0) of XI' l il .E(X I ») = CO~(X~X. Equality holds if and only if X. is linear function (X. (AI 117) j r . + dX. The correlation inequality correspcnds to the special case ZI = Xl . .lI.) (A. o. . •  II .. . X. I 1 1 j I I .X.. = a + oXI . . X.I6) . such that E(Zf) < 00.\:2 and is written Cov(. X 3 ) + be Cov(X" X 3 ) and + ad Cov(X I . ar '2 . Cov(aX I + oX" eX. = X.)(X. By expanding the product (XI .[E(X)j2 (A.X.).\ 1.)) and using (AI 0. we get the formula Var X = E(X') .E(X I ).1.E(X. E(Zil < 00. I liS) The covariance is defined whenever Xl and X 2 have finite variances and in that case (AII.I US) The correlation of Xl and X 2 is the covariance of the standardized versions of Xl and X 2· The correlation inequality is equivalent to the statement (A. we obtain the relations.. 7).E(X.
3) k=O .J4).II). Mx(s) ~ E(e'X) is well defined for and is called the moment generating function of X. are uncorrelated) need be independent.. It is not trUe in general that X I and X 2 that satisfy (AIl.X.. 30 Hoel..llf E(c'nIXI) < 00 for some So > 0. i ~ 1. 12. 10. lsi < so} of zero. Port..) > 0.22) and (AI 1. are If M x is well defined in a neighhorhood finite and {s ..II. 28..) ~ o when Var(X.ID.II.l2. XII have finite variances.4. I 1. (A. all moments of X k! sk. +2L '<7 COV(X.23) A. As a consequence of (AI 1. =L = E(X k ) lsi < so· (A.2) eSXpx(x)dx if X is continuous..bX 1.20) If Xl and X 2 are independent and X I and X 2 are integrable. By (A.l1.2. Sections 14 Pitman (1993) Section 6. respectively).5) and (A. we obtain as a consequence of (A. and Stone (1971) Sections 4. + X n ) = L '1=1 n Var X. The correlation coefficient roughly measures the amount and sign of linear relationship between Xl and X 2. . 7. lsi < So Mx(s) LeSX'PX(Xi) if X is discrete 1: Mx(s) i~1 (A. CoV(X1. then (A.22) (i.Section A.12. Xl)· (A..24. 11.22) This may be checked directly. b < orb> 0.4 ° + .X2 ) = Corr(X1.3 Parzen (1960) Chapter 5. then Var(X 1 References Gnedenko (1967) Chapter 5.'" .e.13) the relation Var(X 1 + .. It is 1 Or 1 in the case of perfect relationship (X2 = a l.21 ) or in view of(All. Chapter 8.12 MOMENT AND CUMULANT GENERATING FUNCTIONS A. Sections 27. + X n ) = L Var Xi· i=I n (A.5.Xn are independent with finite variances. we see that if Xl.12 Moment and Cumulant Generating Functions 459 If Xl. See also Section 1.20).
SOME CLASSICAL DISCRETE AND CONTINUOUS DISTRIBUTIONS I f • .l1. ~ Cj(X) ~ d .'(8) ds~ = E(X').9) l 1 " I J . is called the jth cumulant of X.+. Cj = 0 for j > 3.6) This follows by induction from the definition and (A. The first cumulant Cl is the mean J. = 0 and dk :\1. . . the probability distribution of a random variable or vector is just a probability measure on a suitable Euclidean space. ! . .5 If defined.+x"I(S) = II Mx. K x can be represented by the convergent Taylor expansion Kx(s) = where L j=O 00 Cisj (A. .8) J. For a generalization of the notion afmoment generating function to random vectors. • By definition. 11. + XI! has moment generating function given by M(x.12. then Cj(X + Y) = c..12. c.t4 . Section 8.(s). j > L For j > 2 and any constant a. .2l). Chapter 8.13 .3J1~.Us hils derivatives of all orders at :. . See • 1 ~ . and Stone (1971) Chapter 8.460 A Review of Basic: Probability Theory Appendix A A. References Hoel. I I i I A. see Section B.1x I ' . and C4 = J..3.12.. C2 and C3 equal the second and third central moments f12 and f13 of X.12.7) is called the cumulant generating function of X.'(".5. In this section we introduce certain families of b •. A/x determines the distribution of X uniquely and is itself uniquely determined by the distribution of X. Cj(X + a) = Cj(X). i=l n (A. If XI .t of X. . • 111.(X) + cJ(Y). 12.4 The moment generuting function .~o sJ dj (A. XII are independent random variables with moment generating functions 11. c3/ci and 1'2 = C4/C~.8. Ii . The coefficients of skewness and kurtosis (see (A. Port.. If X and Yare independent..Kx(s)I. If lv/x is well defined in some neighborhood of zero.12.10» can be written as 1'1 = Problem 8. j A. Section 3.1 parzen (1960) Chapter 5.. then Xl + . Sections 23 Rao (1973) Section 2bA j i i I . If X is nonnally distributed. The function Kx(s) = log M x (s) (A.
I. p(k) = (A. n) distribution (see (A6. This result may be derived by using (A. (Al3. B(n2.l.4). respectively.13.6. DIN) distribution. The parameters D and n may be any natural nwnbers that are less than or equal to the natural number N. + . X has a B(n.12.13. + n"O) distribution. N. (A I 3. ..13 Some Classical Discrete and Continuous Distributions 461 distributions.n. Similarly.n . .1) > 0 whereas 0 may be any number in [0.l3. and n : 1t(D. then X has a B( n.OW.. If X has a B( n. . Var X = nO(1  0). . . N.71f X is the number of defectives (special objects) in a sample of size n taken without replacement from a population with D defectives and N .D» < k < min(n.13.Xk are independent random variables distributed as B(n l1 8).Section A.3». p(k)~ (~)0'(10)"k.1 0». then E(X) ~ nO.4) . 0) distribution (see (A. The hypergeometric distribution with parameters D.. which arise frequently in probability and statistics. 0). A. it is assumed to be zero to the "left" ofthe set and one to the "right" ofthe set.13.5 If Xl. B)" for "the binomial distribution with pmameter (n .6) in conjunction with (A.D nondefectives. k=O. . has a B(n. Discrete Distributions The binomial distribution with parameters nand B : B( n.. 0).3) Higherorder moments may be computed from the moment generating function Mx(t) = [Oe' + (1 .5) and (A.n). . 1]. Following the name of each distribution we give a shorthand notation that will sometimes be used as will obvious abbreviations such as "binomial (n. B(n. The parameter n can be any integer (A.. 0) distribution.. If anywhere below p is not specified explicitly for some value of x it shall be assumed that p vanishes at that point. + X 2 + '" + X.12. N. then X.6) for k a natural number with max(O. and list some of their properties.(N .D). A. A. The symbol p as usual stands for a frequency or density function.l3. If the sample is taken with replacement. .2 If X is the total number of successes obtained in n Bernoulli trials with probability of success 0. B). if the value of the distribution function F is not specified outside some set. then X has an H( D. Or. X 2 .
13.' .. 2. Oil .6).. If X has a M( n.462 A Review of Basic Probability Theory Appendix A If X has an H(D. where Xi multinomial trials with probabilities (0 1 . (A. ~ • (AI 1. . then D D ( D) Nn E(X) = n N' Var X = n N 1. 1. . and . • • I • E(X) = Var X ~ A..7).12 If X" X 2 .. k ( k) ]0 = e'A k! (A 13.? i 1 ki = n.13.1O) f The moment generating function of X is given by Mx(t) ~ e. e = {(O" ..8) Formulae (AI3. tion (see (A.l3) The par~meter n is any 'L.. . t=l = (Xl.IS) ..9) for k = 0. i of:.. lOA). to. ]O(k" .(.20). .. " f" is the number of times outcome Wi occurs in n .0... (AI3.8) may be obtained directly from the definition (A 13. . .l . ~ I}. Bq : M(n.1 < < q.. then X has a M(n.kq) = k.1I) A.Bq) distribu ! i : E(Xi ) Cov(Xi.N N 1' (A.. ..71Ij where I j = 1 if the jth object sampled is defective and 0 otherwise.. Oq).. q. The parameter>.. 13."" P(A n ) distributions. " X n are independent random variables with P(A').• .Oq): 0. P(A2). . Oq' whenever k i are nonnegative integers such that natural number while (Ol. . . 10.l3. + X 2 + +Xn has the P(AI + A2 + . (A. If X has a peA) distribution.. .Xq )'.. kq. An easier way is to use the interpretation (A. j = 1. The Poisson distribution with parameter A: peA). and then applying formulae (A. This result may be derived in th same manner as the corresponding fact for the binomial distribution.) n(JiOj. then X. n) distribution.. B1 .6.'l). then .14 If X 0.j l i. .. respectively.•• . 0 ' .Xj ) ne" Var Xi ~ nOi(I..Oq). 01 "" . + An) distribution..l3. Oq) distribution.'" Oq) is any vector in n! k k (A.i3. > A.l3. . (A. N.I . The multinomial distribution with parameters n.7) by writing X = 'L.O. can be any positive number. .6)).
=1 ()i J ) distribution for any set {i l .16 If X has a M (n'(:h. and (A. be the d./") + ) F"(.=1 ()i. T > 0. By definition. if X .13. Continuous Distributions Before beginning our listing we intr<X1uce some convenient notations: X .7).8) and an application of formulas (A. . and Y is said to generate F s .I3. we may take E(Y) = 0.F.2:.'i s } C {I. ifY has a second moment. if Y has a second moment. . : '00 < j.13. '"Y. X '" Fp.s = {Fp"a: 00 < j. we may select F as being the unique member of the family F shaving Var Y = 1 and then X '" F. {::} X/a . . C).S. and a a scale parameter.T '"Y . be the dJ. iJ' II. . {::} X .t < oo} is called a location parameter family.q}.X'. · · . . j. Alternatively.. .I3). : cr > OJ is called a scale parameterfamily. a is a scale parameter.) = (r(x a. X .L F. and X '" p will similarly mean that X has density or frequency function p.. for any a > 0.a > O} is called a locationscale parameter family.j. it follows that we may without loss of generality (as far as generating FL goes) assume that E(Y) = O.