Mathematical Statistics
Basic Ideas and Selected Topics
Volume I
Peter J. Bickel
University of California
Kjell A. Doksum
University of California
1'1"(.'nl icc
Hall
.. ~
PRENTICE HALL
Upper Saddle River, New Jersey 07458
Library of Congress CataloginginPublication Data Bickel. Peter J. Mathematical statistics: basic ideas and selected topics / Peter J. Bickel, Kjell A. Doksum2 nd ed. p. em. Includes bibliographical references and index. ISBN D13850363X(v. 1) L Mathematical statistics. L Doksum, Kjell A. II. Title.
QA276.B47200l
519.5dc21 00031377
Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte
}'I('I1II(\'
lI,dl
@2001, 1977 by PrenticeHall, Inc. Upper Saddle River, New Jersey 07458
All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
I
i
ISBN: D1385D363X
PrenticeHall International (UK) Limited, London PrenticeHall of Australia Pty. Limited, Sydney PrenticeHall of Canada Inc., Toronto PrenticeHall Hispanoamericana, S.A., Mexico PrenticeHall of India Private Limited, New Delhi PrenticeHall of Japan, Inc., Tokyo Pearson Education Asia Pte. Ltd. Editora PrenticeHall do Brasil, Ltda., Rio de Janeiro
J
To Erich L Lehmann
." i
~I
"I
~ !
,'
,
I,
,.~~
_.

..
CONTENTS
PREFACE TO THE SECOND EDITION: VOLUME I PREFACE TO THE FIRST EDITION I STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA 1.1 Data, Models, Parameters, and Statistics
xiii
xvii
1 1
1.1.1
1.1.2
Data and Models
Pararnetrizations and Parameters
I
6
1.1.3
1.2
Statistics as Functions on the Sample Space
8
1.3
1.4
1.5
1.6
1.7 1.8 1.9
1.1.4 Examples, Regression Models Bayesian Models The Decision Theoretic Framework 1.3.1 Components of the Decision Theory Framework 1.3.2 Comparison of Decision Procedures 1.3.3 Bayes and Minimax Criteria Prediction Sufficiency Exponential Families 1.6.1 The OneParameter Case 1.6.2 The Multiparameter Case 1.6.3 Building Exponential Families 1.6.4 Properties of Exponential Families 1.6.5 Conjugate Families of Prior Distributions Problems and Complements Notes References
9 12 16 17 24 26 32 41 49 49 53 56 58 62 66 95 96
VII
VIII
•••
CONTENTS
2 METHODS OF ESTIMATION 2.1 Basic Heuristics of Estimation 2.1.1 Minimum Contrast Estimates; Estimating Equations 2.1.2 The PlugIn and Extension Principles 2.2 Minimum Contrast Estimates and Estimating Equations 2.2.1 Least Squares and Weighted Least Squares 2.2.2 Maximum Likelihood 2.3 Maximum Likelihood in Multiparameter Exponential Families *2.4 Algorithmic Issues 2.4.1 The Method of Bisection 2.4.2 Coordinate Ascent 2.4.3 The NewtonRaphson Algorithm 2.4.4 The EM (ExpectationlMaximization) Algorithm 2.5 Problems and Complements 2.6 Notes 2.7 References
3
MEASURES OF PERFORMANCE
99 99 99 102 107 107 114 121 127 127 129 132 133 138 158 159
3.1 Introduction 3.2 Bayes Procedures 3.3 Minimax Procedures *3.4 Unbiased Estimation and Risk Inequalities 3.4.1 Unbiased Estimation, Survey Sampling 3.4.2 The Information Inequality *3.5 Nondecision Theoretic Criteria 3.5.1 Computation 3.5.2 Interpretability 3.5.3 Robustness 3.6 Problems and Complements 3.7 Notes 3.8 References
4 TESTING AND CONFIDENCE REGIONS 4.1 Introduction 4.2 Choosing a Test Statistic: The NeymanPearson Lemma 4.3 UnifonnIy Most Powerful Tests and Monotone Likelihood Ratio
161 161 161 170 176 176 179 188 188 189 190 197 210 211 213 213 223
227 233
4.4
Models Confidence Bounds, Intervals, and Regions
CONTENTS
ix
241 248 251 252
4.5 *4.6 *4.7 4.8
The Duality Between Confidence Regions and Tests Uniformly Most Accurate Confidence Bounds Frequentist and Bayesian Formulations Prediction Intervals
4.9
Likelihood Ratio Procedures 4.9.1 Inttoduction
4.9.2 4.9.3 Tests for the Mean of a Normal DistributionMatched Pair Experiments Tests and Confidence Intervals for the Difference in Means of
255 255
257
4.9.4
4.9.5
Two Normal PopUlations The TwoSample Prohlem with Unequal Variances
Likelihood Ratio Procedures fOr Bivariate Nonnal
261 264 266 269 295 295
Distrihutions 4.10 Problems and Complements 4.11 Notes 4.12 References
5 ASYMPTOTIC APPROXIMATIONS 5.1 Inttoduction: The Meaning and Uses of Asymptotics
5.2 Consistency
297 297
301
5.2.1
5.2.2
PlugIn Estimates and MLEs in Exponential Family Models
Consistency of Minimum Contrast Estimates
301
304
5.3
5.4
5.5 5.6 5.7 5.8
First and HigherOrder Asymptotics: The Delta Method with Applications 5.3.1 The Delta Method for Moments 5.3.2 The Delta Method for In Law Approximations 5.3.3 Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families Asymptotic Theory in One Dimension 5.4.1 Estimation: The Multinomial Case *5.4.2 Asymptotic Normality of Minimum Conttast and M Estimates *5.4.3 Asymptotic Normality and Efficiency of the MLE *5.4.4 Testing *5.4.5 Confidence Bounds Asymptotic Behavior and Optimality of the Posterior Distribution Problems and Complements Notes References
306 306 311 322 324 324 327 331 332 336 337 345 362 363
x
CONTENTS
6 INFERENCE IN THE MULTIPARAMETER CASE
6.1 Inference for Gaussian Linear Models 6.1.1 6.1.2 6.1.3 *6.2 6.2.1 6.2.2 6.2.3 *6.3 6.3.1 6.3.2 *6.4 6.4.1 6.4.2 6.4.3 *6.5 *6.6 6.7 6.8 6.9 The Classical Gaussian Linear Model Estimation Tests and Confidence Intervals Estimating Equations Asymptotic Normality and Efficiency of the MLE The Posterior Distribution in the Multiparameter Case Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic Wald's and Rao's Large Sample Tests GoodnessofFit in a Multinomial Model. Pearson's X 2 Test GoodnessofFit to Composite Multinomial Models. Contingency Thbles Logistic Regression for Binary Responses
365
365 366 369 374 383 384 386 391 392 392 398 400 401 403 408 411 417 422 438 438 441 441 443 443
Asymptotic Estimation Theory in p Dimensions
Large Sample Tests and Confidence Regions
Large Sample Methods for Discrete Data
Generalized Linear Models
Robustness Properties and Scmiparametric Models Problems and Complements Notes References
A A REVIEW OF BASIC PROBABILITY THEORY A.I The Basic Model A.2 Elementary Properties of Probability Models A.3 Discrete Probability Models A.4 Conditional Probability and Independence A.5 Compound Experiments A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement A.7 Probabilities On Euclidean Space A.8 Random Variables and Vectors: Transformations A.9 Independence of Random Variables and Vectors A.IO The Expectation of a Random Variable A.II Moments A.12 Moment and Cumulant Generating Functions
444
446 447 448 451 453 454 456 459
CONTENTS
XI
•
A. t3 Some Classical Discrete and Continuous Distributions A.14 Modes of Convergence of Random Variables and Limit Theorems A. I5 Further Limit Theorems and Inequalities A.16 Poisson Process
A.17 Notes
460
466 468 472
474 475 477 477 477 479 480 482 484 485 485 488 491 491 494 497 502 502 503 506 506 508
A.18 References
B ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS
B.I Conditioning by a Random Variable or Vector B.I.l B.I.2 B.1.3 B.IA B.l.5 B.2.1 B.2.2 B.3
B.3.1
The Discrete Case Conditional Expectation for Discrete Variables Properties of Conditional Expected Values Continuous Variables Comments on the General Case The Basic Framework The Gamma and Beta Distributions
The X2 , F, and t Distributions
B.2 Distribution Theory for Transformations of Random Vectors
Distribution Theory for Samples from a Normal Population B.3.2 Orthogonal Transformations
BA The Bivariate Normal Distribution
B.5 Moments of Random Vectors and Matrices B.5.1 B.5.2 B.6.1 B.6.2 B.8 Basic Properties of Expectations Properties of Variance Definition and Density Basic Properties. Conditional Distributions
Op
B.6 The Multivariate Normal Distribution
B.7 Convergence for Random Vectors: Multivariate Calculus B.9 Convexity and Inequalities
and Op Notation
511
516 518 519 519 520 521
B.1O Topics in Matrix Theory and Elementary Hilbert Space Theory
B.1O.1 Symmetric Matrices B,10.2 Order on Symmetric Matrices
B.10.3 Elementary Hilbert Space Theory
B.Il Problems and Complements B.12 Notes B.13 References
524
538 539
•• XII
CONTENTS
C TABLFS
Table I The Standard Nonna! Distribution Table I' Auxilliary Table of the Standard Normal Distribution
Table II t Distribution Critical Values Table In x2 Distribution Critical Values Table IV F Distribution Critical Values
INDEX
541 542 543 544 545 546 547
I
PREFACE TO THE SECOND EDITION: VOLUME I
In the twentythree years that have passed since the first edition of our book appeared statistics has changed enonnollsly under dIe impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phy
logenetic and other), and other types of combinatorial objects.
(2) The generation of enonnous amounts of dataterrabytes (the equivalent of 10 12 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore's "law") and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number of directions:
(I) Methods for inference based on larger numbers of observations and minimal assumptionsasymptotic methods in non and semiparametric models, models with ''infinite'' number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo.
XIll
.,q;,
•
:ci'~:,f"
.;.
'4,.
<.it:
...
and functionvalued statistics. we do not require measure theory but assume from the start that our models are what we call "regular. we assume either a discrete probability whose support does not depend On the parameter set. Our one long book has grown to two volumes. and probability inequalities as well as more advanced topics in matrix theory and analysis. Chapter 1 now has become part of a larger Appendix B. Others do not and some though theoretically attractive cannot be implemented in a human lifetime. There have. and references to the literature for proofs of the deepest results such as the spectral theorem. instead of beginning with parametrized models we include from the start non.xiv Preface to the Second Edition: Volume I (4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. Volume [ covers the malerial of Chapters 16 and Chapter 10 of the first edition with pieces of Chapters 710 and includes Appendix A on basic probability theory.and semipararnetric models. weak convergence in Euclidean spaces. Despite advances in computing speed." That is. From the beginning we stress functionvalued parameters. Volume I. our focus and order of presentation have changed. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. Hilbert space theory is not needed. been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. then go to parameters and parametric models stressing the role of identifiability. which we present in 2000. (5) The study of the interplay between numerical and statistical considerations. each to be only a little shorter than the first edition. of course. We . which includes more advanced topics from probability theory such as the multivariate Gaussian distribution. However. problems. However. but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis. As in the first edition. some methods run quickly in real time. As a consequence our second edition. is much changed from the first. reflecting what we now teach our graduate students. Specifically. such as the empirical distribution function. such as the density. covers material we now view as important for all beginning graduate students in statistics and science and engineering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing concluSIons. These will not be dealt with in OUr work. Appendix B is as selfcontained as possible with proofs of mOst statements. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. or the absolutely continuous case with a density.
there is clearly much that could be omitted at a first reading that we also star. Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation. As in the first edition problems playa critical role by elucidating and often substantially expanding the text. The main difference in our new treatment is the downplaying of unbiasedness both in estimation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. including some optimality theory for estimation as well and elementary robustness considerations. The conventions established on footnotes and notation in the first edition remain. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in muitiparameter exponential families and ail introduction to the EM algorithm.Preface to the Second Edition: Volume I xv also. Nevertheless. There are clear dependencies between starred . Chapters 14 develop the basic principles and examples of statistics. Chapter 6 is devoted to inference in multivariate (multiparameter) models. from the start. Included are asymptotic normality of maximum likelihood estimates. including a complete study of MLEs in canonical kparameter exponential families. are an extended discussion of prediction and an expanded introduction to kparameter exponential families. we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that are given in Appendix B. Finaliy. the Wald and Rao statistics and associated confidence regions. There is more material on Bayesian models and analysis. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the mOre advanced topics of Volume II. Generalized linear models are introduced as examples. It includes the initial theory presented in the first edition but goes much further with proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference. Save for these changes of emphasis the other major new elements of Chapter 1. which parallels Chapter 2 of the first edition. Chapters 3 and 4 parallel the treatment of Chapters 4 and 5 of the first edition on the theory of testing and confidence regions. Although we believe the material of Chapters 5 and 6 has now become fundamental. Chapter 5 of the new edition is devoted to asymptotic approximations. Wilks theorem on the asymptotic distribution of the likelihood ratio test. include examples that are important in applications. inference in the general linear model. Robustness from an asymptotic theory point of view appears also. if somewhat augmented. Also new is a section relating Bayesian and frequentist inference via the Bemsteinvon Mises theorem. Almost all the previous ones have been kept with an approximately equal number of new ones addedto correspond to our new topics and point of view. and some parailels to the optimality theory and comparisons of Bayes and frequentist procedures given in the univariate case in Chapter 5. Major differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs). One of the main ingredients of most modem algorithms for inference. such as regression experiments.
Michael Jordan. and active participation in an enterprise that at times seemed endless. 5. Nancy Kramer Bickel and Joan H. taken on a new life. and the classical nonparametric k sample and independence problems will be included. are weak. and the functional delta method. For the first volume of the second edition we would like to add thanks to new colleagnes. We also thank Faye Yeager for typing.edn . Michael Ostland and Simon Cawley for producing the graphs.berkeley. Examples of application such as the Cox model in survival analysis.6 Volume II is expected to be forthcoming in 2003. greatly extending the material in Chapter 8 of the first edition.3 ~ 6. 6. j j Peter J. We also expect to discuss classification and model selection using the elementary theory of empirical processes. Yoram Gat for proofreading that found not only typos but serious errors. elementary empirical process theory. density estimation.5 I.berkeley. in part in appendices. particnlarly Jianging Fan.2 ~ 5. Last and most important we would like to thank our wives. convergence for random processes. and Prentice Hall for generous production support. Topics to be covered include permutation and rank tests and their basis in completeness and equivariance. With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics. appeared gratifyingly ended in 1976 but has.2 ~ 6. will be studied in the context of nonparametric function estimation. Semiparametric estimation and testing will be considered more generally. Jianhna Hnang. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo. Bickel bickel@stat. in part in the text and. Fujimura.edn Kjell Doksnm doksnm@stat.4. and our families for support.4. encouragement.4 ~ 6. and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley.XVI • Pref3ce to the Second Edition: Volume I sections that follow. Ying Qing Chen. other transformation models.3 ~ 6. The topic presently in Chapter 8. The basic asymptotic tools that will be developed or presented. • with the field.
covers most of the material we do and much more but at a more abstract level employing measure theory. Finally. In the twoquarter courses for graduate students in mathematics... and engineering that we have taught we cover the core Chapters 2 to 7. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated. These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. the physical sciences. Be cause the book is an introduction to statistics. Port. multinomial models. we need probability theory and expect readers to have had a course at the level of. PREFACE TO THE FIRST EDITION This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be. (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates. the treatment is abridged with few proofs and no examples or problems. At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig. for instance. we select topics from xvii . Our appendix does give all the probability that is needed. and the GaussMarkoff theorem. (4) Show how the ideas aod results apply in a variety of important subfields such as Gaussian linear mOdels. we feel that none has quite the mix of coverage and depth desirable at this level. Hoel. Linear Statistical Inference and Its Applications. (2) Give careful proofs of the major "elementary" results such as the NeymanPearson lemma. the information inequality. 3rd 00. which go from modeling through estimation and testing to linear models. and Stone's Introduction to Probability Theory. However. The work of Rao. and the structure of both Bayes and admissible solutions in decision theory. 2nd ed.arters. Our book contains more material than can be covered in tw~ qp. the Lehmann5cheffe theorem. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice. Although there are several good books available for tbis purpose. and nonparametric models. statistics. Introduction to Mathematical Statistics. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory).
I . It may be integrated with the material of Chapters 27 as the course proceeds rather than being given at the start. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text. Later we were both very much influenced by Erich Lehmann whose ideas are strongly rellected in this hook. (iii) Basic notation for probabilistic objects such as random variables and vectors. reservations. S. Pyke's careful reading of a nexttofinal version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D. Gray. Without Winston Chow's lovely plots Section 9. Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers. P. X. and stimUlating lectures of Joe Hodges and Chuck Bell. R. The comments contain digressions. We would like to acknowledge our indebtedness to colleagues. C. (i) Various notational conventions and abbreviations are used in the text. Peler J. Chapter 1 covers probability theory rather than statistics. distribution functions. A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text. Much of this material unfortunately does not appear in basic probability texts but we need to draw on it for the rest of the book. respectively. and so On. These comments are ordered by the section to which they pertain. preliminary edition. They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text. The foundation of oUr statistical knowledge was obtained in the lucid. J. and friends who helped us during the various stageS (notes. and A Samulon. Among many others who helped in the same way we would like to mention C. . and additional references. They need to be read only as the reader's curiosity is piqued. We would also like tn thank tlle colleagues and friends who Inspired and helped us to enter the field of statistics. L. Chou.5 was discovered by F. I for the first. Drew. 2 for the second.6 would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. A special feature of the book is its many problems. Quang. Minassian who sent us an exhaustive and helpful listing. A serious error in Problem 2. Cannichael. i . Chen. students. or it may be included at the end of an introductory probability course that precedes the statistics course. Gupta. final draft) through which this book passed. U.xviii Preface to the First Edition Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. and moments is established in the appendix. Lehmann's wise advice has played a decisive role at many points. W. E. in proofreading the final version. Bickel Kjell Doksum Berkeley /976 : . Scholz. caught mOre mistakes than both authors together. enthusiastic. G. Conventions: (i) In order to minimize the number of footnotes we have added a section of comments at the end of each Chapter preceding the problem section. densities.2.
Mathematical Statistics Basic Ideas and Selected Topics Volume I Second Edition .
.. j 1 J 1 . . I .
A 1 .3). trees as in evolutionary phylogenies. ''The Devil is in the details!" All the principles we discuss and calculations we perform should only be suggestive guides in successful applications of statistical analysis in science and policy.4 and Sections 2. Chapter 1. digitized pictures or more routinely measurements of covariates and response on a set of n individualssee Example 1.5. AND PERFORMANCE CRITERIA 1. The particular angle of mathematical statistics is to view data as the outcome of a random experiment that we model mathematically. scientific or industrial.2. Subject matter specialists usually have to be principal guides in model formulation. GOALS. and so on. (3) Arrays of scalars and/or characters as in contingency tablessee Chapter 6or more generally multifactor IDultiresponse data on a number of individuals. in particUlar. PARAMETERS AND STATISTICS Data and Models Most studies and experiments. (4) All of the above and more. The goals of science and society.1.Chapter 1 STATISTICAL MODELS. In any case all our models are generic and.1.1 DATA. A detailed discussion of the appropriateness of the models we shall discuss in particUlar situations is beyond the scope of this book. as usual. Moreover. and/or characters. but we will introduce general model diagnostic tools in Volume 2. we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading.1 1. MODELS. are to draw useful information from data using everything that we know. a single time series of measurements. large scale or small. for example. for example. Data can consist of: (1) Vectors of scalars. A generic source of trouble often called grf?ss errors is discussed in greater detail in the section on robustness (Section 3. functions as in signal processing.1 and 6. which statisticians share. produce data whose analysis is the ultimate object of the endeavor.1. measurements. (2) Matrices of scalars and/or characters.
Here are some examples: (a) We are faced with a population of N elements. and Performance Criteria Chapter 1 priori. is distributed in a large population. and more general procedures will be discussed in Chapters 24. and so on. testing. Goals.. Hierarchies of models are discussed throughout. robustness. In this manner. Random variability " : . give methods that assess the generalizability of experimental results. The population is so large that. This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population. A to m. a shipment of manufactured items. (3) We can assess the effectiveness of the methods we propose. treating a disease. and so on. (b) We want to study how a physical or economic feature. The data gathered are the number of defectives found in the sample. An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the population. (c) An experimenter makes n independent detenninations of the value of a physical constant p. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. An unknown number Ne of these elements are defective. for example. It is too expensive to examine all of the items.3 and continue with optimality principles in Chapters 3 and 4. in particular. and B to n. Goodness of fit tests. For instance.21. e. We begin this discussion with decision theory in Section 1. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p. (2) We can derive methods of extracting useful information from data and. For instance. starting with tentative models: (I) We can conceptualize the data structure and our goals more precisely. we approximate the actual process of sampling without replacement by sampling with replacement. height or income. in the words of George Box (1979). reducing pollution. for instance.5 and throughout the book. (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee. . we obtain one or more quantitative or qualitative measures of efficacy from each experiment..! I 2 Statistical Models. randomly selected patients and then measure temperature and blood pressure. and diagnostics are discussed in Volume 2. plus some random errors. Chapter I. to what extent can we expect the same effect more generally? Estimation. for modeling purposes. So to get infonnation about a sample of n is drawn without replacement and inspected. we can assign two drugs. producing energy. are never true but fortunately it is only necessary that they be usefuL" In this book we will study how. if we observe an effect in our data. confidence regions. learning a maze. We begin this in the simple examples that follow and continue in Sections 1. "Models of course. have the patients rated qualitatively for improvement by physicians. (5) We can be guided to alternative or more general descriptions that might fit better. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes.
.1 Data. F. Thus.. n). .. which are modeled as realizations of X I."" €n are independent.2.. H(N8.d. .. Here we can write the n determinations of p. although the sample space is well defined. then by (AI3. If N8 is the number of defective items in the population sampled. On this space we can define a random variable X given by X(k) ~ k. The mathematical model suggested by the description is well defined. 1 <i < n (1... n) distribution....0) < k < min(N8. The same model also arises naturally in situation (c).. . which together with J1. if the measurements are scalar.N(I . So..d.2) where € = (€ I . It can also be thought of as a limiting case in which N = 00. N. as Xi = I.i.. N.t + (i. Models. . The sample space consists of the numbers 0.. k ~ 0. The main difference that our model exhibits from the usual probability model is that NO is unknown and. .6) (J. .1..1. Sampling Inspection. . where . . What should we assume about the distribution of €. . . as X with X . Sample from a Population. n corresponding to the number of defective items found. .Xn independent. completely specifies the joint distribution of Xl. . X n ? Of course..". . and also write that Xl.l) if max(n . so that sampling with replacement replaces sampling without.. X n as a random sample from F. That is. . 0 ° Example 1. Fonnally. First consider situation (a). . that depends on how the experiment is carried out. A random experiment has been perfonned.. 1 €n) T is the vector of random errors. Given the description in (c). .xn . can take on any value between and N./ ' stands for "is distributed as." The model is fully described by the set :F of distributions that we specify. X n are ij. Parameters. anyone of which could have generated the data actually observed. n. we cannot specify the probability structure completely but rather only give a family {H(N8. n)} of probability distributions for X.1. We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models... we observe XI. OneSample Models. which we refer to as: Example 1.. Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording "defective" or not.1. we postulate (l) The value of the error committed on one determination does not affect the value of the error at other times.. identically distributed (i.Section 1. in principle. 1.) random variables with common unknown distribution function F. I.. X has an hypergeometric. €I. and Statistics 3 here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs.8)..l. We often refer to such X I..
t + ~. The Gaussian distribution. response y = X + ~ would be obtained where ~ does not depend on x. YI.4 Statistical Models.3) and the model is alternatively specified by F. be the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B. where 0'2 is unknown.. . •. It is important to remember that these are assumptions at best only approximately valid... Natural initial assumptions here are: (1) The x's and y's are realizations of Xl. for instance. G) pairs.. respectively. ." Now consider situation (d). Commonly considered 9's are all distributions with center of symmetry 0.~). if we let G be the distribution function of f 1 and F that of Xl. . cr 2 ) distribution. We call the y's treatment observations. . will have none of this.Yn.t and cr. and Performance Criteria Chapter 1 (2) The distribution of the error at one determination is the same as that at another. tbe set of F's we postulate. Goals. 1 X Tn a sample from F. that is.. cr > O} where tP is the standard normal distribution.2) distribution and G is the N(J.1. (72) population or equivalently F = {tP (':Ji:) : Jl E R. or by {(I".. This implies that if F is the distribution of a control. There are absolute bounds on most quantitiesIOO ft high men are impossible. (3) The distribution of f is independent of J1. (2) Suppose that if treatment A had been administered to a subject response X would have been obtained. G E Q} where 9 is the set of all allowable error distributions that we postulate. £n are identically distributed.1.3. . then G(·) F(' . Then if treatment B had been administered to the same subject instead of treatment A. . A placebo is a substance such as water tpat is expected to have no effect on the disease and is used to correct for the welldocumented placebo effect. TwoSample Models. Equivalently Xl. the Xi are a sample from a N(J. and Y1. Let Xl.. 0'2).. . Then if F is the N(I". '. whatever be J. All actual measurements are discrete rather than continuous. .. . By convention. The classical def~ult model is: (4) The common distribution of the errors is N(o. we refer to the x's as control observations. That is. We call this the shift model with parameter ~. if drug A is a standard or placebo. or alternatively all distributions with expectation O. Thus..t. We let the y's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo. E}.. 1 X Tn . we have specified the Gaussian two sample model with equal variances.Xn are a random sample and. 0 ! I = . patients improve even if they only think they are being treated. 1 Yn a sample from G. G) : J1 E R. Heights are always nonnegative.. Often the final simplification is made. so that the model is specified by the set of possible (F. Example 1. (3) The control responses are normally distributed. To specify this set more closely the critical constant treatment effect assumption is often made. heights of individuals or log incomes. 0 This default model is also frequently postulated for measurements taken on units obtained by random sampling from populations.. then F(x) = G(x  1") (1.
As our examples suggest.1.2 is that. ~(w) is referred to as the observations or data. When w is the outcome of the experiment. In others. we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + n available patients. in Example 1. P is the . On this sample space we have defined a random vector X = (Xl. We are given a random experiment with sample space O. Parameters.1. The danger is that.Xn ). The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise. The number of defectives in the first example clearly has a hypergeometric distribution. Fortunately. The advantage of piling on assumptions such as (I )(4) of Example 1. we can ensure independence and identical distribution of the observations by using different. In some applications we often have a tested theoretical model and the danger is small. for instance. This will be done in Sections 3.1).4.2.2. However. Models. the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed. For instance. Experiments in medicine and the social sciences often pose particular difficulties. For instance. have been assigned to B.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease.3 when F. we now define the elements of a statistical model. Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter. A review of necessary concepts and notation from probability theory are given in the appendices. For instance. In Example 1. our analyses. we can be reasonably secure about some aspects. Since it is only X that we observe. but not others. we need only consider its probability distribution. if they are true. and Statistics 5 How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations. Statistical methods for models of this kind are given in Volume 2. there is tremendous variation in the degree of knowledge and control we have concerning experiments.6.'" .1. in comparative experiments such as those of Example 1.1. the data X(w).1. That is. we observe X and the family P is that of all bypergeometric distributions with sample size n and population size N.Section 1. the methods needed for its analysis are much the same as those appropriate for the situation of Example 1. we know how to combine our measurements to estimate JL in a highly efficient way and also assess the accuracy of our estimation procedure (Example 4. if they are false.1 Data. P is referred to as the model.1.3 and 6. All the severely ill patients might.1. G are assumed arbitrary. equally trained observers with no knowledge of each other's findings. may be quite irrelevant to the experiment that was actually performed. It is often convenient to identify the random vector X with its realization. Using our first three examples for illustrative purposes. in Example 1. we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated.5. In this situation (and generally) it is important to randomize. though correct for the model written down. if (1)(4) hold. This distribution is assumed to be a member of a family P of probability distributions on Rn.
n) distribution. having expectation 0.G) has density n~ 1 g(Xi 1'). However. we know we are measuring a positive quantity in this model.. ..Xrn are identically distributed as are Y1 . . . as we shall see later.. as a parameter and in Example 1. When we can take to be a nice subset of Euclidean space and the maps 0 + Po are smooth. 1 • • • . 02 and yet Pel = Pe 2 • Such parametrizations are called unidentifiable. the parameter space e. Goals.e. we only wish to make assumptions (1}(3) with t:. in (l.2 with assumptions (1)(4) we have = R x R+ and. The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything. If.3 with only (I) holding and F.1. 02 ). . for eXample.2) suppose that we pennit G to be arbitrary. and Performance Criteria Chapter 1 family of all distributions according to which Xl. What parametrization we choose is usually suggested by the phenomenon we are modeling. models P are called parametric. It's important to note that even nonparametric models make substantial assumptionsin Example 1.l.. In Example 1. the first parametrization we arrive at is not necessarily the one leading to the simplest analysis. NO. . 1) errors.1 we can use the number of defectives in the population. .1. G) : I' E R.. we may parametrize the model by the first and second moments of the normal distribution of the observations (i. The critical problem with such parametrizations is that ev~n with "infinite amounts of data. under assumptions (1)(4). ..I} and 1. Thus.6 Statistical Models. . Note that there are many ways of choosing a parametrization in these and all other problems. parts of fJ remain unknowable. by (tt. knowledge of the true Pe. 0. Pe 2 • 2 e ° . in Example 1.G) : I' E R.Xn are independent and identically distributed with a common N (IL. N.1.. that is.. Finally. For instance.G taken to be arbitrary are called nonparametric. lh i O => POl f. . Of even greater concern is the possibility that the parametrization is not onetoone.3 that Xl. X rn are independent of each other and Yl .1. in Example = {O. G) into the distribution of (Xl. if) (X'.. we can take = {(I'. moreover. () t Po from a space of labels.2 Parametrizations and Parameters i e To describe P we USe a parametrization. if () = (Il.. Pe the H(NB. G with density 9 such that xg(x )dx = O} and p(". 1) errors lead parametrization is unidentifiable because. We may take any onetoone function of 0 as a new parameter.2 ) distribution. Now the and N(O. we have = R+ x R+ . models such as that of Example 1. Yn .X l . I 1.1. . For instance.2. tt = to the same distribution of the observations a~ tt = 1 and N (~ 1. Pe the distribution on Rfl with density implicitly taken n~ 1 .t2 + k e e e J . or equivalently write P = {Pe : BEe}.2)). Ghas(arbitrary)densityg}. () is the fraction of defectives. in senses to be made precise later. . that is.1. . that is. tt is the unknown constant being measured.1 we take () to be the fraction of defectives in the shipment. to P. a map. Yn .1. we will need to ensure that our parametrizations are identifiable. Thus.2 with assumptions (1)(3) are called semiparametric..JL) where cp is the standard normal density. .. Then the map sending B = (I'. . such that we can have (}1 f. X n ) remains the same but = {(I'. Models such as that of Example 1. j. still in this example." that is.1. e j . on the other hand. If...
that is.2 with assumptions (1)(4) the parameter of interest fl. ) evidently is and so is I' + C>. the fraction of defectives () can be thought of as the mean of X/no In Example 1. v. X n independent with common distribution. For instance. When we sample.M(P) can be characterized as the mean of P. For instance. As we have seen this parametrization is unidentifiable and neither f1 nor ~ arc parameters in the sense we've defined. . formally a map. But = Var(X.1 Data.2 is now well defined and identifiable by (1. if the errors are normally distributed with unknown variance 0'2. For instance. thus. if POl = Pe'). it is natural to write e e e . Then "is identifiable whenever flx and flY exist. Similarly.1. from to its range P iff the latter map is 11. which can be thought of as the difference in the means of the two populations of responses. where 0'2 is the variance of €. implies q(BJl = q(B2 ) and then v(Po) q(B). which indexes the family P.. and Statistics 7 Dual to the notion of a parametrization.Section 1. instead of postulating a constant treatment effect ~.1. consider Example 1. say with replacement. that is. in Example 1. as long as P is the set of all Gaussian distributions. Here are two points to note: (1) A parameter can have many representations. We usually try to combine parameters of interest and nuisance parameters into a single grand parameter (). More generally. A parameter is a feature v(P) of the distribution of X. a map from some e to P. and observe Xl. In addition to the parameters of interest.. ~ Po. is that of a parameter.3 with assumptions (1H2) we are interested in~.2. there are also usually nuisance parameters. or more generally as the center of symmetry of P. Implicit in this description is the assumption that () is a parameter in the sense we have just defined. . or the midpoint of the interquantile range of P. E(€i) = O.1. the focus of the study. Models. The (fl." = f1Y .flx.1. in Example 1. in Example 1. For instance.3) and 9 = (G : xdG(x) = J O}. our interest in studying a population of incomes may precisely be in the mean income. Parameters.. (J is a parameter if and only if the parametrization is identifiable. or the median of P. make 8 + Po into a parametrization of P.1. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable). in Example 1.3. from P to another space N.1.2 again in which we assume the error € to be Gaussian but with arbitrary mean~. which correspond to other unknown features of the distribution of X. we can start by making the difference of the means. G) parametrization of Example 1. a function q : + N can be identified with a parameter v( P) iff Po. then 0'2 is a nuisance parameter. . Formally. we can define (J : P + as the inverse of the map 8 + Pe.1.1.2 where fl denotes the mean income and.1. implies 8 1 = 82 . Sometimes the choice of P starts by the consideration of a particular parameter. But given a parametrization (J + Pe. Then P is parametrized by 8 = (fl1~' 0'2). For instance.
1964). For instance. In this volume we assume that the model has . Goals. statistics.. can be related to model formulation as we saw earlier. For future reference we note that a statistic just as a parameter need not be real or Euclidean valued. Mandel.. called the empirical distribution function. Deciding which statistics are important is closely connected to deciding which parameters are important and. but the true values of parameters are secrets of nature. The link for us are things we can compute. Informally. Often the outcome of the experiment is used to decide on the model and the appropriate measure of difference. this difference depends on the patient in a complex manner (the effect of each drug is complex). and Performance Criteria Chapter 1 1. If we suppose there is a single numerical measure of performance of the drugs and the difference in performance of the drugs for any given patient is a constant irrespective of the patient. Our aim is to use the data inductively. a common estimate of 0. These issues will be discussed further in Volume 2. < xJ.1. X n ) are a sample from a probability P on R and I(A) is the indicator of the event A.• 1 X n ) = X ~ L~ I Xi.1. F(P)(x) = PIX. For instance. a statistic T is a map from the sample space X to some space of values T. consider situation (d) listed at the beginning of this section. to narrow down in useful ways our ideas of what the "true" P is.3 Statistics as Functions on the Sample Space I j • . Formally. a statistic we shall study extensively in Chapter 2 is the function valued statistic F. for example.2 is the statistic I i i = i i 8 2 = n1 "(Xi . is the statistic T(X 11' . in Example 1.. Thus. the fraction defective in the sample.1. Models and parametrizations are creations of the statistician. usually a Euclidean space. then our attention naturally focuses on estimating this constant If.X)" L i=l I n How we use statistics in estimation and other decision procedures is the subject of the next section. x E Ris F(X " ~ I . It estimates the function valued parameter F defined by its evaluation at x E R. This statistic takes values in the set of all distribution functions on R. we can draw guidelines from our numbers and cautiously proceed. hence.2 a cOmmon estimate of J1. Next this model. which now depends on the data.8 Statistical Models. T( x) is what we can compute if we observe X = x. . Xn)(x) = n L I(X n i < x) i=l where (X" . is used to decide what estimate of the measure of difference should be employed (ct. we have to formulate a relevant measure of the difference in performance of the drugs and decide how to estimate this measure. Databased model selection can make it difficult to ascenain or even assign a meaning to the accuracy of estimates or the probability of reaChing correct conclusions. however. . T(x) = x/no In Example 1. which evaluated at ~ X and 8 2 are called the sample mean and sample variance. Nevertheless.1..
0). 0).1 Data. 'L':' Such models will be called regular parametric models. Regression Models We end this section with two further important examples indicating the wide scope of the notions we have introduced.. Xl). in Example 1. are continuous with densities p(x..1. in situation (d) again.lO. 0). The experimenter may. Regression Models. There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. The distribution of the response Vi for the ith subject or case in the study is postulated to depend on certain characteristics Zi of the ith subject. This is obviously overkill but suppose that. However. Y n ) where Y11 ••• . Models. the number of patients in the study (the sample size) is random. Zi is a d dimensional vector that gives characteristics such as sex. assign the drug that seems to be working better to a higher proportion of patients. Example 1. .1. It will be convenient to assume(1) from now on that in any parametric model we consider either: (I) All of the P.4. patients may be considered one at a time. For instance. (B. 1. . Expectations calculated under the assumption that X rv P e will be written Eo. (A. the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other. . height. 0). and so On of the ith subject in a study. In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1. For instance. age.3.. Yn are independent. In the discrete case we will use both the tennsfrequency jimction and density for p(x. for example. density and frequency functions by p(.). after a while. Thus...Section 1. Parameters. drugs A and B are given at several . Moreover.3 we could take z to be the treatment label and write onr observations as (A. Notation.1. and Statistics 9 been selected prior to the current experiment. . This is the Stage for the following. these and other subscripts and arguments will be omitted where no confusion can arise. We observe (ZI' Y. We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more infonnation. Y. weight. ) that is independent of 0 such that 1 P(Xi' 0) = 1 for all O. (zn.X" . 8). When dependence on 8 has to be observed. Thus. (2) All of the P e are discrete with frequency functions p(x.. 1990). They are not covered under our general model and will not be treated in this book. Lehmann. assign the drugs alternatively to every other patient in the beginning and then. Distribution functions will be denoted by F(·. (B. and there exists a set {XI.4 Examples. Problems such as these lie in the fields of sequential analysis and experimental deSign. This selection is based on experience with previous similar experiments (cf. we shall denote the distribution corresponding to any particular parameter value 8 by Po. in the study.1.. Regular models. ). See A. and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients. sequentially. Yn ). X m ).
In the two sample models this is implied by the constant treatment effect assumption. Then. .. then we can write. (c) whereZ nxa = (zf.~II3Jzj = zT (3 so that (b) becomes (b') This is the linear model.. d = 2 and can denote the pair (Treatment Label. in Example 1. (b) where Ei = Yi .treat the Zi as a label of the completely unknown distributions of Yi. Here J.Yn) ~ II f(Yi I Zi). Treatment Dose Level) for patient i.8.1. n.. Identifiability of these parametrizations and the status of their components as parameters are discussed in the problems. . .2 with assumptions (1)(4).1. . In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed observations.L(z) is an unknown function from R d to R that we are interested in. The most common choice of 9 is the linear fOlltl.3(3) is a special case of this model. semiparametric as with (I) and (2) with F arbitrary. For instance. A eommou (but often violated assumption) is (I) The ti are identically distributed with distribution F. (3) g((3. 0 . then the model is zT (a) P(YI.. On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) 1'(z) = 9((3.z) = L. Then we have the classical Gaussian linear model. That is.. Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on Zi. which we can write in vector matrix form.Z~)T and Jisthen x n identity. So is Example 1. . the effect of z on Y is through f. By varying the assumptions we obtain parametric models as with (I). and nonparametric if we drop (1) and simply . .E(Yi ). .. We usually ueed to postulate more.L(z) denote the expected value of a response with given covariate vector z.1.2 uuknown.10 Statistical Models.1.. . See Problem 1. [n general. Clearly..(z) only. Often the following final assumption is made: (4) The distributiou F of (I) is N(O.2) with ... I3d) T of unknowns. (3) and (4) above. and Performance Criteria Chapter 1 dose levels.1.3 with the Gaussian twosample model I'(A) ~ 1'. I'(B) = I' + fl. . If we let f(Yi I zd denote the density of Yi for a subject with covariate vector Zi. Example 1. i=l n If we let J. i = 1.. z) where 9 is known except for a vector (3 = ((31. Goals.
xn ) ~ f(x1 I') II f(xj . . and Statistics 11 Finally.. save for a brief discussion in Volume 2.. . is that N(O.t... Of course.t = E(Xi ) be the average time for an infinite series of records.(3) + (3X'l + 'i. . x n ). ... The default assumption. . Models. . . the model for X I. we have p(edp(C21 e1)p(e31 e"e2)". at best an approximation for the wave example. and the associated probability theory models and inference for dependent data are beyond the scope of this book.. Let XI.. Consider the model where Xi and assume = J..(3cd··· f(e n Because €i =  (3e n _1). .. en' Using conditional probability theory and ei = {3ei1 + €i...(3)I'). model (a) assumes much more but it may be a reasonable first approximation in these situations.. . Parameters.1. C n where €i are independent identically distributed with density are dependent as are the X's. In fact we can write f. 0'2) density.5. . To find the density p(Xt. say. the elapsed times X I. eo = a Here the errors el. Example 1. the conceptual issues of stationarity. Then we have what is called the AR(1) Gaussian model J is the We include this example to illustrate that we need not be limited by independence. i ~ 2. n.1 Data. . 0 . X n spent above a fixed high level for a series of n consecutive wave records at a point on the seashore.Section 1.. X n is P(X1. . Xi . .. we give an example in which the responses are dependent.{3Xj1 j=2 n (1... . An example would be..t+ei. . p(e n I end f(edf(c2 .cn _d p(edp(e2 I edp(e3 I e2) . ' . A second example is consecutive measurements Xi of a constant It made by the same observer who seeks to compensate for apparent errors. ..p(e" I e1. However.(1 . .Il. Xi ~ 1. It is plausible that ei depends on Cil because long waves tend to be followed by long waves... . Let j.. X n be the n determinations of a physical constant J.n.n ei = {3ei1 + €i. Measurement Model with Autoregressive Errors. i = 1. Xl = I' + '1. . ergodicity.. i = l. . . . we start by finding the density of CI.
1. in the past. PIX = k. That is. given 9 = i/N. X has the hypergeometric distribution 'H( i.. in the inspection Example 1."" 1l"N} for the proportion (J of defectives in past shipments. N. ([. 1. The notions of parametrization and identifiability are introduced. to think of the true value of the parameter (J as being the realization of a random variable (} with a known distribution. we can construct a freq uency distribution {1l"O. Models are approximations to the mechanisms generating the observations. N. the most important of which is the workhorse of statistics. N.1) Our model is then specified by the joint distribution of the observed number X of defectives in the sample and the random variable 9. In this section we introduced the first basic notions and formalism of mathematical statistics.2.1. n). We know that. There are situations in which most statisticians would agree that more can be said For instance. the regression model.2 BAYESIAN MODELS Throughout our discussion so far we have assumed that there is no information available about the true value of the parameter beyond that provided by the data.. This is done in the context of a number of classical examples. we have had many shipments of size N that have subsequently been distributed. . This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as measure of the beliefs of the experimenter concerning the true value of (J before he or she takes any data. and Performance Criteria Chapter 1 Summary. The general definition of parameters and statistics is given and the connection between parameters and pararnetrizations elucidated. They are useful in understanding how the outcomes can he used to draw inferences that go beyond the particular experiment. . How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences. it is possible that. .. i = 0. 1l"i is the frequency of shipments with i defective items.2) This is an example of a Bayesian model. Goals. If the customers have provided accurate records of the number of defective items that they have found. Thus. vector observations X with unknown probability distributions P ranging over models P. .I 12 Statistical Models. and indeed necessary. There is a substantial number of statisticians who feel that it is always reasonable. Now it is reasonable to suppose that the value of (J in the present shipment is the realization of a random variable 9 with distribution given by P[O = N] I = IT" i = 0.. 0 = N I I ([. We view statistical models as useful tools for learning from the outcomes of experiments and studies.2. a .
Lindley (1965).1 of being defective independently of the other members of the shipment. we can obtain important and useful results and insights. An interesting discussion of a variety of points of view on these questions may be found in Savage et al. This would lead to the prior distribution e.. However. We shall return to the Bayesian framework repeatedly in our discussion. For a concrete illustration. In this section we shall define and discuss the basic clements of Bayesian models. the information about (J is described by the posterior distribution. and Berger (1985). (J) as a conditional density or frequency function given 8 = we will denote it by p(x I 0) for the remainder of this section. then by (B.Section 1.1)'(0..9)'00'. . 1. The joint distribution of (8. De Groot (1969). whose range is contained in 8. insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later). (1962). Savage (1954).2.2. The most important feature of a Bayesian model is the conditional distribution of 8 given X = x.3). (1. After the value x has been obtained for X. suppose that N = 100 and that from past experience we believe that each item has probability . x) = ?T(O)p(x.3) Because we now think of p(x. let us turn again to Example 1.3). We now think of Pe as the conditional distribution of X given (J = (J. . 100. For instance.2.1. Raiffa and Schlaiffer (1961). To get a Bayesian model we introduce a random vector (J. Eqnation (1. the information or belief about the true value of the parameter is described by the prior distribution. However. the resulting statistical inference becomes subjective. There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in tenus of frequencies. = ( I~O ) (0. Suppose that we have a regular parametric model {Pe : (J E 8}.2) is an example of (1.1. (1. (1) Our own point of view is that SUbjective elements including the views of subject matter experts arc an essential element in all model building.4) for i = 0.2 Bayesian Models 13 Thus. ?T. which is called the posterior distribution of 8.2. (8. by giving (J a distribution purely as a theoretical tool to which no subjective significance is attached. X) is that of the outcome of a random experiment in which we first select (J = (J according to 7r and then. If both X and (J are continuous or both are discrete. In the "mixed" cases such as (J continuous X discrete. 1. I(O.O). select X according to Pe. with density Or frequency function 7r. The theory of this school is expounded by L. Before the experiment is performed. the joint distribution is neither continuous nor discrete. Before sampling any items the chance that a given shipment contains . given (J = (J. The function 7r represents our belief or information about the parameter (J before the experiment and is called the prior density or frequency function.l. X) is appropriately continuous or discrete with density Or frequency function.
2.. the number of defectives left after the drawing.Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1.2.1. (1. . is iudependeut of X and has a B(81. In the cases where 8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (0..52) 0.2. 0..7) In general.2.X > 10 I X = lOJ P [(1000 . .1.3). P[1000 > 20 I X = 10] P[IOOO .k dt (1..:. Goals.1 > .1100(0..6) we argue loosely as follows: If be fore the drawing each item was defective with probability ... 1r(9)9k (1 _ 9)nk 1r(Blx" . If we assume that 8 has a priori distribution with deusity 1r.8. Therefore.1) distribution.xn )= Jo'1r(t)t k (lt)n.<I> 1000 . some variant of Bayes' rule (B.9) I .30. 1r(t)p(x I t) if 8 is discrete. (i) The posterior distribution is discrete or continuous according as the prior distri bution is discrete or continuous.6) To calculate the posterior probability given iu (1. This leads to P[1000 > 20 I X = lOJ '" 0.001. . to calculate the posterior.1 and good with probability .30.1) (1. . P[IOOO > 201 = 10] P [ .1)(0.2.8) as posterior density of 0. .2. ( 1. and Performance Criteria Chapter 1 I . 14 Statistkal Models.II ~. X) given by (1. Thus.9 independently of the other items.181(0. this will continue to be the case for the items left in the lot after the 19 sample items have been drawn. Example 1. 1. 1008 .9)(0.8) roo 1r(9)p(x 19) 1r(t)p(x I t)dt if 8 is continuous. theu 1r(91 x) 1r(9)p(x I 9) 2..9)(0. Here is an example.1)(0.2. I 20 or more bad items is by the normal approximation with continuity correction.X) .10).5) 5 ) = 0. Bernoulli Trials.10 '" .<1>(0.1) '" I .1100(0.. Suppose that Xl.9) > .9 ] ..4) can be used.X.J C3 (1.9) I . (ii) If we denote the corresponding (posterior) frequency function or density by 1r(9 I x).. Now suppose that a sample of 19 has been drawn in which 10 defective items are found. . Specifically. we obtain by (1.2.181(0.2. (A 15.j .
= 1.2. x n ).6. which corresponds to using the beta distribution with r = . Or else we may think that 1[(8) concentrates its mass near a small number.2 indicates.05 and its variance is very small. .2.16. we only observe We can thus write 1[(8 I k) fo". we need a class of distributions that concentrate on the interval (0.11» be B(k + T. . s) density (B.k + s) where B(·.05.15.2. and the posterior distribution of B givenl:X i =kis{3(k+r.2. We also by example introduce the notion of a conjugate family of distributions.8) distribution. must (see (B. As Figure B. (A..2.2. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions.2.10) The proportionality constant c. Suppose. which has a B( n. For instance.2. Another bigger conjugate family is that of finite mixtures of beta distributionssee Problem 1. If we were interested in some proportion about which we have no information or belief.2. Summary. and s only. the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all. L~ 1 Xi' We also obtain the same posterior density if B has prior density 1r and 1 Xi. k = I Xi· Note that the posterior density depends on the data only through the total number of successes.<. Then we can choose r and s in the (3(r. r.2. say 0.9) we obtain 1 L: L7 (}k+rl (1 _ 8)nk+s1 c (1.nk+s). we are interested in the proportion () of "geniuses" (IQ > 160) in a particular city. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B.2 Bayesian Models 15 for 0 < () < 1.·) is the beta function. We present an elementary discussion of Bayesian models. 0 A feature of Bayesian models exhibited by this example is that there are natural parametric families of priors such that the posterior distributions also belong to this family. 8) distribution given B = () (Problem 1. so that the mean is r/(r + s) = 0. we might take B to be uniformly distributed on (0. To get infonnation we take a sample of n individuals from the city. We return to conjugate families in Section 1. Specifically. where k ~ l:~ I Xj. If n is small compared to the size of the city. . 1).13) leads us to assume that the number X of geniuses observed has approximately a 8(n.n..(8 I X" . nonVshaped bimodal distributions are not pennitted. which depends on k. Such families are called conjugaU? Evidently the beta family is conjugate to the binomial. .Section 1.9). We may want to assume that B has a density with maximum value at o such as that drawn with a dotted line in Figure B..s) distribution. upon substituting the (3(r. To choose a prior 11'.ll) in (1. for instance.. One such class is the twoparameter beta family. 'i = 1.2. n . The result might be a density such as the one marked with a solid line in Figure B.1). introduce the notions of prior and posterior distributions and give Bayes rule. Xi = 0 or 1.
of g((3.3. sex. Given a statistical model.3." In testing problems we. On the basis of the sample outcomes the organization wants to give a ranking from best to worst of the brands (ties not pennitted). (age.16 Statistical Models. if there are k different brands.1. For instance.e. the receiver wants to discriminate and may be able to attach monetary Costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment.2. the expected value of Y given z. at a first cut. z) we can estimate (3 from our observations Y. A very important class of situations arises when. However.3 THE DECISION THEORETIC FRAMEWORK I . Example 1. In other situations certain P are "special" and we may primarily wish to know whether the data support "specialness" or not. such as.1 or the physical constant J. i we can try to estimate the function 1'0. the information we want to draw from data can be put in various forms depending on the purposes of our analysis. if we believe I'(z) = g((3. shipments. 0 In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function. Ranking. as it's usually called. In Example 1. These are estimation problems. For instance. If J. placebo and treatment are equally effective) are special because the FDA (Food and Drug Administration) does not wish to pennit the marketing of drugs that do no good. in Example 1. We may have other goals as illustrated by the next two examples. 0 Example 1.l > J. "hypothesis" or "nonspecialness" (or alternative). there are k! possible rankings or actions. Prediction. Unfortunately {t(z) is unknown. Making detenninations of "specialness" corresponds to testing significance. state which is supported by the data: "specialness" or. Thus.1.2.l < Jio or those corresponding to J. Thus. a reasonable prediction rule for an unseen Y (response of a new patient) is the function {t(z).1.3. in Example 1. we have a vector z.1. as in Example 1." (} > Bo.1.lo as special. there are many problems of this type in which it's unclear which oftwo disjoint sets of P's. We may wish to produce "best guesses" of the values of important parameters. say a 50yearold male patient's response to the level of a drug. i. P's that correspond to no treatment effect (i. A consumer organization preparing (say) a report on air conditioners tests samples of several brands. with (J < (Jo.1 do we use the observed fraction of defectives . Note that we really want to estimate the function Ji(')~ our results will guide the selection of doses of drug for future patients. whereas the receiver does not wish to keep "bad. if we have observations (Zil Yi).lo means the universe is expanding forever and J. and as we shall see fonnally later.lo is the critical matter density in the universe so that J.1 contractual agreement between shipper and receiver may penalize the return of "good" shipments.1. Po or Pg is special and the general testing problem is really one of discriminating between Po and Po. Zi) and then plug our estimate of (3 into g.l < J. Goals. drug dose) T that can be used for prediction of a variable of interest Y.4.l > JkO correspond to an eternal alternation of Big Bangs and expansions. As the second example suggests. For instance. for instance. and Performance Criteria Chapter 1 1. Intuitively.L in Example 1.. then depending on one's philosophy one could take either P's corresponding to J. 1 < i < n. say. There are many possible choices of estimates.1. say. the fraction defective B in Example 1. one of which will be announced as more consistent with the data than others.
I} with 1 corresponding to rejection of H. and reliability of statistical procedures. A = {a. On the other hand. 1. Ranking. I)}. We usnally take P to be parametrized. Here are action spaces for our examples. Thus.1. in Example 1. it is natural to take A = R though smaller spaces may serve equally well.1. if we have three air conditioners. for instance. ~ 1 • • • l I} in Example 1.2. 1. (2. accuracy. . 2). (4) provide guidance in the choice of procedures for analyzing outcomes of experiments.. (3. or p.1.. . .1. P = {P. we would want a posteriori estimates of perfomlance. By convention. (3) provide assessments of risk. That is. Here quite naturally A = {Permntations (i I . on what criteria of performance we use. 3.. 2).. 3. ik) of {I. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify Po with the set of "special" P's). (1. Testing. 1.Section 1. defined as any value such that half the Xi are at least as large and half no bigger? The same type of question arises in all examples. whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing. Action space. Intuitively. or combine them in some way? In Example 1. we need a priori estimates of how well even the best procedure can do. taking action 1 would mean deciding that D. If we are estimating a real parameter such as the fraction () of defectives.3. in ranking what mistakes we've made. (2. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study. (2) point to what the different possible actions are. is large. we begin with a statistical model with an observation vector X whose distribution P ranges over a set P. Thus.1. and so on. even if. A = {O. X = ~ L~l Xi. 2. 3). : 0 E 8).3 The Decision Theoretic Framework 17 X/n as our estimate or ignore the data and use hislOrical infonnation on past shipments. .3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that.6. in Example 1.2 lO estimate J1 do we use the mean of the measurements.1 Components of the Decision Theory Framework As in Section 1. in estimation we care how far off we are.1.1. Thus.1.1. In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter. 3). in testing whether we are right or wrong. once a study is carried out we would probably want not only to estimate ~ but also know how reliable our estimate is. or the median. A = {( 1. 1). most significantly. Estimation. For instance. a large a 2 will force a large m l n to give us a good chance of correctly deciding that the treatment effect is there. The answer will depend on the model and. in Example 1. k}}. 2. there are 3! = 6 possible rankings. (3. A new component is an action space A of actions or decisions or claims that we can contemplate making. In any case. . # 0.3. in Example 1.
as we shall see (Section 5. is P. a) 1(0. Although estimation loss functions are typically symmetric in v and a. Estimation. Loss function. Q is the empirical distribution of the Zj in .qd(lJ)) and a ~ (a""" ad) are vectors. Here A is much larger. a) if P is parametrized.1). they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. = max{laj  vjl. 1'() is the parameter of interest. I(P.. M) would be our prediction of response or no response for a male given treatment B. and z E Z.. For instance.al < d. sometimes can genuinely be quantified in economic terms. ~ 2. l(P.. the expected squared error if a is used. examples of loss functions are 1(0.V. "does not respond" and "responds. In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is. and z = (Treatment. l( P.2.a)2. a) = 1 otherwise. Sex)T. the probability distribution producing the data. although loss functions.Vd) = (q1(0)." respectively.3. which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1.a)2). The interpretation of I(P. a). as the name suggests. say. Far more important than the choice of action space is the choice of loss function defined as a function I : P X A ~ R+. . is the nonnegative loss incurred by the statistician if he or she takes action a and the true "state of Nature.a) ~ (q(O) . r " I' 2:)a. if Y = 0 or 1 corresponds to. in the prediction example 1. .." that is. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider..3. or 1(0. As we shall see. If.i = We can also consider function valued parameters. [v(P) . then a(B. asymmetric loss functions can also be of importance. a) = J (I'(z) . 1 d} = supremum distance.: la. Other choices that are. For instance. Closely related to the latter is what we shall call confidence interval loss. a) = (v(P) . [f Y is real. If v ~ (V1. a) = min {(v( P) . .a)2 (or I(O. d'}. Goals. ' . Quadratic Loss: I(P. This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable. a) 1(0. and truncated quadraticloss: I(P. say. less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P. For instance. a) I d . a) = l(v < a).18 Statistical Models. .)2 = Vj [ squared Euclidean distance/d absolute distance/d 1.a(z))2dQ(z).ai. and Performance Criteria Chapter 1 Prediction. a) = [v( P) . r I I(P.3. a) = 0. a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z. Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·). . A = {a ..
. a) ~ 0 if 8 E e a (The decision is correct) l(0.. We next give a representation of the process Whereby the statistician uses the data to arrive at a decision.. that is.I loss: /(8.). Testing.).3 with X and Y distributed as N(I' + ~. in Example 00 defectives results in a penalty of s dollars whereas every defective item sold results in an r dollar replacement cost. . the statistician takes action o(x).. is a or not. other economic loss functions may be appropriate. .y is close to zero. I) /(8. respectively.1.3 The Decision Theoretic Framework 19 the training set (Z 1..3. (Zn.1) Decision procedures. if we are asking whether the treatment effect p'!1'ameter 6. The decision rule can now be written o(x.1. . a Estimation.9.2) and N(I'..Section 1. we implicitly discussed two estimates or decision rules: 61 (x) = sample mean x and 02(X) = X = sample median. ]=1 which is just n. y) = o if Ix::: 111 (J <c (1. Here we mean close to zero relative to the variability in the experiment. where {So. If we take action a when the parameter is in we have made the correct decision and the loss is zero. Of course. The data is a point X = x in the outcome or sample space X. This 0 . the decision is wrong and the loss is taken to equal one. relative to the standard deviation a. Using 0 means that if X = x is observed.. then a reasonable rule is to decide Ll = 0 if our estimate x . 0.1 times the squared Euclidean distance between the prediction vector (a(z. . e ea.2) I if Ix ~ 111 (J >c . In Example 1. Otherwise. Then the appropriate loss function is 1. (1.).a) = I n n. Y).2).3. is a partition of (or equivalently if P E Po or P E P. this leads to the commonly considered I(P.. For the problem of estimating the constant IJ.3 we will show how to obtain an estimate (j of a from the data. I) 1(8.. in the measurement model. In Section 4.1'(zn))T Testing...1 loss function can be written as ed. and to decide Ll '# a if our estimate is not close to zero. L(I'(Zj) _0(Z.))2. Y n). For instance. . a(zn) jT and the vector parameter (I'( z.0) sif8<80 Oif8 > 80 rN8. a) = 1 otherwise (The decision is wrong). We ask whether the parameter () is in the subset 6 0 or subset 8 1 of e.1 suppose returning a shipment with °< 1(8. We define a decision rule or procedure to be any function from the sample space taking its values in A. .
. fi) = Ep(fi(X) .d. fJ(x)). Estimation of IJ.3. . and assume quadratic loss. lfwe use the mean X as our estimate of IJ. MSE(fi) I. Example 1. then the loss is l(P. (fi . the risk or riskfunction: The risk function. We illustrate computation of R and its a priori use in some examples...E(fi)] = O. (Continued). and Performance Criteria Chapter 1 where c is a positive constant called the critical value. Proof.. Moreover. Suppose v _ v(P) is the real parameter we wish to estimate and fJ(X) is our estimator (our decision rule).v(P))' (1. (]"2) errors.i.) 0 We next illustrate the computation and the a priori and a posteriori use of the risk function. .6) = Ep[I(P. so is the other and the result is trivially true. measurements of IJ. the cross term will he zero because E[fi . our risk function is called the mean squared error (MSE) of. 6 (x)) as a random variable and introduce the riskfunction R(P. The MSE depends on the variance of fJ and on what is called the bias ofv where = E{fl) . The other two terms are (Bias fi)' and Var(v). withN(O. Suppose Xl.. but for a range of plausible x·s.1']. If we expand the square of the righthand side keeping the brackets intact and take the expected value. We do not know the value of the loss because P is unknown. Goals. e Estimation.. That is.) = 2L n ~ n . 1 is the loss function.3. i=l .v) = [V  + [E(v) .3. R maps P or to R+.3. we turn to the average or mean loss over the sample space. How do we choose c? We need the next concept of the decision theoretic framework. and X = x is the outcome of the experiment. and is given by v= MSE{fl) = R(P. X n are i. for each 8. (If one side is infinite. If we use quadratic loss. we regard I(P. Thus. Thus. then Bias(X) 1 n Var(X) . we typically want procedures to have good properties not at just one particular x.1. A useful result is Bias(fi) Proposition 1. R('ld) is our a priori measure of the performance of d.3) where for simplicity dependence on P is suppressed in MSE. If d is the procedure used. () is the true value of the parameter.v can be thought of as the "longrun average error" of v. Write the error as 1 = (Bias fi)' + E(fi)] Var(v)." Var(X.20 Statistical Models. 6(X)] as the measure of the perfonnance of the decision rule o(x).
3.1")' ~ E(~) can only be evaluated numerically (see Problem 1. 00 (1. in general. say. or numerical and/or Monte Carlo computation. The choice of the weights 0. whereas if we have a random sample of measurements X" X 2. computational difficulties arise even with quadratic loss as soon as we think of estimates other than X.4.6 through a .23). n (1. that the E:i are i.2 ... is possible.4) which doesn't depend on J.4) can be used for an a priori estimate of the risk of X. If. 1 X n from area A. Ii ~ (0. analytic. . but for absolute value loss only approximate. for instance by 8 2 = ~ L~ l(X i .Xn ) (and we. If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = <:TolE measurements. Example 1.3.1. Let flo denote the mean of a certain measurement included in the U. a natural guess for fl would be flo.t. .1 2 MSE(X) = R(I". . (.3 The Decision Theoretic Framework 21 and. .a 2 . = X.2 and 0.6). .1"1 ~ Eill 2 ) where £.3. write for a median of {aI.Section 1. Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be < crJ. or approximated asymptotically.3.fii V.i. census. Next suppose we are interested in the mean fl of the same measurement for a certain area of the United States.3.X)2. the £.1). Suppose that instead of quadratic loss we used the more natural(l) absolute value loss. then for quadratic loss.. If we have no data for area A. Then (1.X) =. a'. we may wantto combine tLo and X = n 1 L~ 1 Xi into an estimator.2. age or income. planning is not possible but having taken n measurements we can then estimate 0.5) This harder calculation already suggests why quadratic loss is really favored.8)X.3.1".. For instance. of course. The a posteriori estimate of risk (j2/n is.3..fii 1 af2 00 Itl<p(t)dt = .a 2 then by (A 13. if X rnedian(X 1 . X) = a' (P) / n still. In fact.fii/a)[ ~ a . for instance.. ( N(o. If we only assume. X) = EIX . Then R(I".d.S. an}). an estimate we can justify later. We shall derive them in Section 1.8 can only be made on the basis of additional knowledge about demography or the economy. as we assumed. or na 21(n . 1) and R(I". as we discussed in Example 1. areN(O. itself subject to random error.X) = ". E(i ..2. 0 ~ = a We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1. .1 is useful for evaluating the performance of competing estimators.2)1"0 + (O. R( P. with mean 0 and variance a 2 (P). by Proposition 1. If we have no idea of the value of 0.
21'0 R(I'. However.3. respectively. 22 Statistical Models. Because we do not know the value of fL.64)0" In MSE(ii) = . We easily find Bias(ii) Yar(ii) 0. Here we compare the performances of Ii and X as estimators of Jl using MSE.0)P[6(X. if we use as our criteria the maximum (over p.1')' + (. D Testing. X) = 0" In of X with the minimum relative risk inf{MSE(ii)IMSE(X). = 0. Ii) of ii is smaller than the risk R(I'. Y) = 01 if i'> i O.n. neither estimator can be proclaimed as being better than the other. MSE(ii) MSE(X) i . I' E R} being 0. R(i'>. Figure 1.64)0" In.2(1'0 . the risk is = 0 and i'> i 0 can only take on the R(i'>.1 gives the graphs of MSE(ii) and MSE(X) as functions of 1'. The two MSE curves cross at I' ~ 1'0 ± 30'1 . A test " " e e e .1. then X is optimal (Example 3.3. the risk R(I'.81' .6) P[6(X.6) = 1(i'>. The test rule (1.1') (0. Y) = which in the case of 0 . Goals. I)P[6(X. iiJ + 0. I. using MSE.04(1'0 . and Performance Criteria Chapter 1 i formal Bayesian analysis using a nonnal prior to illustrate a way of bringing in additional knowledge.64 when I' = 1'0.2) for deciding between i'> two values 0 and 1. Y) = IJ. thus.8)'Yar(X) = (.I.1 loss is 01 + lei'>. In the general case X and denote the outcome and parameter space.I' = 0. The mean squared errdrs of X and ji. and we are to decide whether E 80 or E 8 where 8 = 8 0 U 8 8 0 n 8 . Y) = I] if i'> = 0 P[6(X.3.) ofthe MSE (called the minimax criteria). If I' is close to 1'0..4).3. Figure 1.
6(X) = IIX E C]. where 1 denotes the indicator function. the focus is on first providing a small bound. and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding Do = 0 when Do ¥ 0). If <5(X) = 1 and we decide 8 E 8 1 when in fact E 80.3 The Decision Theoretic Framework 23 function is a decision rule <5(X) that equals 1 On a set C C X called the critical region and equals 0 on the complement of C.[X < k[.8) for all possible distrihutions P of X. If (say) X represents the amount owed in the sample and v is the unknown total amount owed. a < v(P) . Here a is small. on the probability of Type I error. that is.3.1. in the treatments A and B example. Finding good test functions corresponds to finding critical regions with smaIl probabilities of error. 8 > 80 .18). This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function.[X < k].6) P(6(X) = 0) if 8 E 8 1 Probability of Type II error. an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed. (1. Such a v is called a (1 . we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding ~ =1= 0 when ~ = 0). "Reject the shipment if and only if X > k.1 lead to (Prohlem 1. I(P. For instance.05 or .3.1) and tests 15k of the fonn.a) upper coofidence houod on v.05. Thus. say . 8 < 80 rN8P.7) Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation. [n the NeymanPearson framework of statistical hypothesis testing. Suppose our primary interest in an estimation type of problem is to give an upper bound for the parameter v. confidence bounds and intervals (and more generally regions). it is natural to seek v(X) such that P[v(X) > vi > 1.8) sP. For instance. usually . For instance. whereas if <5(X) = 0 and we decide () E 80 when in fact 8 E 8 b we call the error a Type II error. and then trying to minimize the probability of a Type II error.Section 1.6) Probability of Type [error R(8. the risk of <5(X) is e R(8. a > v(P) I.o) 0.a (1.3.01 or less. the loss function (1.6) E(6(X)) ~ P(6(X) ~ I) if8 E 8 0 (1.IX > k] + rN8P. R(8." in Example 1.3. This is not the only approach to testing.3. we call the error committed a Type I error.
The 0.2 . We shall go into this further in Chapter 4. e Example 1. . for some constant c > O.1 nature makes it resemble a testing loss function and..a>v(P) . are available. rather than this Lagrangian form.3. and a3. The same issue arises when we are interested in a confidence interval ~(X) l v(X) 1for v defined by the requirement that • P[v(X) < v(P) < Ii(X)] > 1 . The loss function I(B.24 Statistical Models. where x+ = xl(x > 0). or repair it. and the risk of all possible decision procedures can be computed and plotted. we could operate. c ~1 . the connection is close. or sell partial rights. 1. though. Suppose the following loss function is decided on TABLE 1.5. I I (Oil) (No oil) Bj B2 0 12 10 I 5 6 \ .a ·1 for all PEP. though upper bounding is the primary goal. Suppose we have two possible states of nature. a patient either has a certain disease or does not. replace it. general criteria for selecting "optimal" procedures. as we shall see in Chapter 4.3. Suppose that three possible actions. we could drill for oil. a) (Drill) aj (Sell) a2 (Partial rights) a3 . which we represent by Bl and B . or wait and see.a)  a~v(P) . We next tum to the final topic of this section. In the context of the foregoing examples. an asymmetric estimation type loss function. For instance = I(P. It is clear. sell the location. and so on. The decision theoretic framework accommodates by adding a component reflecting this.a<v(P). We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model.: Comparison of Decision Procedures In this section we introduce a variety of concepts used in the comparison of decision procedures. in fact it is important to get close to the truthknowing that at most 00 doJlars are owed is of no use. A has three points. Typically.3. aJ. a component in a piece of equipment either works or does not work. we could leave the component in.8) and then see what one can do to control (say) R( P. For instance. and Performance Criteria Chapter 1 1 . a 2 certain location either contains oil or does not. Goals. administer drugs. What is missing is the fact that. it is customary to first fix a in (1. a2. We shall illustrate some of the relationships between these ideas using the following simple example in which has two members. .1. thai this fonnulation is inadequate because by taking D 00 we can achieve risk O. Ii) = E(Ii(X) v(P))+.3.
5.5 4. and so on. B) given by the following table TABLE 1.4." and so on.) R(B. whereas if there is no oil.5 9.6 and 0.Section 1. if there is oil and we drill.4) ~ 7. R(025.. the loss is 12. R(B2. (R(O" 5). a. an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0. 5)) and if k = 2 we can plot the set of all such points obtained by varying 5.4 10 6 6.)) 1 1 R(B . it is known that formation 0 occurs with frequency 0. 2 a.4.3. and frequency function p(x.). i Rock formation x o I 0.)) are given in Table 1. Risk points (R(B 5.4 = 1.2 for i = 1.4 8 8. formations 0 and 1 occur with frequencies 0.5(X))] = I(B. Possible decision rules 5i (x) .5. a.3. 0 . B2 Thus.5 8.0 9 5 6 It remains to pick out the rules that are "good" or "best.)P[5(X) = ad +1(B. .) 0 12 2 7 7. 3 a." Criteria for doing this will be introduced in the next subsection. the loss is zero.7. and when there is oil. Next.3. TABLE 1.4 and graphed in Figure 1. 8 a3 a2 9 a3 a3 Here.6 0.. a3 4 a2 a.). We list all possible decision rules in the following table. e TABLE 1. . ." 02 corresponds to ''Take action Ul> if X = 0. a. 5 a.5 3 7 1.3 and formation 1 with frequency 0.7 0..2 (Oil) (No oil) e. The risk of 5 at B is R(B.3) For instance. we can represent the whole risk function of a procedure 0 by a point in kdimensional Euclidean space.3.. I x=O x=1 a. a.52 ) + 10(0. 6 a. 1 9. e. .a2)P[5(X) ~ a2] + I(B.3 The Decision Theoretic Framework 25 Thus.3 0.6 4 5 1 " 3 5. a3 7 a3 a. take action U2. The risk points (R(B" 5..3.6. 01 represents "Take action Ul regardless of the value of X. if X = 1. 5. whereas if there is no oil and we drill.6) + 1(0.3.5) E[I(B. R(B" 52) R(B2 . X may represent a certain geological formation.7) = 7 12(0.).6 3 3.a3)P[5(X) = a3]' 0(0.2. The frequency function p(x. 1. If is finite and has k members. R(B k .
1. Consider.4 . a5).   .S) < R(8. S. Extensions of unbiasedness ideas may be found in Lehmann (1997. Symmetry (or invariance) restrictions are discussed in Ferguson (1967). It is easy to see that there is typically no rule e5 that improves all others. in estimating () E R when X '"'' N(().S. Si).9. if we ignore the data and use the estimate () = 0.) > R( 8" S6)' The problem of selecting good decision " procedures has been attacked in a variety of ways. neither improves the other. and only if. and Performance Criteria Chapter 1 R(8 2 .3 Bayes and Minimax Criteria The difficulties of comparing decision procedures have already been discussed in the special contexts of estimation and testing.3. For instance.8 5 .) I • 3 10 .3. Section 1. . . and S6 in our example. we obtain M SE(O) = (}2.5 0 0 5 10 R(8 j . We say that a procedure <5 improves a procedure Of if. The risk points (R(8 j .9 . Goals. Here R(8 S. Researchers have then sought procedures that improve all others within the class.26 Statistical Models. or level of significance (for tests).Si) Figure 1. R(8. unbiasedness (for estimates and tests). (2) A second major approach has been to compare risk functions by global crite .2 .S') for all () with strict inequality for some (). The absurd rule "S'(X) = 0" cannot be improved on at the value 8 ~ 0 because Eo(S'(X)) = 0 ifand only if O(X) = O. Usually.2. i (1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry. for instance. We shall pursue this approach further in Chapter 3..6 .7 .. R( 8" Si)). i = 1.) < R( 8" S6) but R( 8" S.5). if J and 8' are two rules.
9 8.8 6 In the Bayesian framework 0 is preferable to <5' if. (1.. the only reasonable computational method.48 7. In this framework R(O. = 0. r(o.6 4 4.3. The method of computing Bayes procedures by listing all available <5 and their Bayes risk is impracticable in general. This quantity which we shall call the Bayes risk of <5 and denote r( <5) is then.5 7 7.3.I.3.3.7 6. it has smaller Bayes risk. Then we treat the parameter as a random variable () with possible values ()1.9) The second preceding identity is a consequence of the double expectation theorem (B. suppose that in the oil drilling example an expert thinks the chance of finding oil is .2. + 0.3. ..38 9. If we adopt the Bayesian point of view.3.10) r( 0) ~ 0.). 0)11(0).5 gives r( 0. to Section 3.3.6 12 2 7.8 10 6 3. Bayes: The Bayesian point of view leads to a natural global criterion.9).2.4 8 4.2. We postpone the consideration of posterior analysis. but can proceed to calculate what we expect to lose on the average as () varies. r(o") = minr(o) reo) = EoR(O. .4 5 2.) maxi R(0" Oil. r( 09) specified by (1. O(X)) I () = ()]. Note that the Bayes approach leads us to compare procedures on the basis of. 0) isjust E[I(O. 11(0.5 we for our prior.20) in Appendix B. (1.5.o(x))]. Recall that in the Bayesian model () is the realization of a random variable or vector () and that Pe is the conditional distribution of X given 0 ~ O. TABLE 1. the expected loss. If there is a rule <5*. . given by reo) = E[R(O. and only if.) ~ 0. such that • see that rule 05 is the unique Bayes rule then it is called a Bayes rule.2R(0..3 The Decision Theoretic Framework 27 ria rather than on a pointwise basis. R(0" Oi)) I 9. .o)] = E[I(O. Bayes and maximum risks oftbe procedures of Table 1.0).5 9 5.8R(0. To illustrate. . if () is discrete with frequency function 'Tr(e).02 8. therefore. we need not stop at this point. and reo) = J R(O. which attains the minimum Bayes risk l that is.92 5. We shall discuss the Bayes and minimax criteria. if we use <5 and () = ().o)1I(O)dO.8.3. ()2 and frequency function 1I(OIl The Bayes risk of <5 is. From Table 1.0) Table 1.Section 1.6 3 8.
is called minimax (minimizes the maximum risk). Player II then pays Player I. Nevertheless l in many cases the principle can lead to very reasonable procedures.J') . Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule.5 we might feel that both values ofthe risk were equally important. < sup R(O.J). .20 if 0 = O . J). The maximum risk of 0* is the upper pure value of the game. if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used.4. For instance. if and only if. This is. J)) we see that J 4 is minimax with a maximum risk of 5. R(O. Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to emplOy.4. Nature's choosing a 8. To illustrate computation of the minimax rule we tum to Table 1. R(02. in Example 1. in Example 1.75 if 0 = 0.3. we prefer 0" to 0"'. It aims to give maximum protection against the worst that can happen. Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk. e 4. = infsupR(O. the sel of all decision procedures. but only ac. we toss a fair coin and use 04 if the coin lands heads and 06 otherwise. Such comparisons make sense even if we do not interpret 1r as a prior density or frequency. J'). 6). suppose that.3." Nature (Player I) picks a independently of the statistician (Player II). I Randomized decision rules: In general. It is then natural to compare procedures using the simple average ~ [R( fh 10) + R( fh. But this is just Bayes comparison where 1f places equal probability on f}r and ()2. supR(O. Of course.(2) We briefly indicate "the game of decision theory. Our expected risk would be.28 Statistical Models. J) A procedure 0*. Goals. For simplicity we shall discuss only randomized . 4. a randomized decision procedure can be thought of as a random experiment whose outcomes are members of V. which has . . a weight function for averaging the values of the function R( B. .5. and Performance Criteria Chapter 1 if (J is continuous with density rr(8). which makes the risk as large as possible. For instance. The criterion comes from the general theory of twoperson zero sum games of von Neumann. This criterion of optimality is very conservative. if V is the class of all decision procedures (nonrandomized). 2 The maximum risk 4. From the listing ofmax(R(O" J). who picks a decision procedure point 8 E J from V. 8)]. The principle would be compelling.3. sup R(O.75 is strictly less than that of 04.
Finding the Bayes rule corresponds to finding the smallest c for which the line (1. then all rules having Bayes risk c correspond to points in S that lie on the line irl + (1 .).J) ~ L.(0 d = i = 1 . S is the convex hull of the risk points (R(0" Ji ).. All points of S that are on the tangent are Bayes.\. (1. Jj ). A point (Tl. i=l (1. (1. 0 < i < 1. If the randomized procedure l5 selects l5i with probability Ai. S= {(rI.. By (1. 9 (Figure 2 1.. (1. l5q of nOn randomized procedures. T2) on this line can be written AR(O"Ji ) + (1.3..'Y) that is tangent to S al the lower boundary of S. Ji )). we then define q R(O.5.Section 1.r). For instance. R(O .3).R(OI.3. TWo cases arise: (I) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule.13) intersects S.R(O. Ai >0. i = 1 . A randomized minimax procedure minimizes maxa R(8. AR(02.3. minimizes r(d) anwng all randomized procedures. including randomiZed ones.1). This is thai line with slope . Ji ) + (1 ..13) defines a family of parallel lines with slope i/(1 . J)) and consider the risk sel 2 S = {(RrO"J).5.J. We will then indicate how much of what we learn carries over to the general case. That is. i = 1.12) r(J) = LAiEIR(IJ. J).A)R(O"Jj ).. Jj .3). when ~ = 0.i/ (1 .. 1 q.Ji)' r2 = tAiR(02.R(02.r2):r.3.3. the Bayes risk of l5 (1.3. which is the risk point of the Bayes rule Js (see Figure 1.3. this point is (10. Ei==l Al = 1. . R(O .d) anwng all randomized procedures.13) As c varies.2.3.10).Ji )] i=l A randomized Bayes procedure d. = ~A. we represent the risk of any procedure J by the vector (R( 01 . (2) The tangent is the line connecting two unonrandomized" risk points Ji . ~Ai=1}.J.3.A)R(02. given a prior 7r on q e.J): J E V'} where V* is the set of all procedures. .11) Similarly we can define.3. .14) . If .3.3 The Decision Theoretic Framework 29 procedures that select among a finite set (h 1 • • • . As in Example 1.(02 ).i)r2 = c. We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1..). .
namely Oi (take'\ = 1) and OJ (take>. (\)). i = 1. the first square that touches S).15) Each one of these rules. and Performance Criteria Chapter 1 r2 I 10 5 Q(c') o o 5 10 Figure 1. (1.3.9.• all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes).>'). ranges from 0 to 1. Because changing the prior 1r corresponds to changing the slope . OJ with probability (1 .3. . thus. In our example. = 0). and. where 0 < >. Let c' be the srr!~llest c for which Q(c) n S i 0 (i. 0 < >. Then Q( c*) n S is either a point or a horizontal or vertical line segment.11) corresponds to the values Oi with probability>..e.3.3.16) whose diagonal is the line r.3. the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (Le. oil. as >. = r2.13). Goals. 2 The point where the square Q( c') defined by (1.. < 1.3.. R(B . See Figure 1. (1.3.)') of the line given by (1.16) touches S is the risk point of the minimax rule.3.3. < 1. by (1. It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*./(1 . To locate the risk point of the nilhimax rule consider the family of squares. The convex hull S of the risk poiots (R(B!. the first point of contact between the squares and S is the .30 Statistical Models. We can choose two nonrandomized Bayes rules from this class. is Bayes against 7I". .
\ the solution of From Table 1. A rule <5 with risk point (TIl T2) is admissible. R(8" 0)) : 0 E V'} where V" is the set of all randomized decision procedures. R( 81 . From the figure it is clear that such points must be on the lower left boundary.5 can be shown to hold generally (see Ferguson. {(x. we can define the risk set in general as s ~ {( R(8 . then any Bayes procedure corresponding If e is not finite there are typically admissible procedures that are not Bayes. or equivalently. There is another important concept that we want to discuss in the context of the risk set.4. that <5 2 is inadmissible because 64 improves it (i. agrees with the set of risk points of Bayes procedures. (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the Sense of (1. . (c) If e is finite(4) and minimax procedures exist.)). which yields .0). if and only if. A decision rule <5 is said to be inadmissible if there exists another rule <5' such that <5' improves <5.e. if and only if..\) = 5.) and R( 8" 04) = 5.3.\).. If e is finite. there is no (x. (a) For any prior there is always a nonrandomized Bayes procedure. under some conditions. e = {fh. 0.4'\ + 3(1 . Thus. Naturally.5(1 . . 04) = 3 < 7 = R( 81 . for instance.6 ~ R( 8" J...14) with i = 4. the minimax rule is given by (1.\ + 6.3 The Decision Theoretic Framework 31 intersection between TI = T2 and the line connecting the two points corresponding to <54 and <56 .Section 1. j = 6 and . they are Bayes procedures.3.V) : x < TI.\ ~ 0. the set of all lower left boundary points of S corresponds to the class of admissible rules and. this equation becomes 3. if there is a randomized one. 1967.59. thus. 1 . The following features exhibited by the risk set by Example 1. y) in S such that x < Tl and y < T2.3.3. . In fact..14).3.4 we can see. all admissible procedures are either Bayes procedures or limits of . To gain some insight into the class of all admissible procedures (randomized and nOnrandomized) we again use the risk set. Y < T2} has only (TI 1 T2) in common with S. > a for all i. l Ok}. for instance). (d) All admissible procedures are Bayes procedures.4 < 7. all rules that are not inadmissible are called admissible. (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the positive quadrant. (e) If a Bayes prior has 1f(Oi) to 1f is admissible. However. Using Table 1.
Using this information. and prediction. The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility. • 1. For more information on these topics. In terms of our preceding discussion. Similar problems abound in every field. and risk through various examples including estimation. loss function. although it usually turns out that all admissible procedures of interest are indeed nonrandomized. Section 3.32 . A college admissions officer has available the College Board scores at entrance and firstyear grade point averages of freshman classes for a period of several years. Other theorems are available characterizing larger but more manageable classes of procedures. I II .. Bayes procedures (in various senses). We want to find a function 9 defined on the range of Z such that g(Z) (the predictor) is "close" to Y. We assume that we know the joint probability distribution of a random vector (or variable) Z and a random variable Y. which is the squared prediction error when g(Z) is used to predict Y. he wants to predict the firstyear grade point averages of entering freshmen on the basis of their College Board scores. Z is the information that we have and Y the quantity to be predicted. The frame we shaH fit them into is the following. A government expert wants to predict the amount of heating oil needed next winter. We introduce the decision theoretic foundation of statistics inclUding the notions of action space. at least when procedures with the same risk function are identified. Goals. Since Y is not known.g(Z») = E[g(Z) ~ YI 2 or its square root yE(g(Z) .2 presented important situations in which a vector z of 00variates can be used to predict an unseen response Y. An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson. Summary. testing. The basic biasvariance decomposition of mean square error is presented. Statistical Models. These remarkable results. in the college admissions situation. Z would be the College Board score of an entering freshman and Y his or her firstyear grade point average. ranking. The MSPE is the measure traditionally used in the .I 'jlI . I The prediction Example 1. and Performance Criteria Chapter 1 '. We stress that looking at randomized procedures is essential for these conclusions. which include the admissible rules. ]967. decision rule. confidence bounds. A meteorologist wants to estimate the amount of rainfall in the coming spring. are due essentially to Waldo They are useful because the property of being Bayes is ea·der to analyze than admissibility. For example.3.Yj'. The joint distribution of Z and Y can be calculated (or rather well estimated) from the records of previous years that the admissions officer has at his disposal.4). Here are some further examples of the kind of situation that prompts our study in this section. One reasonable measure of "distance" is (g(Z) . Next we must specify what close means. A stockholder wants to predict the value of his holdings at some time in the future on the basis of his past experience with the market and his portfolio.4 PREDICTION . we referto Blackwell and Girshick (1954) and Ferguson (1967). we tum to the mean squared prediction error (MSPE) t>2(Y. . at least in their original fonn.y)2.
4. we can conclude that Theorem 1.4 Prediction 33 mathematical theory of prediction whose deeper results (see.g(z))' IZ = z] = E[(Y I'(z))' I Z = z] + [g(z) I'(z)]'. or equivalently.C)2 has a unique minimum at c = J.4.L and the lemma follows.2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss.1 assures us that E[(Y .20). see Problem 1.1.4.6.c)2 is either oofor all c or is minimized uniquely by c In fact.5 and Section 3.(Y.g(z))' I Z = z].4. Because g(z) is a constant.25. 1957) presuppose it. 1.c)' ~ Var Y + (c _ 1')'.711). see Example 1. that is. By the substitution theorem for conditional expectations (B. 0 Now we can solve the problem of finding the best MSPE predictor of Y.g(Z))' I Z = z] = E[(Y .c = (Y 1') + (I' .L = E(Y).4. we can find the 9 that minimizes E(Y . consider QNP and the class QL of linear predictors of the form a + We begin the search for the best predictor in the sense of minimizing MSPE by considering the case in which there is no covariate information.1) < 00 if and only if E(Y . (1. in which Z is a constant.Section 1. when EY' < 00.1) follows because E(Y p.c) implies that p. The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6.3. (1. Just how widely applicable the notions of this section are will become apparent in Remark 1. Let (1.1.4. Lemma 1. exists. The class Q of possible predictors 9 may be the nonparametric class QN P of all 9 : R d Jo R or it may be to some subset of this class. for example.4. ljZ is any random vector and Y any random variable.4. we have E[(Y .4.I'(Z))' < E(Y . EY' = J.4.2) I'(z) = E(Y I Z = z). E(Y . L1=1 Lemma 1.c)' < 00 for all c.4. Grenander and Rosenblatt. We see that E(Y . given a vector Z. E(Y . Proof. g(Z)) such as the mean absolute error E(lg( Z) .) = 0 makes the cross product term vanish. then either E(Y g(Z))' = 00 for every function g or E(Y . See Remark 1.l.16).4) .3) If we now take expectations of both sides and employ the double expectation theorem (B. In this section we bjZj . EY' < 00 Y .4. In this situation all predictors are constant and the best one is that number Co that minimizes B(Y .4.YI) (Problems 1.c)2 as a function of c. and by expanding (1.g(Z))2.g(Z))' (1.
E{h(Z)<1 E{ E[h(Z)< I Z]} E{h(Z)E[Y I'(Z) I Z]} = 0 because E[Y I'(Z) I Z] = I'(Z) I'(Z) = O. then by the iterated expectation theorem.4.5) follows from (a) because (a) implies that the cross product term in the expansionof E ([Y 1'(z)J + [I'(z) g(z)IP vanishes. Vat Y = E(Var(Y I Z» + Var(E(Y I Z). Infact.6) and that (1.4.E(V)J Let € = Y .4.E(U)] = O.7) IJVar Y • < 00 strict inequality ooids unless 1 Y = E(Y I Z) (1.2.4.4. That is. I'(Z) is the unique E(Y . and recall (B. If E(IYI) < 00 but Z and Y are otherwise arbitrary. and Performance Criteria Chapter 1 for every 9 with strict inequality holding unless g(Z) best MSPE predictor. As a consequence of (1.8) or equivalently unless Y is a function oJZ. Goals.4. Proof. To show (a).5) is obtained by taking g(z) ~ E(Y) for all z. then we can write Proposition 1. so is the other. (1.6) which is generally valid because if one side is infinite. Property (1. = I'(Z).g(Z)' = E(Y . Properties (b) and (c) follow from (a).1(c) is equivalent to (1.4. which will prove of importance in estimation theory.1. Var(Y I z) = E([Y .34 Statistical Models. (1.6) is linked to a notion that we now define: Two random variables U and V with EIUVI < 00 are said to be uncorrelated if EIV .4. when E(Y') < 00. .E(Y I z)]' I z).4. let h(Z) be any function of Z.I'(Z))' + E(g(Z) 1'(Z»2 ( 1. Theorem 1.5) becomes. 0 Note that Proposition 1. Suppose that Var Y < 00.6).5) An important special case of (1. Equivalently U and V are uncorrelated if either EV[U E(U)] = 0 or EUIV .20).4. 1. = O. we can derive the following theorem.Il(Z) denote the random prediction error. Write Var(Y I z) for the variance of the condition distribution of Y given Z = z. then Var(E(Y I Z)) < Var Y.4. that is.E(v)IIU . then (1.4.4. then (a) f is uncorrelated with every function ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <.
4. y) = p (Z = Z 1 Y = y) of the number of shutdowns Y and the capacity state Z of the line for a randomly chosen day.025 0. o Example 1.025 0.10 0.10 0.E(Y I Z))' =L x 3 ~)y y=o .E(Y I Z))' ~ 0 By (A. and only if.'" 1 Y.10. or 3 failures among all days.20. E(Y .45.6). the average number of failures per day in a given month.10 I 0. (1. E(Var(Y I Z)) ~ E(Y .lI. I. Each day there can be 0. Equality in (1. 2. The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state.1 2. . E (Y I Z = ±) = 1. I z) = E(Y I Z). 2.30 py(y) I 0.7) follows immediately from (1. p(z.15 3 0. y) = 0. or 3 shutdowns due to mechanical failure.25 2 0.25 0.4.05 0.30 I 0. as we reasonably might.15 I 0. The first is direct.4. the best predictor is E (30.9) this can hold if.8) is true.25 1 0. Within any given month the capacity status does not change. and only if. whereas the column sums py (y) yield the frequency of 0. i=l 3 E (Y I Z = ~) ~ 2.05 0. 1.y) pz(z) 0.50 z\y 1 0 0. The following table gives the frequency function p( z.25 0. In this case if Yi represents the number of failures on day i and Z the state of the assembly line.4.10 0.885.7) can hold if. or quarter capacity.E(Y I Z = z))2 p (z. But this predictor is also the right one. These fractional figures are not too meaningful as predictors of the natural number values of Y.1. also.45 ~ 1 The MSPE of the best predictor can be calculated in two ways.:. The assertion (1. half. if we are trying to guess.05 0. We want to predict the number of failures for a given day knowing the state of the assembly line for the month. We find E(Y I Z = 1) = L iF[Y = i I Z = 1] ~ 2.Section 1. An assembly line operates either at full.4 Prediction 35 Proof.4.
36 Statistical Models. p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence. Suppose Y and Z are bivariate normal random variables with the same mean . whereas its magnitude measures the degree of such dependence.p')). Goals.4. If p = 0. the best predictor is just the constant f.10) I I " The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution. u~.885 as before. The term regression was coined by Francis Galton and is based on the following observation. Theorem B. . p) distribution.4. Thus. 2 I . E((Y ./lz). (1 .J. is usually called the regression (line) of Y on Z.4.LY. In the bivariate normal case.4. Regression toward the mean. . a'k.6) we can also write Var /lO(Z) p = VarY . the best predictor of Y using Z is the linear function Example 1. for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y.E(Y I Z p') (1. If p > O. Therefore.9) I . the predictor is a monotone increasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z.11) I The line y = /lY + p(uy juz)(z ~ /lz).2 tells us that the conditional dis /lO(Z) Because . One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y.L[E(Y I Z y . " " is independent of z. and Performance Criteria Chapter 1 The second way is to use (1.4.y as we would expect in the case of independence.6) writing. Similarly. o tribution of Y given Z = z is N City + p(Uy j UZ ) (z .E(Y I Z))' (1. = z)]'pz(z) 0. this quantity is just rl'. which is the MSPE of the best constant predictor. can reasonably be thought of as a measure of dependence. the MSPE of our predictor is given by. The larger this quantity the more dependent Z and Yare. = Z))2 I Z = z) = u~(l _ = u~(1 _ p'). O"~. E(Y ~ E(Y I Z))' VarY ~ Var(E(Y I Z)) E(Y') ~ E[(E(Y I Z))'] I>'PY(Y) . E(Y . Y) has a N(Jlz l f. Because of (1. = /lY + p(uyjuz)(Z ~ /lZ). The Bivariate Normal Distribution. (1.2. If (Z.4. which corresponds to the best predictor of Y given Z in the bivariate normal model.4.
12) with MSPE ElY l"o(Z)]' ~ E{EIY l"o(zlI' I Z} = E(uYYlz) ~ Uyy  EyzEziEzy. is closer to the population mean of heights tt than is the height of the father.6) in which I' (I"~. tall fathers tend to have shorter sons.. Ezy O"yy E zz is the d x d variancecovariance matrix Var(Z) . The variability of the predicted value about 11. (Zn.4. these were the heights of a randomly selected father (Z) and his son (Y) from a large human population. Let Z = (Zl. . . Y) is unavailable and the regression line is estimated on the basis of a sample (Zl. .6). " Zd)T be a d x 1 covariate vector with mean JLz = (ttl. This quantity is called the multiple correlation coefficient (MCC).4. Then the predicted height of the son.6. E zy = (COV(Z" Y). y)T has a (d + !) multivariate normal.p)tt + pZ.5 states that the conditional distribution of Y given Z = z is N(I"Y +(zl'zf. be less than that of the actual heights and indeed Var((! . E). .. Yn ) from the population. distribution (Section B. the MCC equals the square of the .. should.. I"Y) T. and positive correlation p. We shall see how to do this in Chapter 2. ttd)T and suppose that (ZT. there is "regression toward the mean. Theorem B. By (1.8 = EziEzy anduYYlz ~ uyyEyzEziEzy. We write Mee = ' ~! _ PZy ElY /lO(Z))' ~ Var l"o(Z) Var Y Var Y . so the MSPE of !lo(Z) is smaller than the MSPE of the constant predictor J1y.11). uYYlz) where. In Galton's case.4. The quadratic fonn EyzEZ"iEzy is positive except when the joint nonnal distribution is degenerate. . Y» T T ~ E yZ and Uyy = Var(Y). variance cr2. coefficient ofdetennination or population Rsquared. usual correlation coefficient p = UZy / crfyCT lz when d = 1.4. the distribution of (Z.. N d +1 (I'..4 Prediction 37 tt. consequently. or the average height of sbns whose fathers are the height Z. The Multivariate Normal Distribution. (1 .8. 0 Example 1.COV(Zd. . One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y. Note that in practice.Section 1.3. in particular in Galton's studies. Thus. YI ).. ." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox. where the last identity follows from (1. I"Y = E(Y). .p)1" + pZ) = p'(T'. Thus. the best predictor E(Y I Z) ofY is the linear function (1.
Goals.1 and 2. In words. Y) a I VarZ. l.5%. and E(Y _ boZ)' = E(Y') _ [E(ZY)]' E(Z') (1. Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z. Y. knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33. respectively. yn)T See Sections 2.13) . Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor. Z2 = father's height). We first do the onedimensional case. (Z~.3.4.bZ)' = E(Y') + E(Z')(b  Therefore. The natural class to begin with is that of linear combinations of components of Z. respectively.y = . E(Y . In practice. y)T is unknown. If we are willing to sacrifice absolute excellence.y = . P~. .39 l. Suppose(l) that (ZT l Y) T is trivariate nonnal with Var(Y) = 6. Z2)T be the heights in inches of a IOyearold girl and her parents (Zl = mother's height. are P~.3%.38 Statistical Models. when the distribution of (ZT.E(Z')b~. The best linear predictor. What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1. The percentage reductions knowing the father's and both parent's heights are 20. The problem of finding the best MSPE predictor is solved by Theorem 1. Then the unique best zero intercept linear predictor i~ obtained by taking E(ZY) b=ba = E(Z') .bZ}' = E(Y)  b E(Z) 1· = (Y  [Z(b . p~y = .l Proof.4.4. the hnear predictor l'a(Z) and its MSPE will be estimated using a 0 sample (Zi. let Y and Z = (Zl.ba) + Zba]}' to get ba )' . we can avoid both objections by looking for a predictor that is best within a class of simple predictors.2.393. We expand {Y . Suppose that E(Z') and E(Y') are finite and Z and Y are not constant.335.209.zz ~ (.. E(Y . .zy ~ (407.:.1.)T.bZ)' is uniquely minintized by b = ba. and parents.9% and 39.~ ~:~~)... and Performance Criteria Chapter 1 For example. whereas the unique best linear predictor is ILL (Z) = al + bl Z where b = Cov(Z.. 29W· Then the strength of association between a girl's height and those of her mother and father.
E(Z)f[Z .boZ)' = 0. On the other hand.Z)'.l).Section 1.4.I'l(Z)1' depends on the joint distribution P of only through the expectation J. in Example 1.4. E(Y .boZ)' > 0 is equivalent to the CauchySchwarz inequality with equality holding if.bZ)' ~ Var(Y .l7) in the appendix. I 1.E(ZWW ' E([Z . 0 Remark 1..a.bZ)2 is uniquely minimized by taking a ~ E(Y) . Let Po denote = .4. then the Unique best MSPE predictor is I'dZ) Proof.4. whatever be b. whiclt corresponds to Y = boZo We could similarly obtain (A. OUf linear predictor is of the fonn d I'I(Z) = a + LbjZj j=l = a+ ZTb [3 = (E([Z . is the unique minimizing value. Best Multivariate Linear Predictor.4 Prediction 39 To prove the second assertion of the theorem note that by (lA.t and covariance E of X.4.a)'.b(Z .4.16) directly by calculating E(Y . 0 Note that if E(Y I Z) is of the form a + bZ.4. (1.2. and only if.14) Yl T Ep[Y . and b = b1 • because.bE(Z).E(Z)IIZ .E(Z)J))1 exist.1). E(Y . If EY' and (E([Z .4.13) we obtain tlte proof of the CauchySchwarz inequality (A. We can now apply the result on zero intercept linear predictors to the variables Z .bZ) + (E(Y) .E(Y)) . That is. Theorem 1.E(Z) and Y .bE(Z) . From (1.bZ)2 we see that the b we seek minimizes E[(Y .E(Y) to conclude that b.4. Substituting this value of a in E(Y .a .I1. Note that R(a.E(Z)J[Y .1. Therefore.b.I'd Z )]' = 1 05 ElY I'(Z)j' . E[Y .tzf [3.a . by (1. This is because E(Y . . In that example nothing is lost by using linear prediction. = I'Y + (Z  J. This is in accordance with our evaluation of E(Y I Z) in Example 1. A loss of about 5% is incurred by using the best linear predictor.a .5).E(Y)]) = EziEzy. if the best predictor is linear. then a = a. it must coincide with the best linear predictor. E(Y .E(Z))]'.1 the best linear predictor and best predictor differ (see Figure 1. b) X ~ (ZT.
4. .4. . The line represents the best linear predictor y = 1.3 to d > 1. b) = Epo IY . In the general.4. thus. . Ro(a.4.4.I'LCZ)). that is.4. I"(Z) = E(Y I Z) = a + ZT 13 for unknown Q: E R and f3 E Rd.I'(Z) and each of Zo.4.3.17 for an overall measure of the strength of this relationship.4.00 z Figure 1. Set Zo = 1.4. and R(a. 0 Remark 1. b) is minimized by (1.45z. However. N(/L. . that is.50 0..4. By Proposition 1. b) is miuimized by (1.1.1(a). We could also have established Theorem 1. the multivariate uormal. Because P and Po have the same /L and E.. b) = Ro(a.4 by extending the proof of Theorem 1.4. Y). b). 2 • • 1 o o 0. A third approach using calculus is given in Problem 1.2. 0 Remark 1.4.. Zd are uncorrelated. j = 0. Thus. . The three dots give the best predictor. See Problem 1.05 + 1. Goals. ~ Y .75 1.. E(Zj[Y . . not necessarily nonnal.3.40 Statistical Models. E). distribution and let Ro(a..(a + ZT 13)]) = 0. the MCC gives the strength of the linear relationship between Z and Y. (1. p~y = Corr2 (y.19. R(a.I'I(Z)]'.14). We want to express Q: and f3 in terms of moments of (Z. . . By Example 1. case the multiple correlation coefficient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y.14). Suppose the mooel for I'(Z) is linear.4. 0 Remark 1.4.d.15) . our new proof shows how secondmoment results sometimes can be established by "connecting" them to the nannal distribution.25 0. . and Performance Criteria Chapter 1 y .
With these concepts the results of this section are linked to the general Hilbert space results of Section 8. we see that r(5) ~ MSPE for squared error loss 1(9. usually R or Rk.4. When the class 9 of possible predictors 9 with Elg(Z)1 space as defined in Section B. If we identify II with Y and X with Z.6. Consider the Bayesian model of Section 1. Remark 1. and projection 1r notation.I 0 and there is a 90 E 9 such that 0 < oc form a Hilbert go = arg inf{".(Y 19NP). The notion of mean squared prediction error (MSPE) is introduced.10. then 90(Z) is called the projection of Y on the space 9 of functions of Z and we write go(Z) = .4. If T assigns the same value to different sample points.14) (Problem 1. We consider situations in which the goal is to predict the (perhaps in the future) value of a random variable Y. the optimal MSPE predictor is the conditional expected value of Y given Z. 1. I'dZ» = ".(Y I 9). It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear. Because the multivariate normal model is a linear model.8) defined by r(5) = E[I(II. we can conclude that I'(Z) = . g(Z) and h(Z) are said to be orthogonal if at least one has expected value zero and E[g(Z)h(Z)] ~ O.4.Thus.5(X»]. I'(Z» + ""'(Y. o Summary. We return to this in Section 3. can also be a set of functions.4.. Moreover. Using the distance D.(Y 1ge) ~ "(I'(Z) 19d (1..16) is the Pythagorean identity.Section 1.4.1.2.23). the optimal MSPE predictor E(6 I X) is the Bayes procedure for squared error loss. I'dZ) = .. Remark 1.12).3. and it is shown that if we want to predict Y on the basis of information contained in a random vector Z.(Y. but as we have seen in Sec~ tion 1.. T(X) = X loses information about the Xi as soon as n > 1. I'(Z» Y .. we would clearly like to separate out any aspects of the data that are irrelevant in the context of the model and that may obscure our understanding of the situation.' (l'e(Z).4.5 SUFFICIENCY Once we have postulated a statistical model. Recall that a statistic is any function of the observations generically denoted by T(X) or T.5) = (9 ..4. Note that (1.2 and the Bayes risk (1.g(Z»): 9 E g). then by recording or taking into account only the value of T(X) we have a reduction of the data. The optimal MSPE predictor in the multivariate normal distribution is presented.4. .5)'. Thus. The range of T is any space of objects T. this gives a new derivation of (1.5.15) for a and f3 gives (1.17) ""'(Y.I'(Z) is orthogonal to I'(Z) and to I'dZ).16) (1.5 Sufficiency 41 Solving (1.4.3. We begin by fonnalizing what we mean by "a reduction of the data" X EX.
and Performance Criteria Chapter 1 Even T(X J.Xn ) does not contain any further infonnation about 0 or equivalently P. recording at each stage whether the examined item was defective or not. . However.5. Therefore. We give a decision theory interpretation that follows. We could then represent the data by a vector X = (Xl. X 2 the time between the arrival of the first and second customers. Instead of keeping track of several numbers. when Xl + X.1) where Xi is 0 or 1 and t = L:~ I Xi. Using our discussion in Section B. . The idea of sufficiency is to reduce the data with statistics whose use involves no loss of information. the conditional distribution of X 0 given T = L:~ I Xi = t does not involve O. X.) and that of Xlt/(X l + X.. are independent and the first of these statistics has a uniform distribution on (0.Xn ) where Xi = 1 if the ith item sampled is defective and Xi = 0 otherwise. it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T. the conditional distribution of XI/(X l + X.0.X.42 Statistical Models. .l. Suppose that arrival of customers at a service counter follows a Poisson process with arrival rate (parameter) Let Xl be the time of arrival of the first customer. " . Thus.' • . Begin by noting that according to TheoremB. X(n))' loses information about the labels of the Xi. A statistic T(X) is called sufficient for PEP or the parameter if the conditional distribution of X given T(X) = t does not involve O.) are the same and we can conclude that given Xl + X. . By (A. + X. • X n ) (X(I) .Xn ) is the record of n Bernoulli trials with probability 8.5. Each item produced is good with probability 0 and defective with probability 1. Suppose there is no dependence between the quality of the items produced and let Xi = 1 if the ith item is good and 0 otherwise. 16. Goals. = t. suppose that in Example 1. where 0 is unknown.1. Although the sufficient statistics we have obtained are "natural.4). .) and Xl +X. whatever be 8. Xl has aU(O.l. Thus. = tis U(O.. It follows that. X n ) into the same number. 1) whatever be t. T = L~~l Xi.) given Xl + X. 0 In both of the foregoing examples considerable reduction has been achieved. P[X I = XI.'" . the sample X = (Xl. given that P is valid. By (A. whatever be 8. Y) where X is uniform on (0. We prove that T = X I + X 2 is sufficient for O. .Xn = xnl = 8'(1. ~ t. is a statistic that maps many different values of (Xl. 1). .l..)](X I + X. o Example 1.. A machine produces n items in succession.'" . .9. Xl and X 2 are independent and identically distributed exponential random variables with parameter O. X.. The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X.2.l we see that given Xl + X2 = t. XI/(X l +X." it is important to notice that there are many others e.) is conditionally distributed as (X.5). we need only record one.'" . Thus.1 we had sampled the manufactured items in order. The total number of defective items observed. .3. t) and Y = t .2. the conditional distribution of Xl = [XI/(X l + X.5. Example 1.. . t) distribution. T is a sufficient statistic for O. For instance. is sufficient. Then X = (Xl. in the context of a model P = {Pe : (J E e}.8)n' (1. once the value of a sufficient statistic T is known. One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following.1. (Xl.. By Example B.
e porT = til = I: {x:T(x)=t.O) I: {x:T(x)=t. Neyman. By our definition of conditional probability in the discrete case.5. we need only show that Pl/[X = xjlT = til is independent of for every i and j.. X2. it is enough to show that Po[X = XjlT ~ l. Fortunately.5) . We shall give the proof in the discrete case. To prove the sufficiency of (152). Applying (153) we arrive at.In a regular model. a statistic T(X) with range T is sufficient for e if.] Po[X = Xj.Section 1.. a simple necessary and sufficient criterion for a statistic to be sufficient is available. if (152) holds. (153) By (B.S. Such statistics are called equivalent.} p(x. 0) porT ~ Ii] ~~CC+ g(li..} h(x). Po [X ~ XjlT = t.] o if T(xj) oF ti if T(xj) = Ii. e) definedfor tin T and e in on X such that p(X.5 Sufficiency 43 that will do the same job. Now. T = t. More generally.") be the set of possible realizations of X and let t i = T(Xi)' Then T is discrete and 2::~ 1 porT = Ii] = 1 for every e. Section 2. In general. .2.6). Po[X = XjlT = t. The complete result is established for instance by Lehmann (1997. forO E Si. h(xj) (1.I.. if Tl and T2 are any two statistics such that 7 1 (x) = T 1 (y) if and only if T2(x) = T 2(y). e)h(Xj) Po[T (15. It is often referred to as the factorization theorem for sufficient statistics. 0 E 8. and e and a/unction h defined (152) = g(T(x). Being told that the numbers of successeS in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one. Let (Xl. i = 1.. there exists afunction g(t.Ll) and (152). Proof.]/porT = Ii] p(x.4) if T(xj) ~ Ii 41 o if T(xj) oF t i . and Halmos and Savage. checking sufficiency directly is difficult because we need to compute the conditional distribution. This result was proved in various forms by Fisher.e) = g(ti. then T 1 and T2 provide the same information and achieve the same reduction of the data.j is independent ofe on cach of the sets Si = {O: porT ~ til> OJ. • only if.O) Theorem I. O)h(x) for all X E X.
A whole class of distributions.5. .7) o Example 1...1.5. . we can show that X(n) is sufficient. X(n) is a sufficient statistic for O. The population is sampled with replacement and n members of the population are observed and their labels Xl.Xn are recorded. and Performance Criteria Chapter 1 Therefore.5. n P(Xl.. 0 o Example 1. we need only keeep track of X(n) = max(X .3.5. if T is sufficient..4)). •.O) where x(n) = onl{x(n) < O}. The probability distribution of X "is given by (1.2 (continued).... _ _ _1 .O)=onexP[OLXil i=l (1. 0 Example 1.S.1 to conclude that T(X 1 .').5.8) if all the Xi are > 0.5. = max(xl. . l X n are the interarrival times for n customers. By Theorem 1. both of which are unknown. Let 0 = (JL. .. and h(xl •..44 Statistical Models.X n JI) = 0 otherwise. ~ Po[X = x. and P(XI' .10) P(Xl. Takeg(t. 1. which admits simple sufficient statistics and to which this example belongs.~... ' .'t n /'[exp{ . T is sufficient. . .Xn ) is given by I. . .2. .8) = eneStift > 0.'I. T = T(x)] = g(T(x).. 1 X n be independent and identically distributed random variables each having a normal distribution with mean {l and variance (j2. Common sense indicates that to get information about B. Then the density of (X" .xn }. 1=1 (!. Goals. X n ). . Conversely.' (L x~ 1=1 1 n n 2JL LXi))].Il) .Xn.'" . I x n ) = 1 if all the Xi are > 0.9) can be rewritten as " 1 Xn .4.5. 10 fact. .Xn ) = L~ IXi is sufficient.9) if every Xi is an integer between 1 and B and p( X 11 • (1. and both functions = 0 otherwise.. Expression (1. We may apply Theorem 1.3). Consider a population with () members labeled consecutively from I to B.2. . Let Xl. let g(ti' 0) = porT = tiL h(x) = P[X = xIT(X) = til Then (1.5.[27f. O} = 0 otherwise. If X 1. }][exp{ ..5.. . Estimating the Size of a Population. .16.. > 0.X. then the joint density of (X" .. O)h(x) (1.Xn ) is given by (see (A.6) p(x. [27"..5.JL)'} ~=l I n I! 2 . 0) by (B..5.' L(Xi ..n /' exp{ . are introduced in the next section.
5.5.~ I: xn By the factorization theorem. An equivalent sufficient statistic in this situation that is frequently used IS S(X" . Suppose X" . where we assume diat the given constants {Zi} are not all identical. find a randomized decision rule o'(T(X)) depending only on T(X) that does as well as O(X) in the sense of having the same risk function. a 2)T is identifiable (Problem 1.~0)}(2"r~n exp{ . o Sufficiency and decision theory Sufficiency can be given a clear operational interpretation in the decision theoretic setting. . .. with JLi following the linear regresssion model ("'J . [1/(n i=l n n 1)1 I:(Xi . . 2:>.1.} + EY. that Y I .6. Ezi 1'i) is sufficient for 6. we construct a rule O'(X) with the sarne risk = mean squared error as o(X) as follows: Conditionally. (1. if T(X) is sufficient.xnJ)) is itself a function of (L~ upon applying Theorem 1.4 with d = 2.2a I3. Here is an example. we can.Zi)'} exp {Er. Let o(X) = Xl' Using only X.. T = (EYi . .' + 213 2a + 213. 0 X Example 1. Then pix.EZiY. X n ) = (I: Xi.• Yn are independent. {32. Then 6 = {{31. Thus.) i=l i=l is sufficient for B.5.o) = R(O. that is. a<.1. for any decision procedure o(x).l) distributed. Yi N(Jli. X is sufficient.12) t ofT(X) and a Example 1.. X n ) = [(lin) where I: Xi. 1 2 2 . EYi 2 . X n are independent identically N(I). R(O.5.5. .. Suppose. respectively.Section 1. in Example 1. Specifically. 0') for allO.X)'].··· .( 2?Ta ')~ exp p {E(fJ. a 2 ).9) and _ (y.1 we can conclude that n n Xi. i= 1 = (lin) L~ 1 Xi. I)) = exp{nI)(x .5 Sufficiency 45 I Evidently P{Xl 1 • •• . The first and second components of this vector are called the sample mean and the sample variance. I)) . By randomized we mean that o'(T(X») can be generated from the value random mechanism not depending on B. L~l x~) and () only and T(X 1 .
o Sufficiency aud Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P = {Po : 0 E e} we postulate a prior distribution 11 for e. T = I Xi was shown to be sufficient. Thus.. if Xl. This 6' (T(X)) will have the same risk as 6' (X) because.5.6') ~ E{E[R(O. where k In this situation we call T Bayes sufficient. ).12) follows along the lines of the preceding example: Given T(X) = t. T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of () given T(X) = T(x) for all x. Minimal sufficiency For any model there are many sufficient statistics: Thus.4.5. 0) as p(x. This result and a partial converse is the subject of Problem 1.6).1 (Bernoulli trials) we saw that the posterior distribution given X = x is 1 Xi. .. But T(X) provides a greater reduction of the data. 0 The proof of (1. 0 and X are independent given T(X). then T(X) = (L~ I Xi. n~l) distribution. X n ) are hoth sufficient.5..2. (Kolmogorov). ) Var(T') = E[Var(T'IX)] + VarIE(T'IX)]  ~ nl n + 1 n ~ 1 = Var(X .2. R(O.14. L~ I xl) and S(X) = (X" . Example 1.46 Statistical Models. We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X). Now draw 6' randomly from this conditional distribution.l and (1. (T2) sample n > 2. Using Section B. choose T" = 15"' (X) from the normal N(t. Goals. we find E(T') ~ E[E(T'IX)] ~ E(X) ~ I" ~ E(X . and Performance Criteria Chapter 1 given X = t. it is Bayes sufficient for every 11.Xn is a N(p. Theorem }. Let SeX) be any other sufficient statistic. Equivalently. by the double expectation theorem. .. the distribution of 6(X) does not depend on O. In this Bernoulli trials case. the same as the posterior distribution given T(X) = L~l Xi = k.6'(T))IT]} ~ E{E[R(O. In Example 1. IfT(X) is sufficient for 0. = 2:7 Definition.O) = g(S(x). we can find a transformation r such that T(X) = r(S(X)).. Then by the factorization theorem we can write p(x. in that.5.6(X»)IT]} = R(O. O)h(x) E:  . 6'(X) and 6(X) have the same mean squared error.1 (continued).6). .
1).5 Sufficiency 47 Combining this with (1. ~ 1/3. }exp ( I . we find T = r(S(x)) ~ (log[2ng(s(x). For any two fixed (h and fh.O). when we think of Lx(B) as a function of (). 1/3)]}/21og 2. the statistic L takes on the value Lx. the ratio of both sides of the foregoing gives In particular.11) is determined by the twodimensional sufficient statistic T Set 0 = (TI. In this N(/". we find OT(1 . However.) n/' exp {nO. it gives. t. take the log of both sides of this equation and solve for T. 2/3)/g(S(x). In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x. ~ 2/3 and 0. Now. for a given observed x. The likelihood function o The preceding example shows how we can use p(x.Section 1. if we set 0. the "likelihood" or "plausibility" of various 8. for a given 8.O)h(x) forall O.1) nlog27r . We define the likelihood function L for a given observed data vector x as Lx(O) = p(x. the likelihood function (1. Thus. L x (()) gives the probability of observing the point x. if X = x. t2) because. ()) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient. Example 1.5. Lx is a map from the sample space X to the class T of functions {() t p(x.5. = 2 log Lx(O.o)nT ~g(S(x). Thus.4 (continued). (T'). (T') example. 20' 20 I t. ()) x E X}.) ~ (L Xi.5.)}. L Xf) i=l i=l n n = (0 10 0. 20. L x (') determines (iI.(t. for example. T is minimally sufficient. It is a statistic whose values are functions.5.T. as a function of ().) = (/".8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of 8. In the discrete case. The formula (1..O E e. then Lx(O) = (27rO.
)..5. Let p(X. . The "irrelevant" part of the data We can always rewrite the original X as (T(X).. itself sufficient. 1) (Problem 1.. the ranks. S(X) = (R" . Let Ax OJ c (x: p(x. then for any decision procedure J(X).48 Statistical Models.5. hence.Xn . We define a statistic T(X) to be Bayes sufficient for a prior 7r if the posterior distribution of f:} given X = x is the same as the posterior distribution of 0 given T(X) = T(x) for all X. If.O) > for all B.. in Example 1. 0) = g(T(X). and Scheffe. L is minimal sufficient. B ul the ranks are needed if we want to look for possible dependencies in the observations as in Example 1. .Oo) > OJ = Lx~6o)' Thus.5. the residuals. or for the parameter O. 1 X n ). O)h(X). if the conditional distribution of X given T(X) = t does not involve O. 0 the likelihood ratio of () to Bo. ifT(X) = X we can take SeX) ~ (Xl .1. SeX)~ where SeX) is a statistic needed to uniquely detennine x once we know the sufficient statistic T(x).X. and Performance Criteria Chapter 1 with a similar expression for t 1 in terms of Lx (O. (X( I)' .5.X Cn )' the order statistics..17). . Goals. . . or if T(X) ~ (X~l)' . 0 In fact. where R i ~ Lj~t I(X. 1 X n are a random sample. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic SeX) we can find a ttansformation r such that T(X) = r(S(X). a 2 is assumed unknown. We show the following result: If T(X) is sufficient for 0. l:~ I (Xi .. See Problem ((x:).5. If T(X) is sufficient for 0. Suppose there exists 00 such that (x: p(x.X)2) is sufficient. (X. but if in fact the common distribution of the observations is not Gaussian all the information needed to estimate this distribution is contained in the corresponding S(X)see Problem 1. as in the Example 1. Suppose that X has distribution in the class P ~ {Po : 0 E e}. Lehmann. Summary.12 for a proof of this theorem of Dynkin.5. < X. 1) and Lx (1. 0) and heX) such that p(X.5. but if in fact a 2 I 1 all information about a 2 is contained in the residuals.4. .5. L is a statistic that is equivalent to ((1' i 2) and. we can find a randomized decision rule J'(T(X» depending only on the value of t = T(X) and not on 0 such that J and J' have identical risk functions. If P specifies that X 11 •. SeX) becomes irrelevant (ancillary) for inference if T(X) is known but only if P is valid. 1. The likelihood function is defined for a given data vector of . hence. Then Ax is minimal sufficient.. The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t.. a statistic closely related to L solves the minimal sufficiency problem in general.13.Rn ). Thus. . Thus.X).1 (continued) we can show that T and.. For instance. _' . X is sufficient. if a 2 = 1 is postulated. We say that a statistic T (X) is sufficient for PEP. By arguing as in Example 1. OJ denote the frequency function or density of X. it is Bayes sufficient for O. Consider an experiment with observation vector X = (Xl •. • X( n» is sufficient. Ax is the function valued statistic that at (J takes on the value p x.
The Poisson Distribution. p(x. This is clear because we need only identify exp{'7(B)T(x) .B(B)} with g(T(x). and if there is a value B E such that o e e. Pitman. B E' sufficient for B. B) > O} c {x: p(x.6. In a oneparameter exponential family the random variable T(X) is sufficient for B.2) .B) and h(x) with itself in the factorization theorem. Let Po be the Poisson distribution with unknown mean IJ. = 1 .o I x. B) of the Pe may be written p(x.B(B)} (1.6. many other common features of these families were discovered and they have become important in much of the modern theory of statistics. We return to these models in Chapter 2. x. 1.6. then.Section 1. beta. Ax(B) is a minimally sufficient statistic. (1. The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman. B). by the factorization theorem.2"" }. binomial.6.6 EXPONENTIAL FAMILIES The binomial and normal models considered in the last section exhibit the interesting feature that there is a natural sufficient statistic whose dimension as a random vector is independent of the sample size. Note that the functions 1}.B o) > O}. 1. and Dannois through investigations of this property(l).1. Subsequently.1) where x E X c Rq. gamma. We shall refer to T as a natural sufficient statistic of the family. 1. is said to be a oneparameter exponentialfami/y. such that the density (frequency) functions p(x. if there exist realvalued functions fJ(B).B) = eXe.exp{xlogBB}. Probability models with these common features include normal. the likelihood ratio Ax(B) = Lx(B) Lx(Bo) depends on X through T(X) only. B. for x E {O.B) ~ h(x)exp{'7(B)T(x) .1 The OneParameter Case The family of distributions of a model {Pe : B E 8}. and multinomial regression models used to relate a response variable Y to a set of predictor variables. BEe.6 Exponential Families 49 observations X to be the function of B defined by Lx(B) = p(X. these families form the basis for an important class of models called generalized linear models. They will reappear in several connections in this book. realvalued functions T and h on Rq. Poisson. Then. B( 0) on e. and T are not unique. Example 1. More generally. IfT(X) is {x: p(x. Here are some examples. B>O.
o The families of distributions obtained by sampling from oneparameter exponential families are themselves oneparameter exponential families. n) < 0 < 1. (1.) i=1 m B(8)] ( 1. 0) distribution. I I! .6.6.h(X)=( : ) . Goals. Specifically.' x.Z)OI) Z)'O'J} (2rrO)1 exp { (2rr)1 exp {  ~ [z' + (y  ~z' } exp { .6.Xm are independent and identically distributed with common distribution Pe.) exp[~(O)T(x. . . h(x) = (2rr)1 exp { . + OW. where the p. . Statistical Models.B(O)=nlog(I0). 1 (1.5) o = 2.1](0) = logO.~O'(y  z)' logO}. Suppose X has a B(n.O) = ( : ) 0'(1_ Or" ( : ) ex p [xlog(1 ~ 0) + nlog(l. suppose Xl.O) = f(z)f.6.~O'. the family of distributions of X is a oneparameter exponential family with q=I. fonn a oneparameter exponential family as in (1. . 0 E e. B(O) = 0.6. q = 1.T(x)=x.O) f(z.6.. .3) I' . Here is an example where q (1. Then > 0.6. for x E {O. Z and W are f(x.. 1).. Suppose X = (Z.h(x) = . 1 X m ) considered as a random vector in Rmq and p(x.~z.50 .T(x) = (y  z)'.3. . Example 1. The Binomial Family.O) IIh(x. we have p(x. and Performance Criteria Chapter 1 Therefore.2. is the family of distributions of X = (Xl' .0)].1). T(x) ~ x. B(O) = logO. If {pJ=».y. 0 Example 1.(y I z) = <p(zW1<p((y . This is a oneparameter exponential family distribution with q = 2.. y)T where Y = Z independent N(O.. 0 Then. the Pe form a oneparameter exponential family with ~~I . .1](0) = . 1. . .} .6) .1](O)=log(I~0). B) are the corresponding density (frequency) functions. p(x.4) Therefore.
and h.6. .8) .6..\ 1"/'" (p (s 1) 1) 1) (r T(x) x (x . For example.\) (3(r. We leave the proof of these assertions to the reader. This family of Poisson distIibutions is oneparameter exponential whatever be m. and h. then the m ) fonn a I Xi.1 the sufficient statistic T :m)(x1. pi pi I:::n TABLE 1. . 1 X m).. I:::n I::n Theorem 1.1.x log x 10g(l x) logx The statistic T(m) (X I. i=l (1. 1)(0) N(I'.Section 1. 1]... · ..Xm ) = I Xi is distributed as P(mO).B(O)} for suitable h * . ily of distributions of a sample from any of the foregoin~ is just In our first Example 1. and pJ LT(Xi).6. .1 Family of distributions . oneparameter exponential family with natural sufficient statistic T(m)(x) = Some other important examples are summarized in the following table. X m ) corresponding to the oneparameter exponential fam1 T(Xd. then qCm) = mq.' fixed I' fixed p fixed . B.\ fixed r fixed s fixed 1/2.· . Let {Pe } be a oneparameter exponential family of discrete distributions with corresponding functions T... . fl..2) r(p. B. then the family of distributions of the statistic T(X) is a oneparameter exponential family of discrete distributions whose frequency functions may be written h'Wexp{1)(O)t . In the discrete case we can establish the following general result. if X = (Xl.7) Note that the natural sufficient statistic T(m) is onedimensional whatever be m.6 Exponential Families 51 where x = (x 1 . 1 X m ) is a vector of independent and identically distributed P(O) random variables and m ) is the family of distributions of x. Therefore. If we use the superscript m to denote the corresponding T. .. •. .Il)" . . the m) fonn a oneparameter exponential family. B(m)(o) ~ mB(O).6. h(m)(x) ~ i=l II h(xi). .
where 1} = log 8.6.. 00 00 exp{A(1})} andE = R. = L(e"' (x!) = L(e")X (x! = exp(e").A(1})1 for s in some neighborhood 0[0.6. selves continuous.2.1}) = (1(x!)exp{1}xexp[1}]}.6.6. if q is definable. x E X c Rq (1.52 Statistical Models.. Goals.2.8) h(x) exp[~(8)T(x) . .6. E is either an interval or all of R and the class of models (1.6.9) with 1} ranging over E is called the canonical oneparameter exponential family generated by T and h.8) exp[~(8)t . By definition.6. the result follows. the momentgenerating function o[T(X) exists and is given by M(s) ~ exp[A(s + 1}) . £ is called the natural parameter space and T is called the natural sufficient statistic. }. The Poisson family in canonical form is q(x.B(8)] L {x:T(x)=t} ( 1.9) and ~ is an Ihlerior point of E. x E {O.1. We obtain an important and useful reparametrization of the exponential family (1. The exponential family then has the form q(x. x=o x=o o Here is a useful result.. If X is distributed according to (1. L {x:T{x)=t} h(x)}.1. P9[T(x) = tl L {x:T(x)=t} p(x. If 8 E e.6.1) by letting the model be indexed by 1} rather than 8. Then as we show in Section 1. Example 1. . then A(1}) must be finite. The model given by (1.9) where A(1}) = logJ .2.B(8)]{ Ifwe let h'(t) ~ L:{"T(x)~t} h(x). Theorem 1.A(~)I. (continued). 0 A similar theorem holds in the continuous case if the distributions ofT(X) are them Canonical exponential families.6. Jh(x)exp[1}T(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case.9) with 1} E E contains the class of models with 8 E 8.. Let E be the collection of all 1} such that A(~) is finite.1}) = h(x)exp[1}T(x) . and Performance Criteria Chapter 1 Proof.
.4 Suppose X 1> ••• . 1 Tk. We compute M(s) = = E(exp(sT(X))) {exp[A(s ~ + 1])  A(1])]} J'" J J'" J h(x)exp[(s + 1])T(x) . We give the proof in the continuous case. is said to be a kparameter exponential family. B(O) the natural sufficient statistic E~ 2n(j2 and variance nlt}2 4n04 .8) = (il(x. This is known as the Rayleigh distribution. Therefore.6. A family of distrib~tioos {PO: 0 E 8}.6 Exponential Families 53 Moreover.17k and B of 6. Pitman. and realvalued functions T 1. 02 ~ 1/21].. 1. Now p(x. is one. c R k . More generally.8) ~ (x/8 2)exp(_x 2/28 2).A(1])] because the last factor. 0 Here is a typical application of this result. if there exist realvalued functions 171. Proof. x> 0.8 > O.<·or .8(0)].O) = h(x) exp[L 1]j(O)Tj (x) . (1. X n is a sample from a population with density p(x. and Dannois were led in their investigations to the following family of distributions. exp[A(s + 1]) . x E X j=1 c Rq. The rest of the theorem follows from the momentenerating property of M(s) (see Section A. E(T(X)) ~ A'(1]). 2 i=1 i=l n 1 n Here 1] = 1/20 2. Example 1. which is naturally indexed by a kdimensional parameter and admit a kdimensional sufficient statistic. h on Rq such that the density (frequency) functions of the Po may be written as. Direct computation of these moments is more complicated. ' e k p(x. nlog0 J. 0 = n log 82 and A(1]) 1 xl has mean nit} = = nlog(21]).10) . .j8 2))exp(i=l n n I>U282) ~=1 = (il xi)exp[202 LX.Section 1.6. Koopman.. being the integral of a density.12).2 The Multiparameter Case Our discussion of the "natural form" suggests that oneparameter exponential families are naturally indexed by a onedimensional real parameter fJ and admit a onedimensional sufficient statistic T(x). Var(T(X)) ~ A"(1]). It is used to model the density of "time until failure" for certain types of equipment.A(1])]dx h(x)exp[(s + 1])T(x)  A(s + 1])Jdx = . _ .6..
10).'l) = h(x)exp{TT(x)'l.6. i=l . in the conlinuous case. The density of Po may be written as e ~ {(I". suppose X = (Xl.').(X) •. I ! In the discrete case.LX. (1. Goals.A('l)}. . (]"2 > O}.Ot = fl. then the preceding discussion leads us to the natural sufficient statistic m m (LXi.5. . The Normal Family.. It will be referred to as a natural sufficient statistic of the family..2.') population.TdX))T is sufficient..LTk(Xi )) t=1 Example 1..x ..I' 54 Statistical Models... q(x. .'). . .4)..O) =exp[". the canonical kparameter exponential indexed by 71 = (fJI.1...') : 00 < f1 x2 1 J12 2 p(x.. . X m ) from a N(I". + log(27r"'». we define the natural parameter space as t: = ('l E R k : 00 < A('l) < oo}. 8 2 = and I" 1 2' Tl(x) = x. and Performance Criteria Chapter 1 By Theorem 1.. In either case.fJk)T rather than family generated by T and h is e. 1)2(0) = 2 " B(O) " 1"' 1 " T. . which corresponds to a twOparameter exponential family with q = I..11) (]"2.6. .'" . A(71) is defined in the same way except integrals over Rq are replaced by sums.x E X c Rq where T(x) = (T1 (x).. i=l i=1 which we obtained in the previous section (Example 1. letting the model be Thus.(x) = x'.5. +log(27r" »)]. .3. .' . 0 Again it will be convenient to consider the "biggest" families. . Then the distributions of X form a kparameter exponential family with natural sufficient statistic m m TCm)(x) = (LTl(Xi). If we observe a sample X = (X" . Again. J1 < 00. h(x) = I.Tk(x)f and. . = N(I". . Suppose lhat P.2(". 2 (". the vector T(X) = (T.6.Xm ) where the Xi are independent and identically distributed and their common distribution ranges over a k~parameter exponential family given by (1.
and t: = Rk . ~l ~ /. . .)T.j j=l = 1. and rewriting kl q(x. A(1/) = 4n[~:+m'~5+z~1~'+2Iog(rr/~3)].. ./(7'.(x»).~3 ~ 1/2(7'.. . .6 Exponential Families 55 (x. A) = I1:~1 AJ'(X). We observe the outcomes of n independent trials where each trial can end up in one of k possible categories./(7'.5 that Y I .6./(7'... 0. j=1 k This is a kparameter canonical exponential family generated by TIl"" Tk and h(x) = II~ I l[Xi E {I. A(1/) ~ ~[(~U2~.'I2. Linear Regression.. . Let T.6. x') = (T..~2): ~l E R.k.. . a 2 ).~3): ~l E R. . .Section 1.~3 < OJ. h(x) = I andt: ~ R x R.5. Now we can write the likelihood as k qo(x.4 and 1...(x). 0) ~ exp{L . Then p(x. the density of Y = (Yb . . 0 + el) ~ qo(x.EziY. This can be identifiable because qo(x. L71 Aj = I}.. ~.= {(~1. where A is the simplex {A E R k : 0 < A. 2. I <j < k 1. with J.(x) = L:~ I l[X i = j]. is not and all c. In this example.5. Example 1. k}] with canonical parameter 0. we can achieve this by the reparametrization k Aj = eO] /2:~ e Oj ..id.kj.'12 ~ (3. and AJ ~ P(X i ~ j).. E R k . in Examples 1. In this example. where m. as X and the sample space of each Xi is the k categories {I.."k.j = 1.~. TT(x) = T(Y) ~ (EY"EY.. E R. Yi rv N(Jli.1. 0) for 1 = (1. . . ... A E A.k... n.5. Suppose a<. I yn)T can be put in canonical form with k = 3.) + log(rr/.n log j=l L exp(.T. < OJ. 1/) =exp{T'f._l)(x)1/nlog(l where + Le"')) j=1 . Example 1. However 0.~·(x) . From Exam~ pIe 1. Multinomial Trials. .6. . We write the outcome vector as X = (Xl. < l..~2)]. ~3 and t: = {(~1. Y n are independent. . i = 1.~.)j. .'.5. remedied by considering IV ~j = log(A. = 1/2(7'.~l= (3. It will often be more convenient to work with unrestricted parameters.ti = f31 + (32Zi. k ~ 2.Xn)'r where the Xi are i. ./Ak) = "..6.. (N(/" (7') continued}. = n'Ezr Example 1.7.
0) ~ q(x.7 and X = (X t.6. let Xt =integers from 0 to ni generated Vi < n.6. ) 1(0 < However. is an npararneter canonical exponential family with Yi by T(Yt .6.13) . . Let Y. B(n. Here is an example of affine transformations of 6 and T. More10g(P'I[X = jI!P'IIX ~ k]).·.3 Building Exponential Families Submodels A submodel of a kparameter canonical exponential family {q(x. are identifiable. . Ai). 0 1.2.6.1. .56 Note that q(x. I < k. 11 E £ C R k } is an exponential family defined by p(x. < . See Problem 1. the parameters 'Ii = Note that the model for X Statistical Models.).17 e for details. is unchanged.6. 17). . be independent binomial. 1 <j < k . " . Similarly.d. and 1] is a map from e to a subset of R k • Thus. If the Ai are unrestricted. from Example 1. 1/(0») (1. k}] with canonical parameter TJ and £ = Rk~ 1.. h(y) = 07 I ( ~. M(T) sponding to and ~ MexkT + hex" it is easy to see that the family generated by M(T(X» and h is the subfamily of P corre 1/(0) = MTO..' A(1/) = L:7 1 ni 10g(l + e'·). if X is discrete Affine transformations from Rk to RI defined by UP is the canonical family generated by T kx 1 and hand M is the affine transformation I' .i. as X. 1 < i < n.. where (J E e c R1. Goals. and Performance Criteria Chapter 1 1 parameter canonical exponential family generated by T (k 1) {I. Here 'Ii = log 1\. . then all models for X are exponential families because they are submodels of the multinomial trials model. 0 < Ai < 1.8.6. X n ) T where the Xi are i.Y n ) ~ Y.. < X n be specified levels and (1. this... " Example 1. Logistic Regression. then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h. 1 < i < n.6. 1]) is a k and h(x) = rr~ 1 l[xi E over.12) taking on k values as in Example 1. if c Re and 1/( 0) = BkxeO C R k .
a 2 ). x).d...6.1..)T . LocationScale Regression.6...x = (Xl.i/a.Xn i. the 8 parametrization has dimension 2..8. Then (and only then)..12) with the range of 1J( 8) restricted to a subset of dimension l with I < k .6. + 8. If each Iii ranges over R and each a.. i=I This model is sometimes applied in experiments to determine the toxicity of a substance. PIX < xl ~ [1 + exp{ (8.9. are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization. which is called the coefficient of variation or signaltonoise ratio. Example 1. E R.Section 1. Yn. generated by . Example 1. Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic.00). . y. The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance. 8) in the 8 parametrization is a canonical exponential family. so it is not a curved family..'V T(Y) = (Y" . SetM = B T . then this is the twoparameter canonical exponential family generated by lYIY = (L~l. . 'fJ1(B) curved exponential family with l = 1.13) holds.). .xn)T. p(x. a. ~ (1. .8.. However.) = L ni log(1 + exp(8.. . where 1 is (1.!t is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied.1)T. n. L:~ 1 Xl yi)T and h with A(8. Y n are independent..PIX < x])) 8. . this is by Example 1. Then. that is. '. . is a known constant '\0 > O. Suppose that YI ~ . . .5 a 2nparamctercanonical exponential family model with fJi = P. o Curved exponential families Exponential families (1. log(P[X < xl!(1 .6.6..x and (1.14) 8. + 8. i = 1.10. and l1n+i = 1/2a.6 Exponential Families 57 This is a linear transformation TJ(8) = B nxz 8 corresponding to B nx2 = (1.8. which is less than k = n when n > 3. suppose the ratio IJlI/a. Y. . + 8.. }Ii N(J1. I ii. with B = p" we can write where T 1 = ~~ I Xi.6. .6. N(Jl.x)W'.i.. > O. This is a 0 In Example 1.Xi)). ranges over (0.i.'. T 2 = L:~ 1 X. In the nonnal case with Xl. Gaussian with Fixed SignaltoNoise Ratio... = '\501 and 'fJ2(0) = ~'\5B2.
.6.8. Sections 2.(0). 0) ~ q(y. and Performance Criteria Chapter 1 and h(Y) = 1.5. 1989.8 note that (1.. 1988.d. ~. For 8 = (8 1 .6 are heteroscedastic and homoscedastic models. then p(y. the map 1/(0) is Because L:~ 11]i(6)Yi + L:~ i17n+i(6)Y? cannot be written in the fonn ~.1/(O)) as defined in (6. 83 > 0 (e. 1 Tj(Y. sampling.i. and Snedecor and Cochran. We extend the statement of Theorem 1. In Example 1.(O)Tj'(Y) for some 11.xjYj ) and we can apply the supennode! approach to reach the same conclusion as before. a. Thus. Let Yj . say. E R. 1978. and B(O) = ~.g.) and 1 h'(YJ)' with pararneter1/(O).1 generalizes directly to kparameter families as does its continuous analogue.E Yj c Rq.6.6. 1 Bj(O). M(s) as the momentgenerating function. with an exponential family density Then Y = (Y1. Ii.2. We return to curved exponential family models in Section 2. Models in which the variance Var(}'i) depends on i are called heteroscedastic whereas models in which Var(Yi ) does not depend on i are called homoscedastic. n.) = (Yj. and = Ee sTT .13) exhibits Y. TJ'(Y).4 Properties of Exponential Families Theorem 1.10 and 1. Examples 1. Section 15.. Even more is true. as being distributed according to a twoparameter family generated by Tj(Y. respectively.10). .5 that for any random vector Tkxl.3. yn)T is modeled by the exponential family generated by T(Y) ~.12) is not an exponential family model. Supermode1s We have already noted that the exponential family structure is preserved under i.) depend on the value Zi of some covariate. Goals.58 Statistical Models. we define I I . Recall from Section B. 1 < j < n.).12.6. for unknown parameters 8 1 E R.1. Carroll and Ruppert.6._I11. be independent.6. 8. 1. Next suppose that (JLi.8..6. . but a curved exponential family model with 0 1=3. Bickel.
6. ('70).6. . We prove (b) first. ('70) II· The corollary follows immediately from Theorem B. Theorem 1.6.3(c).5.6.4). By the Holder inequality (B. t: and (a) follows. then T(X) has under 110 a moment M(s) ~ exp{A('7o + s) . If '7" '72 E + (1 . ~ = 1 .3. (with 00 pennitted on either side). h) with corresponding natural parameter space E and function A(11). ('70)) . v(x). Since 110 is an interior point this set of5 includes a baLL about O. Then (a) E is convex (b) A: E ) R is convex (c) If E has nonempty interior in R k generating function M given by and'TJo E E. u(x) ~ exp(a'7. 8".1 give a classical result in Example 1. Suppose '70' '71 E t: and 0 < a < 1.T(x)).a)'72 E is proved in exactly the same way as Theorem 1.1.A('70) = II 8".a)'7rT(x)) and take logs of both sides to obtain. Fiually (c) 0 The formulae of Corollary 1.3 V. S > 0 with ~ + ~ = 1. . Substitute ~ ~ a. A(U'71 Which is (b).15) is finite.8".r(T) = IICov(7~. J exp('7 TT (x))h(x)dx > 0 for all '7 we conclude from (1.15) t: the righthand side of (1.6.6. Corollary 1. Under the conditions ofTheorem 1. J u(x)v(x)h(x)dx < (J ur(x)h(x)dx)~(Jv'(x)h(x)dx)~. h(x) > 0..6. + (1 . Proof of Theorem 1.a)'72) < aA('7l) + (1 .6.a.9.Section 1. Let P be a canonicaL kparameter exponential famiLy generated by (T.2.6.r'7o T(X) . 1O)llkxk.6. A where A( '70) = (8"..A('7o)} vaLidfor aLL s such that 110 + 5 E E. T. vex) = exp«1 . .1 and Theorem 1.15) that a'7.6.a)A('72) Because (1. 8A 8A T" = A('7o) f/J. for any u(x).3.6 Expbnential Families 59 V.6.
6.(Tj(X» = P>.6. in Example 1. Our discussion suggests a link between rank and identifiability of the 'TJ parameterization. Formally. f=l . and Performance Criteria Chapter 1 Example 1. o The rank of an exponential family I . 7) E f} is a canonical exponential/amily generated by (TkXI ' h) with natural parameter space e such that E is open.1. However. for all 9 because 0 < p«X.lx ~ j] = AJ ~ e"'ILe a .~. Then the following are equivalent.. Suppose P = (q(x."I 11 J Evidently every kparameter exponential family is also k'dimensional with k' > k. and ry. k A(a) ~ nlog(Le"') j=l and k E>. Theorem 1. (X)". if we consider Y with n'> 2 and Xl < X n the family as we have seen remains of rank < 2 and is in fact of rank 2. But the rank of the family is 1 and 8 1 and 82 are not identifiable.4 that follows. Here.4. (i) P is a/rank k.(9) = 9.6. using the a: parametrization. Goads.1 is in fact its rank and this is seen in Theorem 1.7 we can see that the multinomial family is of rank at most k . . p:x.8.6. 2' Going back to Example 1.~~) < 00 for all x.x. Tk(X) are linearly independent with positive probability.4. we are writing the oneparameter binomial family corresponding to Yl as a twoparameter family with generating statistic (Y1 . + 9. . Xl Y1 ). .7).6.7. T. However. there is a minimal dimension. ifn ~ 1. . ajTj(X) = ak+d < 1 unless all aj are O. (ii) 7) is a parameter (identifiable). It is intuitively clear that k . such that h(x) > O. Note that PO(A) = 0 or Po (A) < 1 for some 0 iff the corresponding statement holds i. 9" 9.60 Statistical Models. P7)[L. We establish the connection and other fundamental relationships in Theorem 1. II ! I Ii .6. An exponential family is of rank k iff the generating statistic T is kdimensional and 1. Sintilarly. (iii) Var7)(T) is positive definite. (continued).
A is defined on all ofE.6. We give a detailed proof for k = 1. The proof for k details left to a problem.. which that T implies that A"(TJ) = 0 for all TJ and. Note that.' .3. of 1)0· Let Q = (P1)o+o(1). (iii) = (iv) and the same discussion shows that (iii) .. A' is constant. ranges over A(I').A(TJ.(ii) = (iii). Taking logs we obtain (TJI ..) is false.6.2 and. all 1) = (~i) II. = Proof ofthe general case sketched I." Then ~(i) > 1 is then sketched with '* P"[a. This is equivalent to Var"(T) ~ 0 '*~ (iii) {=} f"V(ii) There exist T}I =I= 1J2 such that F Tll = Pm' Equivalently exp{r/. III. by our remarks in the discussion of rank.4 hold and P is of rank k. 1)0) E n· Q is the exponential family (oneparameter) generated by (1)1 . for all T}. 0 Corollary 1.. = Proof.. thus. by Theorem 1. Then (a) P may be uniquely parametrized by "'(1)) E1)T(X) where".1)o)TT.'t·· . We. Thus.A(TJI)}h(x) ~ exp{TJ2T(x) . with probability 1.6 Exponential Families 61 (iv) 1) ~ A(1)) is 11 onto (v) A is strictly convex on E. .. have (i) . 1)) is a strictly concavefunction of1) On 1'. ~ (i) =~ (iii) ~ (i) = P1)[aT T = cJ ~ 1 for some a of 0. A"(1]O) = 0 for some 1]0 implies c. .6. Now (iii) =} A"(TJ) > 0 by Theorem 1. F!~ . :.6. all 1) ~ (iii) = a T Var1)(T)a = Var1)(aT T) = 0 for some a of 0.1)o) : 1)0 + c(1).~> . Suppose tlult the conditions of Theorem 1. ~ P1)o some 1).Section 1. '" '. Apply the case k = 1 to Q to get ~ (ii) =~ (i).(v). (b) logq(x.2. ".. 0 . ~ (ii) ~ _~ (i) (ii) = P1). A'(TJ) is strictly monotone increasing and 11. hence. Proof. Conversely.T ~ a2] = 1 for al of O.. . because E is open.TJ2)T(X) = A(TJ2) . Let ~ () denote "(. This is just a restatement of (iv) and (v) of the theorem.T(x) . hence.A(TJ2))h(x).) with probability 1 =~(i). (iv) '" (v) = (tii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined as previously for arbitrary 1)0' 1).
62 Statistical Models.11. the 8(11) 0) family is parametrized by E(X). Suppose Xl.11.Jl)TE. E) = Idet(E) 1 1 2 / .6.(9) LT. .ljh<i<.. then X .21). families to which the posterior after sampling also belongs. For {N(/" "')}.._. Goals.6.I.2 (Y . . B(9) = (log Idet(E) I+JlTEl Jl). The p Van'ate Gaussian Family.~p).6. E. which is obviously a 11 function of (J1.a 2 + J12).X') = (JL.1 '" 11"'. .)] eXP{L 1/. Thus. 0 ! = 1. and.17) The first two terms on the right in (1. and Performance Criteria Chapter 1 The relation in (a) is sometimes evident and the j. .6.29) that I) generate this family and that the rank of the family is indeed p(p + 3)/2. and that (.1 (Y .6... the relation in (a) may be far from obvious (see Problem 1. Then p(xI9) = III h(x.6. as we always do in the Bayesian context.3..6.t parametrization is close to the initial parametrization of classical P. . so that Theorem 1. It may be shown (Problem 1.{Y.a 2 ). revealing that this is a k = p(p + 3)/2 parameter exponential family with statistics (Yi. is open."5) family by E(X).(Y). E).. Recall that Y px 1 has a p variate Gaussian distribution. . write p(x 19) for p(x.. However. (1. .E)._. (1. We close the present discussion of exponential families with the following example.6.6. where we identify the second element ofT... An important exponential family is based on the multivariate Gaussian distributions of Section 8..6.5 Conjugate Families of Prior Distributions In Section 1.6. which is a p x p symmetric matrix.17) can be rewritten ( 2: 1 <i<j~p aijYiYj + ~ 2: aii r:2 ) + L(2: aij /lj)Yi i=l i=l j=l p where E.Xn is a sample from the kparameter exponential family (1. N p (1J. Y n are iid Np(Jl.2 we considered beta prior distributions for the probability of success in n Bernoulli trials. E)..Yp. See Section 2.p/2 I exp{ . The corollary will prove very important in estimation theory.5. 9 ~ (Jl. h(Y) .2 p p (1.Jl. E) _~yTEIy + (E1JllY 2 I 2 (log Idet(E)1 + JlT I P log". ..4 applies.(Xi) i=l j=l i=l n • n nB(9)}.. This is a special case of conjugate families of priors. ifY 1.10). E(X. with mean J.18) _. Example 1.... ... where X is the Bernoulli trial.6.Jl) . T (and h generalizing Example 1.16) Rewriting the exponent we obtain 10gf(Y. ~iYi Vi).. the N(/l.Yn)T follows the k = p(p + 3)/2 parameter exponential family with T = (EiYi .tpx 1 and positive definite variance covariance matrix L::pxp' iff its density is f(Y.. with its distinct p(p + I) /2 entries. By our supermodel discussion..Jl)}. 0). Jl.
6.. we consider the conjugate family of the (1.18) and" by (1. A conjugate exponential family is obtained from (1. .fk+IB(O) logw(t)) j=1 (1. n)T D It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way. .O<W(tJ.. the parameter t of the prior distribution is updated to s = (t + a). is a conjugate prior to p(xIO) given by (1. tk+ I) T and w(t) = 1= eXP{hiJ~J(O)' _='" _= 1 k ik+JB(II)}dO J···dOk (1.fkIJ)<oo) with integrals replaced by sums in the discrete case.(O) ex exp{L ~J(O)(LT. .6.6..fk+Jl. Because two probability densities that are proportional must be equal.21) where S = (SI. then Proposition 1.u5) sample. . X n become available. The (k + I)parameter exponential/amity given by k ". tk+l) E 0.6. Proof.20).21) is an updating formula in the sense that as data Xl. be "parameters" and treating () as the variable of interest. .21) and our assertion follows. . (1.6..1. 1r(6jx) is the member of the exponential family (1. Example 1."" Sk+l)T = ( t} + ~TdxiL . . and t j = L~ 1 Tj(xd. where a = (I:~ 1 T1 (Xi)...6.(Olx) ex p(xlO)". k.22) 02 p(xIO) ex exp{ 2 .36).6.18). ik + ~Tk(Xi)' tk+l + 11. .6..Section 1. Forn = I e.) . We assume that rl is nonempty (see Problem 1.(0) ~ exp{L ~j(O)fj .19) o {(iJ.2)' Uo 2uo Ox .12. If p(xIO) is given by (1.'''' I:~ 1 Tk(Xi)..20) given by the last expression in (1. where u6 is known and (j is unknown.. ..1.(0). then k n .6. Note that (1.6.6....6.Xn is a N(O. let t = (tl. Suppose Xl. . 0 Remark 1.'" . To choose a prior distribution for model defined by (1. . .6 Exponential Families 63 where () E e.(Xi)+ f. That is.20) where t = (fJ.6.6.. n n )T and ex indicates that the two sides are proportional functions of (). ..18) by letting 11. j = 1.20).6. which is k~dimensional.(lk+1 j=1 i=1 + n)B(O)) ex ".6.
6.( 0 . and Performance Criteria Chapter 1 This is a oneparameter exponential family with The conjugate twoparameter exponential family given by (1. = nTg(n)I(1~.6.23) with (70 TO . (J Np(TJo.23) Upon completing the square. t.6.(n) = .25) By (1.64 Statistical Models.24) Thus. Eo). we find that 1r(Olx) is a normal density with mean Jl(s. 1 < i < n.6. Our conjugate family.6.6.(S) = t.6.6.6. the posterior has a density (1. = rg(n)lrg so that w.28) where W.d. (0) is defined only for t. > 0 and all t I and is the N (tI!t" (1~ It.) density.(O) <X exp{ .n) and variance = t. 761) where '10 varies over RP.26) intuitively as ( 1.30) that the Np()o" f) family with )0. W. Tg is scalar with TO > 0 and I is the p x p identity matrix (Problem 1. we must have in the (t l •t2) parametrization "g (1. therefore. tl(S) = 1]0(10 2 TO 2 + S. we obtain 1r. Using (1.20) has density (1.. = s. . 0 These formulae can be generalized to the case X.6. + n.6.W.6. 1r.) } 2a5 h t2 t1 2 (1. rJ) distributions where Tlo varies freely and is positive. i.6.24). consists of all N(TJo. E RP and f symmetric positive definite is a conjugate family f'V . 75) prior density. If we start with aN(f/o. Goals. Moreover. Eo known.26) (1.(n) (g +n)I[s+ 1]0~~1 TO TO (1. = 1 . if we observe EX. Np(O.37).27) Note that we can rewrite (1. it can be shown (Problem 1..i.21).
'T}k and B on e.5. If bas a nonempty interior in R k and '10 E then T(X) has for X ~ P7Jo the momentgenerating function .. h on Rq such that the density (frequency) function of Pe can be written as k p(x..29) (Tl (X).6.6. Eo). and Darmois. a theory has been built up that indicates that under suitable regularity conditions families of distributions. B) are not exponential. Discussion Note that the uniform U( (I. (1. In fact. .6 Exponential Families 65 but a richer one than we've defined in (1. Despite the existence of classes of examples such as these. . which admit kdimensional sufficient statistics for all sample sizes. e. with integrals replaced by sums in the discrete case. 0) = h(x) expl2:>j(I1)Ti (x) .. The set E is convex. . Summary. is not of the form L~ 1 T(Xi).Xn ).6.32. E 1 R is convex. x j=1 E X c Rq. is a kparameter exponential/amity of distributions if there are realvalued functions 'T}I . and realvalued functions TI. 8 C R k . " Tk. In the onedimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape. 2. ••• . the map A . Some interesting results and a survey of the literature may be found in Brown (1986). Pitman.10 is a special result of this type. The canonical kparameter exponentialfamily generated by T and h is T(X) = where A(7J) ~ log J:. must be kparameter exponential families. ..1 are often too restrictive. ....... the conditions of Proposition 1.6. .20). which is onedimensional whatever be the sample size. The set is called the natural parameter space.A(7Jo)} e e. In fact. starting with Koopman. {PO: 0 E 8}.Section 1. T k (X)) is called the natural sufficient statistic of the family. r) is a p(p + 3)/2 rather than a p + I parameter family.20) except for p = 1 because Np(A.3 is not covered by this theory.6. See Problems 1. to Np (8 .J: h(x)exp{TT(x)7J}dx in the continuous case. for all s such that '10 + s is in Moreover E 7Jo [T(X)] = A(7Jo) and Var7Jo[T(X)] A( '10) where A and A denote the gradient and Hessian of A..31 and 1.6. .6. Problem 1.B(O)].(s) = exp{A(7Jo + s) . O}) model of Example 1. the family of distributions in this example and the family U(O. The natural sufficient statistic max( X 1 ~ . It is easy to see that one can consUUct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1.
1 units. . (iii) Var'7(T) is positive definite. Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean J.6.7 PROBLEMS AND COMPLEMENTS Problems for Section 1.29). Goals. (iv) the map '7 ~ A('7) is I ' Ion 1'.L and variance a 2 . and Performance Criteria Chapter 1 An exponential family is said to be of rank k if T is kdimensional and 1.. is conjugate to the exponential family p(xl9) defined in (1.L and a 2 but has in advance no knowledge of the magnitudes of the two parameters.1 1.1)) (O)t j .. Suppose that the measuring instrument is known to be biased to the positive side by 0. (b) A measuring instrument is being used to obtain n independent determinations of a physical constant J..(0) = exp{L1)j(O)t) . T" .66 Statistical Models. and t = (t" . (a) A geologist measures the diameters of a large number n of pebbles in an old stream bed.L. (ii) '7 is identifiable.. He wishes to use his observations to obtain some infonnation about J. then the following are equivalent: (i) P is of rank k. (v) A is strictlY convex on f.. 1.B(O)tk+l logw} j=l where w = 1:" 1: E exp{l:. . tk+1) 0 ~ {(t" . A family F of priOf distributions for a parameter vector () is called a conjugate family of priors to p(x I 8) if the posterior distribution of () given x is a member of F. State whether the model in question is parametric or nonparametric.B(O)}dO. Give a formal statement of the following models identifying the probability laws of the data and the parameter space. tk+. Ii II I .) E R k+l : 0 < w < oo}.. Assume that the errors are otherwise identically distributed nonnal random variables with known variance. The (k + 1).parameter exponential family k ". l T k are linearly independent with positive Po probability for some 8 E e. If P is a canonical exponential family with E open.
..2 ) and Po is the distribution of Xll. Show that Fu+v(t) < Fu(t) for every t..1(d).. . and Po is the distribution of X (b) Same as (a) with Q = (X ll .p.. .2)..1(c).··. . then X is (b) As in Problem 1. An entomologist studies a set of n such insects observing both the number of eggs laid and the number of eggs hatching for each nest. . . .) (a) The parametrization of Problem 1.1 describe formaHy the foHowing model. . . Which of the following parametrizations are identifiable? (Prove or disprove. B = (1'1.1. e (e) Same as (d) with (Qt.) < Fy(t) for every t..Section 1.for this model? (d) The number of eggs laid by an insect follows a Poisson distribution with unknown mean A. (d) Xi. 2.Xp ). Q p ) and (A I . At. 1 Ab) restricted to the sets where l:f I (Xi = 0 and l:~~1 A.ap ) : Lai = O}... each egg has an unknown chance p of hatching and the hatChing of one egg is independent of the hatching of the others. Each .. . ~ N(l'ij. 0. 1'2) and we observe Y X. restricted to p = (all' . respectively. v..2) where fLij = v + ai + Aj. .. (b) The parametrization of Problem 1. . Are the following parametrizations identifiable? (Prove or disprove. Once laid. .1.l(d) if the entomologist observes only the number of eggs hatching but not the number of eggs laid in each case. .) (a) Xl.Xpb .. 4. . 0. (a) Let U be any random variable and V be any other nonnegative random variable. Can you perceive any difficulties in making statements about f1.7 Problems and Complements 67 (c) In part (b) suppose that the amount of bias is positive but unknown.2) and N (1'2. . i = 1..2 ). (If F x and F y are distribution functions such that Fx(t) said to be stochastically larger than Y. Ab.Xp are independent with X~ ruN(ai + v. = o..t. .. ... i=l (c) X and Y are independent N (I' I . = (al. Q p ) {(al.. (e) The parametrization of Problem l.1. . are sampled at random from a very large population.. bare independeht with Xi. j = 1."" a p . Two groups of nl and n2 individuals... 3.
nk·mI . I nl.· .. Let Po be the distribution of a treatment response.c) = PrY > c . i = 1. In each of k subsequent years the number of students graduating and of students dropping out is recorded. Let Y and Po is the distribution of Y.0.. It is known that the drug either has no effect or lowers blood pressure. is the distribution of X e = (0. Goals. 68 Statistical Models.p(0.. Show that Y .. 0'2).k is proposed where 0 < 71" < I. . 7.p(0..t) = 1 . (0).. . . the density or frequency function p of Y satisfies p( c + t) ~ p( c ... mk·r. . e = = 1 if X < 1 and Y {J.1. . .Li < + ..2). whenXis unifonnon {O. I. 5.Nk = nk. o < J1 < 1. (b)P. What assumptions underlie the simplification? 6. • "j :1 member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour. Vk) is unknown.9 and they oecur with frequencies p(0. + Vk + P = 1. 2. if and only if. ....1.. Hint: IfY ..1.k + VI + .0). 0 = (1'.. . Mk. + nk + ml + . =X if X > 1. The simplification J. + mk + r = n 1. Let N i be the number dropping out and M i the number graduating during year i..c bas the same distribution as Y + c.4(1).t) fj>r all t...O}... 0 < V < 1 are unknown. The number n of graduate students entering a certain department is recorded. Which of the following models are regular? (Prove or disprove. .Lk VI n". but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown. 8.c has the same distribution as Y + c.9). k.. . (e) Suppose X ~ N(I'. .2. + fl.. Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour.:".I nl . (a) What are the assumptions underlying this model'? (b) (J is very difficult to estimate here if k is large. I ! fl. . 0 < Vi < 1.1)..··· ... then P(Y < t + c) = P( Y < t . 1 < i < k and (J = (J.L. Consider the two sample models of Examples 1.P(Y < c .t)... is the distribution of X when X is unifonn on (0.LI ni + .. }. .. J.2.0'2) (d) Suppose the possible control responses in an experiment are 0. VI. and Performance Criteria Chapter 1 ":.0. The following model is proposed. V k mk P r where J. .Li = 71"(1 M)iIJ.  . Vi = (1 11")(1 ~ v)iI V for i = 1. M k = mk] I! n. • . ml . P8[N I .. I ...) (a) p. Suppose the effect of a treatment is to increase the control response by a fixed amount (J.3(2) and 1. Both Y and p are said to be symmetric about c.. .. 0 < J.LI... = n11MI = mI.' I .
C(t} = F(tjo). . o(x) ~ 21' + tl .. 10. I} and N is known.. Sbow that {p(j) : j = 0.Xn are observed i... . N.tl) implies that o(x) tl. (b) In part (a). Collinearity: Suppose Yi LetzJ' = L:j:=1 4j{3j + Ci..d.tl) does not imply the constant treatment effect assumption... if the number of = (min(T. . or equivalently. (b) Deduce that (th. . i . log X and log Y satisfy a shift model with parameter log (b) Show that if X and Y satisfy a shift model With parameter tl.i. . . (Yl. X m and Yi. .. Sx(t) = . then C(·) ~ F(. I(T < C)) = j] PIC = j] PIT where T. then satisfy a scale model with parameter e. fJp) are not identifiable if n pardffieters is larger than the number of observations. 12.2x and X ~ N(J1. Suppose X I. Therefore. {r(j) : j ~ 0. e) vary freely over F ~ {(I'. Let X < 1'.zp. . Y.3 let Xl. . ci '" N(O. > 0. 1 < i < n. are not collinear (linearly independent)."" Yn ). N} are identifiable. = (Zlj.{3p) are identifiable iff Zl.. and (I'. . .7 Problems and Complements 69 (a) Show that if Y ~ X + o(X).N = 0. Does X' Y' = yc satisfy a scale model? Does log X'.2x yield the same distribution for the data (Xl.. C are independent. . In Example 1. according to the distribution of X. Positive random variables X and Y satisfy a scale model with parameter 0 > 0 if P(Y < t) = P(oX < t) for all t > 0. . The Scale Model.6. that is. 0 < j < N..Znj?' (a) Show that ({31. Y = T I Y > j]. That is. . The Lelunann 1\voSample Model. Hint: Consider "hazard rates" for Y min(T.tl and o(x) ~ 21' + tl.. r(j) = = PlY = j.. . = 9. 11. Yn denote the survival times of two groups of patients receiving treatments A and B. . . p(j).. j = 0.Section 1. eX o. C).1. t > O. e) : p(j) > 0. suppose X has a distribution F that is not necessarily normal... Let c > 0 be a constant.Xn ). .tl).. log y' satisfy a shift model? = Xc. . C). N}.. . then C(·) = F(. 0 'Lf 'Lf op(j) = I. the two cases o(x) . C(·) = F(· . e(j) > 0. . (]"2) independent."'). Show that if we assume that o(x) + x is strictly increasing. e(j).. For what tlando(x) ~ 2J1+tl2x? type ofF is it possible to have C(·) ~ F(·tl) for both o(x) (e) Suppose that Y ~ X + O(X) where X ~ N(J1. eY and (c) Suppose a scale model holds for X.. o (a) Show that in this case.'" ."') and o(x) is continuous.
Tk > t all occur. where TI". Set C. thus.1.12. . F(t) and Sy(t) ~ P(Y > t) = 1 .  . Then h.{a(t). I (b) Assume (1. then X' = log So (X) and Y' ~ log So(Y) follow an exponential scale model (see Problem Ll. Moreover.2) is equivalent to Sy (t I z) = sf" (t). So (T) has a U(O. Goals.. Hint: By Problem 8. = exp{g(. t > 0. = = (.z)}. The Cox proportional hazani model is defined as h(t I z) = ho(t) exp{g(. Zj) + cj for some appropriate €j..G(t). . (b) By extending (bjn) from the rationals to 5 E (0.. (c) Suppose that T and Y have densities fort) andg(t). then Yi* = g({3. .7. k = a and b. t > 0. hy(t) = c. respectively. as T with survival function So.7.2) where ho(t) is called the baseline hazard function and 9 is known except for a vector {3 ({311 . I) distribution.I ~I ..(t) = fo(t)jSo(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y. . (1. ~ a5. Hint: See Problem 1. log So(T) has an exponential distribution. A proportional hazard model.(t) if and only if Sy(t) = Sg.l3. Let f(t I Zi) denote the density of the survival time Yi of a patient with covariate vector Zi apd define the regression survival and hazard functions of Y as i Sy(t I Zi) = 1~ fry I zi)dy. show that there is an increasing function Q'(t) such that ifYt = Q'(Y. Survival beyond time t is modeled to occur if the events T} > t.(t). I (3p)T of unknowns. Also note that P(X > t) ~ Sort). 13.i. . t > 0.. we have the Lehmann model Sy(t) = si.. " T k are unobservable and i. and Performance CriteriCl Chapter 1 t Ii . Show that if So is continuous.ll) with scale parameter 5.l3. (c) Under the assumptions of (b) above. Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t)..). Let T denote a survival time with density fo(t) and hazard rate ho(t) = fo(t)j P(T > t).) Show that Sy(t) = S'. 12. h(t I Zi) = f(t I Zi)jSy(t I Zi).(t). 1 P(X > t) =1 (.2) and that Fo(t) = P(T < t) is known and strictly increasing. 7.2.00).l3.(t) with C. Sy(t) = Sg.h.. . Show that hy(t) = c. z)) (1. I' 70 StCitistical Models.7. Specify the distribution of €i.. z) zT. The most common choice of 9 is the linear form g({3.1) Equivalently.d.. For treatments A and E.) Show that (1. are called the survival functions.ho(t) is called the Cox proportional hazard model..
Generally.4 1 0. 15. G.1. .000 is the minimum monthly salary for state workers.2 1. Show that both . Merging Opinions.2 unknown..O) 1 .1..p and C.6 Let 1f be the prior frequency function of 0 defined by 1f(I1.d. Consider a parameter space consisting of two points fh and ()2. the median is preferred in this case.Section 1. .5) or mean I' = xdF(x) = Fl(u)du. '1/1' > 0.1. . .. . still identifiable? If not. and suppose that for given (). (a) Suppose Zl' Z.11 and 1. what parameters are? Hint: Ca) PIXI < tl = if>(.2 with assumptions (t){4).L . X n are independent with frequency fnnction p(x posterior1r(() I Xl.." . Let Xl.2 0. Examples are the distribution of income and the distribution of wealth.(x/c)e. 'I/1(±oo) = ±oo.. have a N(O. the parameter of interest can be characterl ized as the median v = F.. Find the median v and the mean J1 for the values of () where the mean exists.l (0. Problems for Section 1.p(t)).i. Are. and Zl and Z{ are independent random variables. wbere the model {(F..12. Show how to choose () to make J. Observe that it depends only on l:~ I Xi· I 0).i.. Yl . Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function J=oo Jo F(x. 1) distribution. Find the .X n ). J1 and v are regarded as centers of the distribution F. (b) Suppose X" . F. 14.d.2) distribution with .v arbitrarily large. x> c x<c 0. have a N(O. 1f(O2 ) = ~. When F is not symmetric. . Here is an example in which the mean is extreme and the median is not. Yn be i.Xm be i. and for this reason.8 0. J1 may be very much pulled in the direction of the longer tail of the density. an experiment leads to a random variable X whose frequency function p(x I 0) is given by O\x 01 O 2 0 0.p and ~ are identifi· able. ) = ~. In Example 1.7 Problems and Complements 71 Hint: See Problems 1. (b) Suppose Zl and Z.. where () > 0 and c = 2.G)} is described by where 'IjJ is an unknown strictly increasing differentiable map from R to R. Ca) Find the posterior frequency function 1f(0 I x).
25.)=1. 71"1 (0 2 ) = . does it matter which prior. " ••. . c(a). 3. . 3. . " J = [2:..=l1r(B i )p(x I 8i ).8). Find the posteriordensityof8 given Xl = Xl. .. ... 0 (c) Find E(81 x) for the two priors in (a) and (b).. k = 0. For this convergence. "X n ) = c(n 'n+a m) ' J=m.72 Statistical Models. the outcome X has density p(x OJ ~ (2x/O'). 0 < x < O. Compare these B's for n = 2 and 100. 7l" or 11"1. Unllormon {'13} . This is called the geometric distribution (9(0)). . for given (J = B. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success O.m+I. ~ = (h I L~l Xi ~ = ..5n) for the two priors and 1f} when 7f (e) Give the most probable values 8 = arg maxo 7l"(B and 71"1. (3) Find the posterior density of 8 when 71"( 0) ~ 1.. .2.'" 1rajI Show that J + a. where a> 1 and c(a) .75. 4'2'4 (b) Relative to (a). 1r(J)= 'u . ) ~ . and Performance CriteriCi Chapter 1 (cj Same as (b) except use the prior 71"1 (0 . 1. (J. (b) Find the posterior density of 8 when 71"( OJ = 302 .0 < 0 < 1.0 < B < 1. Consider an experiment in which. Suppose that for given 8 = 0.. X n are independent with the same distribution as X. 'IT .Xn are natural numbers between 1 and Band e= {I. Let 71" denote a prior density for 8. I XI ..O)kO. (d) Suppose Xl. < 0 < 1.2. ':. what is the most probable value of 8 given X = 2? Given X = k? (c) Find the posterior distribution of (} given X = k when the prior distribution is beta.'" . . Let X I. . 2. . Then P. 7r (d) Give the values of P(B n = 2 and 100. IE7 1 Xi = k) forthe two priors (f) Give the set on which the two B's disagree.. . ... X n be distributed as where XI. (3) Suppose 8 has prior frequency.. Show that the probability of this set tends to zero as n t 00. 4. Assume X I'V p(x) = l:. Goals. }. l3(r.[X ~ k] ~ (1. is used in the fonnula for p(x)? 2. X has the geometric distribution (a) Find the posterior distribution of (J given X = 2 when the prior distribution of (} is . ..Xn = X n when 1r(B) = 1.
I Xn+k) given . f3(r.. where VI.. .n.8) that if in Example 1. Show in Example 1.• X n = Xn.. XI = XI.f) = [~. ZI..• X n = Xn' and the conditional distribution of the Zi'S given Y = t is that of sample from the population with density e.. X n+ k ) be a sample from a population with density f(x I 0).2. b) denote the posterior distribution.b> 1. n...Section 1. Xn) + 5.1 that the conditional distribution of 6 given I Xi = k agrees with the posterior distribution of 6 given X I = Xl •." .7 Problems . then !3( a.. then the posterior distribution of D given X = k is that of k + Z where Z has a B(N . ... If a and b are integers. . } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of () given X = x is ... are independent standard exponential. 9..and Complements 73 wherern = max(xl.'" . Suppose Xl. 11'0) distribution. where {i E A and N E {l. D = NO has a B(N.) = n+r+s Hint: Let!3( a. f(x I f).1. S. In Example 1. . a regular model and integrable as a function of O. W. Let (XI. .2. . X n = Xn is that of (Y.1. . . . Show that ?T( m I Xl. Next use the central limit theorem and Slutsky's theorem. .c(b..Xn is a sample with Xi '"'' p(x I (}). Xn) = Xl = m for all 1 as n + 00 whatever be a.1= n+r+s _ .. . 11'0) distribution."" Zk) where the marginal distribution of Y equals the posterior distribution of 6 given Xl = Xl.. (a) Show that the family of priors E.0 E Let 9 have prior density 1r. 10. Va' W t .··.2... Justify the following approximation to the posterior distribution where q. 7. 6. Show that a conjugate family of distributions for the Poisson family is the gamma family.. Assume that A = {x : p( x I 8) > O} does not involve O. (b) Suppose that max(xl.1 suppose n is large and (1 In) E~ I Xi = X is not close to 0 or 1 and the prior distribution is beta. where E~ 1 Xi = k.."'tF·t'. is the standard normal distribution function and n _ T 2 x + n+r+s' a J. Show rigorously using (1. b) is the distribution of (aV loW)[1 + (aV IbW)]t. ..(1. s).2. Interpret this result.J:n). . X n+ Il . Show that the conditional distribution of (6.
vB has a XX distribution.~vO}. X and 5' are independent with X ~ N(I".. . Q'r)T.d.) u/' 0 < cx) L"."Z In) and (n1)S2/q 2 '" X~l' This leads to p(x..74 where N' ~ N Statistical Models.J t • I X.. 1 X n . T6) densities.1 < j < T.6 ""' 1r. X 1. (b) Discuss the behavior of the two predictive distributions as 15. (c) Find the posterior distribution of 0". Find I 12. a = (a1.. has density r("" a) IT • fa(u) ~ n' r(. In a Bayesian model whereX l . Q'j > 0. (N.. 9= (OJ. .~l . 0 > O. •• • .i. Find the posterior distribution 7f(() I x) and show that if>' is an integer. then the posterior density ~(Jt 52 = _1_ "'(X _ X)2 nI I.i. . given (x. Show that if Xl. . <0<x and let7t(O) = 2exp{20}. . and () = a. . x> 0. . Letp(x I OJ =exp{(xO)}. 8 2 ) is such that vn(p. .d../If (Ito. 01 )=1 j=l Uj < 1. . D(a). (12). po is I t = ~~ l(X i  1"0)2 and ()( denotes (b) Let 7t( 0) ()( 04 (>2) exp { . X n .. Xl .) pix I 0) ~ ({" . 0 > O. .O.. 11. v > 0. L.i. 'V .") ~ .. j=1 '\' • = 1. posterior predictive distribution.. 82 I fL. (. . < 1. (a) If f and 7t are the N(O. The posterior predictive distribution is the conditional distribution of X n +l given Xl. The Dirichlet distribution.2 is (called) the precision of the distribution of Xi. the predictive distribution is the marginal distribution of X n + l . The Dirichlet distribution is a conjugate prior for the multinomial. . 52) of J1.. . "5) and N(Oo. Next use Bayes rule.. unconditionally. .. LO. . . and Performance Criteria Chapter 1 + nand ({...\ > 0. X n are i.d.2) and we formally put 7t(I". I (b) Use the result (a) to give 7t(0) and 7t(0 I x) when Oexp{ Ox}. Let N = (N l ..9). Note that. Here HinJ: Given I" and ". i). ....Xn + l arei. O(t + v) has a X~+n distribution. compute the predictive and n + 00. j=1 • . I(x I (J). N(I". = 1.. N.. . 14.. .. (a) Show that p(x I 0) ()( 04 n exp (~tO) where "proportional to" as a function of 8. 0> 0 a otherwise. . ..Xn. 13. Goals.O the posterior density 7t( 0 Ix). x n ). Suppose p(x I 0) is the density of i.)T.) be multinomial M(n. where Xi known. 0<0.~X) '" tnI.J u. given x.
a3. or 0 > a be penoted by I..3. Find the Bayes rule for case 2. (d) Suppose that 0 has prior 1C(OIl (a).1.. the possible actions are al. = 11'.1.3. n r ).p) (I . 1C(02) l' = 0. .q) and let when at.5 and (ii) l' = 0. (b)p=lq=. and the loss function l((J. (J2. . (J) given by O\x a (I .7 Problems and Complements 75 Show that if the prior7r( 0) for 0 is V( a).3. 1.3. 1957) °< °= a 0.3. . ..159 be the decision rules of Table 1. a) is given by 0. O 2 a 2 I a I 2 I Let X be a random variable with frequency function p(x. I..5. . = 0.3. (c) Find the minimax rule among the randomized rules. 159 of Table 1. Compute and plot the risk points (a)p=q= . Suppose that in Example 1.5. a new buyer makes a bid and the loss function is changed to 8\a 01 O 2 al a2 a3 a 12 7 I 4 6 (a) Compute and plot the risk points in this case for each rule 1St. Suppose the possible states of nature are (Jl.. 159 for the preceding case (a). 1C(O. See Example 1.Section 1.1. Find the Bayes rule when (i) 3.) = 0.. Problems for Section 1. Let the actions corresponding to deciding whether the loss function is given by (from Lehmann. (b) Find the minimax rule among {b 1 . (c) Find the minimax rule among bt. The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether 8 < O. (d) Suppose 0 has prior 1C(01) ~ 1'.. 8 = 0 or 8 > 0 for some parameter 8. respectively and suppose . I b"9 }.1. 0. . then the posteriof7r(0 where n = (nt l · · · .5.. .3 IN ~ n) is V( a + n). a2.
E. . 0 be chosen to make iiI unbiased? I (b) Neyman allocation.j = I.g. (. iiz = LpjX j=Ii=1 )=1 s n] s j where we assume that Pj. b<f>(y'ns) + b<I>(y'nr).=l iii = N. 0<0 0=0 0 >0 R(O.(X) 0 b b+c 0 c b+c b 0 where b and c are positive.andastratumsamplemeanXj.. 1 < j < S. geographic locations or age groups).0)). Assume that 0 < a. < 00.) c<f>( y'n(r . and Performance Criteria ChClpter 1 O\a 1 0 c 1 <0 0 >0 J".s. (a) Compute the biases. . c<I>(y'n(sO))+b<I>(y'n(rO)).Xnjj. Stratified sampling.. 1) distribution function.) r=2s=1. and <I> is the N(O.. 11 .3) . 1) sample and consider the decision rule = 1 if X <r o (a) Show that the risk function is given by ifr<X<s 1 if X > s. < J' < s. Show that the strata sample sizes that minimize MSE(!i2) are given by (1. are known. i I For what values of B does the procedure with r procedure with r = ~s = I? = 8 = 1 have smaller risk than the 4.I L LXij. n = 1 and . . ... J". I I I.j = 1. Goals.7.. Suppose that the jth stratum has lOOpj% of the population and that the jth stratum population mean and variances are f£j and Let N = nj and consider the two estimators 0. 1 < j < S.i. Suppose X is aN(B.d. We want to estimate the mean J1. How should nj.. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e. where <f> = 1 <I>. (b) Plot the risk function when b = c = 1. variances.0)) +b<f>( y'n(s .76 StatistiCCII Models. I . Weassurne that the s samples from different strata are independent. are known (estimates will be used in a latet chapter). Within the jth stratum we have a sample of i. 1 (l)r=s=l. and MSEs of iiI and 'j1z. random variables Xlj"".8.
.7 Problems and Complements 77 Hint: You may use a Lagrange multiplier. I).. that X is the median of the sample. show that MSE(Xb ) and MSE(Xb ) are the same for all values of b (the MSEs of the sample mean and sample median are invariant with respect to shift).0 .=1 Pj(O'j where a.0.=1 pp7J" aV. U(O. Hint: See Problem B. ~ 7. (a) Find MSE(X) and the relative risk RR = MSE(X)/MSE(X)..= 2:.2.13.. N(o.. 1).20. l X n . ~ ~  ~ 6. 5. Next note that the distribution of X involves Bernoulli and multinomial trials. (iii) F is normal. b = '.1. ..'. (d) Find EIX . > k) (ii) F is uniform. where lJ is defined as a value satisfyingP(X < v) > ~lllldP(X >v) >~.. We want to estimate "the" median lJ of F.5. and = 1. ° = 15. p = . Hint: By Problem 1.bl when n = compare it to EIX . and that n is odd.2.25.2p. .3.5. Let X and X denote the sample mean and median. < ° P < 1.2'. Suppose that n is odd. .5.b. . set () = 0 without loss of generality. Let XI. X n be a sample from a population with values 0.i. (c) Show that MSE(!ill with Ok ~ PkN minus MSE(!i. . ~ a) = P(X = c) = P. Hint: Use Problem 1. Use a numerical integration package.0 ~ + 2'.2. The answer is MSE(X) ~ [(ab)'+(cb)'JP(S where k = . . and 2 and (e) Compute the relative risks M SE(X)/MSE(X) in questions (ii) and (iii). .3) is N.d.I 2:..15.3. ~ ~ ~ .9.b.45. ~ (a) Find the MSE of X when (i) F is discrete with P(X a < b < c. Also find EIX ..3.p). (b) Evaluate RR when n = 1.) with nk given by (1. 75. plot RR for p = . a = ..5(0 + I) and S ~ B(n.40.bl for the situation in (i).. .2. Each value has probability . P(X ~ b) = 1 .5. Let Xb and X b denote the sample mean and the sample median of the sample XI b. [f the parameters of interest are the population mean and median of Xi .. Hint: See Problem B.bl..0 + '. respectively. .5. (c) Same as (b) except when n ~ 1. ..3. > 0. (b) Compute the relative risk RR = M SE(X)/MSE(X) in question (i) when b = 0. n ~ 1. X n are i. Suppose that X I.Section 1.4.7.. ali X '"'' F..
A decision rule 15 is said to be unbiased if E. e 8 ~ (.(o(X)). and only if. a) = (6 .2)(. 6' E e.. = c L~ 1 (Xi .0(X))) for all 6. . 10. defined by (J(6. .1)1 E~ 1 (Xi . In Problem 1. Let Xl. If the true 6 is 60 .3(a) with b = c = 1 and n = I. A person in charge of ordering equipment needs to estimate and uses I .8lP where p = XI n is the proportion with the characteristic in a sample of size n from the company. then expand (Xi .3. 9.J.g.(1(6'. . II.2 ~~ I (Xi . 0) (J(6'. . ' ..1'])2. ~ 2(n _ 1)1. being lefthanded). for all 6' E 8 1. Let () denote the proportion of people working in a company who have a certain characteristic (e. suppose (J is discrete with frequency function 11"(0) = 11" (!) = 11" U) = Compute the Bayes risk of I5r •s when i. (b) Suppose Xi _ N(I'. I (b) Show that if we use the 0 1 loss function in testing. . You may use the fact that E(X i .. I . .').[X .X)2 is an unbiased estimator of u 2 • Hint: Write (Xi . Find MSE(6) and MSE(P). (a)r = 8 =1 (b)r= ~s =1.X)2 has a X~I distribution. .L)4 = 3(12...(1(6.0(X))) < E. (a) Show that if 6 is real and 1(6. 0 < a 2 < 00. the powerfunction.0) = E.for what 60 is ~ MSE(8)/MSE(P) < 17 Give the answer for n = 25 and n = 100.10) + (. (a) Show that s:? = (n . satisfies > sup{{J(6. Show that the value of c that minimizes M S E(c/.. (i) Show that MSE(S2) (ii) Let 0'6 c~ .3) that .X)2.) is (n+ 1)1 Hint!or question (bi: Recall (Theorem B.  . I I 78 Statistical Models. 8. and Performance Criteria Chapter 1 :! .3. Which one of the rules is the better one from the Bayes point of view? 11. It is known that in the state where the company is located.a)2. I .0): 6 E eo}.4.1'] .. 10% have the characteristic. then a test function is unbiased in this sense if. then this definition coincides with the definition of an unbiased estimate of e. Goals.X)' = ([Xi . i1 .Xn be a sample from a population with variance a 2 .X)2 keeping the square brackets intact.
Ok) as a function of B. . z > 0. 0 < u < 1.UfO. Convexity ofthe risk set. s = r = I.Section 1.1.1. b. Show that J agrees with rp in the sense that. (c) Same as (b) except k = 2. find the set of I' where MSE(ji) depend on n. there is a randomized procedure 03 such that R(B. and possible actions at and a2. 0. The interpretation of <p is the following. if <p(x) = 1. Suppose that the set of decision procedures is finite. 18. use the test J u . Compare 02 and 03' 19. consider the estimator < MSE(X).1. and then JT. Suppose the loss function feB.3.3. (a) find the value of Wo that minimizes M SE(Mw). Show that the procedure O(X) au is admissible..~ is unbiased 13. For Example 1. . If X ~ x and <p(x) = 0 we decide 90. In Problem 1. If U = u. B = . A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 < <p(X) < 1.' and 0 = = WJ10 + (1  w)x' II'  1'0 I are known. 03) = aR(B. 16. show that if c ::.3.3. (h) If N ~ 10. In Example 1.Polo(X) ~ 01 = Eo(<p(X)).3.3. Consider the following randomized test J: Observe U. (b) find the minimum relative risk of P. In Example 1. Your answer should Mw If n. a) is ." (a) Show that the risk is given by (1. 1) and is independent of X..4.4...) for all B. Consider a decision problem with the possible states of nature 81 and 82 . plot R(O.' and 0 = II' .a )R(B. then. procedures.7).7 Problems and Complements 79 12.1) and let Ok be the decision rule "reject the shipment iff X > k.wo to X. but if 0 < <p(x) < 1. wc dccide El. o Suppose that U . Define the nonrandomized test Ju .ao) ~ O. by 1 if<p(X) if <p(X) >u < u. Furthersuppose that l(Bo. Show that if J 1 and J 2 are two randomized. 0. 14. consider the loss function (1.1'0 I· 17. Suppose that Po. 15. we petiorm a Bernoulli trial with probability rp( x) of success and decide 8 1 if we obtain a success and decide 8 0 otherwise. given 0 < a < 1. (B) = 0 for some event B implies that Po (B) = 0 for all B E El. and k o ~ 3.) + (1 . Po[o(X) ~ 11 ~ 1 .
80 Statistical Models. Give the minimax rule among the randomized decision rules.8 0. Give an example in which the best linear predictor of Y given Z is a constant (has no predictive value) whereas the best predictor Y given Z predicts Y perfectly. Give an example in which Z can be used to predict Y perfectly.4 = 0.9. What is 1.(0.. 6.l. (a) Find the best predictor of Y given Z. Show that either R(c) = 00 for all cor R(c) is minimized by taking c to be any number such that PlY > cJ > pry < cJ > A number satisfying these restrictions is called a median of (the distribution of) Y.1. !. al a2 0 3 2 I 0.7 find the best predictors ofY given X and of X given Y and calculate their MSPEs. Let Y be any random variable and let R(c) = E(IYcl) be the mean absolute prediction error. An urn contains four red and four black balls.4 I 0. Let X be a random variable with probability function p(x ! B) O\x 0. and the ratio of its MSPE to that of the best and best linear predictors..6 (a) Compute and plot the risk points of the nonrandomized decision rules. (b) Compute the MSPEs of the predictors in (a). Give the minimax rule among the nonrandomized decision rules. U2 be independent standard normal random variables and set Z Y = U I. and Performance Criteria Chapter 1 B\a 0.4. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn. the best linear predictor.. 5.Is Z of any value in predicting Y? = Ur + U?. (b) Give and plot the risk set S. . 7.2 0. (c) Suppose 0 has the prior distribution defined by .1 calculate explicitly the best zero intercept linear predictor. Four balls are drawn at random without replacement.) = 0.) the Bayes decision rule? Problems for Section 1. its MSPE. Goals. The midpoint of the interval of such c is called the conventionally defined median or simply just the median. 3. 2. !. 4. but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z). Let U I . 0. In Problem B. 0 0.(0. and the best zero intercept linear predictor. In Example 1.
a 2.1"1/0] where Q(t) = 2['I'(t) + t<I>(t)].cl) = oQ[lc .L. Y'. which is symmetric about c.el) as a function of c. Show that c is a median of Z. (a) Show that the relation between Z and Y is weaker than that between Z' and y J . Y" are N(v. ICor(Z. Y)I < Ipl. Let Zl and Zz be independent and have exponential distributions with density Ae\z. Y) has a bivariate normal distribution. Show that if (Z. p) distribution and Z". Let ZI. p( z) is nonincreasing for z > c. z > 0. and otherwise. Let Y have a N(I". that is.z) for 11. If Y and Z are any two random variables. Define Z = Z2 and Y = Zl + Z l Z2.  = ElY cl + (c  co){P[Y > co] . Suppose that Z.col < eo.2 ) distrihution. Y" be the corresponding genetic and environmental components Z = ZJ + Z". 7 2 ) variables independent of each other and of (ZI. Sbow that the predictor that minimizes our expected loss is again the best MSPE predictor. Y') have a N(p" J.t. 9.Section 1. Yare measurements on such a variable for a randomly selected father and son. 12. which is symmetric about c and which is unimodal. p( c all z. 0. Suppose that if we observe Z = z and predict 1"( z) for Your loss is 1 unit if 11"( z) . (b) Show directly that I" minimizes E(IY . 10. Y = Y' + Y". Find (a) The best MSPE predictor E(Y I Z = z) ofY given Z = z (b) E(E(Y I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z» . a 2. exhibit a best predictor of Y given Z for mean absolute prediction error. (b) Suppose (Z. where (ZI .PlY < eo]} + 2E[(c  Y)llc < Y < co]] 8. Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables.7 Problems and Complements 81 Hint: If c ElY . Y) has a bivariate nonnal distribution the best predictor of Y given Z in the sense of MSPE coincides with the best predictor for mean absolute error. Suppose that Z has a density p. (a) Show that P[lZ  tl < s] is maximized as a function oftfor each s > ° by t = c. + z) = p(c . Suppose that Z has a density p. Z". that is. (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict y J • 13. y l ). ° 14. (a) Show that E(IY .YI > s.
in the linear model of Remark 1. IS. .4. .exp{ >.y}. and Doksurn and Samarov. 1). Consider a subject who walks into a clinic today. Show that Var(/L(Z))/Var(Y) = Corr'(Y./L(Z)) = max Corr'(Y..4.) = Var( Z.3.: (0 The best linear MSPE predictor of Y based on Z = z. At the same time t a diagnostic indicator Zo of the severity of the disease (e.) ~ 1/>. (c) Let '£ = Y . = Corr'(Y. Show that. Assume that the conditional density of Zo (the present) given Yo = Yo (the past) is where j1. >. .S from infection until detection. and is diagnosed with a certain disease. g(Z)) 9 where g(Z) stands for any predictor./L)/<J and Y ~ /3Yo/<J. Here j3yo gives the mean increase of Zo for infected subjects over the time period Yo.) ~ 1/>''.1] . that is. j3 > 0. One minus the ratio of the smallest possible MSPE to the MSPE of the constant predictor is called Pearson's correlation ratio 1J~y. and Performance Criteria Chapter 1 . Hint: Recall that E( Z.4. then'fJh(Z)Y =1JZY. 2 'T ' .15. hat1s. . Show that p~y linear predictors. . .(y) = >. (See Pearson.g. > 0.g(Z)) where £ is the set of 17. 82 Statisflcal Models. . at time t. where piy is the population multiple correlation coefficient of Remark 1./L£(Z)) ~ maxgEL Corr'(Y. 1905. on estimation of 1]~y. a blood cell or viral load measurement) is obtained.. (b) Show that if Z is onedimensional and h is a 11 increasing transfonnation of Z.I • ! . We are interested in the time Yo = t . I. y> O. Goals. mvanant un d ersuchh. Hint: See Problem 1.4. and a 2 are the mean and variance of the severity indicator Zo in the population of people without the disease.) (a) Show that 1]~y > piy. €L is uncorrelated with PL(Z) and 1J~Y = P~Y' 18. 1995. ~~y = 1 . (a) Show that the conditional density j(z I y) of Z given Y (b) Suppose that Y has the exponential density = y is H(y. IS.ElY .) = E( Z. Let S be the unknown date in the past when the sUbject was infected./LdZ) be the linear prediction error. and Var( Z./L(Z)J' /Var(Y) ~ Var(/L(Z))/Var(Y). 16. Let /L(z) = E(Y I Z ~ z). Predicting the past from the present. Yo > O. It will be convenient to rescale the problem by introducing Z = (Zo .
4. (e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z ~ z) ~ cI<p(>. This density is called the truncated (at zero) normal. 1). 1990. (d) Find the best predictor of Yo given Zo = zo using mean absolute prediction error EIYo .7 Problems and Complements 83 Show that the conditional distribution of Y (the past) given Z = z (the present) has density . Z] = Cov{E[r(Y) (d) Suppose Y i = al + biZ i + Wand 1'2 = a2 and Y2 are responses of subjects 1 and 2 with common influence W and separate influences ZI and Z2.z). then = E{Cov[r(Y). (c) Find the conditional density 7I"o(Yo I zo) of Yo given Zo = zo.6) when r = s.. all the unknowns... 21. s(Y» given Z = z. Write Cov[r(Y).g(Zo)l· Hint: See Problems 1. Cov[r(Y). Find Corr(Y}.~ [y  (z . (c) Show that if Z is real. see Berman.>')1 2 } Y> ° where c = 4>(z . (a) Show that ifCov[r(Y). and Nonnand and Doksum. Establish 1. 7r(Y I z) ~ (27r) .9. b).4.>.). . 2000). Y2) using (a). (b) Show that (a) is equivalent to (1. s(Y) I z] for the covatiance between r(Y) and s(Y) in the conditional distribution of (r(Y). s(Y)] < 00. c. need to be estimated from cohort studies. (c) The optimal predictor of Y given X = x. and checking convexity. Find (a) The mean and variance of Y. N(z .s(Y)] Covlr(Y). x = 1.. wbere Y j . .>. s(Y) I Z]} + Cov{ E[r(Y) I Z]' E[s(Y) I Z]}.14 by setting the derivatives of R(a. (b) The MSPE of the optimal predictor of Y based on X.Section 1. I Z]' Z}. including the "prior" 71". where Z}. b) equal to zero. Let Y be the number of heads showing when X fair coins are tossed.z) .(>' .7 and 1. .I exp { . Hint: Use Bayes rule. density. (In practice.4.4. where X is the number of spots showing when a fair die is rolled.. Z2 and W are independent with finite variances. 6. + b2 Z 2 + W. solving for (a. 20. 19. Let Y be a vector and let r(Y) and s(Y) be real valued.
Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. a > O. \ (b) Establish the same result using the factorization theorem..4. z) = z(1 .z) be a positive realvalued function. . 1 X n be a sample from a Poisson.14). z) and c is the constant that makes Po a density. 25.. This is known as the Weibull density. density. Verify that solving (1.2eY + e' < 2(Y' + e'). Find the 22.Xn is a sample from a population with one of the following densities. Oax"l exp( Ox").. Then [y . 24. Problems for Section 1. and Performance Criteria Chapter 1 (e) In the preceding model (d).5 1. Show that EY' < 00 if and only if E(Y . (a) Show directly that 1 z:::: 1 Xi is sufficient for 8. and (ii) w(y. 0 > 0. and = 0 otherwise.z)lw(y. 1). Let Xl. .e' < (Y . 1 2 y' . a > O. . are N(p. Goals.6(r. .. 2. (a) Let w(y.density.e)' = Y' . z "5). (c) p(x.g(z)]'lw(y. p(e).4. Show that the mean weighted squared prediction error is minimized by Po(Z) = EO(Y I Z).6(O. 0 < z < I. z) ~ ep(y. > a.. 00 (b) Suppose that given Z = z. Zz). s). optimal predictor of Yz given (Yi. In this case what is Corr(Y. Show that L~ 1 Xi is sufficient for 8 directly and by the factorization theorem.z). In Example 1.0> O. if b1 = b2 and Zl. Suppose Xl. g(Z)) < for some 9 and that Po is a density. we say that there is a 50% overlap between Y 1 and Yz. Y z )? (f) In model (d). 23.4..x > 0. . Assume that Eow (Y.z) = 6w (y. population where e > O. n > 2.p~y). where Po(Y. x  . .~(1 .l . .15) yields (1. Zz and ~V have the same variance (T2. (a) p(x. 0) = Ox'. Find Po(Z) when (i) w(y. Y ~ B(n. . 3. This is thebeta. 0 > 0. 1. suppose that Zl and Z.z).84 Statistical Models. and suppose that Z has the beta. z) = I. 0) = Oa' Ix('H).g(z») is called weighted squared prediction error. show that the MSPE of the optimal predictor is . 0) = < X < 1.e)' Hint: Whatever be Y and c.3. < 00 for all e. 0 (b) p(x.') and W ~ N(po. Let Xi = 1 if the ith item drawn is bad.
Section 1. 7
Problems and Complements
85
This is known as the Pareto density. In each case. find a realvalued sufficient statistic for (), a fixed. 4. (a) Show that T 1 and T z are equivalent statistics if, and only if, we can write T z = H (T1 ) for some 11 transformation H of the range of T 1 into the range of T z . Which of the following statistics are equivalent? (Prove or disprove.) (b) n~ (c) I:~
1
x t and I:~
and I:~
I:~
1
log Xi, log Xi.
Xi Xi
I Xi
1
>0 >0
l(Xi X
1 (Xi 
(d)(I:~ lXi,I:~ lxnand(I:~ lXi,I:~
(e) (I:~
I Xi, 1
?)
xl)
and (I:~
1 Xi,
I:~
X)3).
5. Let e = (e lo e,) be a bivariate parameter. Suppose that T l (X) is sufficient for e, whenever 82 is fixed and known, whereas T2 (X) is sufficient for (h whenever 81 is fixed and known. Assume that eh ()2 vary independently, lh E 8 1 , 8 2 E 8 2 and that the set S = {x: pix, e) > O} does not depend on e. (a) Show that ifT, and T, do not depend one2 and e, respectively, then (Tl (X), T2 (X)) is sufficient for e. (b) Exhibit an example in which (T, (X), T2 (X)) is sufficient for T l (X) is sufficient for 8 1 whenever 8 2 is fixed and known, but Tz(X) is not sufficient for 82 , when el is fixed and known. 6. Let X take on the specified values VI, .•. 1 Vk with probabilities 8 1 , .•• ,8k, respectively. Suppose that Xl, ... ,Xn are independently and identically distributed as X. Suppose that IJ = (e" ... , e>l is unknown and may range over the set e = {(e" ... ,ek) : e, > 0, 1 < i < k, E~ 18i = I}, Let Nj be the number of Xi which equal Vj' (a) What is the distribution of (N" ... , N k )? (b) Sbow that N = (N" ... , N k _,) is sufficient for 7. Let Xl,'"
1
e,
e.
X n be a sample from a population with density p(x, 8) given by
pix, e)
o otherwise.
Here
e = (/1, <r) with 00 < /1 < 00, <r > 0.
(a) Show that min (Xl, ... 1 X n ) is sufficient for fl when a is fixed. (b) Find a onedimensional sufficient statistic for a when J1. is fixed. (c) Exhibit a twodimensional sufficient statistic for 8. 8. Let Xl,. " ,Xn be a sample from some continuous distribution Fwith density f, which is unknown. Treating f as a parameter, show that the order statistics X(l),"" X(n) (cf. Problem B.2.8) are sufficient for f.
,
"
I
!
86
Statistical Models, Goals, and Performance Criteria
Chapter 1
9. Let Xl, ... ,Xn be a sample from a population with density
j,(x)
a(O)h(x) if 0, < x < 0,
o othetwise
where h(x)
> 0,0= (0,,0,)
with
00
< 0, < 0, <
00,
and a(O)
=
[J:"
h(X)dXr'
is assumed to exist. Find a twodimensional sufficient statistic for this problem and apply your result to the U[()l, ()2] family of distributions. 10. Suppose Xl" .. , X n are U.d. with density I(x, 8) = ~elx61. Show that (X{I),"" X(n», the order statistics, are minimal sufficient. Hint: t,Lx(O) =  E~ ,sgn(Xi  0), 0 't {X"" . , X n }, which determines X(I),
. " , X(n)'
11. Let X 1 ,X2, ... ,Xn be a sample from the unifonn, U(O,B). distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for O.
12. Dynkin, Lehmann, Scheffe's Theorem. Let P = {Po : () E e} where Po is discrete concentrated on X = {x" x," .. }. Let p(x, 0) p.[X = xl Lx(O) > on Show that f:xx(~~) is minimial sufficient. Hint: Apply the factorization theorem.
=
=
°
x,
13. Suppose that X = (XlI" _, X n ) is a sample from a population with continuous distribution function F(x). If F(x) is N(j1., ,,'), T(X) = (X, ,,'). where,,2 = n l E(Xi 1')2, is sufficient, and S(X) ~ (XCI)"" ,Xin»' where XCi) = (X(i)  1')/'" is "irrelevant" (ancillary) for (IL, a 2 ). However, S(X) is exactly what is needed to estimate the "shape" of F(x) when F(x) is unknown. The shape of F is represented hy the equivalence class F = {F((·  a)/b) : b > 0, a E R}. Thus a distribution G has the same shape as F iff G E F. For instance, one "estimator" of this shape is the scaled empirical distribution function F,(x) jln, x(j) < x < x(i+1)' j = 1, . .. ,nl
~
0, x
< XCI)
> x(n)
1, x
~
Show that for fixed x, F,((x  x)/,,) converges in prohahility to F(x). Here we are using F to represent :F because every member of:F can be obtained from F.
I
I ,
'I i,
14. Kolmogorov's Theorem. We are given a regular model with e finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on 9, the posterior distrihution of 9 depends on x only through T(x). Show that T(X) is sufficient.
(b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the posterior distribution depends on x only through T(x).
Section 1.7
Problems and Complements
87
Hint: Apply the factorization theorem.
15. Let X h .··, X n be a sample from f(x  0), () E R. Show that the order statistics arc minimal sufficient when / is the density Cauchy Itt) ~ I/Jr(1 + t 2 ). 16. Let Xl,"" X rn ; Y1 ,· . " ~l be independently distributed according to N(p, (72) and N(TI, 7 2 ), respectively. Find minimal sufficient statistics for the following three cases:
(i) p, TI,
0", T
are arbitrary:
00
< p, TI < 00, a <
(J,
T.
(ii)
(J
=T
= TJ
and p, TI, (7 are arbitrary. and p,
0", T
(iii) p
are arbitrary.
17. In Example 1.5.4. express tl as a function of Lx(O, 1) and Lx(l, I). Problems to Sectinn 1.6
1. Prove the assertions of Table 1.6.1.
2. Suppose X I, ... , X n is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show that the distribution of X fonns a oneparameter exponential family. Identify 'TI, B, T, and h. 3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability nf success O. Then P, IX = k] = (I  0)'0, k ~ 0 0 1,2, ... This is called thc geometric distribution (9 (0». (a) Show that the family of geometric distributions is a oneparameter exponential family with T(x) ~ x. (b) Deduce from Theorem 1.6.1 that if X lo '" oXn is a sample from 9(0), then the distributions of L~ 1 Xi fonn a oneparameter exponential family. (c) Show that E~
1
Xi in part (b) has a negative binomial distribution with parameters
(noO)definedbyP,[L:71Xi = kJ =
(n+~I
)
(10)'on,k~0,1,2o'"
(The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success 0.) Hint: By Theorem 1.6.1, P,[L:7 1 Xi = kJ = c.(1  o)'on. 0 < 0 < 1. If
=
' " CkW'
I
= c(;'',)::, 0 lw n
k=O
L..J
< W < I, then
4. Which of the following families of distributions are exponential families? (Prove or
disprove.) (a) The U(O, 0) fumily
88
(b) p(.", 0)
Statistical Models, Goals, and Performance Criteria
Chapter 1
(c)p(x,O)
= {exp[2Iog0+log(2x)]}I[XE = ~,xE {O.I +0, ... ,0.9+0j = 2(x +
0)/(1 + 20), 0
(0,0)1
(d) The N(O, 02 ) family, 0 > 0
(e)p(x,O)
< x < 1,0> 0
(f) p(x,9) is the conditional frequency function of a binomial, B(n,O), variable X, given that X > O.
5. Show that the following families of distributions are twoparameter exponential families and identify the functions 1], B, T, and h. (a) The beta family. (b) The gamma family. 6. Let X have the Dirichlet distribution, D( a), of Problem 1.2.15. Show the distribution of X form an rparameter exponential family and identify fJl B, T, and h.
7. Let X = ((XI, Y I ), ... , (X no Y n » be a sample from a bIvariate nonnal population.
Show that the distributions of X form a fiveparameter exponential family and identify 'TJ, B, T, and h.
8. Show that the family of distributions of Example 1.5.3 is not a one parameter eX(Xloential family. Hint: If it were. there would be a set A such that p(x, 0) > on A for all O.
°
9. Prove the analogue of Theorem 1.6.1 for discrete kparameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (XI, X 2) is a sample of size 2 from f(·, 0), then XI + X2 is sufficient for B. Show that f(·, B) corresponds to a onearameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x" X2) such that log f(x" 0) + log f(X2, 0) = g(xI + X2, 0) + h(XI, X2). Fix 00 and let r(x, 0) = log f(x, 0)  log f(x, 00), q(x, 0) = g(x,O)  g(x,Oo). Then, q(xI + X2,0) = r(xI,O) +r(x2,0), and hence, [r(x" 0) r(O, 0)1 + [r(x2, 0)  r(O, 0») = r(xi + X2, 0)  r(O, 0). 11. Use Theorems 1.6.2 and 1.6.3 to obtain momentgenerating functions for the sufficient statistics when sampling from the following distributions. (a) normal, () ~ (ll,a 2 )
(b) gamma. r(p, >.), 0
= >., p fixed
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3)
(0 gamma. r(p, >'). ()
= (p, >.).
 
,

Section 1. 7
Problems and Complements
89
12. Show directly using the definition of the rank of an ex}X)nential family that the multinomialdistribution,M(n;OI, ... ,Ok),O < OJ < 1,1 <j < k,I:~oIOj = 1, is of rank k1. 13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any (k ~ I)dimensional hyperplane. 14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of &. Show that if k = 1 and &0 oJ 0 and A, A are defined on all of &, then Theorem 1.6.3 continues to hold. 15. Let P = {P. : 0 E e} where p. is discrete and concentrated on X = {x" X2, ... }, and let p( x, 0) = p. IX = x I. Show that if P is a (discrete) canonical ex ponential family generated bi, (T, h) and &0 oJ 0, then T is minimal sufficient. Hint: ~;j'Lx('l) = Tj(X)  E'lTj(X). Use Problem 1.5.12.
16. Life testing. Let Xl,.'" X n be independently distributed with exponential density (20)l e x/2. for x > 0, and let the ordered X's be denoted by Y, < Y2 < '" < YnIt is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r aparticles have been emitted. Show that
(i) The joint distribution of Y1 , •.. , Yr is an exponential family with density
n! [ (20), (n _ r)! exp (ii) The distribution of II:: I Y;
(iii) Let
1
I::l Yi + (n 20
r)Yr]
' 0  Y,  ...  Yr·
<
<
<
+ (n 
r)Yrl/O is X2 with 2r degrees of freedom.
denote the time required until the first, second,... event occurs in a Poisson process with parameter 1/20' (see A.I6). Then Z, = YI/O', Z2 = (Y2 Yr)/O', Z3 = (Y3  Y 2)/0', ... are independently distributed as X2 with 2 degrees of freedom, and the joint density of Y1 , ••. , Yr is an exponential family with density
Yi, Yz , ...
The distribution of Yr/B' is again XZ with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burnedout tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
I ,
90
Statistical Models, Goals, and Performance Criteria Chapter 1
1)(Y; l~~l)/e (I = 1", .. ,') are independently distributed as X2 with 2 degrees of freedom, and [L~ 1 Yi + (n  7")Yr]/B = [(ii): The random variables Zi ~ (n  i
+
L::~l Z,.l
17. Suppose that (TkXl' h) generate a canonical exponential family P with parameter k 1Jkxl and E = R . Let
(a) Show that Q is the exponential family generated by IlL T and h exp{ cTT}. where IlL is the projection matrix of Tonto L = {'I : 'I = BO + c). (b) Show that ifP has full rank k and B is of rank I, then Q has full rank l. Hint: If B is of rank I, you may assume
18. Suppose Y1, ... 1 Y n are independent with Yi '" N(131 + {32Zi, (12), where Zl,'" , Zn are covariate values not all equaL (See Example 1.6.6.) Show that the family has rank 3.
Give the mean vector and the variance matrix of T.
19. Logistic Regression. We observe (Zll Y1 ), ... , (zn, Y n ) where the Y1 , .. _ , Y n are independent, Yi "' B(TIi, Ad The success probability Ai depends on the characteristics Zi of the ith subject, for example, on the covariate vector Zi = (age, height, blood pressure)T. The function I(u) ~ log[u/(l  u)] is called the logil function. In the logistic linear re(3 where (3 = ((31, ... ,/3d ) T and Zi is d x 1. gression model it is assumed that I (Ai) = Show that Y = (Y1 , ... , yn)T follow an exponential model with rank d iff Zl, ... , Zd are
zT
not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). 20. (a) In part IT of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by ('11 'Io)TT and that ~(ii) =~(i). (b) Fill in the details of part III of the proof of Theorem 1.6.4. 21. Find JJ.('I) ~ EryT(X) for the gamma,
qa, A), distribution, where e = (a, A).
I
22. Let X I, . _ . ,Xn be a sample from the k·parameter exponential family distribution (1.6.10). Let T = (L:~ 1 1 (Xi ), ... , L:~ 1Tk(X,») and let T
I
S
~
((ryl(O), ... ,ryk(O»): e E 8).
Show that if S contains a subset of k + 1 vectors Vo, .. _, Vk+l so that Vi  Vo, 1 < i are not collinear (linearly independent), then T is minimally sufficient for 8.
< k.
I .' jl,
"
23. Using (1.6.20). find a conjugate family of distributions for the gamma and beta families. (a) With one parameter fixed. (b) With both parameters free.
:
I
Section 1.7
Problems and Complements
91
24. Using (1.6.20), find a conjugate family of distributions for the normal family using as parameter 0 = (O!, O ) where O! = E,(X), 0, ~ l/(Var oX) (cf. Problem 1.2.12). 2 25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with (72 known. Find a conjugate family of prior distributions for (131,132) T. 26. Using (1.6.20), find a conjugate family of distributions for the multinomial distribution. See Problem 1.2.15. 27. Let P denote the canonical exponential family genrated by T and h. For any TJo E £, set ho(x) = q(x, '10) where q is given by (1.6.9). Show that P is also the canonical exponential family generated by T and h o.
28. Exponential/amities are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by h(f)
~ E(logf(X)) =
l:IIOgf(X)I!(X)dx.
This quantity arises naturally in infonnation in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S ~ {x: f(x) > OJ. (a) Show that the canonical kparameter exponential family density
f(x, 'I)
= exp
• ryjrj(x) 1/0 + I:
j:=1
A('I)
, XES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is
f(x)dx
~ 1,
Is
f(x)rj(x)
~ aj,
1 < j < k,
where '17o, .•.• '17k are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. (b) Find the maximum entropy densities when rj(x) = x j and (i) S ~ (0,00), k = 1, at > 0; (ii) S = R, k = 2, at E R, a, > 0; (iii) S = R, k = 3, a) E R, a, > 0, a3 E R. 29. As in Example 1.6.11, suppose that Y 1, ...• Y n are Li.d. Np(f.L. E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y = (Y ... , Yn ) is the p(p + 3)/2 canonical " exponential family generated by h = 1 and the p(p + 3)/2 statistics
n n
Tj
=
LYii>
i=l
1 <j <Pi
Tjl =
LJ'ijJ'iI.
i=l
1 <j< l<p
where Y i = (Yi!, ... , Yip). Show that <: is open and that this family is of rank pcp + 3)/2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = pcp + 3)/2 statistics Tj(Y) ~ Yj, 1 < j < p, and Tj,(Y) = YjYi, 1 <j < I < p,
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Np(J.l, E). As E ranges over all p x p symmetric positive definite matrices, so does E 1 • Next establish that for symmetric matrices M,
J
M
exp{ _uT Mu}du
< 00 iff M
is positive definite
by using the spectral decomposition (see B.I0.1.2)
=L
j=1
p
AjejeJ for el, ... , e p orthogonal. Aj E R.
To show that the family has full rank m, use induction on p to show that if Zt, ... , Zp are i.i.d. N(O, 1) and if B pxp = (b jl ) is symmetric, then
p
P
LajZj
j"" 1
+ Lbj,ZjZ,
j,l
~c
= P(aTZ + ZTBZ = c) ~ 0
N p(l', E), then
unless a ~ 0, B = 0, c = 0. Next recall (Appendix B.6) that since Y ~ y = SZ for some nonsingular p x p matrix S.
I
30. Show that if Xl,'" ,Xn are d.d. N p (8,E o) given (J where ~o is known, then the Np(A, f) family is conjugate to N p(8, Eo), where A varies freely in RP and f ranges over
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I'j, Tj) : 1 < j < k} be a given collection of pairs with I'j E R, Tj > 0. Let (tt, tT) be a random pair with Aj = P«(I', tT) = (I'j, Tj)), 0 < Aj < 1, L:~~l Aj = 1. Let 8 be a random variable whose conditional distribution given (IL, IT) = (p,j 1 Tj) is nonnal, N(p,j, rJ). Consider the model X = 8 + f, where 8 and € are independent and € rv N(O, a3), a~ known. Note that 8 has the prior density
11'(0)
=L
j=l
k
Aj'l'rj (0  I'j)
(1.7.4)
where 'I'r denotes the N(O, T 2 ) density. Also note that (X tion. (a) Find the posterior
i!
I 0) has the N(O, (75) distribu
" "
,;
k
11'(0 I x)
and write it in the fonn
= LP((tt,tT) ~
j=1
(l'j,Tj) I X)1I'(O I (l'j,Tj),X)
" • ,
k
L
j=1
Aj (x)'I'rj(x) (0 l'j(X»
Section 1.7
Problems ;3nd Complements
93
for appropriate A) (x), Tj (x) and ILJ (x). This shows that (1. 7.4) defines a conjugate prior for the N(O, (76), distribution. (b) Let Xi = + Ei, I < i < n, where is as previously and EI," ., En are ij.d. N(O, (76)' Find the posterior 7r( 0 I Xl, ... , x n ), and show that it belongs to class (1.7 A). Hint: Consider the sufficient statistic for p(x I B).
e
e
32. A Hierarchical BinomialBeta Model. Let {(rj, Sj) : 1 <j < k} be a given collection of pair.; with rj > 0, sJ > 0, let (R, S) be a random pair with P(R = cJ' S = 8j) = Aj, D < Aj < 1, E7=1 Aj = 1, and let e be a random variable whose conditional density ".(0, c, s) given R = r, S = S is beta, (3(c, s). Consider the model in which (X I 0) has the binomial, B( n, fJ), distribution. Note that e has the prior density
".(0)
Find the posterior
k
=L
j=1
k
Aj"'(O, cJ ' sJ)'
(J .7.5)
".(0 I x) = LP(R= cj,S =
j=l
8j
I x)7r(O I (rj,sj),x)
and show that it can be written in the form J (x)7r(O,rj(x),sj(x)) for appropriate Aj(X), Cj(x) and 8j(X). This shows that (1.7.5) defines a class of conjugate prior.; for the B( n, 0) distribution.
L:A
33. Let p(x,TJ) be a one parameter canonical exponential family generated by T(x) = x and h(x), X E X C R, and let 1jJ(x) be a nonconstant, nondecreasing function. Show that E,1jJ(X) is strictly increasing in ry. Hint:
Cov,(1jJ(X), X)
~E{(X 
X')i1jJ(X) 1jJ(X')]}
where X and X' are independent identically distributed as X (see A.Il.12).
34. Let (Xl, ... , X n ) be a stationary Markov chain with two states D and 1. That is.
P[Xi
where
= Ei I Xl = EI,·· .,Xi  = Eid = P[Xi = Ci I X i  l = Eid =
l
PEi_1Ei
(POO PIO
pal) is the matrix of transition probabilities. Suppose further that Pn '
= 1  p.
(i) poo
(ii)
= PII = p, so that, PlO = Pal PIX, = OJ = PIX, = IJ = !.
94
Statistical Models, Goals, and Performance Criteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, oneparameter exponential family with T = NOD + N ll where Nt) the number of transitions from i to j. For example, 01011 has N Ol = 2, Nil = 1, N oo = 0, N IO ~ 1.
(b) Show that E(T)
= (n 
l)p (by the method of indicators or otherwise).
35. A Conjugate Priorfor the Two~Sample Problem. Suppose that Xl, ... , X n and Y1 , ... , Yn are independent N(fLI' (12) and N(1l2' ( 2 ) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro 2 has a X~ distribution and given 0 2 , /11 and fL2 are independent with N(~I, <7 2/ kt} and N(6, <7 2/ k2) distributions, respectively, where ~j E R, k j > 0, j = 1,2. Show that Jr is a conjugate prior.
I
36. The inverse Gaussian density. IG(j..t, .\), is
f(x,J1.,>')
= [>./21Tjl/2 x 3/2 exp { >.(x 
J1.)2/ 2J1.2 X}, x> 0, J1. > 0, >. > O.
(a) Show thatthis is an exponentialfamily generated hy T( X) h(x) = (21T)1/2 X 3/'. (b) Show that the canonical parameters TJl, TJ2 are given by TJI that A( 'II, '12) =  [! [Og('I2) + v''I1'I2]'£ = [0,00) x (0,00).
= ! (X, XI) T and = fL 2A, 1]2 =
= '\, and
(e) Fwd the momentgenerating function ofT and show that E(X) J1. 3>., E(XI) = J1. 1 + >.1, Var(X I ) = (>'J1.)1 + 2>'2. (d) Suppose J1. pnor.
~
J1., Var(X)
J1.o is known. Show that the gamma family, qa,,6), is a conjugate
(e) Suppose that>' = >'0 is known. Show that the conjngate prior formula (1.6.20) produces a function that is not integrable with respect to fl. That is, defined in (1.6.19) is empty.
n
(I) Suppose that J1. and>. are both unknown. Show that (1.6.20) produces a function
that is not integrable; that is, f! defined in (1.6,19) is empty. 37, Let XI, ... , X n be i.i.d. as X ~ Np(O, ~o) where ~o is known. Show that the conjugate prior generated by (1.6.20) is the N p ( 1]0,761) family, where 1]0 varies freely in RP, 76 > 0 and I is the p x p identity matrix.
,
•
38. Let Xi
"
(Zi, Yi)T be jj,d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of XI, ... ,Xn as a canonical exponential family and identify T, h, A, and E. Find the expected value and variance of the sufficient statistic.
=
,!,i
i!: "
39. Suppose that Y1 , •.. 1 Y n are independent, Yi
'"'' N(fLi, a 2 ), n > 4.
"'
(a) Write the distribution of Y1 , ..• ,Yn in canonical exponential family fonn. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 1) : (0 1, O , 03)T ~ (1'7, <72 jT where 1) is detennined by 2
fLi
I i'i
I ,
I
= exp{OI
+ 02Zi},
Zl
< Z2 < .. , <
Zn;
(72 =
03


Section 1.8
Notes
95
where 8 r E R, O E R, 03 > O. This model is sometimes used when IIi is restricted to be 2 positive. Show that p(y, 0) as given by (1.6.12) is a curved exponential family model with
1=3.
40. Suppose Y 1 , • •. , y;'l are independent exponentially, E' (Ai), distributed survival times, n > 3. (a) Write the distribution of Y1 , ... 1 Yn in canonical exponential family form. Identify T, h, '1, A, and E. (b) Recall that J.1i = E (Y'i) = Ai I. Suppose lJi depends on the value Zi of a covariate. Because Iti > O. fLi is sometimes modeled as
fLi
= CXp{ 0 1 + (hZi},
i
=
1, ... , n
where not all the z's are equal. Show that p(y, fi) as given by (1.6.12) is a curved exponential family model with 1 = 2.
1.8
NOTES
Note for Section 1.1
(1) For the measure theoretically minded we can assume more generally that the Po are
all dominated by a derivative.
(J
finite measure It and that p(x, 8) denotes
dJ;lI, the Radon Nikodym
Notes for Section 1,3
~
(I) More natural in the sense of measuring the Euclidean distance between the estimate f} and the "truth" Squared error gives much more weight to those that are far away from f} than those close to f}.
e.
e
~
(2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r.
Note for Section 1,4
(I) Source; Hodges, Jr., J. L., D. Keetch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGrawHill, 1975.
Notes for Section 1,6
(1) Exponential families arose much earlier in the work of Boltzmann in statistical mechanics as laws for the distribution of the states of systems of particlessee Feynman (1963), for instance. The connection is through the concept of entropy, which also plays a key role in infonnation theorysee Cover and Thomas (199]). (2) The restriction that's x E Rq and that these families be discrete or continuous is artificial. In general if fL is a (J finite measure on the sample space X. p( x, e) as given by (1.6.1)
6. The Advanced Theory of Statistics." Biomelika. for instance. D. G. 22. 1963." J. "Model Specification: The Views of Fisher and Neyman.. "A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts. 1986. 23. S.. Statist. and Later Developments. GIRSHICK. L. BROWN. positions. H. and spheres (e.9 REFERENCES BERGER. DE GROOT. Vols. R. 1975. I: I" Ii L6HMANN. G. Elements of Information Theory New York: Wiley. "A Theory of Some Multiple Decision Problems. KRETcH AND R. "Using Residuals Robustly I: Tests for Heteroscedasticity. 2nd ed. AND M. L. Feynman. E. Ch. AND M.. ." Ann. I.547572 (1957). Testing Statistical Hypotheses. J.• Statistical Decision Theory and Bayesian Analysis New York: Springer. S. 14431473 (1995). Iii I' i M. FERGUSON.. 160168 (1990). Leighton. E. J. L6HMANN. II. 1988. 383430 (1979). P. K A AND A. MA: AddisonWesley. "Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression. THOMAS. Nonlinearity. and so on. 40 Statistical Mechanics ofPhysics Reading. E.. P. and Performance Criteria Chapter 1 can be taken to be the density of X with respect to fLsee Lehmann (1997). BICKEL. New York: Springer. 1991. ROSENBLATT. Sands. J. . Royal Statist. 1966. 1961." Ann. L. P. Math Statist. .. FEYNMAN. Note for Section 1. E. I and II.. D. CRUTCHFIELD. T. R. R.. R. 0 . L. P. BLACKWELL. . Statlab: An Empirical Introduction to Statistics New York: McGrawHili.. 1." Ann. 125. Ii . v. L.96 Statistical Models. BERMAN. A.. STUART. J. 1985.. M. the Earth). .. A 143. IMS Lecture NotesMonograph Series. 1997.. BOX...266291 (1978). The Feynmtln Lectures on Physics. Statist. . • • . This permits consideration of data such as images. 5. A.. AND A. GRENANDER. Optimal Statistical Decisions New York: McGrawHili. RUPPERT. M.. 1954. Theory ofGames and Statistical Decisions New York: Wiley. Hayward. M. Science 5. T. Soc. AND COVER. JR. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory.. r HorXJEs.I' L6HMANN.733741 (1990). Mathematical Statistics New York: Academic Press. Goals. 1967. 1957.7 (I) u T Mu > 0 for all p x 1 vectors u l' o. Statistical Analysis of Stationary Time Series New York: Wi • . and M. 1969. Transformation and Weighting in Regression New York: Chapman and Hall." Statist. U. CARROLL.g. SAMAROV. 77. ley. m New York: Hafner Publishing Co.t AND D. B. Eels.. KENDALL. DoKSUM.. "Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discussion).
1964. 1961. W. G. Introduction to Probability and Statistics from a Bayesian Point of View. "Empirical Bayes Procedures for a Change Point Problem with Application to HIVJAIDS Data. K. AND R. 8th Ed. D. 6779. 1954. Editors: S. Lecture Notes in Statistics. AND K.. Applied Statistical Decision Theory.Section 1. 303 (1905).. COCHRAN. IA: Iowa State University Press." Proc. GLAZEBROOK. A. E. ET AL. 1986. New York: Springer. Statistical Methods. Ahmed and N. New York. MANDEL. J.. . Ames. 1. H.) RAIFFA. London 71. G. AND K.. SNEDECOR. SCHLAIFFER. SAVAGE. Part II: Inference London: Cambridge University Press. The Foundation ofStattstical Inference London: Methuen & Co. SAVAGE. Dulan & Co. The Foundations ofStatistics. Part I: Probability. Graduate School of Business Administration. L. Roy. 1965. D. AND W. Reid. "On the General Theory of Skew Correlation and NOnlinear Regression. 1989.. NORMAND. B. The Statistical Analysis of Experimental Data New York: J. Wiley & Sons. Harvard University. Division of Research. SL. Boston. Wiley & Sons." Empirical Bayes and Likelihood Inference.. 1962. WETHERILL. V. (Draper's Research Memoirs. J. DOKSUM. Biometrics Series II. 2000.. 9 References 97 LINDLEY.. Soc. J. Sequential Methods in Statistics New York: Chapman and Hall. L. PEARSON. G.
'. I 'I i .·.! ~J .: '". :i. I:' o. I . I II 1'1 . .
how do we select reasonable estimates for 8 itself? That is.1 BASIC HEURISTICS OF ESTIMATION Minimum Contrast Estimates. (2. the true 8 0 is an interior point of e.O). how do we find a function 8(X) of the vector observation X that in some sense "is close" to the unknown 81 The fundamental heuristic is typically the following.O) EOoP(X. X E X. We consider a function that we shall call a contrast function  p:Xx8>R and define D(Oo. Estimating Equations Our basic framework is as before. =0 (2. The equations (2.8) is smooth.8).Chapter 2 METHODS OF ESTIMATION 2. if PO o were true and we knew DC 8 0 . Then we expect  \lOD(Oo.1. In order for p to be a contrast function we require that DC 8 0l 9) is uniquely minimized for 8 = (Jo.O) where V denotes the gradient.1. 6) ~ o.8) is an estimate of D(8 0 . Now suppose e is Euclidean C R d . That is. So it is natural to consider 8(X) minimizing p(X. usually parametrized as P = {PO: 8 E e}. Of course. 8). This is the most general fonn of the minimum contrast estimate we shall consider in the next section. In this parametric case.8) as a function of 8. p(X. X """ PEP. but in a very weak sense (unbiasedness). we don't know the truth so this is inoperable. As a function of 8.2) define a special form of estimating equations.1 2. we could obtain 8 0 as the minimizer.2) 99 . and 8 Jo D( 8 0 .1.1. DC8 0l 9) measures the (population) discrepancy between 8 and the true value 8 0 of the parameter.1) Arguing heuristically again we are led to estimates B that solve \lOp(X.
!I [" '1 L • . (1) Suppose V (8 0 .)Y.g({3. N(O.2) or equivalently the system of estimating eq!1ations. (1).1. An estimate (3 that minimizes p(X.i + L[g({3o. suppose we postulate that the Ei of Example 1. further.1.g({3.1L1 2 = L[Yi . ~ ~8g~ L. z) is continuous and ! I lim{lg(l3. for convenience. . then {3 satisfies the equation (2.zd = LZij!3j j=1 andzi = (Zil.10). J which is indeed minimized at {3 = {3o and uniquely so if and only if the parametrization is identifiable. we take n p(X. IL( z) = (g({3.16).z.4 R d.('l/Jl. .100 More generally. Here the data are X = {(Zi. (3) n n<T.) .4 are i.1. l'i) : 1 < i < n} where Yi. i=l (2.8/3 ({3. :i = ~8g~ ~ L.I1) =0 is an estimating equation estimate. g({3. The estimate (3 is called the least squares estimate. ua). . ZI). d g({3.. (3) exists if g({3. But.1. If. Yn are independent.»)'. z. . {3) to consider is the squared Euclidean distance between the vector Y of observed Yi and the vector expectation of Y.8/3 ({3.1.. Zn»T That is. Wd)T V(l1 o. ~ Then we say 8 solving (2. "'. W(X. z.7) i=l J . there is a substantial overlap between the two classes of estimates.6) .)g(l3. z). Least Squares. i=l 3 I (2..1. I D({3o. .3) = 0 has 8 0 as its unique solution for all 8 0 ~ E e.i.4) w(X. z) is differentiable in {3.1.p(X. (2. Then {3 parametrizes the model and we can compute (see Problem 2.d. W . z.1.1.4 with I'(z) = g({3. Here is an example to be pursued later..)J i=1 ~ 2 (2.z. suppose we are given a function W : X and define Methods of Estimation Chapter 2 X R d . g({3. (3) E{3.z)l: 1{31 ~ oo} 'I = 00 I (Problem 2.5) Strictly speaking P is not fully defined here and this is a point we shall explore later.1.Zid)T 'j • r . Evidently. where the function 9 js known. z.' .1..1. Consider the parametric version of the regression model of Example 1.).8) ~ E I1 . (3) = !Y . A naiural(l) function p(X. Example 2. . {3 E R d . I In the important linear case. .
In fact. Method a/Moments (MOM).. 0 Example 2. To apply the method of moments to the problem of estimating 9. . . This very important example is pursued further in Section 2. I'd of X.Section 2. The method of moments prescribes that we estimate 9 by the solution of p.9) where Zv IIZijllnxd is the design matrix. 8 E R d and 8 is identifiable. Here is another basic estimating equation example. = 1'. N(O.1 Basic Heuristics of Estimation 101 the system becomes (2.Xi t=l .(8)..8) the normal equations. .. as X ~ P8. More generally. we need to be able to express 9 as a continuous function 9 of the first d moments. once defined we have a method of computing a statistic fj from the data X = {(Zi 1 Xi).1.. Suppose Xl. say q(8) = h(I'I. The motivation of this simplest estimating equation example is the law of large numbers: For X "' Po.2.i. We return to the remark that this estimating method is well defined even if the Ci are not i. . Thus. 1 1 < j < d. I'd( 8) are the first d moments of the population we are sampling from.1. . These equations are commonly written in matrix fonn (2.1 <j < d if it exists. and then using h(p!. we obtain a MOM estimate of q( 8) by expressing q( 8) as a function of any of the first d moments 1'1.d. which can be judged on its merits whatever the true P governing X is.. I'd). Thus. . .2 and Chapter = 6.. d > k. thus. Least squares.Xn are i. suppose is 1 . ~ . lid) as the estimate of q( 8)..  n n. /1j converges in probability to flj(fJ).d.i. 1 1j ~_l". Suppose that 1'1 (8). . we assume the existence of Define the jth sample moment f1j by.. provides a first example of both minimum contrast and estimating equation methods. . 1 < i < n}. u5). L. .1 from R d to Rd. if we want to estimate a Rkvalued function q(8) of 9.1. ...
. Vi = i. Suppose we observe multinomial trials in which the values VI.6.Pk are completely unknown.2 The PlugIn and Extension Principles We can view the method of moments as an example of what we call the plugin (or substitution) and extension principles. then setting (2.1 0) .X 2 • In this example. . Here k = 5.102 Methods of Estimation Chapter 2 For instance.. A>O. . Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category.3. . X n be Ltd..10. ~ iii ~ (X/a)'..)]xO1exp{Ax}. I.1. An algorithm for estimating equations frequently used when computationofM(X. for instance. A = X/a' where a 2 = fl. Example 2. In this case f) ~ ('" A). . 2. case. as X and Ni = number of indices j such that X j = Vi. .(1 + ")/A 2 Solving .i. This algorithm and others will be discussed more extensively in Section 2. 4. then the natural estimate of Pi = P[X = Vi] suggested by the law of large numbers is Njn. There are many algorithms for optimization and root finding that can be employed.4 and in Chapter 6. or 5. and 1" ~ E(X') = . i = 1.d. . If we let Xl...•. two other basic heuristics particularly applicable in the i. the method of moment estimator is not unique. consider a study in which the survival time X is modeled to have a gamma distribution. A = Jl1/a 2 . We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments. It is defined by initializing with eo. . . 1968).5. 3. .. f(u.1. in general.. express () as a function of IJ" and fl3 = E(X 3 ) and obtain a method of moment estimator based on /11 and fi3 (Problem 2. x>O.>0.~ (I'l/a)'. . with density [A"/r(. We can. As an illustration consider a population of men whose occupations fall in one of five different job categories. neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn. . but their respective probabilities PI. the proportion of sample values equal to Vi.. Frequency Plugin(2) and Extension.1. l Vk of the population being sampled are known. .(X")1 dxd J isquickandM is nonsingular with high probability is the NewtonRaphson algorithm.. in particular Problem 6.·) fli =D'I'(X.·)  l~t. I. '\). 2. 1'1 for () gives = E(X) ~ "/ A.11). Here is some job category data (Mosteller.2 and 0=2 = n 1 EX1. 0 Algorithmic issues We note that.
1 < think of this model as P = {all probability distributions Pan {VI. use (2. . Suppose .03 2 84 0.Pk by the observable sample frequencies Nt/n. the multinomial empirical distribution of Xl.. and Then q(p) (p. That is.. If we use the frequency substitution principle. in v(P) by 0 P ). then (NIl N 2 . Next consider the marc general problem of estimating a continuous function q(Pl' . v(P) = (P4 + Ps) and the frequency plugin principle simply says to replace P = (PI.13 n2::. .0)..P2." .1. the difference in the proportions of bluecollar and whitecollar workers. N 3 ) has a multinomial distribution with parameters (n. categories 4 and 5 correspond to bluecollar jobs. . . For instance. ..1.Ni =708 2:::o. lPk). Now suppose that the proportions PI' .Pk) with Pi ~ PIX = 'Vi]. P2 ~ 20(1. let P dennte p ~ (P". suppose that in the previous job category table... i < k.. Pk) of the population proportions. .Xn . whereas categories 2 and 3 correspond to whitecollar jobs. + P3). . v.12).11) to estimate q(Pll··. Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles.12) If N i is the number of individuals of type i in the sample of size n.12 3 289 0..Pk) = (~l . 1 ()d) and that we want to estimate a component of 8 or more generally a function q(8).4. The frequency plugin principle simply proposes to replace the unknown population frequencies PI.31 5 95 0.41 4 217 0..Ps) = (P4 + Ps) . that is..pl .. Example 2. Many of the models arising in the analysis of discrete data discussed in Chapter 6 are of this type. the estimate is which in our case is 0. P3 PI = 0 = (1..(P2 + P3). Equivalently.. 1~!. P3) given by (2. HardyWeinberg Equilibrium.53 = 0.1 Basic Heuristics of Estimation 103 Job Category l I Ni Pi ~ 23 0. . together with the estimates Pi = NiJn.»i=1 for Danish men whose fathers were in category 3. .. . . Pk do not vary freely but are continuous functions of some ddimensional parameter 8 = ((h. .1. ..09.Section 2.. .}}. We would be interested in estimating q(P"". (2. we are led to suppose that there are three types of individuals whose frequencies are given by the socalled HardyWeinberg proportions 2 . ..0 < 0 < 1.1.44 . can be identified with a parameter v : P ~ R. If we assume the three different genotypes are identifiable.0)2. . 1 Nk/n.0.
E A) n. then v(P) is the plugin estimate of v.Pk. . as X p.Xn ) =h N.14) As we saw in the HardyWeinberg case.. the representation (2. suppose that we want to estimate a continuous Rlvalued function q of e... that is. the frequency of one of the alleles. P > R where v(P) . F(x) > a}. We can think of the extension principle alternatively as follows.d. We shall consider in Chapters 3 (Example 3.1. The plugin and extension principles can be abstractly stated as follows: Plugin principle. in the Li..14) are not unique. Po ~ R given by v(PO) = q(O). If PI. T(Xl. however. Note.1. .4) and 5 how to choose among such estimates. that we can also Nsfn is also a plausible estimate of O. Nk) . Because f) = ~.4.1.1. (2. (2.d.1. ..Pk are continuous functions of 8.. . case if P is the space of all distributions of X and Xl. (2. 1  V In general.'" .. I: t=l (2. consider va(P) = [Fl (a) + Fu 1(a )1. we can use the principle we have introduced and estimate by J N l In.1. if X is real and F(x) = P(X < x) is the distribution function (dJ. F(x) < a}. Let be a submodel of P.13) defines an extension of v from Po to P via v. we can usually express q(8) as a continuous function of PI.13) and estimate (2. If w.h(p) and v(P) = v(P) for P E Po.Pk(IJ»). X n are Ll. . by the law of large numbers.). . 0 write 0 = 1 .E have an estimate P of PEP such that PEP and v : P 4 T is a parameter.. a natural estimate of P and v(P) is a plugin estimate of v(P) in this nonparametric context For instance.104 Methods of Estimation Chapter 2 we want to estimate fJ. . Fu'(a) = sup{x. ..13) Given h we can apply the extension principle to estimate q( 8) as. where a E (0. In particular.. ( . I) and ! F'(a) " = inf{x. the empirical distribution P of X given by    f'V 1 n P[X E Al = I(X.. . "·1 is. Then (2. . ./P3 and.1. .15) ~ ". thus. q(lJ) with h defined and continuous on = h(p. Now q(lJ) can be identified if IJ is identifiable by a parameter v . .(IJ).16) . .
' Nl = h N () ~ ~ = v(P) and P is the empirical distribution. context is the jth sample moment v(P) = xjdF(x) ~ 0.1. if X is real and P is the class of distributions with EIXlj < 00. ~ I ~ ~ Extension principle. . because when P is not symmetric. is a continuous map from = [0. = Lvi n i= 1 k .) h(p) ~ L vip.. As stated. The plugin and extension principles must be calibrated with the target parameter.i. then the plugin estimate of the jth moment v(P) = f.Section 2. Remark 2. 1/.1. However. i=1 k /1j ~ = ~ LXI 1=1 I n .13). let Po be the class of distributions of X = B + E where B E R and the distribution of E ranges over the class of symmetric distributions with mean zero. is called the sample median. ~ For a second example.4. The plugin and extension principles are used when Pe. For instance. II to P. A natural estimate is the ath sample quantile . = v(P). In this case both VI(P) = Ep(X) and V2(P) = "median of P" satisfy v(P) = v(P).1.3 and 2. v(P8 ) = q(8) = h(p(8)) is a continuous map from to Rand D( P) = h(p) is a continuous map from P to R.17) where F is the empirical dJ. For instance.1. but only VI(P) = X is a sensible estimate of v(P). This reasoning extends to the general i. Let viP) be the mean of X and let P be the class of distributions of X = 0 + < where B E R and the distribution of E ranges over the class of distributions with mean zero. these principles are general. (P) is called the population (2. they are mainly applied in the i. then Va.1. casebut see Problem 2.1 Basic Heuristics of Estimation 105 V 1. P 'Ie Po.tj = E(Xj) in this nonparametric .d. ~ ~ .1.1. P E Po..14. Suppose Po is a submodel of P and P is an element of P but not necessarily Po and suppose v: Po ~ T is a p~eter. case (Problem 2. Here x!.12) and to more general method of moment estimates (Problem 2.1.i. If v: P ~ T is an extension of v in the sense that v(P) = viP) on Po. in the multinomial examples 2. and v are continuous. (P) is the ath population quantile Xa. the sample median V2(P) does not converge in probability to Ep(X).d. Here x ~ = median.2. With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plugin estimates for multinomial trials because I'j(8) where =L i=l k vfPi(8) = h(p(8» = viP.1. Pe as given by the HardyWeinberg p(O).1 I:~ 1 Xl. then v(P) is an extension (and plugin) estimate of viP). e e Remark 2.
X. (X.2. . therefore.2 with assumptions (1)(4) holding. Because we are dealing with (unrestricted) Bernoulli trials. . they are often difficult to compute. we find I' = E. these are the frequency plugin (substitution) estimates (see Problem 2. 'I I[ i['l i .1. Remark 2. 0 = 21' . . a 2 ) sample as in Example 1. those obtained by the method of maximum likelihood.8) we are led by the first moment to the estimate.LI (8) = (J the method of moments leads to the natural estimate of 8. for large amounts of data. minimax. What are the good points of the method of moments and frequency plugin? (a) They generally lead to procedures that are easy to compute and are. Discussion. these estimates are likely to be close to the value estimated (consistency). Plugin is not the optimal way to go for the Bayes. This minimal property is discussed in Section 5. as we shall see in Section 2.6. or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter 3.) = ~ (0 + I). For example. The method of moments can lead to either the sample mean or the sample variance. . If the model fits.. . Suppose X I. X n is a N(f1. the frequency of successes. When we consider optimality principles. ".7.1.5. X n are the indicators of a set of Bernoulli trials with probability of success fJ. O}.106 Methods of Estimation Chapter 2 Here are three further simple examples illustrating reasonable and unreasonable MOM estimates.1. where Po is n. because Po = P(X = 0) = exp{ O}. if we are sampling from a Poisson population with parameter B. For instance. optimality principle solutions agree to first order with the best minimum contrast and estimating equation solutions.d. We will make a selection among such procedures in Chapter 3. Because f. .X). )" I I • . Estimating the Size of a Population (continued). (b) If the sample size is large. a frequency plugin estimate of 0 is Iogpo. X). as we shall see in Chapter 3. Suppose that Xl. a special type of minimum contrast and estimating equation method. I . and there are best extensions.3. there are often several method of moments estimates for the same q(9). U {I.4.1 because in this model B is always at least as large as X (n)' 0 As we have seen. a saving grace becomes apparent in Chapters 5 and 6. However. It does turn out that there are "best" frequency plugin estimates.3 where X" . Example 2. estimation of B real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of (J values rather than minimizers of functions p( (J. Moreover..4. To estimate the population variance B( 1 . = 0].1.1 is a method of moments estimate of B. the plugin principle is justified.1. X(l .. then B is both the population mean and the population variance. valuable as preliminary estimates in algorithms that search for more efficient estimates.I and 2X . •i . See Section 2.. Algorithms for their computation will be introduced in Section 2.. The method of moments estimates of f1 and a 2 are X and 2 0 a. 0 Example 2. In Example 1. Example 2.. • . . . 2. we may arrive at different types of estimates than those discussed in this section.4. Unfortunately. C' X n are ij. [ [ .l [iX. This is clearly a foolish estimate if X (n) = max Xi> 2X . .Ll).5. Thus. " .
.(3) ~ L[li .. z) = ZT{3. When P is the empirical probability distribution P E defined by Pe(A) ~ n. ~ ~ ~ ~ 2.. Suppose X . Method of moment estimates are empirical PIEs based on v(P) = (f.2 Minimum Contrast Estimates and Estimating Equations 107 Summary..) = q(O) and call v(P) a plugin estimator of q(6). For the model {Pe : () E e} a contrast p is a function from X x H to R such that the discrepancy D(lJo. a least squares estimate of {3 is a minimizer of p(X. In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6. 6). then is called the empirical PIE. If P is an estimate of P with PEP. Zi). In the multinomial case the frequency plugin estimators are empirical PIEs based on v(P) = (Ph . . P E Po.2. ~ ~ ~ ~ ~ v we find a parameter v such that v(p. P.2 2.Zi)j2. f. If P = PO.lj = E( XJ\ 1 < j < d. . Let Po and P be two statistical models for X with Po c P. where ZD = IIziJ·llnxd is called the design matrix. For this contrast.Section 2. 0).ll.Pk). where Pj is the probability of the jth category. and the contrast estimating equations are V Op(X.. I < i < n. the associated estimating equations are called the normal equations and are given by Z1 Y = ZtZD{3. li) : I < i < n} with li independent and E(Yi ) = g((3.1 MINIMUM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares Least squares(1) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters. where 9 is a known function and {3 E Rd is a vector of unknown regression coefficients.lJ) ~ E'o"p(X. D(P) is called the extensionplug·in estimate of v(P). For data {(Zi. when g({3..g((3. A minimum contrast estimator is a minimizer of p(X. The general principles are shown to be related to each other. . is parametric and a vector q( 8) is to be estimated. An extension ii of v from Po to P is a parameter satisfying v(P) = v(P)..O) = O. 1 < j < k..1 L:~ 1 I[X i E AI.. 0 E e c Rd is uniquely minimized at the true value 8 = 8 0 of the parameter. It is of great importance in many areas of statistics such as the analysis of variance and regression theory. The plugin estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P. . 0 E e.ld)T where f.
we can compute still Dp(l3o.).E:::'::'..4}{2. • .En) is any distribution satisfying the GaussMarkov assumptions.1. .zn)f is 11. I<i<n.2. j. . Suppose that we enlarge Po to P where we retain the independence of the Yi but only require lJ. the model is semiparametric with {3.g(l3.) + <" I < i < n n (2.2.g(l3. .d.. Y) ~ PEP = {Alljnintdistributions of(Z.2. as in (a) of Example 1. Z)f. 1 < i < n. that is. Z could be educationallevel and Y income. z. NCO.. a 2 and H unknown.C=ha:::p::.. (Z" Y.) = E(Y..g(l3.)]' i=l ~ led to the least squares estimates (LSEs) {3 of {3.1.13) n n i=1  L Varp«.Y) such thatE(Y I Z = z) = g(j3.i. I) are i. 13 = I3(P) is the miniml2er of E(Y .2).6) is difficult Sometimes z can be viewed as the realization of a population variable Z. Note that the joint distribution H of (C}. and 13 ~ (g(l3.2. z.::. This follows "I .3) continues to be valid. This is frequently the case for studies in the social and biological sciences. (3)  E p p(X. .g(l3.) i. j. 13) = LIY. . 1 < j < n.z.4 with <j simply defined as Y. (2. = 0. . That is.=.m::a.) = g(l3. E«. i=l (2. .d.5) (2.. .. 13 E R d }.4) (2. " (2.1 we considered the nonlinear (and linear) Gaussian model Po given by Y.2.2.::'hc. . < i < n. . as (Z.. For instance. I' . which satisfies (but is not fully specified by) (2. that is.:0::d=':Of.) + L[g(l3 o. The least squares method of estimation applies only to the parameters {31' . then 13 has an interpretation as a parameter on p.) =0..1. aJ) and (3 ranges over R d or an open subset. or Z could be height and Y log weight Then we can write the conditional model of Y given Zj = Zj. 1_' .z. . is modeled as a sample from a joint distribution. .) = 0.»)2..:1.z.g(l3.. The estimates continue to be reasonable under the GaussMarkov assumptions.108_~ ~_ _~ ~ ~cMc. where Ci = g(l3.2. 1 < i < j < n 1 > 0\ ..(z.z).3) which is again minimized as a function of 13 by 13 = 130 and uniquely so if the map 13 ~ (g(j3.i. fj) because (2.) . . Zn»)T is 1.2.o=nC. (Zi. I Zj = Zj).6) Var( €i) = u 2 COV(fi.2.).::2 : In Example 2. The contrast p(X.zt}.:. " . z. If we consider this model.2. E«.2) Are least squares estimates still reasonable? For P in the semiparametric model P. Yi).f3d and is often applied in situations in which specification of the model beyond (2.E(Y.
.2.2.2.4. Yi) are a sample from a (d + 1)dimensional distribution and the covariates that are the coordinates of Z are continuous. J for Zo an interior point of the domain.zo). Yi).7) (2.41.1).Section 2. As we noted in Example 2.. and Seber and Wild (1989). we can approximate p(z) by p(z) ~ p(zo) +L j=1 d a: a (zo)(z . Zd.5. (2. {3d). which. is called the linear (multiple) regression modeL For the data {(Zi..I to each of the n pairs (Zil Yi).1. = py .1. i = 1. y)T has a nondegenerate multivariate Gaussian distribution Nd+1(Il.4.2.2.8) d f30 = («(3" .. E1=1 g~ (zo)zo as an + 1) dimensional d p(z) = L(3jZj. We continue our discussion for this important special case for which explicit fonnulae and theory have been derived.. a further modeling step is often taken and it is assumed that {Zll . j=1 (2. we are in a situation in which it is plausible to assume that (Zi.1. z) = zT (3. Fan and Gijbels (1996). For nonlinear cases we can use numerical methods to solve the estimating equations (2.2 Minimum Contrast Estimates and Estimating Equations 109 from Theorem 1. in conjnnction with (2. We can then treat p(zo)  nnknown (30 and identify (zo) with (3j to give an approximate (d 1 and Zj as before and linear model with Zo = t!.2. . E(Y I Z = z) can be written as d p(z) where = (30 + L(3jZj j=1 (2. In this case we recognize the LSE {3 as simply being the usual plugin estimate (3(P). In that case. and Volnme n. (2. see Rnppert and Wand (1994).L{3jPj.. j=O This type of approximation is the basis for nonlinear regression analysis based on local polynomials. Sections 6. 2:). . where P is the empirical distribution assigning mass n. The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth. n} we write this model in matrix fonn as Y = ZD(3 + €    where Zv = IIZij 11 is the design matrix.5) and (2..2.2. I)/. . (2) If as we discussed earlier.9) .3 and 6.6). See also Problem 2.1 the most commonly used 9 in these models is g({3. as we have seen in Section lA.4).7).
Nine samples of soil were treated with different amounts Z of phosphorus. ~ = (lin) L:~ 1Yi = ii.2.I 'i. il.2. ( (J2 2 ) distribution where = ayy Therefore.10) Here are some examples.) = 1 and the normal equation is L:~ 1(Yi . Example 2. 1 64 4 71 5 54 9 81 11 76 13 23 77 93 23 95 28 109 The points (Zi' Yi) and an estimate of the line 131 + 132z are plotted in Figure 2.2. d = 1. LZi(Yi . L 1 that. necessarily. For certain chemicals and plants. The parametrization is identifiable if and only ifZ D is of rank d or equivalently if Z'bZD is affulI rank d.no Furthennore. We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil. Zi 1'. we assume that for a given z. see Problem 2. 1967. If we run several experiments with the same z using plants and soils that are as nearly identical as possible.il. is independent of Z and has a N(O. i=l (2. .) = O.il. {3. if the parametrization {3 ) ZD{3 is identifiable. p.il2Zi) = 0.12) . we will find that the values of y will not be the same.8).2. Zi. . Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil.(3zz.2. In the measurement model in which Yi is the detennination of a constant Ih. In that case.1. We have already argued in Examp}e 2. we get the solutions (2.2. I I. < Methods of Estimation Chapter 2 =y  I'(Z) Eyz:Ez~Ezy. Following are the results of an experiment to which a regression model can be applied (Sned.1 '. We want to estimate f31 and {32. Y is random with a distribution P(y I z). the relationship between z and y can be approximated well by a linear equation y = 131 + (32Z provided z is restricted to a reasonably small interval. . For this reason.1. 1 < i < n.ill . exists and is unique and satisfies the Donnal equations (2. the solution of the normal equations can be given "explicitly" by (2. il. 139).11) When the Zi'S are not all equal.1. il.ecor and Cochran. whose solution is . g(z.) = il" a~.2. the least squares estimate. 0 Ii.25.) ~ 0. i I .2. given Zi = 1 < i < n. Estimation of {3 in the linear regression model. we have a Gaussian linear regression model for Yi. The nonnal equations are n i=l n ~(Yi . g(z. Example 2.
2.Section 2. that is p I'(z) ~ 2:). . The line Y = /31 + (32Z is an estimate of the best linear MSPE predictor Ul + b1Z of Theorem 1. . i = 1... P > d and postulate that 1'( z) is a linear combination of 91 (z). . . 9p( z).. The vertical distances €i = fYi . j=1 Then we are still dealing with a linear regression model because we can define WpXI .132 z ~ ~ (2. Yn). €6 is the residual for (Z6. 91. n} and sample regression line for the phosphorus data. . (zn. . 0 ~ ~ ~ ~ ~ Remark 2.2.13) where Z ~ (lin) I:~ 1 Zi..1. Here {31 = 61. &atter plot {(Zi.3. .Yi) and a line Y ~ a + bz vertically by di = Iy. Yi)..(Z).4..2 Minimum Contrast Estimates and Estimating Equations 111 y 50 • o 10 20 x 30 Figure 22. The linear regression model is considerably more general than appears at first sight. .Yn on ZI...1.9p. i = 1. if we measure the distance between a point (Zi.. For instance.(/31 + (32Zi)] are called the residuals of the fit. and fi = (lin) I:~ 1 Yi· The line y = /31 + f32z is known as the sample regression line or line of best fit of Yl. .42.. . . .2. This connection to prediction explains the use of vertical distance! in regression. .1.9. Y6). suppose we select p realvalued functions of z. ... and 131 = fi . Geometrically.(a + bzi)l. n.. .. then the regression line minimizes the sum of the squared distances to the n points (ZI' Yl).58 and 132 = 1. The regression line for the phosphorus data is given in Figure 2. Zn.
I . Zi) + Ei . Zi) = {3l + fhZi. 1 n. • I' "0 r . we arrive at quadratic regression.."..Zi)]2 = L ~. In Example 2. if d = 1 and we take gj (z) = zj..16) as a function of {3. Whether any linear model is appropriate in particular situations is a delicate matter partially explorable through further analysis of the data and knowledge of the subject matter. we can write (2.. ..wi.17) . Zi2 = Zi. Example 2.wi) g((3.. That is. .2.. gp( z)) T as our covariate and consider the linear model Yi = where L j=l p ()jWij + Ci.2. Zil = 1. and (32 that minimize ~ ~ "I. which for given Yi = yd.3. Weighted least squares.. 1 <i < n Yi n 1 . if we setg«(3.z. Yi = 80 + 8 1Zi + 82 + Cisee Problem 2.. For instance. Weighted Linear Regression. .2.zd + ti.5) fails.. Zi)/.filii minimizes ~  I i i L[1Ii .. I and the Y i satisfy the assumption (2.15) . .. However..wi.. _ Yi .. then = WW2/Wi = .. We return to this in Volume II.g(.2.. Consider the case in which d = 2.5). fi = <d.. The weighted least squares estimate of {3 is now the value f3.g«(3. but the Wi are known weights.8. Such models are called heteroscedastic (as opposed to the equal variance models that are homoscedastic).)]2 i=l i=l ~ n (2.wi 1 < . The method of least squares may not be appropriate because (2.'l are sufficient for the Yi and that Var(Ed.2. i = 1. < n.2. and g({3.wi  _ g((3.24 for more on polynomial regression. .2. 0 < j < 2. ..14) where (J2 is unknown as before.wi . 1<i <n Wij = 9j(Zi).II 112 Methods of Estimation Chapter 2 (91 (z). iI L Vi[Yi i=l n ({3. Note that the variables . Thus. z.= g({3. We need to find the values {3l and /32 of (3. . we may be able to characterize the dependence of Var( ci) on Zi at least up to a multiplicative constant. + /3zzi)J2 (2.. Zi) = (2. [Yi . .2.= y.2 and many similar situations it may not be reasonable to assume that the variances of the errors Ci are the same for all levels Zi of the covariate variable.2..2.
Thus.(L:~1 UlYi)(L:~l uizd Li~] Ui Zi . i ~ I....2.Y. Var(Z") and  _ L:~l UiZiYi .4.2 Minimum Contrast Estimates and Estimating Equations 113 where Vi = l/Wi_ This problem may be solved by setting up analogues to the normal equations (2.n.2. That is.1) and (2.1.B that minimizes (2. Zi) = z.~ UiYi .Section 2. . we may allow for correlation between the errors {Ei}.(Li~] Ui Z i)2 n 2 n (2.18) ~ ~ ~I" (31 = E(Y') . When ZD has rank d and Wi > 0.13 minimizing the least squares contrast in this transformed model is given by (2.3. Yn) and probability distribution given by PI(Z". .n where Ui n = vi/Lvi.2.. We can also use the results on prediction in Section 1.2.. .2.1 leading to (2. wn ) and ZD = IIZijllnxd is the design matrix.2.4 as follows.1.19) and (2. then its MSPE is ElY" ~ 1l1(Z")f ~ :L UdYi i=l n «(3] + (32 Zi)1 2 It follows that the problem of minimizing (2.4H2. Moreover. . as we make precise in Problem 2. More generally.(32E(Z') ~ . we find (Problem 2. that weighted least squares estimates are also plugin estimates. . . a__ Cov(Z"'. Then it can be shown (Problem 2. the . when g(. 1 < i < n. Let (Z*.2.16) for g((3.2.)] ~Ui. we can write Remark 2.2.B.7).(3. z) = zT..28) that the model Y = ZD..6)..B.27) that f3 satisfy the weighted least squares normal equations ~ ~ where W = diag(wI. Y*) V. suppose Var(€) = a 2W for some invertible matrix W nxn. 0 Next consider finding the.B+€ can be transformed to one satisfying (2. By following the steps of Exarnple 2. i i=l = 1. Y*) denote a pair of discrete random variables with possible values (z" Yl)..2.1 7) is equivalent to finding the best linear MSPE predictor of y . (3 and for general d. . . . ~ . If Itl(Z*) = {31 given by + {32Z* denotes a linear predictor of y* based on Z .2. using Theorem 1..20).26.2.~ UiZi· n n i"~l I" n n F"l This computation suggests. (Zn.8). ..Y") ~(Zi.2.2.
.
.
.
.
. equivalently. .6. 0) = 20'(1.2. let nJ.. 2. the dual point of view of (2. respectively.l. In general. p(2. Consider a popUlation with three kinds of individuals labeled 1. which is maximized by 0 = 0. O)p(l. Nevertheless. A is an unknown positive constant and we wish to estimate A using X...4)... then X has a Poisson distribution with parameter nA.2. then I • Lx(O) ~ p(l. X3 = 1. + n. } with probabilities. . 2n (2. Here are two simple examples with (j real. Evidently. situations with f) well defined but (2.5. OJ = (1  0)' where 0 < () < 1 (see Example 2.OJ'".28) If 2n..7. (2. the MLE does not exist if 112 + 2n3 = O. 1).2. X2 = 2.1. which has the unique solution B = ~.0)' < ° for all B E (0.lx(0) = 5 1 0' . the maximum likelihood estimate exists and is given by 8(x) = 2n.O) = 0'.118 Methods of Estimation Chapter 2 which again enables us to analyze the behavior of B using known properties of sums of independent random variables.ny l. 1.27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section. Example 2.22) and (2.0). 0 e Example 2. p(X. Here X takes on values {O. and 3 and occurring in the HardyWeinberg proportions p(I. x n } equal to 1..>.2.(1.2. If we make the usual simplifying assumption that the arrivals form a Poisson process. .27) doesn't make sense. represents the expected number of arrivals in an hour or. . 8' 80. and as we have seen in Example 2.2. there may be solutions of (2.x=O.2. Then the same calculation shows that if 2nl + nz and n2 + 2n3 are both positive.29) '(j . is zero.0) The likelihood equation is 8 80 Ix (0) ~ = 5 1 0 10 =0. Similarly.27) that are not maxima or only local maxima. 0) ~ 20(1 . the likelihood is {1 . p(3. n2. Because . I).2. + n. x. where ). so the MLE does not exist because = (0. the rate of arrival.2. and n3 denote the number of {Xl. .2. ]n practice. Let X denote the number of customers arriving at a service counter during n hours. ~ maximizes Lx(B). If we observe a sample of three individuals and obtain Xl = 1. 0)p(2.) = ·j1 e'"p. 2 and 3.
7. .2...8. = L.n. k. trials in which each trial can produce a result in one of k categories. this estimate is the MLE of ). and k j=1 k Ix(fJ) = LnjlogB j ..8B " 1=13 k k =0. is that l be concave in If l is twice differentiable.lx(B) <0. . If x = 0 the MLE does not exist. p(x. the maximum is approached as .ekl with kl Ok ~ 1.32). this is well known to be equivalent to ~ e. = jl. Example 2.\ I 0. consider an experiment with n Li.30)). 31=1 By (2. the MLE must have all OJ > 0. fJ) = TI7~1 B7'. . = j) be the probability J of the jth category.32) We first consider the case with all the nj positive.2.31 ) To obtain the MLE (J we consider l as a function of 8 1 . As in Example 1. and the equation becomes (BkIB j ) n· OJ = .2. let B = P(X.2 Minimum Contrast Estimates and Estimating Equations 119 The likelihood equation is which has the unique solution). 8' (2. We assume that n > k ~ 1. e. However. Then. fJE6= {fJ:Bj >O'LOj ~ I}. L. . A similar condition applies for vector parameters.logB..6..kl. 8B. and let N j = L:~ 1 l[Xi = j] be the number of observations in the jth category. familiar from calculus. and must satisfy the likelihood equa~ons ~ 8B1x(fJ) = 8B 3 8 8 " n.6.LBj j=1 • (2. 8) = 0 if any of the 8j are zero.d.1.2. A sufficient condition. .. 8Bk/8Bj (2. (see (2.2. 80. Multinomial Trials. j = 1.Section 2..2. j=d (2. for an experiment in which we observe nj ~ 2:7 1 I[X. 0 To apply the likelihood equation successfully we need to know when a solution is an MLE. thus.. Let Xi = j if the ith trial produces a result in the jth category. . J = I. Then p(x.32) to find I. = x/no If x is positive..o. ~ n .30) for all This is the condition we applied in Example 2. ~ ~ .2.2...
'ffi f. i=l g«(3. are known functions and {3 is a parameter to be estimated from the independent observations YI . zn) 03W.1 holds and X = (Y" .202 L.2. we find that for n > 2 the unique MLEs of Il and Ii = X and iT2 ~ n. The 0 < OJ < 1.3. Then (} with OJ = njjn. Zi). k. . Zi)) 00 ao n log 2 I"" 2 "2 (21Too) . As we have IIV. N(lll 0. ((g((3.. Using the concavity argument.).2 ) with J1..2. In Section 2..(Xi . .X)2 (Problem 2.1. where E(V. YnJT. is still the unique MLE of fJ. OJ > ~ 0 case. ~ g((3. Next suppose that nj = 0 for some j.gi«(3) 1 . Suppose the model Po of Example 2.6. See Problem 2.  seen and shall see more in Section 6. (2.2. Then Ix «(3) log IT ~'P (V.2.11(a)). i = 1. 1 < i < n.2. .) = gi«(3)..IV.2 both unknown. see Problem 2. then <r <k I. Zi)J[l:} . ...Zi)]2.1 we consider least squares estimators (LSEs) obtained by min2 imizing a contrast of the fonn L:~ 1 IV. .28.120 ~ Methods of Estimation Chapter 2 To show thaI this () maximizes lx(8). n. Zj)]Wij where W = Ilwijllnxn is a symmetric positive definite matrix. . zi)1 .. 0.33) It follows that in this nj > 0. Suppose that Xl. we check the concavity of lx(O): let 1 1 <j < k ..30... 1 < j < k.d. where Var(Yi) does not depend on i.g«(3..2. as maximum likelihood estimates for f3 when Y is distributed as lV. o 1=1 Evidently maximizing Ix«(3) is equivalent to minimizing L:~ n (2.. X n are Li..1.. 1 Yn . . Example 2. version of this example will be considered in the exponential family case in Section 2.. and 0. j = 1.34) g«(3.1). Thus. z. It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model Yi independent N(g({3. g«(3. ~ g«(3.lV. Ix(O) is strictly concave and (} is the unique ~ ~ maximizer of lx(O).' . wi(5).2. More generally. .1 L:~ . gi.2 are Maximum likelihood and least squares We conclude with the link between least squares and maximum likelihood. .9. these estimates viewed as an algorithm applied to the set of data X make sense much more generally. we can consider minimizing L:i. Summary. least squares estimates are maximum likelihood for the particular model Po. This approach is applied to experiments in which for the ith case in a study the mean of the response Yi depends on • .
Suppose also that e t R where e c RP is open and 1 is (2.. These estimates are shown to be equivalent to minimum contrast estimates based on a contrast function related to Shannon entropy and KullbackLeibler information divergence. (a. as k .2 we consider maximum likelihood estimators (MLEs) 0 that are defined as maximizers of the likelihood Lx (B) = p(x. Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2.6.oo}}.00.bE {a.4 and Corollaries 1.. which are appropriate when Var(Yj) depends on i or the Y's are correlated.3 and 1. though the results of Theorems 1. are given. (h).3. (m. Suppose we are given a function 1: continuous.1. See Problem 2.1 ).a < b< oo} U {(a. Concavity also plays a crucial role in the analysis of algorithms in the next section. In particular we consider the case with 9i({3) = Zij{3j and give the LSE of (3 in the case in which I!ZijllnXd is of rank d.. where II denotes the Euclidean norm. it is shown that the MLEs coincide with the LSEs. (}"2) distribution. .2 and other exponential family properties also playa role. Let &e = be the boundary of where denotes the closure of in [00.6. We start with a useful general framework and lemma. Suppose 8 c RP is an open set.6. In Section 2. we define 8 1n .1 only. That is. oo]P. For instance. Proof. Then there exists 8 E e such that 1(8) = max{l(li) : Ii E e}.3 Maximum Likelihood in Multiparameter Exponential Families 121 a set of available covariate values Zil. m).I ) all tend to De as m ~ 00. (a. For instance. .1 and 1. or 8 1nk diverges with 181nk I . 2 e Lemma 2. z=1=1 ~ 2. in the N(O" O ) case. e e I"V e De= {(a. Properties that derive solely fTOm concavity are given in Propositon 2. m.ae as m t 00 to mean that for any subsequence {B1nI<} either 8 1nk t t with t ¢ e.00. = RxR+ and ee e.3.Section 2. Extensions to weighted least squares. b). (m.b) .6. m. 8). a8 is the set of points outside of 8 that can be obtained as limits of points in e.3.5.3. . In general. This is largely a consequence of the strict concavity of the log likelihood in the natural parameter TI.a= ±oo.1.1) lim{l(li) : Ii ~ ~ De} = 00. In the case of independent response variables Yi that are modeled to have a N(9i({3).3.3 MAXIMUM LIKELIHOOD IN MUlTIPARAMETER EXPONENTIAL FAMILIES Questions of existence and uniqueness of maximum likelihood estimates in canonical exponential families can be answered completely and elegantly. (m. including all points with ±oo as a coordinate.zid. Formally. if X N(B I .2. b). b) : aER. for a sequence {8 m } of points from open.
Proofof Theorem 2. which implies existence of 17 by Lemma 2.9) we know that II lx(lI) is continuous on e. Let x be the observed data vector and set to = T(x). If 0.1. ! " .1.). Without loss of generality we can suppose h(x) = pix.3. thus.3. Furthennore.)) > ~ (fx(Od +lx (0. (a) If to E R k satisfies!!) Ifc ! 'I" 0 (2.2).122 Methods of Estimation Chapter 2 Proposition 2. [ffunher {x (II) e = e e t 88. Suppose the cOlulitions o/Theorem 2. Suppose X ~ (PII . and 0. II E j. then the MLE8(x) exists and is unique. ~ Proof.3. By Lemma 2.)) 0 Themem 2.6. Existence and Uniqueness o/the MLE ij. have a necessary and sufficient condition for existence and uniqueness of the MLE given the data.  (hO! + 0. . Applications of this theorem are given in Problems 2.3. then 1] exists and is unique ijfto E ~ where C!i: is the interior of CT.3.8 and 2.27).1.3. then Ix lx(O.3) has i 1 !.'1) with T(x) = 0.3. We show that if {11 m } has no subsequence converging to a point in E. II). II) is strictly concave and 1.1. '10) for some reference '10 E [ (see Problem 1. h) and that (i) The natural parameter space.1.3. a contradiction.1. = . .12.00. open C RP.3. Then.'1 . " We.. (ii) The family is of rank k. if to doesn't satisfy (2. is open..2) then the MLE Tj exists.3) I I' (b) Conversely. . then lx(11 m ) . then the MLE doesn't exist and (2. E. and is a solution to the equation (2. " .to.. We can now prove the following. Define the convex suppon of a probability P to be the smallest convex set C such that P(C) = 1. /fCT is the convex suppon ofthe distribution ofT (X).3. Corollary 2.1 hold.3. is unique. From (B. lI(x) =  exists. we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) .3. ' L n . no solution. U m = Jl~::rr' Am = . We give the proof for the continuous case. i . are distinct maximizers. Suppose P is the canonical exponential/amity generated by (T. Write 11 m = Am U m . if lx('1) logp(x.3. I I.(11) ~ 00 as densities p(x. with corresponding logp(x.
that is. fOT all 1].Section 2.d.1 follow.6. Because any subsequence of {11m} has no subsequence converging in E we conclude L ( 11m) t 00 and Tj exists. I' E R.3.* P'IlcTT = 0] = I. CT = R )( R+ FOT n > 2.1. Nonexistence: if (2. So. The Gaussian Model. (1"2 > O.1 hold and T k x 1 has a continuous case density on R k • Then the MLE Tj exists with probabiliry 1 and necessarily satisfies (2.Xn are i. = 0 because T(XI) is always a point on the parabola T2 = T'f and the MLE does not exist.J = 00. . if {1]m} has no subsequence converging in £ it must have a subsequence { 11m k} that obeys either case 1 or 2 as follows. u mk t u. iff.3) by TheoTem 1. Po[uTT(X) > 61 > O. This is equivalent to the fact that if n = 1 the formal solution to the likelihood equations gives 0'2 = 0.3. 0 Example 2.4.3. both {t : dTt > dTto} n CO and {t : dTt < dTto} n CO are nonempty open sets. Then. CT = C~ and the MLE always exists.3.9. 0 (L:7 L:7 CJ.i. lIu m ll 00. Theorem 2. Write Eo for E110 and Po for P11o . N(p" ( 2 ). t u..3.3). existence of MLEs when T has a continuous ca. I Xl) and 1. . By (B. Evidently.k Lx( 11m.1) a point to belongs to the interior C of a convex set C iff there exist points in CO on either side of it. It is unique and satisfies x (2.2. As we observed in Example 1. um~. Then AU ¢ £ by assumption. which is impossible. So In either case limm. Suppose the conditions ofTheorem 2.3. Then because for some 6 > 0. this is the exponential family generated by T(X) = 1 Xi. there exists c # such that Po[cTT < 0] = 1 E1](c T T(X)) < 0.6. The equivalence of (2. thus.2) and COTollary 2. contradicting the assumption that the family is of rank: k.5.3. If ij exists then E'I T ~ 0 E'I(cTT) ~ 0.2) fails. So we have Case 2: Amk t A. 0 ° '* '* PrOD/a/Corollary 2. T(X) has a density and. In fact. Por n = 1.3.3.1.se density is a general phenomenon.3 Maximum Likelihood in Multiparameter Exponential Families ~ 123 1. for every d i= 0. Suppose X"". Case 1: Amk t 111]=11.
.3.4) and (2._t)T. I <t< k ..1 in the second.3.). Thasadensity.6. 0 = If T is discrete MLEs need not exist..2 follows. The statistic of rank k . the best estimate of 8. We follow the notation of Example 1.3. we see that (2. They are determined by Aj = Tj In. then and the result follows from Corollary 2.1. Thus. >. Thus. I <j < k. I < j < k.1 that in this caseMLEs of"'j = 10g(AjIAk). > O. In Example 3. using (2. The likelihood equations are equivalent to (problem 2. We conclude from Theorem 2. the maximum likelihood principle in many cases selects the «best" estimate among them. P > 0. Suppose Xl. .3. in a certain sense. exist iff all Tj > O.I.n..3.. I < j < k iff 0 < Tj < n. lit ~ .1. We assume n > k . where 0 < Aj PIX = j] < I. x > 0. The boundary of a convex set necessarily has volume 0 (Problem 2.5) have a unique solution with probability 1. but only Os is a MLE.i.2(b» r' log .4..3. . Because the resulting value of t is possible if 0 < tjO < n. o Remark 2.6.5) ~=X A where log X ~ L:~ t1ogXi. It is easy to see that ifn > 2.4 we will see that 83 is. with i'.3. Multinomial Trials. the MLE 7j in exponential families has an interpretation as a generalized method of moments estimate (see Problem 2.2. Here is an example.I and verify using Theorem 2.T(X) = A(. The TwoParameter Gamma Family. I < j < k. From Theorem 1.3.)e>"XxPI. h(x) = XI. liz = I  vn3/n and li3 = (2nl + nz)/2n are frequency substitution estimates (Problem 2. = = L L i i bJ _ I .2) holds.13 and the next example). if we write c T to = {CjtjO : Cj > O} + {cjtjO : Cj < O} we can increase c T to by replacing a tjO by tjo + I in the first sum or a tjO by tjO .d.2(a). LXi).1).124 Methods of Estimation Chapter 2 Proof.1.3).2 that (2.3 we know that E. For instance.ln. To see this note that Tj > 0. in the HardyWeinberg examples 2.(X) = r~. by Problem 2.>.3.= r(jJ) A log X (2.3.. This is a rank 2 canonical exponential family generated by T   = (L log Xi..3.Xn are i. Example 2. I < j < k.2..3.4.3. and one of the two sums is nonempty because c oF 0. with density 9p.7. When method of moments and frequency substitution estimates are not unique.1. thus.1. How to find such nonexplicit solutions is discussed in Section 2.6.4) (2. A nontrivial application of Theorem 2.4 and 2.3.9). Example 2.3.3. where Tj(X) L~ I I(Xi = j).I which generates the family is T(k_l) = (Tl>"" T..3. if T has a continuous case density PT(t)..
1). Similarly.3. Remark 2.3.1 and Haberman (1974). the bivariate normal case (Problem 2.3. .2) so that the MLE ij in P exists. h).3. Unfortunately strict concavity of Ix is not inherited by curved exponential families. (2. In some applications.8 we saw that in the multinomial case with the clQsed parameter set {Aj : Aj > 0.2..1 we can obtain a contradiction to (2. when we put the multinomial in canonical exponential family form. for example. whereas if B = [0. the following corollary to Theorem 2.3. However. L. the MLE does not exist if 8 ~ (0.1 is useful.6) on c( II) = ~~.3. our parameter set is open.13). Alternatively we can appeal to Corollary 2.1.1.3.B) ~ h(x)exp LCj(B)Tj(x) .A(c(ii)) ~ = O.3. 1)T.3. . The following result can be useful.2.3.10). the MLEs ofA"j = 1. Ck (B)) T and let x be the observed data. Corollary 2. 1 < i < k . Consider the exponential family k p(x. When P is not an exponential family both existence and unicity of MLEs become more problematic. Let CO denote the interior of the range of (c. then it is the unique MLE ofB. .3) does not have a closedform solution as in Example 1. Then Theorem 2.exist and are unique.1. and unicity can be losttake c not onetoone for instance. x E X. if any TJ = 0 or n..1] it does exist ~nd is unique. In Example 2.2. .3.6. n > kl. If the equations ~ have a solution B(x) E Co. then so does the MLE II in Q and it satisfies the likelihood equation ~ cT(ii) (to . be a cUlVed exponential family p(x. 0 < j < k . mxk e.k. if 201 + 0.3.Section 2.3.3 can be applied to determine existence in cases for which (2.. 0 The argument of Example 2.~l Aj = I}. Here [ is the natural paramerer space of the exponential fantily P generated by (T. lfP above satisfies the condition of Theorem 2. The remaining case T k = a gives a contradiction if c = (1.8see Problem 2.3.1. (B). .6.I.2) by taking Cj = 1(i = j).. (II) ..3. Let Q = {PII : II E e). .1 directly D (Problem 2. m < k . note that in the HardyWeinberg Example 2. = 0..7) Note that c(lI) E c(8) and is in general not ij.B(B) j=l . c(8) is closed in [ and T(x) = to satisfies (2.3 Maximum likelihood in Multiparameter Exponential Families 125 On the other hand. BEe.A(c(II))}h(x) Suppose c : 8 ~ [ C R k has a differential (2.lI) ~ exp{cT (II)T(x) .3. e open C R=. .
'. Because It > 0. C(O) and from Example 1. . This is a curved exponential ¥. Methods of Estimation Chapter 2 family with (It) = C2 (It) = .~Y'1.... '() A 1/ Thus.\()'.. ry.6. we see that the distribution of {01 : j = 1.i..2 .ryl > O.n. i = 1. .ry.A6iL2 = 0 Ii. .".d..ry.7) becomes = 0.. suppose Xl. with I.Zn are given constants. af = f)3(8 1 + 82 z i )2. . m} is a 2nparameter canonical exponential family with 'TJi = /tila. .10. . 1 n. L: Xi and I. are n independent random samples.) : ry. ..!0b''. Suppose that }jI •.?' m)T " ". Gaussian with Fixed Signal to Noise..6. I . we can conclude that an MLE Ii always exists and satisfies (2.3..nit.< O.6. Equation (2.3.) : ry.6. Now p(y.126 The proof is sketched in Problem 2.5 and 1.. Using Examples 1. the o q. .3. j = 1.2 and 2. . Jl > 0.3. i Example 2.. t.5.3.10.n(it' + )"5it'))T which with Ji2 = n.it. .3. < O}. As a consequence of Theorems 2. E R..3. < Zn where Zt.3.it. LocationScale Regression.3 )(t.gx±). l}m. Ii± ~ ~[).7) if n > 2.: Note that 11+11_ = '6M2 solution we seek is ii+.. .Xn are i. tLi = (}1 + 82 z i .2. = ( ~Yll'''' .11. ). which implies i"i+ > 0. l = 1.6. . 'TJn+i = 1/2a..oJ). . = 1 " = 2 n( ryd'1' .3. m m m f?.. 11..1/'1') T ./2'1' .5X' +41i2]' < 0.9. 9) is a curved exponential family of the form (2..~Yn. which is closed in E = {( ryt> ry. We find CI Example 2.g( it' . corresponding to 1]1 = //x' 1]2 = ..6) with " • .2~~2 .4.5 = .0'2) with Jl/ a = AO > a known. Next suppose. N'(fl. As in Example 1.3)T. where 0l N(tLj 1 0'1). as in Example 1. simplifies to Jl2 + A6XIl. n. = L: xl.. that . '1. . < O}. Zl < .I LX.. generated by h(Y) = 1 and "'> I T(Y) '. = ~ryf. ... Evidently c(8) = {(lh.'' .
Finally. These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency. then the full 2nparameter model satisfies the conditions of Theorem 2.7). Let £ be the canonical parameter set for this full model and let e ~ (II: II. L 10). by the intermediate value theorem. which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2. In this section we derive necessary and sufficient conditions for existence of MLEs in canonical exponential families of full rank with £ open (Theorem 2. Given f continuous on (a. In fact. even in the context of canonical multiparameter exponential families.4. then.3. > OJ. It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis. Ixd large enough.O. However. entrust ourselves. an MLE (J of (J exists and () 0 ~ ~ Summary. The packages that produce least squares estimates do not in fact use fonnula (2.1 work. Here. b) such that f(x*) = O.1. is isolated and shown to apply to a broader class of models. is the bisection algOrithm to find x*.Section 2. order d3 operations to invert. L 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of Z'bZD requires on the order of nd 2 operations to evaluate each of d(d + 1)/2 tenns with n operations to get Z'bZD and then.1 The Method of Bisection The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in kparameter exponential families. even in the classical regression model with design matrix ZD of full rank. b). We begin with the bisection and coordinate ascent methods. strict concavity. Given tolerance € > a for IXfinal .02 E R. Initialize x61d ~ XI. we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all. 2. in this section. the basic property making Theorem 2.3. in pseudocode. ~ 2.1). at various times. x o1d = xo. > 2. f i strictly.). E R. f( a+) < 0 < f (b.1 and Corollary 2. the fonnula (2.3. MLEs may not be given explicitly by fonnulae but only implicitly as the solutions of systems of nonlinear equations. such as the twoparameter gamma. d.3.4 ALGORITHMIC ISSUES As we have seen.x*l: Find Xo < x" f(xo) < 0 < f(x') by taking Ixol. . Then c(8) is closed in £ and we can conclude that for m satisfies (2.3.3.4 Algorithmic Issues 127 If m > 2. there exists unique x*€(a.1. if implemented as usual.
for m = log2(lxt . xfinal = Xnew· (4) If f(xnew) < 0. End Lemma 2. may befound (to tolerance €) by the method afbisection applied to f(..4. .xoldl Methods of Estimation Chapter 2 < 2E. in addition.1. the MLE Tj. o ! i If desired one could evidently also arrange it so that. I (3) and X m + x* i as m j. which exists and is unique by Theorem 2.to· I' Proaf. From this lemma we can deduce the following. lx. X n be i.3.4.xol/€). .1. Theorem 2.d. !'(. Let p(x I 1]) be a oneparameter canonical exponentialfamily generated by (T. [(8. xfinal = !(x~ld + x old ) and return xfinal' (2) Else. f(a+) < 0 < f(b). • .) = VarryT(X) > 0 for all ..6..IXm+l . by the intermediate value theorem.. Moreover. 0 Example 2. I.4. (2. x~ld = Xnew· Go to (I). The bisection algorithm stops at a solution xfinal such that i Proot If X m is the mth iterate of Xnew i (I) Moreover.x*1 < €.1 and T = to E C~. 00. Let X" . Xm < x" < X m +l for all m.4.i. h). x~ld (5) If f(xnew) > 0.1.... the interior (a. exists. By Theorem 1.1.3.1) .4. satisfying the conditions of Theorem 2.. so that f is strictly increasing and continuous and necessarily because i. If(xfinal)1 < E.xol· 1 (2) Therefore..1). Then.) = EryT(X) .. i I i. .. b) of the convex support a/PT. The Shape Parameter Gamma Family.128 (I) If IX~ld . Xnew = !(x~ld + x~ld)' ~ Xnew· (3) If f(xnew) = 0.
getting 1j 1r). say c.'fJk· Repeat. which is slow.· .4. It solves the r'(B) r(B) T(X) n which by Theorem 2. we would again set a tolerenceto be.···.2 Coordinate Ascent The problem we consider is to solve numerically. 1 k. d and finally 1] _ ~Il) _ =1] (~1 1]1"".'fJk' ~1) an soon. E1/(T(X» = A(1/) = to when the MLE Tj = Tj(to) exists. 1] ~Ok = (~1 ~1 {} "") 1]I.4 Algorithmic Issues 129 Because T(X) = L:~' equation I log Xi has a density for all n the MLE always exists.4.1 can be evaluated by bisection. In fact.'TJk =tk' ) . The general case: Initialize ~o 1] = ("" 'TJll···' "") • 'TJk Solve ~1 f or1]k: Set 1] 8'T]k [) A(~l ~1 1]ll1'J2. The case k = 1: see Theorem 2.4.Section 2. r > 1. The function r(B) = x'Iexdx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method. However.oJ _ = (~1 "" {}) ~02 _ 1]1. in cycle j and stop possibly in midcycle as soon as <I< . but as we shall see. always converges to Ti.1.'TJk. Notes: (1) in practice.1]2. eventually. it is in fact available to high precision in standard packages such as NAG or MATLAB.'TJ3.···.. This example points to another hidden difficulty. for a canonical kparameter exponential family. Here is the algorithm.1]2. bisection itself is a defined function in some packages. for each of the if'. 0 J:: 2. .
Bya standard argument it follows that.3. Consider a point we noted in Example 2.. Because l(ijI) ~ Aand the MLE is unique. . (I) l(ij'j) Tin j for i fixed and in. limi. 0 ~ (4) :~i (77') = 0 because :~. Theorem 2. (3) and (4) => 1]1 . x ink) ( '~ini . I • . Let 1(71) = tif '1. . in fact.'11 11 t t (I . is computationally explicit and simple.4.1J k) . is the expectation of TI(X) in the oneparameter exponential family model with all parameters save T1I assumed known. 71J. = 7Jk. as it should. W = iji = ij. For n > 2 we know the MLE exists..4. We pursue this discussion next. Then each step of the iteration both within cycles and from cycle to cycle is quick. ij = (V'I) . I(W ) = 00 for some j.2: For some coordinates l..pO) = We now use bisection ' r' . Therefore. _A (1»).2. 1 < j < k. .\(0) = ~. 0 Example 2. rp differ only in the second coordinate. Here we use the strict concavity of t. . .'. (i).. the algorithm may be viewed as successive fitting of oneparameter families. . ij as r t 00. We use the notation of Example 2.··. This twodimensional problem is essentially no harder than the onedimensional problem of Example 2.. The TwoParameter Gamma Family (continued).. TIl = . the log likelihood.2. Thus. to ¢ Fortunately in these cases the algorithm. (3) I (71j) = A for all j because the sequence oflikelihoods is monotone.I) solving r0'. (6) By (4) and (5).I ~(~ (1) 1 to get V. ... (Wn') ~ O. ff 1 < j < k. .1. Else lim...130 Methods of Estimation Chapter 2 (2) Notice that (iii.2. 1J ! But 71j E [. (2) The sequence (iji!. Continuing in this way we can get arbitrarily close to 1j. 'II.1 because the equa~ tion leading to Anew given bold' (2. A(71') ~ to.A(71) + log hex). .5).4.. ij'j and ij'(j+I) differ in only one coordinate for which iji(j+1) maximizes l. 'tn. ilL" iii.3.... We note some important generalizations. We give a series of steps.3. j (5) Because 1]1. (fij(e) are as above. Proof.. (ii) a/Theorem 2. Hence. Suppose that this is true for each l... fI.I)) = log X + log A 0) and then A(1) = pY. Whenever we can obtain such steps in algorithms. 1j(r) ~ 1j.4. We can initialize with the method 2 of moments estimate from Example 2. The case we have C¥. 71 1 is the unique MLE.=1 d j = k and the problem of obtaining ijl(tO.j l(i/i) = A (say) exists and is > 00.2. they result in substantial savings of time.2'  It is natural to ask what happens if. To complete the proof notice that if 1j(rk ) is any subsequence of 1j(r) that converges to ij' (say) then. the MLE 1j doesn't exist.l(ij') = A.. ijik) has a convergent subsequence in t x .4. Suppose that we can write fiT = (fir. j ¥ I) can be solved in closed form.1j{ can be explicit.1 hold and to E 1j(r) t g:. refuses to converge (in fI space!)see Problem 2. by (I).2.? Continuing. .) where flj has dimension d j and 2::.. that is.···) C!J.
Then it is easy to see that Theorem 2.1 in which Ix(O). 1 B. . r = k.1 illustrates the j process. See also Problem 2. that is.. The graph shows log likelihood contours.4.92. each of whose members can be evaluated easily.4.B2 )T where the log likelihood is constant.2 has a generalization with cycles of length r. find that member of the family of contours to which the vertical (or horizontal) line is tangent.4..7. and Holland (1975).3.4. = dr = 1. Change other coordinates accordingly..1. iterate and proceed.. ... The coordinate ascent algorithm can be slow if the contours in Figure 2.4..1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method. values of (B 1 .B~) = 0 by the method of j bisection in B to get OJ for j = 1. e ~ 3 2 I o o 1 2 3 Figure 2.. the log likelihood for 8 E open C RP.p. which we now sketch. At each stage with one coordinate fixed. Next consider the setting of Proposition 2. . for instance. .10. .4.4 Algorithmic Issues 131 just discussed has d 1 = .Section 2. Figure 2. Solve g~: (Bt. 1' B . The coordinate ascent algorithm. is strictly concave.4. A special case of this is the famous DemingStephan proportional fitting of contingency tables algorithmsee Bishop. If 8(x) exists and Ix is differentiable. the method extends straightforwardly. Feinberg. and Problems 2. BJ+l' .
3. coordinate ascent. Here is the method: If 110 ld is the current value of the algorithm.. the argument that led to (2. then by f1new is the solution for 1] to the approximation equation given by the right.3. when it converges. methods such as bisection.and lefthand sides. B)}F(Xi.to). (2. B) The density is = [I + exp{ (x = B) WI.B) i=l 1 1 2 L i=l n I(X" B) < O.3) Example 2.f. can be shown to be faster than coordinate ascent.4.2) The rationale here is simple. A hybrid of the two methods that always converges and shares the increased speed of the NewtonRaphson method is given in Problem 2.10.4. Bjork.4. F(x.6.1. in general. We return to this property in Problem 6.k I (iiold)(A(iiold) . and NewtonRaphson's are still employed. If 7J o ld is close to the root 'ij of A(ij) expanding A(ij) around 11old' we obtain = to. Let Xl. which may counterbalance its advantage in speed of convergence when it does converge. is the NewtonRaphson method.B) We find exp{ (x . _. In this case. 1 l(x. though there is a distinct possibility of nonconvergence or convergence to a local rather than global maximum.3 The NewtonRaphson Algorithm An algorithm that. and Anderson (1974). this method is known to converge to ij at a faster rate than coordinate ascentsee Dahlquist.7.B)} [i+exp{(xB)}j2' n 1 l(B) l(B) n2Lexp{(X. A onedimensional problem 7 ..132 Methods of Estimation Chapter 2 2. X n be a sample from the logistic distribution with d. Newton's method also extends to the framework of Proposition 2. then ii new = iiold . if l(B) denotes the log likelihood.4. I 1 The NewtonRaphson method can be implemented by taking Bold X. When likelihoods are noncave.  I o The NewtonRaphson algorithm has the property that for large n. If 110 ld is close enough to fj. This method requires computation of the inverse of the Hessian.4. iinew after only one step behaves approximately like the MLE.2) gives (2.4.
a)] = B2. Lumped HardyWeinberg Data. e = Example 2. It does tum out that in this simplest case an explicit maximum likelihood solution is still possible. If we suppose (say) that observations 81.B).€i3). Po[X (0. . and Anderson (1974). difficult to compute.4.. There are ideal observations. Xi = (EiI.4.x(B) is "easy" to maximize. S = S(X) where S(X) is given by (2.4.. let Xi.. is not X but S where = 5i 5i Xi.OJ] i=l n (2.4. in Chapter 6 of Dahlquist. be a sample from a population in HardyWeinberg equilibrium for a twoallele locus. What is observed.4) Evidently. . 0) is difficultto maximize. the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€i2 = 1).0 E c Rd Their log likelihood Ip. I)] (1 . but the computation is clearly not as simple as in the Original HardyWeinberg canonical exponential family example.4. Here is another important example. . A prototypical example folIows. 2. we observe 5 5(X) ~ Qo with density q(s. Yet the EM algorithm. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997). Many examples and important issues and methods are discussed. and so on. the function is not concave. 0 leads us to an MLE if it exists in both cases. E'2. €i2 + €i3). X ~ Po with density p(x. The algorithm was fonnalized with many examples in Dempster. 0.0. Laird. 1 <i< m (€i1 +€i2. Say there is a closedfonn MLE or at least Ip.4).4. a < 0 < I.(0) })2EiI 10gB + Ei210g2B(1 .4 Algorithmic Issues 133 in which such difficulties arise is given in Problem 2. 0) where I"..B) + 2E'310g(1 . however.Section 2. then explicit solution is in general not lXJssible.4 The EM (Expectation/Maximization) Algorithm There are many models that have the following structure. Soules.0)] ~ 28(1 . and its main properties. where Po[X = (1. with an appropriate starting point. . Bjork. Ei3). We give a few examples of situations of the foregoing type in which it is used.6. The log likelihood of S now is 1". Petrie. and Weiss (1970). As in Example 2. for instance. 1 8 m are not Xi but (€il. 1.13. This could happen if. (2. for some individuals. and Rubin (977).x(B) is concave in B. Po[X = (0. A fruitful way of thinking of such problems is in terms of 8 as representing part of X.4.2.0)2. m+ 1 <i< n. i = 1.5) + ~ [(EiI+Ei2)log(1(10)')+2Ei310g(IB)] i=m+l a function that is of curved exponential family fonn. 0).n. (0) = log q(s. Unfortunately. though an earlier general form goes back to Baum. the rest of X is "missing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood.
Let . :B 10gq(s.B) =E. the M step is easy and the E step doable.2)andO <. = 01.12) q(s. .Sn is a sample from a population P whose density is modeled as a mixture of two Gaussian densities.12).Bo) and (2.o log p(X. I That is.B) I S(X) = s) 0=00 (2.Bo) 0 p(X.4. including the examples.\. which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2..4. o (:B 10gp(X.4.9) follows from (2. This fiveparameter model is very rich pennitting up to two modes and scales. (P(X.tz E Rand 'Pu (5) = . Suppose 8 1 .4. • • i ! I • . Initialize with Bold = Bo· Tbe first (E) step of the algorithm is to compute J(B I Bold) for as many values of Bas needed. Here is the algorithm.6) where A. .Ai)I'Z.4. Then we set Bnew = arg max J(B I Bold). I. ~i tells us whether to sample from N(Jil.\"'u.(f~).(sI'z)where() = (.. It is not obvious that this falls under our scheme but let (2.4) = L()(Si I Ai) = N(Ail'l + (1. Suppose that given ~ = (~11"" ~n).. the Si are independent with L()(Si I . we can think of S as S(X) where X is given by (2.p (~).8).4.(sl'd +.6). The EM Algorithm. i.4..ll. . reset Bold = Bnew and repeatthe process.p() [A.4.i = 1.B) J(B I Bo) ~ E.\)"'u. A.11).. Thus.)<T~).8) by o taking logs in (2. Again.\ < 1.B) IS(X)=s) q(s.\ = 1. p(s. The log likelihood similarly can have a number of local maxima and can tend to 00 as e tends to the boundary of the parameter space (Problem 2. As we shall see in important situations. EM is not particularly appropriate. are independent identically distributed with p() [A.I.4. J.<T. (P(X. Bo) _ I S(X) = s ) (2.7) ii. If this is difficul~ the EM algorithm is probably not suitable. . .<T.). .0'2 > 0. I .4. O't.8) . j.(I'. differentiating and exchanging E oo and differentiation with respect i. + (1  A. .5. . I:' . Although MLEs do not exist in these models. that under fJ.4.B) 0=00 = E. I' Hi where we suppress dependence on s. S has the marginal distribution given previously.9) I for all B (under suitable regularity conditions). B) = (1. if this step is difficult.. The EM algorithm can lead to such a local maximum. It is easy to see (Problem 2. The rationale behind the algorithm lies in the following formulas. = 11 = . we have given.4. The second (M) step is to maximize J(B I Bold) as a function of B.134 Methods of Estimation Chapter 2 Example 2. Note that (2. a local maximum close to the true 8 0 turns out to be a good "proxy" for the 0 nonexistent MLE.af) or N(J12. Mixture of Gaussians.
Lemma 2. On the other hand. 0 0 = 0old' = Onew. 0old)' Equality holds in (2. hence.11)  D IOgq(8. J(Onew I 0old) > J(Oold I 0old) = 0 by definition of Onew. I S(X) ~ s > 0 } (2.Section 2.o log .3.1. formally.0 ) I S(X) ~ 8 . DO The main reason the algorithm behaves well follows.4.4.4.15) q(s. then .E.4..4 Algorithmic Issues 135 to 0 at 00 . Let SeX) be any statistic. O n e w ) } log ( " ) ~ J(Onew I 0old) .0) (2. uold 0 r s. q S. uold Now.13) q(8.4.O)r(x I 8.4. 0 ) +E.Onew) {r(X I 8 .0) = O.4. Suppose {Po : () E e} is a canonical exponential family generated by (T. Fat x EX.(X 18. Id log (X I . Onew) > q(s.Onew) E'Old { log r(X I 8. 0) ~ q(s.14) whete r(· j ·. 00ld are as defined earlier and S(X) (2." ) I S(X) ~ 8 .3. Lemma 2. the result holds whenever the quantities in J(O I (0 ) can be defined in a reasonable fashion. (2. r(Xj8. (2. Because. DJ(O I 00 ) [)O and.1.2.17) o The most important and revealing special case of this lemma follows.0) is the conditional frequency function of X given S(X) J(O I 00 ) If 00 = s.0) } ~ log q(s.4.4.12) = s.10) [)J(O I 00 ) DO it follows that a fixed point 0 of the algorithm satisfies the likelihood equation.4. 0) I S(X) = 8 ) DO (2. We give the proof in the discrete case.00Id) by Shannon's ineqnality.16) ° q(s. Then (2. Theorem 2. (2. In the discrete case we appeal to the product rule.1. ~ E '0 (~I ogp(X .h) satisfying the conditions a/Theorem 2.13) iff the conditional distribution of X given S(X) forOnew asfor 00ld and Onew maximizes J(O I 0old)' =8 is the same Proof.lfOnew. However.O) { " ( X I 8.4. S (x) ~ 8 p(x.
4. which is necessarily a local maximum J(B I Bo) I I = Eoo{(B .4.4.A('1)}h(x) (2. (b) lfthe sequence of iterates {Bm} so obtained is bounded and the equation A(B) o/q(s.(A(O) . (2. X is distributed according to the exponential family I where p(x.4.(A(O) .Bof Eoo(T(X) I S(X) = y) .25) I" .4. In this case.BO)TT(X) . A'(1)) = 2nB (2.16. 1 <j < 3.4. 0 Example 2.= +Eo ( t (2<" + <i') I <it + <i'..(x). I • Under the assumption that the process that causes lumping is independent of the values of the €ij.22) • ~ 0) .4.4. Proof. • I E e (2Ntn + N'n I S) = 2Nt= + N.+l (2. i=m.23) I . A(1)) = 2nlog(1 + e") and N jn = L:~ t <ij(Xi).(1 . Now.21) Part (a) follows.B) 1) = log (1 = exp(1)(2Ntn (x) + N'n(x)) . A proof due to Wu (I 983) is sketched in Problem 2. = Ee(T(X) I S(X) = s) ~ (2. I . we see.4 (continued). i i. 1 I • • Po I<ij = 1 I <it + <i' = OJ =  0.4.B). then it converges to a limit 0·.24) " ! I . that. Thus.19) = Bnew · If a solution of(2.0)' 1. = 11 <it +<" = IJ.4.136 (a) The EM algorithm consists of the alternation Methods of Estimation Chapter 2 A(Bnew ) = Eoold(T(X) I S(X) Bold ~ = s) (2.A(Bo)) (2. Part (b) is more difficult.20) has a unique solution. after some simplification.18) (2. m + 1 < i < n) .Pol<i. h(x) = 2 N i .B) 1 . .I<j<2 I r r: I I r! Pol<it 11 <it + <i' = 11 0' B' B' + 20(1 . 1'I.4.A(Bo)) I S(X) = s} (B .18) exists it is necessarily unique. .
(y. the conditional expected values equal their observed values. which is indeed the MLE when S is observed. (Zn. I Z. ~ T l (801 d). For the Mstep. and find (Problem 2. unew  _ 2N1m + N 2m n + 2 .4.12) that if2Nl = Bm.= > N 3 =)) 0 and M n > 0.4.4 Algoritnmic Issues 137 where Mn ~ Thus. a~ + ~. To compute Eo (T I S = s).4. 112.. to conclude ar. Suppose that some of the Zi and some of the Yi are missing as follows: For 1 < i < nl we observe both Zi and Yi.4). . [T5 (8ald) 2 = T2 (8old)' iii. T3 = The observed data are n l L zl. I). T 2 = Y.): 1 <i< nt} U {Z.). where B = (111. then B' .4. This completes the Estep. we observe only Yi .l L ZiYi...26) It may be shown directly (Problem 2.4.1.new T4 (Bo1d ) . compute (Problem 2.T = (1"1. 2 Mn . we note that for the cases with Zi andlor Yi observed.new ii~.) E. i=I i=l n n n i=l s = {(Z" Y.T2 ]}' .d. where (Z.27) hew i'tT2Ji{[T3 (8ol d) . T 5 = n. . Y) ~ N (I"" 1"" 1 O'~. Y).l LY/.Tl IlT4 (8ol d) .Bold n ~ (2. at Example 2.(ZiY.) E. + 1 <i< n}. for nl + 1 < i < n2.. Jl2. we oberve only Zi. new = T 3 (8ol d) . converges to the unique root of ~ + N.(2N3= + N 2=)B + 2 (N1= + (I n n =0 0 in (0.new ~ + I'i.4.T{ (2. r) (Problem 2. For other cases we use the properties of the bivariate nonna] distribution (Appendix BA and Section 1.1). B). a~.Yn) be i.1) that the Mstep produces I1l. /L2. b. and for n2 + 1 < i < n. I'l)/al [1'2 + P'Y2(Z.p2)a~ [1'2 + P'Y2(Z. ii2 . iii.: nl + 1 <i< n.4. p). Let (Zl'yl).2 I Z.8) of B based on the observed data. In this case a set of sufficient statistics is T 1 = Z.i. as (Z. ai ~ We take Bo~ = B MOM ' where BMOM is the method of moment estimates (11" 112.Section 2. T4 = n.(Y.: n. al a2P + 1"1/L2).1) A( B) = E. I Zi) 1"2 + P'Y2(Z.T 2 .} U {Y. l"l)/al]Zi with the corresponding Z on Y regression equations when conditioning on Yi (Problem 2.. the EM iteration is L i=tn+l n ('iI + <. 1"1)/ad 2 + (1 .6.
0'2) based on the first two moments. j : > I) that one system 3. E"IJ(O I 00)] is the KullbackLeiblerdivergence (2.23).5 PROBLEMS AND COMPLEMENTS Problems fnr Sectinn 2. (c) Suppose X takes the values 1. what is a frequency substitution estimate of the odds ratio B/(l. &(>. show that T 3 is a method of moment estimate of (). (3( 0'1.4. .d. Summary. Now the process is repeated with B MOM replaced by ~ ~ ~ ~ 0new... Consider n systems with failure times X I . Remark 2. based on the first moment.i' I' I i. Consider a population made up of three different types of individuals occurring in the HardyWeinberg proportions 02.0) and (I .B)? .0..P3 given by the HardyWeinberg proportions. where 0 < B < 1. Find the method of moments estimates of a = (0'1. X n assumed to be independent and identically distributed with exponential.2. based on the second moment. (a) Show that T 3 = N1/n + N 2 /2n is a frequency substitution estimate of e. 0 Because the Estep. based On the first two moments.. 0) is minimized.4. use this algorithm as a building block for the general coordinate ascent algorithm. which as a function of () is maximized where the contrast logp(X. Also note that. B)/p(X.4 we derive and discuss the important EM algorithm and its basic properties.2. .4... X n have a beta. the EM algorithm is often called multiple imputation. distributions. Finally in Section 2. (b) Find the method of moments estimate of>. Suppose that Li.. . in the context of Example 2. By considering the first moment of X.1.6. are discussed and introduced in Section 2. . in general. Important variants of and alternatives to this algorithm. respectively.3 and the problems.B o)]. (c) Combine your answers to (a) and (b) to get a method of moment estimate of >. then J(B i Bo) is log[p(X. . 2B(1 . including the NewtonRaphson method.). 2. We then..138 Methods of Estimation Chapter 2 where T j (B) denotes T j with missing values replaced by the values computed in the Estep and T j = Tj(B o1d )' j = 1.2.B)2. ! I 2.P2. The basic bisection algorithm for finding roots of monotone functions is developed and shown to yield a rapid way of computing the MLE in all oneparameter canonical exponential families with E open (when it exists). ~' ". (d) Find the method of moments estimate of the probability P(X1 will last at least a month.1 with respective probabilities PI.4. which yields with certainty the MLEs in kparameter canonical exponential families with E open when it exists. X I.1 1.. j.0'2) distribution.4. Note that if S(X) = X. . involves imputing missing values. in Section 2. (a) Find the method of moments estimate of >. (b) Using the estimate of (a).
. X n . < X(n) be the order statistics of a sample Xl. (c) Argue that in this case all frequency substitution estimates of q(8) must agree with q(X). of Xi n ~ =X. X 1l be the indicators of n Bernoulli trials with probability of success 8.Xn be a sample from a population with distribution function F and frequency function or density p.. . . given the order statistics we may construct F and given P. 8. (b) Show that in the continuous case X ~ rv F means that X (c) Show that the empirical substitution estimate of the jth moment JLj is the jth sample moment JLj' Hinr: Write mj ~ f== xjdF(x) ormj = Ep(Xj) where XF. (Zn. The empirical distribution function F is defined by F(x) = [No.. If q(8) can be written in the fonn q(8) ~ s(F) for sOme function s of F we define the empirical substitution principle estimate of q( 8) to be s( F). . (a) Show that X is a method of moments estimate of 8. . (Z2.. Let Xl.8)ln first using only the first moment and then using only the second moment of the population.. . Let X(l) < ... ... < ~ ~ Nk+1 = n(l  F(tk)). Y2 ). Hint: Consi~r (N I . The jth cumulant '0' of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant Cj' Give the first three sample cumulants..) There is a Onetoone correspondence between the empirical distribution function ~ F and the order statistics in the sense that.t ) = Number of vectors (Zi. .8. we know the order statistics.. ~ ~ ~ (a) Show that in the finite discrete case. See A.5. Yi) such that Zi < sand Yi < t n .... 4. Nk+I) where N I ~ nF(t l ). (b) Exhibit method of moments estimates for VaroX = 8(1 . which we define by ~ F ~( s. ..2. Let X I. Show that these estimates coincide. = Xi with probability lin. (See Problem B. N 2 = n(F(t2J . . . Give the details of this correspondence.findthejointfrequencyfunctionofF(tl). . of Xi < xl/n.Section 2. ~ Hint: Express F in terms of p and F in terms of P ~() X = No. empirical substitution estimates coincides with frequency substitution estimates.5 Problems and Complements 139 Hint: See Problem 8. . 6.2.F(t.. ~  7. 5... (d) FortI tk. The natural estimate of F(s. . t). < . Yn ) be a set of independent and identically distributed random vectors with common distribution function F. t) is the bivariate empirical distribution function FCs.F(tk). Let (ZI. .12. YI )..)).
. Show that the least squares estimate exists.4. L I."ZkYk . . (c) Use the empirical substitution principle to COnstruct an estimate of cr using the relation E(IX.. 11. In Example 2.1. 9. Zn and Y11 .. Let X".IU9) that 1 < T < L . as the corresponding characteristics of the distribution F. z) is continuous in {3 and that Ig({3. : J • .2. respectively.L. {3) is continuous on K. There exists a compact set K such that for f3 in the complement of K. are independent N"(O. (a) Find an estimate of a 2 based on the second mOment. ..Y) = Z)(Yk .". Show that the sample product moment of order (i. .2). .. The All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates.17. k=l n  n . 1". (b) Define the sample product moment of order (i.. . Y are the sample means of the sample correlation coefficient is given by Z11 •. find the method of moments estimate based on i 12. I . ~ i . Since p(X. . the sampIe correlation.• . Hint: See Problem B. .Xn ) where the X.j) is given by ~ The sample covariance is given by n l n L (Zk k=l .. >. p(X. 10. z) I tends to 00 as \.2 with X iiI and /is. . Vk and that q(9) can be written as ec q«(J) = h(!'I«(J)"" . " . and so on. as X ~ P(J...' where Z. Suppose X = (X"".) is the distribution function of a probability P on R2 assigning mass lin to each point (Zi. j). (b) Construct an estimate of a using the estimate of part (a) and the equation a . 0). the result follows.Yn . . (See Problem 2.).) Note that it follows from (A. . Vi). .j2. Hint: Set c = p(X. X n be LLd. the sample covariance. (J E R d. I) = ". In Example 2. Suppose X has possible values VI.1. suppose that g({3. . = #. {3) > c.140 Methods of Estimation Chapter 2 (a) Show that F(..ZY. . with (J identifiable. f(a.81 tends to 00..!'r«(J)) I .
to give a method of moments estimate of (72. . with (J E R d and (J identifiable.:::m:::. rep. When the data are not i.1) (iii) Raleigh.ilr) is a frequency plug (a) Show that the method of moments estimate if = h(fill . as X rv P(J..(X) ~ Tj(X). Vk and that q(O) for some Rkvalued function h. In the fOllowing cases. 1'(1. IG(p. find the method of moments estimates (i) Beta.  1/' Po) / . .. General method of moment estimates(!). A).1.O) ~ (x/O')exp(x'/20').d. 1'(0.0 > 0 14. X n are i. .i. .36.10). = h(PI(O). . Let g.d.0) (ii) Beta. .p:::'.6.i..Pr(O)) . Suppose that X has possible values VI. can you give a method of moments estimate of {3? .• it may still be possible to express parameters as functions of moments and then use estimates based on replacing population moments with "sample" moments.5. 1 < j < k. (a) Use E(X. in estimate. . ~ (X....l Lgj(Xi ).. j i=l = 1.9r be given linearly independent functions and write ec n Pi(O) = EO(gj(X)). Show that the method of moments estimate q = h(j1!. 13. where U.) to give a method of moments estimate of p. See Problem 1. A). (c) If J.. (b) Suppose P ~ po and I' = b are fixed.l and (72 are fixed.1.. . Let 911 . 0). .x (iv) Gamma.6. Use E(U[). p fixed (v) Inverse Gaussian. .:::nt:::' ~__ 141 for some R k valued function h. Suppose X 1. Hint: Use Corollary 1. fir) can be written as a frequency plugin estimate. > 0. Iii = n..Section 2. r...5 Problems and Com". 0 ~ (p..6. (b) Suppose {PO: 0 E 8} is the kparameter exponential family given by (1.p(x.. Consider the Gaussian AR(I) model of Example 1. .
. = Y . )..ZY ~ . FF. and (J3.. S = 1.. Show that <j < i ! ..)] = [Y..1 " Xir Xk J. _ (Z)' ' a. For a vector X = (Xl.. HardyWeinberg with six genotypes.. II. Multivariate method a/moments...•• Om) can be expressed as a function of the moments. 17. Y) and (ai.. In a large natural population of plants (Mimulus guttatus) there are three possible alleles S... the method of moments l by replacing mjkrs by mjkrs..S 1 . P3 + "2PS + '2P6 PI are frequency plugin estimates of OJ.. Establish (2..z.. Q.q... Let 8" 8" and 83 denote the probabilities of S. + '2P4 + 2PS .z.... i = 1...~b. and F at one locus resulting in six genotypes labeled SS... and F.Y.• l Yn are taken at times t 1.."" X iq ). .. 1 q.. we define the empirical or sample moment to be j ~ .g(I3. I. Let X = (Z.. 16...)] + [g(l3o. I I . n. ()2. .. = n 1 L:Z. . The reading Yi ._ .. I • .. The HardyWeinberg model specifies that the six genotypes have probabilities L7=1 I Genotype Genotype Probability SS 2 II 8~ 3 FF 8j 4 SI 28. >..)].1".  t=1 liB = (0 1 . For independent identically distributed Xi = (XiI.J is' _ n n. ... b. . 5 SF 28.4... .) ..8.. n 1  Problems for Sectinn 2... bI) are as in Theorem 1. 1 6 and let Pl ~ N j / n. of observations.. An Object of unit mass is placed in a force field of unknown constant intensity 8.. k > 0.3. Y) and B = (ai.1.. SI. k> Q mjkrs L.bIZ. Show that method of moments estimators of the parameters b1 and al in the best linear predictor are estimate ~ 8 of B is obtained L:Z. I. SF. let the moments be mjkrs = E(XtX~). 11P2 + "2P4 + '2P6 .. Hint: [y.z.g(l3o..6). 83 6 IF 28. . I . I ... and IF. where (Z. Readings Y1 ..! .... . t n on the position of the object...z. respectively. . 1".142 Methods of Estimation Chapter 2 15..83 8t Let N j be the number of plants of genotype j in a sample of n independent plants..  1. where OJ = 1. .2 1.. ' l X q ). . j > 0.. 1 I . .. 1 .g(I3.. .
" 7 5..)..2. . Find the LSE of 8.2. Find the least squares estimates of ()l and B . nl + n2. where €nlln2 are independent N(O.. What is the least squares estimate based on YI . < O.4.. 2 10. x > 0.3. Yn be independent random variables with equal variances such that E(Yi) = O:Zj where the Zj are known constants. (a) f(x. . .2.n. Show that the fonnulae of Example 2.6) under the restrictions BJ > 0. B) = Be'x. g(zn.B.5) provided that 9 is differentiable with respect to (3" 1 < i < d.. (zn. Find the MLE of 8. Yn). .1. A new observation Ynl." + 82 z i + €i = nl with ICi a~ given by = 81 + ICi. . (Pareto density) .. B > O. B) = Bc'x('+JJ.. i + 1. nl and Yi = 82 + ICi. to have mean 2.2 may be derived from Theorem 1. Hint: The quantity to be minimized is I:7J (y. . Find the least squares estimate of 0:. Hint: Write the lines in the fonn (z.. The regression line minimizes the sum of the squared vertical distances from the points (Zl.. . . Show that the least squares estimate is always defined and satisfies the equations (2.Yi).t of the best (MSPE) predictor of Yn + I ? 4. (b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1. . .5 Problems and Complements 143 ICi differs from the true position (8 /2)tt by a mndom error f l .)' 1+ B5 6. Yl). .. all lie on aline. Suppose Yi ICl"..(zn.<) ~ . (exponential density) (b) f(x. the range {g(zr. (J2) variables..Yn). . Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (Zi. X n denote a sample from a population with one of the following densities Or frequency functions. .I is to be taken at time Zn+l. . . . B. 13). i = 1. _. and 13 ranges over R d 8.BJ .y.. x > c...We suppose the oand be uncorrelated with constant variance.. Y. ~p ~(yj}) ~ .z.. Find the line that minimizes the sum of the squared perpendicular distance to the same points.. B > O. 9. Yn have been taken at times Zl. i = 1. . .4. 7.4)(2.13 E R d } is closed. 13). Zn and that the linear regression model holds. . Let X I.. c cOnstant> 0. in fact.. if we consider the distribution assigning mass l/n to each of the points (zJ. (a) Let Y I . . ..Section 2. Find the least squares estimates for the model Yi = 8 1 (2. Suppose that observations YI . . 3. .
 e .5 show that no maximum likelihood estimate of e = (1". 0 > O. Show that depends on X through T(X) only provided that 0 is unique.16(b). density) (e) fix.t and a 2 . Hint: Use the factorization theorem (Theorem 1. they are equivariant 16. x > 1".. Show that any T such that X(n) < T < X(1) + is a maximum likelihood estimate of O. ry) denote the density or frequency function of X in terms of T} (Le. x > 0.. WEn .. q(9) is an MLE of w = q(O). 0) = VBXv'O'. (Pareto density) (d) fix. . (We write U[a. Suppose that T(X) is sufficient for 0 and ~at O(X) is an MLE of O. then i  WMLE = arg sup sup{Lx(O) : 0 E e(w)}.. If n exists. I).t and a 2 are both known to be nonnegative but otherwise unspecified..l exp{ OX C } . Show that the MLE of ry is h(O) (i. Let X I. is a sample from a N(/l' ( 2 ) distribution.0"2).:r > 0. (Wei bull density) 11. the MLE of w is by definition ~ W.) ! ! !' ! 14.5. ncR'. 0) = Ocx c . . Because q is onto n. (72 Ii (a) Show that if J1 and 0. 0 <x < I. X n be a sample from a U[0 0 + I distribution. and o belongs to only one member of this partition.e. (beta. 0 > O.1). c constant> 0.: I I \ . e c Let q be a map from e onto n.X)" > 0. 0 E e and let 0 denote the MLE of O.a)l rather than 0.l L:~ 1 (Xi .I")/O") .. (a) Let X .r(c+<). . 12. be independently and identically distributed with density f(x. . .II ". I i' 13.0"2) 15. Let X" . I < k < p. Thus.. > O. Hint: Let e(w) = {O E e : q(O) = w}.  under onetoone transformations). 0) = (x/0 2) exp{ _x 2/20 2 }.144 Methods of Estimation Chapter 2 (c) f(". J1 E R. . X n .2.P > I. (b) Find the maximum likelihood estimate of PelX l > tl for t > 1". Define ry = h(8) and let f(x.0"2 (a) Find maximum likelihood estimates of J. Suppose that Xl. 0 > O.O) where 0 ~ (1". then {e(w) : wEn} is a partition of e. bl to make pia) ~ p(b) = (b . < I" < 00. Pi od maximum likelihood estimates of J1 and a 2 . . n > 2. iJ( VB. 0> O. 00 = I 0" exp{ (x . ~ I in Example 2. MLEs are unaffected by reparametrization.. = X and 8 2 ~ n. x> 0. (Rayleigh density) (f) f(x. for each wEn there is 0 E El such that w ~ q( 0). Hint: You may use Problem 2. reparametrize the model using 1]).2 are unknown. say e(w). c constant> 0. be a family of models for X E X C Rd. I • :: I Suppose that h is a onetoone function from e onto h(e). then the unique MLEs are (b) Suppose J. X n • n > 2. Show that if 0 is a MLE of 0. . 0) = cO".1. I I I (b) LetP = {PO: 0 E e}. .Pe.
Bremner.. (KieferWolfowitz) Suppose (Xl." Y: .. 21. and so on.) Let M = number of indices i such that Y i = r + 1.0") E = {(I'. X n ) is a sample from a population with density f(x.n . Thus. Censored Geometric Waiting Times. but e ..1 (I_B)... Derive maximum likelihood estimates in the following models..Xn be independently distributed with Xi having a N( Oi. •• .  .. where 0 < 0 < 1. Show that maximum likelihood estimates do not exist. 0 < a 2 < oo}. . . and have commOn frequency function. < 00.[X ~ k] ~ Bk . If time is measured in discrete periods. Y n IS _ B(Y) = L.X I = B(I. We want to estimate O.. and Brunk (1972). (b) The observations are Xl = the number of failures before the first success. in a sequence of binomial trials with probability of success fJ.P. 19. . Show that the maximum likelihood estimate of 0 based on Y1 .B) = 1.B) =Bk . Suppose that we only record the time of failure.2.O") : 00 < fl. we observe YI . identically distributed. M 18.r f(r + I. 1 <i < n. ..1 (1 .. .Section 2. 17. .[X < r] ~ 1 LB k=l k  1 (1_ B) = B'. Yn which are independent. f(k.B). .5 Problems and Complements 145 Now show that WMLE ~ W ~ q(O). 1) distribution.B). 20... A general solution of this and related problems may be found in the book by Barlow. X 2 = the number of failures between the first and second successes. k=I.. Let Xl. (We denote by "r + 1" survival for at least (r + 1) periods.~1 ' Li~1 Y. In the "life testing" problem 1. . if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods.6. k ~ 1. 16(i). (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B. a model that is often used for the time X to failure of an item is P. Bartholomew. B) = 100' cp 9 (x I') + 0' 10 cp(x 1') 1 where cp is the standard normal density and B = (1'. (a) The observations are indicators of Bernoulli trials with probability of success 8. find the MLE of B. . .. (a) Find maximum likelihood estimates of the fJ i under the assumption that these quantities vary freely. We want to estimate B and Var.
.6. . where [t] is the largest integer that is < t.8 0. Let Xl.2 Peed Speed  2 I 1 1 1 1 1 1 1 J2 o o 1 1 1 3.0 5. Ii equals one of the numbers 22.. J2 J2 o o o I 3.2.1.' wherej E J andJisasubsetof {UI .jp): 0 <j.z!. Xl.0 2. Show that the MLE of = (fl.6).8 14.146 Methods of Estimation Chapter 2 that snp.up(X.Xm and YI . (1') populations. .4 o o o o o o o o o o o o o Life 20. = fl. and ~ b(X) = X X (N + 1) or (N + 1)1 n n othetwise.a 2 ) = Sllp~. n). x) as a function of b.8 2. N.2. 1 <k< p}.1 2.4)(2.2 3.2 4. where'i satisfy (2.5 86.ji.5 I' I I . respectively. . 7 .x n .8 3.0 0. x)/ L(b.. . TABLE 2.J. Assume that Xi =I=.Y)' i=1 j=1 /(rn + n).5 66.'.< J. . distribution. . Polynomial Regression.tl.. and only if. and assume that zt'·· .a 2 ) and NU"..I' fl. Hint: Coosider the mtio L(b + 1. 1i(b. . II I I :' 24.5 0.9 3. Show that the maximum likelihood estimate of b for Nand n fixed is given by if ~ (N + 1) is not an integer.. 1 1 1 o o .X)' + L(Y.t.0 I . . i: In an experiment to study tool life (in minutes) of steelcutting tools as a function of cutting speed (in feet per minute) and feed rate (in thousands of an inch per revolution).0 11.. 1985). Y.(Zi) + 'i. (1') is If = (X. Set zl ~ . (1') where e n ii' = L(Xi ..2 4. the following data were obtained (from S. 23. Suppose Y. Suppose X has a hypergeometric.p(x.0"2) if. Weisberg. Tool life data Peed 1 Speed 1 1 Life 54. " Yn be two independent samples from N(J.Xj for i =F j and that n > 2.
Show that (3. Let W.Y') have density v(z..3. Both of these models are approximations to the true mechanism generating the data. Consider the model Y = Z Df3 + € where € has covariance matrix (12 W.6) with g({3. + E eto (b) Y = + O:IZl + 0:2Z2 + Q3Zi +. . Derive the weighted least squares nonnal equations (2.900)/300.2. (b) Let P be the empirical probability . y). ~. .2.(P) = Cov(Z'. Use these estimated coefficients to compute the values of the contrast function (2. Y) have joint probability P with joint density ! (z.1). let v(z. 28. Show that the follow ZDf3 is identifiable.8 and let v(z.1 (see (B. .2. However.4)(2.4)(2. E{v(Z.2.19). Show that p. Consider the model (2. z) ing are equivalent. weighted least squares estimates are plugin estimates.2.1. this has to be balanced against greater variability in the estimated coefficients.0:4Z? + O:SZlZ2 + f Use a least squares computer package to compute estimates of the coefficients (f3's and o:'s) in the two models. Y)Z') and E(v(Z.5 Problems and Complements 147 The researchers analyzed these data using Y = log tool life. of Example 2. Being larger.y)dzdy. Y)Y') are finite. (c) Zj. Let Z D = I!zijllnxd be a design matrix and let W nXn be a known symmetric invertible matrix.(P)E(Z·). in Problem 2.6) with g({3. This will be discussed in Volume II...(P) and p. y) > 0 be a weight funciton such that E(v(Z.2.Section 2.6. Y)[Y .y)/c where c = I Iv(z. Set Y = Wly.Y = Z D{3+€ satisfy the linear regression mooel (2. the second model provides a better approximation. ZI ~ (feed rate .6)). ZD = Wz ZD and € = WZ€. z) ~ Z D{3.2. 25. (2.1. Two models are contemplated (a) Y ~ 130 + pIZI + p. = (cutting speed .Z)]'). .2. (a) Show tha!.y)!(z.l be a square root matrix of W. That is. l/Var(Y I Z = z).(P) coincide with PI and 13. .5) for (a) and (b).z. (12 unknown.1). (a) Let (Z'.y)!(z. Y')/Var Z' and pI(P) = E(Y*) 11.ZD is of rank d.13)/6.(P)Z of Y is defined as the minimizer of t = zT {3.(b l + b.. Let (Z. 26. (2. z.  27..y) defined . The best linear weighted mean squared prediction error predictor PI (P) + p. (a) The parameterization f3 (b) ZD is of rank d.
097 1.058 3.071 0.1. i = 1.666 3....17.229 4.. = nq = 0. Use a weighted least squares c~mputer routine to compute the weighted least squares estimate Ii of /1.6)T(y .392 4..155 4. Hint: Suppose without loss of generality that ni 0. where 1'.457 2. suppose some of the nj are zero.ZD. ~ ~ ~Show that the MLE of OJ is 0 with OJ = nj In.908 1.918 2.6) is given by (2.091 3. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay. .075 4. Then = n2 = ..274 5.. n.391 0.'.. .20).131 5.8.676 5..156 5. \ (e) Find the weighted least squares estimate of p.100 3.. .5.455 9.. which vanishes if 8 j = 0 for any j = q + 1.300 6.064 5. Chon) shonld be read row by row.ZD.716 7.038 3. . in this model the optimal " MSPE predictor of the future Yi+ 1 given the past YI ..870 30. n. .480 5. (a) Show that E(1'.114 1..+l I Y .093 5.(Y . .723 1.393 3. j = 1.019 6.) ~ ~ (I' + 1'. Consider the model Yi = 11 + ei.453 1. (See Problem 2. That is.€n+l are i.834 3. ' I (f) The following data give the elapsed times YI . <J > 0. k.703 4.916 6.301 2. 0) = II j=q+l k 0..a. .046 2. k.511 3.j}.379 2.. 31. then the  f3 that minimizes ZD.. .) (c) Find a matrix A such that enxl = Anx(n+l)ECn+l)xl' (d) Find the covariance matrix W of e.2. with mean zero and variance a 2 • The ei are called moving average errors.054 1.968 2.856 2. 2.196 2.249 1.~=l Z. . where El" . 1'.jf3j for given covariate values {z..093 1. Is ji. Show that the MLE of .360 1.081 2. Yn are independent with Yi unifonnly distributed on [/1i .788 2. Suppose YI .564 1. .476 3. 4 (b) Show that Y is a multivariate method of moments estimate of p. Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore.i.182 0. .582 2.nk > 0.6) T 1 (Y .689 4. I Yi is (/1 + Vi).039 9.1. = L.148 ~ Methods of Estimation Chapter 2 (b) Show that if Z  D has rank d...669 7.958 10. . _'..611 4.053 4.020 8.2. . The data (courtesy S.397 4.860 5. nq+1 > p(x.921 2.).599 0. Let ei = (€i + {HI )/2. /1i + a}.ZD.971 0.6) W. = (Y  29.665 2. J. . i = 1.858 3.968 9. In the multinomial Example 2. different from Y? I I TABLE 2. .d.
Show that the sample median ii is the minimizer of L~ 1 IYi . 1). . where Jii = L:J=l Zij{3j... + Xj i<j  201· (b) Define BH L to be the minimizer of J where F [x .I'. .I/a}. be i. i < j.28Id(F. Hint: See Example 1. 35.. (a) Show Ihat if Ill[ < 1.. Hint: Use Problem 1... (a) Show that the MLE of ((31. A > 0. the sample median fj is defined as Y(k) where k ~ ~(n + 1) and Y(l).3.i.6./Jr ~ ~ and iii. and x. .5 Problems and Complements 149 (f3I' . . .. (3p.J. . Find the MLE of A and give its mean and variance. x E R.:C j • i <j (a) Show that the HodgesLehmann estimate is the minimizer of the contrast function p(x. a) is obtained by finding (31.tLil and then setting a = max t IYi .  Illi < 1... .{3p.. .l1. Yn are independent with Vi having the Laplace density 1 2"exp {[Y.) Suppose 1'.Jitl..Ili I and then setting (j = n~I L~ I IYi .fin are called (b) If n is odd.i.. ./3p that minimizes the maximum absolute value conrrast function maxi IYi . (See (2. . Let B = arg max Lx (8) be "the" MLE.. F)(x) *F denotes convolution.Section 2. . 8 E R. = L Ix.4.. An asymptotically equivalent procedure is to take the median of the distribution placing mass and mass .1. y)T where Y = Z + v"XW. Show that XHL is a plugin estimate of BHL. Y( n) denotes YI .). be the observations and set II = ~ (Xl .d. let Xl and X. If n is even. . Let g(x) ~ 1/".2.. with density g(x . the sample median yis defined as ~ [Y(. a)T is obtained by finding 131.>0 ~ ~ where tLi = :Ej=l Ztj{3j for given covariate values {Zij}. .) + Y(r+l)] where r = ~n. be i. then the MLE exists and is unique. Z and W are independent N(O..7 with Y having the empirical distribution F.d. ...{:i.. Give the MLE when . Suppose YI .O) Hint: See Problem 2. The HodgesLehmann (location) estimate XHL is defined to be the median of the ~n(n + 1) pairwise averages ~(Xi + Xj). .32(b).x.8). . Yn ordered from smallest to largest. . These least absolute deviation estimates (LADEs). ~ 33. = I' for each i.17).. where Iii = ~ L:j:=1 ZtjPj· 32.& at each Xi. " . be the Cauchy density. as (Z.(1 + x'j. Let X. .d. Let x. ~ f3I. 34. XHL i& at each point x'..(3p that minimizes the least absolute deviation contrast function L~ I IYi .
+~l Vi and 8 2 = E. h I (ry») denote the density or frequency function of X for the ry parametrization. . 51..B) g(x + t> . . where x E Rand (} E R. Also assume that S.B I ) (K'(ryo. Let (XI. symmetric about 0.h(y . B) denote their joint density. Let Vj and W j denote the number of hits of type 1 and 2 on day j. 3. that 8 1 E.. . On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making). b). Find the values of B that maximize the likelihood Lx(B) when 1t>1 > L Hint: Factor out (x . BI ) (p' (x. 38. Ii I. Let 9 be a probability density on R satisfying the following three conditions: I. 1 n + m. Sl. then h"(y) > 0 for some nonzero y. Problem 35 can be generalized as follows (Dhannadhikari and JoagDev. ~ (XI + x.Xn are i.j'.B )'/ (7'. then B is not unique. Let K(Bo. (7') and let p(x.d. Assume :1. . Define ry = h(B) and let p' (x. Show that the entropy of p(x.i. . Let Xi denote the number of hits at a certain Web site on day i.. and positive everywhere.. B) is and that the KullbackLiebler divergence between p(x.B).0). 37. 'I i 39. o Ii !n 'I . If we write h = logg. P(nA). 9 is continuous. . Suppose X I. Find the MLEs of Al and A2 hased on 5.B)g(x. then the MLE is not unique. The likelihood function is given by Lx(B) g(xI .150 Methods of Estimation Chapter 2 (b) Show that if 1t>1 > 1. Show that (a) The likelihood is symmetric about ~ x. X. Bo) is tn( B. . (a. .. B) = g(xB). . and 8 2 are independent.. > h(y) .. Let X ~ P" B E 8.x2)/2. 1985).. 2. and 5. ryl Show that o n ».) be a random samplefrom the distribution with density f(x.\2 = A.+~l Wj have P(rnAl) and P(mA2) distributions.f)) in the likelihood equation. ~ Let B ~ arg max Lx (B) be "the" MLE. b) there exists a 0 > 0 ~ (c) There is an interval such that h(y + 0) . such that for every y E (a. (b) Either () = x or () is not unique. .ryt» denote the KullbackLeibler divergence between p( x. n.b). B) and pix. distribution. B ) and p( X.B)g(i . i = 1. Let Xl and X2 be the observed values of Xl and X z and write j.. where Al +.)/2 and t> = (XI . = I ! I!: Ii. a < b.h(y) . . ry) = p(x. ~ (d) Use (c) to show that if t> E (a.t> . ryo) and p' (x. Assume that 5 = L:~ I Xi has a Poisson. j = n + I. N(B.. .. 9 is twice continuously differentiable everywhere except perhaps at O. 36. Suppose h is a II function from 8 Onto = h(8).
5. .. and 1'.) Show that statistics. A) = g(z. [3. exp{x/O.Section 2..2. [3 > 0.}. a get.7) for a.)/o]}" (b) Suppose we have observations (t I . Give the least squares where El" •.O) ~ a {l+exp[[3(tJ1. x > 0. o + 0. give the lea. 1).I[X. An example of a neural net model is Vi = L j=l p h(Zij.. Aj) + Ei. 1959. j 1 1 cxp{ x/Oj}.5 Problems and Complements 151 40. a > 0.)/o)} (72.olog{1 + exp[[3(t. Yn).. on a population of a large number of organisms. ..c n are uncorrelated with mean zero and variance (72. 1989) Y dt ~ [3 1  Idy [ (Y)!] ' y > 0. 1 En are uncorrelated with mean 0 and variance estimating equations (2. a.p. 0 > O. . Variation in the population is modeled on the log scale by using the model logY. (tn.I[X. The mean relative growth of an organism of size y at time t is sometimes modeled by the equation (Richards. Let Xl.7) for estimating a. .. i = 1. n > 4. Suppose Xl. . . h(z. (. yd 1 ••• . X n be a sample from the generalized Laplace distribution with density 1 O + 0. + C. 1'. = log a . where OJ > O.1.. [3. = LX. . [3. i = 1. and fj..1. > OJ and T. and g(t. (e) Let Yi denote the response of the ith organism in a sample and let Zij denote the level of the jth covariate (stimulus) for the ith organism. Seber and Wild. T. and /1. n where A = (a. 1'). I' E R. . .1'.~t square estimating equations (2.~e p = 1.  J1.j ~ x <0 1.. 6.. . n. [3. 42. . < 0] are sufficient (b) Find the maximum likelihood estimates of 81 and 82 in tenns of T 1 and T 2 • Carefully check the "T1 = 0 or T 2 = 0" case. /3. j = 1. . .1. = L X. 41. For the ca. .0). where 0 (a) Show that a solution to this equation is of the form y (a. 0). 1 X n satisfy the autoregressive model of Example 1.
'12 = A and where r denotes the gamma function. 1 Xl <i< n.0 . L x. 0 for Xi < _£I.). . . .. show that the MLE of fi is jj =  2. < Show that the MLE of a.x. a. = OJ.15. C2' y' 1 1 for (a) Show that the density of X = (Xl. = 1] = p(x"a. + 8. Suppose Y I . Let Xl. Consider the HardyWeinberg model with the six genotypes given in Problem 2. < I} and let 03 = 1.. . Xn)T can be written as the rank 2 canonical exponential family generated by T = (ElogX"EXi ) and hex) = XI with ryl = p. Hint: Let C = 1(0). There exists a compact set K c e such that 1(8) < c for all () not in K. find the covariance matrix W of the vector € = (€1. t·. . + C1 > 0).1 > _£l. + 0. _ C2' 2. n i=l n i=l n i=1 n i=l Cl LYi + C. .i.J.3. write Xi = j if the ith plant has genotype j. Under what conditions on (Xl 1 .p).) 1 II(Z'i /1)2 (b) If j3 is known.y.'" £n)T of autoregression errors. Let = {(01.x.fi) = 1log p pry.3.3..)I(c..Xi)Y' < L(C1 + c. . n > 2.l.. S. (One way to do this is to find a matrix A such that enxl = Anxn€nx 1. = L(CI + C. the bound is sharp and is attained only if Yi x.l.. . Yn are independent PlY. (X..) Then find the weighted least square estimate of f.::. fi) = a + fix. .02): 01 > 0. . Hint: I . I < j < 6. .5)... " l Xn) does the Mill exist? What is the MLE? Is it unique? e 4.152 Methods of Estimation Chapter 2 (aJ If /' is known. Is this also the MLE of /1? Problems for Section 2. Xn· Ip (x. Give details of the proof or Corollary 2. In a sample of n independent plants.4) and (2. Yn) is not a sequence of 1's followed by all O's or the reverse.. I .. gamma. = If C2 > 0. • II ~. (b) Show that the likelihood equations are equivalent to (2. X n be i.0.. fi exists iff (Yl .1.. < .3.d. This set K will have a point where the max is attained. 3.. Prove Lenama 2.. > 0.(8 ... . r(A. .l  L: ft)(Xi /.3 1..
Let Y I . ..Xn E RP be i. (3) Show that. .9. w( ±oo) = 00.. • It) w' ( Xi It) .d. convex set {(Ij. e E RP.3.n. See also Problem 1. 9.10 with that the MLE exists and is unique. Hint: If BC has positive volume. . . Zj < . Prove Theorem 2.. [(Ai). the MLE 8 exists but is not unique if n is even.3.0). show 7.Section 2. fe(x) wherec. the MLE 8 exists and is unique. Then {'Ij} has a subsequence that converges to a point '10 E f..3.1 to show that in the multinomial Example 2.fl. .. if n > 2.z=7 11. u > 1 fR exp{Ixlfr}dxand 1·1 is the Euclidean norm. But c( e) is closed so that '10 = c( eO) and eO must satisfy the likelihood equations. with density.O.40. Hint: If it didn't there would exist ryj = c(9j ) such that ryJ 10 . .0). (O. 0 < Zl < . . J.lkIl :Ij >0.3..en. and assume forw wll > 0 so that w is strictly convex.. Suppose Y has an exponential.1 <j < ac k1. 1 <j< k1. 8. .' ..A( ryj) ~ max{ 'IT toA( 'I) : 'I E c(8)} > 00..) ~ Ail = exp{a + ilz. .1 (a) ~ ~ c(a) exp{ Ix . X n be Ll. n > 2. 12..l E R. 10. Show that the boundary of a convex C set in Rk has volume 0.. (b) Show that if a = 1. C! > 0.0 < ZI < . distribution where Iti = E(Y.8) . < Show that the MLE of (a. Let XI..0.'~ve a unique solution (. " (a) Show that if Cl: > ~ 1. > 3. and Zi is the income of the person whose duration time is ti. 0:0 = 1.}. < Zn.. _.6.1} 0 OJ = (b) Give an algorithm such that starting at iP = 0. ji(i) + ji. Let Xl.. In the heterogenous regression Example 1. then it must contain a sphere and the center of the sphere is an interior point by (B.. Hint: The kpoints (0.. . .t). . a(i) + a. < Zn Zn.Ld.1).. . MLEs of ryj exist iffallTj > 0.6.n) are the vertices of the :Ij <n}.. (0. . Yn denote the duration times of n independent visits to a Web site. ~fo (X u I.5 Problems and Complements 153 n 6.ill T exists and is unique. the likelihood equations logfo that t (X It) w' iOJ t=I = 0 ~ { (X i . Use Corollary 2.. .3..
2 are assumed to be known are = (lin) L:.d.J1. verify the Mstep by showing that E(Zi I Yi). (i) If a strictly convex function has a minimum.1. . then the coordinate ascent algorithm doesn't converge to a member of t:.J1. (a) Show that the MLEs of a'f. (b) Reparametrize by a = . 8aob :1 13. En i. /12. P coincide with the method of moments estimates of Problem 2.6. (Check that you are describing the GaussSeidel iterative method for solving a system of linear equations. ar 1 a~.)(Yi . respectively.2. Y. Yn ) be a sample from a N(/11. 2.l and CT. show that the estimates of J11. . b = : and consider varying a. J12.i. Apply Corollary 2.:. EM for bivariate data. (I't') IfEPD 8a2 a2 D > 0 an d > 0.4.4 1. = (lin) L:. (Xn . ~J ' .3.154 Methods of Estimation Chapter 2 (c) Show that for the logistic distribution Fo(x) [1 + exp{ X}]I. (See Example 2. provided that n > 3. O'~ + J1i. . ...b)_(ao.) has a density you may assume that > 0..2). and p = [~(Xi . Let (X10 Yd.1)". . N(O.2)2. IPl < 1. b) = x if either ao = 0 or 00 or bo = ±oo. pal0'2 + J11J.:. O'~ + J1~.) Hint: (a) Thefunction D( a. il' i.n log a is strictly convex in (a. p) population. .J1. b successively. O'~. complete the Estep by finding E(Zl I l'i) and E(ZiYi I Yd· (b) In Example 2.4..) I . Describe in detail what the coordinate ascent algorithm does in estimation of the regression coefficients in the Gaussian linear model i: y ~ ZDJ3 + <. Golnb and Van Loan. Note: You may use without proof (see Appendix B. b) and lim(a. I (Xi . for example.(Yi . EeT = (J11. O'?. ..bo) D(a. 3.. (b) If n > 5 and J11 and /12 are unknown. ag.J1.6. Hint: (b) Because (XI. "i . 0'1 Problems for Section 2.9). Show that if T is minimal and t: is open and the MLE doesn't exist. it is unique. {t2.3.1 and J1. See.. L:. E" ..b) .4. rank(ZD) = k. Chapter 10.2)/M'''2] iI I'..8. > 0. and p when J1. 1985. W is strictly convex and give the likelihood equations for f.' ! (a) In the bivariate nonnal Example 2. .'2). 7if)F 8 08 D 802 vb2 2 2 > (a'D)2 ' then D'IS strictIy convex. b) ~ I w( aXi ..
X is the number of members who have the trait.O"~). ~2 = log C\) + (b) Deduce that T is minimal sufficient. when they exist. Hint: Use Bayes rule._x(1. 2 (uJ L }iIi + k 2: Y.. where 8 = 80l d and 8 1 estimate of 8.Ii)..B)"]'[(1 .. . 1') E (0.1) x Rwhere P. B E [0. IX > 1) = \ X ) 1 (1 6)" .. 1 and (a) Show that X . given X>l. . 8new . thus. and given II = j.P. I' >.3) to show that the NewtonRaphson algorithm gives ~ _ 81 ~ 8 ~ _ ~ _ B(I. it is desired to estimate the proportion 8 that has the genetic trait.I ~ + (1  B)n] . Y'i).. i ~ 1'.B)[I.(1 _ B)n]{x nB .5 Problems and Complements 155 4.B)n} _ nB'(I. (c) Give explicitly the maximum likelihood estimates of Jt and 5. Y1 '" N(fLl a}). Suppose that in a family of n members in which one has the disease (and. the model often used for X is that it has the conditional distribution of a B( n. also the trait).[1 .(1 .? (b) Give as explicitly as possible the E. Do you see any problems with >.Section 2. Let (Ii.x = 1. Y (~ A 2:7 I(Yi  Y)'  O"~)/ (0". is distributed according to an exponential ryl < n} = i£(1 I) uzuy. j = 0.[lt ~ 1] = A ~ 1 . Yd : 1 < i family with T afi I a? known. I n ) 8"(10)"" (a) Show that P(X = x MLE exists and is unique.n. Consider a genetic trait that is directly unobservable but will cause a disease among a certain proportion of the individuals that have it.. as the first approximation to the maximum likelihood . be independent and identically distributed according to P6. 6.{(Ii. B= (A. Suppose the Ii in Problem 4 are not observed. For families in which one member has the disease.4.and Msteps of the EM algorithm for this problem.[lt ~ OJ. 1 < i < n. Because it is known that X > 1. 2:Ji) .(I. and that the (b) Use (2. 1].B)n[n .2B)x + nB'] _. (a) Justify the following crude estimates of Jt and A. B) variable.
* maximizes = iJ(A') t. X n be i. V. .c)} and "+" indicates summation over the ind~x. (1) 10gPabc = /lac = Pabe. v < 00.1 generated by N++c. find (}l of (b) above using (} =   x/n as a preliminary estimate. (h) Show that the family of distributions obtained by letting 1'.A(iJ(A)). (c) Show that the MLEs exist iff 0 < N a+c . .c Pabc = l. hence.i. n N++ c ++c  . 1 < b < B.156 (c) [f n = 5.. iff U and V are independent given W. PIU = a. X. X Methods of Estimation Chapter 2 = 2.9) = ".4. + Vbc where 00 < /l. where X = (U. v vary freely is an exponential family of rank (C .b. X 3 = a. Let Xl.riJ(A) . Show that the sequence defined by this algorithm converges to the MLE if it exists. W = c] 1 < a < A. 1 <c < C and La. N +bc given by < N ++c for all a.b. (a) Suppose for aU a. .4.e. W). N++ c N a +c N+ bc Pabc = .v ~ b I W = c] = P[U = a I W = c]P[V = b I W ~ i.X 2 .N+bc where N abc = #{i : Xi = (a. c]' Show that this holds iff PIU ~ a. (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the other of the local maxima (b) Make a similar study of the NewtonRaphson method in this case. Define TjD as before. . b1 c and then are . .1 (1 + (x e. 7. Consider the following algorithm under the conditions of Theorem 2. b1 C.d.. 8.2 noting that the sequence of iterates {rymJ is bounded and.X3 be independent observations from the Cauchy distribution about f(x. Let Xl. N . the sequence (11ml ijm+l) has a convergent subse quence.1) + C(A + B2) = C(A + B1) .9)')1 Suppose X. ~ 0. V = b. Let and iJnew where ). = 1.2. Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a. 9.Na+c. Hint: Apply the argument of the proof of Theorem 2.
..3 + (A . Show that the algorithm converges to the MLE if it exists and di· verges otheIWise. f vary freely. c" parameters.c.8).4.=_ L ellalblp~~~'CI a'. v.I) ~ AB + AC + BC . (a) Show that S in Example 2. Initialize: pd: O) = N a ++ N_jo/>+ N++ c abc nnn Pabc d2) dl) Nab+ Pabc dO) n n Pab+ d1) dl) dO) Pabc d3) N a + c Pabc Pa+c N+bc Pabc d 2) Pabc n P+bc d2)· Reinitialize with ~t~. Justify formula (2.I)(C . 11.4. N±bc . (b) Give explicitly the E. = fo(x  9) where fo(x) 3'P(x) + 3'P(x .I)(C .1) + (A .[X = x I SeX) = s] = 13. (b) Consider the following "proportional fitting" algorithm for finding the maximum likelihood estimate in this model. c" and "a.5 has the specified mixtnre of Gaussian distribution. Suppose X is as in Problem 9.I) + (B .and Msteps of the EM algorithm in this case. 10. but now (2) logPabc = /lac + Vbc + 'Yab where J1.1)(B .a>!b".b'.c' obtained by fixing the "b.:.N++c/A.a) 1 2 . Hint: P.(x) :1:::: 1(S(x) ~ = s). 12.(A + B + C).5 Problems and Complements 157 Hint: (b) Consider N a+ c .Section 2. Nl+co (c) The model implies Pabc = P+bcPa±c/P++c and use the likelihood equations. (a) Show that this is an exponential family of rank A + B + C .N++ c / B. Let f. Hint: Note that because {p~~~} belongs to the model so do all subsequent iterates and that ~~~ is the MLE for the exponential family Pabc = ellauplO) _=.
. These considerations lead essentially to maximum likelihood estimates. Hint: Show that {(19 m. Then using the Estep to impute values for the missing Y's would greatly unclerpredict the actual V's because all the V's in the imputation would have Y < 2.d..and Msteps of the EM algorithm for estimating (ftl. and independent of El. N(O. Bm+d} has a subsequence converging to (0" JJ*) and.5. ~ 16. 14. {32).n}. Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n. Zn are ii. the process determining whether X j is missing is independent of X j . Zl. the probability that Vi is missing may depend on Zi. This condition is called missing at random. if a is sufficiently large. .4..6. Bm + I)} has a subsequence converging to (B* . this assumption may not be satisfied. ... suppose Yi is missing iff Vi < 2. That is. given X .4. . • .5. For example. Show for 11 = 1 that bisection may lead to a local maximum of the likelihood. En a an 2. ! . Establish part (b) of Theorem 2. but not on Yi.) underpredicts Y. {31 1 a~. we observe only}li. Hint: Use the canonical nature of the family and openness of E. in Example 2. Limitations a/the EM Algorithm. Complete the E. 18.4. (2) The frequency plugin estimates are sometimes called Fisher consistent. NOTES .4. 2 ). The assumption underlying the computations in the EM algorithm is that the conditional probability that a component X j of the data vector X is missing given the rest of the data vector is not a function of X j . I I' . Establish the last claim in part (2) of the proof of Theorem 2. al = a2 = 1 and p = 0. given Zi. Ii .d." . R. (J*) and necessarily g. consider the model I I: ! I I = J31 + J32 Z i + Ei I are i..3 for the actual MLE in that example. If /12 = 1. 17. a 2 .{Xj }. If Vi represents the seriousness of a disease. N(ftl. ~ . . That is.p is the N(O} '1) density. EM and Regression. For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1986). I Z.1 (1) "Natural" now was not so natural in the eighteenth century when the least squares principle was introduced by Legendre and Gauss. where E) . Noles for Section 2.6.3.. 15..2. necessarily 0* is the global maximizer.4.i.158 Methods of Estimation Chapter 2 and r. Hint: Show that {( 8m . suppose all subjects with Yi > 2 drop out of the study. In Example 2. thus. find the probability that E(Y.6 . = {(Zil Yi) Yi :i = 1. ."" En. Fisher (1922) argued that only estimates possessing the substitution property should be considered and the best of these selected. . the "missingness" of Vi is independent of Yi. A. Verify the fonnula given in Example 2. For instance. For X .~ go.
H. FEINBERG. FISHER. a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates.3 (I) Recall that in an exponential family. Campbell. PETRIE. 1997). S. AND J. W. M. The Analysis ofFrequency Data Chicago: University of Chicago Press. J. Appendix A. see Cover and Thomas (1991). 138 (1977). for any A. DAHLQUIST. EiSENHART. M. BREMNER. Roy. Numerical Analysis New York: Prentice Hall. T. Note for Section 2. ANOH. t 996..2. 1922. CAMPBELL. 1985. FAN. A.. Statistical Inference Under Order Restrictions New York: Wiley. 39. E. E. Matrix Computations Baltimore: John Hopkins University Press.g. GoLUB. La. A. "The Meaning of Least in Least Squares. THOMAS. 0. (2) For further properties of KullbackLeibler divergence. M.. AND K. S. J. C.5 (1) In the econometrics literature (e. "On the Mathematical Foundations of Theoretical Statistics. 1997. BISHOP. 1.. Sciences. M.7 References 159 Notes for Section 2. R.. "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Am!. HOLLAND. 164171 (1970). R.7 REFERENCES BARLOW. Note for Section 2.. S. "Examples of Nonunique Maximum Likelihood Estimators. G. WEISS. P[T(X) E Al o for all or for no PEP. A. Statist. VAN LOAN.. D. AND N.. 54.1974. . F. 199200 (1985). 1972. W.. 2433 (1964). Local Polynomial Modelling and Its Applications London: Chapman and Hall. 1991. HABERMAN. The Econometrics ofFinancial Markets Princeton.• AND I. RUBIN. 1974.. Wiley and Sons. M." Journal Wash. M. and MacKinlay." The American Statistician. "Y. DHARMADHlKARI. T. AND P. L. AND C. 1975. Acad. MA: MIT Press. E. SOULES. G. COVER. BJORK. NJ: Princeton University Press. MACKINLAY.1. AND N. Math." reprinted in Contributions to MatheltUltical Statistics (by R. Statist. La. J. Fisher 1950) New York: J.Section 2. BARTHOLOMEW.. GIlBELS. DEMPSTER.. B. Discrete Multivariate Analysis: Theory and Practice Cambridge. c. G." J.2 (I) An excellent historical account of the development of least squares methods may be found in Eisenhart (1964). "Y. 41. BRUNK. 2. A. Soc. A. B. BAUM. A.. AND D. AND A. ANDERSON. LAIRD. Elements oflnfonnation Theory New York: Wiley. "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm.loAGDEV.
.. B. J.. 1989. 2nd ed." Ann. C.. . A. AND W.. 1986. MOSTELLER. S.1.379243. Nonlinear Regression New York: Wiley. 10. S. G. New York: Wiley. 63. F.95103 (1983) . AND T. E.. The EM Algorithm and Extensions New York: Wiley." Ann. 6th ed. 1987. 1985. SNEDECOR.623656 (1948). AND c. 11. 128 (1968). WU. Statistical Methods. Theory. WILD. SHANNON. "On the Shannon Theory of Information Transmission in the Case of Continuous Signals. G. 1%7." 1. "Association and Estimation in Contingency Tables. Statist. P.. • . AND D. "On the Convergence Properties of the EM Algorithm. 102108 (1956). G. i I. WEISBERG. AND M. J. LIlTLE. Statistical Anal. RICHARDS. J.. "Multivariate Locally Weighted Least Squares Regression. I • I . E.. A. W. I 1 . Statist. STIGLER. 22."/RE Trans! Inform. A. KR[SHNAN. C. 27.160 Methods of Estimation Chapter 2 KOLMOGOROV. Applied linear Regression. Amer. Exp. N. Statisr. COCHRAN. 290300 ( 1959). SEBER." J. 13461370 (1994).. MACLACHLAN. Assoc. "A Mathematical Theory of Communication. The History ofStatistics Cambridge. Ames. In." Bell System Tech. F.·sis with Missing Data New York: J. RUBIN.. RUPPERT.. R. 1997.. Journal. MA: Harvard University Press. WAND. E.. J. "A Flexible Growth Function for Empirical Use. D. Botany. Wiley. IA: Iowa State University Press..
and 62 on the basis of the risks alone is not well defined unless R(O. 6) as measuring a priori the performance of 6 for this model. after similarly discussing testing and confidence bounds. AND OPTIMAL PROCEDURES 3. ) < R(B.6 . we study. actual implementation is limited. However.6) = E o{(O.6(X). We return to these themes in Chapter 6. We think of R(. in the context of estimation.2. We also discuss other desiderata that strongly compete with decision theoretic optimality. 6) : e .3.1) 161 . 3. NOTIONS OF OPTIMALITY. in Chapter 4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case. then for data X '" Po and any decision procedure J randomized or not we can define its risk function. loss function l(O. (3. OUf examples are primarily estimation of a real parameter. 6 2 ) for all 0 or vice versa.6) = EI(6. o r(1T. action space A.2 BAYES PROCEDURES Recall from Section 1.6) =ER(O. In Section 3.2 and 3.3 that if we specify a parametric model P = {Po: E 8}.Chapter 3 MEASURES OF PERFORMANCE. However. In Sections 3. the relation of the two major decision theoretic principles to the 000decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness. Strict comparison of 6.1 INTRODUCTION Here we develop the theme of Section 1.+ R+ by ° R(O. in particular computational simplicity and robustness.a). R(·. by introducing a Bayes prior density (say) 1r for comparison becomes unambiguous by considering the scalar Bayes risk.6(X)). which is how to appraise and select among decision procedures.4.3 we show how the important Bayes and minimax criteria can in principle be implemented.
Issues such as these and many others are taken up in the fundamental treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger (1985) and Bernardo and Smith (1994).2.162 Measures of Performance Chapter 3 where (0. 0) and "I (0) is equivalent to considering 1 (0..6(X)j2.6) In the discrete case..2) the Bayes risk of the problem. if 1[' is a density and e c R r(". X) is given the joint distribution specified by (1. Suppose () is a random variable (or vector) with (prior) frequency function or density 11"(0). such that (3.(0)1 . 00 for all 6 or the Using our results on MSPE prediction.2. We may have vague prior notions such as "IBI > 5 is physically implausible" if. 1(0. for instance.2. as usual.5). After all.2. it is plausible that 6.3 we showed how in an example. we just need to replace the integrals by sums. if computable will c plays behave reasonably even if our knowledge is only roughly right.. It is in fact clear that prior and loss function cannot be separated out clearly either.6) = J R(0. This exercise is interesting and important even if we do not view 1r as reflecting an implicitly believed in prior distribution on ()..3). (0.6). (3.2. 0) = (q(O) using a nonrandomized decision rule 6. (0. Here is an example. and instead tum to construction of Bayes procedure.4) the parametrization plays a crucial role here. we find that either r(1I". In the continuous case with real valued and prior density 1T. 'f " .6): 6 E VJ (3. 6) Bayes rule 6* is given by = a? 6'(X) = E[q(O) I XJ. Recall also that we can define R(.. x _ !oo= q(O)p(x I 0).5) This procedure is called the Bayes estimate for squared error loss.(O)dfI p(x I O). and that in Section 1. . 0) 2 and 1T2(0) = 1. Clearly. 6) ~ E(q(O) . j I 1': II . oj = ".(0) a special role ("equal weight") though (Problem 3. This is just the prohlem of finding the best mean squared prediction error (MSPE) predictor of q(O) given X (see Remark 1. considering I.(O)dO (3. we could identify the Bayes rules 0.r= . we can give the Bayes estimate a more explicit form. If 1r is then thought of as a weight function roughly reflecting our knowledge.8) for the posterior density and frequency functions. In view of formulae (1. Our problem is to find the function 6 of X that minimizes r(". e "().5..2. B denotes mean height of people in meters. 1 1  . We first consider the problem of estimating q(O) with quadratic loss.4.2.4) !. Thus. For testing problems the hypothesis is often treated as more important than the alternative.) ~ inf{r(".(O)dfI (3.2. We don't pursue them further except in Problem 3.2.3) In this section we shall show systematically how to construct Bayes rules. and 1r may express that we care more about the values of the risk in some rather than other regions of e.
.2. Fonnula (3. Intuitively. Because the Bayes risk of X.2. the Bayes estimate corresponding to the prior density N(1]O' 7 2 ) differs little from X for n large. In fact.E(e I X))' I X)] E[(J'/(l+~)]=nja 2 + 1/72 ' 1 2 n n/ No finite choice of '1]0 and 7 2 will lead to X as a Bayes estimate. E(Y I X) is the best predictor because E(Y I X ~ . if we substitute the prior "density" 7r( 0) = 1 (Prohlem 3. for each x.6. X is the estimate that (3.2. This action need not exist nor be unique if it does exist. 1]0 fixed). For more on this. But X is the limit of such estimates as prior knowledge becomes "vague" (7 . If we choose the conjugate prior N( '1]0. J' )]/r( 7r. J') ~ 0 as n + 00.2 Bayes Procedures 163 Example 3. .1. Applying the same idea in the general Bayes decision problem.7) Its Bayes risk (the MSPE of the predictor) is just r(7r.6) yields. we obtain the posterior distribution The Bayes estimate is just the mean of the posterior distribution J'(X) ~ 1]0 l/T' ] _ [ n/(J2 ] [n/(J' + l/T' + X n/(J' + l/T' (3. " Xn. if X = :x and we use action a.00 with. a) I X = x). we see that the key idea is to consider what we should do given X = x. /2) as in Example 1.5. If we look at the proof of Theorem 1.) minimizes the conditional MSPE E«(Y . take that action a = J*(x) that makes r(a I x) as small as possible. see Section 5.7) reveals the Bayes estimate in the proper case to be a weighted average MID + (1 . . However.12. 0 We now tum to the problem of finding Bayes rules for gelleral action spaces A and loss functions l.2. J') E(e . tends to a as n _ 00. '1]0. a 2 / n. Such priors with f 7r(e) ~ 00 or 2: 7r(0) ~ 00 are called impropa The resulting Bayes procedures are also called improper.E(e I X))' = E[E«e . In fact. we should.2. Thus.w)X of the estimate to be used when there are no observations. This quantity r(a I x) is what we expect to lose. Bayes Estimates for the Mean of a Normal Distribution with a Normal Prior.4. X is approximately a Bayes estimate for anyone of these prior distributions in the sense that Ir( 7r.1). r) . and X with weights inversely proportional to the Bayes risks of these two estimates. Suppose that we want to estimate the mean B of a nonnal distribution with known variance a 2 on the basis of a sample Xl. that is. To begin with we consider only nonrandomized rules.Section 3.r( 7r. we fonn the posterior risk r(a I x) = E(l(e.a)' I X ~ x) as a function of the action a.1.
11) = 5. let Wij > 0 be given constants.89.2.o(X)) I X = x] = r(o(x) Therefore. More generally consider the following class of situations.o(X))] = E[E(I(O.9) " "I " • But.2. The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corrpeting procedures. az.· .) = 0. r(al 11) = 8.8) Thus. Suppose we observe x = O. E[I(O. " .! " j. o As a first illustration. Then the posterior distribution of o is by (1. ~ a. . ..1. the posterior risks of the actions aI.2.. E[I(O.o(X)) I X)].o'(X)) IX = x].3.74. A = {ao.164 Measures of Performance Chapter 3 Proposition 3.o) = E[I(O.4.·. 6* = 05 as we found previously. and aa are r(at 10) r(a. As in the proof of Theorem 1. consider the oildrilling example (Example 1. 0'(0) Similarly. and the resnlt follows from (3. by (3. I x) > r(o'(x) I x) = E[I(O.2. we obtain for any 0 r(?f.2. r(a. .8) Then 0* is a Bayes rule. I 0)  + 91(0" ad r(a. Let = {Oo. ?f(O. 10) 8 10. Therefore.. and let the loss incurred when (}i is true and action aj is taken be given by j e e .5) with priOf ?f(0.o'(X)) I X].2. a q }. r(a.1.) = 0. Therefore..35. (3.70 and we conclude that 0'(1) = a. az has the smallest posterior risk and. Bayes Procedures Whfn and A Are Finite. !i n j. [I) = 3.o(X)) I X] > E[I(O. (3. Suppose that there exists a/unction 8*(x) such that r(o'(x) [x) = inf{r(a I x) : a E A}.8). if 0" is the Bayes rule.8.2.9). Example 3.67 5.2. :i " " Proof.Op}.2.
Section 3.2
Bayes Procedures
165
Let 1r(0) be a prior distribution assigning mass 1ri to Oi, so that 1ri > 0, i = 0, ... ,p, and Ef 0 1ri = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each O. Then, by (1.2.8), the posterior probabilities are
and, thus,
raj (
x I) =
EiWipTiP(X I Oi) . Ei 1fiP(X I Oi)
(3.2.10)
The optimal action 6* (x) has
r(o'(x) I x)
=
O<rSq
min r(oj
I x).
Here are two interesting specializations. (a) Classification: Suppose that p
= q, we identify aj with OJ, j
1, i f j
=
0, ... ,p, and let
Wii
O.
This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,
r(Oi
Ix) = Pl8 f ei I X = xl
and minimizing r(Oi I x) is equivalent to the reasonable procedure of maximizing the posterior probability,
PI8 = e I X = i
xl
=
1fiP(X lei) Ej1fjp(x I OJ)
(b) Testing: Suppose p = q = 1, 1ro = 1r, 1rl = 1  'Jr, 0 < 1r < 1, ao corresponds to deciding 0 = 00 and al to deciding 0 = 01 • This is a special case of the testing fonnulation of Section 1.3 with 8 0 = {eo} and 8 1 = {ed. The Bayes rule is then to
decide e decide e
= 01 if (1 1f)p(x I 0, ) > 1fp(x I eo) = eo if (1  1f)p(x IeIl < 1fp(x Ieo)
and decide either ao or al if equality occurs. See Sections 1.3 and 4.2 on the option of randomizing .between ao and al if equality occurs. As we let 'Jr vary between zero and one. we obtain what is called the class of NeymanPearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) < Ct. This is treated further in Chapter 4. D
166
Measures of Performance
Chapter 3
To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form" the Bayes procedure for an estimation problem when the loss is not quadratic.
Example 3.2.3. Bayes Estimation ofthe Probability ofSuccess in n Bernoulli Trials. Suppose that we wish to estimate () using X I, ... , X n , the indicators of n Bernoulli trials with
probability of success
e, We shall consider the loss function I given by
(0  a)2 1(0, a) = 0(1 _ 0)' 0 < 0 < I, a real.
(3.2.11)
"
" " ,

d
This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for () close to zero, this l((), a) is close to the relative squared error (fJ  a)2 JO, lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if aU terms on the righthand side are finite,
rea I k)
(3.2.12)
;,
I
, " " I
,

I
Minimizing this parabola in a, we find our Bayes procedure is given by
J'(k) = E(1/(1  0) I S = k) E(I/O(I  0) IS  k)
;
(3.2.13)

i! , ,.
provided the denominator is not zero. For convenience let us now take as prior density the density br " (0) of the bela distribution i3( r, s). In Example 1. 2.1 we showed that this leads to a i3(k +r,n +s  k) posteriordistributionforOif S = k. Ifl < k < nI and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and
.
,
";1 
J'(k)
Jo'(I/(1  O))bk+r,n_k+,(O)dO
J~(1/0(1 O))bk+r,n_k+,(O)dO
B(k+r,nk+sI) B(k + r  I, n  k + s  I) k+rl n+s+r2'
(3.2.14)
where we are using the notation B.2.11 of Appendix B. If k = 0, it is easy to see that a = is the only a that makes r(a I k) < 00. Thus, J'(O) = O. Similarly, we get J'(n) = 1. If we assume a un!form prior density, (r = s = 1), we see that the Bayes procedure is the usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). 0
°
"Real" computation of Bayes procedures
The closed fonos of (3.2.6) and (3.2.10) make the compulation of (3.2.8) appear straightforward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 ~ (0" . .. , Op) has a hierarchically defined prior density,
"(0,, O ,... ,Op) = 2
"1 (Od"2(02
I( 1 ) ... "p(Op I Op_ d.
(3.2.15)
I
Section 3.2
Bayes Procedures
167
Here is an example. Example 3.2.4. The random effects model we shall study in Volume II has
(3.2.16)
where the €ij are i.i.d. N(O, (J:) and Jl, and the vector ~ = (AI, .. ' ,AI) is independent of {€i; ; 1 < i < I, 1 < j < J} with LI." ... ,LI.[ i.i.d. N(O,cri), 1 < j < J, Jl, '"'" N(Jl,O l (J~). Here the Xi)" can be thought of as measurements on individual i and Ai is an "individual" effect. If we now put a prior distribution on (Jl,,(J~,(J~) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (J.L, (J~, (J~, AI,. '. ,AI) with the Xii I () independent N(f.l + Ll. i . cr~). Then p(x I 0) = lli,; 'Pu.(Xi;  f.l Ll. i ) and
11'(0)
=
11'}(f.l)11'2(cr~)11'3(cri)
II 'PU" (LI.,)
i=l
I
(3.2.17)
where <Pu denotes the N(O, (J2) density. ]n such a context a loss function frequently will single out some single coordinate Os (e.g., LI.} in 3.2.17) and to compute r(a I x) we will need the posterior distribution of ill I X. But this is obtainable from the posterior distribution of (J given X = x only by integrating out OJ, j t= s, and if p is large this is intractable. ]n recent years socalled Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable 0 and the use of Bayesian methods has spread. We return to the topic in Volume II.
Linear Bayes estimates
When the problem of computing r( 1r, 6) and 611" is daunting, an alternative is to consider  of procedures for which r( 'R', 6) is easy to compute and then to look for 011" E 'D  a class 'D that minimizes r( 1f 16) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q( 0)  a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;=l bjXj _ If in (1.4.14) we identify q(8) with Y and X with Z, the solution is

8(X)
= Eq(8) + IX 
E(X)]T{3
where f3 is as defined in Section 1.4. For example, if in the 11l0del (3.2.16), (3.2.17) we set q(fJ) = Ll 1 , we can find the linear Bayes estimate of Ll 1 hy using 1.4.6 and Problem 1.4.21. We find from (1.4.14) that the best linear Bayes estimator of LI.} is
(3.2.18)
where E(Ll.I) given model
= 0,
X = (XIJ, ... ,XIJ)T, P.
= E(X)
and/l
=
ExlcExA,. For the
168
Var(X I ;)
Measures of Performance
Chapter 3
=E
Var(X,)
I B) +
Var E(XIj
I BJ = E«T~) +
(T~ + (T~,
E Cov(X,j , Xlk I B)
+ Cov(E(X'j I B), E(Xlk I BJ)
0+ Cov(1' + II'!, I' + tl.,)
= (T~ +
(T~,
Cov(tl.
Xlj)
=E
Cov(X,j , tl.,
!: ,,
I',
" From these calculations we find and Norberg (1986).
I B) + Cov(E(XIj I B), E(tl. l I 0» = 0 + (T~, =
(Tt·
f3
and OL(X). We leave the details to Problem 3.2.10.
'I
Linear Bayes procedures are useful in actuarial science, for example, Biihlmann (1970)
if
ii "
Bayes estimation, maximum likelihood, and equivariance
As we have noted earlier. the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior 71"(0) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than
modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing the estimate (j M is said to be equivariant with respect to reparametrization if for every onetoDne function h from e to fl = h (e). the estimate of w '" h(B) is W = h(OM); that is, M
~
I
= h(BM ). In Problem 2.2.16 we show that the MLE procedure is equivariant. If we consider squared error loss. then the Bayes procedure (}B = E(8 I X) is not equivariant
(h(O»
M
~

~
~
for nonlinear transfonnations because
E(h(O) I X)
op h(E(O I X))
1 ,
,
,,
for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is
i
re(a I x)
= LIB  a]'7f(O I x).
'Ee
(3.2.19)
i
If we set w = h(O) for h onetoone onto fl = heel, then w has prior >.(w) and in the w parametrization, the posterior Bayes risk is
= 7f(h
1
(w))
I
,
r,,(alx)
= L[waj2>,(wlx)
wE"
L[h(B)  a)2 7f (B I x).
(3.2.20)
'Ee
Thus, the Bayes procedure for squared error loss is not equivariant because squared error loss is not equivariant and, thus, r,,(a I x) op Te(h 1(a) Ix).
Section 3.2
Bayes Procedures
169
Loss functions of the form l((}, a) = Q(Po, Pa) are necessarily equivariant. The KullbackLeibler divergence K«(},a), (J,a E e, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h 1 (a)), thus, with this loss function,
ro(a I x)
~
re(h1(a) I x).
See Problem 2.2.38. In the discrete case using K means that the importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.l 0». In the N(O, case the K L (KullbackLeibler) loss K( a) is ~n(a  0)' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families
"J)
e,
K(1], a) = L[ryj  ajJE1]Tj
j=1
•
+ A(1]) 
A(a).
Moreover, if we can find the KL loss Bayes estimate 1]BKL of the canonical parameter 7] and if 1] = c( 9) :  t E is onetoone, then the K L loss Bayes estimate of 9 in the
e
general exponential family is 9 BKL = C 1(1]BKd. For instance, in Example 3.2.1 where J.l is the mean of a nonnal distribution and the prior is nonnal, we found the squared error Bayes estimate jiB = wf/o + (lw)X, where 1]0 is the prior mean and w is a weight. Because the K L loss is equivalent to squared error for the canonical parameter p, then if w = h(p), WBKL = h('iiUKL), where 'iiBKL =
~
wTJo + (1  w)X.
Bayes procedures based on the KullbackLeibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume 11.
Bayes methods and doing reasonable things
There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1%9; Lindley, 1965; Savage, 1954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior 1r. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E V there is a 60 E V o such that R(0, 60 ) < R(0, 6) for all O. Summary. We show how Bayes procedures can be obtained for certain problems by COmputing posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.
170
Measures of Performance
Chapter 3
3.3
MINIMAX PROCEDURES
In Section 1.3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worstcase analysis; the true () is one that is as "hard" as possible. That is, 6 1 is better than 82 from a minimax point of view if sUPe R(0,6 1 ) < sUPe R(B, 82 ) and is said to be minimax if
o·
supR(O, 0')
.
= infsupR(O,o).
,
.
,
I·
Here (J and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {p. : 0 E e}. It is fruitful to consider proper subclasses of V and subsets of P. but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by considering a socalled zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 8. For the basic game, 5 picks 0 without N's knowledge, N picks 1f without 5's knowledge and then all is revealed and S pays N
r(Jr,o)
, ,
,
=
J
R(O,o)dJr(O)
where the notation f R(O, o)dJr(0) stands for f R(0, o)Jr(O)dII in the continuous case and L, R(Oj, o)Jr(O j) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important.
I: N is told the choice 0 of 5 before picking 1r and 5 knows the rules of the game. Then
Ii ,
N naturally picks 1f,; such that
r(Jr"o)
that is, 1f,; is leastfavorable against such that
~supr(Jr,o),
o. Knowing the rules of the game S naturally picks 0*
•
(3.3.1)
r( Jr,., 0·) = sup r(Jr, 0') = inf sup r(Jr, 0).
We claim that 0* is minimax. To see this we note first that,
.
,
.
(3.3.2)
for allJr, o. On the other hand, if R(0" 0) = suP. R(0, 0), then if Jr, is point mass at 0" r( Jr" 0) = R(O" 0) and we conclude that supr(Jr,o) = sup R(O, 0)
,
•
•
(3.3.3)
I
,
•
•
Section 3.3
Minimax Procedures
171
and our claim follows. II: S is told the choke 7r of N before picking 6 and N knows the rules of the game. Then S naturally picks 6 1r such that
That is, r5 rr is a Bayes procedure for
11".
Then N should pick
7r*
such that (3.3.4)
For obvious reasons, 1f* is called a least favorable (to S) prior distribution. As we shall see by example, altbough the rigbthand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language.
Theorem 3.3.1. (von Neumann). If both
e and D are finite,
?5
11"
then:
(a)
v=supinfr(1I",6), v=infsupr(1I",6)
rr
?5
are both assumed by (say)
7r*
(least favorable), 6* minimax, respectively. Further,
") =v v=r1f,u (
.
(3.3.5)
v and v are called the lower and upper values of the basic game. When v (saY), v is called the value of the game.
=
v
=
v
Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification ~ {eo} and = {eIl (Example 3.2.2) but is too reSlrictive in ilS and testing when assumption for the great majority of inference problems. A generalization due to Wald and Karlinsee Karlin (1 959)states that the conclusions of the theorem remain valid if and D are compact subsets of Euclidean spaces. There are more farreaching generalizations but, as we shall see later, without some form of compactness of and/or D, although equality of v and v holds quite generally, existence of least favorable priors and/or minimax procedures may fail.
eo
e,
e
e
The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on and D and are easy to prove.
e
Proposition 3.3.1. Suppose 6**. 7r** can be found such that
U
£** =
r ()1r.',
1f **
= 1rJ••
(3.3.6)
172
Measures of Performance
Chapter 3
that is, 0** is Bayes against 11"** and 11"** is least favorable against 0**. Then v R( 11"** ,0**). That is, 11"** is least favorable and J*'" is minimax.
To utilize this result we need a characterization of 11"8. This is given by
v
=
Proposition 3.3.2.11"8 is leastfavorable against
°iff
1r,jO: R(O,b) = supR(O',b)) = 1.
.'
(3,3,7)
That is, n a assigns probability only to points () at which the function R(·, 0) is maximal.
Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 11"8 may not be unique. In particular, if R(O, 0) = constant. the rule has constant risk, then all 11" are least favorable. We now prove Propositions 3.3.1 and 3.3.2.
Proof of Proposition 3.3.1. Note first that we always have
v<v
because, trivially,
i~fr(1r,b)
(3.3.8)
< r(1r,b')
(3.3,9)
for aIln, 5'. Hence,
v = sup in,fr(1r,b) < supr(1r,b')
•
(3,3.10)
•
for all 0' and v
<
infa, sUP1r 1'(11", (/) =
v. On the other hand. by hypothesis,
sup1'(1r,6**) > v.
v> inf1'(11"*'",6)
,
= 1'(1I"*",0*'") =
.
(3.3.11)
Combining (3.3.8) and (3.3.11) we conclude that
:r
v
as advertised.
= i~f 1'(11"**,0) = 1'(11"**,6**) = s~p1'(1I",0*"')
= V
(3.3.12)
" 'I
,.;
~l
o
Proofof Proposition 3.3.2. 1r is least favorable for b iff E.R(8,6) =
,.,
f r(O,b)d1r(O) =s~pr( ..,6).
•
(3.3.13)
But by (3.3.3),
supr(..,b) = supR(O, 6),
(3.3,14)
•
i ,
Because E.R(8, 6)
= sUP. R(O, b), (3.3,13) is possible iff (3.3.7) holds.
o
Putting the two propositions together we have the following.
•
Section 3.3
Minimax Procedures
173
11"*
Theorem 3.3.2. Suppose 0* has sUPo R((}, 0*) = 1" < 00. If there exists a prior that 0* is Bayes for 11"* and tr" {(} : R( (}, 0") = r} = 1, then 0'" is minimax.
such
Example 3.3.1. Minimax estimation in the Binomial Case. Suppose S has a B(n,B) distribution and X = Sjn,as in Example 3.2.3. Let I(B, a) ~ (Ba)'jB(IB),O < B < 1. For this loss function,
R(B X) = E(X _B)' , B(IB)

=
B(IB) = ~ nB(IB) n'

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(Ol 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic loss neither of these assertions holds. The minimax estimate is
o's=S+hln = .,fii X+ I 1 () n+.,fii .,fii+l .,fii+1 2
This estimate does have constant risk and is Bayes against a (J( y'ri/2, vn/2) prior (Problem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n t 00 of the ratio of the risks of 0* and X is > 1 for every () =f ~. At B = the ratio tends to 1. Details are left to Problem 3.3.4. 0
!
Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending ~r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p" 0'2) where p, = v, if the satellite functions, and otherwise. The variance 0'2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or"p, = v." We view this as a decision problem with 1 loss. If the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to and 1  1r to v, use 0  1 loss, and set L(x, 0, v) = p( x Iv) j p(x I 0), then the Bayes test decides I' = v if
°
°°
L(x,O,v)
and decides p,
=
=
exp
°
" {
2"EXi   , a 2a
nv'} >
17l'
if L(x,O,v) <:17l' 7l'
174
Measures of Performance
Chapter 3
This test is equivalent to deciding f.t
= v (Problem 3.3.1) if. and only if,
"yn
1 ;;;:EXi>t,
T=
, , ,
where,
t
If we call this test d,r.
=
" [log " + 2 nv'] ;;;: vyn a
17r
2
'
R(O,J. )
1 ~ <I>(t)
<I>
= <1>( t)
R(v,o,)
(t  vf)
To get a minimax test we must have R(O, 61r ) = R( V, 6ft), which is equivalent to
v.,fii t = t  "'
h •
or
"
','
I ,~
~.
, ,
•
v.,fii t= . 20
•
Because this value of t corresponds to 7r = the intuitive test, which decides JJ only ifT > ~[Eo(T) + Ev(T)J, is indeed minimax.
!.
= v if and
0
'I
•
If is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem
e
3.3.2.
Theorem 3,3,3, Let 0' be a rule such that sup.R(O,o') = r < 00, let {"d denote a sequence of pn'or distributions such that 'lrk;{8 : R(B,o*) = r} = 1, and let Tk = inffJ r( ?rie, J), where r( 7fkl 0) denotes the Bayes risk wrt 'lrk;. If
Tk Task 00,
(3.3.i5)
then J* is minimax. Proof Because r( "k, 0') = r
supR(B, 0') = rk + 0(1)
•
where 0(1)
~
0 as k
~ 00.
But hy (3.3.13) for any competitor 0
supR(O,o) > E.,(R(B,o)) > rk ~supR(O,o') 0(1).
•
•
(3.3.16)
,
'.
If we let k _ suP. R(B, 0').
00
the lefthand side of (3.3.16) is unchanged, whereas the right tends to
0
j
1
•
Section 3.3
M'mimax Procedures
175
Example 3.3.3. Normal Mean. We now show that X is minimax in Example 3.2.1. Identify 1fk with theN(1Jo, 7 2 ) prior where k = 7 2 . Then
whereas the Bayes risk of the Bayes rule of Example 3.2.1 is
i~frk(J) ~ (,,'/n) +7' n
Because
(0
7
2
0
2
=
00,
n  (,,'/n) +7'
a2
1
0
2
n·
2
In)1 « (T2 In) + 7 2 )
>
0 as T 2

we can conclude that
X is minimax.
0
Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann). Suppose XI,'" ,Xn arei.i.d, FE:F
Then X is minimax for estimating B(F) EF(Xt} with quadratic loss. This can be viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as foUows:(J)
(i)
=
".{F: VarF(XIl # M}
= O.
(ii) ".{ F : F
# N(I', M) for some I'} = O.
(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F),M).
Evidently, the Bayes risk is now the same as in Example 3.3.3 with 0"2 evidently,
= M.
Because.
) VarF(X, ) max R(F, X = max :F :F n
Theorem 3.3.3 applies and the result follows. Minimax procedures and symmetry
M
n
o
As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" 0, There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann (1986) and Chapter 5 of Lehmann and CaseUa (1998), for instance. We shall discuss this approach somewhat, by example. in Chapters 4 and Volume II but refer to Lehmann (1986) and Lehmann and Casella (1998) for further reading. Summary. We introduce the minimax principle in the contex.t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.
A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called leas! favorable." R(0.4. for instance. .1. 5') for aile. This notion has intuitive appeal. This approach has early on been applied to parametric families Va. such as 5(X) = q(Oo).6.I . then in estimating a linear function of the /3J. the game is said to have a value v. Moreover. In the nonBayesian framework. The most famous unbiased estimates are the familiar estimates of f. . are minimax. An alternative approach is to specify a proper subclass of procedures. Do C D.. . for example.• •••. if Y is postulated as following a linear regression model with E(Y) = zT (3 as in Section 2.1 Unbiased Estimation. 0'5) model. This approach coupled with the principle of unbiasedness we now introduce leads to the famous GaussMarkov theorem proved in Section 6. II . I I . all 5 E Va. Von Neumann's Theorem states that if e and D are both finite.2. S(Y) = L:~ 1 diYi . and then see if within the Do we can find 8* E Do that is best according to the "gold standard. or more generally with constant risk over the support of some prior.2. N (Il. Survey Sampling In the previous two sections we have considered two decision theoretic optimality principles. there is a least favorable prior 1[" and a minimax rule 8* such that J* is the Bayes rule for n* and 1r* maximizes the Bayes risk of J* over all priors. the solution is given in Section 3. then the game of S versus N has a value 11. computational ease.3. When v = ii. in many cases. We introduced. we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1['.L and (72 when XI.d. We show that Bayes rules with constant risk. symmetry. I X n are d.4 UNBIASED ESTIMATION AND RISK INEQUALITIES 3. Bayes and minimaxity. D. for which it is possible to characterize and. the notion of bias of an estimate O(X) of a parameter q(O) in a model P {Po: 0 E e} as = Biaso(5) = Eo5(X)  q(O).. Obviously.. An estimate such that Biase (8) 0 is called unbiased. ruling out. we can also take this point of view with humbler aims. in Section 1. The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk. 176 Measures of Performance Chapter 3 . v equals the Bayes risk of the Bayes rule 8* for the prior 1r*. 5) > R(0. on other grounds. (72) = . it is natural to consider the computationally simple class of linear estimates. and so on..1 . which can't be beat for 8 = 80 but can obviously be arbitrarily terrible. compute procedures (in particular estimates) that are best in the class of all procedures. When Do is the class of linear procedures and I is quadratic Joss. according to these criteria. E Do that minimizes the Bayes risk with respect to a prior 1r among all J E Do. Ii I More specifically. estimates that ignore the data. 3. looking for the procedure 0. This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in the N (0 .
.< 2 ~ ~ (3.1 ) ~ 1 nl L (Xi ~ ooc n ~ X) 2 . . UN for the known last census incomes.3) = 0 otherwise. UMVU (uniformly minimum variance unbiased).4.(Xi 1=1 I" N ~2 x) ..14) and has = iv L:fl (3.. . .' . X N ) as parameter . is to estimate by a regression estimate ~ XR Xi. We ignore difficulties such as families moving. reflecting the probable correlation between ('ttl. .5) This method of sampling does not use the information contained in 'ttl. (3. X N ).3.Section 3. Ui is the last census income corresponding to = it E~ 1 'Ui.8) Jt =X .···. Unbiased Estimates in Survey Sampling.4. and u =X  b(U .Xn denote the incomes of a sample of n families drawn at random without replacement. . for instance. ...il) Clearly for each b. . As we shall see shortly for X and in Volume 2 for . . We want to estimate the parameter X = Xj. to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census. . ... . Example 3....3. This leads to the model with x = (Xl.an}C{XI.4..XN} 1 (3..3 and Problem 1. . We let Xl..4.4. UN) and (Xl. . ~ N L..1.4. (~) If{aj.2) 1 Because for unbiased estimates mean square error and variance coincide we call an unbiased estimate O*(X) of q(O) that has minimum MSE among all unbiased estimat~s for all 0.4 Unbiased Estimation and Risk Inequalities 177 given by (see Example 1. these are both UMVU. a census unit.4. UN' One way to do this.6) where b is a prespecified positive constant. . (3.4) where 2 CT. [J = ~ I:~ 1 Ui· XR is also unbiased..2.. If .. Suppose we wish to sample from a finite population. Unbiased estimates playa particularly important role in survey sampling. X N for the unknown current family incomes and correspondingly UI... It is easy to see that the natural estimate X ~ L:~ 1 Xi is unbiased (Problem 3. Write Xl. (3.4.
j . .4.3.11. It is possible to avoid the undesirable random sample size of these schemes and yet have specified 1rj. " 1 ". I . the unbiasedness principle has largely fallen out of favor for a number of reasons. q(e) is biased for q(e) unless q is linear. The result is a sample S = {Xl.1 the correlation of Ui and Xi is positive and b < 2Cov(U. ". ' . ..I. . However.18. 1. An alternative approach to using the Uj is to not sample all units with the same prob ability.15). . The HorvitzThompson estimate then stays unbiased. (i) Typically unbiased estimates do not existsee Bickel and Lehmann (1969) and Problem 3. (iii) Unbiased_estimates do not <:. ~ . .4. X is not unbiased but the following estimate known as the HorvitzThompson estimate is: Ef .~'! I . Specifically let 0 < 7fl.19). N toss a coin with probability 7fj of landing heads and select Xj if the coin lands heads.4. A natural choice of 1rj is ~n. They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later.i . . Ifthc 'lrj are not all equal..2Gand minimax estimates often are. X M } of random size M such that E(M) = n (Problem 3... 0 Discussion. .4. (ii) Bayes estimates are necessarily biasedsee Problem 3. I j .. Unbiasedness is also used in stratified sampling theory (see Problem 1.j' I ': Because 1rj = P[Xj E Sj by construction unbiasedness follows.. 178 Measures of Performance Chapter 3 i ..4. 111" N < 1 with 1 1fj = n. estimate than X and the best choice of b is bopt The value of bopt is unknown but can be estimated by = The resulting estimate is no longer unbiased but behaves well for large samplessee Problem 5.4).. .'I . for instance.7) II ! where Ji is defined by Xi = xJ. j=1 3 N ! I: . . . M (3. . This makes it more likely for big incomes to be induded and is intuitively desirable. Xl!Var(U).bey the attractive equivariance property. To see this write ~ 1 ' " Xj XHT ~ N LJ 7'" l(Xj E S)... Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems. If B is unbiased for e. this will be a beller cov(U. For each unit 1. II it >: .. :I . outside of sampling.3.X)!Var(U) (Problem 3. . I II.Xi XHT=LJN i=l 7fJ.
From this point on we will suppose p(x.OJdx (3. p.. 0 or equivalently Vare(Bn)/MSEe(Bn) t 1 as n t 00. in the linear regression model Y = ZDf3 + e of Section 2. for integration over Rq.O) > O} docs not depend on O. unbiased estimates are still in favor when it comes to estimating residual variances. The lower bound is interesting in its own right. as we shall see in Chapters 5 and 6.2 The Information Inequality The oneparameter case We will develop a lower bound for the variance of a statistic. e e. This preference of 8 2 over the MLE 52 = iT e/n is in accord with optimal behavior when both the number of observations and number of parameters are large. Some classical conditiolls may be found in Apostol (1974). For instance.   . f3 is the least squares estimate.4. B)dx.OJdx] = J T(xJ%oP(x. See Problem 3. %0 [j T(X)P(x.8) whenever the righthand side of (3. (II) 1fT is any statistic such that E 8 (ITf) < 00 for all 0 E then the operations of integration and differentiation by B can be interchanged in J T( x )p(x. Then 11 holdS provided that for all T such that E. good estimates in large samples ~ 1 ~ are approximately unbiased.p) where i = (Y . and appears in the asymptotic optimality theory of Section 5. Simpler assumptions can be fonnulated using Lebesgue integration theory. Assumption II is practically useless as written. We make two regularity assumptions on the family {Pe : BEe}. What is needed are simple sufficient conditions on p(x. 167. Note that in particular (3. e.4.4 Unbiased Estimation and Risk Inequalities 179 Nevertheless. We suppose throughout that we have a regular parametric model and further that is an open subset of the line.8) is finite. B) for II to hold. B) is a density.4.O E 81iJO log p( x.2. and we can interchange differentiation and integration in p(x. (I) The set A ~ {x : p(x.4.ZDf3). For instance. Finally. the variance a 2 = Var(ei) is estimated by the unbiased estimate 8 2 = iTe/(n . The arguments will be based on asymptotic versions of the important inequalities in the next subsection.4. In particular we shall show that maximum likelihood estimates are approximately unbiased and approximately best among all estimates. which can be used to show that an estimate is UMVU. OJ exists and is finite.( lTD < 00 J .Section 3. We expect that jBiase(Bn)I/Vart (B n ) . and p is the number of coefficients in f3. That is. suppose I holds.4. B)dx. 3. The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous~case theorems given later.8) is assumed to hold if T(xJ = I for all x. For all x E A. has some decision theoretic applications.9.
. 1 X n is a sample from a Poisson PCB) population. Ii ! I h(x) exp{ ~(O)T(x) . O)dx = O. Suppose that 1 and II hold and that E &Ologp(X. . o Example 3..180 for all Measures of Performance Chapter 3 e. O)dx ~ :0 J p(x.i J T(x) [:OP(X. Then (3.0 Xi) 1 = nO =  0' n O· o . 00. then I and II hold. t. Then Xi .2.6.4. I 1(0) 1 = Var (~ !Ogp(X.4.1) ~(O) = 0la' and 1 and 11 are satisfied. O)dx :Op(x.O)] dx " I'  are continuous functions(3) of O. It is not hard to check (using Laplace transform theory) that a oneparameter exponential family quite generally satisfies Assumptions I and II.O)] dxand J T(x) [:oP(X. (J2) population.4. . I' .9) Note that 0 < 1(0) < Lemma 3. (~ logp(X.4.4.B(O)} is an exponential family and TJ(B) has a nonvanishing continuous derivative on 8.  Proof. the Fisher information number. where a 2 is known. 1 and 11 are satisfied for samples from gamma and beta distributions with one parameter fixed.I [I ·1 . (3.4. Similarly.1. thus. (3. .n and 1(0) = Var (~n_l .&0 '0 {[:oP(X. For instance.O)).11 ) . O)dx. Suppose Xl.O)} p(x. If I holds it is possible to define an important characteristic of the family {Po}. suppose XI. 1 X n is a sample from a N(B.O)]j P(x..O) & < 00. /fp(x. the integrals .10) and.. which is denoted by I(B) and given by Proposition 3. 0))' = J(:Ologp(x.1. I J J & logp(x 0 ) = ~n 1 . 0))' p(x. Then (see Table 1. 0) ~ 1(0) = E.
Proposition 3. e)) = I(e). (3. 0 E e. .4. Let [.O)dX.4. We get (3.Section 3. > [1/. 1/.(T(X)) by 1/.O))P(x.(0). Theorem 3.4.1.O).17) . CoroUary 3.1. .1. and that the conditions of Theorem 3.4. T(X)) .(0). Then Var. Denote E. then I(e) = nI.'(0) = J T(X):Op(x.4. e) and T(X). nI.(0) is differentiable and Var (T(X)) . (3.I 1. 10gp(X.12) Proof.. ej. 1/. Suppose the conditions of Theorem 3. Then for all 0.(0) = E(feiogf(X"e»)'.14) Now let ns apply the correlation (CauchySchwarz) inequality (A.4.4.1 hold and T is an unbiased estimate ofB.' (e) = Cov (:e log p(X. Using I and II we obtain.O)dX= J T(x) UOIOgp(x.l4) and Lemma 3.15) The theorem follows because. If we consider the class of unbiased estimates of q(B) = B.I1. we obtain a universal lower bound given by the following.4. by Lemma 3. Here's another important special case.4.16) to the random variables 8/8010gp(X.2.'(~)I)'.(T(X)) > [1/.(T(X)) < 00 for all O. 0 (3. 1/. Let T(X) be all)' statistic such that Var.'(0)]'  I(e) . Suppose that X = (XI.4.16) The number 1/I(B) is often referred to as the information or CramerRoo lower bound for the variance of an unbiased estimate of 'tjJ(B).(T(X) > I(O) 1 (3.4. (3.. 0 The lower bound given in the information inequality depends on T(X) through 1/.1 hold. (Information Inequality).Xu) is a sample from a population with density f(x.1. (e) and Var.4.4 Unbiased Estimation and Risk Inequalities 181 Here is the main result of this section.13) By (A.4. Var (t. Suppose that I and II hold and 0 < 1(0) < 00.
4.(B) such that Var. As we previously remarked. This is no accident.4. If the family {Pe} satisfies I and II and if there exists an unbiased estimate T* of 1/. Note that because X is UMVU whatever may be a 2 . .B(B)J.4. i g .B) = h(x)exp[ry(B)T'(x) . then  1 (3..1) with natural sufficient statistic T( X) and .182 Measures of Performance Chapter 3 Proof. we have in fact proved that X is UMVU even if a 2 is unknown. Then {P9} is a oneparameter exponential family with density or frequency function af the fonn II p(x. X n is a sample from a nonnal distribution with unknown mean B and known variance a 2 . Because X is unbiased and Var( X) ~ Bin.(B).3. (Br Now Var(X) ~ a'ln. 0 We can similarly show (Problem 3. These are situations in which X follows a oneparameter exponential family.2.B)] Var " i) ~ 8B iogj(Xi. Next we note how we can apply the information inequality to the problem of unbiased estimation.2.19) Ii i .4.Xn are the indicators of n Bernoulli trials with probability of success B. Theorem 3..4.4. Conversely.. Example 3. then T' is UMVU as an estimate of 1.4.1 and I(B) = Var [:B !ogp(X. the conditions of the information inequality are satisfied. This is a consequence of Lemma 3.j. .4. We have just shown that the information [(B) in a sample of size n is nh (8).4..B)] = nh(B).. o II (B) is often referred to as the information contained in one observation. 1) density.B) ] [ ~var [%0 ]Ogj(Xi. . Suppose Xl.1 for every B. Suppose that the family {p.p'(B)]'ll(B) for all BEe.(T(X». if {P9} is a oneparameter exponentialfamily of the form (1. whereas if 'P denotes the N(O. which achieves the lower bound ofTheorem 3.18) 1 a ' . .1 we see that the conclusion that X is UMVU follows if ~  Var(X) ~ nI.4. By Corollary 3. For a sample from a P(B) distribution.1) that if Xl. the MLE is B ~ X.[T'(X)] = [. .6. I I "I and (3.. then X is UMVU. (3. : BEe} satisfies assumptions I and II and there exists an unbiased estimate T* of1.. then X is a UMVU estimate of B.(8) has a continuous nonvanishing derivative on e. Example 3.18) follows. (Continued).j. then T(X) achieves the infonnation inequality bound and is a UMVU estimate of E.
2. (3.4. B . B) IdB. Thus.20) with Pg probability 1 for each B.(Ae) = 1 for all B' (Problem 3.20) guarantees Pe(Ae) ~ 1 and assumption 1 guarantees P. However..B) so that = T(X) .4 Unbiased Estimation and Risk Inequalities 183 Proof We start with the first assertion.p(B) = A'(B) and. (B)T'(X) + a2(B) (3.20) with respect to B we get (3. then (3. and only if.6).4. " and both sides are continuous in B. B) = a.16) we know that T'" achieves the lower bound for all (J if.4.19) is highly technical.22) for j = 1.10g(1 .. continuous in B.2 and.19). But now if x is such that = 1..4.B)) + nz)[log B .B) I + 2nlog(1  BJ} . we see that all a2 are linear combinations of {} log p( Xj. Then ~ ( 80 logp X. J 2 Pe.1) we assume without loss of generality (Problem 3. Let B . Suppose without loss of generality that T(xI) of T(X2) for Xl.21 ) Upon integrating both sides of (3. there exist functions al ((J) and a2 (B) such that :B 10gp(X.4.4 and 2. The passage from (3. Here is the argument.ll. it is necessary.4. If A.4. X2 E A"''''.4.14) and the conditions for equality in the correlation inequality (A.A'(B» = VareT(X) = A"(B). thus. By solving for a" a2 in (3. Note that if AU = nmAem .4.(B)T' (x) +a2(B) for all BEe}.4.B) = a. (3. denotes the set of x for which (3. 0 Example 3. By (3.Section 3. In the HardyWeinberg model of Examples 2.4. p(x.A (B) .4.4.(A") = 1 for all B'.25) But . the information bound is [A"(B))2/A"(B) A"(B) Vare(T(X») so that T(X) achieves the information bound as an estimate of EeT(X).24) [(B) = Vare(T(X) .3) that we have the canonical case with ~(B) = Band B( Bj = A(B) = logf h(x)exp{BT(x))dx.4. :e Iogp(x.20) hold.6. j hence.20) to (3.6.4.2. (3. be a denumerable dense subset of 8. 2 A ** = A'" and the result follows. Our argument is essentially that of Wijsman (1973).. + n2) 10gB + (2n3 + n3) Iog(1 . then (3.4. From this equality of random variables we shall show that Pe[X E A'I = 1 for all B where A' = {x: :Blogp(x.1. Conversely in the exponential family case (1.23) for all B1 . B) = aj(B)T'(x) + a2(B) (3. B) = 2"' exp{(2nj 2"' exp{(2n.4.4.23) must hold for all B. B . .
For a sample from a P( B) distribution o EB(::. I(B) = Eo aB' It turns out that this identity also holds outside exponential families. DiscRSsion. I I' f .4. See Volume II.. .4. The variance of B can be computed directly using the moments of the multinomial distribution of (NI • N z .27) and integrate both sides with respect to p(x. Theorem 3. but the variance of the best estimate is not equal to the bound [.2.7) Var(B) ~ B(I .2. Extensions to models in which B is multidimensional are considered next. Na).p' (B) I'( I (B). () (ell" .B)) which equals I(B).25) we obtain a' 10gp(X. . ~ ~ This T coincides with the MLE () of Example 2. '. . (Continued).4.ed)' In particular.6. that I and II fail to hold.B)] and then using Theorem 1. although UMVU estimates exist.4.4. We find (Problem 3. assumptions I and II are satisfied and UMVU estimates of 1/. It often happens. =e'E(~Xi) = . e). or by transforming p(x. Suppose p('. .2 implies that T = (2N 1 + N z )/2n is UMVU for estimating E(T) ~ (2n)1[2nB' + 2nB(1 .2. . B).3. B)  2 (a log p(x.B)] ~ B.184 Measures of Performance Chapter 3 where we have used the identity (2nl + HZ) + (2n3 + nz) = 211. Because this is an exponential family. as Theorem 3. we have iJ' aB' logp (X.. Even worse. B) )' aB (3. Then (3.4. B) 2 1 a = p(x. A third metho:! would be to use Var(B) = I(I(B) and formula (3. B) aB' p(x. for instance..'" .4.4. in many situations.4. B) satisfies in addition to I and II: p(·. . Sharpenings of the information inequality are available but don't help in general. By (3.6.IOgp(X. B) example.2 suggests. in the U(O. " 'oi . B) to canonical form by setting t) = 10g[B((1 .25). Proof We need only check that I a aB' log p(x.26) Proposition 3.B) is twice differentiable and interchange between integration and differentiation is pennitted. we will find a lower bound on the variance of an estimator .24). 0 ~ Note that by differentiating (3.B)(2n.4. Example 3.(e) exist. • The multiparameter case We will extend the information lower bound to the case of several parameters.4.B) ~ A "( B)..26) holds. 0 0 . (3.
Section 3.4. (0) iJiJ lt2(0) ~ E [aer 2 iJl" logp(x. We assume that e is an open subset of Rd and that {p(x. Under the conditions in the opening paragraph. .32) Proof... j = 1. ~ N(I".2 ] ~ er. nh (O) where h is the information matrix ofX. . Then 1 = log(211') 2 1 2 1 2 loger . (a) B=T 1 (3. 1 <j< d.Ok IOgp(X.4. 0)] = E[er.. . 0). d. Let p( x.Xnf' has information matrix 1(0) = EO (aO:. = Var('. The (Fisher) information matrix is defined as (3. . III (0) = E [~:2 logp(x.4..4. er 2). The arguments follow the d Example 3. . .4 E(x 1") = 0 ~ I. Suppose X = 1 case and are left to the problems.. as X.. 1 <k< d. .31) and 1(0) (b) If Xl. 0). in addition. .4.29) Proposition 3. iJO logp(X.0) = er. er 2 ). (3.Bd are unknown.28) where (3.? ologp(X. . 0) j k That is.. (3. then X = (Xl.. a ).5. (c) If.Xn are U.30) iJ Ijd O) = Covo ( eO logp(X.4.O) ] iJ2 ] l22(0) = E [ (iJer2)21ogp(x.2 = er.4.4 /2. pC B) is twice differentiable and double integration and differentiation under the integral sign can be interchanged. B) : 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect OJ.O») . 0 = (I".( x 1") 2 2a 2 logp(x 0) .4 Unbiased Estimation and Risk Inequalities 185 of fh when the parameters 82 .4. 6) denote the density or frequency function of X where X E X C Rq...d.
in this case [(0) ! .30) and Corollary 1. .2 ( 0 0 ) . Assume the conditions of the .(0) exists and (3.6..35) o ~ Next suppose 8 1 = T is an estimate of (Jl with 82. A(O). (continued).O) = T(X) . We claim that each of Tj(X) is a UMVU . Then Theorem 3. 1/. 0).O») = VOEO(T(X» and the last equality follows 0 from the argument in (3.33) o Example 3.38) where Lzy = EO(TVOlogp(X.6. Suppose k .4.4...6.4. Suppose the conditions of Example 3. J)d assumed unknown. 0) .34) () E e open. Example 3.36) Proof. I'L(Z) = I'Y + (Z l'z)TLz~ Lzy· Now set Y (3. . The conditions I. Let 1/. then [(0) = VarOT(X). We will use the prediction inequality Var(Y) > Var(I'L(Z».4. Then VarO(T(X» > Lz~ [1(0) Lzy (3.!pening paragraph hold and suppose that the matrix [(0) is nonsingular Then/or all 0. UMVU Estimates in Canonical Exponential Families.4.6 hold. (3.3. Here are some consequences of this result. where I'L(Z) denotes the optimal MSPE linear predictor of Y. that is.'(0) = V1/. II are easily checked and because VOlogp(x. .13).A.1.(0) = EOT(X) and let ".4/2 ..4.4..' = explLTj(x)9j j=1 . Z = V 0 logp(x. By (3.4. [(0) = VarOT(X) = (3.( 0) be the d x 1 vector of partial derivatives. = .37) = T(X). Canonical kParameter Exponential Family.4.4.186 Measures of Performance Chapter 3 Thus.4.. ! p(x.A(O)}h(x) (3.4.(O).
the lower bound on the variance of an unbiased estimator of 1/Jj(O) = E(n1Tj(X) = >. Ak) to the canonical form p(x.Tk_I(X)). .A(0) j ne ej = (1 + E7 eel _ eOj ) . . ..d.A(O)} whereTT(x) = (T!(x). kI A(O) ~ 1£ log Note that 1+ Le j=l Oj 80 A (0) J 8 = 1 "k 11 ne~ I 0 +LJl=le l = 1£>'.0). To see our claim note that in our case (3.. j. .i. then Nj/n is UMVU for >'j.. . = nE(T.41) because .j. j=1 X = (XI. 0 = (0 1 . . j = 1. Thus.. Example 3.A(O) is just VarOT! (X).4. .. We have already computed in Proposition 3.(X) = I)IXi = j]. we let j = 1.. (1 + E7 / eel) = n>'j(1.p(OJrI(O). (3. This is a different claim than TJ(X) is UMVU for EOTj(X) if Oi.. . are known. n T. .Section 3. " Ok_il T. But f.4..) = Var(Tj(X».(1.7.4.39) where.. . But because Nj/n is unbiased and has 0 variance >'j(1 .p(0)11(0) ~ (1.3. In the multinomial Example 1. hence.Xn)T.0. i =j:.T(O) = ~:~ I (3. X n i...4... we transformed the multinomial model M(n. .4.40) J We claim that in this case .4. .(X)) 82 80. . is >.6. AI.A.4 Unbiased Estimation and Risk Inequalities 187 estimate of EOTj(X)....p(0) is the first row of 1(0) and.>'j)/n.. . k.O) = exp{TT(x)O . . by Theorem 3.>'.4 8'A rl(O)= ( 8080 t )1 kx k . Multinomial Trials.6 with Xl. . without loss of generality. a'i X and Aj = P(X = j).)/1£.
. (72) then X is UMVU for J1 and ~ is UMVU for p."xl' Note that both sides of (3.42) are d x d matrices.4. If Xl. interpretability of the procedure. . LX. X n are i.8.4. ~ In Chapters 5 and 6 we show that in smoothly parametrized models..4..pd(OW = (~(O)) dxd .4. N(IL. Let 1/J(O) ~EO(T(X))dXl and "'(0) = (1/Jl(O).4. " il 'I.B)a > Ofor all .21.. we show how the infomlation inequality can be extended to the multiparameter case. Also note that ~ o unbiased '* VarOO > r'(O). But it does not follow that n~1 L:(Xi . . Using inequalities from prediction theory.3 hold and • is a ddimensional statistic. We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal. .5 NON DECISION THEORETIC CRITERIA In practice. The Normal Case.1 Computation Speed of computation and numerical stability issues have been discussed briefly in Section 2. These and other examples and the implications of Theorem 3.d.2 + (J2. They are dealt with extensively 'in books on numerical analysis such as Dahlquist. We study the important application of the unbiasedness principle in survey sampling.4. reasonable estimates are asymptotically unbiased. Asymptotic analogues oftbese inequalities are sharp and lead to the notion and construction of efficient estimates.3 whose proof is left to Problem 3. 1 j Here is an important extension of Theorem 3.5. . We derive the information inequa!ity in oneparameter models and show how it can be used to establish that in a canonical exponential family.4. Then J (3. Suppose that the conditions afTheorem 3. • Theorem 3. 188 MeasureS of Performance Chapter 3 j Example 3. Summary. 3.. even if the loss function and model are well specified.4.X)2 is UMVU for 0 2 . The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure." .4. 3. and robustness to model departures.3 are 0 explored in the problems.. .42) where A > B means aT(A . .4. T(X) is the UMVU estimate of its expectation. features other than the risk function are also of importance in selection of a procedure.i.
(72) Example 2. 3. The improvement in speed may however be spurious since AI is costly to compute if d is largethough the same trick as in computing least squares estimates can be used.11). with ever faster computers a difference at this level is irrelevant. takes on the order of log log ~ steps (Problem 3. variance and computation As we have seen in special cases in Examples 3.3.1). The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d 3 operations when done in the usual way and can be numerically unstable. estimates of parameters based on samples of size n have standard deviations of order n.1. It is in fact faster and better to Y.5.1 / 2 .2 is the empirical variance (Problem 2.5 Nondecision Theoretic Criteria 189 Bjork. Its maximum likelihood estimate X/a continues to have the same intuitive interpretation as an estimate of /1.Section 3. and Anderson (1974). The interplay between estimated.4.3 and 3.2 is given by where 0.4. fI(j) = iPl) .1 / 2 is wasteful.AI (flU 1) (T(X) . say. It is clearly easier to compute than the MLE. On the other hand.2.A(BU ~l»)).4. < 1 then J is of the order of log ~ (Problem 3.1. solve equation (2. for this population of measurements has a clear interpretation. Closed form versus iteratively computed estimates At one level closed form is clearly preferable. It follows that striving for numerical accuracy of ord~r smaller than n.9) by.10). the NewtonRaphson method in ~ ~(J) which the jth iterate. But it reappears when the data sets are big and the number of parameters large. Of course.1. On the other hand. For instance.5 we are interested in the parameter III (7 • • This parameter. at least if started close enough to 8.2 Interpretability Suppose that iu the normal N(Il.3.4./ a even if the data are a sample from a distribution with . Then least squares estimates are given in closed fonn by equ<\tion (2. Gaussian elimination for the particular z'b Faster versus slower algorithms Consider estimation of the MLE 8 in a general canonical exponential family as in Section 2. a method of moments estimate of (A.1.P) in Example 2. consider the Gaussian linear model of Example 2. if we seek to take enough steps J so that III III < .2). the signaltonoise ratio. Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at least bounds on the COnstants involved.5.5. in the algOrithm we discuss in Section 2.2. We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory. It may be shown that.
. To be a bit formal. say () = N p" where N is the population size and tt is the expected consumption of a randomly drawn individuaL (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true'' parametric model with an interpretable parameter B. and both the mean tt and median v qualify.4) is for n large a more precise estimate than X/a if this model is correct. if gross errors occur. the HardyWeinberg parameter () has a clear biological interpretation and is the parameter for the experiment described in Example 2. For instance.e. The actual observation X is X· contaminated with "gross errOfs"see the following discussion. Gross error models Most measurement and recording processes are subject to gross errors.13. economists often work with median housing prices.')1/2 = l. 1 X~) could be taken without gross errors then p.. However.\)' distribution.3 Robustness Finally. the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X) / [Var( X)] 1/2. but we do not necessarily observe X*. but there are several parameters that satisfy this qualitative notion. we may be interested in the center of a population. suppose that if n measurements X· (Xi. This idea has been developed by Bickel and Lehmann (l975a. what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about). X n ) where most of the Xi = X:.)(p/>. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied. P(X > v) > ~). anomalous values that arise because of human error (often in recording) or instrument malfunction.1. We can now use the MLE iF2. This is an issue easy to point to in practice but remarkably difficult to formalize appropriately. Similarly. (c) We have a qualitative idea of what the parameter is. that is. amoog others. However. they may be interested in total consumption of a commodity such as coffee.5. See Problem 3. we could suppose X* rv p.190 Measures of Performance Chapter 3 mean Jl and variance 0'2 other than the normal. . we observe not X· but X = (XI •.5. we turn to robustness. E p.5. 1975b. suppose we initially postulate a model in which the data are a sample from a gamma. On the other hand. . Then E(X)/lVar(X) ~ (p/>... would be an adequate approximation to the distribution of X* (i.. For instance. Alternatively. We return to this in Section 5. We consider three situations (a) The problem dictates the parameter. 9(Pl. which as we shall see later (Section 5. but there are a few • • = . E P*).I1'. 1976) and Doksum (1975).4. v is any value such that P(X < v) > ~. We will consider situations (b) and (c). However. the parameter v that has half of the population prices on either side (fonnally. 3. But () is still the target in which we are interested.
In our new formulation it is the xt that obey (3.l..is not a parameter. and symmetric about 0 with common density f and d.. Then the gross error model issemiparametric.5.J) for all such P. we do not need the symmetry assumption. p(".5 Nondecision Theoretic Criteria ~ 191 wild values.l. The breakdown point will be discussed in Volume II. Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of X· . Consider the onesample symmetric location model P defined by ~ ~ i=l""l n . X~) is a good estimate.) are ij. .\ is the probability of making a gross error. has density h(y .. . . specification of the gross error mechanism.5. the sensitivity curve and the breakdown point. for example. However. = {f (.i. Unfortunately. Formal definitions require model specification. Xi Xt with probability 1 .remains identifiable. X n ) will continue to be a good or at least reasonable estimate if its value is not greatly affected by the Xi I Xt. A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the Ci still i.1') : f satisfies (3.5. We return to these issues in Chapter 6.i. so it is unclear what we are estimating.) for 1'1 of 1'2 (Problem 3. Example 1.l. X n ) knowing that B(Xi. the assumption that h is itself symmetric about 0 seems patently untenable for gross errors. Without h symmetric the quantity jJ. That is.1'). The advantage of this formulation is that jJ. and definitions of insensitivity to gross errors. the gross errors. make sense for fixed n. Further assumptions that are commonly made are that h has a particular fonn.h(x). but with common distribution function F and density f of the form f(x) ~ (1 . . (3. with probability .1') and (Xt.18). h = ~(7 <p (:(7) where K ::» 1 or more generally that h is an unknown density symmetric about O. and then more generally. Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6.5..)~ 'P C) + .1 ) where the errors are independent. F. where Y.d. where f satisfies (3. ISi'1 or 1'2 our goal? On the other hand.2) Here h is the density of the gross errors and . (3. X is the best estimate in a variety of senses.j) E P. .5. Y. If the error distribution is normal..2. if we drop the symmetry assumption. .Section 3. Now suppose we want to estimate B(P*) and use B(X 1 . . Again informally we shall call such procedures robust. We next define and examine the sensitivity curve in the context of the Gaussian location model. .2) for some h such that h(x) = h(x) for all x}. it is possible to have PC""f.j = PC""f. iff X" .1).d.1.d..A Y. Informally B(X1 . identically distributed. it is the center of symmetry of p(Po. This corresponds to. •. That is.2).. P. two notions.Xn are i. in situation (c).f. we encounter one of the basic difficulties in formulating robustness in situation (b). However. with common density f(x .5.
. .. Are there estimates that are less sensitive? A classical estimate of location based on the order statistics is the sample median X defined by ~ ~ X X(k+l) !(X(k) ifn~2k+1 + X(k+l)) ifn = 2k where X (I)' . . that is. . where Xl..1.O(Xl' . How sensitive is it to the presence of gross errors among XI. . . has the plugin property.1.. The empirical plugin estimate of 0 is 0 = O(P) where P is the empirical probability distribution.. in particular. Xl. Now fix Xl!'" .192 The sensitivity curve Measures of Performance Chapter 3 .1 for which the estimator () gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation X does to the value of ~ We return to the location problem with e equal to the mean J.. (}(PU. 1') = 0. See Problem 3.16). not its location.od Problem 2. X"_l )J. Thus. ..1 for all p(/l. Often this is done by fixing Xl. X n ordered from smallest to largest. This is equivalent to shifting the Be vertically to make its value at x = 0 equal to zero.J.2.17). X n ) = B(F). that is.od because E(X.l. (}(X 1 . Then ~ ~ O. We are interested in the shape of the sensitivity curve.L = E(X).4). X(n) are the order statistics. At this point we ask: Suppose that an estimate T(X 1 . therefore.1. .j)) = J. I .5.. where F is the empirical d. ~ ) SC( X. See Section 2. . +X"_I+X) n = x. The sample median can be motivated as an estimate of location on various grounds. 0) = n[O(xl. X" 1'). X n ) . . See (2. and it splits the sample into two equal halves. • ." . P.. We start by defining the sensitivity curve for general plugin estimates. x) . . Suppose that X ~ P and that 0 ~ O(P) is a parameter.L. .. X =n (Xl+ . .32. . .f) E P. . . (2. . Xnl as an "ideal" sample of size n .Xn ? An interesting way of studying this due to Tukey (1972) and Hampel (1974) is the sensitivity curve defined as follows for plugin estimates (which are well defined for all sample sizes n)..L = (}(X l 1'.. the sample mean is arbitrarily sensitive to gross errOfa large gross error can throw the mean off entirely. The sensitivity curoe of () is defined as ~ ~ ~ ~ ~ ~ ~ ~ ~ ...5. In our examples we shall.··. (i) It is the empirical plugin estimate of the population median v (Problem 3. Because the estimators we consider are location invariant.Xnl so that their mean has the ideal value zero.2... 1 Xnl represents an observed sample of size nl from P and X represents an observation that (potentially) comes from a distribution different from P.. . . is appropriate for the symmetric location model.X.14.. shift the sensitivity curve in the horizontal or vertical direction whenever this produces more transparent formulas. we take I' ~ 0 without loss of generality. .. .I. SC(x..
..5. n = 2k + 1 is odd and the median of Xl.Section 3. we obtain .5. SC(x) SC(x) x x Figure 3.2. The sensitivity curve of the median is as follows: If. .5. x) nx(k) nx = _nx(k+l) for for < x(k) for x(k) < x x <x(k+I) nx(k+l) x> x(k+l) where xCI) < . SC(x.5. Although the median behaves well when gross errors are expected.5.. The sensitivity curves of the mean and median. say.32 and 3. its perfonnance at the nonnal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X.1) is the Laplace (double exponential) density f(x) = 1 exp{l x l/7}. 27 a density having substantially heavier tails than the normaL See Problems 2..Xn_l = (X(k) + X(ktl)/2 = 0. x is an empirical (iii) The sample median is the MLE when we assume the common density f(x) of the errors {cd in (3.1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when X is near Ji.. A class of estimates providing such intermediate behavior and including both the mean and .1).. XnI. v coincides with fL and plugin estimate of fL. The sensitivity curve in Figure 3. . < X(nl) are the ordered XI..1.9.5 Nondecision Theoretic Criteria 193 (ii) In the symmetric location model (3.
+ X(n[nuj) n . Xa  X..) SC(x) X x(n[naJ) .2[nnl where [na] is the largest integer < nO' and X(1) < . the mean is better than any trimmed mean with a > 0 including the median.5. which corresponds approximately to a = This can be verified in tenns of asymptotic variances (MSEs}see Problem 5. 4.2. and Tukey (1972). < X (n) are the ordered observations. Haber. Hampel.8) than the Gaussian density. The estimates can be justified on plugin grounds (see Problem 3...4.5. . . That is. for example. Xa.. If f is symmetric about 0 but has "heavier tails" (see Problem 3.5. . f = <p.5. . and Hogg (1974). The range 0. Note that if Q = 0.. whereas as Q i ~.4.1. There has also been some research into procedures for which a is chosen using the observations. We define the a (3. For more sophisticated arguments see Huber (1981). the sensitivity Jo ~ CUIve of an Q trimmed mean is sketched in Figure 3. Figure 3.l)Q] and the trimmed mean of Xl. 4e'x'. then the trimmed means for a > 0 and even the median can be much better than the mean. the sensitivity curve calculation points to an equally intuitive conclusion.2[nn]/n)I.10 < a < 0. Intuitively we expect that if there are no gross errors. Xu = X." see Jaeckel (1971). I I ! . f(x) = 1/11"(1 + x 2 ). Huber (1972).20 seems to yield estimates that provide adequate protection against the proportions of gross errors expected and yet perform reasonably well when sampling is from the nonnal distribution.. that is. However. For instance. f(x) = or even more strikingly the Cauchy..2.5). l Xnl is zero.3) Xu = X([no]+l) + . Which a should we choose in the trimmed mean? There seems to be no simple answer.194 Measures of Performance Chapter 3 the median has ~en known since the eighteenth century.1 again.5. suppose we take as our data the differences in Table 3. i I . The sensitivity curve of the trimmed mean. Bickel. infinitely better in the case of the Cauchysee Problem 5.1. See Andrews. the Laplace density. .5. If [na] = [(n . . by <Q < 4. For a discussion of these and other forms of "adaptation. (The middle portion is the line y = x(1 . we throw out the "outer" [na] observations on either side and take the average of the rest. Let 0 trimmed mean. Rogers.
Xu is called a Ctth quantile and X. Let B(P) = Var(X) = 0"2 denote the variance in a population and let XI.5. The IQR is often calibrated so that it equals 0" in the N(/l.5.25). then the variance 0"2 or standard deviation 0.1.25 are called the upper and lower quartiles. &) (3..Section 3. Spread. < x(nI) are the ordered Xl. Similarly. X n is any value such that P( X < x n ) > Ct. say k. A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = X. P( X > xO') > 1 . 0 Example 3.XnI.75 . . Let B(P) = X o deaote a ath quantile of the distribution of X.. If we are interested in the spread of the values in a population. If no: is an integer.2 . . 0 < 0: < 1.10).5 Nondecision Theoretic Criteria 195 Gross errors or outlying data points affect estimates in a variety of situations.is typically used. and let Xo. We next consider two estimates of the spread in the population as well as estimates of quantiles.1.X.2.I L:~ I (Xi .75 and X. Xo: = x(k). To simplify our expression we shift the horizontal axis so that L~/ Xi = O..25. = ~ [X(k) + X(k+l)]' and at sample size n  1. n + 00 (Problem 3.. where Xu has 100Ct percent of the values in the population on its left (fonnally. Write xn = 11. Then a~ = 11.674)0".5.Ct). Example 3. denote the o:th sample quantile (see 2. other examples will be given in the problems. the nth sample quantile is Xo. X n denote a sample from that population. ..1x.16). Quantiles and the lQR. .742(x. 0"2) model. SC(X. where x(t) < . then SC(X.X)2 is the empirical plugin estimate of 0. &2) It is clear that a:~ is very sensitive to large outlying Ixi values. . Because T = 2 x (..75 .1 L~ 1 Xi = 11.X. the scale measure used is 0. .5.4) where the approximation is valid for X fixed.
25) and the sample IQR is robust with respect to outlying gross errors x.2.1. ~ ~ ~ Next consider the sample lQR T=X.5. ~ . x (t) that (Problem 3. Ii I~ .1 denotes the empirical distribution based on Xl.. and computability. An exposition of this point of view and some of the earlier procedures proposed is in Hampel.196 thus.: : where F n . F) and~:z.O(F)] = . . The sensitivity of the parameter B(F) to x can be measured by the influence function.15) I[t < xl).25· Then we can write SC(x. = <1[0((1  <)F + <t. Summary.(x.Xnl. median. although this difficulty appears to be being overcome lately. xa: is not sensitive to outlying x's. which is defined by IF(x. SC(x. r: ~i II i . . Discussion. and Stabel (1983).O. Most of the section focuses on robustness. Rousseuw.. 1'75) . for 2 Measures of Performance Chapter 3 < k < n . It plays an important role in functional expansions of estimates. Unfortunately these procedures tend to be extremely demanding computationally.(x. We will return to the influence function in Volume II. i) = SC(x..5. ~' is the distribution function of point mass at x (. Other aspects of robustness. It is easy to see ~ ! '.5... We discuss briefly nondecision theoretic considerations for selecting procedures including interpretability. The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean. trimmed mean. in particular the breakdown point.F) dO I. Ronchetti. 1'0) ~[Xlkl) _ xlk)] x < XlkI) 2 '  2 2 1 Ix _ xlkl] XlkI) < x < xlk+11 '  (3. discussing the difficult issues of identifiability.5) 1 [xlk+11 _ xlkl] x> xlk+1) ' Clearly.F) where = limIF.) . have been studied extensively and a number of procedures proposed and implemented.6. 1'.O. and other procedures. 0. .SC(x.• .75 X. o Remark 3.
3). Find the Bayes estimate 0B of 0 and write it as a weighted average wOo + (1 ~ w)X of the mean 00 of the prior and the sample mean X = Sin. (c) In Example 3.. a(O) > 0. Suppose IJ ~ 71'(0). where OB is the Bayes estimate of O. In the Bernoulli Problem 3. change the loss function to 1(0. Show that (h) Let 1(0.2 preceeding. Consider the relative risk e( J.1.6 Problems and Complements 197 3. 3. Find the Bayes risk r(7f. ..2.r 7f(0)p(x I O)dO. Show that OB ~ (S + 1)/(n+2) for ~ ~ the uniform prior.24. density. then the improper Bayes rule for squared error loss is 6"* (x) = X.2 with unifonn prior on the probabilility of success O.O) and = p(x 1 0)[7f(0)lw(0)]/c c= JJ p(x I 0)[7f(0)/w(0)]dOdx is assumed to be finite. give the MLE of the Bernoulli variance q(0) = 0(1 .O) = p(x I 0)71'(0) = c(x)7f(0 .6 PROBLEMS AND COMPLEMENTS Problems for Section 3. In Problem 3. the parameter A = 01(1 . Hint: See Problem 1. which is called the odds ratio (for success). (3(r. x) where c(x) =.2 o e 1.. ~ ~ 4.4. .O)P. (72) sample and 71" is the improper prior 71"(0) = 1. is preferred to 0.0)' and that the prinf7r(O) is the beta. 1 X n is a N(B. E = R. Show that if Xl. That is. 0) is the quadratic loss (0 .2. J) ofJ(x) = X in Example 3. if 71' and 1 are changed to 0(0)71'(0) and 1(0.. 0) = (0 .3.4. J). 0) the Bayes rule is where fo(x.Xn be the indicators of n Bernoulli trials with success probability O.0)/0(0).0 E e.a)' jBQ(1. (a) Show that the joint density of X and 0 is f(x. respectively. . In some studies (see Section 6.2. under what condition on S does the Bayes rule exist and what is the Bayes rule? 5.Section 3. where R( 71') is the Bayes risk.0). Suppose 1(0.0) and give the Bayes estimate of q(O). 71') = R(7f) /r( 71'. we found that (S + 1)/(n + 2) is the Bayes rule. 6. 2. the Bayes rule does not change. 71') as . Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule. Check whether q(OB) = E(q(IJ) I x). Let X I. Compute the limit of e( J.2. lf we put a (improper) uniform prior on A. (X I 0 = 0) ~ p(x I 0). = (0 o)'lw(O) for some weight function w(O) > 0. s).
Suppose we have a sample Xl. do not derive them. Let q( 0) = L:.. . find the Bayes decision mle o' and the minimum conditional Bayes risk r(o'(x) I x). On the basis of X = (X I.\(B) = r . Assume that given f). 76) distribution.\(B) = I(B. For the following problems. B = (B" . . (e) a 2 + 00.3.a)' / nj~. (E. compute the posterior risks of the possible actions and give the optimal Bayes decisions when x = O. One such function (Lindley. Xl. a r )T. c2 > 0 I. a) = Lj~. .E).ft B be the difference in mean effect of the generic and namebrand drugs. a = (a" .2. E). (Bj aj)2. Measures of Performance Chapter 3 (b) n . I j l . . E).(0) should be negative when 0 E (E. Hint: If 0 ~ D(a). 1). where ao = L:J=l Qj. to a close approximation.) with loss function I( B. . a) = (q(B) . where is known.. (a) If I(B. Bioequivalence trials are used to test whether a generic drug is. . given 0 = B are multinomial M(n. i: agency specifies a number f > a such that if f) E (E.O)  I(B. I ~' . I) = difference in loss of acceptance and rejection of bioequivalence.aj)/a5(ao + 1). and that 0 has the Dirichlet distribution D( a).. 7. B). where CI. B.198 (a) T + 00. Var(Oj) = aj(ao .. then E(O)) = aj/ao.. a) = [q(B)a]'. then the generic and brandname drugs are. Suppose tbat N lo .2(d)(i) and (ii). " X n of differences in the effect of generic and namebrand effects fora certain drug. (a) Problem 1.. defined in Problem 1.15. (.. (c) We want to estimate the vector (B" .19(c).3. E) and positive when f) 1998) is 1. 0) and I( B. . B .Xu are Li. find necessary and j sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule. .I(d). (b) Problem 1.• N. by definition.9j ) = aiO:'j/a5(aa + 1).=l CjUj .) (b) When the loss function is I(B.B.  I .... (c) Problem 1. equivalent to a namebrand drug. and Cov{8j . Note that ). 8.d. There are two possible actions: 0'5 a a with losses I( B.3.O'5). .. . • 9. bioequivalent. 00. . and that 9 is random with aN(rlO. Cr are given constants.)T.exp {  2~' B' } . . where E(X) = O.Xn ) we want to decide whether or not 0 E (e. A regulatory " ! I i· . N{O. (Use these results... Set a{::} Bioequivalent 1 {::} Not Bioequivalent .. Find the Bayes decision rule. Let 0 = ftc .
Hint: See Example 3. . Yo) = inf g(xo.2 show that L(x.\(±€. . .. v) > ?r/(l ?r) is equivalent to T > t. (a) Show that a necessary condition for (xo. yo) is a saddle point of 9 if g(xo.. respectively..Yp).2. Discuss the preceding decision rule for this "prior. In Example 3. Note that . 2. Con 10. and 9 is twice differentiable.Section 3. Yo) is in the interior of S x T.6. representing x = (Xl. Yo) = sup g(x.. .) = 0 implies that r satisfies logr 1 = ( 2 2c' This is an example with two possible actions 0 and 1 where l((). RP. S x T ~ R. A point (xo.y = (YI.. (b) It is proposed that the preceding prior is "uninformative" if it has 170 large ("76 + 00"). (a) Show that the Bayes rule is equivalent to "Accept biocquivalence if E(>'(O) I X and show that (3.Xm). Yo) to be a saddle point is that.. Suppose 9 .2.1) is equivalent to "Accept bioequivalence if[E(O I x»)' where = x) < 0" (3. 0) and l((). 0. ..17). 1) are not constant.6. (b) the linear Bayes estimate of fl. + = 0 and it 00").2..1. For the model defined by (3. Any two functions with difference "\(8) are possible loss functions at a = 0 and I. S T Suppose S and T are subsets of Rm.Yo) Xi {}g {}g Yj = 0.) + ~}" .6 Problems and Complements 199 where 0 < r < 1.1) < (T6(n) + c'){log(rg(~)+." (c) Discuss the behavior of the preceding decision rule for large n ("n sider the general case (a) and the specific case (b).3.3 1. (Xv.Yo) = {} (Xo. y).16) and (3. {} (Xo. (c) Is the assumption that the ~ 's are nonnal needed in (a) and (b)? Problems for Section 3. find (a) the linear Bayes estimate of ~l.
i..l.d.=1 CijXiYj with XE 8 m . ..n)/(n + . o')IR(O.y) ~ L~l 12. (b) There exists 0 < 11"* < 1 such that the prior 1r* is least favorable against is..)w". . B ) and suppose that Lx (B o.i) =0.d< p. and o'(S) = (S + 1 2 . • " = '!.. Yc Yd foraHI < i.c." 1 Xi ~ l}. 0») equals 1 when B = ~. Yo . d) = (a) Show that if I' is known to be 0 (!.j=O. prior. and 1 .a.a)'.(X) = 1 if Lx (B o. I • > 1 for 0 i ~. iij. Suppose i I(Bi. 2 J'(X1 . &* is minimax.0). I}. the conclusion of von Neumann's theorem holds.1(0.0•• ). Let S ~ 8(n.n12. f.<7') and 1(<7'.. y ESp..PIB = Bo]. .. X n be i. (b) Show that limn_ooIR(O.. 5. and show that this limit ( .2. A = {O. B1 ) >  (l.= 1 .b < m. Show that the von Neumann minimax theorem is equivalent to the existence of a saddle point for any twice differentiable g. n: I>l is minimax.200 and Measures of Performance Chapter 3 8'g (x ) < 0 8'g(xo. BIl has a continuous distrio 1 bution under both P Oo and P BI • Show that (a) For every 0 < 7f < 1. . a) ~ (0 .2. the test rule 1511" given by o. Suppose e = {Bo.. I j 3.n12).i. • . Let Lx (Bo.:.n). 4. (b) Suppose Sm = (x: Xi > 0..j)=Wij>O. Bd.Yo) > 0 8 8 8 X a 8 Xb 0.thesimplex. 1rWlO = 0 otherwise is Bayes against a prior such that PIB = B ] = . f" I • L : I' (a) Show that 0' has constant risk and is Bayes for the beta. fJ( . Thus.._ I)'. I(Bi.1 < i < m.Xn ) .o•• ) =R(B1 . B1 ) Ip(X. B ) ~ p(X. N(I'. o(S) = X = Sin. 1 <j. Hint: Show that there exists (a unique) 1r* so that 61f~' that R(Bo. . L:. Let X" .andg(x. and that the mndel is regnlar. Hint: See Problem 3.
where (Ill...tj.. .. . PI. M (n. . = tj. . Hint: Consider the gamma. respectively.. Let k . . h were t·.j.Pi)' Pjqj <j < k. then is minimax for the loss function r: l(p. Hint: (a) Consider a gamma prior on 0 = 1/<7'. b. o(X 1..a? 1. hence.. . Xl· (e) Show that if I' is unknown. 0') = (1. . X is no longer unique minimax.a < b" = 'lb... (jl..). Xk)T. . . o'(X) is also minimax and R(I'... that both the MLE and the estimate 8' = (n . . . . ftk) = (Il?l'"'' J1. X k be independent with means f. 0 1.k... For instance. . show that 0' is unifonnly best among all rules of the form oc(X) = CL Conclude that the MLE is inadmissible. . 9.1. .\.. . that is. . a) = (. . Rj = L~ I I(XI < Xi)' Hint: Consider the uniform prior on permutations and compute the Bayes rule by showing that the posterior risk of a pennutation (i l . Let Xl. Show that X has a Poisson (. N k ) has a multinomial. . ik) is smaller than that of (i'l"'" i~). / . 1 = t j=l (di . Remark: Stein (1956) has shown that if k > 3. .. . . (c) Use (B..\) distribution and 1(. o(X) ~ n~l L(Xi . jl~ < .).29). J. .~~n X < Pi < < R(I'. ik). 00. Jlk.I)..2. d = (d).) . Show that if = (I'I''''..d) where qj ~ 1 .I'kf. Let Xi beindependentN(I'i. Rk) where Rj is the rank of Xi.P. i=l then o(X) = X is minimax.j. dk)T..?. ttl . 1 < j < k. . > jm). < im. ik is an arbitrary unknown permutation of 1.. 1 <i< k. X k ) = (R l . Show that if (N). .X)' are inadmissible.. I' (X"".\. ..X)' is best among all rules of the form oc(X) = c L(Xi .tn I(i.. " a.12..k} I«i). d) = L(d. . and iI. LetA = {(iI. prior.\ . . < /12 is a known set of values. . f(k. an d R a < R b.Pi.. Write X  • "i)' 1(1'....l . 8. .. Permutations of I.» = Show that the minimax rule is to take L l. .. 1). See Problem 1. j 7. 6.Section 3_6 Problems and Complements 201 (b) If I' ~ 0..3. b = ttl. . distribution.. .j.o) for alII'. . . Then X is nurnmax.X)' and.1)1 L(Xi . See Volume II.
4.15.. (b) How does the risk of these estimates compare to that of X? 12. . Let K(po.2. . For a given x we want to estimate the proportion F(x) of the population to the left of x. Hint: Consider the risk function of o~ No. X has a binomial. ! p(x) = J po(x)1r(B)dB...3.I)!. n) be i. B(n..1'»2 of these estimates is bounded for all nand 1'._.=1 Show that the Bayes estimate is (Xi . 1 ..2. I: I ir'z . .q)1r(B)dB. . + l)j(n + k). LB= j 01 1...1r) = Show that the marginal density of X. ~ I I .i f X> .a) and let the prior be the uniform prior 1r(B" .. Show that v'n 1+ v'n 2(1+ v'n) is minimax for estimating F(x) = P(X i < x) with squared error loss. ..202 Measures of Performance Chapter 3 Hint: Consider Dirichlet priors on (PI. Suppose that given 6 ~ B. ... Show that the Bayes estimate of 0 for the KullbackLeibler loss function lp(B.. X n be independent N(I'...i.. 14. Let X" . 10. See also Problem 3.d with density defined in Problem 1. .. of Xi < X • 1 + 1 o.8.. Define OiflXI v'n < d < v'n d v'n d d X .B). I: • .... M(n.. 13..BO)T. See Problem 3. BO l) ~ (k . with unknown distribution F. d X+ifX 11.. X = (X"".. Let the loss " function be the KullbackLeibler divergence lp(B. . . . B j > 0. a) is the posterior mean E(6 I X). v'n v'n (a) Show that the risk (for squared error loss) E( v'n(o(X) . 1)..d... distribution.. q) denote the K LD (KullbackLeiblerdivergence) between the densities Po and q and define the Bayes KLD between P = {Po : BEe} and q as k(q. Pk.. Suppose that given 6 = B = (B ... Let Xi(i = 1. J K(po.Xo)T has a multinomial. distribution.. . B).
then R(O.~). X n be the indicators of n Bernoulli trials with success probability B.X is called the mutual information between 0 and X. 0) and B(n. Prove Proposition 3. We shall say a loss function is convex. if I(O. a < a < 1. Let X . N(I". Let Bp(O) and Bq(ry) denote the information inequality lower bound ('Ij. .1 .6 Problems and Complements 203 minimizes k( q. Show that in theN(O. . . for any ao. . suppose that assumptions I and II hold and that h is a monotone increasing differentiable function from e onto h(8). Show that B q(ry) = Bp(h. Show that if 1(0. 15.f [E.a )I( 0.k(p.4." A density proportional to VJp(O) is called Jeffrey's prior. 2. Reparametrize the model by setting ry = h(O) and let q(x.. Give the Bayes rules for squared error in these three cases. K) = J [Eo {log ~i~i}] K((J)d(J > a by Jensen's inequality.2) with I' . Hint: Use Jensen's inequality: If 9 is a convex function and X is a random variable.1"0 known. h1(ry)) denote the model in the new parametrization. 4. ry). Jeffrey's priors are proportional to 1.. then That is. Let A = 3. J).12) for the two parametrizations p(x. (a) Show that if Ip(O) and Iq(fJ) denote the Fisher information in the two parametrizations.'? /1 as in (3. X n are Ll. .4.. . 7f) and that the minimum is I(J. 0. Show that X is an UMVU estimate of O. Hint: k(q. Suppose that there is an unbiased estimate J of q((J) and that T(X) is sufficient.1 E: 1 (Xi  J. p(X) IO. . . N(I"o. 0) cases. (ry»).). Suppose X I. respectively. . .aao + (1 . and O~ (1 .4.tO)2 is a UMVU estimate of a 2 • &'8 is inadmissible. then E(g(X) > g(E(X». 0) and q(x.x =. 0) with 0 E e c R.. Problems for Section 3. p( x. {log PB(X)}] K(O)d(J. It is often improper. that is.O)!.. a) is convex and J'(X) = E( J(X) I I(X».Section 3. R.alai) < al(O. ao) + (1 . Fisher infonnation is not equivariant under increasing transformations of the parameter. a. a" (J. Equivariance. ... Jeffrey's "Prior. (b) Equivariance of the Fisher Information Bound. Let X I.d. J') < R( (J. ry) = p(x. Show that (a) (b) cr5 = n.4 1. K) . S. the Fisher information lower bound is equivariant.
\ = a + bB is UMVU for estimating . 1).o.. Show that 8 = (Y . Find lim n.8. 0) = Oc"I(x > 0).1 and variance 0.5(b). compute Var(O) using each ofthe three methods indicated. 00. (a) Find the MLE of 1/0.4. 14. 204 Hint: See Problem 3. find the bias o f ao' 6.2 (0 > 0) that satisfy the conditions of the information inequality. 9. and . .. In Example 3.p) is an unbiased estimate of (72 in the linear regression model of Section 2. .3. ( 2). Suppose (J is UMVU for estimating fJ. Pe(E) = 1 for some 0 if and only if Pe(E) ~ ~2 > OJ docsn't depend on 0.ZDf3)/(n . ~ ~ ~ (a) Write the model for Yl give the sufficient statistic. 7.1.. B(O.\ = a + bOo 11.4.4. Let F denote the class of densities with mean 0. P.Ii . Does X achieve the infonnation . . .. . I I . Show that assumption I implies that if A . Establish the claims of Example 3.4.2. i = 1. 8. For instance.ZDf3)T(y . =f.Yn are independent Poisson random variables with E(Y'i) = !Ji where Jli = exp{ Ct + (3Zi} depends on the levels Zi of a covariate. n. . Is it unbiased? Does it achieve the infonnation inequality lower bound? (b) Show that X is an unbiased estimate of 0/(0 inequality lower bound? + 1). Hint: Use the integral approximation to sums.' . . . Show that a density that minimizes the Fisher infonnation over F is f(x. . Ct.XN }.. . Let a and b be constants.P. . . Y n in twoparameter canonical exponential form and (b) Let 0 = (a. a ~ i 12.. Show that if (Xl. ...11(9) as n ~ give the limit of n times the lower bound on the variances of and (J. X n be a sample from the beta. Hint: Consider T(X) = X in Theorem 3. distribution. Show that . • 13... then = 1 forallO. Zi could be the level of a drug given to the ith patient with an infectious disease and Vi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered. then (a) X is an unbiased estimate of x = I ~ L~ 1 Xi· . X n ) is a sample drawn without replacement from an unknown finite population {Xl. {3) T Compute I( 9) for the model in (a) and then find the lower bound on the variances of unbiased estimators and {J of a and (J. . 2 ~ ~ 10.{x : p(x. 0) for any set E.13 E R. . Suppose Yl .. a ~ (c) Suppose that Zi = log[i/(n + 1)1. Measures of Performance Chapter 3 I (c) if 110 is not known and the true distribution of X t is N(Ji. Let X" .
16. UN are as in Example 3. B(n. : 0 E for (J such that E((J2) < 00.". P[o(X) = 91 = 1..4. .. = N(O.~ for all k.. 7. Define ~ {Xkl.. Suppose the sampling scheme given in Problem 15 is employed with 'Trj _ ~. . Show that is not unbiasedly estimable. even for sampling without replacement in each stratum... then E( M) ~ L 1r j=1 N J ~ n. (b) Deduce that if p.Xkl. that ~ is a Bayes estimate for O. k=l K ~1rkXk. 15.X)/Var(U). (See also Problem 1.p)..) Suppose the Uj can be relabeled into strata {xkd. :/i.4. k = 1.} and X =K  1".3. (c) Explain how it is possible if Po is binomial.1  E1 k I Xki doesn't depend on k for all k such that 1Tk > o.511 > (b) Show that the inequality between Var X and Var X continues to hold if ~ . Suppose X is distributed accordihg to {p.l. . B(n.• .4. . ec R} and 1r is a prior distribution (a) Show that o(X) is both an unbiased estimate of (J and the Bayes estimate with respect to quadratic loss. (a) Take samples with replacement of size mk from stratum k = fonn the corresponding sample averages Xl. K.6 Problems and Complements 205 (b) The variance of X is given by (3. Show that X k given by (3. G 19. Show that the resulting unbiased HorvitzThompson estimate for the population mean has variance strictly larger than the estimate obtained by taking the mean of a sample of size n taken without replacement from the population. .6) is (a) unbiased and (b) has smaller variance than X if  b < 2 Cov(U. X is not II Bayes estimate for any prior 1r.4)...~).Section 3. 18. = 1. Let 7fk = ~ and suppose 7fk = 1 < k < K. 'L~ 1 h = N. distribution. Stratified Sampling. 20. Let X have a binomial. 1 < i < h. More generally only polynomials of degree n in p are unbiasedly estimable.1 and Uj is retained independently of all other Uj with probability 1rj where 'Lf 11rj = n. Suppose UI. if and only if. XK. . Show that if M is the expected sample size. . _ Show that X is unbiased and if X is the mean of a simple random sample without replacement from the population then VarX<VarX with equality iff Xk. 17. . 8).4.
• Problems for Section 35 I is an integer. for all adx 1./(9)a [¢ T (9)a]T II (9)[¢ T (9)a].. = 2k is even. If n IQR.. B) be the uniform distribution on (0. however.75' and the IQR. and we can thus define moments of 8/8B log p(x.5) to plot the sensitivity curve of the 1. Let X ~ U(O. is an empirical plugin estimate of ~ Here case. (iii) 2X is unbiased for . the upper quartile X.25 and net =k 3.25 and (n . B) is differentiable fat anB > x. that is. Yet show eand has finite variance. B») ~ °and the information bound is infinite.9)' 21. 22. Note that logp(x. with probability I for each B. Show that the a trimmed mean XCII. Regularity Conditions are Needed for the Information Inequality. J xdF(x) denotes Jxp(x)dx in the continuous case and L:xp(x) in the discrete 6. for all X!.4. Hint: It is equivalent to show that. Note that 1/J (9)a ~ 'i7 E9(a 9) and apply Theorem 3. Show that. 1 2. 5. give and plot the sensitivity curves of the lower quartile X. Show that the sample median X is an empirical plugin estimate of the population median v. B).4. Prove Theorem 3. B).3.206 Hint: Given E(ii(X) 19) ~ 9. An estimate J(X) is said to be shift or translation equivariant if.5."" F (ij) Vat (:0 logp(X.25.. Var(a T 6) > aT (¢(9)II(9). 1 X n1 C. If a = 0. If a = 0. give and plot the sensitivity curve of the median. . . 4. E(9 Measures of Performance Chapter 3 I X) ~ ii(X) compute E(ii(X) .4.l)a is an integer. use (3. T .
One reasonable choice for k is k = 1. is symmetrically distributed about O. The Huber estimate X k is defined implicitly as the solution of the equation where 0 < k < 00. .. and the HodgesLehmann estimate. It has the advantage that there is no trimming proportion Q that needs to be subjectively specified.6 Problems and Complements 207 It is antisymmetric if for all Xl. then (i. Xu. X. . (b) Show that   . (7= moo 1 IX. plot the sensitivity curves of the mean.5 and for (j is. and xiflxl < k kifx > k kifx<k. .JL) where JL is unknown and Xi . (a) Suppose n = 5 and the "ideal" ordered sample of size n ~ 1 = 4 is 1.3. k t 0 to the median. . For x > .. '& is an estimate of scale.). X are unbiased estimates of the center of symmetry of a symmetric distribution.. F(x .30. J is an unbiased estimate of 11. X a arc translation equivariant and antisymmetric.  XI/0.30.5. The HodgesLehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages ~(Xi + Xj).1.. .. xH L is translation equivariant and antisymmetric.67.e. 7. trimmed mean with a = 1/4. Show that if 15 is translation equivariant and antisymmetric and E o(15(X» exists and is finite.. I)order statistics). In .(a) Show that X. (b) Suppose Xl.1.03. Its properties are similar to those of the trimmed mean. . . median.03 (these are expected values of four N(O. . Deduce that X. i < j..Section 3.) ~ 8. X n is a sample from a population with dJ.6.. <i< n Show that (a) k = 00 corresponds to X. (See Problem 3.
Let 1'0 = 0 and choose the ideal sample Xl. . Use a fixed known 0"0 in place of Ci. and "" "'" (b) the Iratio of Problem 3. . Xk) is a finite constant.O)/ao) where fo(x) for Ixl for Ixl <k > k. Show that SC(x. ii. thenXk is the MLEofB when Xl.. the standard deviation does not exist.) has heavier tails than f() if g(x) is above f(x) for Ixllarge.5. we say that g(. . This problem may be done on the computer. For the ideal sample of Problem 3.x}2. Location Parameters. 11. where S2 = (n . In the case of the Cauchy density.. (a) Find the set of Ixl where g(Ixl) > <p(lxl) for 9 equal to the Laplace and Cauchy densitiesgL(x) = (2ry)1 exp{ Ixl/ry} and gc(x) = b[b2 + x 2 ]1 /rr. (b) Find thetml probabilities P(IXI > 2)..5. If f(·) and g(. The (student) tratio is defined as 1= v'n(x I'o)/s. 9.. iTn ) ~ (2a)1 (x 2  a 2 ) as n ~ 00.• 13.~ !~ t . Let JJo be a hypothesized mean for a certain population. plot the sensitivity curve of (a) iTn . 00.X. 1 Xnl to have sample mean zero. with density fo((. The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant \. Laplace. 00. P(IXI > 3) and P(IXI > 4) for the nonnal. then limlxj>00 SC(x.. . . n is fixed. In what follows adjust f and 9 to have v = 0 and 'T = 1.d. (d) Xk is translation equivariant and antisymmetric (see Problem 3. we will use the IQR scale parameter T = X.1)1 L:~ 1(Xi . = O." .) are two densities with medians v zero and identical scale parameters 7. (e) If k < 00. X 12. thus. with k and € connected through 2<p(k) _ 24>(k) ~ e k 1(e) Xk exists and is unique when k £ ! i > O. Find the limit of the sensitivity curve of t as (a) Ixl ~ (b) n ~ 00.7(a). •• . Let X be a random variable with continuous distribution function F.6). Suppose L:~i Xi ~ . and is fixed.i. (c) Show that go(x)/<p(x) is of order exp{x2 } as Ixl ~ 10.25. and Cauchy distributions.75 . .208 ~ Measures of Performance Chapter 3 (b) Ifa is replaced by a known 0"0. [.5.11.Xn arei.
. . i/(F)] and. (a) Show that if F is symmetric about c and () is a location parameter. Show that if () is shift and location equivariant. and trimmed population mean JiOl (see Problem 3. where T is the median of the distributioo of IX location parameter. An estimate ()n is said to be shift and scale equivariant if for all xl.. ~ ~ 15. ~ ~ 8n (a +bxI.t. xnd.67 and tPk is defined in Problem 3. (b) Show that the mean Ji. (e) Let Ji{k) be the solution to the equation E ( t/Jk (X :. then v(F) and i/( F) are location parameters. 0 < " < 1. 8) = IF~ (x._.5.5) are location parameters. Show that J1. ~ (a) Show that the sample mean. let H(x) be the distribution function whose inverse is H'(a) = ![xax. . (e) Show that if the support S(F) = {x : 0 < F(x) < 1} of F is a finite interval. sample median... and note thatH(xi/(F» < F(x) < H(xv(F». · . Hint: For the second part.:.(F) = !(xa + XI"). then for a E R. SC(a + bx. b > 0.Xn_l) to show its dependence on Xnl (Xl... median v._.. c + dO. .1: (a) Show that SC(x.. In this case we write X " Y. . . Let Y denote a random variable with continuous distribution function G..).. Show that v" is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < i/(F).8. v(F) = inf{va(F) : 0 < " < I/2} andi/(F) =sup{v. (jx. If 0 is scale and shift equivariant.xn .. .8. Also note that H(x) is symmetric about zero. i/(F)] is the location parameter set in the sense that for any continuous F the value ()( F) of any location parameter must be in [v(F).F).. the ~ = bdSC(x. any point in [v(F). n (b) In the following cases. then ()(F) = c. b > 0. 8 < " is said to be order preserving if X < Y ::::} Ox < ()y. c E R.Section 3. it is called a location parameter. X is said to be stochastically smaller than Y if = G(t) for all t E R.xn ). Fn _ I ).O. F(t) ~ P(X < t) > P(Y < t) antisymmctric. 8. .) 14.t )) = 0. 8) and ~ IF(x.(k) is a (d) For 0 < " < 1. F) lim n _ oo SC(x. ~ = d> 0..5. a + bx. let v" = v. a. i/(F)] is the value of some location parameter. ~ (b) Write the Be as SC(x. 8. 8. ~ se is shift invariant and scale equivariant.5. x n _. compare SC(x.6 Problems and Complements 209 if (Ja+bX = a + bex· It is antisymmetric if (J x :.) That is.]. if F is also strictly increasing.(F): 0 <" < 1/2}. ([v(F). .  vl/O. and ordcr preserving. In Remark 3. and sample trimmed mean are shift and scale equivariant.a +bxn ) ~ a + b8n (XI.
Show that in the bisection method. IlFfdF(x). We define S as the . . I : . 81" < 0. • • . In the gross error model (3. . ) (a) Show by example that for suitable t/J and 1(1(0) . in order to be certain that the Jth iterate (j J is within e of the desired () such that 'IjJ( fJ) = 0. This is. Let d = 1 and suppose that 'I/J is twice continuously differentiable. where B is the class of Borel sets.1/. B E B.field generated by SA. (e) Does n j [SC(x. is not identifiable. and (iii) preceding? ~ 16.1) Hint: (a) Try t/J(x) = Alogx with A> 1.210 (i) Measures of Performance Chapter 3 O(F) ~ 1'1' ~ J . {BU)} do not converge.8) . 18. C < 00. I I Notes for Section 3. we shall • I' . • . 0.0 > 0 (depending on t/J) such that if lOCO) .B ~ {F E :F : Pp(A) E B}. The NewtonRaphson method in this case is .OJ then l(I(i) ..Bilarge enough.1 is commonly known as the Cramer~Rao inequality. ~ I(x  = X n .4 I .BI < C1(1(i. (ii) O(F) = (iii) e(F) "7. we in general must take on the order of log ~ steps. Assume that F is strictly increasing. Because priority of discovery is now given to the French mathematician M.3 (1) A technical problem is to give the class S of subsets of:F for which we can assign probability (the measurable sets). (1) The result of Theorem 3. . F)! ~ 0 in the cases (i). 3.t is identifiable. . then fJ.7 NOTES Note for Section 3. A.4.2). (b) If no assumptions are made about h. (b) ~ ~ . (ii). . Frechet.5.. ~ ~ 17. consequently. then J.rdF(l:). also true of the method of coordinate ascent.I F(x. show that (a) If h is a density that is symmetric about zero. and we seek the unique solution (j of 'IjJ(8) = O. · .' > 0. (b) Show that there exists.
K. 2nd ed. AND E. 1972. AND E. P.. A. 1994. J. F. Statistical Decision Theory and Bayesian Analysis New York: Springer. BICKEL. Statist.8 REFERENCES ANDREWS. BAXTER. W. HAMPEL. J. J. LEHMANN.. P. "Descriptive Statistics for Nonparametric Models. >. LEHMANN. Mathematical Analysis. J.. BERGER." Ann. H. DOWE. DoKSUM. ofStatist.] = Jroo.. AND E. AND C. Statist. R. Optimal Statistical Decisions New York: McGrawHill. A. in Proceedings of the Second Pa. 2. "Descriptive Statistics for Nonparametric Models.10451069 (1975b). H. I. LEHMANN. 1122 (1975). (3) The continuity of the first integral ensures that : (J [rO )00 . SMITH. P. T. W. Bayesian Theory New York: Wiley. G.. D." Scand. 11391158 (1976).8).thematical Methods in Risk Theory Heidelberg: Springer Verlag. "Descriptive Statistics for Nonparametric Models.Section 3.. BICKEL. DAHLQUIST. AND A. Ma. F.8 References 211 follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement. BICKEL. Point Estimation Using the KullbackLeibler Loss Function and MML. p(x.)dXd>. II. Math.. AND J. "Unbiased Estimation in Convex. OLIVER. III. 0. J.(T(X)) = 00. BERNARDO.. Robust Estimates of Location: Sun>ey and Advances Princeton. 3. M.. F.cific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: SpringerVerlag. 1. D. Introduction. 1998.. Statist. Numen'cal Analysis New York: Prentice Hall. P. NJ: Princeton University Press.4. P." Ann. J. DE GROOT.. S. Dispersion. 1969. . H. Reading.." Ann. WALLACE.. ANDERSON. 3." Ann. P. AND E. BICKEL. M. HUBER. "Measures of Location and AsymmeUy. 1974. BOHLMANN. M. LEHMANN.. AND N. 1970. Jroo T(x) [~p(X' 0)] dx roo Joo 8(J 00 00 00 for all (J whereas the continuity (or even boundedness on compact sets) of the second integral guarantees that we can interchange the order of integration in (4) The finiteness of Var8(T(X)) and f(O) imply that 1/J'(0) is finite by the covariance interpretation given in (3. Families.. 10381044 (l975a). 1974. APoSTOL. MA: AddisonWesley. Statist. 1985. (2) Note that this inequality is true but uninteresting if f(O) = 00 (and 1/J'(0) is finite) or if Var.. TUKEY. R. BJORK. 15231535 (1969). 4. Jroo T(x) :>. M. 3. ROGERS. L... Location. A.. 40. BICKEL.
and Economics Reading." Statistical Science." Ann. • NORBERG. StatiSl. B. "Adaptive Robust Procedures. Stlltist. SAVAGE. Theory ofPoint Estimation. Y.." Scand. 13. L. FREEMAN. 538542(1973). E. M.. 909927 (1974). . AND P. and Probability. "Robust Statistics: A Review. LINDLEY.." J.. E. The Foundations afStatistics New York: J.. S. MA: AddisonWesley. 43. L. RONCHEro. • • 1 ." Statistica Sinica. Part I: Probability. 136141 (1998).222 (1986). 49.. "Decision Analysis and BioequivaIence Trials. YU. Math. R. Math. S. W. WUSMAN.. Robust Statistics New York: Wiley. 69. 1948. H. Part II: Inference. Assoc." Ann. R. HUBER. 1. J.. "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution.212 Measures of Performance Chapter 3 HAMPEL." J. University of California Press. Stu/ist. F. Statist. 1954. . LEHMANN. (2000).. E. 1.. RoyalStatist.10411067 (1972). Robust Statistics: The Approach Based on Influence Functions New York: J. Programming. Cambridge University Press. London: Oxford University Press. Soc. 1998. "The Influence Curve and Its Role in Robust Estimation. D.. P. D. "Stochastic Complexity (With Discussions):' J. STAHEL. CASELLA. 1986. Statist." Proc. 69. 383393 (1974). London. 1959. R. 223239 (1987). B. AND G. Soc." Ann. Wiley & Sons. "Boostrap Estimate of KullbackLeibler Information for Model Selection. E. A. Mathematical Methods and Theory in Games. WALLACE.. J. SHIBATA. "Model Selection and the Principle of Mimimum Description Length. 1981. "On the Attainment of the CramerRao Lower Bound. MA: AddisonWesley. 240251 (1987)." J. L.. L. HANSEN. Introduction to Probability and Statistics from a Bayesian Point of View. 2Q4. "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification. 197206 (1956). Royal Statist. 7. C.. Third Berkeley Symposium on Math. 1972. 2nd ed.. Amer. KARLIN. HOGG. 1986. AND W.. ROUSSEUW. HAMPEL. Stall'. 49. "Estimation and Inference by Compact Coding (With Discussions). j I JEFFREYS. Exploratory Data Analysis Reading. R.. LINDLEY. Theory ofProbability.. Actuarial J. Amer. "Robust Estimates of Location. . 1965." J. H. Testing Statistical Hypotheses New York: Springer. StrJtist.. STEIN. 10201034 (1971). LEHMANN.375394 (1997). HUBER... P. AND B.V.. Wiley & Sons. 42. RISSANEN. JAECKEL. 2nd ed.. Math.. Assoc. P. TuKEY...n. Amer... C. New York: Springer. A. R. . Assoc. I.
Accepting this model provisionally. The design of the experiment may not be under our control. Here are two examples that illustrate these issues. M(n. Sex Bias in Graduate Admissions at Berkeley.Nfb the numbers of admitted male and female applicants. As we have seen. Nfo) by a multinomial.Nf1 . 3. parametrically. we are trying to get a yes or no answer to important questions in science. respectively.PmllPmO. in examples such as 1. If n is the total number of applicants. where is partitioned into {80 . and the corresponding numbers N mo . Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial. But this model is suspect because in fact we are looking at the population of all applicants here. Usually.3 the questions are sometimes simple and the type of data to be gathered under our control. the situation is less simple. They initially tabulated Nm1. PI or 8 0 . and what 8 0 and 8 1 correspond to in tenns of the stochastic model may be unclear.1. This framework is natural if. o E 8.1. pUblic policy. modeled by us as having distribution P(j. Nmo.Pfl.3 we defined the testing problem abstractly. and 3. e Example 4.1.PfO).3. what does the 213 . medicine. to answering "no" or "yes" to the preceding questions. where Po. Nfo of denied applicants. it might be tempting to model (Nm1 . and indeed most human activities.2. or more generally construct an experiment that yields data X in X C Rq. whether (j E 8 0 or 8 1 jf P j = {Pe : () E 8 j }. and we have data providing some evidence one way or the other. The Graduate Division of the University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. conesponding. distribution. not a sample. treating it as a decision theory problem in which we are to decide whether P E Po or P l or.1 INTRODUCTION In Sections 1.Chapter 4 TESTING AND CONFIDENCE REGIONS: BASIC THEORY 4. the parameter space e. peIfonn a survey. 8d with 8 0 and 8. respectively. what is an appropriate stochastic model for the data may be questionable. 8 1 are a partition of the model P or. as is often the case.
. • . the natural model is to assume. that N AA. where N m1d is the number of male admits to department d.214 Testing and Confidence Regions Chapter 4 hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male] = Pm! Pml +PmO = P[Admit I Female] = P fI Pil +PjO But is this a correct translation of what absence of bias means? Only if admission is determined centrally by the toss of a coin with probability Pml Pi! Pml + PmO PIl +PiO [n fact.. . as is discussed in a paper by Bickel. . . ~.. Pfld. OUf multinomial assumption now becomes N ""' M(pmld' PmOd. I . d = 1.1. m P [ NAA . . the same data can lead to opposite conclusions regarding these hypothesesa phenomenon called Simpson's paradox. D).. for d = 1. NjOd. D)." then the data are naturally decomposed into N = (Nm1d. 0 I • . I .. It was noted by Fisher corresponds to H : p = ~ with the alternative K : p as reported in Jeffreys (1961) that in this experiment the observed fraction ':: was much closer to 3.n 3 t I . The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive). ! . . If departments "use different coins. • I I . ... N mOd . d = 1.=7xlO 5 ' n 3 . I I 1] Fisher conjectured that rather than believing that such a very extraordinary event occurred it is more likely that the numbers were made to "agree with theory" by an overzealous assistant. and so on.. either N AA cannot really be thought of as stochastic or any stochastic I . In a modem formulation. has a binomial (n.2. The example illustrates both the difficulty of specifying a stochastic model and translating the question one wants to answer into a statistical hypothesis. and O'Connell (1975).: Example 4. PfOd.< .than might be expected under the hypothesis that N AA has a binomial. This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate. That is. Mendel's Peas. distribution.. if there were n dominant offspring (seeds). . The hypothesis of dominant inheritance ~. Mendel crossed peas heterozygous for a trait with two alleles.! . • Pml + P/I + PmO + Pio • I . ~). i:. . admissions are petfonned at the departmental level and rates of admission differ significantly from department to department. if the inheritance ratio can be arbitrary. Hammel. . NIld.D. In these tenns the hypothesis of "no bias" can now be translated into: H: Pml Pmld PmOd Pfld Pild + + PfOd . one of which was dominant.p) distribution. pml +P/I !. . . B (n. • In fact. . the number of homozygous dominants. In one of his famous experiments laying the foundation of the quantitative theory of genetics.
0 E 8 0 .Xn ). In situations such as this one . K : () > Bo Ifwe allow for the possibility that the new drug is less effective than the old. administer the new drug.3. is point mass at . our discussion of constant treatment effect in Example 1.3.1. (1 . P = Po. where Xi is 1 if the ith patient recovers and 0 otherwise. In science generally a theory typically closely specifies the type of distribution P of the data X as. (}o] and eo is composite. What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug.1 loss l(B. for instance. say k. say. Now 8 0 = {Oo} and H is simple. = Example 4. I} or critical region C {x: Ii(x) = I}. then 8 0 = [0 . the number of recoveries among the n randomly selected patients who have been administered the new drug. Thus.1 Introduction 215 model needs to pennit distributions other than B( n. .Section 4.1. p). If we suppose the new drug is at least as effective as the old. It is convenient to distinguish between two structural possibilities for 8 0 and 8 1 : If 8 0 consists of only one point. (2) If we let () be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite. we reject H if S exceeds or equals some integer. The set of distributions corresponding to one answer. a) = 0 if BE 8 a and 1 otherwise. and accept H otherwise. If the theory is false.!. 11 and K is composite. is better defined than the alternative answer 8 1 . and we are then led to the natural 0 . To investigate this question we would have to perform a random experiment. Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied. acceptance and rejection can be thought of as actions a = 0 or 1. suppose we observe S = EXi . B) distribution.11. for instance. it's not clear what P should be as in the preceding Mendel example. Suppose that we know from past experience that a fixed proportion Bo = 0. 8 1 is the interval ((}o. in the e . and then base our decision on the observed sample X = (X J. Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug. It will turn out that in most cases the solution to testing problem~ with 80 simple also solves the composite 8 0 problem. These considerations lead to the asymmetric formulation that saying P E Po (e E 8 0 ) corresponds to acceptance of the hypothesis H : P E Po and P E PI corresponds to rejection sometimes written as K : P E PJ . where 00 is the probability of recovery usiog the old drug.(1) As we have stated earlier. Most simply we would sample n patients. the set of points for which we reject. say 8 0 . Moreover. where 1 ~ E is the probability that the assistant fudged the data and 6!. then S has a B(n. In this example with 80 = {()o} it is reasonable to reject IJ if S is "much" larger than what would be expected by chance if H is true and the value of B is eo. When 8 0 contains more than one point. n 3' 0 What the second of these examples suggests is often the case.1. That a treatment has no effect is easier to specify than what its effect is. See Remark 4. The same conventions apply to 8] and K.. We illustrate these ideas in the following example.3 recover from the disease with the old drug. see. That is. Thus.€)51 + cB( nlP). 8 0 and H are called compOSite. then = [00 . we call 8 0 and H simple. we shall simplify notation and write H : () = eo. recall that a decision procedure in the case of a test is described by a test function Ii: x ~ {D.. .
when H is false. of the two errors.3. testing techniques are used in searching for regions of the genome that resemble other regions that are known to have significant biological activity. but if this view is accepted.. asymmetry is often also imposed because one of eo. as we shall see in Sections 4. . 0 > 00 . and later chapters. . . No one really believes that H is true and possible types of alternatives are vaguely known at best. There is a statistic T that "tends" to be small. To detennine significant values of these statistics a (more complicated) version of the following is done. In that case rejecting the hypothesis at level a is interpreted as a measure of the weight of evidence we attach to the falsity of H. : I I j ! 1 I . and large. We will discuss the fundamental issue of how to choose T in Sections 4..3.. (5 > k) < k). . • . 1.1. Note that a test statistic generates a family of possible tests as c varies. Thresholds (critical values) are set so that if the matches occur at random (i. Given this position. It has also been argued that.e. our critical region C is {X : S rule is Ok(X) = I{S > k} with > k} and the test function or PI ~ probability of type I error = Pe.216 Testing and Confidence Regions Chapter 4 tenninology of Section 1.2 and 4. (Other authors consider test statistics T that tend to be small. if H is false.3 this asymmetry appears reasonable. The value c that completes our specification is referred to as the critical value of the test. We do not find this persuasive. The Neyman Pearson Framework The Neyman Pearson approach rests On the idea that. one can be thought of as more important. but computation under H is easy..3.) We select a number c and our test is to calculate T(x) and then reject H ifT(x) > c and accept H otherwise.1 . matches at one position are independent of matches at other positions) and the probability of a match is ~. generally in science. is much better defined than its complement and/or the distribution of statistics T under eo is easy to compute.. One way of doing this is to align the known and unknown regions and compute statistics based on the number of matches. 0 PH = probability of type II error ~ P. e 1 . 4.1. how reasonable is this point of view? In the medical setting of Example 4. We now tum to the prevalent point of view on how to choose c. (5 The constant k that determines the critical region is called the critical value.T would then be a test statistic in our sense.1 and4. then the probability of exceeding the threshold (type I) error is smaller than Q". announcing that a new phenomenon has been observed when in fact nothing has happened (the socalled null hypothesis) is more serious than missing something new.2.that has in fact occurred. The Neyman Pearson framework is still valuable in these situations by at least making us think of possible alternatives and then. if H is true. 1 i . As we noted in Examples 4. it again reason~bly leads to a Neyman Pearson fonnulation. suggesting what test statistics it is best to use. ~ : I ! j . For instance. We call T a test statistic.1. In most problems it turns out that the tests that arise naturally have the kind of structure we have just described.. By convention this is chosen to be the type I error and that in tum detennines what we call H and what we call K. ! : . .2.
Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8 0 . It can be thought of as the probability that the test will "detect" that the alternative 8 holds. whereas if 8 E power against (). if our test statistic is T and we want level a. Begin by specifying a small number a > 0 such that probabilities of type I error greater than a arc undesirable.2. In that case.3 with O(X) = I{S > k}.3 (continued). 80 = 0. 8 0 = 0. Example 4.Section 4. e t . Thus. if 0 < a < 1.1.05 critical value 6 and the test has size . even though there are just two actions.1. 0) is just the probability of type! errot.(6) = Pe. That is.0473. . The power is a function of 8 on I. It is referred to as the level a critical value. the approach of Example 3. the power is 1 minus the probability of type II error. we can attach. Definition 4. 0) is the {3(8. k = 6 is given in Figure 4. Such tests are said to have level (o/significance) a. it is convenient to give a name to the smallest level of significance of a test. if we have a test statistic T and use critical value c. Once the level or critical value is fixed. Indeed. even nominally.8)n. Specifically.05 are commonly used in practice.Ok) = P(S > k) A plot of this function for n ~ t( j=k n ) 8i (1 .j J = 10. we find from binomial tables the level 0. Both the power and the probability of type I error are contained in the power function. This quantity is called the size of the test and is the maximum probability of type I error. Ll) Nowa(e) is nonincreasing in c and typically a(c) r 1 as c 1 00 and a(e) 1 0 as c r 00. such as the quality control Example 1. Because a test of level a is also of level a' > a.1.1. Here are the elements of the Neyman Pearson story. The power of a test against the alternative 0 is the probability of rejecting H when () is true. By convention 1 . then the probability of type I error is also a function of 8.. Finally.1.01 and 0.1. In Example 4. If 8 0 is composite as well. the probabilities of type II error as 8 ranges over 8 1 are determined. Here > cJ.2(b) is the one to take in all cases with 8 0 and 8 1 simple.P [type 11 error] is usually considered.3 and n = 10. numbers to the two losses that are not equal and/or depend on 0. and we speak of rejecting H at level a.3. it is too limited in any situation in which.1 Introduction 217 There is an important class of situations in which the Neyman Pearson framework is inappropriate. in the Bayesian framework with a prior distribution on the parameter.(S > 6) = 0.9. our test has size a(c) given by a(e) ~ sup{Pe[T(X) > cJ: 8 E eo}· (4. This is the critical value we shall use. See Problem 3.1. there exists a unique smallest c for which a(c) < a.0) = Pe[Rejection] = Pe[o(X) = 1] = Pe[T(X) If 8 E 8 0 • {3(8. {3(8. The values a = 0.2.1. which is defined/or all 8 E 8 by e {3(8) = {3(8.
Let Xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient. :1 • • Note that in this example the power at () = B1 > 0.3 is the probability that the level 0.0 0 ]. then the drug effect is measured by p.0473.3770. I I I I I 1 .0 0. The power is plotted as a function of 0.3 for the B(lO.2 0.6 0.3 to (it > 0..7 0.:"::':::'.1 0. Example 4. (The (T2 unknown case is treated in Section 4._~:.1 it appears that the power function is increasing (a proof will be given in Section 4. .[T(X) > k]. 0 Remark 4.. for each of a group of n randomly selected patients. When (}1 is 0. j 1 ..) We want to test H : J. Xn ) is a sample from N (/'.2 is known. I i j .05 test will detect an improvement of the recovery fate from 0. We return to this question in Section 4.9 1.8 06 04 0.5.3. For instance.1. Suppose that X = (X I. I j a(k) = sup{Pe[T(X) > k]: 0 E eo} = Pe.05 onesided test c5k of H : () = 0.0 Figure 4.3).1.3 04 05 0. a 67% improvement. .. . From Figure 4. and variance (T2. We might.3.t < 0 versus K : J. this probability is only . OneSided Tests for the Mean ofa Normal Distribution with Known Variance..2 o~31:.5.1. .3 versus K : B> 0.4. Power function of the level 0. What is needed to improve on this situation is a larger sample size n. This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject. and H is the hypothesis that the drug has no effect or is detrimental. .t > O. record sleeping time without the drug (or after the administration of a placebo) and then after some time administer the drug and record sleeping time again. If we assume XI.8 0..1. 0) family of distributions. That is.218 Testing and Confidence Regions Chapter 4 1. One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives. suppose we want to see if a drug induces sleep.. k ~ 6 and the size is 0.:. It follows that the level and size of the test are unchanged if instead of 80 = {Oo} we used eo = [0. ! • i.2) popnlation with .~==/++_++1'+ 0 o 0. whereas K is the alternative that it has some positive effect.1. _l X n are nonnally distributed with mean p.
• T(X(B)). Because (J(jL) a(e) ~ is increasing.. In all of these cases.(T(X)) doesn't depend on 0 for 0 E 8 0 . This occurs if 8 0 is simple as in Example 4. 1) distribution.o versus p.2 and 4.a) is an integer (Problem 4.. T(X(lI). (12) observations with both parameters unknown (the t tests of Example 4.1.(jo) = (~ (jo)+ where y+ = Y l(y > 0). < T(B+l) are the ordered T(X).co . then the test that rejects iff T(X) > T«B+l)(la)). #.3. S) inf{d(x. However.1. and we have available a good estimate (j of (j.~I. It is convenient to replace X by the test statistic T(X) = .3.1.9).V. if we generate i.5).P. (Tn) for (j E 8 0 is usually invariance The key feature of situations in which under the action of a group of transformations_ See Lehmann (1997) and Volume II for discussions of this property.Section 4. where d is the Euclidean (or some equivalent) distance and d(x. it is clear that a reasonable test statistic is d((j.nX/ (J. in any case. .3. In Example 4.1. In Example 4.. critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods. where T(l) < .eo) = IN~A . Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward. But it occurs also in more interesting situations such as testing p. has level a if £0 is continuous and (B + 1)(1 . sup{(J(p) : p < OJ ~ (J(O) = ~(c). N~A is the MLE of p and d(N~A.a) quantile oftheN(O.. That is.i.I') ~ ~ (c + V.d.   = . The power function of the test with critical value c is p " [Vii (X !") > e _ Vii!"] (J (J 1.a) is the (1.1 and Example 4. = P. T(X(B)) from £0.1 Introduction 219 Because X tends to be larger under K than under H.~ (c . The smallest c for whieh ~(c) < C\' is obtained by setting q:. T(X(l)). . .( c) = C\' or e = z(a) where z(a) = z(1 ..1. £0.co =.1. which generates the same family of critical regions.1. ~ estimates(jandd(~. p ~ P[AA].2. The task of finding a critical value is greatly simplified if £. o The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H : (j E eo c RP. eo).p) because ~(z) (4. has a closed form and is tabled. Rejecting for large values of this statistic is equivalent to rejecting for large values of X.5. This minimum distance principle is essentially what underlies Examples 4.1. it is natural to reject H for large values of X. the common distribution of T(X) under (j E 8 0 . .~(z).y) : YES}.o if we have H(p.. Given a test statistic T(X) we need to determine critical values and eventually the power of the resulting tests.2) = 1..
. (T2 = VarF(Xtl.4.1 J) where x(l) < . where U denotes the U(O. thus. and hn(t) = t/( Vri + 0.12 + OJI/Vri) close approximations to the size" critical values ka are h n (L628). . Suppose Xl.1. As x ranges over R.1 El{Fo(Xi ) < Fo(x)} n.7) that Do.1 El{Xi ~ U(O.U.. .. .Fa.. Proof..n {~ n Fo(x(i))' FO(X(i)) _ .01.Xn be i. . F o (Xi)..u[ O<u<l . U. ). What is remarkable is that it is independ~nt of which F o we consider.   Dn = sup IF(x) . which is called the Kolmogorov statistic. . In particular. This statistic has the following distributionjree property: 1 Proposition 4. 1).220 Testing and Confidence Regions Chapter 4 1 . We can proceed as in Example 4. the order statistics. for n > 80. F and the hypothesis is H : F = ~ (' (7'/1:) for some M. < Fo(x)} = U(Fo(x)) .. Un' where U denotes the empirical distribution function of Ul u = Fo(x) ranges over (0. as X . .. • Example 4.(i_l. < x(n) is the ordered observed sample. Consider the problem of testing H : F = Fo versus K : F i. The distribution of D n has been thoroughly smdied for finite and large n. Let F denote the empirical distribution and consider the sup distance between the hypothesis Fo and the plugin estimate of F.1. This is again a consequence of invariance properties (Lehmann.6. 1997).. h n (L358). The natural estimate of the parameter F(!. + (Tx) is .10 respectively.1.. . The distribution of D n under H is the same for all continuous Fo. X n are ij. as a tcst statistic .d. In particular. Let X I. where F is continuous. Goodness of Fit Tests. then by Problem B.1 El{U. can be wriHen as Dn =~ax max tI. and the result follows.i. :i' D n = sup [U(u) ..05.3..L5 rewriting H : F(!' + (Tx) = <p(x) for all x where !' = EF(X . P Fo (D n < d) = Pu (D n < d).1.. n. 1).. o Note that the hypothesis here is simple so that for anyone of these hypotheses F = F o• the distribution can be simulated (or exhibited in closed fonn). that is. 1) distribution. 0 Example 4. .... and h n (L224) for" = . the empirical distribution function F. which is evidently composite. ..1. . Set Ui ~ j . • . Goodness of Fit to the Gaussian Family. Also F(x) < x} = n.d.Fo(x)[.)} n (4. F. x It can be shown (Problem 4.5. and .
<l'(x)1 where G is the empirical distribution of (L~l"'" Lln ) with Lli (Xi . if X = x. otherwise.d.(iix u > z(a) or upon applying <l' to both sides if. whereas experimenter II insists on using 0' = 0. . Experimenter] may be satisfied to reject the hypothesis H using a test with size a = 0. Now the Monte Carlo critical value is the I(B + 1)(1 . . In). T nB .X)/ii. If the two experimenters can agree on a common test statistic T. That is.1 Introduction 221 (12. 1 < i < n. this difficulty may be overcome by reporting the outcome of the experiment in tenus of the observed size or pvalue or significance probability of the test. and only if. for instance. 1.X) . . H is not rejected. a satisfies T(x) = .' ..a) + IJth order statistic among Tn. Example 4. T nB .05. If we observe X = x = (Xl.4. . This quantity is a statistic that is defined as the smallest level of significance 0' at which an experimenter using T would reject on the basis ofthe observed outcome x.01. if the experimenter's critical value corresponds to a test of size less than the p~value. Zn arc i. H is rejected. . . (Sec Section 8. the joint distribution of (Dq .. . and only if.<l'(x)1 sup IG(x) . 1). N(O.Section 4.4) Considered as a statistic the pvalue is <l'( y"nX /u). the pvalue is <l'( T(x)) = <l' (..d. It is then possible that experimenter I rejects the hypothesis H.1.. Consider. {12 and is that of ~  (Zi .. where X and 0'2 are the MLEs of J1 and we obtain the statistic F(X + ax) Applying the sup distance again. Tn sup IF'(X x x + iix) .. Therefore. .2. ~n) doesn't depend on fl. (4. under H. whatever be 11 and (12. I). then computing the Tn corresponding to those Zi. But. observations Zi. · · . and the critical value may be obtained by simulating ij. Tn has the same distribution £0 under H. where Z I.) Thus..3. Tn!.i. We do this B times independently. thereby obtaining Tn1 . I < i < n... . we would reject H if.. whereas experimenter II accepts H on the basis of the same outcome x of an experiment. from N(O.(3) 0 The pValue: The Test Statistic as Evidence o Different individuals faced with the same testing problem may have different criteria of size.Z) / (~ 2::7 ! (Zi  if) • . a > <l'( T(x)).
222 Testing and Confidence Regions Cnapter 4 In general.• see f [' Hedges and Olkin. that is. The pvalue is a(T(X)). let X be a q dimensional random vector. T(x) > c. U(O. .1). . but K is not. for miu{ nOo. Thus. 1).. ~ H fJ. It is possible to use pvalues to combine the evidence relating to a given hypothesis H provided by several different independent experiments producing different kinds of data. In this context. a(Tr ). H is rejected if T > Xla where Xl_a is the 1 . Then if we use critical value c.1.. when H is well defined.1.1. I.1. 80). n(1 . ''The actual value of p obtainable from the table by interpolation indicates the strength of the evidence against the null hypothesis" (p. More generally.3. these kinds of issues are currently being discussed under the rubric of datafusion and metaanalysis (e. •• 1. 1985). The normal approximation is used for the pvalue also.6). to quote Fisher (1958). But the size of a test with critical value c is just a(c) and a(c) is decreasing in c. distribution (Problem 4.1.. if r experimenters use continuous test statistics T 1 .2. Various melhods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff (1967). For example. We have proved the following. This is in agreement with (4. (4. We will show that we can express the pvalue simply in tenns of the function a(·) defined in (4.. .1.1. so that type II error considerations are unclear.ath quantile of the X~n distribution. The statistic T has a chisquare distribution with 2n degrees of freedom (Problem 4. a(T) is on the unit interval and when H is simple and T has a continuous distribution. The pvalue can be thought of as a standardized version of our original statistic.6) I • to test H. ' I values a(T. we would reject H if. I T r to produce p .1. Similarly in Example 4. Thus.4). • T = ~2 I: loga(Tj) j=l ~ ~ r (4. . and the pvalue is a( s) where s is the observed value of X. • i~! <:1 im _ .. aCT) has a uniform.). !: The pvalue is used extensively in situations of the type we described earlier.<I> ( [nOo(1 ~ 0 )1' 2 s~l~nOo) 0 . the smallest a for which we would reject corresponds to the largest c for which we would reject and is just a(T(x)). Thus. a(8) = p.5). then if H is simple Fisher (1958) proposed using l j :i.g. Suppose that we observe X = x.Oo)} > 5.. the largest critical value c for which we would reject is c = T(x). Thus. and only if. Proposition 4.(S > s) '" 1..5) ..
or equivalently.3). (4. where eo and e 1 are disjoint subsets of the parameter space 6. 00 ) = 0. Such a test and the corresponding test statistic are called most poweiful (MP). L(x./00)5[(1.) which is large when S true. power function. For instance. 4. We introduce the basic concepts and terminology of testing statistical hypotheses and give the NeymanPearson framework. that is. type II error.00)ln~5 [0. the highest possible power. we test whether the distribution of X is different from a specified Fo. p(x. OIl > 0. (null) hypothesis H and alternative (hypothesis) K. Summary. and.01 )/(1. o:(Td has a U(O. under H. type I error. 0. and pvalue. (0. subject to this restriction.Otl/(1 .0. In this case the Bayes principle led to procedures based on the simple likelihood ratio statistic defined by L(x. equals 0 when both numerator and denominator vanish. 00. we specify a small number 0: and conStruct tests that have at most probability (significance level) 0: of rejecting H (deciding K) when H is true. by convention.2 and 3. critical regions.3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk. 00 . I) = EXi is large.00)/00 (1 . that is. We introduce the basic concepts of simple and composite hypotheses. significance level. in the binomial example (4. In particular. we try to maximize the probability (power) of rejecting H when K is true. we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 80 or e 1. We start with the problem of testing a simple hypothesis H : () = ()o versus a simple alternative K : 0 = 01. This is an instance of testing goodness offit. The statistic L is reasonable for testing H versuS K with large values of L favoring K over H. power. is measured in the NeymanPearson theory. a given test statistic T.Section 4. 0) is the density or frequency function of the random vector X.1. test functions.Oo)]n. Typically a test statistic is not given but must be chosen on the basis of its perfonnance. size. 1) distribution. test statistics. The statistic L takes on the value 00 when p(x. and S tends to be large when K : () = 01 > 00 is . then. In Sections 3.2 CHOOSING A TEST STATISTIC: THE NEYMANPEARSON LEMMA We have seen how a hypothesistesting problem is defined and how perfonnance of a given test b.2.2 Choosing a Test Statistic The NeymanPearson lemma 223 The preceding paragraph gives an example in which the hypothesis specifies a distribution completely. OIl where p(x. In this section we will consider the problem of finding the level a test that ha<. 0) 0 p(x. OIl = p (x.)1 5 [(1. In the NeymanPearson framework. (I .
p is a level a test. (b) For each 0 < a < 1 there exists an MP size 0' likelihood ratio test provided that randomization is permitted. 0 < <p(x) < 1.3. To this end consider (4.'P(X)] 1{P(X. i' ~ E. k] . 1. using (4.3 with n = 10 and Bo = 0.4) 1 j 7 .05 . we choose 'P(x) = 0 if S < 5. Note that a > 0 implies k < 00 and.Bd >k <k with 'Pdx) any value in (0. Eo'P(X) < a.k is < 0 or > 0 according as 'Pk(X) is 0 or 1.3. (See also Section 1.'P(X)] a. then ~ .Bo.[CPk(X) .'P(X)] > O.2.) .7l'). Bo. B .['Pk(X) . B .2.P(S > 5)I!P(S = 5) = . Note (Section 3. which are tests that may take values in (0. Such randomized tests are not used in practice. and because 0 < 'P(x) < 1. Finally.2) that 'Pk is a Bayes rule with k ~ 7l' /(1 .2.'P(X)] [:~~:::l.) .05 in Example 4. (c) If <p is an MP level 0: test.224 Testing and Confidence Regions Chapter 4 We call iflk a likelihood ratio or NeymanPearsoll (NP) test ifunction) if for some 0 k < OJ we can write the test function Yk as 'Pk () = X < 1 o if L( x.1.('Pk(X) . and suppose r.3).1). then it must be a level a likelihood ratio test. 'Pk(X) = 1 if p(x. if want size a ~ . Bo) = O. the interpretation is that we toss a coin with probability of heads cp(x) and reject H iff the coin shows heads.) For instance.'P(X)] = Eo['Pk{X) . They are only used to show that with randomization. (4.Bo. Proof. and 'P(x) = [0.2. where 7l' denotes the prior probability of {Bo}. we have shown that I j I i E. 'Pk is MP for level E'c'PdX). BI )  . i = 0. If L(x. B. lOY some x.. It follows that II > O. there exists k such that (4.1) if equality occurs. .. We show that in addition to being Bayes optimal.'P(X)] > kEo['PdX) .. likelihood ratio tests are unbeatable no matter what the size a is. Theorem 4.2) forB = BoandB = B.2. EO'PdX) We want to show E. where 1= EO{CPk(X)[L(X. then <{Jk is MP in the class oflevel a tests.3) 1 : > O.'P(X)[L(X.1. BIl o .B. Because we want results valid for all possible test sizes Cl' in [0. (a) Let E i denote E8. (a) If a > 0 and I{Jk is a size a likelihood ratio test.k] + E'['Pk(X) . that is. then I > O. thns. If 0 < 'P(x) < 1 for the observation vector x. Bo) = O} = I + II (say). (NeymanPearson Lemma). o Because L(x.0262 if S = 5.1]. we consider randomized tests '1'. 'P(x) = 1 if S > 5.kEo['Pk(X) .'P(X)] .kJ}.
O. Therefore. 00 .1. The same argument works for x E {x : p(x. OJ. 0.) + 1Tp(X. 'Pk(X) = 1 and !. 'P(X) > a with equality iffp(" 60 ) = P(·. 0. (c) Let x E {x: p(x.Oo. then E9. 2 (7 T(X) ~.) > k and have 'P(x) = 'Pk(X) ~ 0 when L(x..) > then to have equality in (4. 0.2 Choosing a Test Statistic: The NeymanPearson lemma " 225 =:cc (b) If a ~ 0.0.) > k] Po[L(X. 0.1T)p(x. Now 'Pk is MP size a. 00 . k = 00 makes 'Pk MP size a. We found nv2} V n L(X.OI)' Proof.3. Example 4. k = 0 makes E. 00 ) > and 0 = 00 . If a = 1. = 'Pk· Moreover./f 'P is an MP level a test.1T)L(x. tben. Because Po[L(X.) .L. 0 It follows from the NeymanPearson lemma that an MP test has power at least as large as its level.1. 00 . for 0 < a < 1. X n ) is a sample of n N(j. denotes the Bayes procedure of Exarnple 3. Let 1T denote the prior probability of 00 so that (I 1T) is the prior probability of (it. or 00 according as 1T(01 Ix) is larger than or smaller than 1/2. 0..2.) + 1T' (4. Part (a) of the lemma can. 'Pk(X) =a . .(X.v) + nv ] 2(72 x .2. 0. is 1T(O.O. Remark 4.I. 00 . decides 0.) = k j. I x) = (1 1T)p(X. 60 .2. If not.2.2.) < k. also be easily argued from this Bayes property of 'Pk (Problem 4. where v is a known signal.Section 4. OJ! = ooJ ~ 0.k] on the set {x : L(x. that is.2.2.2..10).. we conc1udefrom (4.) (1 . then there exists k < 00 such that Po[L(X. OJ Corollary 4. Let Pi denote Po" i = 0..pk is MP size a.u 2 ) random variables with (72 known and we test H : 11 = a versus K : 11 = v. 0.v)=exp 2LX'2 . { (7 i=l 2(7 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions. when 1T = k/(k + 1).7.fit [ logL(X. 00 .1.) = k] = 0.O. . 00 ) (11T)L(x. Consider Example 3.2 where X = (XI. It follows that (4.2..2) holds forO = 0.5) thatthis 0. Next consider 0 < Q: < 1. OJ! > k] < a and Po[T. define a. 00 .fit. 00 . OJ! > k] > If Po[L(X.Oo. See Problem 4. Here is an example illustrating calculation of the most powerful level a test <{Jk.5) If 0.2(b). Then the posterior probability of (). 0. 0.4) we need tohave'P(x) ~ 'Pk(X) = I when L(x. 0. then 'Pk is MP size a.) (1 .2. = v. 00 .Po[L(X.
. . l. 0.1. ). then. itl = ito + )"6. Eo are known. Such a test is called uniformly most powerful (UMP).9). The following important example illustrates.1. Simple Hypothesis Against Simple Alternative for the Multivariate Normal: Fisher's Discriminant Flmction. E j ). 0 An interesting feature of the preceding example is that the test defined by (4. O)T and Eo = I.226 Testing and Confidence Regions Chapter 4 is also optimal for this problem. a UMP (for all ). It is used in a classification context in which 9 0 . Particularly important is the case Eo = E 1 when "Q large" is equivalent to "F = (ttl . then this test rule is to reject H if Xl is large. j = 0.' Rejecting H for L large is equivalent to rejecting for I 1 . (Jj = (Pj. If ~o = (1.2.8)./ii/(J)) = 13 for n and find that we need to take n = ((J Iv j2[z(la) + z(I3)]'.2. Example 4. if ito.6) has probability of type I error 0:. itl' However if. this is no longer the case (Problem 4. <I>(z(a) + (v./iil (J)).2). say. 0.Jto)E01X large. We return to this in Volume II. Thus. E j ). large. T>z(la) (4. for the two popnlations are known.2.. The likelihood ratio test for H : (J = 6 0 versus K : (J = (h is based on .4. and only if.. In this example we bave assnmed that (Jo and (J. Suppose X ~ N(Pj. they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population CQvanances. I I ." The function F is known as the Fisher discriminant function..7) where c ~ z(1 .2. the test that rejects if.2. (JI correspond to two known populations and we desire to classify a new observation X as belonging to one or the other. > 0 and E l = Eo. however. This is the smallest possible n for any size Ct test. We will discuss the phenomenon further in the next section.0. .6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > O. by (4. But T is the test statistic we proposed in Example 4. then we solve <1>( z(a) + (v. If this is not the case. .2. if Eo #. The power of this test is.) test exists and is given by: Reject if (4.90 or . • • . .I. . among other things.2. From our discussion there we know that for any specified Ct. 6.. that the UMP test phenomenon is largely a feature of onedimensional parameter problems. By the NeymanPearsoo lemma this is the largest power available with a level Ct test.a)[~6Eo' ~oJ! (Problem 4. if we want the probability of detecting a signal v to be at least a preassigned value j3 (say..95). Note that in general the test statistic L depends intrinsically on ito.
if a match in a genetic breeding experiment can result in k types.OkO' Usually the alternative to H is composite.2) where . 0) P n! nn.3. The NeymanPearson lemma. n offspring are observed. r lUI nt···· nk· ". . here is the general definition of UMP: Definition 4.3.. Testing for a Multinomial Vector. . .2 that UMP tests for onedimensional parameter problems exist. _ (nl.. .. . 0 < € < 1 and for some fixed (4. This phenomenon is not restricted to the Gaussian case as the next example illustrates. is established. 0" . . nk are integers summing to n. Ok = OkO. which states that the size 0:' SLR test is uniquely most powerful (MP) in the class of level a tests. For instance. .. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : (j = ()o versus the simple alternative K : 0 = ()1. Nk) .... Example 4.. . Suppose that (N1 . .u k nn...3..)N' i=l to Here is an interesting special case: Suppose OjO integer I with 1 < I < k > 0 for all}. . and N i is the number of offspring of type i.Section 4. then (N" . Before we give the example. 4. Two examples in which the MP test does not depend on 0 1 are given. ..1.1) test !. 'P') > (3(0. .'P) for all 0 E 8" for any other level 0:' (4.M( n. k are given by Ow. We note the connection of the MP test to the Bayes procedure of Section 3. there is sometimes a simple alternative theory K : (}l = (}ll. With such data we often want to test a simple hypothesis H : (}I = Ow.2 for deciding between 00 and Ol."" (}k = Oklo In this case.···.3 UNIFORMLY MOST POWERFUL TESTS ANO MONOTONE LIKELIHOOD RATIO MODELS We saw in the two Gaussian examples of Section 4.3. (}l. However..3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 227 Summary.. 1 (}k) distribution with frequency function.. Ok)' The simple hypothesis would correspond to the theory that the expected proportion of offspring of types 1. where nl.nk.p. . .1. . A level a test 'P' is wtiformly most powerful (UMP) for H : 0 E versus K : 0 E 8 1 if eo (3(0. N k ) has a multinomial M(n. the likelihood ratio L 's L~ rr(:. Such tests are said to be UMP (unifonnly most powerful).
OW and the model is by (4. . if and only if.o)n[O/(1 . equals the likelihood ratio test 'Ph(t) and is MP.(X) = '" > 0.3) with 6.2. Then L = pnN1£N1 = pTl(E/p)N1.3. set s then = l:~ 1 Xi.228 Testmg and Confidence Regions Chapter 4 That is. (2) If E'o6.3 continned). is of the form (4.nu)!".3.1 is of this (. J Definition 4. 00 . Note that because l can be any of the integers 1 . p(x. :. in the case of a real parameter... .2) with 0 < f < I}..00). 0) ~. Theorem 4. it is UMP at level a = EOodt(X) for testing H : e = eo versus K : B > Bo in fact.. Thus. is UMP level".1. we have seen three models where. B( n.(x) any value in (0. is an MLRfamily in T(x).6..2.3.(X) is increasing in 0. for testing H : 8 < /:/0 versus K:O>/:/I' ..2 (Example 4. (1) For each t E (0.. then 6. .: 0 E e}. for". 6. Oil = h(T(x)) for some increasing function h.3. Example 4. . : 0 E e} with e c R is said to be a monotone likelihood ratio (MLR) family if for (it < O the distributions POl and P0 2 are distinct and 2 the ratio p(x. 0 Typically the MP test of H : () = eo versus K : () = (}1 depends on (h and the test is not UMP. . we conclude that the MP lest rejects H.. Oil is an increasing function ofT(x). Moreover. under the alternative. Suppose {P. Bernoulli case. then this family is MLR. If {P. = .nx/u and ry(!") Define the NeymanPearson (NP) test function 6 ( ) _ 1 ifT(x) > t . Because dt does not depend on ()l.3. = (I .3.o)n. the power function (3(/:/) = E. this test is UMP fortesting H versus K : 0 E 8 1 = {/:/ : /:/ . we get radically different best tests depending on which Oi we assume to be (ho under H. Because f. This is part of a general phenomena we now describe. . then L(x. Ow) under H. X ifT(x) < t ° (4.1) MLR in s. N 1 < c. there is a statistic T such that the test with critical region {x : T(x) > c} is UMP. If 1J(O) is strictly increasing in () E e. However. . . is an MLRfamily in T(x). type I is less frequent than under H and the conditional probabilities of the other types given that type I has not occurred are the same under K as they are under H. where u is known. Consider the problem of testing H : 0 = 00 versus K: 0 ~ 01 with 00 < 81. e c R. < 1 implies that p > E.O) = 0'(1 . : 8 E e}. 0 .3. e c R.2.1. Example 4. Consider the oneparameter exponential family mode! o p(x. The family of models {P. k. = h(x)exp{ry(O)T(x) ~ B(O)}. . = P( N 1 < c). 02)/p(X.1) if T(x) = t..d. Critical values for level a are easily determined because Nl .i. In this i. 0 form with T(x) . Example 4.
0 Proof (I) follows from b l = iPh(t) The following useful result follows immediately. Suppose tha~ as in Example l. If the inspector making the test considers lots with bo = N(Jo defectives or more unsatisfactory..l. d t is UMP for H : (J < (Jo versus K : (J > (Jo. (X) < <> and b t is of level 0: for H : (J :S (Jo. Ifwe write ~= Uo t i''''l (Xi _1')2 000 we see that Sju5 has a X~ distribution.Section 4. distribution. . If the distributionfunction Fo ofT(X) under X"" POo is continuous and ift(1a) isasolution of Fo(t) = 1 . Thus.bo > n. where 11 is a known standard.5.l. H.l) yields e L( 0 0 )=b. Quality Control. she formulates the hypothesis H as > (Jo. where U O represents the minimum tolerable precision.. X < h(a).. . N . distribution. e c p(x. the alternative K as (J < (Jo.1.3. . where xn(a) is the ath quantile of the X. N. . _1')2. and specifies an 0' such that the probability of rejecting H (keeping a bad lot) is at most 0'. Because the class of tests with level 0' for H : (J < (Jo is contained in the class of tests with level 0' for H : (J = (Jo. 20' 2  This is a oneparameter exponential family and is MLR in T = So The UMP level 0' test rejects H if and only if 8 < s(a) where s(a) is such that Pa .4. Suppose {Po: 0 E Example 4. 0. 0 Example 4. Suppose X!. Testing Precision.2.(X). we test H : u > Uo l versus K : u < uo. 1 bo(bol) (blX+l)(Nbl) (box+1)(Nbo) (Nbln+x+l) (Nbon+x+l)' .3. For simplicity suppose that bo > n. Then. is an MLRfamily in T(r). then by (1). the critical constant 8(0') is u5xn(a).. If a is a value taken on by the distribution of X. where h(a) is the ath quantile of the hypergeometric. Because the most serious error is to judge the precision adequate when it is not.o. is UMP level a.1 by noting that for any (it < (J2' e5 t is MP at level Eo. (8 < s(a)) = a.3 Uniformly Most Po~rful Tests and Monotone likelihood Ratio Models 229 and Corollary 4. and only if. recall that we have seen that e5 t maximizes the power for testing II : (J = (Jo versus K : (J > (Jo among the class of tests with level <> ~ Eo.(bl1) x. (l.(X) for testing H: 0 = 01 versus J( : 0 = 0. X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives.<>. For instance.Xn is a sample from a N(I1.<» is lfMP level afor testing H: (J < (Jo versus K : (J > (Jo. Corollary 4. n). then the test/hat rejects H if and only ifT(r) > t(1 .( NOo.U 2 ) population. R. Eoo. Let S = l:~ I (X. then e}. where b = N(J.. we could be interested in the precision of a new measuring instrument and test it by applying it to a known standard.o.l of the measurements Xl. we now show that the test O· with reject H if.3. 0) = exp {~8 ~ IOg(27r0'2)} . If 0 < 00 .l.Xn . and we are interested in the precision u. if N0 1 = b1 < bo and 0 < x < b1 . and because dt maximizes the power over this larger class. To show (2).
we want guaranteed power as well as an upper bound on the probability of type I error. This equation is equiValent to = {3 z(a) + . we would also like large power (3(0) when () E 8 1 ... 1 I . It follows that 8* is UMP level Q. I' .1. However.:(x:. 0 Power and Sample Size In the NeymanPearson framework we choose the test whose size is small.(b o . In our nonnal example 4.ntI") for sample size n.I. ° ° . By CoroUary 4.. forO <:1' < b1 1.230 NotethatL(x. we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small.3.) is increasing. L is decreasing in x and the hypergeometric model is an MLR family in T( x) = r.OIl box (Nn+I)(blx) .1.a.B 1 ) =0 forb l Testing and Confidence Regions Chapter 4 < X <n.4 because (3(p. This continuity of the power shows that not too much significance can be attached to acceptance of H.2). . Note that a small signaltonoise ratio ~/ a will require a large sample size n.n + I) .. not possible for all parameters in the alternative 8. Therefore.O'"O. In both these cases. 0) continuous in 0. in (O . we want the probability of correctly detecting an alternative K to be large. In our example this means that in addition to the indifference region and level a. This is not serious in practice if we have an indifference region.'. if all points in the alternative are of equal significance: We can find > 00 sufficiently close to 00 so that {3( 0) is arbitrarily close to {3(00) = a. For such the probability of falsely accepting H is almost 1 . This is possible for arbitrary /3 < 1 only by making the sample size n large enough.c:O". we specify {3 close to I and would like to have (3(!') > (3 for aU !' > t.x) <I L(x.(}o.L. ~) for some small ~ > 0 because such improvements are negligible..1..1 and formula (4.' . the appropriate n is obtained by solving i' . The critical values for the hypergeometric distribution are available on statistical calculators and software.Oo.. as seen in Figure 4.4 we might be uninterested in values of p. ". (0. This is a subset of the alternative on which we are willing to tolerate low power. in general.1.Il = (b l . On the other hand. . In Example 4. That is..+. and the powers are continuous increasing functions with limOlO.. this is.1. Off the indifference region.. H and K are of the fonn H : () < ()o and K : () > ()o. Thus.ntI" = z({3) whose solution is il . this is a general phenomenon in MLR family models with p( x.x) (N . {3(0) ~ a. that is. I i (3(t) ~ 'I>(z(a) + . i. ~) would be our indifference region. Thus.
4 this would mean rejecting H if. We solve For instance.282 x 0. As a further example and precursor to Section 5. Bd. They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is 0:.80 ) . This problem arises particularly in goodnessoffit tests (see Example 4. we can have very great power for alternatives very close to O. (h = 00 + Dt.4.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 231 Dual to the problem of not having enough power is that of having too much.3. and only if.Section 4. (3 = . First.1. we fleed n ~ (0.3. 1.) = (3 for n and find the approximate solution no+ 1 .6 (Example 4." The reason is that n is so large that unimportant small discrepancies are picked up. Formula (4. Suppose 8 is a vector. 00 = 0.4. we find (3(0) = PotS > so) = <I> ( [nO(l ~ 0)11/ 2 Now consider the indifference region (80 . Thus. > O.3 continued).. far from the hypothesis. we solve (3(00 ) = Po" (S > s) for s using (4.3 to 0.35 and n = 163 is 0.35(0. and 0.35.0. Example 4. Again using the nonnal approximation. where (3(0.05)2{1. Often there is a function q(B) such that H and K can be formulated as H : q(O) <: qo and K : q(O) > qo.645 x 0. Dt.4) and find the approximate critical value So ~ nOo 1 + 2 + z(l. the size . In Example 4.05 binomial test of H : 8 = 0. 0 ° Our discussion can be generalized.1.35.5).2) shows that.1.86.90 of detecting the 17% increase in 8 from 0. test for = . = 0. when we test the hypothesis that a very large sample comes from a particular distribution. There are various ways of dealing with this problem. if Oi = . using the SPLUS package) for the level . The power achievable (exactly.55)}2 = 162. that is. Such hypotheses are often rejected even though for practical purposes "the fit is good enough.00 )] 1/2 .3 requires approximately 163 observations to have probability .1.4.05. Our discussion uses the classical normal approximation to the binomial distribution. ifn is very large and/or a is small. It is natural to associate statistical significance with practical significance so that a very low pvalue is interpreted as evidence that the alternative that holds is physically significant.Oi) [nOo(1 .7) + 1. to achieve approximate size a. Now let . we next show how to find the sample size that will "approximately" achieve desired power {3 for the size 0: test in the binomial example.3(0.90.
However. The theory we have developed demonstrates that if C.7.4. I . > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q.Bo). Although H : 0.4) j j 1 .3.. then rejecting for large values of Tn is UMP among all tests based on Tn. In general.. We may ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O. for 9. .9. the critical value for testing H : 0.5 that a particular test statistic can have a fixed distribution £0 under the hypothesis. To achieve level a and power at least (3.1. I}..1 by taking q( 0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative. and also increases to 1 for fixed () E 6 1 as n .(0) = o} aild 9. = (tm.+ 00.2 only. This procedure can be applied.(Tn ) is an MLR family.3 that this test is UMP for H : (1 > (10 versus K : 0..2. a reasonable class of loss functions are those that satisfy : 1 i j 1 1(0. 0) = (B . = {O : ). Suppose that j3(O) depends on () only through q( (J) and is a continuous increasing function of q( 0).= 00 is now composite.2 is (}'2 = ~ E~ 1 (Xi . we may consider l(O.. We have seen in Example 4.= (10 versus K : 0. is detennined by a onedimensional parameter ). a E A = {O. J. For each n suppose we have a level a test for H versus I< based on a suitable test statistic T.l' independent of IJ. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II. the distribution of Tn ni:T 2/ (15 is X.0) > ° I(O. . Thus.3. first let Co be the smallest number c such that Then let n be the smallest integer such that P" IT > col > fJ where 00 is such that q( 00 ) = qo and 0.(0) > O} and Co (Tn) = C'(') (Tn) for all O. that are not 01. to the F test of the linear model in Section 6. (4.3. for instance.Xf as in Example 2. Suppose that in the Gaussian model of Example 4. Implicit in this calculation is the assumption that POl [T > col is an increasing function ofn. a). 00). Testing Precision Continued..(0) so that 9 0 = {O : ). .< (10 and rejecting H if Tn is small. Then the MLE of 0.1) 1(0.< (10 among all tests depending on <. For instance. 0 E 9. is such that q( Oil = q.232 Testing and Confidence Regions Chapter 4 q. It may also happen that the distribution of Tn as () ranges over 9. we illustrate what can happen with a simple example. 0 E 9. Example 4. when testing H : 0 < 00 versus K : B > Bo..O)<O forB < 00 forB>B o. The set {O : qo < q(O) < ql} is our indjfference region. is the a percentile of X~l' It is evident from the argument of Example 4.£ is unknown.l)l(O.3. 0 = Complete Families of Tests The NeymanPearson framework is based on using the 01 loss function.
4 Confidence Bounds. If E. In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H. (4. AND REGIONS We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter {J i" a . locally most powerful (LMP) tests in some cases can be found. In the following the decision procedures arc test functions. hence.(2) a v R(O.O)} E. 1) 1(O. when UMP tests do not exist. Theorem 4.3.o.3. and Regions 233 if for any decision rule 'P The class D of decision procedures is said to be there exists E such that complete(I).3.3. the risk of any procedure can be matched or improved by an NP test. then any procedure not in the complete class can be matched or improved at all () by one in the complete class. e 4.(X)) = 1.(x)) .(I"(X))) < 0 for 8 > 8 0 . (4.I"(X) 0 for allO then O=(X) clearly satisfies (4. for some 00 • E"o.) .E.(X) = ".O)]I"(X)} Let o.12) and.3. is similarly UMP for H : 0 > 8 0 versus K : 0 < 00 (Problem 4.4 CONFIDENCE BOUNDS. if the model is correct and loss function is appropriate. For () real. 0 Summary.3. INTERVALS.(X) > 1. Thus. a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing ()o versus 8 1 is an increasing function of a statistic T( x) for every ()o < ()1.5) holds for all 8. 1") = (1(0.I") ~ = EO{I"(X)I(O. We consider models {Po : () E e} for which there exist tests that are most powerful for every () in a composite alternative t (UMP tests). (4. then the class of tests of the form (4.O) + [1(0. = R(O. Thus.3. is an MLR family in T( x) and suppose the loss function 1(0.E. 1) + [1 1"(X)]I(O.0. Now 0. O))(E.3.Section 4. e c R. Intervals.o. 1") for all 0 E e. the test that rejects H : 8 < 8 0 for large values of T(x) is UMP for K : 8 > 8 0 . Proof. Finally.2.(X) = E"I"(X) > O.(Io. The risk function of any test rule 'P is R(O.6) But 1 .I"(X) for 0 < 00 . We also show how. Suppose {Po: 0 E e}.o) < R(O.(X) be such that. 0 < " < 1. a) satisfies (4. the class of NP tests is complete in the sense that for loss functions other than the 01 loss function.E.5) That is.3) with Eo.R(O.{1(8. hence. we show that for MLR models.1 and. For MLR models. E.5).(o. 1) 1(0.3. it isn't worthwhile to look outside of complete classes. is UMP for H : 0 < 00 versus K: 8 > 8 0 by Theorem 4. is complete.4).
Finally. • j I where . In the N(". we settle for a probability at least (1 .... 1 X n are i.a.. . 11 I • J .. we find P(X . in many situations where we want an indication of the accuracy of an estimator.... 0.. We say that [j.a) confidence bound for ".(X) and solving the inequality inside the probability for p..a. .i.a)l. Then we can use the experimental outcome X = (Xl. We say that .(X) that satisfies P(.(X).. . X + a] contains pis 1 . X E Rq.a) such as . is a constant. intervals.a. N (/1.95 Or some other desired level of confidence. and we look for a statistic . and X ~ P.3.95.. That is.d.8). This gives . and a solution is ii(X) = X + O"z(1 . In this case.(Xj < .) and =1 a .! = X . as in (1.Q with 1 ..4 where X I.. if v ~ v(P)..) = 1 . Now we consider the problem of giving confidence bounds.. . we may be interested in an upper bound on a parameter. Suppose that JL represents the mean increase in sleep among patients administered a drug. .. As an illustration consider Example 4.. That is.a.!a)/. we want both lower and upper bounds. X n ) to establish a lower bound . is a lower bound with P(.) = 1 .. we want to find a such that the probability that the interval [X . ( 2 ) with (72 known. PEP. In general.oz(1 .a.(X) is a lower confidence bound with confidence level 1 . In our example this is achieved by writing By solving the inequality inside the probability for tt.(X) for" with a prescribed probability (1 .fri < ..1.. We find such an interval by noting ..Q equal to .±(X) ~ X ± oz (1. Here ii(X) is called an upper level (1 .fri < .. . it may not be possible for a bound or interval to achieve exactly probability (1 .fri.a)I.+(X)] is a level (1 . In the nonBayesian framework. or sets that constrain the parameter with prescribed probability 1 .a) of being correct.a) for a prescribed (1.(X) . is a parameter.a).234 Testing and Confidence Regions Chapter 4 member of a specified set 8 0 .a)l.a) confidence interval for ".2 ) example this means finding a statistic ii(X) snch that P(ii(X) > .fri.) = Ia.oz(1 . Similarly. ..
1')1". the random interval [v( X). ii( Xl] formed by a pair of statistics v(X). In the preceding discussion we used the fact tbat Z (1') = Jii(X . which has a X~l distribution.L) obtained by replacing (7 in Z(J. .1. Similarly.~o) for 1'. D(X) is called a level (1 . " X n be a sample from a N(J.0)% confidence intervolfor v if. We conclude from the definition of the (Student) t distribution in Section B. In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level.4. The (Student) t Interval and Bounds. For a given bound or interval. In general...0') < (J . v(X) is a level (I .1.a) upper confidence bound for v if for every PEP. lhat is. Let X t . Note that in the case of intervals this is just inf{ PI!c(X) < v < v(X). has aN(O. Example 4.o. and assume initially that (72 is known. The quantities on the left are called the probabilities of coverage and (1 . P[v(X) < vi > I . the confidence level is clearly not unique because any number (I .0) is.1) = T(I') . for all PEP. (72) population. by Theorem B.O!) lower confidence bound for v if for every PEP.X) n i=l 2 .Section 4. Now we tum to the (72 unknown case and propose the pivot T(J. has a N(o.4 Confidence Bounds. and Regions 235 Definition 4.o.a) is called a confidence level. we will need the distribution of Now Z(I') = Jii(X 1')/".L) by its estimate s.! that Z(iL)1 "jVI(n . P[v(X) < v < v(X)] > I . the minimum probability of coverage). I) distribu~on to obtain a confidence interval for I' by solving z (1 . independent of V = (n .!o) < Z(I') < z (1. finding confidence intervals (or bounds) often involves finding appropriate pivots. 1) distribution and is. where S 2 = 1 nI L (X.l)s2 lIT'.0) will be a confidence level if (I . In this process Z(I') is called a pivot.4. P[v(X) = v] > 10. Intervals. A statistic v(X) is called a level (1 .e.3.0) or a 100(1 .3.L. Moreover. For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient.3. P E PI} (i..
. N (p" a 2 ).7 (see Problem 8. Up to this point.)/ v'n].l)s2/X(1.stn_ 1 (I . 0 ..l) is fairly close to the Tn .st n _ 1 (I .nJ = 1" X ± sc/.99 confidence intervaL From the results of Section B.4.a) in the limit as n + 00. computer software. . For instance. .. and if al + a2 = a. For the usual values of a. . if we let Xn l(p) denote the pth quantile of the X~1 distribution. the confidence coefficient of (4..d.~a)/. . The properties of confidence intervals such as (4.fii and X + stn."2)' (n . Let tk (p) denote the pth quantile of the P (t n..355s/3."2)) = 1".i.. Hence. i '1 !.~a) < T(J.fii are natural lower and upper confidence bounds with confidence coefficients (1 .a).n < J. .st n_l (1 ~. . or very heavytailed distributions such as the Cauchy.4. X . (4.7.) /...4. X +stn _ 1 (1 Similarly. • r. we find P [X .01. It turns out that the distribution of the pivot T(J.l (I ..4. distribution. we can reasonably replace t n .!a) and tndl . then P(X("I) < V(<7 2) < x(l.1 (1 . Suppose that X 1. if n = 9 and a = 0.a) /.a..1 )s2/<7 2 has a X..355 and IX .1 (p) by the standard normal quantile z(p) for n > 120.~a) = 3. • " ~. Then (72. thus.3. . .236 has the t distribution 7..1) in nonGaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5.1) can be much larger than 1 .355s/31 is the desired level 0.)/.. for very skew distributions such as the X2 with few degrees of freedom.Q).l) < t n_ 1 (1.1.1 (I . X n are i. Thus. .a). t n . Solving the inequality inside the probability for Jt.3.. we have assumed that Xl.l < X + st n_ 1 (1. ~.1. .1) has confidence coefficient close to 1 .~a)) = 1.12).1 distribution and can be used as a pivot. the interval will have probability (1 .2) f. we enter Table II to find that the probability that a 'TnI variable exceeds 3. Confidence Intervals and Bounds for the Vartance of a Normal Distribution..l)s2/X("I)1 is a confidence interval with confidence coefficient (1 . In this case the interval (4.fii.~a)/. V( <7 2 ) = (n .l (I .005.) confidence interval of the type [X .. . By Theorem B .. " .a. To calculate the coefficients t n. By solving the inequality inside the probability for a 2 we find that [(n .. If we assume a 2 < 00.11 whatever be Il and Tj.2.Xn is a sample from a N(p" a 2 ) population. we use a calculator. (4. On the other hand.3.Q. X + 3. .1.355 is . or Tables I and II.n is.1.1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal. Example 4. Testing and Confidence Regions Chapter 4 1 . See Figure 5. we see that as n + 00 the Tn_l distribution converges in law to the standard normal distribution.1) The shortest level (I .
It may be shown that for n large.l)sjx(Q). We illustrate by an example. In contrast to Example 4. X) is a quadratic polynomial with two real roots. Example 4. 1 X n are the indicators of n Bernoulli trials with probability of success (j.16 we give an interval with correct limiting coverage probability.0) < z (1)0(1 . There is a unique choice of cq and a2.~ a) and observe that this is equivalent to Q P [(X . The pivot V( (T2) similarly yields the respective lower and upper confidence bounds (n . g( 0.ka)[S(n .3." we can write P [ Let ka vIn(X . However. 1959).Q) and (n .a even in the limit as n + 00. = ~(X) < 0 < B(X)]' Because the coefficient of (j2 in g(() 1 X) is greater than zero. taking at = (Y2 = ~a is not far from optimal (Tate and Klett.3) + ka)[S(n .4.X) = (1+ k!) 0' .4. .k' < .0) ] = Plg(O. Intervals.. and Regions 237 The length of this interval is random.1.X) < OJ'" 1where g(O. There is no natural "exact" pivot based on X and O. then X is the MLE of (j.O)j )0(1 . the confidence interval and bounds for (T2 do not have confidence coefficient 1.4.l)sjx(1 . 1) distribution. 0 The method of pivots works primarily in problems related to sampling from nonnal populations. Asymptotic methods and Monte Carlo experiments as described in Chapter 5 have shown that the confidence coefficient may be arbitrarily small depending on the underlying true distribution. In Problem 1.1. the scope of the method becomes much broader.Section 44 Confidence Bounds. If we consider "approximate" pivots. which unifonnly minimizes expected length among all intervals of this type. Approximate Confidence Bounds and Intervalsfor the Probability of Success in n Bernoulli Trials. vIn(X . which typically is unknown.0) . If we use this function as an "approximate" pivot and let:=:::::: denote "approximate equality.0)  1Ql] = 12 Q.0) has approximately a N(O.(2X + ~) e+X 2 For fixed 0 < X < I.4) .0(1. In tenns of S = nX.S)jn] + k~j4} / (n + k~). X) < OJ (4. they are(1) O(X) O(X) {S + k2~ . = Z (1 .. if we drop the nonnality assumption.4. If X I.S)jn] {S + k2~ + k~j4} / (n + k~) (4.. [0: g(O. by the De MoivreLaplace theorem.
calls willingness to buy success. [n this case. O(X)] is an approximate level (1 .02 and is a confidence interval with confidence coefficient approximately 0. (1 . n(l . i " .~n)2 < S(n . Cai. We can similarly show that the endpoints of the level (1 .4.. Another approximate pivot for this example is yIn(X . For small n.4. = T.a) confidence interval for O. if the smaller of nO. To see this. That is.a) confidence bounds. say I.4.8) is at least 6. i o 1 [. He draws a sample of n potential customers.02 by choosing n so that Z = n~ ( 1. we choose n so that ka.0)1 JX(1 . ka. .O > 8 1 > ~. and Das Gupta (2000). For instance.S)ln] Now use the fact that + kzJ4}(n + k~)l (S . it is better to use the exact level (1 . and Das Gupta (2000) for a discussion. Cai. See Problem 4.6) Thus. This leads to the simple interval 1 I (4. and we can achieve the desired .96) 2 = 9.7) See Brown. or n = 9.~a = 0. He can then detennine how many customers should be sampled so that (4.4) has length 0.4.238 Testing and Confidence Regions Chapter 4 so that [O(X). to bound I above by 1 0 choose . = 0. Note that in this example we can detennine the sample size needed for desired accuracy.4.95. of the interval is I ~ 2ko { JIS(n . 1 .2a) interval are approximate upper and lower level (1 .~a) ko 2 kcc i 1. consider the market researcher whose interest is the proportion B of a population that will buy a product. Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 00 < ~.16.(1.5) is used and it is only good when 8 is near 1/2.4.4.96 0. = length 0.: iI j . .601. we n .02 ) 2 .96.5.a) procedure developed in Section 4.975.5) (4. This fonnula for the sample size is very crude because (4.n 1 tn (4.600. These inlervals and bounds are satisfactory in practice for the usual levels.02.S)ln ~ to conclude that tn .(n + k~)~ = 10 .X). note that the length.. A discussion is given in Brown. and uses the preceding model.
(X) < q. ~. . Suppose q (X) and ii) (X) are realJ valued.(O) < ii. In this case. then the rectangle I( X) Ic(X) has level = h (X) x . That is. If q is not 1 .X n denote the number of hours a sample of internet subscribers spend per week on the Internet. distribution.4. we will later give confidence regions C(X) for pairs 8 = (0 1 . . Let Xl.a) confidence region.) confidence interval for qj (0) and if the ) pairs (T l ' 1'. £(8. .. if q( B) = 01 . I is a level (I . X~n' distribution. ii.. qc (0))... (X) ~ [q. Let 0 and and upper boundaries of this interval. X n is modeled as a sample from an exponential. By Problem B. . For instance.0'). we can find confidence regions for q( 0) entirely contained in q( C(X)) with confidence level (1 .. then exp{ x/O} edenote the lower < q(O) < exp{ x/i)} is a confidence interval for q( 0) with confidence coefficient (1 . and Regions 239 Confidence Regions for Functions of Parameters We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 . . ( 2 ) T. Confidence Regions of Higher Dimension We can extend the notion of a confidence interval for onedimensional functions q( B) to rdimensional vectors q( 0) = (q.a).a) confidence interval 2nX/x (I . j ~ I. x IT(1a.. Note that ifC(X) is a level (1 ~ 0') confidence region for 0.8) . .a.a).~a) < 0 <2nX/x (~a) where x(j3) denotes the 13th quantile of the X~n distribution. We write this as P[q(O) E I(X)I > 1. q(C(X)) is larger than the confidence set obtained by focusing on B1 alone. if the probability that it covers the unknown but fixed true (ql (0).Section 44 Confidence Bounds. . ...(X). Then the rdimensional random rectangle I(X) = {q(O).F(x) = exp{ x/O}. (T c' Tc ) are independent.a)...). . Example 4.4. this technique is typically wasteful. j==1 (4. By using 2nX /0 as a pivot we find the (1 .1 ).1. ..4. Here q(O) = 1 ..a Note that if I.3. . Intervals..). (0). Suppose X I. 2nX/0 has a chi~square.. and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet. then q(C(X)) ~ {q(O) . 0 E C(X)} is a level (1. .0') confidence region for q(O). qc( 0)) is at least (I .r} is said to be a level (1 .4.
4.2. if we choose a J = air. It is possible to show that the exact confidence coefficient is (1 .4. c c P[q(O) E I(X)] > 1.Laj.5). From Example 4. ( 2 ) sample and we want a confidence rectangle for (Ii.a) confidence rectangle for (/" . then I(X) has confidence level (1 . tl continuous in t. r.1. that is.F(t)1 tEn i " does not depend on F and is known (Example 4.a). min{l. F(t) + do})· We have shown ~ ~ I o P(C(X)(t) :) F(t) for all t E R) for all P E 'P =1.1 h(X) = X ± stn_l (1. That is. F(t) . See Problem 4.4. According to this inequality.5.. is a confidence interval for J1 with confidence coefficient (1  I (X) _ [(nl)S2 (nl)S2] 2 Xnl (1 . Suppose Xl.4. (7 2).LP[qj(O) '" Ij(X)1 > 1.do}. v(P) = F(·).2. Then by solving Dn(F) < do for F. j=1 j=1 Thus. .0 confidence region C(x)(·) is the confidence band which. 0 The method of pivots can also be applied to oodimensional parameters such as F. Confidence Rectangle for the Parameters ofa Normal Distribution.0:.1.d. Moreover. . j = 1. h (X) X 12 (X) is a level (1 ..a = set of P with P( 00.~a)/rn !a). as X rv P.. . We assume that F is continuous. consists of the interval i J C(x)(t) = (max{O. an rdimensional confidence rectangle is in this case automaticalll obtained from the onedimensional intervals... .15. . Example 4.a.~a). and we are interested in the distribution function F(t) = P(X < t). 1 X n is a N (M. in which case (Proposition 4.2). D n (F) is a pivot.. X n are i. .. Thus. then leX) has confidence level 1 .4. ~ ~'f Dn(F) = sup IF(t) .40) 2.7).a) r.(1 . An approach that works even if the I j are not independent is to use Bonferroni's inequality (A.240 Testing and Confidence Regions Chapter 4 Thus.1) the distribution of . we find that a simultaneous in t size 1. Let dO' be chosen such that PF(Dn(F) < do) = 1 .. Example 4.6. if we choose 0' j = 1 . Suppose Xl> . From Example 4.i. for each t E R.ia)' Xnl (lo) is a reasonable confidence interval for (72 with confidence coefficient (1. .
4.5. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X. Intervals for the case F supported on an interval (see Problem 4. For a nonparametric class P ~ {P} and parameter v ~ v(P). We derive the (Student) t interval for I' in the N(Il. Example 4.19.0') lower confidence bound for /1. J.d..1. then Let F(t) and F+(t) be the lower and upper simultaneous confidence boundaries of Example 4.4. Suppose Xl.4.7. We define lower and upper confidence bounds (LCBs and DCBs). 1WoSided Tests for the Mean of a Normal Distribution.!(F+) and sUP{J.. is /1.5 THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS Confidence regions are random subsets of the parameter space that contain the true parameter with probability at least 1 .!(F)   ~ oosee Problem 4.5 The Duality Between Confidence Regions and Tests 241 We can apply the notions studied in Examples 4. We begin by illustrating the duality in the following example. A Lower Confidence Boundfor the Mean of a Nonnegative Random Variable.! ~ inf{J.6. .!(F) : F E C(X)} = J.0'.4.4. o 4.i.0: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Pe that C(X) covers q( 8) is at least 1 . In a parametric model {Pe : BEe}. . We shall establish a duality between confidence regions and acceptance regions for families of hypotheses. subsets of the sample space with probability of accepting H at least 1 . A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining .4. 1992) where such bounds are discussed and shown to be asymptotically strictly conservative.0: when H is true.0: for all E e.4.4. a 2 ) model with a 2 unknown. Suppose that an established theory postulates the value Po for a certain physical constant. a level 1 .18) arise in accounting practice (see Bickel. Acceptance regions of statistical tests are. By integration by parts.5 to give confidence regions for scalar or vector parameters in nonparametric models. confidence intervals. 0 Summary. we similarly require P(C(X) :J v) > 1. which is zero for t < 0 and nonzero for t > O. if J.!(F) : F E C(X)) ~ J.4 and 4. X n are i." for all PEP. Example 4.9)   because for C(X) as in Example 4.! ~ /L(F) = f o tf(t)dt = exists. and more generally confidence regions.Section 4. Then a (1 .6. for a given hypothesis H. as X and that X has a density f( t) = F' (t). given by (4. and we derive an exact confidence interval for the binomial parameter.
in Example 4.1 (1.2.. the postulated value JLo is a member of the level (1 0' ) confidence interval [X  Slnl (1 ~a) /vn. generating a family of level a tests.I n .a.4. then it is reasonable to formulate the problem as that of testing H : fl = Jlo versus K : fl i. generated a family of level a tests {J(X.2) we obtain the confidence interval (4.Q) confidence interval (4. This test is called twosided because it rejects for both large and small values of the statistic T. Let v = v{P) be a parameter that takes values in the set N. /l) ~ O. (4.I n _ 1 (1. .00).242 Testing and Confidence Regions Chapter 4 measurements Xl .1 (1 . .l other than flo is a possible alternative..4.a)s/vn > /l..1 (1 . Because the same interval (4. t n .. by starting with the test (4. (4.. /l = /l(P) takes values in N = (00.a)s/ vn and define J'(X.4. in Example 4. all PEP. Because p"IITI = I n. if and only if. Evidently.00).l (1. We accept H. O{X. We can base a size Q' test on the level (1 .!a) < T < I n.a.5'[) If we let T = vn(X . i:. = 1. in fact. where F is the distribution function of Xi' Here an example of N is the class of all continuous distribution functions. and in Example 4..5. For instance.6. • j n . if we start out with (say) the level (1 .~a).1) is used for every flo we see that we have.2) takes values in N = (00 .~a)] = 0 the test is equivalently characterized by rejecting H when ITI > tnl (1 . We achieve a similar effect.. /l)} where lifvnIX.2(p) takes values in N = (0.2 = . Let S = S{X) be a map from X to subsets of N. 00) X (0. that is P[v E S(X)] > 1 . as in Example 4.5.00).fLo)/ s.1) by finding the set of /l where J(X. If any value of J.flO. (/" . if and only if. 1..a) confidence region for v if the probability that S(X) contains v is at least (1 . and only if. Xl!_ Knowledge of his instruments leads him to assume that the Xi are independent and identically distributed normal random variables with mean {L and variance a 2 .~1 >tn_l(l~a) ootherwise. In contrast to the tests of Example 4. .5. consider v(P) = F.1) we constructed for Jl as follows.5. Consider the general framework where the random vector X takes values in the sample space X C Rq and X has distribution PEP. X .4.1. X + Slnl (1  ~a) /vn]. /l) to equal 1 if. Jlo) being of size a only for the hypothesis H : Jl = flo· Conversely. o These are examples of a general phenomenon. then our test accepts H.2) These tests correspond to different hypotheses. it has power against parameter values on either side of flo. For a function space example.~Ct).a) LCB X .4.a). then S is a (I ..5.4.
va) with level 0". By applying F o to hoth sides oft we find 8(t) = (O E e: Fo(t) < 1.(T) of Fo(T) = a with coefficient (1 .a)}. let . Let S(X) = (va EN. Fe (t) is decreasing in ().0".(01 + 0"2). eo e Proof.Ct). That i~.Section 4. It follows that Fe (t) < 1 .3.0" confidence region for v. the power function Po(T > t) = 1 .0" of containing the true value of v(P) whatever be P. and pvalues for MLR families: Let t denote the observed value t = T(x) ofT(X) for the datum x. Suppose X ~ Po where (Po: 0 E ej is MLR in T = T(X) and suppose that the distribution function Fo(t) ofT under Po is continuous in each of the variables t and 0 when the other is fixed. for other specified Va.a)) where to o (1 .aj.(t) in e.1. Duality Theorem. The proofs for the npper confid~nce hound and interval follow by the same type of argument.5 The Duality Between Confidence Regions and Tests 243 Next consider the testing framework where we test the hypothesis H = Hvo : v = Va for some specified value va.5. We have the following. Consider the set of Va for which HI/o is accepted. if S(X) is a level 1 . the acceptance region of the UMP size a test of H : 0 = 00 versus K : > ()a can be written e A(Oo) = (x: T(x) < too (1.Ct confidence region for v. then PIX E A(vo)) > 1  a for all P E P vo if and only if S (X) is a 1 . If the equation Fo(t) = 1 . X E A(vo)).G iff 0 > Iia(t) and 8(t) = Ilia.vo) = OJ is a subset of X with probability at least 1 .Fo(t) for a test with critical constant t is increasing in (). is a level 0 test for HI/o.Ct. let Pvo = (P : v(P) ~ Vo : va E V}. 0 We next give connections between confidence bounds. E is an upper confidence bound for 0 with any solution e. acceptance regions. then 8(T) is a Ia confidence region forO. then ()u(T) is a lower confidence boundfor () with confidence coefficient 1 .1. then [QUI' oU2] is confidence interval for () with confidence coefficient 1 . 00).a has a solution O.1. Similarly. Formally. By Corollary 4. H may be rejected. this is a random set contained in N with probapility at least 1 . Moreover.0" quantile of F oo • By the duality theorem. then the test that accepts HI/o if and only if Va is in S(X). H may be accepted. Then the acceptance regIOn A(vo) ~ (x: J(x. Suppose we have a test 15(X. if 0"1 + 0"2 < 1.3. For some specified Va. By Theorem 4.Ct) is the 1. < to(la). Conversely. if 8(t) = (O E e: t < to(1 . We next apply the duality theorem to MLR families: Theorem 4.
v) : J(t. B) pnints. A"(B) Set) Proof.a) confidence region is given by C(X t.. 1 . a).B) > a} {B: a(t. for the given t. For a E (0.4.1 shows the set C. Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials. Figure 4.80 E (0. '1 Example 4.~a)/ v'n}. The pvalue is {t: a(t. • a(t.3. N (IL. v) =O} gives the pairs (t.oo). a) .d. and for the given v.~) upper and lower confidence bounds and confidence intervals for B.a) is nondecreasing in B. v) = O}.I}. : : !. . We illustrate these ideas using the example of testing H : fL = J. We have seen in the proof of Theorem 4. .5. a confidence region S( to).a)~k(Bo. Let Xl.B) = (oo.Fo(t) is decreasing in t. and A"(B) = T(A(B)) = {T(x) : x Corollary 4.2 known.1 that 1 . and an acceptance set A" (po) for this example. _1 X n be the indicators of n binomial trials with probability of success 6. E A(B)). Under the conditions a/Theorem 4. then C = {(t. The corresponding level (1 .. (ii) k(B. I c= {(t. a) denote the critical constant of a level (1.1 I. 00 ) denote the pvalue for the UMP size Q' test of H : () let = eo versus K : () > eo. The result follows. cr 2 ) with 0.p): It pI < <7Z (1.3. v) <a} = {(t.5. We call C the set of compatible (t.Fo(t).1 244 Testing and Confidence Regions Chapter 4 1 o:(t. vo) = 1 IT > cJ of H : v ~ Vo based on a statistic T = T(X) with observed value t = T(x). .a) test of H. v will be accepted. In general.a)] > a} = [~(t).2. we seek reasonable exact level (1 .1).1. Let T = X. To find a lower confidence bound for B our preceding discussion leads us to consider level a tests for H : 6 < 00. B) ~ poeT > t) = 1 . Then the set . . let a( t.v) where. . We claim that (i) k(B.1. Let k(Bo.Xn ) = {B : S < k(B.to(l. where S = Ef I Xi_ To analyze the structure of the region we need to examine k(fJ.a)ifBiBo. • .Fo(t) is increasing in O. We shall use some of the results derived in Example 4..5. . ..v) : a(t. I X n are U.10 when X I . Because D Fe{t) is a distribution function. B) plane.1. vertical sections of C are the confidence regions B( t) whereas borizontal sections are the acceptance regions A" (v) = {t : J( t. vol denote the pvalue of a test6(T. In the (t. t is in the acceptance region.1).
hence. and. Q).5 The Duality Between Confidence Regions and Tests 245 Figure 4.3.Q) ~ I andk(I. The claims (iii) and (iv) are left as exercises. if we define a contradiction. To prove (i) note that it was shown in Theorem 4. a) increases by exactly 1 at its points of discontinuity.o : J.Q) = n+ 1. it is also nonincreasing in j for fixed e. whereas A'" (110) is the acceptance region for Hp. On the other hand. (iii).1. The shaded region is the compatibility set C for the twosided test of Hp.5.Q) = S + I}. Q) would imply that Q > Po. [S > j] > Q. if 0 > 00 .[S > k(02. Therefore. Clearly. Therefore.I] [0.[S > k(02.[S > j] < Q for all 0 < 00 O(S) = inf{O: k(O. Po. (iv) k(O. Q) as 0 tOo. and (iv) we see that. I] if S ifS >0 =0 . e < e and 1 2 k(O" Q) > k(02. S(to) is a confidence interval for 11 for a given value to of T.Q)] > Po. From (i). [S > j] = Q and j = k(Oo. Then P. Q).1 (i) that PetS > j] is nondecreasing in () for fixed j. If fJo is a discontinuity point 0 k(O. P.Section 4.o' (iii) k(fJ. then C(X) ={ (O(S). [S > j] < Q.[S > k(O"Q) I] > Q. let j be the limit of k(O.L = {to in the normal model. Po. (ii).Q) I] > Po. The assertion (ii) is a consequence of the following remarks.
From our discussion. Similarly. Plot of k(8.5 I i .Q) = S and.2 portrays the situatiou.I.Q) DCB for 0 and when S < n.. 4 k(8. O(S) together we get the confidence interval [8(S). O(S) I oflevel (1. When S = n. 1 _ . O(S) is the unique solution of 1 ~( ~ s ) or(1 _ 8)nr = Q. O(S) = I. Q) is given by. Then 0(S) is a level (1 . we define ~ O(S)~sup{O:j(O. . therefore. j '~ I • I • .246 Testing and Confidence Regions Chapter 4 and O(S) is the desired level (I . if n is large. then k(O(S). As might be expected.16) for n = 2. . j Figure 4.4.2Q).5.3 0. . we find O(S) as the unique solution of the equation. I • i I .5. when S > 0.4 0.Q)=Sl} where j (0. When S 0. . Putting the bounds O(S).3. 8(S) = O.1 d I J f o o 0. i • .2 0. These intervals can be obtained from computer packages that use algorithms based on the preceding considerations.2. . 0.1 0.0. these bounds and intervals differ little from those obtained by the first approximate method in Example 4.16) 3 I 2 11.Q) LCB for 0(2) Figure 4.
o o For instance. For this threedecision rule.4 we considered the level (1 .13. 0.3) 3. Suppose Xl. If J(x. Because we do not know whether A or B is to be preferred. A and B. I X n are i. Decide I' > I'c ifT > z(1 . and 3. when !t < !to. . are given to high blood pressure patients. Summary. we test H : B = 0 versus K : 8 i. Here we consider the simple solution suggested by the level (1 .5. The problem of deciding whether B = 80 . However.8 < B • or B > 80 is an example of a threeo decision problem and is a special case of the decision problems in Section 1. Example 4.d.. Using this interval and (4. We explore the connection between tests of statistical hypotheses and confidence regions. Decide I' < 1'0 ifT < z(l.3) we obtain the following three decision rule based on T = J1i(X  1'0)/": Do not reject H : /' ~ I'c if ITI < z(1 .. twosided tests seem incomplete in the sense that if H : B = B is rejected in favor of o H : () i. Then the wrong decision "'1 < !to" is made when T < z(1 . we decide whether this is because () is smaller or larger than Bo. Make no judgment as 1O whether 0 < 80 or 8 > B if I contains B .O. suppose B is the expected difference in blood pressure when two treatments. Thus.2 ) with u 2 known. Therefore. then the set S(x) of Vo where .0') confidence interval!: 1. the probability of the wrong decision is at most ~Q.~a).3.i.. the probability of falsely claiming significance of either 8 < 80 or 0 > 80 is bounded above by ~Q. If H is rejected. o (4.!a). and o Decide 8 > 80 if I is entirely to the right of B .Jii for !t. o o Decide B < 80 if! is entirely to the left of B .5 The Duality Between Confidence Regions and Tests 247 Applications of Confidence Intervals to Comparisons and Selections We have seen that confidence intervals lead naturally to twosided tests. v = va. vo) is a level 0' test of H . If we decide B < 0.5. To see this consider first the case 8 > 00 . but if H is rejected. then we select A as the better treatment. the twosided test can be regarded as the first step in the decision procedure where if H is not rejected.~Q) / . N(/l. In Section 4. 2. we can control the probabilities of a wrong selection by setting the 0' of the parent test or confidence interval.Section 4. and vice versa. it is natural to carry the comparison of A and B further by asking whether 8 < 0 or B > O.~Q).5.Q) confidence interval X ± uz(1 .!a). by using this kind of procedure in a comparison or selection problem. we make no claims of significance. We can use the twosided tests and confidence intervals introduced in later chapters in similar fashions.2. This event has probability Similarly. we usually want to know whether H : () > B or H : B < 80 .B .4.
.a) LCB for if and only if 0* is a unifonnly most accurate level (1 . 0 I i. they are both very likely to fall below the true (J. We next show that for a certain notion of accuracy of confidence bounds. Suppose X = (X" . Definition 4. I' i' .u2(X) and is.a) LCB 0* of (J is said to be more accurate than a competing level (1 .Q).5.1 continued). for 0') UCB e is more accurate than a competitor e .if. where X(1) < X(2) < . The dual lower confidence bound is 1'.Ilo)/er > z(1 . then the lest that accepts H . If S(x) is a level (1 .2) for all competitors. we say that the bound with the smaller probability of being far below () is more accurate. Using Problem 4. less than 80 .2) Lower confidence bounds e* satisfying (4. B(n. optimality of the tests translates into accuracy of the bounds. ~).248 Testing and Confidence Regions Chapter 4 0("'. which reveals that (4. I' > 1'0 rejects H when ..n.2.' .6.6.1) Similarly. This is a consequence of the following theorem. which is connected to the power of the associated onesided tests..6.Xn and k is defined by P(S > k) = 1 .0') lower confidence bounds for (J. va) ~ 0 is a level (1 . If (J and (J* are two competing level (1 .0) confidence region for va. for any fixed B and all B' < B.z(1 . We also give a connection between confidence intervals. (X) = X . unifonnly most accurate in the N(/L.~'(X) < B'] < P. e. j i I ... where 80 is a specified value.Q for a binomial.1) for all competitors are called uniformly most accurate as are upper confidence bounds satisfying (4.I (X) is /L more accurate than . Formally. Which lower bound is more accurate? It does tum out that . random variable S. a level (1 any fixed B and all B' > B.6. .1) is nothing more than a comparison of (X)wer functions. A level (1.Q) LCB B if. (4. But we also want the bounds to be close to (J. We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution. . 4.Q)er/. .6 UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account. and only if. for X E X C Rq. P. P. A level a test of H : /L = /La vs K ..6.6.2 and 4. in fact.Xn ) is a sample of a N (/L. and the threedecision problem of deciding whether a parameter 1 l e is eo. Note that 0* is a unifonnly most accurate level (1 .1..1 t! Example 4.3. v = Va when lIa E 8(:1') is a level 0: test...Q) UCB for B. or larger than eo.Q) con· fidence region for v. (72) model.n(X . . the following is true. twosided tests. Thus. and only if.1 (Examples 3.6. Ii n II r . < X(n) denotes the ordered Xl.6. .IB' (X) I • • < B'] < P.[B(X) < B']. we find that a competing lower confidence bound is 1'2(X) = X(k). (4. 0"2) random variables with (72 known.[B(X) < B'].
00 )) or Pe. Most accurate upper confidence bounds are defined similarly. We want a unifonnly most accurate level (1 .•. and only if. Boundsforthe Probability ofEarly Failure ofEquipment.0') LeB for (J.7 for the proof).5. Then!l* is uniformly most accurate at level (1 . Suppose ~'(X) is UMA level (1 .a) LeB 00. We define q* to be a uniformly most accurate level (1 . .4.(o(X. Then O(X. 00 ) by o(x. 00 ) is UMP level Q' for H : (J = eo versus K : 0 > 00 .a) lower confidence bound. the probability of early failure of a piece of equipment.6. Let f)* be a level (1 .Section 4.6.a) upper confidence bound q* for q( A) = 1 .2).5 favor X{k) (see Example 3. [O'(X) > 001. Also. (Jo) is given by o'(x. 00 ) is a level a test for H : 0 = 00 versus K : 0 > 00 . However. and 0 otherwise.a). Example 4. we find that j. Pe[f < q(O')1 < Pe[q < q(O')] whenever q((}') < q((}).Oo) 1 ifO'(x) > 00 ootherwise is UMP level a for H : (J = eo versus K : (J > 00 .1 to Example 4. .6.Because O'(X.1. We can extend the notion of accuracy to confidence bounds for realvalued functions of an arbitrary parameter. then for all 0 where a+ = a. Defined 0(x.6. a real parameter. and only if. X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it. the robustness considerations of Section 3.e>. such that for each (Jo the associated test whose critical function o*(x. Identify 00 with {}/ and 01 with (J in the statement of Definition 4. Let XI. 1 X n be the times to failure of n pieces of equipment where we assume that the Xi are independent £(A) variables.a) Lea for q( 0) if. they have the smallest expected "distance" to 0: Corollary 4. Let O(X) be any other (1 .z(1 a)a/ JTi is uniformly most accurate. if a > 0. Proof Let 0 be a competing level (1 . 00 ) ~ 0 if. for any other level (1 . for e 1 > (Jo we must have Ee. 0 If we apply the result and Example 4.6.a) lower confidence boundfor O.2 and the result follows. [O(X) > 001 < Pe.a) LCB q.2.6 Uniformly Most Accurate Confidence Bounds 249 Theorem 4.1.2. (O'(X. Uniformly most accurate (UMA) bounds turn out to have related nice properties.1. For instance (see Problem 4.Oo)) < Eo.t o . O(x) < 00 . .
in the case of lower bounds. we can restrict attention to certain reasonable subclasses of level (1 .6. is random and it can be shown that in most situations there is no confidence interval of level (1 .a) intervals iliat has minimum expected length for all B. Considerations of accuracy lead us to ask that. Pratt (1961) showed that in many of the classical problems of estimation . By Problem 4..250 Testing and Confidence Regions Chapter 4 We begin by finding a uniformly most accurate level (1. the lengili t .3) or equivalently if A X2n(1a) o < 2"'~ 1 X 2 L_n= where X2n(1. there exist level (1 . 374376).. However. * is by Theorem 4. Confidence intervals obtained from twosided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes.6.a) iliat has uniformly minimum length among all such intervals.\ *) where>.. there does not exist a member of the class of level (1 .a) UCB for A and. 1962.8.a) UCB for the probability of early failure. pp.a)/ 2Ao i=1 n (4. subject to ilie requirement that the confidence level is (1 . There are. The situation wiili confidence intervals is more complicated. they are less likely ilian oilier level (1 . In particular. in general.a) lower confidence bounds to fall below any value ()' below the true B.1. By using the duality between onesided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level (1 . the intervals developed in Example 4.a) by the property that Pe[T. Of course.. because q is strictly increasing in A. some large sample results in iliis direction (see Wilks. . These topics are discussed in Lehmann (1997).) as a measure of precision.a) UCB'\* for A.a).a) in the sense that. Summary.a) quantile of the X§n distribution.T. however.a) unbiased confidence intervals. the UMP test accepts H if LXi < X2n(1. l . the confidence region corresponding to this test is (0.1 have this property. the confidence interval be as short as possible. a uniformly most accurate level (1 . 0 Discussion We have only considered confidence bounds. Thus.a) is the (1.5. Neyman defines unbiased confidence intervals of level (1 . the interval must be at least as likely to cover the true value of q( B) as any other value. *) is a uniformly most accurate level (1 . the situation is still unsatisfactory because. That is. Therefore. B'.T.a) intervals for which members with uniformly smallest expected length exist. To find'\* we invert the family of UMP level a tests of H : A ~ AO versus K : A < AO. If we turn to the expected length Ee(t . ~ q(()') ~ t] for every (). it follows that q( >. as in the estimation problem.6.a) confidence intervals that have uniformly minimum expected lengili among all level (1. ~ q(B) ~ t] ~ Pe[T.
Definition 4. N(fL. no probability statement can be attached to this interval. Thus.7. a Definition 4.. (j].1.a) credible bounds and intervals are subsets of the parameter space which are given probability at least (1 .3.a. Then. the posterior distribution of fL given Xl. :s: alx) ~ 1 . 75). a E e c R. the interpretation of a 100(1".d. with ~ a6) a6 fLB = ~ nx+l/1 ~t""o n ~ + I ~ ~2 .a)% confidence interval. A consequence of this approach is that once a numerical interval has been computed from experimental data.2 and 1.a.a)% of the intervals would contain the true unknown parameter value.. Instead.7.4.Section 4.a) by the posterior distribution of the parameter given the data. In the Bayesian formulation of Sections 1.7 FREQUENTIST AND BAYESIAN FORMULATIONS We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E X c Rq are random while the parameters are fixed but unknown. then Ck = {a: 7r(lx) .1.a lower and upper credible bounds for fL are  fL = fLB  ao Zla Vn (1 + . then 100(1 .a)% confidence interval is that if we repeated an experiment indefinitely each time computing a 100(1 . it is natural to consider the collec tion of that is "most likely" under the distribution II(alx).aB = n ~ + 1 I ~ It follows that the level 1 .A)2 nro ao 1 Ji = liB + Zla Vn (1 + ..a) lower and upper credible bounds for if they respectively satisfy II(fl.Xn is N(liB. Let 7r('lx) denote the density of agiven X = x. then Ck will be an interval of the form [fl. If 7r( alx) is unimodal. then fl. Let II( 'Ix) denote the posterior probability distribution of given X = x.6.7 Frequentist and Bayesian Formulations 251 4.::: 1 .2. from Example 1.a) credible region for e if II( C k Ix) ~ 1  a . . X has distribution P e. with known. and {j are level (1 . Suppose that given fL.1. and that a has the prior probability distribution II.)2 nro 1 . We next give such an example.i. Example 4.'" . with fLo and 75 known.12.7. Xl.Xn are i. given a. and that fL rv N(fLo. II(a:s: t9lx) . a a Turning to Bayesian credible intervals and regions. what are called level (1 '. Suppose that. .::: k} is called a level (1 . a~).
Let A = a.2 .. . by Problem 1.6.4. the Bayesian interval tends to the frequentist interval. a doctor administering a treatment with delayed effect will give patients a time interval [1::.. In the Bayesian framework we define bounds and intervals. Suppose that given 0. it is desirable to give an interval [Y. the interpretations of the intervals are different.2 . is shifted in the direction of the reciprocal b/ a of the mean of W(A).• . Y] that contains the unknown value Y with prescribed probability (1 .4 we discussed situations in which we want to predict the value of a random variable Y.8 PREDICTION INTERVALS In Section 1.n(1+ ~)2 nTo 1 • Compared to the frequentist interval X ± Zl~oO/.2. where /10 is a prior guess of the value of /1.2 and suppose A has the gamma f( ~a. In the case of a normal prior w( B) and normal model p( X I B). In addition to point prediction of Y.n. then . Let xa+n(a) denote the ath quantile of the X~+n distribution.2. /1 +1with /1 ± = /111 ± Zl'" 2 ~ 00 . the probability of coverage is computed with the data X random and B fixed. that determine subsets of the parameter space that are assigned probability at least (1 .. ..a) lower credible bound for A and ° is a level (1 . the interpretations are different: In the frequentist confidence interval. given Xl.12. For instance. Xl. 01 Summary.d.4 for sources of such prior guesses.2. where t = 2:(Xi .a) credible interval is similar to the frequentist interval except it is pulled in the direction /10 of the prior mean and it is a little narrower. called level (1 a) credible bounds and intervals. However.7.. We shall analyze Bayesian credible regions further in Chapter 5. D Example 4.a) by the posterior distribution of the parameter () given the data :c. Note that as TO > 00. Compared to the frequentist bound (n 1) 8 2 / Xnl (a) of Example 4.252 Testing and Confidence Regions Chapter 4 while the level (1 . t] in which the treatment is likely to take effect.a) credible interval is [/1. 4. X n .3. the center liB of the Bayesian interval is pulled in the direction of /10. the level (1 .a) upper credible bound for 0.. (t + b)A has a X~+n distribution./10)2. = x a+n (a) / (t + b) is a level (1 . N(/10. b > are known parameters. Similarly.02) where /10 is known. whereas in the Bayesian credible interval. the probability of coverage is computed with X = x fixed and () random with probability distribution II (B I X = x). however. . Then. See Example 1.a). we may want an interval for the . ~b) density where a > 0.Xn are Li.
Moreover.1). which is assumed to be also N(p" cr 2 ) and independent of Xl. fnl (1  ~a) for Vn. X is the optimal estimator.1)1 L~(Xi .1.X n +1 '" N(O.d.3.Section 4. by the definition of the (Student) t distribution in Section B.8 Prediction Intervals 253 future GPA of a student or a future value of a portfolio.l + 1]cr2 ).Ct) prediction interval as an interval [Y. 8 2 = (n . Let Y = Y(X) denote a predictor based on X = (Xl. [n.4. ~ We next use the prediction error Y . Y ::. By solving tnl Y. let Xl.. which has a X.a) prediction interval Y = X± l (1 ~Ct) ::. . ::. Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is Note that Y can be regarded as both a predictor of Y and as an estimate of p" and when we do so.+ 18tn l (1  (4. in this case.1) Note that Tp(Y) acts as a prediction interval pivot in the same way that T(p. We define a predictor Y* to be prediction unbiased for Y if E(Y* .3 and independent of X n +l by assumption.i... . it can be shown using the methods of . As in Example 4. Y] based on data X such that P(Y ::. We want a prediction interval for Y = X n +1. We define a level (1.1.l distribution. It follows that.3. Thus. In fact.X) is independent of X by Theorem B. Also note that the prediction interval is much wider than the confidence interval (4. The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4. .8. The (Student) t Prediction Interval... It follows that Z (Y) p  Y vnY l+lcr has a N(O.8. "~). and can c~nclu~e that in the class of prediction unbiased predictors.4. the optimal estimator when it exists is also the optimal predictor. In Example 3. TnI.8.) acts as a confidence interval pivot in Example 4. Note that Y Y = X . . .4. the optimal MSPE predictor is Y = X. 1) distribution and is independent of V = (n .Xn .Y to construct a pivot that can be used to give a prediction interval. Tp(Y) !a) . we find the (1 .1)8 2 /cr 2. . AISP E(Y) = MSE(Y) + cr 2 .4. we found that in the class of unbiased estimators. where MSE denotes the estimation theory mean squared error. as X '" N(p" cr 2 ).Y) = 0.. Y) ?: 1 Ct. has the t distribution.Xn be i.1..
2).1) tends to zero in probability at the rate n!.Xn are i.. UCk ) = v)dH(u. Ul . See Problem 4. E(UCi») thus.2. Suppose XI.2) P(X(j) It follows that [X(j) . 00 :s: a < b :s: 00. < X(n) denote the order statistics of Xl. Now [Y B' YB ] is said to be a level (1 . by Problem B.i.8..9. Q(.'. .d. whereas the width of the prediction interval tends to 2(]z (1. ... . . . p(X I e). 0 We next give a prediction interval that is valid from samples from any population with a continuous distribution.v) j(v lI)dH(u.. By ProblemB. X(k)] with k :s: Xn+l :s: XCk») = n + l' kj = n + 1 . This interval is a distributionfree prediction interval. as X rv F. .' •.0:) Bayesian prediction interval for Y = X n + l if . then. I x) has in the continuous case density e..254 Testing and Confidence Regions Chapter 4 Chapter 5 that the width of the confidence interval (4.. 1). then P(U(j) j :s: Un+l :s: UCk ») P(u:S: Un+l:S: v I U(j) = U. b).j is a level 0: = (n + 1 . n + 1.5 for a simpler proof of (4. where F is a continuous distribution function with positive density f on (a.. " X n + l are Suppose that () is random with () rv 1'( and that given () = i..1) is not (1 . Let X(1) < .~o:). Un ordered.4. Let U(1) < . Example 4. that is. Set Ui = F(Xi ). . .. " X n. with a sum replacing the integral in the discrete case.1) is approximately correct for large n even if the sample comes from a nonnormal distribution.8.8. Bayesian Predictive Distributions Xl' ..2.i.12. whereas the level of the prediction interval (4.2.d.8.i.v) = E(U(k») .Un+l are i.0:) in the limit as n ) 00 for samples from nonGaussian distributions. . = i/(n+ 1). Moreover.2j)/(n + 1) prediction interval for X n +l . i = 1. uniform. (4.8. where X n+l is independent of the data Xl'.E(U(j)) where H is the joint distribution of UCj) and U(k).. . the confidence level of (4. Here Xl"'" Xn are observable and Xn+l is to be predicted. We want a prediction interval for Y = X n+l rv F..4.' X n . I x) of Xn+l is defined as the conditional distribution of X n + l given x = (XI.. < UCn) be Ul .xn ).d. U(O. The posterior predictive distribution Q(.
0)0] = E{E(Xn+l .3) converges in probability to B±z (1 . (J"5 + a~) ~2 = (J" B n I (. . (J"5).8. The Bayesian prediction interval is derived for the normal model with a normal prior. T2 known. Note that E[(Xn+l . The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables.8.c{Xn+l I X = t} = .B)B I 0 = B} = O.7.9 4. 0 The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box.2. We consider intervals based on observable random variables that contain an unobservable random variable with probability at least (1. and X ~ Bas n + 00. (n(J"~ /(J"5) + 1.9. N(B. we construct the Student t prediction interval for the unobservable variable.1 LIKELIHOOD RATIO PROCEDURES Introduction Up to this point. where X and X n + l are independent.0:). T2). Xn is T = X = n 1 2:~=1 Xi . even in .9 Likelihood Ratio Procedures 255 Example 4. 4.3) To consider the frequentist properties of the Bayesian prediction interval (4.I . the results and examples in this chapter deal mostly with oneparameter problems in which it sometimes is possible to find optimal procedures. .~o:) (J"o as n + 00. B = (2 / T 2) TJo + ( 2 / (J"0 X. Because (J"~ + 0. (J"5). Consider Example 3.1 where (Xi I B) '" N(B .c{(Xn +1 where. (J"5 known. Summary. by Theorem BA.3) we compute its probability limit under the assumption that Xl " '" Xn are i. and 7r( B) is N( TJo . (4. Xn+l .~o:) V(J"5 + a~. Thus. This is the same as the probability limit of the frequentist interval (4. note that given X = t.0 and 0 are uncorrelated and. Xn+l . Thus.2 o + I' T2 ~ P.  0) + 0 I X = t} = N(liB.0:) Bayesian prediction interval for Y is [YB ' Yt] with Yf = liB ± Z (1. from Example 4.d. and it is enough to derive the marginal distribution of Y = X n +l from the joint distribution of X.1) .3. independent. X n + l and 0.i.. 1983).0 and 0 are still uncorrelated and independent.. A sufficient statistic based on the observables Xl. we find that the interval (4. In the case of a normal sample of size n + 1 with only n variables observable. (J" B n(J" B 2) It follows that a level (1 . However. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distributionfree prediction interval.8.8. .1. To obtain the predictive distribution.Section 4.8.
the MP level a test 6a (X) rejects H for T > z(l . note that it follows from Example 4..e): E 8 0 }. a 2 ) population with a 2 known. recall from Section 2. We are going to derive likelihood ratio tests in several important testing problems.1(c». eo). by the uniqueness of the NP test (Theorem 4. Calculate the MLE eo of e where e may vary only over 8 0 . there is no UMP test for testing H : /L = /Lo vs K : /L I.2.Xn ) has density or frequency function p(x. Because h(A(X)) is equivalent to A(X) ./Lo. Find a function h that is strictly increasing on the range of A such that h(A(X)) has a simple form and a tabled distribution under H .1 that if /Ll > /Lo. e) and we wish to test H : E 8 0 vs K : E 8 1 . and conversely. The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6./Lo)/a. if /Ll < /Lo. . . In the cases we shall consider. We start with a generalization of the NeymanPearson statistic p(x . In particular cases.1) whose computation is often simple. In this section we introduce intuitive and efficient procedures that can be used when no optimal methods are available and that are natural for multidimensional parameters. ed/p(x.. . optimal procedures may not exist.~a). z(a ).a )th quantile obtained from the table.. eo) when 8 0 = {eo}. and for large samples. 3. e) : e E 8 l }. Form A(X) = p(x.. . To see this. So. Xn)./LO. e) : e E 8 0 } e e e Tests that reject H for large values of L(x) are called likelihood ratio tests. then the observed sample is best explained by some E 8 1 . .2 that we think of the likelihood function L(e. For instance. the basic steps are always the same. Calculate the MLE e of e. ifsup{p(x. the MP level a test 'Pa(X) rejects H if T ::::. Because 6a (x) I. To see that this is a plausible statistic. e) as a measure of how well e "explains" the given sample x = (Xl.its (1 . e) is a continuous function of e and eo is of smaller dimension than 8 = 8 0 U 8 1 so that the likelihood ratio equals the test statistic I e e 1 A(X) = sup{p(x. if Xl. Also note that L(x) coincides with the optimal test statistic p(x . . Although the calculations differ from case to case. Xn is a sample from a N(/L. where T = fo(X .x) = p(x. eo). ed / p(x.9. 2. sup{p(x. On the other hand. 1. 8 1 = {e l }. Note that in general A(X) = max(L(x). e) E 8} : eE 8 0} (4. . p(x.256 Testing and Confidence Regions Chapter 4 the case in which is onedimensional. there can be no UMP test of H : /L = /Lo vs H : /L I. 1). . 4. Suppose that X = (Xl . we specify the size a likelihood ratio test through the test statistic h(A(X) and . B')/p(x . e) : () sup{p(x.2. The test statistic we want to consider is the likelihood ratio given by L(x) = sup{p(x.'Pa(x). likelihood ratio tests have weak optimality properties to be discussed in Chapters 5 and 6. e) : e E 8l}islargecomparedtosup{p(x. .2.
e. In the ith pair one patient is picked at random (i.9. mileage of cars with and without a certain ingredient or adjustment. 8n e (4.82 ) where 81 is the parameter of interest and 82 is a nuisance parameter. Here are some examples. with probability ~) and given the treatment.c 0 0 It is often approximately true (see Chapter 6) that c( 8) is independent of 8.9. we can invert the family of size a likelihood ratio tests of the point hypothesis H : 8 = 80 and obtain the level (1 .9. We are interested in expected differences in responses due to the treatment effect.Xn form a sample from a N(JL. and other factors. p (X .2) where sUPe denotes sup over 8 E e and the critical constant c( 8) satisfies p. Studies in which subjects serve as their own control can also be thought of as matched pair experiments. An important class of situations for which this model may be appropriate occurs in matched pair experiments.2.9.a) confidence region C(x) = {8 : p(x. C (x) is just the set of all 8 whose likelihood is on or above some fixed value dependent on the data. we consider pairs of patients matched so that within each pair the patients are as alike as possible with respect to the extraneous factors. We shall obtain likelihood ratio tests for hypotheses of the form H : 81 = 8lD . Let Xi denote the difference between the treated and control responses for the ith pair. In order to reduce differences due to the extraneous factors. and so on. sales performance before and after a course in salesmanship.2 Tests for the Mean of a Normal DistributionMatched Pair Experiments Suppose Xl. After the matching. To see how the process works we refer to the specific examples in Sections 4. In that case. diet. Response measurements are taken on the treated and control members of each pair. We can regard twins as being matched pairs. 8) > (8)] = eo a. For instance. An example is discussed in Section 4. the difference Xi has a distribution that is . we measure the response of a subject when under treatment and when not under treatment. The family of such level a likelihood ratio tests obtained by varying 8lD can also be inverted and yield confidence regions for 8 1 .8) . If the treatment and placebo have the same effect.9.24.5. Suppose we want to study the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age.Section 4. This section includes situations in which 8 = (8 1 . That is.9 Likelihood Ratio Procedures 257 We can also invert families of likelihood ratio tests to obtain what we shall call likelihood confidence regions. the experiment proceeds as follows.'" . [ supe p(X. which are composite because 82 can vary freely. bounds. 0'2) population in which both JL and 0'2 are unknown. while the second patient serves as control and receives a placebo. 4. and soon. 8) ~ [c( 8)t l sup p(x.. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo.
B) was solved in Example 3. However.0 is known and then evaluating p(x. B) : B E 8 0 } boils down to finding the maximum likelihood estimate a~ of a 2 when fJ.258 Testing and Confidence Regions Chapter 4 symmetric about zero. e). The problem of finding the supremum of p(x. To this we have added the nonnality assumption. TwoSided Tests We begin by considering K : fJ.0}. as discussed in Section 4.X) ." However. as representing the treatment effect. good or bad. B) at (fJ. where B=(x. =Ie fJ.0. we test H : fJ.3. We think of fJ.L ( X i .o. B) 1[1~ ="2 a 4 L(Xi . where we think of fJ. n i=l is the maximum likelihood estimate of B.0 as an established standard for an old treatment.B): B E 8} = p(x.=1 fJ.0. = E(X 1 ) denote the mean difference between the response of the treated and control subjects.L Xi 1 ~( n i=l . = fJ. Let fJ. a 2 ) : fJ. We found that sup{p(x.5. the test can be modified into a threedecision rule that decides whether there is a significant positive or negative effect.L X i ' . = fJ. which has the immediate solution ao ~2 = . Form of the TwoSided Tests Let B = (fJ" a 2 ).0) 2  a2 n] = 0.a)= ~ ~2 (1 ~ 1 ~ n i=l 2 ) .6.0 )2 . = fJ. The likelihood equation is oa 2 a logp(x. = O. This corresponds to the alternative "The treatment has some effect. Finding sup{p(x. 8 0 = {(fJ" Under our assumptions.fJ. a~). . Our null hypothesis of no treatment effect is then H : fJ. 'for the purpose of referring to the duality between testing and confidence procedures. The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter 5 if the nonnality assumption is not satisfied.
suppose n = 25 and we want a = 0. the size a critical value is t n 1 (1 . &0)) ~ 2 {~[(log271') + (log&2)]~} ~ log(&5/&2). is of size a for H : M ::.Mo) . Therefore. . Similarly. the size a likelihood ratio test for H : M Z Mo versus K : M < Mo rejects H if.MO)2/&2. therefore.\(x).\(x) logp(x. which thus equals log . A proof is sketched in Problem 4. which can be established by expanding both sides. Mo versus K : M > Mo (with Mo = 0) is suggested. Because Tn has a T distribution under H (see Example 4.064. if we are comparing a treatment and control.8) logp(x. &5 gives the maximum of p(x. or Table III. The statistic Tn is equivalent to the likelihood ratio statistic A for this problem. 8 Therefore. the test that rejects H for Tn Z t nl(1 . (Mo. In Problem 4. A and B. are considered to be equal before the experiment is performed. For instance.a). Thus. Then we would reject H if. the testing problem H : M ::. ITnl Z 2. to find the critical value. &5/&2 (&5/&2) is monotone increasing Tn = y'n(x . rejects H for iarge values of (&5/&2). However. Mo . the likelihood ratio tests reject for large values of ITn I.3.9 Likelihood Ratio Procedures 259 By Theorem 2.9.05. 8) for 8 E 8 0.Section 4.1). and only if. OneSided Tests The twosided formulation is natural if two treatments. Therefore.9. the relevant question is whether the treatment creates an improvement.1).1.\(x) is equivalent to log . The test statistic .2.Mo)/a.~a) and we can use calculators or software that gives quantiles of the t distribution. Because 8 2 function of ITn 1 where = 1 + (x .  {~[(log271') + (log&5)]~} Our test rule.4.x)2 = n&2/(n . (n .1)1 I:(Xi . and only if. To simplify the rule further we use the following equation.11 we argue that P"[Tn Z t] is increasing in 8. where 8 = (M .
If we consider alternatives of the form (/L /Lo) ~ ~. The reason is that. With this solution. we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8.b.1 are monotone in fo/L/ a. 1997. by making a sufficiently large we can force the noncentrality parameter 8 = fo(/L . Because E[fo(X /Lo)/a] = fo(/L ./Lo)/a) = 1. Computer software will compute n./Lo)/a and Yare ./Lo)/a .4. A Stein solution. we may be required to take more observations than we can afford on the second stage. denoted by Tk. however. 1) and X~ distributions.jV/k is given in Problem 4.260 Power Functions and Confidence 4 To discuss the power of these tests.2. ( 2) only through 8. whatever be n. We have met similar difficulties (Problem 4. and the power can be obtained from computer software or tables of the noncentral t distribution./Lo) / a has N (8. bring the power arbitrarily close to 0:.11) just as the power functions of the corresponding tests of Example 4. To derive the distribution of Tn. p. we obtain the confidence region We recognize C(X) as the confidence interval of Example 4.12. is by definition the distribution of Z / .9. say. in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with I/L.1. This distribution. Problem 17). the ratio fo(X ./Lo) / a as close to 0 as we please and.n(X ./Lol ~ ~. respectively. Note that the distribution of Tn depends on () (/L. is possible (Lehmann.. ./Lo) / a.3.b distribution. / (n1)s2 /u 2 n1 V has a 'Tn1. 1) distribution.4.jV/ k where Z and V are independent and have N(8.1.9. Similarly the onesided tests lead to the lower and upper confidence bounds of Example 4. we can no longer control both probabilities of error by choosing the sample size. thus. Thus.1)s2/a 2 are independent and that (n 1)s2/a 2 has a X. with 8 = fo(/L .4. we know that fo(X /L)/a and (n . 260. The density of Z/ . We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 ~ 81 > 0 in the twosided case and 8 ~ 81 or 8 :S 81 in the onesided cases. Likelihood Confidence Regions If we invert the twosided tests. note that from Section B.l distribution. fo( X .7) when discussing confidence intervals. The power functions of the onesided tests are monotone in 8 (Problem 4.
3 5 0.0 3. Let e = (Jlt.9. while YI . ( 2).A in sleep gained using drugs A and B on 10 patients. For quantitative measurements such as blood pressure.8 8 0. Jl2.8 2. we conclude at the 1% level of significance that the two drugs are significantly different..4 If we denote the difference as x's.6 0.8 1..513. . 8 2 = 1. 'Yn2 are the measurements on a sample given the drug. It suggests that not only are the drugs different but in fact B is better than A because no hypothesis Jl = Jl' < 0 is accepted at this level.1 0. Yn2 . height. p.1 0. .1 1.2 2 1.0 4. .9 likelihood Ratio Procedures 261 Data Example As an illustration of these procedures.. Y) = (Xl.3 4 1. A discussion of the consequences of the violation of these assumptions will be postponed to Chapters 5 and 6.2.3. it is usually assumed that X I.Xnl and YI . ( 2) populations. one from each popUlation.99 confidence interval for the mean difference Jl between treatments is [0.8 9 0. Then 8 0 = {e : Jli = Jl2} and 9 1 = {e : Jli =F Jl2}' The log of the likelihood of (X..6 10 2. YI . Patient i A B BA 1 0.4 1. .58.4 4. volume. Tests We first consider the problem of testing H : Jl1 = J:L2 versus H : Jli =F Jl2. .2 1.9 1. suppose we wanted to test the effect of a certain drug on some biological variable (e. Then Xl. then x = 1.. .2 the likelihood function and its log are maximized In Problem 4.06... .Xn1 . As in Section 4. (See also (4.3 Tests and Confidence Intervals for the Difference in Means of Two Normal Populations We often want to compare two populations with distribution F and G on the basis of two independent samples X I.Xn1 could be blood pressure measurements on a sample of patients given a placebo. and so forth. For instance.Y ) is n2 where n = nl + n2.4 1. length.7 1.6 0.9. Because tg(0.0 6 3. temperature. respectively. e .'" . and ITnl = 4.g. consider the following data due to Cushny and Peebles (see Fisher.32.0 7 3. . In the control versus treatment example. ( 2) and N (Jl2. this is the problem of determining whether the treatment has any effect.9. 121) giving the difference B ..25.84].5.Section 4. it is shown that = over 8 by the maximum likelihood estimate e.. blood pressure).. weight. 1958... This is a matched pair experiment with each subject serving as its own control.Xn1 and YI .6 4.6. ...2 0.. The 0.7 5. The preceding assumptions were discussed in Example 1.3).4 3 0. 'Yn2 are independent samples from N (JlI. .1.5 1.995) = 3.1 1.) 4. . .
2 X.3 Tn2 1 . where and If we use the identities ~ fCYi ?l i=l j1)2 1 n fp'i i=l y)2 + n2 (Y _ j1)2 n obtained by writing [Xi . . Y. where 2 Testing and Confidence Regions Chapter 4 When ttl tt2 = tt. . we find that the is equivalent to the test statistic ITI where and To complete specification of the size a likelihood ratio test we show that T has distribution when ttl = tt2.X) and .1 .1 I:rtl .Y) '1':" a a a i=l a j=l J I .9. Y.2 (y. our model reduces to the onesample model of Section 4. (75). By Theorem B. the maximum of p over 8 0 is obtained for f) (j1.262 (X.2 1 I:rt2 . (7 ).X) + (X  j1)]2 and expanding.2.2 (Xi .j112 log likelihood ratio statistic [(Xi . Thus. j1.3.
and only if. 1L2. this follows from the fact that.3) for 1L2 . We can show that these tests are likelihood ratio tests for these hypotheses. T( Ll) has a Tn2 distribution and inversion ofthe tests leads to the interval (4.~ Q likelihood confidence bounds. by definition. 1/nI). It is also true that these procedures are of size Q for their respective hypotheses. twosample t test rejects if.2)8 2/0.2 under H bution and is independent of (n . and that j nl n2/n(Y . we find a simple equivalent statistic IT(Ll)1 where If 1L2 ILl = Ll.Ll.2 has a X~2 distribution.2 • Therefore.X) /0. Confidence Intervals To obtain confidence intervals for 1L2 . As in the onesample case.1L2. corresponding to the twosided test.9 likelihood Ratio Procedures 263 are independent and distributed asN(ILI/o. onesided tests lead to the upper and lower endpoints of the interval as 1 .9. 1/n2). Similarly. As for the special case Ll = 0.2)8 2 /0.ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 .ILl #. IL 1 and for H : ILl ::. We conclude from this remark and the additive property of the X2 distribution that (n . T has a noncentral t distribution with noncentrality parameter. if ILl #. for H : 1L2 ::.N(1L2/0.Section 4.ILl. T and the resulting twosided. .has a N(O.ILl = Ll versus K : 1L2 . there are two onesided tests with critical regions. X~ll' X~2l' respectively. f"V As usual. 1) distriTn .
2 (1 . . Suppose first that aI and a~ are known. H is rejected if ITI 2 t n ..~o:).042 1. On the basis of the results of this experiment.977.. /11 = x and /12 = y for (Ji1.0:) confidence interval has probability at most ~ 0: of making the wrong selection. For instance.x = 0. the more waterproof material. Y1. Ji2) E R x R. 1952. is The MLEs of Ji1 and Ji2 are. The level 0..9. 472) x (machine 1) Y (machine 2) 1.845 1. except for an additive constant.395. If normality holds. thus. a~) samples.975) = 2. setting the derivative of the log likelihood equal to zero yields the MLE .4 The TwoSample Problem with Unequal Variances In twosample problems of the kind mentioned in the introduction to Section 4. Here y . From past experience.95 confidence interval for the difference in mean log permeability is 0.395 ± 0.8 2 = 0. The log likelihood. . consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines. ai) and N(Ji2.. Again we can show that the selection procedure based on the level (1 . p.282 We test the hypothesis H of no difference in expected log permeability.264 Testing and Confidence Regions Chapter 4 Data Example As an illustration.627 2. 4. and T = 2. a treatment that increases mean response may increase the variance of the responses. it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same. respectively.9. As a first step we may still want to compare mean responses for the X and Y populations. it may happen that the X's and Y's have different variances.0264. When Ji1 = Ji2 = Ji.3. Because t4(0. The results in terms oflogarithms were (from Hald.790 1. . we are lead to a model where Xl.583 1. 'Yn2 are two independent N (Ji1. we would select machine 2 as producing the smaller permeability and. . . :. thus. we conclude at the 5% level of significance that there is a significant difference between the expected log permeability for the two machines.776.Xn1 .368. This is the BehrensFisher problem.
Section 4.x)/(nl + .y)/(nl + .. j1x = = nl . IDI/8D as a test statistic for H : J.fi? j=1 + n2(j1.y) By writing nl L(Xi .n2)' It follows that the likelihood ratio test is equivalent to the statistic IDl/aD..t1.t1 = J..n2(fi . For large nl. 2 8y 8~ . 8D=+' nl n2 It is natural to try and use D I 8 D as a test statistic for the onesided hypothesis H : J.t1.) 18 D due to Welch (1949) works well. n2. For small and moderate nI. and more generally (D ..t2.28).)18D has approximately a standard normal distribution (Problem 5.3. J.) I 8 D depends on at I a~ for fixed nl. n2 an approximation to the distribution of (D .x)2 i=1 n2 i=1 n2 L(Yi .1:::.j1)2 j=1 = L(Yi . where D = Y . y) Next we compute = exp {2~~ eM  x)2 + 2:'~ (Ii  y?}. = J.. It follows that "\(x.. BecauseaJy is unknown.fi = n2(x .t2 must be estimated. = at I a~. 1) distribution. j1. that is Yare X) + Var(Y) = nl + n2 . it Thus.1:::.X)2 + nl(j1.)18D to generate confidence procedures.9 Likelihood Ratio Procedures J 265 where. where I:::. (D 1:::. n2.X and aJy is the variance of D.n2)' Similarly. An unbiased estimate is J. by Slutsky's theorem and the central limit theorem.t2 :::. DiaD has aN(I:::.fi)2 we obtain \(x.. Unfortunately the distribution of (D 1:::.
Wang (1971) has shown the approximation to be very good for Q: = 0. TwoSided Tests . The LR procedure derived in Section 4.003. Yd. which works well if the variances are equal or nl = n2.9.9. then we end up with a bivariate random sample (XI. N(j1. fields. .3 and Problem 5. Y cholesterol level in blood.. .3. Some familiar examples are: X test score on English exam. uI 4. (Xn' Yn).05 and Q: 0. machines.3.) are sampled from a population and two numerical characteristics are measured on each case.1 When k is not an integer. our problem becomes that of testing H : p O. Y = age at death. ur Testing Independence.1. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the BehrensFisher problem.u~. can unfortunately be very misleading if =1= u~ and nl =1= n2. p). Empirical data sometimes suggest that a reasonable model is one in which the two characteristics (X.5 likelihood Ratio Procedures for Bivariate Normal Distributions If n subjects (persons. Y). Y diet.266 Testing and Confidence Regions Chapter 4 Let c = Sf/nlsb' Then Welch's approximation is Tk where k c2 [ nl 1 (1 C)2]1 + n2 . If we have a sample as before and assume the bivariate normal model for (X. Note that Welch's solution works whether the variances are equal or not.28. u~ > O. Y = blood pressure. .. See Figure 5.01. X = average cigarette consumption per day in grams. mice.J1. the critical value is obtained by linear interpolation in the t tables or using computer software.2 . X percentage of fat in score on mathematics exam. the maximum error in size being bounded by 0. Y) have a joint bivariate normal distribution.3. Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many staweight. with > 0. X test tistical studies. etc.
See Example 5. the distribution of pis available on computer packages. .8).3.o. and the likelihood ratio tests reject H for large values of [P1.5) has a Tn.. y.5. To obtain critical values we need the distribution of p or an equivalent statistic under H. . Pis called the sample correlation coefficient and satisfies 1 ::. Therefore. . We have eo = (x. 0) and the log of the likelihood ratio statistic becomes log A(X) (4.X)(Yi ~=1 fj)] /n(jl(72.9. ~ = (Yi . When p = 0. where Ui = (Xi . we can control probabilities of type II error by increasing the sample size.X . the distribution of p depends on p only. 0.Y . .. VI)"" . . p::. the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1. the twosided likelihood ratio tests can be based on [Tn I and the critical values obtained from Table II.4. log A(X) is an increasing function of p2.~( Xi . Qualitatively. if we specify indifference regions. 1 (Problem 2. (Un. There is no simple form for the distribution of p (or Tn) when p #. Because (U1.3.1.2 distribution.1. and eo can be obtained by separately maximizing the likelihood of XI. Because ITn [ is an increasing function of [P1. where (jr e was given in Problem 2.(j~.Now. When p = 0 we have two independent samples. V n ) is a sample from the N(O. .Jll)/O"I. (4..4) = Thus. p) distribution. for any a. p).1. Yn .~( Yi .Section 4.9 Likelihood Ratio Procedures 267 The unrestricted maximum likelihood estimate (x. ii. (j~. then by Problem B. A normal approximation is available.13 as 2 = . 1~ )2 2 1~ )2 0"1 n i=1 n i=1 p= [t(Xi .Xn and that ofY1 .9.0"2 = . If p = 0..9.Jl2)/0"2. (jr .
Thus. by using the twosided test and referring to the tables. c(po) where P po [I)' d(po)] P po [I)'::. c(Po)] 1 ~Q. For instance. However. consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week.18 and Tn 0. Therefore.1 Here p 0. 'I7 Summary.Q. we can start by constructing size Q likelihood ratio tests of H : P Po versus K : p > po. and only if p .: : d(po) or p::. Confidence Bounds and Intervals Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present. p::. inversion of this family of tests leads to levell Q lower confidence bounds. These intervals do not correspond to the inversion of the size Q LR tests of H : p = Po versus K : p :::f. if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood. we obtain a commonly used confidence interval for p.9. and only if. We want to know whether there is a correlation between the initial weights and the weight increase and formulate the hypothesis H : p O.268 Testing and Confidence Regions Chapter 4 OneSided Tests In many cases. 0 versus K : P > 0 and similarly that corresponds to the likelihood ratio statistic for H : P 0 versus K : P < O. The power functions of these tests are monotone.:::: c] is an increasing function of p for fixed c (Problem 4. To obtain lower confidence bounds. by putting two level 1 .75. 370. We obtain c(p) either from computer software or by the approximation of Chapter 5.~Q bounds together. for large n the equal tails and LR confidence intervals approximately coincide with each other. These tests can be shown to be of the form "Accept if. 0 versus K : P > O. It can be shown that pis equivalent to the likelihood ratio statistic for testing H : P ::. The likelihood ratio test statistic >. we would test H : P = 0 versus K : P > 0 or H : P ::.15). we find that there is no evidence of correlation: the pvalue is bigger than 0. we obtain size Q tests for each of these hypotheses by setting the critical value so that the probability of type I error is Q when p = O. c(po)" where Ppo [p::. only onesided alternatives are of interest. We find the likelihood ratio tests and associated confidence procedures for four classical normal models: . We can similarly obtain 1 Q upper confidence bounds and. Because c can be shown to be monotone increasing in p. is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis.7 18.48. c(po)] 1 . ! Data Example A~ an illustration. We can show that Po [p. Po but rather of the "equaltailed" test that rejects if.
. respectively.Xn denote the times in days to failure of n similar pieces of equipment.L is zero.48. .XI.L2' a~) populations. the likelihood ratio test is equivalent to the test based on IY . 4. We also find that the likelihood ratio statistic is equivalent to a t statistic with n . . .Lo. what is the pvalue? 2... When a? and a~ are known. Let Mn = max(X 1 . J. .Xn ) is an £(A) sample. (a) Compute the power function of 6c and show that it is a monotone increasing function (b) In testing H : exactly 0. When a? and a~ are unknown.98 for (e) If in a sample of size n = 20.Ll' ( 2) and N(J. We test the hypothesis that X and Yare indepen dent and find that the likelihood ratio test is equivalent to the test based on IPl.Section 4.X)I SD.. Assume the model where X = (Xl.05? e :::. Mn e = ~? = 0. (d) How large should n be so that the 6c specified in (b) has power 0.X.L :::. 0 otherwise. we use (Y . (2) Twosample experiments in which two independent samples are modeled as coming from N(J. ( 2) and we test the hypothesis that the mean difference J. ~ versus K : e> ~. (3) Twosample experiments in which two independent samples are modeled as coming from N(J. Let Xl. . Consider the hypothesis H that the mean life II A = J. . (4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases..1 1. Approximate critical values are obtained using Welch's t distribution approximation. The likelihood ratio test is equivalent to the onesample (Student) t test.. . where SD is an estimate of the standard deviation of D = Y . . . where p is the sample correlation coefficient.10 PROBLEMS AND COMPLEMENTS Problems for Section 4..2 degrees of freedom. Suppose that Xl. We test the hypothesis that the means are equal and find that the likelihood ratio test is equivalent to the twosample (Student) t test.10 Problems and Complements 269 (1) Matched pair experiments in which differences are modeled as N(J. . respectively.L. e). what choice of c would make 6c have size (c) Draw a rough graph of the power function of 6c specified in (b) when n = 20. X n are independently and identically distributed according to the uniform distribution U(O.Ll' an and N(J. X n ) and let 1 if Mn 2:: c = of e.L2' ( 2) populations.
(b) Show that the power function of your test is increasing in 8.~llog <>(Tj ) has a X~r distribution.1. Days until failure: 315040343237342316514150274627103037 Is H rejected at level <> ~ 0. Tr are independent test statistics for the same simple H and that each Tj has a continuous distribution.12. . 7. under H. Show that if H is simple and the test statistic T has a c. j = 1. is a size Q test.. . 6. Establish (4. .3). Assume that F o and F are continuous. . j .. . i I (b) Check that your test statistic has greater expected value under K than under H. Draw a graph of the approximate power function.. . I . Hint: Use the central limit theorem for the critical value. (c) Give an approximate expression for the critical value if n is large and B not too close to 0 or 00.2. 1 using a I . X> 0. Let Xl. f(x.a) is the (1 . U(O. . . . (d) The following are days until failure of air monitors at a nuclear plant. X n be a 1'(0) sample..ontinuous distribution. T ~ Hint: See Problem B... Let Xl..r. Suppose that T 1 .. 0: (a) Construct a test of H : B = 1 versus K : B > 1 with approximate size complete sufficient statistic for this model. 1). r. where x(I . .2 L:.270 Testing and Confidence Regions Chapter 4 (a) Use the result of Problem 8.O) = (xIO')exp{x'/20'}. (Use the central limit theorem. Hint: See Problem B.<»1 VTi)J . j=l . give a normal approximation to the significance probability. (b) Give an expression of the power in tenns of the X~n distribution.4. Hint: Approximate the critical region by IX > 1"0(1 + z(1 .05? 3.Xn be a sample from a population with the Rayleigh density . (cj Use the central limit theorem to show that <l?[ (I"oz( <» II") + VTi(1"  1"0) I 1"] is an approximation to the power of the test in part (a).. . If 110 = 25. (a) Use the MLE X of 0 to construct a level <> test for H : 0 < 00 versus K : 0 > 00 .a)th quantile of the X~n distribution. 1nz t Z i . Let o:(Tj ) denote the pvalue for 1j.0 > 0.4 to show that the test with critical region IX > I"ox( 1  <» 12n]... (0) Show that.) 4. .3. distribution. " i I 5. then the pvalue <>(T) has a unifonn.3. . ...
Use the fact that T(X) is equally likely to be any particular order statistic.T..o sup.) Evaluate the bound Pp(lF(x) . 00.5 using the nonnal approximation to the binomial distribution of nF(x) and the approximate critical value in Example 4.Fo(x) 10 U¢. 1) to (0. .)(Tn ) = LN(O. Hint: If H is true T(X). 9. 80 and x = 0. Let X I.. T(X(B)) is a sample of size B + 1 from La.5. with distribution function F and consider H : F = Fo.5.Fo(x)IOdFo(x) . to get X with distribution . = (Xi .d.. .10 Problems and Complements 271 8.12(b)..2.a)/b. ... then the power of the Kolmogorov test tends to 1 as n .X(B) from F o on the computer and computing TU) = T(XU)).1. . 10. n (c) Show that if F and Fo are continuous and FiFo. That is.00)..T(X(l)). if X. let . . (a) For each of these statistics show that the distribution under H does not depend on F o. . Fo(x)1 > kol x ~ Hint: D n > !F(x) .. .p(F(x»!F(x) . Here.. T(B) ordered.Fo(x)1 > k o ) for a ~ 0.p(Fo(x))IF(x) .. (a) Show that the power PF[D n > kal of the Kolmogorov test is bounded below by ~ sup Fpl!F(x).(X') = Tn(X).1.o J J x . = ~ I. Express the Cramervon Mises statistic as a sum..p(F(x)IF(x) . (b) Suppose Fo isN(O. ~ ll.. .l) (Tn). and let a > O.o V¢.p(u) be a function from (0. 1) variable on the computer and set X ~ F O 1 (U) as in Problem B.Fo(x)IOdF(x).T(B+l) denote T. then T. and 1.10.o T¢. . B. . (In practice these can be obtained by drawing B independent samples X(I).o: is called the Cramervon Mises statistic..p(Fo(x))lF(x) .) Next let T(I). Let T(l). 1) and F(x) ~ (1 +exp( _"/7))1 where 7 = "13/.u. (b) Use part (a) to conclude that LN(". ..Fo(x)1 foteach.Section 4. is chosen 00 so that 00 x 2 dF(x) = 1. 1.1. b > 0. . T(B) be B independent Monte Carlo simulated values of T. (This is the logistic distribution with mean zero and variance 1.6 is invariant under location and scale. . (b) When 'Ij. Show that the test rejects H iff T > T(B+lm) has level a = m/(B + 1). T(l). Suppose that the distribution £0 of the statistic T = T(X) is continuous under H and that H is rejected for large values of T.. Vw.J(u) = 1 and 0: = 2.Fo(x)IO x ~ ~ sup.5. (a) Show that the statistic Tn of Example 4. . X n be d. .Fo• generate a U(O. In Example 4. . Define the statistics S¢. j = 1.
2.. Consider a population with three kinds of individuals labeled 1.272 Testing and Confidence Regions Chapter 4 (e) Are any of the four statistics in (a) invariant under location and scale.0) = 20(1.) I 1 . let N I • N 2 • and N 3 denote the number of Xj equal to I. and only if. whereas the other a I ~ a . and 3 occuring in the HardyWeinberg proportions f(I. take 80 = O. (See Problem 4. ! . If each time you test. i i (b) Define the expected pvalue as EPV(O) = EeU. 2..1) satisfy Pe. Let 0 < 00 < 0 1 < 1. you intend to test the satellite 100 times. the other has v/aQ = 1.a)) = inf{t: Fo(t) > u}. i • (a) Show that if the test has level a. X n . which system is cheaper on the basis of a year's operation? 1 I Ii I il 2.1. [2N1 + N. f(3. Let T = X/"o and 0 = /" . ! .. I X n from this population.. Hint: peT < to I To = to) is 1 minus the power of a test with critical value to_ (d) Consider the problem of testing H : J1.Fe(F"l(l. = /10 versus K : J.). f(2. > cJ = a. the other $105 • One second of transmission on either system COsts $103 each. j 1 i :J . Without loss of generality. then the test that rejects H if. Show that EPV(O) = P(To > T).10. One has signalMtonoise ratio v/ao = 2. where" is known. 5 about 14% of the time. (For a recent review of expected p values see Sackrowitz and SamuelCahn. Expected pvalues. the UMP test is of the form I {T > c).. then the pvalue is U =I . 12.0)'.1. A gambler observing a game in which a single die is tossed repeatedly gets the impression that 6 comes up about 18% of the time. 01 ) is an increasing function of 2N1 + N. which is independent ofT. 2N1 + N. and 3. the power is (3(0) = P(U where FO 1 (u)  < a) = 1. 1999. .2. For a sample Xl. Show that the EPV(O) for I{T > c) is uniformly minimal in 0 > 0 when compared to the EPV(O) for any other test. 00 .2 I i i 1. . I 1 (b) Show that if c > 0 and a E (0. you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0. Let To denote a random variable with distribution F o. ! J (c) Suppose that for each a E (0.') sample Xl. Problems for Section 4.2 and 4..0). You want to buy one of two systems..O) = 0'.3. Hint: P(To > T) = P(To > tiT = t)fe(t)dt where fe(t) is the density of Fe(t). 2. Whichever system you buy during the year.0) = (1. respectively.) . where if! denotes the standard normal distribution.. (a) Show that L(x. Consider Examples 3. . Suppose that T has a continuous distribution Fe.. > cis MP for testing H : 0 = 00 versus K : 0 = 0 1 • .05. The first system costs $106 .L > J.to on the basis of the N(p" . Consider a test with critical region of the fann {T > c} for testing H : () = Bo versus I< : () > 80 ./"0' Show that EPV(O) = if! ( .nO /. I). 3.Fo(T)...
1. .0. A fonnulation of goodness of tests specifies that a test is best if the max.2. if . B" .17). where 11 = I:7 1 ajf}j and a 2 = I:~ 1 f}i(ai 11)2.. +akNk has approx.2. 9. 0. derive the UMP test defined by (4. two 5's are obtained. Y) belongs to population 1 if T > c... Show that if randomization is pennitted.2.imately a N(np" na 2 ) distribution. H : (J = (Jo versus K : (J = (J.~:":2:~~=C:="::===='. For 0 < a < I.. I..N. and to population 0 ifT < c.1. Bk). Upon being asked to play.imum probability of error (of either type) is as small as possible. I. MPsized a likelihood ratio tests with 0 < a < 1 have power nondecreasing in the sample size. (b) Find the test that is best in this sense for Example 4. with probability .2.<l.. 7. Y) known to be distributed either (as in population 0) according to N(O.o = (1. 6. .e.1. Hint: The MP test has power at least that of the test with test function J(x) 8.6) where all parameters are known. Hint: Use Problem 4.1(a) using the connection between likelihood ratio tests and Bayes tests given in Remark 4. Y) and a critical value c such that if we use the classification rule. Nk) ~ 4. [L(X. Bd > cJ = I .2. prove Theorem 4.0196 test rejects if. .2...2. 1.6) or (as in population 1) according to N(I. Prove Corollary 4. Find a statistic T( X. . (c) Using the fact that if(N" . [L(X. then a. (X.=  Section 4. M(n. Bo.4.4 and recall (Proposition B..10 Problems and Complements 273 four numbers are equally likely to occur (i. Bo. .. and only if.7).2) that linear combinations of bivariate nonnal random variables are nonnaUy distributed.Pe. L .. find the MP test for testing 10. then the maximum of the two probabilities ofmisclassification porT > cJ.2. PdT < cJ is as small as possible...0.2. A newly discovered skull has cranial measurements (X. of and ~o '" I.0. In Example 4.. find an approximation to the critical value of the MP level a test for this problem. the gambler asks that he first be allowed to test his hypothesis by tossing the die n times.. = a. .1. (a) Show that if in testing H : f} such that = f}o versus K : f) = f)l there exists a critical value c Po.Bd > cJ then the likelihood ratio test with critical value c is best in this sense. 5.. +.. (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerfullevel. In Examle 4.2..
(c) Suppose 1/>'0 ~ 12..1. Hint: Show that Xf . hence. the probability of your deciding to close is also < 0.01.i .Xn be the times in months until failure of n similar pieces of equipment.) 3. Show that if X I. .01. .: (b) Show that the critical value for the size a test with critical region [L~ 1 Xi > k] is k = X2n(1 .. .<» is the (1 . a model often used (see Barlow and Proschan. .3. (b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP test? 2. You want to ensure that if the arrival rate is < 10.1 achieves this? (Use the normal approximation..<»/2>'0 where X2n(1.0)"'/[1. that the Xi are a sample from a Poisson distribution with parameter 8.• n. Consider the foregoing situation of Problem 4. 0) = ( : ) 0'(1..95 at the alternative value 1/>'1 = 15. ]f the equipment is subject to wear. Here c is a known positive constant and A > 0 is the parameter of interest. 274 Problems for Section 4.. Use the normal approximation to the critical value and the probability of rejection. I i . How many days must you observe to ensure that the UMP test of Problem 4.. A possible model for these data is to assume that customers arrive according to a homogeneous Poisson process and. • I i • 4. • • " 5. (a) Exhibit the optimal (UMP) test statistic for H : 0 < 00 versus K : 0 > 00. Let Xl.Xn is a sample from a truncated binomial distribution with • I. r p(x.3 Testing and Confidence Regions Chapter 4 1. . . i' i . the expected number of arrivals per day. i 1 . .4.. . but if the arrival rate is > 15.. show that the power of the UMP test can be written as (3(<J) = Gn(<J~Xn(<»/<J2) where G n denotes the X~n distribution function. . 1965) is the one where Xl. the probability of your deciding to stay open is < 0. In Example 4. Suppose that if 8 < 8 0 it is not worth keeping the counter open. A) = c AcxC1e>'x . Find the sample size needed for a level 0. > 1/>'0.01 lest to have power at least 0. x > O..(IOn x = 1•. (a) Show that L~ K: 1/>.3. Let Xl be the number of arrivals at a service counter on the ith of a sequence of n days. 1 j ! Xf is an optimal test statistic for testing H : 1/ A < 1/ AO versus J ! .<»th quantile of the X~n distributiou and that the power function of the lIMP level a test is given by where G 2n denotes the X~n distribution function.&(>'). I i .Xn is a sample from a Weibull distribution with density f(x.3.
00 < a < b < 00.:7 I Xi is an optimal test statistic for testing H : () = eo versus l\ ."" X n denote the incomes of n persons chosen at random from a certain population.Q)th quantile of the X~n distribution. Show that the UMP test for testing H : B > 1 versuS K : B < 1 rejects H if 2E log FO(Xi) > Xl_a.. . P(). suppose that Fo(x) has a nonzero density on some interval (a. y > 0.  .6.1.. . Let N be a zerotruncated Poisson.. (b) NabeyaMiura Alternative. . survival times with distribution Fa.. A > O. and consider the alternative with distribution function F(x.) It follows that Fisher's method for cQmbining pvalues (see 4. then P(Y < y) = 1 e  .1. then e.Fo(y)  }). where Xl_ a isthe (1 . Y n be the Ll. Let the distribution of sUIvival times of patients receiving a standard treatment be the known distribution Fo. b). (a) Lehmann Alte~p. In Problem 1.6.tive. X N }).. random variable.2 to find the mean and variance of the optimal test statistic.d. 7.10 Problems and Complements 275 then 2.8 > 80 .Section 4. JL (e) Use the central limit theorem to find a normal approximation to the critical value of test in part (b). Let Xl. .O<B<1. .d. A > O.1.6) is UMP for testing that the pvalues are uniformly distributed. . against F(u)=u B.. which is independent of XI.5.). imagine a sequence X I. . X 2. X N e>. To test whether the new treatment is beneficial we test H : ~ < 1 versus K : . 1 ' Y > 0.XFo(y)  1 P(Y < y) = e ' 1 ' Y > 0. In the goodnessoffit Example 4. Assume that Fo has ~ density fo.[1. (i) Show that if we model the distribution of Y as C(max{X I . .6. > po.~) = 1. 6. (a) Express mean income JL in terms of e.. x> c where 8 > 1 and c > O. Suppose that each Xi has the Pareto density f(x.O) = cBBx(l+BI. 0 < B < 1. For the purpose of modeling. Find the UMP test. survival times of a sample of patients receiving an experimental treatment.. X 2 .. > 1.Fo(y)l". ~ > O. 8.•• of Li.12. Show how to find critical values.1. we derived the model G(y. Hint: Use the results of Theorem 1.. (Ii) Show that if we model the distribotion of Y as C(min{X I .B) = Fi!(x). (b) Find the optimal test statistic for testing H : JL = JLo versus K .. and let YI . (See Problem 4.
0. 1 .L and unknown variance (T2. Find the MP test for testing H : {} = 1 versus K : B = 8 1 > 1. = . 0 = 0. Show that under the assumptions of Theorem 4.. Show that the test is not UMP.6t is UMP for testing H : 8 80 versus K : 8 < 80 .0) UeB for '.2.O)d. I Hint: Consider the class of all Bayes tests of H : () = . e e at l · 6.{Od varies between 0 and 1. where the €i are independent normal random variables with mean 0 and known variance . What would you .(O) «) L > 1 i The lefthand side equals f6". . 10. We want to test whether F is exponential. /.52. Show that the UMP test is based on the statistic L~ I Fo(Y. n announce as your level (1 .3.. with distribution function F(x).j "'" . . 0. Problem 2. every Bayes test for H : < 80 versus K : > (Jl is of the fann for some t.). > Er • . x > 0.. B> O. we test H : {} < 0 versus K : {} > O..(O) L':x. Oo)d.276 (iii) Consider the model Testing and Confidence Regions Chapter 4 G(y.(O)/ 16' I • 00 p(x.0 e6 1 0~0. L(x. I . or Weibull.  1 j . .' L(x. Let Xl.... Oo)d..2 the class of all Bayes tests is complete. Problems for Section 4. Show that under the assumptions of Theorem 4. x > 0.1 and 01 loss.01.3. 1 • ..4 1.0) confidence intervals of fixed finite length for loga2 • (b) Suppose that By 1 (Xi .2.'? = 2. 'i . Hint: A Bayes test rejects (accepts) H if J ..d.' " 2..i.X)' = 16. 1 X n be i. eo versus K : B = fh where I i I : 11. 00 p(x. i = 1.. Show that under the assumptions of Theorem 4.. . O)d.(O) 6 .1).X)2. J J • 9.exp( x). j 1 To see whether the new treatment is beneficial. Assume that F o has a density !o(Y). . n... (a) Show how to construct level (1 . F(x) = 1 . the denominator decreasing. .3. 0) e9Fo (Y) Fo(Y). 12. Using a pivot based on 1 (Xi .{Oo} = 1 .exp( _x B). Let Xl. . 1 X n be a sample from a normal population with unknown mean J.' (cf. F(x) = 1 . The numerator is an increasing function of T(x)..0". Let Xi (f)/2)t~ + €i.
Suppose that in Example 4. what values should we use for the t! so as to make our interval as short as possible for given a? 3.d.1. but we may otherwise choose the t! freely. has a 1. Show that if Xl. .al). find a fixed length level (b) ItO < t i < I. Use calculus.IN] . then the shortest level n (1 .10 Problems and Complements 277 (a) Using a pivot based on the MLE (2L:r<ol (1 . It follows that (1 ..02. i = 1. with N being the smallest integer greater than no and greater than or equal to [Sot".) LCB and q(X) is a level (1.Section 4. . .1.4.(2) " .01 [X .2. .a.(al + (2)) confidence interval for q(8). where 8.~a) /d]2 .) Hint: Use (A. t.. Then take N .3 we know that 8 < O. minI ii. (Define the interval arbitrarily if q > q.1 Show that. . 5. [0.1. 6.(X) = X .3). (a) Justify the interval [ii.(2) UCB for q(8).n is obtained by taking at = Q2 = a/2 (assume 172 known).4.. X+SOt no l (1. there is a shorter interval 7.4..c. Let Xl.z(1 . Begin by taking a fixed number no > 2 of observations and calculate X 0 = (1/ no) I:~O 1 Xi and S5 = (no 1)II:~OI(Xi  XO )2.1) based on n = N observations has length at most l for some preassigned length l = 2d..~a)!.q(X)] is a level (1 . Hint: Reduce to QI +a2 = a by showing that if al +a2 with Ql + Q2 = a. Suppose that an experimenter thinking he knows the value of 17 2 uses a lower confidence bound for Ji. n. Stein's (1945) twostage procedure is the following. Suppose we want to select a sample size N such that the interval (4.IN. < 0.7).a. What is the actual confidence coefficient of Ji" if 17 2 can take on all positive values? 4. then [q(X). Show that if q(X) is a level (1 . with X . ... I")/so.Xn be as in Problem 4.X are ij. 0. distribution. of the form Ji..a) interval of the fonn [ X . X + z(1 .n .1. ii are (b) Calculate the smallest n needed to bound the length of the 95% interval of part (a) by 0.IN(X  = I:[' lX.4. N(Ji" 17 2) and al + a2 < a.4.SOt no l (1.. X!)/~i 1tl ofB.3). 0. although N is random. Compare yonr result to the n needed for (4. .no further observations.] .l.a) confidence interval for B./N. where c is chosen so that the confidence level under the assumed value of 17 2 is 1 .~a)/ . < a.1)] if 8 given by (4..1] if 8 > 0...
Show that ~(). exhibit a fixed length level (1 .. 0") distribution and is independent of sn" 8. ..001. it is necessary to take at least Z2 (1 (12/ d 2 observations.4. (12 2 9. (a) Use (A. .~ a) upper and lower bounds. " 1 a) confidence . Let XI. T 2 I I' " I'. (12) and N(1/.3) are indeed approximate level (1 . Hence. Y n2 be two independent samples from N(IJ..O)]l < z (1. in order to have a level (1 . use the result in part (a) to compute an approximate level 0. (The sticky point of this approach is that we have no control over N.jn(X . X n1 and Yll .95 confidence interval for (J. (12. is _ independent of X no ' Because N depends only on sno' given N = k.a) confidence interval for sin 1 (v'e). sn.') populations.3. 0. 1986. X has a N(J1. (a) Show that X interval for 8. Indicate what tables you wdtdd need to calculate the interval. ./3z (1.. . . Show that these two quadruples are each sufficient..jn is an approximate level (b) If n = 100 and X = 0..0)/[0(1. ..a) confidence interval for 1"2/(Y2 using a pivot based on the statistics of part (a).1. (b) Exhibit a level (1 . if (J' is large. The reader interested in pursuing the study of sequential procedures such as this one is referred to the book of Wetherill and Glazebrook. respectively.  10. Such two sample problems arise In comparing the precision of two instruments and in detennining the effect of a treatment. i (b) What would be the minimum sample size in part (a) if (). 0) and X = 81n. Let 8 ~ B(n.: " are known. d = i II: .!a)/ 4.18) to show that sin 1( v'X)±z (1 .) I ~i .a) interval defined by (4. Suppose that it is known that 0 < ~..) (lr!d observations are necessary to achieve the aim of part (a).~a)]. > z2 (1  is not known exactly.O). 1"2.l4. but we are sure that (12 < (It. 11. 4(). = 0.. we may very likely be forced to take a prohibitively large number of observations. (a) Show that in Problem 4.jn is an approximate level (1  • 12. ±. find ML estimates of 11.051 (c) Suppose that n (12 = 5. By Theorem B.a) confidence interval for (1/1'). (J'2 jk) distribution. . (c) If (12. and the fundamental monograph of Wald.3. and.6. Show that the endpoints of the approximate level (1 . 278 Testing and Confidence Regions Chapter 4 I i' I is a confidence interval with confidence coefficient (1 . (a) If all parameters are unknown.) Hint: Note that X = (noIN)Xno + (lIN)Etn n _ +IXi.a:) confidence interval of length at most 2d wheri 0'2 is known.0') for Jt of length at most 2d. I .fN(X .. Hint: Set up an inequality for the length and solve for n.4. v. Let 8 ~ B(n. (1  ~a) / 2. 1947.1') has a N(o. I 1 j Hint: [O(X) < OJ = [.
1) + V2( n .t ([en .' 1 (Xi . K. Show that the confidence coefficient of the rectangle of Example 4.027 13. Now use the law of large numbers.I). Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures are observed. 15. Compare them to the approximate interval given in part (a).1)8'/0') .4 = E(Xi .9 confidence interval for {L x= + cr.4.2 is known as the kurtosis coefficient.Q confidence interval for cr 2 . = 3.3).2. is replaced by its MOM estimate.2.1 is known. (a) Show that x( "d and x{1.10 Problems and Complements 279 (b) What sample size is needed to guarantee that this interval has length at most 0. then BytheproofofTheoremB.3.9 confidence interval for cr.4.4.J1. XI 16. .4. In Example 4.9 confidence region for (It.) Hint: (n .n} and use this distribution to find an approximate 1 . If S "' 8(64. give a 95% confidence interval for the true proportion of cures 8 using (al (4. See A. and X2 = X n _ I (1 .1 and. (n _1)8'/0' ~ X~l = r(~(nl).~Q).5 is (1 _ ~a) 2. V(o') can be written as a sum of squares of n 1 independentN(O.)' .' Now use the central limit theorem.3. (In practice. (b) A level 0.30. Slutsky's theorem. Xl = Xn_l na).".I) + V2( n1) t z( "1) and x(1 . See Problem 5.(n .).4. .1) t z{1 ."') .) can be approximated by x(" 1) "" (n .J1.ia). ~). and (b) (4.3.4 ) . Now the result follows from Theorem 8. it equals O.". and the central limit theorem as given in Appendix A. 1) random variables.n(X . Hint: Use Problem 8. . In the case where Xi is normal.IUI. Compute the (K. cr 2 ) population.J1. Find the limit of the distribution ofn.4/0.)/01' ~ (J1.Section 4. but assume that J1. Assuming that the sample is from a /V({L. Hint: Let t = tn~l (1 . (c) Suppose Xi has a X~ distribution. Z.4 and the fact that X~ = r (k. and 10 000. 4) are independent.' = 2:.3. 14.9 confidence interval for {L. 0).<.J1. K.)4 < 00 and that'" = Var[(X. known) confidence intervals of part (b) when k = I. :L7/ (b) Suppose that Xi does not necessarily have a nonnal diStribution.7). (d) A level 0. find (a) A level 0.8).)'.3.1.D andn(X{L)2/ cr 2 '" = r (4. Hint: By B. 10. Suppose that 25 measurements on the breaking strength of a certain alloy yield 11. 100.2. (c) A level 0.
we want a level (1 .6. indicate how critical values U a and t a for An (Fo) and Bn(Fo) can be obtained using the Monte Carlo method of Section 4.3 for deriving a confidence interval for B = F(x). .7. verify the lower bounary I" given by (4.4. 1'1(0) < X < 1'.. Consider Example 4.1 (b) V1'(x)[l.I Bn(F) = sup vnl1'(x) . In this case nF(l') = #[X. " • j .i. Show that for F continuous where U denotes the uniform. )F(x)[l.1. < xl has a binomial distribotion and ~ vnl1'(x) .6 with .4. . • i.4.l). define . In Example 4. (c) For testing H o : F = Fo with F o continuous. i 1 . I (b) Using Example 4.95. 18. (a) For 0 < a < b < I. (a) Show that I" • j j j = fa F(x)dx + f. .u.I I 280 Testing and Confidence Regions Chapter 4 17. X n are i.F(x)1 is the approximate pivot given in Example 4. (b)ForO<o<b< I. !i . It follows that the binomial confidence intervals for B in Example 4. I' . pl(O) < X <Fl(b)}.F(x)l.11 . as X and that X has density f(t) = F' (t). Suppose Xl.05 and .4. That is.9) and the upper boundary I" = 00.3 can be turned into simultaneous confidence intervals for F(x) by replacing z (I .. define An(F) = sup {vnl1'(X) .~u) by the value U a determined by Pu(An(U) < u) = 1.F(x)1 .r fixed.F(x)]dx. U(O. b) for some 00 < a < 0 < b < 00.F(x)1 Typical choices of a and bare . . distribution function. find a level (I . 19.F(x)1 )F(x)[1 .a) confidence interval for F(x).4. Assume that f(t) > 0 ifft E (a.1'(x)] Show that for F continuous.1.u) confidence interval for 1".d..
~a)J has size (c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant. Give a 90% confidence interval for the ratio ~ of mean life times.2n2 distribution. X2) = 1 if and only if Xl + Xi > c.· (a) If 1(0. = 1 rejected at level a 0.al) and (1 . (b) Derive the level (1. then the test that accepts. < XI? < f(1.2 that the tests of H : . is of level a for testing H : 0 > 00 .5 1. Yn2 be independent exponential E( 8) and £(.a) LCB corresponding to Oc of parr (a). .(2) together. . . (a) Deduce from Problem 4... Hint: If 0> 00 ..1.\..0.a) VCB for this problem and exhibit the confidence intervals obtained by pntting two such hounds of level (1 . X 2 be independent N( 01. show that [Y f( ~a) I X. Let Xl.5. 5. if and only if 8(X) > 00 . O ) is an increasing 2 function of + B~. ..Section 4.4 and 8.3..a) UCB for 0.3.05.. lem of testing H : 8 1 = 82 = 0 versus K : Or (a) Let oc(X 1 .2). (15 = 1. of l. for H : (12 > (15. Experience has shown that the exponential assumption is warranted. "6 based on the level (1  a) (b) Give explicitly the power function of the test of part (a) in tenos of the X~l distribution function. (e) Suppose that n = 16. How small must an alternative before the size 0. = 0. test given in part (a) has power 0.th quantile of the . Yf (1 . (e) Similarly derive the level (1 .1"2nl. Hint: Use the results of Problems B. 1/ n J is the shortest such confidence interval. What value of c givessizea? (b) Using Problems B. M n /0. 3.3. (d) Show that [Mn .2). .2 = VCBs of Example 4. Show that if 8(X) is a level (1 . and let Il. and consider the prob2 + 8~ > 0 when (}"2 is known.) denotes the o.4.1 has size a for H : 0 (}"2 be < 00 .~a) I Xl is a confidence interval for Il.13 show that the power (3(0 1 .a..1 O? 2. . N( O . (a) Find c such that Oc of Problem 4. respectively.2 are level 0.. a (b) Show that the test with acceptance region [f(~a) for testing H : Il. x y 315040343237342316514150274627103037 826 10 8 29 20 10 ~ Is H : Il.10 Problems and Complements 281 Problems for Section 4. respectively. ~ 01). = 1 versns K : Il. Let Xl.5. with confidence coefficient 1 .12 and B. [8(X) < 0] :J [8(X) < 00 ].3.Xn1 and YI .90? 4.) samples. Or .
X n  00 ) is a level a test of H : 0 < 00 versus K : 0 > 00 • (d) Deduce that X(nk(Q)+l) (where XU) is the jth order statistic of the sample) is a level (1 .4. We are given for each possible value 1]0 of1] a level 0: test o(X. 0. L:J ~ ( . I . and 1 is continuous and positive.282 Testing and Confidence Regions Chapter 4 = ()~ 1 2 eg and exhibit the corresponding family of confidence circles for (ell ( 2). (b) The sign test of H versus K is given by. (a) Show that testing H : 8 < 0 versus K : e > 0 is equivalent to testing . >k Determine the smallest value k = k(a) such that oklo) is level a for H and show that for n large. Thus.Xn be a sample from a population with density f (t . ry) = O}.0:) confidence region for 1] is equivalent to a family of level tests of these composite hypotheses. 6. ). (c) Modify the test of part (a) to obtain a procedure that is level 0: for H : OJ e= O?. k . og are independentN (0. Let X l1 . (e) Show directly that P. T) where 1] is a parameter of interest and T is a nuisance parameter. O. Let C(X) = (ry : o(X. 0. I: H': PIX! >OJ < ~ versus K': PIX! >OJ> ~.8) where () and fare unknown.a) confidence region for the parameter ry and con versely that any level (1 . X.  j 1 j .00 .1 when (72 is unknown. but I(t) = I( t) for all t.~n + ~z(l . (a) Shnw that C(X) is a level (l .. (c) Show that Ok(o) (X. 1 Ct ~r (b) Find the family aftests corresponding to the level (1. 'TJo) of the composite hypothesis H : ry = ryo. .2). Show that PIX(k) < 0 < X(nk+l)] ~ .2).[XU) (f) Suppose that a = 2(n') < OJ and P. . Suppose () = ('T}. respectively. lif (tllXi > OJ) ootherwise. . . l 7. (g) Suppose that we drop the assumption that I(t) = J( t) for all t and replace 0 by the v = median of F. .IX(j) < 0 < X(k)] do not depend on 1 nr I.a. . N( 020g.a)y'n. Show that the conclusions of (aHf) still hold. O?. . Hint: (c) X.0:) LeB for 8 whatever be f satisfying our conditions.0:) confidence interval for 11.. of Example 4. we have a location parameter family.
7.5.1) is unbiased. Let a(S. Y. (b) Describe the confidence region obtained by inverting the family (J(X.Section 4.a).5.2.7.z(l. Y. hence. Suppose X.2a) confidence interval for 0 of Example 4. if 00 < O(S) (S is inconsistent with H : 0 = 00 ). Y ~ N(r/. let T denote a nuisance parameter. 7)).p) = = 0 ifiX .Y. (a) Show that the lower confidence bound for q( 8) obtained from the image under q of q(X) (X . ' + P2 l' z(1 1 .a) = (b) Show that 0 if X < z(1 .a) confidence interval [ry(x). Let P (p.5. ij(x)] for ry is said to be unbiased confidence interval if Pl/[ry(X) < ry' < 7)(X)] <1 a for all ry' F ry. the interval is unbiased if it has larger probability of covering the true value 1] than the wrong value 1]'.Oo) denote the pvalue of the test of H : 0 = 00 versus K : 0 > 00 in Example 4.2.1. 10.O ~ J(X. Show that the Student t interval (4.00 indicates how far we have to go from 80 before the value 8 is not at all surprising under H. 1). the quantity ~ = 8(8) .a.3 and let [O(S). p.2 a) (a) Show that J(X. Then the level (1 . that suP. Let 1] denote a parameter of interest. or). Yare independent and X ~ N(v. and let () = (ry. Thus. 1) and q(O) the ray (X . a( S. alll/. Hint: You may use the result of Problem 4. 12.a). laifO>O <I>(z(1 . Define = vl1J. Show that as 0 ranges from O(S) to 8(S).a) . 1).pYI < (1 1 otherwise. . Note that the region is not necessarily an interval or ray. 11.10 Problems and Complements 283 8. 9.20) if 0 < 0 and. This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line. Let X ~ N(O. 0) ranges from a to a value no smaller than 1 . p)} as in Problem 4.5.5.z(1 . That is. 8( S)] be the exact level (1 . Po) is a size a test of H : p = Po. Establish (iii) and (iv) of Example 4. 00) is = 0'.[q(X) < 0 2 ] = 1.a))2 if X > z(1 .
(a) Show that testing H : x p < 0 versus I< : x p > 0 is equivalent to testing H' : P(X > 0) < (I . Suppose we want a distributionfree confidence region for x p valid for all 0 < p < 1. . (See Section 3. 14. Thus. Show that P(X(k) < x p < X(nl)) = 1. . I • (c) Let x· be a specified number with 0 < F(x') < 1. .1 and Fu 1.) Suppose that p is specified. . lOOx. and F+(x) be as in Examples 4. where ! 1 1 heal c: n(1  p) + zl_QVnp(1  pl. P(xp < x p < xp for all p E (0.7. We can proceed as follows.4.284 Testing and Confidence Regions Chapter 4 13... (b) The quantile sign test Ok of H versus K has critical region {x : L:~ 1 1 [Xi > 0] > k}.p) versus K' : P(X > 0) > (1 . Construct the interval using F.p). Let x p = ~ IFI (p) + Fi! I (P)]. That is. = ynIP(xp) ..05 could be the fifth percentile of the duration time for a certain disease. " II [I .95 could be the 95th percentile of the salaries in a certain profession l or lOOx .p)"j. it is distribution free. k+ ' i . Let F.b) (a) Show that this statement is equivalent to = 1.6 and 4.a. X n be a sample from a population with continuous distribution F.. That is. Then   P(P(x) < F(x) < P+(x)) for all x E (a. (g) Let F(x) denote the empirical distribution. 1 . (e) Let S denote a B(n. L. Show that Jk(X I x'.1» .. In Problem 13 preceding we gave a disstributionfree confidence interval for the pth quantile x p for p fixed. 0 < p < 1. F(x). Detennine the smallest value k = k(a) such that Jk(Q) has level a for H and show that for n large. k(a) .p) variable and choose k and I such that 1 . (X(k)l X(n_I») is a level (1 . Simultaneous Confidence Regions for Quantiles.5. Show that the interval in parts (e) and (0 can be derived from the pivot T(x ) p  . ! . iI . il. vF(xp) [1 F(xp)] " Hint: Note that F(x p ) = p. (t) Show that k and 1in part (e) can be approximated by h (~a) and h (1  ~a) where ! heal is given in part (b).4.h(a). .a = P(k < S < n I + 1) = pi(1.a) LeB for x p whatever be f satisfying our conditions.F(x p)]. Let Xl. be the pth quantile of F. Confidence Regions for QlIalltiles. .a. X n x*) is a level a test for testing H : x p < x* versus K : x p > x*.a) confidence interval for x p whatever be F satisfying our conditions.. i j (d) Deduce that X(n_k(Q)+I) (XU) is the jth order statistic of the sample) is a level (1 .
L i==I n I IFx (Xi) . where A is a placebo.. . VF(p) = ![x p + xlpl. The hypothesis that A and B are equally effective can be expressed as H : F ~x(t) = Fx(t) for all t E R.z.4. F~x) has the same distribution Show that if F x is continuous and H holds. n Hint: nFx(x) = Li~ll[Fx(Xi) < Fx(x)] ~ nFu(F(x)) and ~ ~ nF_x(x) = L IIXi < xl ~ L i==I i==I n n IIFx( X..U with ~ U(O.j < F_x(x)] See also Example 4.17(a) and (c) can be used to give another distributionfree simultaneous confidence band for x p . X n and . LetFx and F x be the empirical distributions based on the i.1. 15. that is. That is... . ~ ~ (a) Consider the test statistic x .1. Give a distributionfree level (1 . .r" ~ inf{x: a < x < b. F(x) > pl.(x)] < Fx(x)] = nF1_u(Fx(x)). TbeaitemativeisthatF_x(t) IF(t) forsomet E R. where p ~ F(x). F f (.Q) simultaneous confidence band for the curve {VF(p) : 0 < p < I}. Suppose that X has the continuous distribution F. then D(F ~ ~ ~ ~ as D(Fu 1 F I  U = F(X) U ). Suppose X denotes the difference between responses after a subject has been given treatments A and B..i.u are the empirical distributions of U and 1 .x. (b) Express x p and ..r p in terms of the critical value of the Kolmogorov statistic and the order statistics. Express the band in terms of critical values for An(F) and the order statistics. .. Hint: Let t.T: a <.X n ..) < p} and.(x) = F ~(Fx(x» .d.13(g) preceding. (b) Suppose we measure the difference between the effects of A and B by ~ the difference between the quantiles of X and X.r < b. where F u and F t . We will write Fx for F when we need to distinguish it from the distribution F_ x of X.X I.p.10 Problems and Complements 285 where x" ~ SUp{. X I. Note the similarity to the interval in Problem 4.xp]: 0 < l' < I}. (c) Show how the statistic An(F) of Problem 4. then L i==l n IIXi < X + t. the desired confidence region is the band consisting of the collection of intervals {[. I). .5.. .Section 4.
where do:. where :F is the class of distribution functions with finite support.(F(x)) .) < Fx(x)) = nFv(Fx(x)).a) that the interval IvF' vt] contains the location set LF = {O(F) : 0(. and by solving D( Fx.286 Testing and Confidence Regions ~ Chapter 4 '1.. i=I n 1 i t . To test the hypothesis H that the two treatments are equally effective...Fy ): 0 < p < 1}. Moreover.d. Fx. vt = O<p<l sup vt(p) O<p<l where [V. we test H : Fx(t) ~ Fy (t) for all t versus K : Fx(t) # Fy(t) for some t E R.. Show that for given F E F.. Fenstad and Aaberge (1977). let Xl. nFy(t) = 2. for ~..'" 1 X n be Li.) is a location parameter} of all location parameter values at F.) ~ < F x (")] ~ ~ = nFu(Fx(x)).p)] = ~[Fxl(p) + F~l:(p)].1. . he a location parameter as defined in Problem 3. then D(Fx.:~ I1IFx(X. i I f F. nFx(x) we set ~ 2. where x p and YP are the pth quan tiles of F x and Fy. It follows that X is stochastically between Xsv F and XS+VF where X s HI(F(X)) has the symmetric distribution H. F y ). Hint: nFx(t) = 2. Let t (c) A Distdbution and ParameterFree Confidence Interval.. Also note that x = H.x = 2VF(F(x))..) = D(Fu . I I I 1 1 D(Fx.x p . .U ). (p). It follows that if c. then D(Fx ..) : :F VF ~ inf vF(p). then  nFy(x + A(X)) = L: I[Y. <aJ Show that if H holds. Hint: Define H by H.a) simultaneous confidence band for A(x) = F~l:(Fx(x)) .x (': + A(x)).x. . Let 8(. Let Fx and F y denote the X and Y empirical distributions and consider the test statistic ~ ~ . where F u and F v are independent U(O.a) simultaneous confidence band 15.Fy ) = maxIFy(t) 'ER "" ..d. < F y I (Fx(x))] i=I n = L: l[Fy (Y.Jt: 0 < p < 1) for the curve (5 p (Fx.". F y ) .:~ II[Fx(X. the probability is (1 . ~ ~ ~ F~x.F x l (l . . We assume that the X's and Y's are in~ dependent and that they have respective continuous distributions F x and F y ...F~x.(x) ~ R. we get a distributionfree level (1 . "" = Yp . = 1 i 16. 1 Yn be i.. i ..17.3.) < do. The result now follows from the properties of B(·). Properties of this and other bands are given by Doksum. (b) Consider the parameter tJ p ( F x. Then H is symmetric about zero. Fx(t)l· . treatment B responses.. Give a distributionfree level (1.l (p) ~ ~[Fxl(p) . treatment A (placebo) responses and let Y1 . F y ) has the same distribntion as D(Fu . vt(p)) is the band in part (b). is the nth quantile of the distribution of D(Fu l F 1 .) < Fy(x p )] = nFv (Fx (t)) under H.i.t::.5.1) empirical distributions.) < Fx(t)] = nFu(Fx(t)). Hint: Let A(x) = FyI (Fx(x)) . .2A(x) ~ HI(F(x)) l 1 + vF(F(x)).:~ 1 1 [Fy(Y. As in Example 1.F1 _u).
) I i=l L tt . (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E~ 1 X i )/ E~ 1 has uniformly smaller variance than T.). 0 < p < 1. F y. F x ) ~ a and YI > Y.a) that the interval [8. Suppose X I. O(Fx.10 Problems and Complements 287 Moreover. :F x :F O(Fx. x' 7 R. Show that n p is known and 0 O' ~ (2 2~A X. t.Section 4.2.E(X). if I' = 1I>'. (d) Show that E(Y) .    Problems for Section 4... F y ) is in [6. the probability is (1 .. we find a distributionfree level (1 . tt Hint: Both 0 and 8* are nonnally distributed.) .(x) = Fy(x thea D(Fx. T n n = (22:7 I X. Properties of ~ £ = "" this and other bands are given by Doksum and Sievers (1976). F y ) and b _maxo<p<l 6p(Fx . (b) Consider the unbiased estimate of O. Now apply the axioms.3. . Fy ) . F y ). then B( Fx . F y ) > O(Fx " Fy). It follows that if we set F\~•.(x)) .4. 6.(P). nFx ("') = nFu(Fx(x)).) = F (p) . Let do denote a size a critical value for D(Fu . Fy ).2z(1 i=l n n a)ulL tt]} i=l is a uniformly most accurate lower confidence bound for 8. = 22:7 I Xf/X2n( a) is . 2. thea It a unifonnly most accurate level 1 . where is nnknown. then by solving D(F x..(X). ~ D(Fu •F v ). (a) Consider the model of Problem 4. ~ ~ = (e) A Distribution and ParameterFree Confidence Interval. F y ) < O(Fx . B(.) < do: for Ll. where :F is the class of distri butions with finite support.6]. Q.)I L ti . Fy. then Y' = Y. moreover X + 6 < Y' < X + 6. F y ) E F x F..2uynz(1 i=! i=l all L i=l n ti is also a level (1. . Let6.6. Fx +a) = O(Fx a. Show that if 0(·. 3.0:) confidence bound forO. .(Fx . Show that for the model of Problem 4. Show that B = (2 L X.6+] contains the shift parameter set {O(Fx.). F y.('. ~) distribution. is called a shift parameter if O(Fx.6 1..Fy)... Xl > X 8t 8t 0} (c) A parameter 0 = <5 ('.= minO<p<l 6. Exhibit the UMA level (1 .·) is a shift parameter. ) is a shift parameter} of the values of all shift parameters at (Fx .(.T.L.) 12:7 I if.a) UCB for O.0: UCB for J. .Xn is a sample from a r (p. Fv ).4.. 6+ maxO<p<1 6+(p). and 8 are shift parameters.) ~ + t.F y ' (p) = 6. Show that for given (Fx . Let 6 = minO<1'<16p(Fx.0:) simultaneouS coofidence band for t. t Hint: Set Y' = X + t.
6. '" J. Xl. 0']. Pa( c. OJ. Prove Corollary 4. Establish the following result due to Pratt (1961).( 0' Hint: E. 3.Bj has the F distribution F 2r . ~.B). Show that if F(x) < G(x) for all x and E(U). • . 5.. [B. 1I"(t) Xl. e> 0. V be random variables with d. I'. . .W < BI = 7.B(n.B). then )" = sB(r(1 . Let U. s). s > 0. . (b) Suppose that given B = B. distribution with r and s positive integers. satisfying the conditions of Problem 8.0 upper and lower confidence bounds for J1 in the model of Problem 4. i .B') < E."" X n are i.. B). t) is the joint density of (8.2" Hint: See Sections 8. U(O. where s = 80+ 2n and W ~ X.aj LCB such that I .) and that.. • • • = t) is distributed as W( s. t)dsdt = J<>'= P. (a) Show that if 8 has a beta.3.l2(b). g.2 and B.d.1. and that B has the = se' (t.B)+. Hint: By Problem B.[B < u < O]du. (B" 0') have joint densities. Show how the quantiles of the F distribution can be used to find upper and lower credible bounds for. (0 .a. then E(U) > E(V).(O . Suppose that given B Pareto.X. E(U) = .d. In Example 4.E(V) are finite.f:s F.B')+.6.1. Problems for Section 4. where Show that if (B. s). .6. (1: dU) pes.i.W < B' < OJ for allB' l' B.3).7 1 . respectively.12 so that F.\ =. Suppose [B'.6 to V = (B . Hint: Use Examples 4. f3( T. G corresponding to densities j. P(>. Hint: Apply Problem 4.\. n = 1.2. density = B.2. Construct unifoffilly most accurate level 1 . F1(tjdt..1 are well defined and strictly increasing. . ..C. then E.. X has a binomial. U = (8 . and for (). establish that the UMP test has acceptance region (4. s) distribution with T and s integers.3 and 4.3.288 Testing and Confidence Regions Chapter 4 4. p.a) upper and lower credible bounds for A. Suppose that B' is a uniformly most accurate level (I . 2. J • 6.2. 1. j .. where So is some COnstant and V rv X~· Let T = L:~ 1 Xi' (a) Show that ()" I T m = k + 2t. Suppose that given. .O] are two level (1 . t > e.a) confidence intervals such that Po[8' < B' < 0'] < p. 0).6. Poisson. 8.Xn are Ll.4..B) = J== J<>'= p( s.\ is distributed as V /050. /3( T. .' with "> ~.4. > 1" (h) Show how quantiles of the X2 distribution can be used to determine level (I . distribution and that B has beta. ·'"1 1 .l. uniform. . .6 for c fixed.3.
x)1f(/1. (a) Let So = E(x.Xn + 1 be i. Hint: p( 0 I x. respectively. 1f(/11 1r. as X.1. /11 and /12 are independent in the posterior distribution p(O X. = So I (m + n  2). Let Xl. Problems for Section 4.a) upper and lower credible bounds forO to the level (Ia) upper and lower confidence bounds for B.x) is aN(x.Xm.... and Y1. . (c) Give a level (1 . y) is obtained by integrating out Tin 7f(Ll.y) proportional to where 1f(r I so) is the density of solV with V ~ Xm+n2. where (75 is known...0:) confidence interval for B. 1 r.. . ... Xl.i.2 degrees of freedom.Section 4.0:) prediction interval for Xnt 1. Here Xl.y.y) is 1f(r 180)1f(/11 1 r. (a) Give a level (1 . y) and that the joint density of. N(J.. r 1x. Yn are two independent N (111 \ 'T) and N(112 1 'T) samples. Show thatthe posterior distribution 7f( t 1 x. Hint: 7f(Ll 1x.112.1. (b) Find level (1 . (b) Show that given r. Suppose that given 8 = (ttl' J.0"5). ..rlm) density and 7f(/12 1r.y.x)' + E(Yj y)'.10 Problems and Complements 289 (a)LetM = Sf max{Xj.y. y) is aN(y. y) of is (Student) t with m + n . .Xn ). .d.l. Show fonnaUy that the posteriof?r(O 1 x.1 )) distribution..a) upper and lower credible bounds for O. r > O.. r) where 7f(Ll I x . In particular consider the credible bounds as n + 00.8 1. (d) Compare the level (1. Show that (8 = S IM ~ Tn) ~ Pa(c'.r)p(y 1/12.. 'T). .2.y) = 7f(r I sO)7f(LlI x .Xn is observable and X n + 1 is to be predicted. r( TnI (c) Set s2 + n. Suppose 8 has the improper priof?r(O) = l/r.m} and + n..r). 'P) is the N(x . (d) Use part (c) to give level (Ia) credible bounds and a level (Ia) credible interval for Ll. T) = (111. . r In) density. 4. = PI ~ /12 and'T is 7f(Ll.s') withe' = max{c. .r 1x. y) is proportional to p(8)p(x 1/11. y). ". ..
2n distribution. 1 X n are i. .s+n. distribution. 00 < a < b (1 . Then the level of the frequentist interval is 95%.Q) distribution free lower and upper prediction bounds for X n + 1 • < 00.. q(ylx)= ( .5. as X where X has the exponential distribution F(x I 0) =I . has a B(m. )B(r+x+y.d.' .12..s+nx+my)/B(r+x. give level 3.9.05. 0 > 0.2) hy using the observation that Un + l is equally likely to be any of the values U(I).8. . :[ = . (This q(y distribution. 0) distribution given (J = 8.... CJ2) with (J2 unknown.i.• ..8. That is.t for M = 5. b). . which is not observable.9.Xn are observable and X n + 1 is to be predicted.nl. Let X have a binomial.d..x . In Example 4. 4. .a).d. .i.2. Un + 1 ordered.) Hint: First show that . x > 0. I x) is sometimes called the Polya q(y I x) = J p(y I 0).. . Present the results in a table and a graph. Let Xl.0'5) with 0'5 known.'.Xn + 1 be i.a) lower and upper prediction bounds for X n + 1 . • . . Suppose that Y.8.a) prediction interval for X n + l .·) denotes the beta function. 0). i! Problems for Section 4.. where Xl.e.. (b) If F is N(p" bounds for X n + 1 . < u(n+l) denote Ul .8).i. X is a binomial. . " u(n+l).290 Testing and Confidence Regions Chapter 4 • • (b) Compare the interval in part (a) to the Bayesian prediction interval (4. Suppose XI.5. X n+ l are i. Show that the likelihood ratio statistic for testing If : 0 = ~ versus K : 0 i ~ is equivalent to 12X . suppose Xl. n = 100. F. and that (J has a beta. let U(l) < .. ! " • Suppose Xl. random variable. 0'5)' Take 0'5 ~ 7 2 = I.'" .9 1.. Suppose that given (J = 0.(0 I x)dO. Show that the conditional (predictive) distribution of Y given X = xis 1 J I i I . (a) If F is N(Jl. give level (I . as X ..3) by doing a frequentist computation of the probability of coverage. N(Jl. Establish (4. 1 2. .. (3(r. . X n such that P(Y < Y) > 1. . distribution.10.Xn are observable and we want to predict Xn+l. B( n. and a = . Find the probability that the Bayesian interval covers the true mean j.8. 5.15. Give a level (1 . ~o = 10.11...x ) where B(·. Hint: Xi/8 has a ~ distribution and nXn+d E~ 1 Xi has an F 2 ....a (P(Y < Y) > I .0:) lower (upper) prediction bound on Y = X n+ 1 is defined to be a function Y(Y) of Xl. give level (1 ~ a:) lower and upper prediction (c) If F is continuous with a positive density f on (a.10. B(n. s). . A level (1 .
. (c) These tests coincide with the testS obtained by inverting the family of level (1  Q) lower confidence bounds for (12. JL > Po show that the onesided..n) and >. = Xnl (1 .. Hint: Note that liD ~ X if X < 1'0 and ~ 1'0 otherwise. ° 3.. OneSided Tests for Scale. Xn_1 (~a). Thus.Xn be a N(/L. We want to test H . if ii' /175 < 1 and = . log . = aD nO' 1". Show that (a) Likelihood ratio tests are of the form: Reject if..X) aD· t=l n > C.Q.~Q) also approximately satisfy (i) and (ii) of part (a). n Cj I L. of the X~I distribution.C2n !' 1 as n !' 00. In Problems 24. I . (n/2)[ii' / C a5  (b) To obtain size Q for H we should take Hint: Recall Theorem B.Section 4.~Q) approximately satisfy (i) and also (ii) in the sense that the ratio .. ~2 ..log (ii' / (5)1 otherwise. and only if. TwoSided Tests for Scale.. >. 2. 2 2 L. (ii) CI ~ C2 = n logcl/C2.f. 0'2) sample with both JL and 0'2 unknown. .( x) ~ 0. if Tn < and = (n/2) log(1 + T. (i) F(c. where F is the d. where Tn is the t statistic.3.(x) = 0.3. onesample t test is the likelihood ratio test (fo~ Q <S ~).(x) Hint: Show that for x < !n.) .F(CI) ~ 1.. 0' 4.V2nz(1 . (b) Use the nonnal approximatioh to check that CIn C'n n . We want to test H : = 0'0 versus K : 0' 1= 0'0' (a) Show that the size a likelihood ratio test accepts if. (c) Deduce that the critical values of the commonly used equaltailed test.(nx)./(n . Hint: log . 0'2 < O'~ versus K : 0'2 > O'~.(Xi < 2" " (10 i=I  X) 2 < C2 where CI and C2 satisfy.(x) = .~Q) n + V2nz(l. · .='7"nlog c l n /C2n CIn . XnI (1 . In testing H : It < 110 versus K . and only if.Q).I)) for Tn > 0. let X 1l .10 Problems and Complements 291 is an increasing function of (2x .(Xi .
1 I 794 2012 1800 2477 576 3498 411 2092 897 1808 I Assume the twosample normal model with equal variances. 8.. 114. . = 0 and €l..95 confidence interval for p.1'2.. I'· (c) Find a level 0. .292 Testing and Confidence Regions Chapter 4 5.95 when nl = nz = ~n and (1'1 1'2)/17 ~ ~.Xn1 and YI . 0 E e. • Xi = (JXi where X o I 1 + €i. . .1 < J. i = 1. .05. (a) Find a level 0. . 110. (a) Using the size 0: = 0.Xn are said to be serially correlated or to follow an autoregressive model if we can write ~ . e 1 ) depends on X only throngh T. respectively. The following data are from an experiment to study the relationship between forage production in the spring and mulch left on the ground the previous fall.. . Assume the onesample normal model. I Yn2 be two independent N(J.L2 versus K : j.. 7.90 confidence interval for a by using the pivot 8 2 ja z . find the sample size n needed for the level 0. Assume a Show that the likelihood ratio statistic is equivalent to the twosample t statistic T. The nonnally distributed random variables Xl. 0'2 ~ ! corresponding to inversion of the (c) Compute a level 0.z ... Suppose X has density p(x. . and that T is sufficient for O. The control measurements (x's) correspond to 0 pounds of mulch per acre. can we conclude that the mean blood pressure in the population is significantly larger than IDO? (b) Compute a level 0. (7 2) is Section 4. The following blood pressures were obtained in a sample of size n = 5 from a certain population: 124. whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre. a 2 ) random variables.O).01 test to have power 0. 100.lIl (7 2) and N(Jl..l.tl > {t2. . Forage production is also measured in pounds per acre.3. eo. n. (b) Consider the problem aftesting H : Jl. Show that A(X.4. (X. 6.. (b) Can we conclude that leaving the indicated amount of mulch on the ground significantly improves forage production? Use ct = 0.05 onesample t test. 9.2' (12) samples.. 0'2). 190. (c) Using the normal approximation <l>(z(a)+ nl nz/n(1'1 I'z) /(7) to the power..9.L.9.. €n are independent N(O. Y. Let Xl. x y vi I I . .P. . where a' is as defined in < ~.90 confidence interval for the mean bloC<! pressure J. (a) Show that the MLE of 0 ~ (1'1.95 confidence interval for equaltailed tests of Problem 4.
Consider the following model. Condition on V and apply the double expectation theorem. J7i'k(~k)ZJ(k+ll io Hint: Let Z and V be as in the preceding hint. and only if.6. . ! x 0 2 1<> 2 I 0 <> I 2 1<> 2 1 ~a Ia 2 iI Oe (i~H <» (1') a 10: (t~) (~.IIZI > tjvlk] is increasing in [61. i = 1. for each v > 0. Then use py. distribution.. / L:: i 10.'" p(X. (a) P.[Z > tVV/k1 is increasing in 6. (Tn. . (b) P. Hint: Let Z and V be independent and have N(o. Let Xl. The F Test for Equality of Scale.2.. (Yl. (Yl) = f PY"Y. (b) Show that the likelihood ratio statistic of H : 0 = 0 (independence) versus K : 0 o(serial correlation) is equivalent to C~=~ 2 X i X i _I)2 / l Xl. Suppose that T has a noncentral t.OXi_d 2} i= 1 < Xi < 00. The power functions of one. .. 12. . respectively. get the joint distribution of Yl = ZI vVlk and Y2 = V. has level a and is strictly more 11.. (T~). Yn2 be two independent N(P. with all parameters assumed unknown. . = 0.. 13. has density !k .O) for 00 = (Z.IITI > tl is an increasing function of 161.. Xo = o. I).Section 4. Then.2) In exp{ (I/Z(72) 2)Xi .and twosided t tests. Show that the noncentral t distribution. From the joint distribution of Z and V...n. 1 {= xj(kl)ellx+(tVx/k'I'ldx.<» (1 . II. (An example due to C. P. Define the frequency functions p(x.. Yl.O)e (a) What is the size a likelihood ratio test for testing H : B = 1 versus K : B 1= I? (b) Show that the test that rejects if. 7k.6.<»1 < e < <>. X n1 . 0) by the following table.. X powerful whatever be B. . . 7k. Stein).10 Problems and Complements 293 X n ) is n (a) Show that the density of X = (Xl. Fix 0 < a < and <>/[2(1 .[T > tl is an increasing function of 6. Show that. P. Let e consist of the point I and the interval [0. Y2 )dY2. samples from N(J1l. XZ distributions respectively.(t) = .
.1)]E(Y.8. x y 254 2.~ I .. Consider the problem of testing H : p = 0 versus K : p i O. Sbow.10.4. F > /(1 .7 and B. Yn) be a sampl~ from a bivariateN(O. . can you conclude at the 10% level of significance that blood cholest~rollevel is correlated with weight/height ratio? 'I I . the continuous versiop of (B. and using the arguments of Problems B..). 1. (a) Show that the LR test of H : af = a~ versus K : ai > and only if. The following data are the blood cholesterol levels (x's) and weightlheight ratios (y's) of 10 men involved in a heart study. .. a?.nt 1 distribution. p) distribution. 1.nl1 ar is of the fonn: Reject if.. where l _ .8. .5) that 2 log '(X) V has a distribution. Let (XI.96 279 2. . 15. · . YI )..1)/(n. . i=2 i=2 n n S12 =L i=2 n U. (Xn . Finally. Un = Un.61 310 2. where a.'. . • I " and (U" V. Let R = S12/SIS" T = 2R/V1 R'.4. Vn ) is a sample from a N(O. where f( t) is the tth quantile of the F nz .Y)'/E(X. I ' LX.X)' (b) Show that (aUa~)F has an F nz obtained from the :F table. V.94 Using the likelihood ratio test for the bivariate nonnal model. Argue as in Proplem 4.a/2) or F < f( n/2). and only if. L Yj'. ! .0 has the same dis r j . Un). (11.24) implies that this is also the unconditional distribution. Hint: Use the transfonnations and Problem B.64 298 2. . i=l j=l n n (b) Show that if we have a sample from a bivariate N(1L1l 1L2.68 250 2.' ! . 0.. .. (1~. (a) Show that the likelihood ratio ~tatistic is equivalent to ITI where . .V. i 0 in xi !:. <. • .19 315 2. then P[p > c] is an increasing function of p for fixed c. and use Probl~1ll4. r> (c) Justify the twosided F test: Reject H if.1 .0 has the same distribution as R. si = L v.4. .4. . show that given Uz = Uz .~ I i " 294 Testing and Confidence Regions Chapter 4 .1I(a). .37 384 2. . 0. that T is an increasing function of R. . F ~ [(nl . .9. distribution and that critical values can be . p) distribution.9.4. vn 1 l i 16. !I " .. T has a noncentral Tnz distribution with noncentrality parameter p.'. note that . . > C. (d) Relate the twosided test of part (c) to the confidence intervals for a?/ar obtained in Problem 4.12 337 1. . p) distribution.4) and (4.1.71 240 2. 14. . using (4. Because this conditional distribution does not depend on (U21 •.62 284 2.9. as an approximation to the LR test of H : a1 = a2 versus K : a1 =I a2.1' Sf = L U. (Un.7 to conclude that tribution as 8 12 /81 8 2 . Let '(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p the bivariate normal model. 1 . .J J .
and C.0. 00.0. Leonard. Consider the cases Ii.4 (I) If the continuity correction discussed in Section A. AND F. R. P. and r.01. (a) Find the level" LR test for testing H : 8 E given in Problem 3. (2) In using 8(5) as a confidence hound we are using the region [8(5). REFERENCES BARLOW.9. Nntes for Section 4.01 + 0. O'CONNELL. P. 187. . "Is there a sex bias in graduate admissions?" Science. More generally the closure of the class of Bayes procedures (in a suitable metric) is complete. Tl . Mathematical Theory of Reliability New York: J.11 NOTES Notes for Section 4.3).I\ E. E...1 (1) The point of view usually taken in science is that of Karl Popper [1968].2. !. Wiley & Sons.035.11 Notes 295 17. Notes for Section 4. i].12 1965. Because the region contains C(X)..(t) = tl(. F. (b) Compare your solution to the Bayesian solution based on a continuous loss function rro = 0. Rejection is more definitive. G. Essentially.15 is used here. The term complete is then reserved for the class where strict inequality in (4...). BICKEL. T6 .o = 0. (2) We ignore at this time some reallife inadequacies of this experiment such as the placebo effect (see Example 1. HAMMEL.3. respectively. W.. it also has confidence level (1 . Box.851.2. Editors New York: Academic Press. (3) A good approximation (Durbin.Section 4.1. 1983.9. G. i] versus K : 8 't [i. Wu. Data Analysis and Robustness. T. the class of Bayes procedures is complete. if the parameter space is compact and loss functions are bounded. 4. Consider the bioequivalence example in Problem 3. E.3) holds for some 8 if <p 't V. S in 8(X) would be replaced by 5 + and 5 in 8(X) is replaced by 5 .~.895 and t = 0.[ii) where t = 1.. Stephens. (2) The theory of complete and essentially complete families is developed in Wald (1950).398404 (975).[ii .05 and 0. AND J. Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding.3 (1) Such a class is sometimes called essentially complete. 1973. Apology for Ecumenism in Statistics and Scientific lnjerence. PROSCHAN. Box.. . 1974) to the critical value is <.819 for" = 0. 00. see also Ferguson (1967). 4.10.11.
66. AND E. A. 54. "EDF statistics for goodness of fit. K. 421434 (1976).. KLETI. j 1 . A." Biometrika. R. DAS GUPTA.. J. AND R. 36. K. D. T. Statist... "On the combination of independent test statistics. Philadelphia. 3rd ed. Assoc. Sratisr. WETHERDJ. Wiley & Sons. F. 9. 1986. FL: Academic Press. HAlO.. HEDGES. Amer." The AmencanStatistician. "Further notes on Mrs. ." Ann. 1952. ! I I I FERGUSON. 1%2. 243246 (1949)." J. AND K.. OSTERHOFF. WANG. GLAZEBROOK. Statist. Wiley & Sons. E. the Growth ofScientific Knowledge. 38. AND G. DOKSUM. WILKS. 659680 (1967). Amer. 1968. H. . A. c. 13th ed. "Plots and tests for symmetry. B. G. 1997. j j New York: 1. Assoc. . . SACKRoWITZ. Mathematical Statistics. Sequential Analysis New York: Wiley." J.. R. 730737 (1974). "P values as random variablesExpected P values. V. .. Amer. 2nd ed. "Optimal confidence intervals for the variance of a normal distribution. Amer. Pennsylvania (1973). CAl. Aspin's tables:' Biometrika. AND G. Statistical Methods for MetaAnalysis Orlando. LEHMANN. "Distribution theory for tests based on the sample distribution function. T. R. I . j 1 . AND I. Statist.• Statistical Methods for Research Workers. I . S. SIEVERS. POPPER. 1967. . A. "PloUing with confidence: Graphical comparisons of two populations. Y. 605608 (1971)." Biometrika. The Theory of Probability Oxford: Oxford University Press. 54 (2000). 1985. Statist.296 Testing and Confidence Regions Chapter 4 BROWN. 16. SAMUEL<:AHN. 549567 (1961)... M. A. Math. JEFFREYS..243258 (1945). H. R.• Conjectures and Refutations. FENSTAD.. SIAM. 69. STEIN.. . L. DURBIN. WALD. PRATI.. 1958. AND A.. G. 473487 (1977). AND 1. OLlaN. S." An" Math. Statist. Statistical Theory with Engineering Applications . STEPHENs. Sequential Methods in Statistics New York: Chapman and Hall." 1. L. A Decision Theoretic Approach New York: Academic Press. "Interval estimation for a binomial proportion. 53. Mathematical Statistics New York: J. WALD. ! . . K. L. 1947. D.. 1950. New York: Springer. "A twosample test for a linear hypothesis whose pOWer is independent of the variance. "Probabilities of the type I errors of the Welch tests.. VAN ZWET... 326331 (1999). 8.. New York: I J 1 Harper and Row. TATE. New York: Hafner Publishing Company. W... Assoc. 1961. 63. 56. DOKSUM. Statistical Decision Functions New York: Wiley.. AARBERGE." The American Statistician. W. "Length of confidence intervals. FISHER. A.. Testing Statistical Hypotheses." Regional Conference Series in Applied Math. 64." J.674682 (1959). 1. WELCH.
k Evaluation here requires only evaluation of F and a onedimensional integration. our setting for this section EFX1 and use X we can write.1 degrees of freedom..1.i. This distribution may be evaluated by a twodimensional integral using classical functions 297 .1) This is a highly informative formula..3) gn(X) =n ( 2k ) F k(x)(1 . However. telling us exactly how the MSE behaves as a function of n.1). closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule. the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5.13).. . consider a sample Xl.1. In particular. . v(F) ~ F. and F has density f we can wrile (5.3). and calculable for any F and all n by a single onedimensional integration. but a different one for each n (Problem 5.1. (5.Xn ) as an estimate of the population median II(F).2 that .1 (D. consider med(X 11 •.9.F(x)k f(x).2) and (5.. from Problem (B.Xn are i. If XI.1. a 2 ) we have seen in Section 4.2) where. and most of this chapter. the qualitative behavior of the risk as a function of parameter and sample size is hard to ascertain. computation even at a single point may involve highdimensional integrals.• . Even if the risk is computable for a specific P by numerical integration in one dimension. consider evaluation of the power function of the onesided t test of Chapter 4.1. Ifwe want to estimate J1(F} = (5. N(J1. Worse. 1 X n from a distribution F. Worse.1.1 INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS Despite the many simple examples we have dealt with.fiiXIS has a noncentral t distribution with parameter 1"1 (J and n . If n is odd.Chapter 5 ASYMPTOTIC APPROXIMATIONS 5.2. To go one step further.d. if n ~ 2k + 1. .
. always refers to a sequence of statistics I. j=l • By the law of large numbers as B j. We shall see later that the scope of asymptotics is much greater. but for the time being let's stick to this case as we have until now. or the sequence Asymptotic statements are always statements about the sequence.d. Rn (F). The classical examples are.4) i RB =B LI(F. Approximately evaluate Rn(F) by _ 1 B i (5.3. X n as n + 00.I I !' 298 Asymptotic Approximations Chapter 5 (Problem 5.. .fii(Xn  EF(Xd) + N(O. ..fiit !Xi. is to approximate the risk function under study by a qualitatively simpler to understand and easier to compute function. Thus.1. S) and in general this is only representable as an ndimensional integral. Monte Carlo is described as follows.)2) } . which occupies us for most of this chapter. . In its simplest fonn.3. more specifically. 1 < j < B from F using a random number generator and an explicit fonn for F.2) and its qualitative properties are reasonably transparent.. X n + EF(Xd or p £F( . It seems impossible to determine explicitly what happens to the power function because the distribution of fiX / S requires the joint distribution of (X. .Xn):~Xi<n_l (~2 (EX. .. based on observing n i. Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or. 10=1 f=l There are two complementary approaches to these difficulties. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5. Xn)}n>l. .o(X1). where A= { ~ .. . • . observations Xl. VarF(X 1 ». . The first. is to use the Monte Carlo method. _. But suppose F is not Gaussian. X nj }. I j {Tn (X" . .1.i. in this context. Asymptotics. . {X 1j l ' •• . fiB ~ R n (F). Draw B independent "samples" of size n.Xnj ) . for instance the sequence of means {Xn }n>l.. i . where X n = ~ :E~ of medians. just as in numerical integration. which we explore further in later chapters. save for the possibility of a very unlikely event. The other. of distributions of statistics.n (Xl. 00. we can approximate Rn(F) arbitrarily closely. or it refers to the sequence of their distributions 1 Xi.
9) As a bound this is typically far too conservative.3). ../' is as above and . then (5. if EFIXd < 00.14.7) where 41 is the standard nonnal d.01.1. the much PFlIx n  Because IXd < 1 implies that .1. then P [vn:(X F n . and P F • What we in principle prefer are bounds. X n ) we consider are closely related as functions of n so that we expect the limit to approximate Tn(Xl~"· .VarF(Xd.1.l) (5.1. For € = ...1..1. n = 400.9. Is n > 100 enough or does it have to be n > 100.6) for all £ > O.6) gives IX11 < 1. (see A. DOD? Similarly.Xn ) or £F(Tn (Xl..1. Xn is approximately equal to its expectation.. The trouble is that for any specified degree of approximation.1.(Xn .1. .Section 5.' = 1 possible (Problem 5. For instance.1. .6) and (5.' m (5.f.9) is .. the weak law of large numbers tells us that.omes 11m'.25 whereas (5. For instance. PF[IX n _  .. (5. As an approximation.1.1. . X n ) but in practice we act as if they do because the T n (Xl. if more delicate Hoeffding bound (B. say.15. (5. this reads (5. Similarly.8) Again we are faced with the questions of how good the approximation is for given n.' 1'1 > €] < .10) is . .9) when . X n )) (in an appropriate sense).6) to fall. Further qualitative features of these bounds and relations to approximation (5.1 Introduction: The Meaning and Uses of Asymptotics 299 In theory these limits say nothing about any particular Tn (Xl. 1'1 > €] < 2exp {~n€'}.1.4. £ = ..l1) states that if EFIXd' < 00.01. Thus.1..6) does not tell us how large n has to be for the chance of the approximation ~ot holding to this degree (the lefthand side of (5. x. the right sup PF x [r..8) are given in Problem 5. the celebrated BerryEsseen bound (A.2 . 1') < z] ~ ij>(z) (5. the central limit theorem tells us that if EFIXII < 00. 7). which are available in the classical situations of (5. for n sufficiently large. below .11) .1.1') < x] _ ij>(x) < CEFIXd' v'~ 3 1/' " "n (5.5) That is.2 hand side of (5. by Chebychev's inequality.14.2 is unknown be. ' . We interpret this as saying that.1.10) < 1 with . say.1.. (5.1. if EFXf < 00.
7) says that the behavior of the distribution of X n is for large n governed (approximately) only by j.2 and asymptotic normality via the delta method in Section 5. Bounds for the goodness of approximations have been available for X n and its distribution to a much greater extent than for nonlinear statistics such as the median. even here they are not a very reliable guide.di) where '(0) = 0. . Section 5.F) > N(O 1) .12) where (T(O.300 Asymptotic Approximations Chapter 5 where C is a universal constant known to be < 33/4. F) typically is the standard deviation (SD) of J1iOn or an approximation to this SD.I '1.8) differs from the truth. The arguments apply to vectorvalued estimates of Euclidean parameters. I . for all F in the n n . B ~ O(F). (b) Their validity for the given n and Ttl for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. . Asymptotics has another important function beyond suggesting numerical approximations for specific nand F. although the actual djstribution depends on Pp in a complicated way. which is reasonable. behaves like .5) and quite generally that risk increases with (1 and decreases with n. Yet.11) suggests.3 begins with asymptotic computation of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for oneparameter exponential families. We now turn to specifics.1. for any loss function of the form I(F. as we have seen. The estimates B will be consistent. : i .8) is typically much betler than (5. (5.1. It suggests that qualitatively the risk of X n as an estimate of Ji. The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17  j • i.(l) The approximation (5.1. • model.11) is again much too consctvative generally. Practically one proceeds as follows: (a) Asymptotic approximations are derived.1.2 deals with consistency of various estimates including maximum likelihood. "(0) > 0. and asymptotically normal. d) = '(II' . asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate. The qualitative implications of results such as are very impor~ tant when we consider comparisons between competing procedures. good estimates On of parameters O(F) will behave like  Xn does in relation to Ji.e(F)]) (1(e. (5. i . . • c F (y'n[Bn .1.1.' (0)( (1 / y'n)( v'27i') (Problem 5. Section 5. 1 i .t and 0"2 in a precise way. Consistency will be pursued in Section 5. As we mentioned. Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo. (5. consistency is proved for the estimates of canonical parameters in exponential families.1. Although giving us some idea of how much (5.3. In particular. If they are simple.' I I • If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good. As we shall see. quite generally. For instance.
and other statistical quantities that are not realistically computable in closed form. A stronger requirement is (5. distributions. The stronger statement (5.d. .1) where I .lS.1. for all (5. (See Problem 5. in accordance with (A.7. by the WLLN.2 Consistency 301 in exponential families among other results.i.2.. We will recall relevant definitions from that appendix as we need them.i. We also introduce Monte Carlo methods and discuss the interaction of asymptotics.Xn from Po where 0 E and want to estimate a real or vector q(O). is a consistent estimate of p(P). 1denotes Euclidean distance.2.2. The least we can ask of our estimate Qn(X I. and probability bounds.) forsuP8 P8 lI'in . A. 'in 1] q(8) for all 8.Section 5. but we shall use results we need from A. Summary. by quantities that can be so computed.2 5.) However. (5.5 we examine the asymptotic behavior of Bayes procedures.q(8)1 > 'I that yield (5. where P is the empirical distribution.14. e Example 5. if. Finally in Section 5. 00. with all the caveats of Section 5. .1). Means. . asymptotics are methods of approximating risks. That is.1). 5.d. Most aSymptotic theory we consider leads to approximations that in the i.. which is called consistency of qn and can be thought of as O'th order asymptotics. Section 5.14. Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to 00. In practice.7 without further discussion. for ..2. If Xl.1.14. 1 X n are i. > O.2) Bounds b(n. and B.2) are preferable and we shall indicate some of qualitative interest when we can. case become increasingly valid as the sample size increases. But.2) is called unifonn consistency. ' . For P this large it is not unifonnly consistent.2..7. and 8.2.2.1) and (B. The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A. remains central to all asymptotic theory.2. The simplest example of consistency is that of the mean. P where P is unknown but EplX11 < 00 then.. . . . If is replaced by a smaller set K.4 deals with optimality results for likelihoodbased procedures in onedimensional parameter models. AlS. we talk of uniform cornistency over K.Xn ) is that as e n 8 E ~ e.1 CONSISTENCY PlugIn Estimates and MlEs in Exponential Family Models Suppose that we have a sample Xl. X ~ p(P) ~ ~  p = E(XJ) and p(P) = X.2. Monte Carlo.
p'l < 6}..6) = sup{lq(p)  q(p')I: Ip .4) I i (5. Binomial Variance. ..3) Evidently. Theorem 5. for all PEP.. which is ~ q(ji). 0 < p < 1.2. w( q. By the weak law of large numbers for all p. in this case. where Pi = PIXI = xi]' 1 < j < k.2. . there exists 6 «) > 0 such that p.pi > 6«)] But. xd is the range of Xl' Let N i = L~ 11(Xi = Xj) and Pj _ Njln. = Proof. Pp [IPn  pi > 6] ~ .2." is the following: I X n are LLd.p) distribution.6)! Oas6! O. P = {P : EpX'f < M < oo}. . p' E S. consider the plugin estimate p(Iji)ln of the variance of p.3) that > <} (5. If q is continuous w(q. and p = X = N In is a uniformly consistent estimate of p. · . then by Chebyshev's inequality. Because q is continuous and S is compact.2. Letw.·) is increasing in 0 and has the range [a. w(q. Ip' pi < 6«).1 < j < k.2. Then qn q(Pn) is a unifonnly consistent estimate of q(p). Suppose thnt P = S = {(Pl..b] < R+bedefinedastheinver~eofw.LJ~IPJ = I}. Then N = LXi has a B(n.2.5) I A simple and important result for the case in which X!. for every < > 0.2.1. w. Other moments of Xl can be consistently estimated in the same way.14.. sup{ Pp[liln pi > 61 : pES} < kl4n6 2 (Problem 5.I l 302 instance. by A. Evidently.. 6 >0 O. the kdimensional simplex.l : [a. Suppose that q : S + RP is continuous. . with Xi E X 1 . . and (Xl.l «) = inf{6 : w(q. l iA) E S be the empirical distribution.2. 6) It easily follows (Problem 5.b) say.ql > <] < Pp[IPn . But further. Asymptotic Approximations Chapter 5 X is uniformly consistent over P because o Example 5. Suppose the modulus ofcontinuity of q.• X n be the indicators of binomial trials with P[X I = I] = p. 0 In fact. Pn = (iiI.Pk) : 0 < Pi < 1. we can go further. Thus. i . q(P) is consistent..6.1) and the result follows. (5. 0) is defined by . . implies Iq(p')q(p) I < <Then Pp[li1n . w(q. Let X 1. where q(p) = p(1p)... 0 ~ To some extent the plugin method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plugin estimates. it is uniformly continuous on S.
variances. 11. and correlation coefficient are all consistent.2. Variances and Correlations.JL2.UV) so that E~ I g(Ui . Var(V. 1 < i < n be i. Var(U. Questions of uniform consistency and consistency when P = { DistribuCorr(Uj.v} = (u.Section 5.2.2. 1 < j < d.p). )1 < oo}.. (i) Plj [The MLE Ti exists] ~ 1.ar.2.6) if Ep[g(X.p). 1 < j < d.1. Theorem 5. EVj2 < 00. More generally if v(P) = h(E p g(X . is a consistent estimate of q(O). Vi). X n are a sample from PrJ E P. (ii) ij is consistent. Let g (91.7.lpl < 1. .2. af > o.3. let mj(O) '" E09j(X.1 .2.1 that the empirical means.then which is well defined and continuous at all points of the range of m. .2. Jl2.6.) I < that h(D n) Foreonsisteney of h(g) apply Proposition B. then Ifh=m. ai. N2(JLI.1 and Theorem 2. Then.) > 0. conclude by Proposition 5. h(D) for all continuous h.1: D n 00. where h: Y ~ RP. 0 Here is a general consequence of Proposition 5. . Then.a~. if h is continuous. ag.1. Suppose P is a canonical exponentialfamily of rank d generated by T.U 2. If we let 8 = (JLI.2 Consistency 303 Suppose Proposition 5.i.. is consistent for v(P). o Example 5. We may.d. Let TI. then vIP) h(g). Suppose [ is open. !:. .V2. [ and A(·) correspond to P as in Section 1. We need only apply the general weak law of large numbers (for vectors) to conclude that (5.) > 0. 1 are discussed in Problem 5..3.. ' .V. U implies !'. thus. Let g(u.)..2. and let q(O) = h(m(O)). = Proof.4. 'Vi) is the statistic generating this 5parameter exponential family.)1 < 1 } tions such that EUr < 00. )) and P = {P : Eplg(X .9d) map X onto Y C R d Eolgj(X')1 < 00. If Xl.Jor all 0. where P is the empirical distribution. Let Xi = (Ui .
. where to = A(1)o) = E1)o T(X 1 ).1I)11: II E 6} ~'O (5. But Tj.E1). 5. We showed in Theorem 2. 1 i . II) = where. • .2. .2.lI o): 111110 1 > e} > D(1I0 . .. D . A more general argument is given in the following simple theorem whose conditions are hard to check...j.1 that On CT the map 17 A(1]) is II and continuous on E.d.  inf{D(II. II) L p(X n. I I Proof Recall from Corollary 2..7) I . (T(Xl)) must hy Theorem 2. {t: It . . I . 0 Hence..= 1 .2. the inverse AI: A(e) ~ e is continuous on 8.3.3. the MLE. is a continuous function of a mean of Li.3. see.Xn ) exists iff ~ L~ 1 T(X i ) = Tn belongs to the interior CT of the convex support of the distribution of Tn.2. . . By definition of the interior of the convex support there exists a ball 8. and the result follows from Proposition 5.. 7 I .2 Consistency of Minimum Contrast Estimates .3. j 1 Pn(X. X n be Li.. Suppose 1 n PII sup{l. for example. Let 0 be a minimum contrast estimate that minimizes I .1I 0 ) foreverye I .. exists iff the event in (5. The argument of the the previous subsection in which a minimum contrast estimate. T(Xdl < 6} C CT' By the law of large numbers.2. I J Theorem 5. .1 that i)(Xj.3.8) I > O. is solved by 110.D(1I0 .2. E1). '1 P1)'[n LT(Xi ) E C T) ~ 1. i=l 1 n (5. Rudin (1987). .. Let Xl.1 belong to the interior of the convex support hecause the equation A(1)) = to.304 Asymptotic Approximations Chapter 5 . (5..2. i i • • ! . = 1 n p. . T(Xl). n . n LT(Xi ) i=I 21 En. if 1)0 is true.p(X" II) is uniquely minimized at 11 i=l for all 110 E 6.7) occurs and (i) follows. Po.L[p(Xi . I ! . BEe c Rd. D( 110 .lI) . as usual. which solves 1 n A(1)) = . II) 0 j =EII.9) n and Then (} is consistent.LT(Xi ).1. Note that. . i I . vectors evidently used exponential family properties.1 to Theorem 2. z=l 1 n i..d. By a classical result.
14) (ii) For some compact K PIJ.[IJ ¥ OJ] = PIJ.2. A simple and important special case is given by the following. [I ~ L:~ 1 (p(Xi .1 we need only check that (5.L(p(Xi.10) tends to O. ifIJ is the MLE.10) By hypothesis. IJ) .9) hold for p(x.5.11) implies that 1 n ~ 0 (5. E n ~ [p(Xi . (5. .IJ)1 < 00 and the parameterization is identifiable.2.IJ o)  D(IJo.lJ o): IIJ IJol > oj.IJd ).11) implies that the righthand side of (5.2. But because e is finite.8) follows from the WLLN and PIJ.IJO)) ' IIJ IJol > <} n i=l .IIIJ  IJol > c] (5. Ie ¥ OJ] ~ Ojoroll j.IJol > <} < 0] (5.. (5.p(X" IJo)) : IJ E g ~(P(X" K'} > 0] ~ 1.1. is finite.8) and (5. Coronary 5.2.8).IJ) p(Xi.2. PIJ.2. then.12) which has probability tending to 0 by (5.8) can often failsee Problem 5.2.11) sup{l.2.Section 5. {IJ" .2.D(IJo.2 Consistency 305 proof Note that.. o Then (5. IJo)) .inf{D(IJo. IJ) = logp(x. [inf c e. IJ). for all 0 > 0.D( IJ o. 0) .2. and (5.9) follows from Shannon's lemma."'[p(X i .D(IJo. IJ j )) : 1 <j < d} > <] < d maxi PIJ. IJ) .L[p(Xi . EIJollogp(XI. _ I) 1 n PIJ o [16 .2. IJj))1 > <I : 1 < j < d} ~ O. IJ) .IJor > <} < 0] because the event in (5.2.2. IIJ .IJol > E] < PIJ [inf{ . 0 Condition (5. IJ)I : IJ K } PIJ !l o. An alternative condition that is readily seen to work more widely is the replacement of (5.D(IJo.13) By Shannon's Lemma 2.2. IJ j ) . IJj ) . IIJ .[max{[ ~ L:~ 1 (p(Xi .D(IJ o.8) by (i) For ail compact K sup In { c e. IJ)II ' IJ n i=l E e} > .2. . IJ) n ~ i=l p(X"IJ o)] .2. But for ( > 0 let o ~ ~ inf{D(IJ. PIJ.2. 2 0 (5. 1 n PIJo[inf{. /fe e =   Proof Note that for some < > 0.2.
A general approach due to Wald and a similar approach for consistency of generalized estimating equation solutions are left to the problems. sequence of estimates) consistency.3.AND HIGHERORDER ASYMPTOTlCS: THE DelTA METHOD WITH APPLICATIONS We have argued. we require that On ~ B(P) as n ~ 00. VariX.33.3.3 FIRST. We denote the jth derivative of h by 00 . Summary.. D . > 2. We introduce the minimal property we require of any estimate (strictly speak ing. We conclude by studying consistency of the MLE and more generally Me estimates in the case e finite and e Euclidean. . 5. Sufficient conditions are explored in the problems. J..3. in Section 5. that sup{PIiBn . We show how consistency holds for continuous functions of vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families. X valued and for the moment take X = R. m Mi) and assume (b) IIh(~)lloo j 1 .2. When the observations are independent but not identically distributed. and assume (i) (a) h is m times differentiable on R. =suPx 1k<~)(x)1 <M < .3. Unfortunately checking conditions such as (5. ..) = <7 2 Eh(X) We have the following. see Problem 5.1 that the principal use of asymptotics is to provide quantitatively or qualitatively useful approximations to risk. ~l Theorem 5. I ~. consistency of the MLE may fail if the number of parameters tends to infinity. If fin is an estimate of B(P).1. ! ! (ii) EIXd~ < 00 Let E(X1 ) = Il. We then sketch the extension to functions of vector means. let Ilglioo = sup{ Ig( t) I : t E R) denote the sup nonn. . Unifonn consistency for l' requires more. As usual let Xl. Let h : R ~ R.Xn be i.2.B(P)I > Ej : PEP} t 0 for all € > O. ~ 5.1 The Delta Method for Moments • • We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders.306 Asymptotic Approximations Chapter 5 We shall see examples in which this modification works in the problems. + Rm (5.14) is in general difficult.i.1) where • .Il E(X Il). . If (i) and (ii) hold.8) and (5. then = h(ll) + L (j)( ) j=l h .d.
. :i1 + .3.. t r 1 and [t] denotes the greatest integer n. rl l:: . + i r = j. bounded by (d) < t.) ~ 0. tI. then (a) But E(Xi .1'1. .3) and j odd is given in Problem 5.. .. (b) a unless each integer that appears among {i I .5.t" = t1 .. .3) (5. .3. EIX I'li = E(X _1')' Proof. If EjX Ilj < 00. so the number d of nonzero tenns in (a) is b/2] (c) l::n _ r. . for j < n/2... . .3.1. (n . . j > 2.3. Xij)1 = Elxdj il. 'Zr J . Lemma 5.1) . .3 First. tr i/o>2 all k where tl .1 < k < r}}.3..3.'1 +... .3.[jj2] + 1) where Cj = 1 <r.ij } • sup IE(Xi .3. Let I' = E(X. 21..2) where IX' that 1'1 < IX . then there are constants C j > 0 and D j > 0 such (5.3.4) for all j and (5. . The expression in (c) is.i j appears at by Problem 5.4) Note that for j eveo. ik > 2... The more difficult argument needed for (5. +'r=j J .and HigherOrder Asymptotics: The Delta Method with Applications 307 The proof is an immediate consequence of Taylor's expansion... .Section 5.5[jJ2J max {l:: { . ...'" .3) for j even. X ij ) least twice. _ . Moreover. and the following lemma. In!. (5.2. We give the proof of (5. C [~]! n(n ..
5) can be replaced by Proof For (5.J.l) 6 + 2h(J. I 1 .6) n . 0 The two most important corollaries of Theorem 5.l)h(J. using Corollary 5.2 ).3.l)}E(X .1')3 = G(n 2) by (5. • .l)E(X . (d).l) + [h(llj'(1')}':. n (b) Next.3.1'11.3.l)h(J. I I I Corollary 5..3. i • r' Var h(X) = 02[h(11(J. apply Theorem 5.ll j < 2j EIXd j and the lemma follows.3. and (e) applied to (a) imply (5.4).3. . then G(n.5) (b) if E(Xt) < G(n.6) can be 1 Eh 2 (X) = h'(J.3f2 ) in (5.3.3.1')' + 1 E[h'](3)(X')(X _ 1')3 = h2 (J. . Corollary 5. (a) ifEIXd3 < 00 and Ilh( 31 11 oo < Eh(X) 00. < 3 and EIXd 3 < 00. I . give approximations to the bias of h(X) as an estimate of h(J. 1 < j < 3.3.3. 1 iln .3.3.1 with m = 3.3f2 ). Because E(X .l) and its variance and MSE.l)h(1)(J..l) + [h(1)]2(J. (n li/2] + 1) < nIJf'jj and (c).3.l) + {h(21(J.1. I . By Problem 5.1') + {h(2)(J. respectively. .308 But (e) Asymptotic Approximations Chapter 5 1 I j i njn(n 1) . In general by considering Xi ..5) follows.3.1 with m = 4.1.3. Then Rm = G(n 2) and also E(X .6.1')2 = 0 2 In.3) for j even. 00 and 11hC 41 11= < then G(n.1.. If the conditions of (b) hold. EIX1 . then h(')( ) 2 0 = h(J.l) + 00 2n I' + G(n3f2).3f') in (5. if 1 1 <j (a) Ilh(J)II= < 00.JL as our basic variables we obtain the lemma but with EIXd j replaced by EIX 1 . (5.3.2. (5.4) for j odd and (5.. Proof (a) Write 00.5) apply Theorem 5. then I • I. I (b) Ifllh(j11l= < 00. and EXt < replaced by G(n'). 0 1 I ..llJ' + G(n3f2) (5. if I' = O.+ O(n.
Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5.5).6).h(/1) is G(n. 0 Clearly the statements of the corollaries as well can be turned to expansions as in Theorem 5.Section 5.1) = 11..Z) (5.3. then we may be interested in the warranty failure probability (5.3.Z. as the plugin estimate of the parameter h(Jl) then.1 If the Xi represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours.Z ".2.exp( 2It).3. We will compare E(h(X)) and Var h(X) with their approximations.3(1_ ')In + G(n.3.3. Example 5. when hit) = t(1 . T!Jus.3.3 First.3.t) and X.(h(X)) E.9) o Further expansion can be done to increase precision of the approximation to Var heX) for large n.3. We can use the two coronaries to compute asymptotic approximations to the means and variance of heX).10) + [h<ZI (/1)J'(14} + R~ ! with R~ tending to zero at the rate 1/n3 .d.4 ) exp( 2/t).h(/1)) h(2~(II):: + O(n.3.t.1.7) If h(t) ~ 1 . Note an important qualitative feature revealed by these approximations.3. by expanding Eh 2 (X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = ~[h(I)U')]'(1Z +. {h(l) (/1)h(ZI U')/13 (5. To get part (b) we need to expand Eh 2 (X) to four terms and similarly apply the appropriate form of (5. (5. .exp( 2') c(') = h(/1)..i.(h(X) . and.4) E(X .3.l ). the MLE of ' is X . which is neglible compared to the standard deviation of h(X).8) because h(ZI (t) = 4(r· 3 .1 with bounds on the remainders. If heX) is viewed. where /1. by Corollary 5.2 (Problem (5. as we nonnally would. Bias and Variance of the MLE of the Binomial VanOance.3.3. (1Z 5. for large n.3.  . by Corollary 5. ..and HigherOrder Asymptotics: The Delta Method with Applications 309 Subtracting Cal from (b) we get C5. Thus. the bias of h(X) defined by Eh(X) . Bias. then heX) is the MLE of 1 ./1)3 = ~~.i.1. If X [. X n are i..= E"X I ~ 11 A.11) Example 5.2 ) 2e. A qualitatively simple explanation of this important phenonemon will be given in Theorem 5. which is G(n.3.3.l / Z) unless h(l)(/1) = O.
E(X') ~ p .3. 1 < j < { Xl .p).' • " The generalization of this approach to approximation of moments for functions of vector means is fonnally the same but computationally not much used for d larger than 2. Suppose g : X ~ R d and let Y i = g(Xi ) = (91 (Xi).p) {(I _ 2p)2 n Thus.[Var(X) + (E(X»2] I nI ~ p(1 ..p)(I. • .10) yields .p)] n I " . (5.j .p)J n p(1 ~ p) [I ..3.p) .p(1 .. . a Xd d} .p)(I. and will illustrate how accurate (5. asSume that h has continuous partial derivatives of order up to m. R'n p(1 ~ p) [(I . and that (i) l IIDm(h)ll= < 00 where Dmh(x) is the array (tensor) a i. D 1 i I .9d(Xi )f.e x ) : i l + . • ~ O(n.3.2p(1 . Next compute Varh(X)=p(I. the error of approximation is 1 1 + .3 ).2. 1 and in this case (5.2p)')} + R~.p)'} + R~ p(l .P) {(1_2 P)2+ 2P(I.2t. .6p(l.5) is exact as it should be. II'.p) n . M2) ~ 2.2p).p) + . . + id = m.2p) n n +21"(1 .10) is in a situation in which the approximation can be checked.p) = p ( l . . < m.P )} (n_l)2 n nl n Because 1'3 = p(l. n n Because MII(t) = I .' . ~ Varh(X) = (1. .310 Asymptotic Approximations Chapter 5 B(l.amh i. " ! .p(1 .2p)p(l.2p)' .3.!c[2p(1 n p) . I . First calculate Eh(X) = E(X) . Let h : R d t R. Theorem 5.pl. .2p)2p(1 . D " ~ I ..3.{2(1. (5.. 0 < i.5) yields E(h(X))  ~ p(1 .p) ..2(1 .
) )' Var(Y. h : R ~ R.).5).13) Moreover.) + a. Similarly.13).3 / 2 ) in (5.and HigherOrder Asymptotics: The Delta Method with Applications 311 (ii) EIY'Jlm < 00.(J1.3.) + ~ {~~:~ (J1.~E(X. (7'(h)) where and (7' = VariX.2..) Var(Y. as for the case d = 1.8.12) can be replaced by O(n.~ (J1. Y 12 ) 3/2). 1. We get. and (5.).3. Theorem 5. (5.Section 5.3.3. (J1.3 First.) Cov(Y". and the appropriate generalization of Lemma 5.2 The Delta Method for In law Approximations = As usual we begin with d !. Y12 ) (5.3. The proof is outlined in Problem 5. E xl < 00 and h is differentiable at (5. by (5.14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d . ijYk = ~ I:~ Yik> Y = ~ I:~ I Vi.2 ).h(!.) Cov(Y".3.1 Then.3. and J. for d = 2. (5.3...'.14) + (g:. The results in the next subsection go much further and "explain" the fonn of the approximations we already have. 5. under appropriate conditions (Problem 5.C( v'n(h(X) . Suppose thot X = R. ..6).))) ~ N(O. then Eh(Y) This is a consequence of Taylor's expansion in d variables.) + 2 g:'. Suppose {Un} are real random variables and tfult/or a sequence {an} constants with an + 00 as n + 00. is to m = 3. then O(n.3.15) Then .3. 1 < j < d where Y iJ I = 9j(Xi ). = EY 1 . EI Yl13 < 00 Eh(Y) h(J1.3.3. 0/ The result follows from the more generally usefullernma.»)'var(Y'2)] +O(n') Approximations (5..1..3).) gx~ (J1. Lemma 5.3.11. (J1. if EIY.x.4.12) Var h(Y) ~ [(:.3.hUll)~.) Var(Y.I' < 00.)} + O(n (5.". + ~ ~:l (J1.3.3. The most interesting application. B.
u)!:. we expect (5. PIIUn €  ul < iii ~ 1 • "'.7.3.. an(Un . for every Ii > 0 (d) j. ( 2 ).u) !:. 'j But (e) implies (f) from (b). V. for every € > 0 there exists a 8 > 0 such that (a) Iv . V and the result follows. The theorem follows from the central limit theorem letting Un X. from (a).3. hence. = I 0 .3.3. Thus. by hypothesis. an = n l / 2. Consider 1 Vn = v'n(X !") !:. Formally we expect that if Vn !:.• ' " and. By definition of the derivative. j . (ii) 9 : R Then > R is d(fferentiable at u with derivative 9(1) (u). u = /}" j \ V .g(ll(U)(v  u)1 < <Iv . ~: . V for some constant u. for every (e) > O. V . ·i Note that (5.ul < Ii '* Ig(v)  g(u) . (5.1. .a 2 ). 0 .15) "explains" Lemma 5. _ F 7 .312 Asymptotic Approximations Chapter 5 (i) an(Un . then EV. j " But.N(O. ~ EVj (although this need not be true. Therefore. I (g) I ~: .32 and B. see Problems 5.3. • • . . (e) Using (e).8). I .17) .N(O.16) Proof.i. . . .ul Note that (i) (b) '* .
3.). and S2 _ n nl ( ~ X) 2) n~(X..N(O. sn!a).Section 5.s Tn =.j odd. EZJ > 0. N n (0 Aal.t2 versus K : 112 > 111.d. A statistic for testing the hypothesis H : 11. we find (Problem 5. "t" Statistics.) and a~ = Var(Y. i=l If:F = {Gaussian distributions}. Let Xl. then S t:. where g(u.3.'" . (1 . VarF(Xd = a 2 < 00.a) for Tn from the Tn . (1  A)a~ . then Tn . ••• . 1'2 = E(Y.3 First.l (1.9.28) that if nI/n _ A.> 0 .'" 1 X n1 and Y1 . Then (5. Yn2 be two independent samples with 1'1 = E(X I ).2..3 we saw that the two sample t statistic Sn vn1n2 (Y X) ' n = = n s nl + n2 has a Tn2 distribution under H when the X's and Y's are normal withar = a~. 1). £ (5.l distribution. 1 2 a P by Theorem 5.17) yields O(n J.O < A < 1. (a) The OneSample Case.3. FE :F where EF(X I ) = '".1 (1 . In Example 4.1 (Ia) + Zla butthat the t n.18) because Tn = Un!(sn!a) = g(Un . Slutsky's theorem.i. else EZJ ~ = O. we can obtain the critical value t n.). Consider testing H: /11 = j. j even = o(nJ/').a) critical value (or Zla) is approximately correct if H is true and F is not Gaussian. and the foregoing arguments.3. .. al = Var(X.18) In particular this implies not only that t n. 1). In general we claim that if F E F and H is true. (b) The TwoSample Case. v) = u!v. Example 5. But if j is even.3.X) n 2 .2 and Slutsky's theorem.+A)al + Aa~) .TiS X where S 2 = 1 nl L (X. Now Slutsky's theorem yields (5. For the proof note that by the central limit theorem.3. Let Xl./2 ). Using the central limit theorem.X" be i.= 0 versuS K : 11.and HigherOrder Asymptotics: The Delta Method with Applications 313 where Z ~ N(O.
or 50. " . The X~ distribution is extremely skew. . X~. as indicated in the plot.. i I . 316.3..5 . Other distributions should also be tried.2 shows that when t n _2(10:) critical value is a very good approximation even for small n and for X. then the critical value t..02 . ur ....95) approximation is only good for n > 10 2 . and in this case the t n . The simulations are repeated for different sample sizes and the observed significance levels are plotted. Figure 5. 0) for Sn is Monte Carlo Simulation As mentioned in Section 5. and the true distribution F is X~ with d > 10.3.05.5 1 1. approximations based on asymptotic results should be checked by Monte Carlo simulations..1..1 (0..1 shows that for the onesample t test. when 0: = 0. oL.. the For the twosample t tests. . Chisquare data I 0. Figure 5. . 10.1.314 Asymptotic Approximations Chapter 5 It follows that if 111 = 11. each lime computing the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table.3.'.2 or of a~. We illustrate such simulations for the preceding t tests by generating data from the X~ distribution M times independently. . i 2(1 approximately correct if H is true and the X's and Y's are not normal. Here we use the XJ distribution because for small to moderate d it is quite different from the normal distribution. the asymptotic result gives a good approximation when n > 10 1. 20.. One sample: 10000 Simulations.. .5 3 Log10 sample size j i Figure 5.5 2 2. where d is either 2..000 onesample t tests using X~ data. 32..I = u~ and nl = n2..5 .c:'::c~:____:_J 0. Each plotted point represents the results of 10.. Y ..
(})) do not have approximate level 0.h(JL)] ". 1 (ri . To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T. even when the X's and Y's have different X~ distributions. when both nl I=.n2 and at I=. 2:7 ar Two sample.' 0.12 0.a~. Y . In this case Monte Carlo studies have shown that the test in Section 4. }.4 based on Welch's approximation works well. For each simulation the two samples are the same size (the size indicated on the xaxis). D Next. Moreover. Each plotted point represents the results of 10. By Theoretn 5.000 twosample t tests. as we see from the limiting law of Sn and Figure 5. 10. in the onesample situation.Xd.2 (1 .02 d'::'. N(O.3..hoi n s[h(l)(X)1 .a~ show that as long as nl = n2.ll)(!. in this case. let h(X) be an estimate of h(J._' 0. 10000 Simulations. scaled to have the same means.9. and the data are X~ where d is one of 2. and a~ = 12af..and HigherOrder Asymptotics: The Delta Method with Applications 315 1 This is because. then the twosample t tests with critical region 1 {S'n > t n . or 50.2.5 3 log10 sample size Figure 5. Equal Variances 0. Other Monte Carlo runs (not shown) with I=.0' H<I o 0.3. _ y'ii:[h(X) .L) where h is con tinuously differentiable at !'.X = ~.n2 and at = a'). ChiSquare Dala. y'ii:[h(X) .Section 5_3 First.) i O.o'[h(l)(JLJf).Xi have a symmetric distribution. and Yi .3. However. the t n 2(1 . the t n _2(O.3.cCc":'::~___:.95) approximation is good for nl > 100. af = a~.5 1 1.0:) approximation is good when nl I=.3.5 2 2.
Variance Stabilizing Transfonnations Example 5.I _ __ 0 I +__ I :::.3. Combining Theorem 5. '! 1 . and beta. 0. " 0. From (5.316 Asymptotic Approximations Chapter 5 Two Sample. +2 + 25 J O'. then the sample mean X will be approximately normally distributed with variance 0"2 In depending on the parameters indexing the family considered.5 1 1.12 .3. Tn ..£.3..oj:::  6K . as indicated in the plot. if H is true .5 Log10 (smaller sample size) l Figure 5.6. o.4. we see that here. The xaxis denotes the size of the smaller of the two samples.3. and 9.3 and Slutsky's theorem.. The data in the first sample are N(O. .02""~".c:~c:_~___:J 0. I: . which are indexed by one or more parameters.3. Poisson. Unequal Variances.J~ f).". N(O. In Appendices A and B we encounter several important families of distributions.. too.~'ri i _ .10000 Simulations: Gaussian Data.3.' OIlS .0. called variance stabilizing. gamma. 1) so that ZI_Q is the asymptotic critical value. Each plotted point represents the results of IO. £l __ __ 0. ..6) and . For each simulation the two samples differ in size: The second sample is two times the size of the first..{x)() twosample t tests. 1) and in the second they are N(Ola 2) where a 2 takes on the values 1. such that Var heX) is approximately independent of the parameters indexing the family we are considering. It turns out to be useful to know transformations h. If we take a sample from a member of one of these families. such as the binomial. 2nd sample 2x bigger 0. +___ 9... We have seen that smooth transformations heX) are also approximately normally distributed.
6) we find Var(X)' ~ 1/4n and Vi'((X) . sample. Edgeworth Approximations The normal approximation to the distribution of X utilizes only the first two moments of X. Thus..13) we see that a first approximation to the variance of h( X) is a' [h(l) (/.(A)') has approximately aN(O.. 1976. Some further examples of variance stabilizing transformations are given in the problems.\ + d..Xn are an i. h(t) = Ii is a variance stabilizing transformation of X for the Poisson family of distributions. See Problems 5. I X n is a sample from a P(A) family. yX± r5 2z(1 .and HigherOrder Asymptotics: The Delta Method with Applications 317 (5.. Thus. Substituting in (5. If we require that h is increasing.1 0 ) Vi' ' is an approximate 1. in the preceding P( A) case. Suppose further that Then again.3 First.16. a variance stabilizing transformation h is such that Vi'(h('Y) ..Section 5. To have Varh(X) approximately constant in A.Ct confidence interval for J>.. .)C.5. Suppose.15 and 5. p.. which has as its solution h(A) = 2.6. this leads to h(l)(A) = VC/J>. . . Major examples are the generalized linear models of Section 6. is to exhibit monotone functions of parameters of interest for which we can give fixed length (independent of the data) confidence intervals. c) (5. Under general conditions (Bhattacharya and Rao. Also closely related but different are socaBed normalizing transformations. which varies freely.3.3. X n) is an estimate of a real parameter! indexing a family of distributions from which Xl. As an example. suppose that Xl.19) for all '/.. by their definition. . In this case a'2 = A and Var(X) = A/n. finding a variance stabilizing transfonnation is equivalent to finding a function h such that for all Jl and (J appropriate to our family. See Example 5. The notion of such transformations can be extended to the following situation. 1n (X I.3. h must satisfy the differential equation [h(ll(A)j2A = C > 0 for some arbitrary c > O.)] 2/ n . .' .. where d is arbitrary. Thus.3. In this case (5.3. A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models.3. 0 One application of variance stabilizing transformations.d. 538) one can improve on .3. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II.i.hb)) + N(o. A > 0.19) is an ordinary differential equation. Such a function can usually be found if (J depends only on fJ. 1/4) distribution.
3.9500 'll ." • [3 n .0105 0.61 0.ססoo 0.72 .9990 0.35 0. i = 1.0481 1.ססOO I . 1) distribution.1.0100 0.15 0. (5. " I. Edgeworth(2) and nonna! approximations EA and NA to the X~o distribution.85 0 0. "J 1 '~ • 'YIn = E(Vn? (2n)1 E(V . According to Theorem 8.9506 0.2.0001 0 0.38 0.8000 0.0050 0.. .5421 0.6000 0.4999 0.04 1.. Example 5.<p(x) ' .1254 0. Table 5.0284 1.1 gives this approximation together with the exact distribution and the nonnal approximation when n = 10.1.0877 0.9984 0.3. 4 0.I .77 0.1051 0. Fn(x) ~ <!>(x) .40 0..9097 o.3. • " .0553 0.0397 0.9996 1.. 0 x Exacl 2.8008 0.4000 0.0208 ~1.9999 1.2706 0.51 1.!.6999 0.7000 0.79 0. Edgeworth Approximations to the X 2 Distribution.3. .0287 0. P(Tn < x). xro I • I I .95 :' 1.1) + 2 (x 3 . Suppose V rv X~.9997 4.20) is called the Edgeworth expansion for Fn . Exact to' EA NA {. 0. ii.4415 0.1964 1.3000 0.1000 0. I' 0.0032 ·1. It follows from the central limit theorem that Tn = (2::7 I Xi n)/V2ri = (V . ~ 2. H 3 (x) = x3  3x.4000 0.6548 4.9684 0.4 to compute Xl.66 0.7792 5. TABLE 5. V has the same distribution as E~ 1 where the Xi are independent and Xi N(O.0250 0.5999 0.9876 0.' •• . We can use Problem B.0500 ! ! . .JL) / a and let lIn and 1'211 denote the coefficient of skewness and kurtosis of Tn.2024 EA NA x 0.38 0.2000 0.0005 0 0.IOx 3 + 15x)] I I y'n(X n 9n ! I i !.ססoo 1.86 0.0655 O./2 2 .9750 0.318 Asymptotic Approximations Chapter 5 the normal approximation by utilizing the third and fourth moments.9724 0.5000 0.5.9995 0.15 0. where Tn is a standardized random variable.75 0.34 2.n)' _ 3 (2n)2 = 12 n 1 I .0010 ~: EA NA x Exact .n)1 V2ri has approximately aN(O.II 0.91 0.0254 0. Let F'1 denote the distribution of Tn = vn( X .9900 0. Hs(x) = xS  IOx 3 + 15x.3006 0.40 0.n..9029 0.3x) + _(xS .9943 0.3.9999 1.21) The expansion (5.95 3.ססoo 0. Then under some conditionsY) where Tn tends to zero at a rate faster than lin and H 2 • H 3 • and H s are Hermite polynomials defined by H 2 (x) ~ x 2  I. Therefore. To improve on this approximation.3513 0. 0.3.9950 0. 1). we need only compute lIn and 1'2n..9905 0.
2.3. then by the 2 vn(U . . where g(UI > U2.j.5. Example 5. 4f. UillL2U~) = (2p.9) that vn(C ..3 and (B.1. vn(i7f .0. >'2.. = Var(Xkyj) and Ak.3. E).= (Xi .and HigherOrder Asymptotics: The Delta Method with Applications 319 The Multivariate Case Lemma 5.2.2 extends to the dvariate case. Let (X" Y.p). we can show (Problem 5.2.6) that vn(r 2 N(O.). Yn ) be i. Lemma 5. 0< Ey4 < 00.2.l = Cov(XkyJ.0 . _p2. Y) where 0 < EX 4 < 00.0. = ai = 1.1.5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test of H : p = is based on Irl._p2). Y)/(T~ai where = Var X.0. Then Proof. as (X. .22) 2 g(l)(u) = (2U. .I. It follows from Lemma 5. we can . aiD.0 >'1.u) ~~ V dx1 forsome d xl vector ofconstants u.2 + p'A2.2.1.3.o}. Because of the location and scale invariance of p and r.d.1) jointly have the same asymptotic distribution as vn(U n .J11)/a.ij~ where ar Recall from Section 4.n lEX.n1EY/) ° ~ ai ~ and u = (p.3.u) ~ N(O. and Yj = (Y .2./U2U3.3. and let r 2 = C2 /(j3<. Ui/U~U3.i.3.2 >'2. . Suppose {Un} are ddimensional random vectors and that for some sequence of constants {an} with an !' (Xl as n Jo (Xl..1) and vn(i7~ .a~) : n 3 !' R. (i) un(U n .J 2 2 + 4 2 4 2 Tn P 120 + P T 02 +2{ _2 p3Al.3 First.0.0 2 (5.2 >'1. U3) = Ui/U2U3..2rJ >".0 >'1.2. We can write r 2 = g(C. (ii) g. Let p2 = Cov 2(X.9.2 T20 .6.m. 1).u) where Un = (n1EXiYi. a~ = Var Y. Using the central limit and Slutsky's theorems.3.3. xmyl). Let central limit theorem T'f.1. R d !' RP has a differential g~~d(U) at u.0. >'1.1.0 TO . P = E(XY). The proof follows from the arguments of the proof of Lemma 5. with  r?) is asymptotically nonnal.ar. E Next we compute = T11 .J12) / a2 to conclude that without use the transformations Xi j loss of generality we may assume J11 = 1k2 = 0. 1. (X n .Section 5.0.
320 When (X.1) is achieved hy choosing h(P)=!log(1+ P ).3. (5.3.24) I Proof.~a)/vn3} where tanh is the hyperbolic tangent.3. . we see (Problem 5. (5.h= (h i . N(O.Y) ~ N(/li.mil £ ~ hey) = h(rn) and (b) so that . Asymptotic Approximations Chapter 5 = 4p2(1_ p2)'.23) • .1).rn) + o(ly . 1938) that £(vn . EY 1 = rn.8. it gives approximations to the power of these tests. j f· ~.5 (a) ! . has been studied extensively and it has been shown (e. .. o Here is an extension of Theorem 5.p).g. that is.UI.3[h(c) . . 2 1. and it provides the approximate 100(1 .hp )andhhasatotaldifferentialh(1)(rn) f • . . .9) Refening to (5. per < c) '" <I>(vn .u 2.3. (1 _ p2)2). N(O. ! f This expression provides approximations to the critical value of tests of H : p = O.3(h(r) .3. ': I = Ilgki(rn)11 pxd.a)% confidence interval of fixed length. .10) that in the bivariate nonnal case a variance stahilizing transformation h(r) with .1'2. Argue as before using B. Then ~.fii(Y . . .l) distribution.fii[h(r) . .p) !:. Var Y i = E and h : 0 ~ RP where 0 is an open subset ofRd. then u5 . which is called Fisher's z.rn) N(o.h(p)]).h(p)J !:. c E (1. and (Prohlem 5.19).4.3.3. Suppose Y 1.fii(r .. . Theorem 5. p=tanh{h(r)±z (1.P The approximation based on this transfonnation. E) = .: ! + h(i)(rn)(y ..3. David. .h(p)) is closely approximated by the NCO. . I Y n are independent identically distributed d vectors with ElY Ii 2 < 00.
' = Var(X. D Example 5. is given as the :FIO .1 in which the density of Vlk.37 for the :F5 .25) has an :Fk. we first note that (11m) Xl is the average ofm independent xi random variables. 00. i=k+1 Using the (b) part of Slutsky's theorem. We do not require the Xi to be normal. only that they be i.k. if . See also Figure B.26) l..3. then P[T5 ..l) distribution. gives the quantiles of the distribution of V/k.~1.m distribution can be approximated by the distribution of Vlk. we conclude that for fixed k. where V .m distribution. when k = 10. the mean ofaxi variable is E(Z2). if k = 5 and m = 60. which is labeled m = 00. Then.k+m L::l Xl 2 (5.0 distribution and 2.1.05 and the respective 0.1. This row. where Z ~ N(O.3.21 for the distribution of Vlk. > 0 is left to the problems.3.Xn is a sample from a N(O. . with EX l = 0.m' To show this. 1 Xl has a x% distribution. as m t 00.x%.). xi and Normal Approximation to the Distribution of F Statistics. the:F statistic T _ k. L::+. when the number of degrees of freedom in the denominator is large.i=k+l Xi ". Thus.15. the :Fk.m  (11k) (11m) L.. By Theorem B.7. EX.1.3.3. Suppose that n > 60 so that Table IV cannot be used for the distribution ofTk. check the entries of Table IV against the last row. Suppose that Xl.7) implies that as m t 00.m.05 quantiles are 2.37] = P[(vlk) < 2. By Theorem B. .d. we can use Slutsky's theorem (A.:l ~ m k+m L X..:: .3.0 < 2. 14. We write T k for Tk. I). (5. But E(Z') = Var(Z) = 1.3 First..m. The case m = )'k for some ).and HigherOrder Asymptotics: The Delta Method with Applications 321 (c) jn(h(Y) .211 = 0.i. Then according to Corollary 8.3.= density.9) to find an approximation to the distribution of Tk.Section 5. Suppose for simplicity that k = m and k . we can write.h(m)) = y'nh(l)(m)(Y  m) + op(I). Now the weak law of large numbers (A. > 0 and EXt < 00. Next we turn to the normal approximation to the distribution of Tk. To get an idea of the accuracy of this approximation. For instance. When k is fixed and m (or equivalently n = k + m) is large. where k + m = n..
0 ! j i · 1 where K = Var[(X. .2Var(Yll))' In particular if X. . = E(X. By Theorem 5. i = 1.3.a) '" 1 + is asymptotically incorrect.5.o) k) K (5. Specifically. I)T and h(u.).29) is unknown. I' Theorem 5. i.28) .a). i.m(1.' !k .27) where 1 = (1. .k(tl)] '" 'fI() :'.3. a'). 4).'. a').m 1) can be ap proximated by a N(O. Equivalently Tk = hey) where Y i ~ (li" li. When it can be eSlimated by the method of moments (Problem 5. l (: An interesting and important point (noted by Box. Suppose P is a canonical exponential family of rank d generated by T with [open. the distribution of'. and a~ = Var(X. if it exists and equal to c (some fixed value) otherwise. !1 .3. X n are a sample from PTJ E P and ij is defined as the MLE 1 .1)/12 j 2(m + k) mk Z. 2) distribution. In general. 1'.28) satisfies Zto '" 1 I I:. .4. the F test for equality of variances (Problem 5. v) identity.k(Tk. which by (5.3. where J is the 2 x 2 v'n(Tk 1) ""N(0. ifVar(Xf) of 2a'. _.8(d)). ~ Ck. Thus (Problem 5. 1)T. . ! i l _ .. j m"': k (/k.)T.m = (1+ jK(:: z. the upper h.3.k (t 1 (5. In general (Problem 53.3. ~ N (0. 1 5. when rnin{k.3. h(i)(u.I.7). when Xi ~ N(o.)/ad 2 .. v) ~ ~. as k ~ 00. = (~. v'n(Tk ._o l or ~. P[Tk. E(Y i ) ~ (1. • • 1)/12) .m(1 .3.3...»)T and ~ = Var(Yll)J.m 1) <) :'.m • < tl P[) :'.k(T>. :. (5.). We conclude that xl jer}. 1'.a) .m} t 00.8(c» one has to use the critical value .m(l.m critical value fk. Then if Xl.3 Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families Our final application of the 8method follows.1) "" N(O. k. 1953) is that unlike the t test. :f.8(a» does not have robustness of level. 322 Asymptotic Approximations Chapter 5 where Yi1 = and Yi2 = Xf+i/a2.3.
Now Vii11.O.3. Then T 1 = X and 1 T2 = n. hence.".A(T/)) + oPT/ (n .3. and.32) Hence. = 1/20'. (ii) follows from (5. of Theorems 5.24).l4. where.3. and 1). (I" +0')] .I(T/) (5. by Corollary 1.'l 1 (ii) LT/(Vii(iiT/)).A 1(71))· Proof. (5. iiI = X/O".Nd(O.3 First.4. Thus.38) on the variance matrix of y'n(ij .. are sufficient statistics in the canonical model. For (ii) simply note that.).3. thus. PT/[T E A(E)] ~ 1 and.1.3. by Example 2.71) for any nnbiased estimator ij.N(o.31) . We showed in the proof ofTheorem 5.30) But D A . Example 5.3. the asymptotic variance matrix /1(11) of . in our case.5. therefore. if T ~ I:7 I T(X. 0').33) 71' 711 Here 711 = 1'/0'.2.2. PT/[ii = AI(T)I ~ L Identify h in Theorem 5.3.and HigherOrder Asymptotics: The Delta Method with Applications 323 (i) ii = 71 + ~ I:71 A •• • I (T/)(T(X. Ih our case.3.2.8. as X witb X ~ Nil'. X n be i.2 that.1. The result is a consequence.4 with AI and m with A(T/). = (5.) .11) eqnals the lower bound (3.T.2 where 0" = T.EX. Thus. Note that by B.3. I'.  (TIl' .4..3.ift = A(T/).. T/2 By Theorem 5. = 1/2. o Remark 5.6.3.i. This is an "asymptotic efficiency" property of the MLE we return to in Section 6.d. Let Xl>' .2 and 5. = A by definition and.1.23).In(ij .4.8. (i) follows from (5. . (5. Recall that A(T/) ~ VarT/(T) = 1(71) is the Fisher information.Section 5.3.
7). We focus first on estimation of O. 5.d. Secondorder asyrnptotics provides approximations to the difference between the error and its firstorder approximation. . . Consistency is Othorder asymptotics. variables are explained in tenns of similar stochastic approximations to h(Y) . We begin in Section 5. il' . we can use (5.. Eo) .. I i I .4. Following Fisher (1958)'p) we develop the theory first for the case that X""" X n are i. These "8 method" approximations based on Taylor's fonnula and elementary results about moments of means of Ll. which lead to a result on the asymptotic nonnality of the MLE in multiparameter exponential families. . . Firstorder asymptotics provides approximations to the difference between a quantity tending to a limit and the limit. for instance.. and confidence bounds.s where Eo = diag(a 2 )2a 4 ). ..1 Estimation: The Multinomial Case .1.h(JL) where Y 1. under Li. The moment and in law approximations lead to the definition of variance stabilizing transfonnations for classical onedimensional exponential families. 5.1 by studying approximations to moments and central moments of estimates. . and il' = T.(T. N = (No. and h is smooth. . . We consider onedimensional parametric submodels of S defined by P = {(p(xo. as Y. . Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal.4 to find (Problem 5.324 Asymptotic Approximations Chapter 5 Because X = T. EY = J. i 7 •• . and so on. .L.)'.. is twice differentiable for 0 < j < k. . 0 . Nk) where N j . 8 open C R (e. 0). J J . Thus. Summary.i. and PES. Finally.Pk) where (5. Higherorder approximations to distributions (Edgeworth series) are discussed briefly. . Y n are i. Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean.15). 0). the difference between a consistent estimate and the parameter it estimates. the (k+ I)dimensional simplex (see Example 1.g.4.1. i In this section we define and study asymptotic optimality for estimation.P(Xk' 0)) : 0 E 8}.6.3. These stochastic approximations lead to Gaussian approximations to the laws of important statistics. sampling.d.4 and Problem 2.i.".3. when we are dealing with onedimensional smooth parametric models. .') N(O. .d. < 1.d. see Example 2.26) vn(X /". .3.4 ASYMPTOTIC THEORY IN ONE DIMENSION I: I " ! .: CPo.L~ I l(Xi = Xj) is sufficient. stochastic approximations in the case of vector statistics and parameters are developed.. . I I I· < P..1) i . In Chapter 6 we sketch how these ideas can be extended to multidimensional parametric families. taking values {xo.33) and Theorem 5. Xk} only so that P is defined by p . . 0.. I . testing.3. Assume A : 0 ~ pix....
0). 8))T. > rl(O) M a(p(8)) P.4.4. .8) .4. Next suppose we are given a plugin estimator h (r..2) is twice differentiable and g~ (X I) 8) is a well~defined.8) logp(X I .1.8).:) (see (2.11)) of (J where h: S satisfies ~ R h(p(8» = 8 for all 8 E e > (5.3) Furthermore (See'ion 3. 88 (Xl> 8) and =0 (5.4. if A also holds..2(0.Section 5.5) As usual we call 1(8) the Fisher infonnation.4..8)1(X I =Xj) )=00 (5.4. h) is given by (5. h) with eqUality if and only if. Consider Example where p(8) ~ (p(xo. . 8) is similarly bounded and well defined with (5. Under H.7) where . Theorem 5.4.4.1.6) 1. Assume H : h is differentiable.4.1. (0) 80(Xj. Then we have the following theorem.9) . (5. fJl E.11).Jor all 8.2). Moreover. Many such h exist if k 2.4) g. 0 <J (5.4.8) ~ I)ogp(Xj. bounded random variable (5.l (Xl.p(Xk.4.4. < k.2 (8.4 Asymptotic Theory in One Dimension 325 Note that A implies that k [(X I . for instance.. pC') =I 1 m . (5.
but also its asymptotic variance is 0'(8. Taking expectations we get b(8) = O.10).4.4.4.4.8) ) +op(I). for some a(8) i' 0 and some b(8) with prob~bility 1.9). 8). &8(X.8)) = a(8) &8 (X" 8) +b(O) &h Pl (5.8) with equality iff.11) • (&h (p(O)) ) ' p(xj.h)I8) (5. 2: • j=O &l a:(p(8))(I(XI = Xj) .8)&Pj 2: • &h a:(p(0))p(x. 0)] p(Xj. whil. we obtain &l 1 <0 .4. : h • &h &Pj (p(8)) (N . Noting that the covariance of the right' anp lefthand sides is a(8). using the correlation inequality (A.. By (5.II.8) = 1 j=o PJ or equivalently. I (x j.6). ( = 0 '(8.4.4. h k &h &pj(p(8)) (N p(xj. we obtain 2: a:(p(8)) &0(xj. h) = I . (8.p(x). • • "'i. (5.p(xj. by (5.8) gives a'(8)I'(8) = 1.h)Var.')  h(p(8)) } asymptotically normal with mean 0.4. ~ir common variance is a'(8)I(8) = 0' (0. (5. I h 2 (5... not only is vn{h (.2 noting that N) vn (h (. which implies (5.4.12) [t.14) . we s~~ iliat equality in (5. h).8) ) ~ . (5.4.3.13).10) Thus.4.4. h(p(8)) ) ~ vn Note that. kh &h &Pj (p(8))(I(Xi = Xj) .4.8)).12).8) j=O 'PJ Note that by differentiating (5..326 Asymptotic Approximations Chapter 5 Proof. by noting ~(Xj. Apply Theorem 5. using the definition of N j .I n.l6) as in the proof of the information inequality (3.15) i • .16) o .p(xj. 0) = • &h fJp (5..13) I I · ./: .
4. the MLE of B where It is defined implicitly ~ by: h(p) is the value of O.2 Asymptotic Normality of Minimum Contrast and MEstimates o e o We begin with an asymptotic normality theorem for minimum contrast estimates. .(vn(B .3 that the information bound (5. J(~)) with the asymptotic variance achieving the infonnation bound Jl(B). In the next two subsections we consider some more general situations.2. Write P = {P..4. Then Theo(5.19) The binomial (n.. OneParameter Discrete Exponential Families. Note that because n N T = n . 0 Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena. which (i) maximizes L::7~o Nj logp(xj..4.8) is.3.I L::i~1 T(Xi) = " k T(xJ )".1. 5.o(p(X"O)  p(X"lJo )) . . As in Theorem 5. achieved by if = It (r.. then. .Xn are tentatively modeled to be distributed according to Po.i.0 (5.0)) ~N (0.4.3) L. Example 5.A(O)}h(x) where h(x) = l(x E {xo.5 applies to the MLE 0 and ~ e is open.18) and k h(p) = [A]I LT(xj)pj )".::7 We shall see in Section 5.3.d..)."j~O (5.17) L...3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check. is a canonical oneparameter exponential family (supported on {xo. Xl.4. Let p: X x ~ R where e D(O..8) = exp{OT(x) . by ( 2. 0 E e. Suppose p(x. 0).Section 5A Asymptotic Theory in One Dim"e::n::'::io::n ~_ __'3::2:.4. Suppose i. Xk}) and rem 5.4. 0) and (ii) solvesL::7~ONJgi(Xj. : E Ell.Oo) = E. . if it exists and under regularity conditions. . E open C R and corresponding density/frequency functions p('. .xd)...0)=0. p) and HardyWeinberg models can both be put into this framework with canonical parameters such as B = log ( G) in the first case.
21) i J A2: Ep.4. 8. As we saw in Section 2. n i=l _ 1 n Suppose AO: . Under AOA5.pix.' '. { ~ L~ I (~(Xi.. # o. .=1 n ! (5.. i' On where = O(P) +. A4: sup..L . • is uniquely minimized at Bo. ~i  i .4.p(.p(X" On) n = O.20) In what follows we let p. . (5. J is well defined on P. 0 E e. f 1 .328 Asymptotic Approximations Chapter 5 .p2(X.p(Xi. O(P))) . 1 n i=l _ . PEP and O(P) is the unique solution of(5.0(P)) l.2.LP(Xi. (5.21) and. That is.p E p 80 (X" O(P») A3: . ~ (X" 0) has a finite expectation and i ! . parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for..1/ 2) n '. < co for all PEP.23) I.p(x. On is consistent on P = {Pe : 0 E e}. O)ldP(x) < co. . j.O). O(Pe ) = O.O(p)))I: It . Theorem 5. o if En ~ O. rather than Pe. O(P).4. as pointed out later in Remark 5.O(P)) +op(n. ~I AS: On £.~(Xi.p(x. L. denote the distribution of Xi_ This is because. P) = . . 0) is differentiable. Let On be the minimum contrast estimate On ~ argmin .4. .O(P)I < En} £. That is.4. O)dP(x) = 0 (5.3.!:. under regularity conditions the properties developed in this section are valid for P ~ {Pe : 0 E e}.t) . Suppose AI: The parameter O(P) given hy the solution of .4.1.4.1. hence. O(P» / ( Ep ~~ (XI.p(x.22) J i I .p = Then Uis well defined. We need only that O(P) is a parameter as defined in Section 1.
4.1j2 ).4.425)(5.. 1 1jJ(Xi .26) and A3 and the WLLN to conclude that (5.O(P)) / ( El' ~t (X" O(P))) o while E p 1jJ2(X"p) = (T2(1jJ.4. By expanding n.25) where 18~  8(P)1 < IiJn .4.O(P)) = n. Let On = O(P) where P denotes the empirical probability.' " 1jJ(Xi .4.27) Combining (5.L 1jJ(X" 8(P)) = Op(n. we obtain.O(P)) ~ .O(P)) Ep iJO (Xl.p) = E p 1jJ2(Xl .28) .O(P)I· Apply AS and A4 to conclude that (5. (iJ1jJ ) In (On . A2.l   2::7 1 iJ1jJ. Next we show that (5.24) follows from the central limit theorem and Slutsky's theorem.4.29) ..21). (5.4. 8(P)) + op(l) = n ~ 1jJ(Xi .4.4. But by the central limit theorem and AI. en) around 8(P).8(P)) I n n n~ t=1 n~ t=1 e (5. .24) proof Claim (5.22) because .4.27) we get. and A3. where (T2(1jJ. applied to (5..20).20) and (5. using (5.On)(8n .4.' " iJ (Xi .4._ . n.Section 5.4 Asymptotic Theory in One Dimension 329 Hence. O(P)).lj2 and L 1jJ(X" P) + op(l) i=1 n EI'1jJ(X 1.O(P))) p (5. t=1 1 n (5.p) < 00 by AI.22) follows by a Taylor expansion of the equations (5.4.Ti(en .O(P)) 2' (E '!W(X.
Nothing in the arguments require that be a minimum contrast as well as an Mestimate (i.! .1. e) is as usual a density or frequency function. A2.4. 0) Covo (:~(Xt.. If further Xl takes on a finite set of values. it is easy to see that A6 is the same as (3. 1 •• I ! . 0) l' P = {Po: or more generally.O) = O.i. We conclude by stating some sufficient conditions.: 1.. If an unbiased estimateJ (X d of 0 exists and we let ?jJ (x.4. essentially due to Cramer (1946). n I _ .4. 0) = logp(x.2 may hold even if ?jJ is not differentiable provided that . I 'iiIf 1 I • I I .O(P) Ik = n n ?jJ(Xi . and we define h(p) = 6(xj )Pj. This is in fact truesee Problem 5. for Mestimates.4. This extension will be pursued in Volume2. .4. we see that A6 corresponds to (5. Remark 5.4. for A4 and A6.22) follows from the foregoing and (5.?jJ(XI'O)). written as J i J 2::. 0 • .12). O)dp. and that the model P is regular and letl(x.30) suggests that ifP is regular the conclusion of Theorem 5.4. en j• .=0 ?jJ(x.29). • Theorem 5. A4 is found. .O). =0 (5.3.4.8).4. I o Remark 5.'. B) where p('.2 is valid with O(P) as in (I) or (2). 0) = 6 (x) 0. Identity (5. that 1/J = ~ for some p).30) Note that (5. (2) O(P) solves Ep?jJ(XI. Suppose lis differen· liable and assume that ! • &1 Eo &O(X I . and A3 are readily checkable whereas we have given conditions for AS in Section 5.2.20) are called M~estimates. {xo 1 ••• 1 Xk}.28) we tinally obtain On .4.1.2.O)?jJ(X I .4. P but P E a}. Remark 5.4. Our arguments apply to Mestimates.31) for all O..e. O(P)) ) and (5. Our arguments apply even if Xl. O(P)) + op (I k n n ?jJ(Xi . 330 Asymptotic Approximations Chapter 5 Dividing by the second factor in (5. O)p(x.Eo (XI.30) is formally obtained by differentiating the equation (5.4. Solutions to (5. B» and a suitable replacement for A3. Conditions AO. A6: Suppose P = Po so that O(P) = 0. Z~ (Xl.21). O(P) in AIA5 is then replaced by (I) O(P) ~ argmin Epp(X I .(x) (5. AI.d.4. X n are i.4.4.4. 0) is replaced by Covo(?jJ(X" 0). An additional assumption A6 gives a slightly different formula for E p 'iiIf (X" O(P)) if P = Po. .
Pe) > 1(0) 1 (5. B). 5.4) but A4' and A6' are not necessary (Problem 5. where iL(X) is the dominating measure for P(x) defined in (A. 0') is defined for all x. 0) (5.34) Furthermore. lO.0'1 < J(O)} 00. Theorem 5.4. In this case is the MLE and we obtain an identity of Fisher's.4. AOA6.4.O) = l(x.O) and P ~ ~ Pe.8) is a continuous function of8 for all x.3 Asymptotic Normality and Efficiency of the MLE The most important special case of (5. . 0) < A6': ~t (x. s) diL(X )ds < 00 for some J ~ J(O) > 0.p. (b) There exists J(O) sup { > 0 such that O"lj') Ehi) eo (Xl.4 Asymptotic Theory in One Dimension 331 A4/: (a) 8 + ~~(xI.3. then if On is a minimum contrast estimate whose corresponding p and '1jJ satisfy 2 0' (. 10' ..4. 0) and . then the MLE On (5.4.4.4.:d in Section 3. < M(Xl.O) 10gp(x.O). 0) obeys AOA6.p(x.Section 5.20) occurs when p(x.eo (Xl. 0') . where EeM(Xl..01 < J( 0) and J:+: JI'Ii:' (x.O) = l(x. That is.35) with equality iff. "dP(x) = p(x)diL(x).p = a(0) g~ for some a 01 o.4.32) where 1(8) is t~e Fisher information introduq. = en en = &21 Ee e02 (Xl. 0) . w.4. 0) g~ (x. We also indicate by example in the problems that some conditions are needed (Problem 5..4.33) so that (5. satisfies If AOA6 apply to p(x." Details of how A4' (with ADA3) iIT!plies A4 and A6' implies A6 are given in the problems. We can now state the basic result on asymptotic normality and e~ciency of the MLE.1).
2. .36) Because Eo 'lj.Xn be i.4. Then X is the MLE of B and it is trivial to calculate [( B) _ 1. 5..'(0) = 0 < l(~)' The phenomenon (5. {X 1.34) follow directly by Theorem 5.. 442.4.1).'(B) = I = 1(~I' B I' 0.37) .. I. 1. 0 1 . (5. Hodges's Example.n(Bn .4. 0 because nIl' . and.3 is not valid without some conditions on the esti· mates being considered.nB). cross multiplication shows that (5. claim (5. B) is an MLR family in T(X). 1 .35). . Then . .4. for some Bo E is known as superefficiency.4 Testing ! I i .ne . If B = 0. B) with J T(x) . 1.r 332 Asymptotic Approximations Chapter 5 Proof Claims (5.35) is equivalent to (5. 1 I I.439) where .<1>( _n '/4 .. . }.1/ 4 " and using X as our estimate if the test rejects and 0 as our estimate otherwise.4. .. ..4. However... PIIZ + . thus.4.30) and (5. 1). PolBn = Xl . if B I' 0. Therefore.. The major testing problem if B is onedimensional is H : < 8 0 versus K : > If p(.. PoliXI < n 1/4 ] . 1998.4.3 generalizes Example 5.36) is just the correlation inequality and the theorem follows because equality holds iff 'I/J is a nonzero multiple a( B) of 3!. By (5.4.B).d.4.. we know that all likelihood ratio tests for simple BI e e eo. .4.4.i . .39) with "'(B) < [I(B) for all B E eand"'(Bo) < [I(Bo). I I Example 5.A'(B). 0 e en e..1 once we identify 'IjJ(x... and Polen = OJ . (5.4.2. see Lehmann and Casella.38) Therefore.".(X B). B) = 0. . j 1 " Note that Theorem 5. For this estimate superefficiency implies poor behavior of at values close to 0. We can interpret this estimate as first testing H : () = 0 using the test "Reject iff IXI > n.. 00. N(B. .. PoliXI < n1/'1 .1/4 I (5.1/4 X if!XI > n.4.33) and (5.nBI < nl/'J <I>(n l/ ' .l'1 Let X" .nB) . for higherdimensional the phenomenon becomes more disturbing and has important practical consequences. The optimality part of Theorem 5.i.. We next compule the limiting distribution of . 1 .4. Let Z ~ N(O.• •. p.4. . .. . We discuss this further in Volume II. Consider the following competitor to X: B n I 0 if IXI < n.
4.4.4.22) guarantees that sup IPoolvn(Bn . and (5. "Reject H if B > c(a. Suppose (A4') holds as well as (A6) and 1(0) < oofor all O.4. as well as the likelihood ratio test for H versus K.0:c":c'=D. 00) . Then ljO > 00 .d.Oo) = 00 + ZIa/VnI(Oo) +0(n. PoolBn > 00 + zl_a/VnI(Oo)] = POo IVnI(Oo) (Bn  00) > ZI_a] ~ a. X n distributed according to Po..[Bn > c(a.(a.45) which implies that vnI(Oo)(c. e e e e. (5.(a.44) But Polya's theorem (A.. ljO < 00 . PolO n > c. (5. It seems natural in general to study the behavior of the test. [o(vn(Bn 0)) ~N(O. That is.43) Property (5. b).(1 1>(z))1 ~ 0. 1) distribution.4.. Xl.. eo) denote the critical value of the test using the MLE en based On n observations. Theorem 5. < >"0 versus K : >. If pC e) is a oneparameter exponential family in B generated by T(X).4. 00) ..2 apply to '0 = g~ and On.Oo)] PolOn > cn(a. 0 E e} is such thot the conditions of Theorem 5. 00)] = PolvnI(O)(Bn  0) > VnI(O)(cn(a. "Reject H for large values of the MLE T(X) of >. Thus. • • where A = A(e) because A is strictly increasing.. . On the (5. a < eo < b.. 00 )" where P8 .h the same behavior.l4.Oo)] = a and B is the MLE n n of We will use asymptotic theory to study the behavior of this test when we observe ij.(a.:c".. are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error Q at eo. > AO. ~ 0..00) > z] .3 versus simple 2 .. B E (a.:.42) is sometimes called consistency of the test against a fixed alternative.42)  ~ o.m='=":c'.46) PolBn > cn(a.4.4.4. Let en (0:'.:cO"=~ ~ 3=3:. the MLE.I / 2 ) (5.a quantile o/the N(O. derive an optimality property. (5.::.4 Asymptotic Theo:c'Y:c.Section 5. Zla (5. . this test can also be interpreted as a test of H : >. Suppose the model P = {Po .40) > Ofor all O.00) other hand.r'(O)) where 1(0) (5.0)]. . 1 < z . Proof. The test is then precisely.41) where Zlct is the 1 .00) > z] ~ 11>(z) by (5.4. and then directly and through problems exhibit other tests wi!. 00 )]  ~ 1.41) follows.4. The proof is straightforward: PoolvnI(Oo)(Bn .4.40). Then c.4.
then by (5.4.4.4.017".1/ 2 ))].jnI(80) +0(n.47) lor.8) + 0(1) .48) and (5. In either case.4.jnI(8)(80 .jnI(8)(Bn .4. j (5. .jnI(O)(Bn .80 1 < «80 )) I J .8) > .jnI(80) + 0(n. then (5. ~! i • Note that (5.50) can be interpreted as saying that among all tests that are asymptotically level 0' (obey (5.4. Write Po[Bn > cn(O'.4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to Ct from the left at 00 and continues rising steeply to 1 to the right of 80 .jnI(8)(Cn(0'.4. 80 )  8) . 00 if 8 < 80 .1 / 2 )) .5. 80)J 1 P.80 ). X n ) is any sequence of{possibly randomized) critical (test) functions such that .k'Pn(X1.41).8) < z] .50) i. these statements can only be interpreted as valid in a small neighborhood of 80 because 'Y fixed means () + B .jI(80)) ill' > 0 > llI(zl_a ")'.2 and (5.4.8 + zl_a/.···. the test based on 8n is still asymptotically MP.4.  i.4.80) tends to infinity.[. o if "m(8 .48) . iflpn(X1 .43) follow. In fact. (3) = Theorem 5.. 0.40) hold uniformly for (J in a neighborhood of (Jo. (5.. .4. X n ) i.4.jnI(8)(Bn .jnI(8)(80 ..[.4. .49)) the test based on rejecting for large values of 8n is asymptotically uniformly most powerful (obey (5.50)) and has asymptotically smallest probability of type I error for B < Bo. f. assume sup{IP. .8 + zl_o/. j .4.8) > . the power of the test based on 8n tends to I by (5.(1 lI(z))1 : 18 .48). Furthennore.jnI(8)(80 .51) I 1 I I . LetQ) = Po. then limnE'o+.80 ) tends to zero.(80) > O. That is. . If "m(8 . Claims (5. I .42) and (5. the power of tests with asymptotic level 0' tend to 0'.49) ! i . . < 1 lI(Zl_a ")'. Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ")' "m( 8 . (5. . ~ " uniformly in 'Y. Theorem 5.80) ~ 8)J Po[. Suppose the conditions afTheorem 5.jI(80)) ill' < O. 1 ~. 00 if8 > 80 0 and .")' ~ "m(8  80 ).4.jnI(8)(Cn(0'. (5.   j Proof.4.4.J 334 Asymptotic Approximations Chapter 5 By (5. On the other hand.4. I.50)..
4. k n (80 ' Q) ~ if OWn (Xl.4. > dn (Q.jn(I(8o) + 0(1))(80 . . The test li8n > en (Q. is O.j1i(8  8 0 ) is fixed. 1.Section 5.53) where p(x.7). for Q < ~.4 Asymptotic Theory in One Dimension 335 If.?." + 0(1) and that It may be shown (Problem 5. ..8 0 ) ..4. ~ .53) tends to the righthand side of (5.5.<P(Zla(1 + 0(1)) + . 0 (5. if I + 0(1)) (5.5 in the future will be referred to as a Wald test.8) that.Xn) is the critical function of the LR test then.4. € > 0 rejects for large values of z1. Q). Finally. X n .50) and. i=1 P 1. These are the likelihood ratio test and the score or Rao test. It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form "Reject if L log[p(X" 8n )!p(X i=1 n i ... note that the Neyman Pearson LR test for H : 0 = 00 versus K : 00 + t. " Xn. [5In (X" ... .[logPn( Xl..4.Xn ) = 5Wn (X" .4.54) establishes that the test 0Ln yields equality in (5.4.48) follows. There are two other types of test that have the same asymptotic behavior. To prove (5. Further Taylor expansion and probabilistic arguments of the type we have used show that the righthand side of (5.4. X.54) Assertion (5.. P.. . . hence.o + 7.50) note that by the NeymanPearson lemma. . The details are in Problem 5. Thus. 1(8) = 1(80 + + fi) ~ 1(80) because our uniformity assumption implies that 0 1(0) is continuous (Problem 5.O+.4.4..4 and 5. 0 n p(Xi ... . .jI(Oo)) + 0(1)) and (5. 00)]1(8n > 80 ) > kn(Oo.53) is Q if. 80)] of Theorems 5.4.8) 1.)] ~ L (5.<P(Zl_a .52) > 0..4. t n are uniquely chosen so that the right.4.OO+7n) +EnP.OO)J .80 ) < n p (Xi. hand side of (5. Llog·· (X 0) =dn (Q.Xn) is the critical function of the Wald test and oLn (Xl. 0) denotes the density of Xi and dn ..4.. 00 + E) logPn(X l E I " . 0 The asymptotic!esults we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n.4. 8 + fi) 0 P'o+:. .' L10g i=1 p (X 8) 1. for all 'Y. is asymptotically most powerful as well.50) for all. .
336
Asymptotic Approximations
Chapter 5
where PlI(X 1, ... ,Xn ,8) is the joint density of Xl, ... ,Xn . Fort: small. n fixed, this is approximately the same as rejecting for large values of a~o logPn(X 11 • • • 1 X n ) eo).
,
• ,
The preceding argument doesn't depend on the fact that Xl" .. , X n are i.i.d. with common density or frequency function p{x, 8) and the test that rejects H for large values of a~o log Pn (XI, ... ,Xn , eo) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming
"Reject H iff
t
i=I
iJ~
: ,
logp(X i ,90) > Tn (a,9 0)."
0
I I
(5.4.55)
It is easy to see (Problem 5.4.15) that
Tn(a, 90 ) = Z10 VnI(90) + o(n I/2 )
and that again if G (Xl, ... , X n ) is the critical function of the Rao test then
nn
,
•
.,
,•
Po,+' WRn (X t, ... ,Xn ) = Ow n(X t , ... ,Xn)J ~ 1, rn
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(90 ), which d' may require numerical integration, can be replaced by _n l d021n(Bn) (Problem 5.4.10).
5.4,5
Confidence Bounds
Q
We define an asymptotic Levell that
lower confidence bound (LCB) On by the requirement
(5.4.57)
r I I i ,
., '
.'
for all () and similarly define asymptotic level!  a DeBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
(i) By using a natural pivot.
, f.
(
.,
(ii) By inverting the testing regions derived in Section 5.4.4.
,. "',
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)(A6), (A4'), and I(9) finite for all it follows (Problem 5.4.9) that
e.
Co(
V n  e)) ~ N(o, 1) nI(lin)(li
Z1a/VnI(lin ).
(5.4.58)
for all () and, hence. an asymptotic level!  a lower confidence bound is given by
9~ = lin e~,
(5.4.59)
Turning tto method (ii), inversion of 8Wn gives fonnally
= inf{9: en(a, 9) > 9n }
(5.4.60)
,
=
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
337
or if we use the approximation C (0, e) ~ n
e+ zlQ/vnI(iJ), (5,4,41),
e) > en}'
~~, = inf{e , Cn(C>,

(5,4,61)
In fact neither e~I' or e~2 properly inverts the tests unless cn(Q, e) and Cn (Q, e) are increasing in The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, e~l is preferable because this bound is not only approximately but genuinely level 1 Q. But computationally it is often hard to implement because cn(Q, 0) needs, in general, to be computed by simulation for a grid of values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4,10) are preferred but Can be quite inadequate (Problem 5,4,1 I), These bounds e~, O~I' e~2' are in fact asymptotically equivalent and optimal in a suitable sense (Problems 5,4,12 and 5,4,13),
e.
e
Summary. We have defined asymptotic optimality for estimates in oneparameter models. In particular, we developed an asymptotic analogue of the information inequality of Chapter 3 for estimates of in a onedimensional subfamily of the multinomial distributions, showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of oneparameter discrete exponential families. In Section 5.4.2 we developed the theory of minimum contrast and M estimates, generalizations of the MLE, along the lines of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mooel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimality results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson (1996), Le Cam and Yang (1990), Lehmann (1999), Rao (1973), and Serfling (1980),
e
5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as n t 00 in a sense we now describe. The framework we consider is the one considered in Sections 5.2 and 5.4, i.i.d. observations from a regular madel in which is open C R or = {e 1 , , , , , e,} finite, and e is identifiable, Most of the questions we address and answer are under the assumption that fJ = 0, an arbitrary specified value, or in frequentist tenns, that 8 is true.
e
e
Consistency The first natural question is whether the Bayes posterior distribution as n + 00 concentrates all mass more and more tightly around B. Intuitively this means that the data that are coming from Po eventually wipe out any prior belief that parameter values not close to are likely, Formalizing this statement about the posterior distribution, II(· I X It •.• , X n ), which is a functionvalued statistic, is somewhat subtle in general. But for = {O l ,' .. , Ok} it is
e
e
i
338
straightforward. Let
Asymptotic Approximations Chapter 5
I.
11(8 i XI,···, Xn)
Then we say that II(·
=PIO = 8 I Xl,···, Xn].
(5.5.1)
e, P,li"(8 I Xl, ... , Xn)  11 > ,] ~ 0 for all f. > O. There is a slightly stronger definition: rIC I XI,' .. ,Xn ) is a.S. iff for all 8 E e, 11(8 I Xl, ... , Xn) ~ 1 a.s. P,.
is consistent iff for all 8 E
General a.s. consistency is not hard to formulate:
I Xl, ... , Xn)
(5.5.2)
consistent
(5.5.3)
11(· I X), ... , Xn)
=}
OJ'} a.s. P,
(5.5.4)
where::::} denotes convergence in law and <5{O} is point mass at satisfactory result for finite.
e
e.
There is a completely
, ,
,
Theorem 5.5.1. Let 1rj  p[e = Bj ], j = 1, ... 1 k denote the prior distribution 0/8. Then II(· I Xl, ... ,Xn ) is consistent (a.s. consistent) iff 7fj > afor j = I, ... , k.
Proof. Let p(., B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1["] = 0 for some j implies that 1f(Bj I Xl, ... ,Xn ) = 0 for all Xl, .. . , X n because, by (1.2.8),
.,
,J,
11(8j
I Xl, ... ,Xn)
PIO = 8j I Xl, ... ,Xn ] 11j Ir~l p(Xi , 8il
L:.~l 11. ni~l p(Xi , 8.)
k
n .
(5.5.5)
,
,
, ,,
Intuitively, no amount of data can convince a Bayesian who has decided a priori that OJ is impossible. On the other hand, suppose all 71" j are positive. If the true is (J j or equivalently 8 = (J j, then
e
log
11(8.IXl, ... ,Xn) =n 11(8 j I X), ... ,Xn)
(11og+ L.. og P(Xi,8.)) . 11. 1{f.,1 n
7fj
n i~l
p(Xi ,8j)
,
By the weak (respectively strong) LLN, under POi'
.,i
1{f.,log p(Xi ,8.)  Lni~l
p(Xi ,8j )
+
E
OJ
(I
I
P(XI ,8.)) og p(X I ,8j )
i
I
in probability (respectively a.s.). But Eo;
(log: ~:::;))
~
< 0, by Shannon's inequality, if
Ba
• .,
=I=
Bj' Therefore,
11(8.IXI, ... ,Xn) 1 og 11(8j I X), ... , X n )
00
,
in the appropriate sense, and the theorem follows.
o
i ~,
h
_
Section 55
Asymptotic Behavior and Optimality of the Posterior Distribution
339
e
Remark 5.5.1. We have proved more than is stated. Namely. that for each I XI, . .. ,Xn ]  a exponentially.
e E e. Po[O =l0
As this proof suggests, consistency of the posterior distribution is very much akin to consistency of the MLE. The appropriate analogues of Theorem 5.2.3 are valid. Next we give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution
Under conditions AOA6 for p(x, B) that if B is the MLE,

= lex, Bj
=logp(x, B), we showed in Section 5.4
(5.5.6)
Ca(y'n(e  B» ~ N(O,rl(B)).
Consider C( ..;ii((}  B) I Xl, ... , X n ), the posterior probability distribution of y'n((} B( Xl, ... , X n )), where we emphasize that (j depends only on the data and is a constant given XI, ... , X n . For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions that strengthen A4 and A5 by replacing convergence in Po probability by convergence a.s. p•. We also add,



A7: For all (), and all 0> o there exists t(o,(})
p. [sup
> 0 such that
{~ t[I(Xi,B') /(Xi,B)]: 18'  BI > /j} < '(0, B)] ~ I.
e such that 1r(') is continuous and positive
AS: The prior distribution has a density 1f(') On at all B. Remarkably,
Theorem 55.2 (UBernsteinlvon Mises"). If conditions ADA3, A4(a.s.), A5(a.s.), A6, A7, and A8 hold. then
C(y'n((}(})
I X1, ... ,Xn )
~N(O,l
1
(B»)
(5.5.7)
a.s. under P%ralle.
We can rewrite (5.5.7) more usefully as
sup IP[y'n((}  e) < x I Xl, ... , X n]  of>(xVI(B»)j ~ 0
x
(5.5.8)
for all a.s. Po and, of course, the statement holds for our usual and weaker convergence in Po probability also. From this restatement we obtain the important corollary. Corollary 5.5.1. Under the conditions of Theorem 5.5.2,
e
sup IP[y'n(O  e) < x j Xl, ... , XnJ  of>(xVl(e)1
x
~0
(5.5.9)
a.s. P%r all B.
1 , ,
340
Asymptotic Approximations Chapter 5
Remarks
(I) Statements (5,5.4) and (5,5,7)(5,5,9) are, in fact, frequentist statements about the
asymptotic behavior of certain functionvalued statistics.
(2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in P, probability if A4 and
A5 are used rather than their strong formssee Problem 5.5.7.
(3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
identifiability guarantees consistency of Bin a regular model.

Proof We compute the posterior density of .,fii(O  B) as
(5.5.10)
where en = en(X!, . .. ,Xn) is given by

Divide top and bottom of (5.5.10) by
;,'
II7
1 p(Xi ,
B) to obtain
(5.5.11)

where l(x,B)
= 10gp(x,B) and
,
,
We claim that
for all B. To establish this note that (a) sup { 11" + 1I"(B) : ItI < M} tent and 1T' is continuous. (b) Expanding, (5.5.13)
(e In) 
~ 0 a.s.
for all M because
eis a.s. consis
I
I
1
i
1 ! , ,
I
p
J
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
341
where
Ie  Bit)1 < )n.
We use I:~
1
g~ (Xi, e) ~ 0 here. By A4(a.s.), A5(a.s.),
1
n
sup { n~[}B,(Xi,B'(t))n~[}B,(Xi,B):ltl<M
In [}'I
[}'l
}
~O,
for all M, a.s. Po. Using (5.5.13). the strong law of large numbers (SLLN) and A8, we obtain (Problem 5.5.3),
Po
[dnqn(t)~1f(B)exp{Eo:;:(Xl,B)~}
forallt] =1.
(5.5.14)
Using A6 we obtain (5.5.12).
Now consider
dn =
I:
r
+y'n
1f(e+
;")exp{~I(Xi,9+;,,) 1(Xi,e)}ds
(5.5.15)
dnqn(s)ds
J1:;I<o,fii
J
1f(t) exp
{~(l(Xi' t) 1(X
i,
9)) } l(lt 
el > o)dt
By AS and A7,
Po [sup { exp
{~(l(Xi,t) 1(Xi , e») } : It  el > 0} < e"'("O)] ~ 1
(5.5.16)
for all 0 so that the second teon in (5.5.14) is bounded by y'ne"'("O) ~ 0 a.s. Po for all 0> O. Finally note that (Problem 5.5.4) by arguing as for (5.5.14), tbere exists o(B) > 0 such that
Po [dnqn(t) < 21f(8) exp {~ Eo (:;: (Xl, B))
By (5.5.15) and (5.5.16), for all 0
~}
for all It 1 < 0(8)y'n]
~ I.
(5.5.17)
> 0,
(5.5.18)
Po [dn 
r dnqn(s)ds ~ 0] = I. J1:;I<o,fii
exp {_ 8'I(B)} ds
2
Finally, apply the dominated convergence theorem, Theorem B.7.5, to dnqn(sl(lsl < 0(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s. Po,
d ~ 1f(B)
n
r= L=
= 1f(8)v'21i'.
JI(B)
(5.5.19)
,
I
342
Hence, a.S. Po,
Asymptot'lc Approximations Chapter 5
qn(t) ~ V1(e)<p(tvI(e))
where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. 0 Example 5.5.1. Posterior Behavior in the Normal Translation Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N{ (), ( 2 ) distribution with a 2 known and we put aN ('TJ, 7 2 ) prior on 8. Then the posterior
distribution of8 isN(Wln7J!W2nX,
(~I r12)1) where
,,2
W2n
.,
I
• •
WIn
= nT 2 +U2'
= !WIn
(5.5.20)
,
'"
,
r
, .,,
I
Evidently, as n + 00, WIn + 0, X + 8, a.s., if () = e, and (~I T\) 1 + O. That is, the posterior distribution has mean approximately (j and variance approximately 0, for n large, or equivalently the posterior is close to point mass at as we vn(O  9) has posterior distribution expect from Theorem 5.5.1. Because 9 =
;
N ( .,!nw1n(ry  X), n (~+
;'»
1).
x,
e
Now, vnW1n
=
O(n 1/ 2) ~ 0(1) and
0
n (;i + ~ ) 1
rem 5.5.2.
+ (12 =
II (8) and we have directly established the conclusion of Theo
I ,
I
I
Example 5.5.2. Posterior Behavior in the Binomial~Beta Model. (Example 3.2.3 continued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X" ... , X n Li.d. Bernoulli (I, e) and put a beta, (3(r, s) prior on e, then, as in Example 3.2.3, (J has posterior (3(8n +r, n+s  8 n ). We have shown in Problem 5.3.20 that if Ua,b has a f3(a, b) distribution, then as a + 00, b + 00,
I
If 0
a) £ (a+b)3]1( Ua,b a+b ~N(o,I). [ ab
< B<
(5.5.21)
j I ,
i j
1 is true, Sn/n ~. () so that Sn + r + 00, n + s  Sn + 00 a.s. Po. By identifying a with Sn + r and b with n + s  Sn we conclude after some algebra that because 9 = X,
vn((J  X)!:' N(O,e(l e))
a.s. Po, as claimed by Theorem 5.5.2.
o
Bayesian optimality of optimal frequentist procedures and frequentist optimality of
Bayesian procedures
•
Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors are asymptotically efficient in the sense of the previous section.
,
1 ,
I
I
t
hz _
I
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
343
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following.
~
Theorem 5.~.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution ofB. Then
(i)
(5.5.22)
a.s. Pe for all
e. Consequently,
~, _ I ~ 1 az. I (e)ae(X" e) +op,(n 1/2 ) e  e+ n L.
l=l
(5.5.23)
and LO( .,fii(rr  e)) ~ N(o, rl(e)).
(ii)
(5.5.24)
E( .,fii(111 
el11111 Xl,'"
,Xn) = mjn E(.,fii(111  dl 
1(11) 1Xl.··· ,Xn) + op(I).
(5.5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions In (e, d) = .,fii(18 dlIell· But the Bayes estimatesforl n and forl(e, d) = 18dl must agree whenever E(11111 Xl, ... , Xn) < 00. (Note that if E(1111 I Xl, ... , X n ) = 00, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.l4.22)
sup IP[.,fii(O  e)
< x I Xl,'" ,Xu) 1>(xy'""'I(""'e))1 ~
Oa.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7.1 I). Thus, any median of the posterior distribution of .,fii(11  e) tends to 0, the median of N(O, II (~)), a.s. Po. But the median of the posterior of .,fii(0  (1) is .,fii(e'  e), and (5.5.22) follows. To prove (5.5.24) note that
~ ~ ~
and, hence, that
E(.,fii(IOelll1e'l) IXl, .. ·,Xn) < .,fiile
e'l ~O
(5.5.27)
a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?). claim (5.5.24) follows and, hence,
E( .,fii(10 
01 101) I h ... , Xn)
= E( .,fii(10 
0'1  101) I X" ... , X n ) + op(I).
(5.5.28)
344
~
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for In(e,d), (5.5.25) and the theorem follows. 0 Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang (1990).
Bayes credible regions
~
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
Theorem 5.5.4. Suppose the conditions afTheorem 5.5.2 are satisfied. Let
where en is chosen so that 1l"(Cn I Xl, ... ,Xn) = 1  0', be the Bayes credible region defined in Section 4.7. Let Inh) be the asymptotically level 1  'Y optimal interval based on B, given by
~
where dn(y)
i •
= z (! !)
JI~). ThenJorevery€ >
0, 0,
~ 1.
P.lIn(a + €) C Cn(X1 , .•. ,Xn ) C In(a  €)J
(5.5.29)
I
I
I
The proof, which uses a strengthened version of Theorem 5.5.2 by which the posterior density of Jii( IJ  0) converges to the N(O,Il (0)) density nnifonnly over compact neighborhoods of 0 for each fixed 0, is sketched in Problem 5.5.6. The message of the theorem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n 1j2 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n 1 order (see Schervisch, 1995).
Thsting
! ,
•
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of 00 given X I, ... ,Xn if H is false is of a different magnitude than the pvalue for the same data. For more on this socalled Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + D.) or (00  D., 00 + D.), then Bayes and freqnentist testing procedures agree in the limit. See Problem 5.5.2. Summary. Here we established the frequentist consistency of Bayes estimates in the finite parameter case, if all parameter values are a prior possible. Second. we established
i
! I
I b
II
!
_
j
TI
Section 5.6
Problems and Complements
345
the socalled Bernsteinvon Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the socalled Bernstein~von Mises theorem and frequentist contjdence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1
1. Suppose Xl, ... , X n are i.i.d. as X ous case density.
rv
F, where F has median F 1 (4) and a continu
(a) Show that, if n
= 2k + 1,
, Xn)
EFmed(X),
n (
_
~
)
l'
k (1  t)kdt F' (t)t
EF
med (X"
2
,Xn )
n( 2;) [1P'(t)f tk (1t)k dt
= 1, 3,
(b) Suppose F is unifonn, U(O, 1). Find the MSE of the sample median for n and 5.
2. Suppose Z ~ N(I', 1) and V is independent of Z with distribution X;'. Then T
Z/
=
(~)!
is said to have a noncentral t distribution with noncentrality J1 and m degrees
of freedom. See Section 4.9.2. (a) Show that
where fm(w) is the x~ density, and <P is the nonnal distribution function.
(b) If X" ... ,Xn are i.i.d.N(I',<T2 ) show that y'nX /
(,.~, L:(Xi
_X)2)! has a
noncentral t distribution with noncentrality parameter .fiiJ1/IT and n  1 degrees of freedom. (c) Show that T 2 in (a) has a noncentral :FI,m distribution with noncentrality parameter J12. Deduce that the density of T is
p(t)
= 2L
i=O
00
P[R = iJ . hi+,(f)[<p(t 1')1(t > 0)
+ <p(t + 1')1(t < 0)1
where R is given in Problem B.3.12.
, " I I. ,
346
Hint: Condition on
Asymptotic Approximations
Chapter 5
ITI.
1, then Var(X) < 1 with equality iff X
3. Show that if P[lXI < 11
±1 with
probability! .
Hint: Var(X) < EX 2
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through ..jiif..
(a) Show that the ratio of the Hoeffding function h( VilE) to the Chebychev function e( Jii€) tends to as Jii€ ~ 00 so that he) is arbitrarily better than en in the tails.
°
I,.,'
i~
(b) Show that the normal approximation 24> (
V;€)  1 gives lower results than h in
00.
,
the tails if P[lXI < 1] = 1 because, if ,,2 < 1. 1  <p(t)  <p(t)lt as t ~ Note: Hoeffding (1963) exhibits better bounds for known a 2 .
R has .\(0) ~ 0, is bounded, and has a hounded second derivative .\n. Show that if Xl, ... , X n are i.i.d., EX l = f.L and Var Xl = 02 < 00, then
5. Suppose.\ : R
~
E.\(X 
1') = .\'(0)
;'/!; +
0
(~)
as n >
00.
= E>.'(O)JiiIX  1'1 + E (";' (X  I')(X I'?) where IX  1'1 < Ix  1'1· The last term is < suPx I>." (x) 1,,2 In and the first tends to
Hint: JiiE(.\(IX 1,1) .\(0))
I
",
1! "
.\'(0)" f== Izl<p(z)dz by Remark B.7. 1(2).
Problems for Section 5,2
1. Using the notation of Theorern 5.2.1, show that
"
2. Let X" ... ,Xn be ij,d. N(I',,,2), Show that for all n
Sup p(•• u) [IX
u
>
1, all €
>
°
/,1 > ,] ~ 1.
Hint: Let (J
)
00.
•
3, Establish (5.2.5). Hint: Iiln  q(p)1
> € =} IPn  pi > w (€).
l
J
4. Let (Ui , V;), 1 < i < n, be i.i.d.  PEP.
(a) Let y(P)
= PIU, > 0, V, > OJ. Show that if P = N(O, 0,1,1, p), then
p
~ sin21l' (Y(P)  ~).
\} .. Eo.n 1 (XiJ. 5.. then is a consistent estimate of p.2 rII.lO)2)) .\} c U s(ejJ(e j=l T J) {~ t n . . for each e eo.p(Xi .d. e') . and the dominated convergence theorem.Xn are i. {p(X" 0) . VarpUVarpV > O}. Or of sphere centers such that K Now inf n {e: Ie . t e. Hint: K can be taken as [A. e)1 (ii) Eo. eo) : Ie  : Ie . (a) Show that condition (5. and < > 0 there is o{0) > 0 such that e. e) is continuous. (1 (J. (Wald) Suppose e ~ p( X.2.•. Hint: From continuity of p. Prove that (5.6 Problems and Complements 347 (b) Deduce that if P is the bivariate normal distribution. e) .e'l < . 5_0  lim Eo. Suppose Xl. (i). Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but p is no longer consistent. .p(X. inf{p(X. sup{lp(X.2.eol > .sup{lp(X.14)(i). By compactness there is a finite number 8 1 .o)} =0 where S( 0) is the 0 ball about Therefore. e E Rand (i) For some «eo) >0 Eo. . A].l)2 _ + tTl = o:J. n Lt= ". eol > A} > 0 for some A < 00. eo) : e' E S(O.(eo)} < 00. N (/J.8) fails even in this simplest case in which X ~ /J is clear.p(X.. Show that the maximum contrast estimate 8 is consistent.~ .Section 5. ". V)j /VarpU Varp V for PEP ~ {P: EpU' + EpV' < 00. inf{p(X.2.e')p(X.lJ. Hint: sup 1. . e') .eol > . 7. by the basic property of maximum contrast estimates. i (e))} > <. where A is an arbitrary positive and finite constant. (c) Suppose p{P) is defined generally as Covp (U. (Ii) Show that condition (5.e)1 : e' E 5'(e. . 05) where ao is known.i.p(X. 6.(ii) holds.14)(i) add (ii) suffice for consistency. eo)} : e E K n {I : 1 IB .
8.I'lj < EIX . and take the values ±1 with probability ~. + id = m E II (Y.11). (ii) If ti are ij. < < 17 < 1/<... 1/(J. 1 + .k / 2 .) 2] .. . en are constants.3. < Mjn~ E (... . .i..3.[. I 10. J (iii) Condition on IXi  X. Establish (5. for some constants M j .O'l n i=l p(X"Oo)}: B' E S(Bj.)2]' < t Mjn~ E [. P > 1.X~ are i.l m < Cm k=I n.L. . Hint: See part (a) of the proof of Lemma 5. 2. For r fixed apply the law of large numbers.xW) < I'lj· 3.X'i j .d. Hint: Taylor expand and note that if i . The condition of Problem 7(ii) can also fail.i. 4.X... II I . li . with the same distribution as Xl. Compact sets J( can be taken of the form {II'I < A.. and if CI. Problems for Section 5. . then by Jensen's inequality. (72). i .d. and let X' = n1EX.3.3 I. Establish Theorem 5..)I' < MjE [L:~ 1 (Xi . . L:~ 1(Xi .3."d' k=l d < mm+1 L d ElY.3.O(BJll}.9) in the exponential model of Example 5. in (i) and apply (ii) to get (iv) E IL:~ 1 (Xi  x. Establish (5.2. Establish (5. Indicate how the conditions of Problem 7 have to be changed to ensure uniform consistencyon K. Let X~ be i. . .d.3) for j odd as follows: . n. • • (i) Suppose Xf.348 Asymptotic Approximations Chapter 5 > min l<J$r {~tinf{p(Xi.1'. L:!Xi .. Then EIX .1. . i . ~ . 9.. < > OJ. Extend the result of Problem 7 to the case () E RP.. Show that the log likelihood tends to 00 as a + 0 and the condition fails.X.3. . . = 1..1.Xn but independent of them. . . .
Xn1 bei.Tn) + 1 .andsupposetheX'sandY's are independent. R valued with EX 1 = O. i.d.Fk. if a < EXr < 00. . .(X. 0'1 = Var(Xd· Xl /LI Ck.ad)]m < m L aj j=1 m <mmILaj. replaced by its method of moments estimate.i. G.\k for some . (b) Show that when F and G are nonnal as in part (a). 1 < j < m. .(l'i . .Section 5.3.I) Hint: By the iterated expectation theorem EIX. .m distribution with k = nl .r'j". 2 = Var[ ( ) /0'1 ]2 .d. PH(st! s~ < Ck.1) +E{IX. Show that if EIXlii < 00.. respectively.d.m 00.) I : i" . km K. j > 2. LetX1"".28. j = 1. then (sVaf)/(s~/a~) has an .1 and 'In = n2 . .. ~ iLli I IX. ~ xl".i.6 Problems and Complements 349 Suppose ad > 0.l(e) Now suppose that F and G are not necessarily nonnal but that and that 0 < Var( Xn < Ck.. ~ iLli < 2 i EIX.m) Then.1)'2:7' . under H : Var(Xtl ~ Var(Y..a as k t 00. Let XI. (d) Let Ck. .. Show that under the assumptions of part (c).. where s1 = (n. . Show that if m = . theu the LR test of H : or = a~ versus K : af =I a~ is based on the statistic s1 / s~...\ > 0 and = 1 + JI«k+m) ZIa. P(sll s~ < ~ 1~ a as k ~ 00. (a) Show that if F and G are N(iL"af) and N(iL2. Show that ~ sup{ IE( X.m with K.pli ~ E{IX.c> ."" X n be i. ~ iLl' I IXli > liLl}P(lXd > 11.i. Establish 5. 6. s~ ~ (n2 . FandYi""'Yn2 bei.).aD. then EIX. )=1 5. 8.. X./LI = E(Xt}. I < liLl}P(IXd < liLl)· 7..I.a~d < [max(al"" . L~=1 i j = m then m a~l. 1)'2:7' . n} EIX.m be Ck.
Wnte (l 1pP . t N }. 7 (1 . N where the t:i are i. suppose i j = j." •.x) ~ N(O. "i.2" ( 11. In particular.350 Asymptotic Approximations Chapter 5 (e) Next drop the assumption that G E g. .00. suppose in the context of Example 3.1 that we use Til. show that i' ~ . . . p). jn((C . (b) If (X. . from {t I.". Show that jn(XRx) ~N(0.\ < 1..d.XC) wHere XC = N~n Et n+l Xi.xd..6.3. Et:i = 0.A») where 7 2 = Var(XIl.. In survey sampling the modelbased approach postulates that the population {Xl._ J .I). "1 = "2 = I. • op(n'). (a) jn(X . if p i' 0. = = 1.p) ~ N(O. . i = I.lIl) + 1 . i = 1. if 0 < EX~ < 00 and 0 < Eyl8 < 00.Ii > 0. .4.. as N 2 t Show that. Po.2< 7'. i (b) Suppose Po is such that X I = bUi+t:i.I' ill "rI' and "2 = Var[( X.XN)} we are interested in is itself a sample from a superpopulation that is known up to parameters.6. . jn(r . . Y) ~ N(I'I..(I_A)a2). then jn(r .m (0 Let 9. Instead assume that 0 < Var(Y?) < x. (cj Show that if p = 0. .3..p). .) =a 2 < 00 and Var(U) Hint: (a) X .Il2. . (b) Use the delta method for the multivariate case and note (b opt  b)(U . Without loss of generality. () E such that T i = ti where ti = (ui.p') ~ N(O.xd.. to estimate Ii 1 < j < n. Consider as estimates e = ') (I X = x.I))T has the same asymptotic distribution as n~ [n.a: as k + 00.. .+xn n when T. JLl = JL2 = 0.m) + 1 . + I+p" ' ) I I I II.4..\ 00." . .N. Under the assumptions of part (c). X) U t" (ii) X R = bopt(U ~  iL) as in Example 3. there exists T I . ... • . Show that under the assumptions of part (e). Hint: Use the centra] limit theorem and Slutsky's theorem.m (depending on ~'l ~ Var[ (X I .p')') and. In Example 5. ! . if ~ then .. then P(sUs~ < qk. that is. . 0 < . = (1.". (UN. .i.1. • 10. Tin' which we have sampled at random ~ E~ 1 Xi..d.+. and if EX. . ~ ] . . 1). then jn(r' . iik..1 EX. (a) If 1'1 = 1'2 ~ 0. (iJi . err eri Show that 4log (~+~) is the variance stabilizing transfonnation for the correlation 1 Ip coefficient in Example 5.l EXiYi . < 00 (in the supennodel). n. .1JT. Var(€. ' . .I.m with KI and K2 replaced by their method of moment estimates. TN i.. 4p' (I . Without loss of generality.~) (X . H En. .. . (I _ P')').iL) • .a as k .1 · /.t .p.1.. be qk. use a normal approximation to tind an approximate critical value qk. n. = ("t .XN} or {(uI.p) ~ N(O. (iJl. .i.1'2)1"2)' such that PH(SI!S~ < Qk.l EY? .
Suppose X I I • • • l X n is a sample from a peA) distrihution.6 Problems and Complements 351 12. Normalizing Transformation for the Poisson Distribution. The following approximation to the distribution of Sn (due to Wilson and Hilferty.y'ri has approximately aN (0. Hint: If U is independent of (V. (b) Let Sn ...n)/v'2rl) and the exact values of PISn < xl from the X' table for x = XO. . 1931) is found to be excellent Use (5.99.90..3 < 00. (a) Suppose that Ely.3.. 16.Section 5. n = 5. then E(UVW) = O.14 to explain the numerical results of Problem 5. Suppose X 1l . 14.3' Justify fonnally j1" variance (72.12). Let Sn have a X~ distribution.3.3. Suppose XI. Show that IE(Y.. X = XO.3. X n are independent. each with HardyWeinberg frequency function f given by .i'c) I < Mn'.. EU 13. 15. (c) Compare the approximation of (b) with the central limit approximation P[Sn < xl = 1'((x .14).i'b)(Yc .6) to explain why. .. X~.i'a)(Yb . E(WV) < 00.. (a) Show that if n is large.E{h(X))1' = 0 to terms up to order l/n' for all A > 0 are of the form h{t) ~ ct'!3 + d. (a) Use this fact and Problem 5.3.s.: . and Hint: Use (5. . This < xl (h) From (a) deduce the approximation P[Sn '" 1>( v'2X  v'2rl). It can be shown (under suitable conditions) that the nonnal approximation to the distribution of h( X) improves as the coefficient of skewness 'Y1 n of heX) diminishes. !) distribution. X n is a sample from a population with mean third central moment j1. = 0... is known as Fisher's approximation. (a) Show that the only transformations h that make E[h(X) ...10. W). . (h) Deduce fonnula (5.25. X.. . (h) Use (a) to justify the approximation 17.13(c). Here x q denotes the qth quantile of the distribution.
(x. Cov(X. E(Y) = 1'2.y) = a ayh(x. 1'2)h2(1'1.. Y) ~ p!7IC72.. Var(Y) = C7~. 1'2)(Y  + O( n '). 19.n P . Y) where (X" Yi). . and h'(t) > for all t. (1'1. Variance Stabilizing Transfo111U1tion for the Binomial Distribution. 0< a < I. . Var Bmnm + n +Rmn = 'm+n ' ' where Rm. Show that the only variance stabilizing transformation h such that h(O) = 0. which are integers. Y" . 1'2)]2C7n + O(n 2) i . X n be the indicators of n binomial trials with probability of success B. (b) Find an approximation to P[JX' < t] in terms of 0 and t. . 1'2) Hint: h(X.a) E(Bmn ) = .y). 1'2)(X . where I I h. where I' = (a) Find an approximation to P[X < E(X. is given by h(t) = (2/1r) sin' (Yt).. Y) ..y) = a axh(x. Show directly using Problem B. 20.n = (mX/nY)i1 independent standard eXJX>nentials. Justify fonnally the following expressions for the moments of h(X. . i 21. Yn are < < .. 101 02 20(10) I f (10)2 2 tJ in terms of fJ and t. 1'2) = h. V)) '" . (1'1.n tends to zero at the rate I/(m + n)2.h(l'l.1'2)]2 177 n +2h.n  m/(m + n») < x] 1 > 'li(x).I'll + h2(1'1.2. . Bm.y). Var(X) = 171. (Xn .. 1'2)pC7.a) Hmt: Use Bm. (e) What is the approximate distribution of Vr'( X .1') + X 2 . (a) (b) Var(h(X. Yn) is a sample from a bivariate population with E(X) ~ 1'1.a tends to zero at the rate I/(m + n)2.. h(l) = I.• • • ~{[hl (1".. [vm +  n (Bm. . Show that ifm and n are both tending to oc in such a way that m/(m + n) > a.. if m/(m + n) . h 2 (x. + (mX/nY)] where Xl.Xm . Let then have a beta distribution with parameters m and n.)? 18. Let X I. 172 + [h 2(1'l. ° .5 that under the conditions of the previous problem. then I m a(l. . • j b _ . v'a(l. I ~ 352 x Asymptotic Approximations Chapter 5 where °< e < f(x) 1.
(e) Fiud the limiting laws of y'n(X . ° while n[h(X h(J1.) + ": + Rn where IRnl < M 1(J1.2) using the delta method.J1. Let Sn "' X~.)] !:.l = ~.J1.l) = o. XI. (a) Show that y'n[h(X) .2V where V ~ 00.. find the asymptotic distribution of y'n(T.h(J1. k > I.l4 is finite. Show that Eh(X) 3 h(J1. = E(X) "f 0. .3 1/6n2 + M(J1. Suppose that Xl.4 to give a directjustification of where R n / yin 0 as in n Recall Stirling's approximation: + + 00.) + ~h(2)(J1.)] ~ as ~h(2)(J1.) 23. ()'2 = Var(X) < 00. and that h(1)(J. = ~.J1.J1.6 Problems and Complements 353 22.2)/24n2 Hint: Therefore.J1.2. 24.X) in tenns of the distribution function when J.4 + 3o. _. find the asymptotic distrintion of nT nsing P(nT y'nX < Vi). (It may be shown but is not required that [y'nR" I is bounded.)] is asymptotically distributed (b) Use part (a) to show that when J1. (a) When J1.. < t) = p(Vi < (b) When J1..X2 .. J1. . X n is a sample from a population and that h is a realvalued function 01 X whose derivatives of order k are denoted by h(k). .).)1J1. Usc Stirling's approximation and Problem 8..(1. Let Xl. n[X(I . = 0. 1 X n be a sample from a population with mean J.. . xi 25.)2 and n(X .Section 5. xi.2V with V "' Give an approximation to the distribution of X (1 . Let Xl>"" X n be a sample Irom a population with and let T = X2 be an estimate of 112 . Suppose IM41 (x)1 < M for all x and some constant M and suppose that J.)2. Compare your answer to the answer in part (a).l and variance (J2 < Suppose h has a second derivative h<2) continuous at IJ.
4.9. 27.. 28.I I (a) Show that T has asymptotically a standard Donnal distribution as and I I . . (a) Show that P[T(t.\ > 1 . cyi . a~. . Asymptotic Approximations Chapter 5 then ~ £ c: 2 yn(X . I i t I . 1 Xn.2a 4 ). We want to obtain confidence intervals on J1.3 independent samples with III = E(XIl. b l .\ < 1.3.. What happens? Analysis for fixed n is difficult because T(~) no longer has a '12n2 distribution.3.t.') < 00.\a'4)I'). .4. . whereas the situation is reversed if the sample size inequalities and variance inequalities agree. the intervals (4.9. Hint: (a). Yd. .2 . Yn2 are as in Section 4.. . XII are ij. 29.}. .a.a. and SD are as defined in Section 4. .(j ' a ) N(O. n2 + 00.9.354 26.\ul + (1 ~ or al . so that ntln ~ .3) and the intervals based on the pivot ID . E In] > 1 .9..3).9./".. (c) Show that if IInl is the length of the interval In. (d) Make a comparison of the asymptotic length of (4.3.) (b) Deduce that if. 'I :I ! t. .9. Let T = (D . n2 + 00. Let n _ 00 and (a) Show that P[T(t.) < (b) Deduce that if p tl ~ <P (t [1. We want to study the behavior of the twosample pivot T(Li. . . p) distribution. (c) Apply Slutsky's theorem.9. Suppose (Xl.33) and Theorem 5.\.4. Suppose nl + 00 . 1 X n1 and Y1 . Assume that the observations have a common N(jtl. .)/SD where D.3). if n}. that E(Xt) < 00 and E(Y. 2pawz)z(1  > 0 and In is given by (4.•. Viillnl ~ 2vul + a~z(l . and a~ = Var(Y1 ).\)a'4)/(1 .d.Xi we treat Xl.\ probability of coverage. Y1 ..o. . al = Var(XIl. .(~':~fJ'). Hint: Use (5. Suppose Xl. (Xnl Yn ) are n sets of control and treatment responses in a matched pair experiment. N(Pl (}2). Suppose that instead of using the onesample t intervals based on the differences Vi .9.3) have correct asymptotic (c) Show that if a~ > al and ..1/ S D where D and S D are as in Section 4.t2.. Eo) where Eo = diag(a'.\)al + .3) has asymptotic probability of coverage < 1 . the interval (4.. Show that if Xl" .\.) of Example 4. = = az. " Yn as separate samples and use the twosample t intervals (4.t. < tJ ~ <P(t[(.JTi times the length of the onesample t interval based on the differences. . I . . 0 < . .Jli = ~. /"' = E(YIl. then limn PIt.~a) > 2( Val + a'4  ~ a) where the righthand side is the limit of .9. .
8Z = (n . ().Z).d.I)z(a)..." . Tn . Let XI.3. j = I.3 using the Welch test based on T rather than the twosample t test based on Sn. (c) Let R be the method of moment estimaie of K.:~l IXjl. (a) SupposeE(X 4 ) < 00.1. (a) Show that the MLEs of I'i and ().d.a)..I) + y'i«n ..i. k) be independent with Xi} ~ N(l'i. 32.xdf and Ixl is Euclidean distance. (d) Find or write a computer program that carries out the Welch test.. then there exist universal constants 0 < Cd < Cd < 00 Such that cdlxl1 < Ixl < Cdlxh· 31. Hint: See Problems B.3. Show that k ~ ex:: as 111 ~ 00 and 112 t 00. where I._I' Find approximations to P(Yn and P{Vn < XnI (I . where tk(1.9 and 4.. Let .3 by showing that if Y 1.3.3. Carry out a Monte Carlo study such as the one that led to Figure 5. Plot your results. (c) Show using parts (a) and (b) that the tests that reject H : fJI = 112 in favor of K: 112 > III when T > tk(l . and let Va ~ (n .z = Var(X).I)sz/(). Vn = (n . . .z are Iii = k. X n be i.z has a X~I distribution when F is theN(fJl (72) distribution. x = (XI. . Show that as n t 00. X.1 L j=I k X ij and iT z ~ (kp)I L L(Xij i=l j=l p k Iii)" . . 30. 33. Show that if 0 < EX 8 < 00.p.~ t (Xi .1)1 L.£. . then for all integers k: where C depends on d.Section 5. but Var(Tn ) + 00. I< = Var[(X 1')/()'jZ. as X ~ F and let I' = E(X). Let X ij (i = I. < XnI (a)) (b) Let Xn_1 (a) be the ath quantile of X.a)) and evaluate the approximations when F is 7.9. T and where X is unifonn. vectors and EIY 1 1k < 00. EIY.I k and k only.6 Problems and Complements 355 (b) Let k be the Welch degrees of freedom defined in Section 4..£.a) is the critical value using the Welch approximation. I is the Euclidean norm..1).X)" Then by Theorem B. Y nERd are Li.4. U( 1. has asymptotic level a. Generalize Lemma 5. . It may be shown that if Tn is any sequence of random variables such that Tn if the variances ofT and Tn exist. .4. Hint: If Ixll ~ L. then lim infn Var(Tn ) > Var(T). then P(Vn < va<) t a as 11 t 00. . ().16.. .
.. 1 X n. I i .(0) Varp!/!(X.1)cr'/k.. . Ep.....i. .0) ~ N Hint: P( .L~. N(O.On) < 0).) Hint: Show that Ep1/J(X1  8) is nonincreasing in B. N(O) = Cov(. That is the MLE jl' is not consistent (Neyman and Scott. is continuous and lbat 1(0) = F'(O) exists..0) !:. Set (.p(x) (b) Suppose lbat for all PEP. Let Xl.0).0) = O. .\(On)] !:. . 1948). (c) Assume the conditions in (a) and (b).. 00 (iii) I!/!I(x) < M < for all x.nCOn ..On) . then < t) = P(O < On) = P (.. Let i !' X denote the sample median.356 Asymptotic Approximations Chapter 5 . O(P) is lbe unique solution of Ep.O(P)) > 0 > Ep!/!(X.Xn be i. . .. 1/).p(X.4 1. I ! 1 i . then 8' !. n 1 F' (x) exists. _ . .. Use the bounded convergence lbeorem applied to . . . (c) Give a consistent estimate of (]'2. [N(O))' . I ..)). (a) Show that (i). .. . Assmne lbat '\'(0) < 0 exists and lbat . 1) for every sequence {On} wilb On 1 n = 0 + t/. (b) Show that if k is fixed and p ~ 00. • 1 = sgn(x).... (e) Suppose lbat the d.n for t E R.n7(0) ~[.. Show lbat if I(x) ~ (d) Assume part (c) and A6. . .p(X.0).p(X. .f. Let On = B(P). aliI/ > O(P) is finite. . (k ...d. .0) and 7'(0) . Show that On is consistent for B(P) over P. (ii). Show that _ £ ( . .. I(X..n(On . .p( 00) as 0 ~ 00. (Use . . \ . and (iii) imply that O(P) defined (not uniquely) by Ep. I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _1 ..p(X. F(x) of X.n(X .p(X.. ! Problems for Section 5.p(X. Show that. Deduce that the sample median is a consistent estimate of the population median if lbe latter is unique. 1/4/'(0)).. .0) !.. N(O. Suppose !/! : R~ R (i) is monotone nondecreasing (ii) !/!(oo) < 0 < !/!(oo) . .0))  7'(0)) 0.. under the conditions of (c).p(Xi . where P is the empirical distribution of Xl. random variables distributed according to PEP.
' .0) ~ N(O..O)p(x.9)dl"(x) = J te(1J. denotes the N(O. lJ :o(1J. x E R.B) < xl = 1.c)'P. .20 and note that X is more efficient than X for these gross error cases.. (h) Suppose that Xl has the Cauchy density fix) = 1/1r(1 + x 2).jii(Oj . Thus.O) is defined with Pe probability I but < x. ( 2 ).d.(x.O).B)) ~ [(I/O).0.b)p(x. Show that A6' implies A6.O))dl"(x) ifforalloo < a < b< 00.(x. .15 and 0. .2. 'T = 4.5  and '1'. 2. the asymptotic 2 . Show that if (g) Suppose Xl has the gross error density f. X) as defined in (f). evaluate the efficiency for € = .. U(O.. Find the efficiency ep(X.'" . not O! Hint: Peln(B .y.O)p(x.Xn ) is the MLE.b)dl"(x) J 1J.6 Problems and Complements 357 (0 For two estImates 0.0) (see Section 3.(x) = (I .7. This is compatible with I(B) = 00.0)p(x. 82 ) Pis N(I".5) where f.. and O WIth . X n be i.X) =0.. not only does asymptotic nonnality not hold but 8 converges B faster than at rate n. If (] = I. 1979) which you may assume. 3.(x). 0 «< 0.(x. then ep(X.i.4. ..(x . ~"'. 0» density. oj).(x.(x) + <'P.(x.I / 2 .a)dl"(x). Condition A6' pennits interchange of the order of integration by Fubini's theorem (Billingsley. 0) = . J = 1.~ for 0 > x and is undefined for 0 g~(X. 0 > 0.05.    .  = a?/o. (a) Show that g~ (x.. Show that   ep(X.10. then 'ce(n(B . Hint: teJ1J.exp(x/O)."' c .a)p(x.O)dl"(x)) = J 1J.(1:e )n ~ 1. X) = 1r/2. Hint: Apply A4 and the dominated convergence theorem B.0. 4.Section 5. Conclude that (b) Show that if 0 = max(XI. " relative efficiency of 8 1 with respect to 82 is defined as ep(8 1 . Let XI. Show that assumption A4' of this section coupled with AGA3 implies assumption A4.
0). ~log (b) Show that n p(Xi .[ L:.4) to g.4. j.In) + .80 + p(X . Show that the conclusions of PrQple~ 5.00) . and A6 hold for.) P"o (X . I 1 .in) 1 is replaced by the likelihood ratio statistic 7.80 p(X" 80) +. 0 for any sequence {en} by using (b) and Polyii's theorem (A.0'1 < En}'!' 0 .i'''i~l p(Xi.5 continue to hold if .22). 8) is continuous and I if En sup {:. Hint: (b) Expand as in (a) but aroond 80 (d) Show that ?'o+".p ~ 1(0) < 00."log . ~ L. .in) .Oo) p(Xi .. &l/&O so that E.~ (X. 0 ~N ~ (. ) n en] . Suppose A4'.in.. I .i (x.i. I I j 6. 80 +..80 + p(Xi .4.5 Asymptotic Approximations Chapter 5 ~lOg n p(Xi. O. I . ) ! .In ~ "( n & &8 10gp(Xi.. Show that 0 ~ 1(0) is continuous.g.·_t1og vn n j = p( x"'o+.O) 1(0) and Hint: () _ g.14. 8 ) i .. A2. i..in) ~ . . Apply the dominated convergence theorem (B. (8) Show that in Theorem 5. 1(80 ). 8)p(x.O') .< ~log n p(X.358 5. 8 ) 0 . under F oo ' and conclude that .i(X.18 . i I £"+7. I (c) Prove (5. .50)..7.4."('1(8) .(X. .
7).14. (a) Show that.4. Then (5. ~. gives B Compare with the exact bound of Example 4.5 hold. f is a is an asymptotic lower confidence bound for fJ. if X = 0 or 1. which agrees with (4.4.7.5. (a) Show that under assumptions CAO)(A6) for 1/! consistent estimate of I (fJ).4.2. (a) Show that under assumptions (AO)(A6) for all Band (A4').:vell . (Ii) Deduce that g. Compare Theorem 4. 11. hence.6 and Slutsky's theorem. We say that B >0 nI Show that 8~I and.4.4. (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5.3).56). Hint: Use Problems 5. 10. = X.4.4. (Ii) Suppose the conditions of Theorem 5. Let B be two asymptotic l. Hint: Use Problem 5.6 Problems and Complements 359 8. B' _n + op (n 1/') 13. ~ 1 L i=I n 8'[ ~ 8B' (Xi. which is just (4. all the 8~i are at least as good as any competitors. Hint.4.B lower confidence bounds.3.4. . 9.59).4. Consider Example 4. Let B~ be as in (5.2. Establish (5. B). (Ii) Compare the bounds in Ca) with the bound (5.4. setting a lower confidence bound for binomial fJ. A.59) coincide. at all e and (A4').59). (a) Establish (5.2.5 and 5.54) and (5.4.57) can be strengthened to: For each B E e.a. for all f n2 nI n2 .4.Q" is asymptotically at least as good as B if.4.58). the bound (5.4. there is a neighborhood V(Bo) of B o o such that limn sup{Po[B~ < BJ : B E V(Bo)} ~ 1 .61).4.61) for X ~ 0 and L 12. and give the behavior of (5.6.Section 5.4. Let [~11. B' nj for j = 1.
1.•. 2. variables with mean zero and variance 1(0).01 .. : A < Ao (a) Find the NeymanPearson (NP) test for testing H : . Consider testing H : j.i. 1) distribution.5 ! 1. (a) Show that the test that rejects H for large values of v'n(X .. 1. (b) Suppose that I' ). That is. A exp . . I' has aN(O. ! I • = '\0 versus K : . ! (a) Show that the posterior probability of (OJ is where m n (.I ' '1 against H as measured by the smallness of the pvalue is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "paradox").i..2.2.=2[1  <1>( v'nIXI)] has a I U(O.. Hint: By (3.. (c) Find the approximate critical value of the NeymanPearson test using a Donnal approximation. J1. That is. Establish (5. phas a U(O. T') distribution.'" XI! are ij. N(Jt. if H is false. 00. Now use the central limit theorem. inverse Gaussian with parameters J. where Jl is known.d. N(I'.4.riX) = T(l + nr 2)1/2rp (I+~~.)) and that when I' = LJ.1. > ~.L and A.2.. where .d. is distributed according to 7r such that 1> 7f({O}) and given I' ~ A > 0. 1 15.. the evidence I ..21J2 2x A} . 1.)l/2 Hint: Use Examples 3.LJ.6.\ = '\0 versus K • (b) Show that the NP test is UMP for testing H : A > AO versus K : A < AO. " " I I . Show that (J l'. Let X 1> .I). X > 0. 1). (d) Find the Wald test for testing H : A = AO versus K : A < AO. I .4.4.. . each Xi has density ( 27rx3 A ) 1/2 {AX + J..\ > o. is a given number. Suppose that Xl. LJ.\ < Ao. = O.A 7" 0.i. Show that (J/(J l'.d. 7f(1' 7" 0) = 1 .) has pvalue p = <1>(v'n(X . 0 given Xl. 1) distribution.. 1 .\ l . . (e) Find the Rao score test for testing H : . X n be i.1 and 3.t = 0 versus K : J.d.10) and (3. the pvalue fJ .11).360 1 Asymptotic Approximations Chapter 5 14.LJ.] versus K . Consider the Bayes test when J1. X n i.55). Consider the problem of testing H : I' E [0. . (c) Suppose that I' = 8 > O.5. Problems for Section 5. Jl > OJ . the test statistic is a sum of i. By Problem 4. .1. =f:.
Show that the posterior probability of H is where an = n/(n + I).052 100 .s. Establish (5.4. ~ L N(O.6 Problems and Complements 361 (b) Suppose that J. L M M tqn(t)dt ~ 0 a. Suppose that in addition to the conditions of Theorem 5. Hint: By (5.I).5.)}.2. 4. i ~. Extablish (5.01 < 0 } continuThen 00.s.13) and the SLLN..Section 5.2. oyn.050 logdnqn(t) = ~ {I(O).058 20 .034 .) sufficiently large. Hint: In view of Theorem 5. 1) and p ~ L U(O. (e) Verify the following table giving posterior probabilities of 1.for M(.5.. ifltl < ApplytheSLLN and 0 ous at 0 = O..sup{ :. By Theorem 5.2 and the continuity of 7f( 0).. Hint: 1 n [PI .( Xt.Oo»)+log7f(O+ J. yn(E(9 [ X) .5. [0.~~(:.17).042 ..8) ~ 0 a.5..(Xi. ~ E.0'): iO' .5.} dt < .645 and p = 0. for all M < 00.5...S..046 .:.i It Iexp {iI(O)'. J:J').054 50 .l has a N(O.5. P.05.1. Apply the argnment used for Theorem 5. 1) prior. ~l when ynX = n 1'>=0. yn(anX   ~)/ va. 10 .i Itlqn(t)dt < J:J').1 is not in effect.0 3.O') : 100'1 < o} 5. By (5.2 it is equivalent to show that J 02 7f (0)dO < a.' " .I(Xi. (Lindley's "paradox" of Problem 5.029 . all .) (d) Compute plimn~oo p/pfor fJ. Fe.O'(t)). (e) Show that when Jl ~ ~..1 I'> = 1. 0' (i» n L 802 i=l < In n ~sup {82 8021(X i .(Xi.17).14).
Pn where Pn a or 1. J I . Suppose that in Theorem 5. .7 NOTES Notes for Section 5. for all 6. .5.) ~ 0 and I jtl7r(t)dt < co. Notes for Section 5. von Misessee Stigler (1986) and Le Cam and Yang (1990).29).5. dn Finally. (0» (b) Deduce (5. . (O)<p(tI.1 .1 (I) The bound is actually known to be essentially attained for Xi = a with probability Pn and 1 with probability 1 . ~ II• '.5. A. I: .5. I : ItI < M} O. L . " . 5. .S. 7. The sets en (c) {t .. For n large these do not correspond to distributions one typically faces. Fisher (1925). ~ 0 a.) by A4 and A5. . convergence replaced by convergence in Po probability.4 (1) This result was first stated by R.) and A5(a..".5. . . (I) If the rightband side is negative for some x.9) hold with a.s.d) for some c(d).5 " I i .I Notes for Section 5. ! ' (2) Computed by Winston Cbow. Show tbat (5.s(8)vn tqn(t)dt ~ roc vIn(t  0) exp {i=(l(Xi' t) .s. See Bhattacharya and Ranga Rao (1976) for further discussion. . . Hint: (t : Jf(O)<p(tJf(O)) > c(d)} = [d. to obtain = I we must have C n = C(ZI_ ~ [f(0)nt l / 2 )(1 + op(I)) by Theorem 5. .s. . .16) noting that vlnen. • . I i.f. (1) This famous result appears in Laplace's work and was rediscovered by S.8) and (5.I(Xi}»} 7r(t)dt. . A proof was given by Cramer (1946). 'i roc J. (a) Show that sup{lqn (t) . i=l }()+O(fJ) I Apply (5. Finally. qn (t) > c} are monotone increasing in c.1. all d and c(d) / in d.!. Bernstein and R. Fn(x) is taken to be O. jo I I .3 ).2 we replace the assumptions A4(a. • Notes for Section 5. 362 f Asymptotic Approximations Chapter 5 > O.5.
reprinted in Biometrika Tables/or Statisticians (1966). 1964... "Nonnormality and tests on variances:' Biometrika. The History of Statistics: The Measurement of Uncertainty Before 1900 Cambridge. R. Nat. 1979. Wiley & Sons. Pearson. 58. A CoUrse in Large Sample Theory New York: Chapman and Hall. 1996. U. Statistical Decision Theory and Bayesian Analysis New York: SpringerVerlag. R. J. 1938. NJ: Princeton University Press. Cambridge University Press. 1380 (1963). YANG. HAMMERSLEY.8 References 363 5. 1987. S. E. . Statisticallnjerence and Scientific Method. 3rd ed. Math.. Probability and Measure New York: Wiley. Linear Statistical Inference and Its Applications. LEHMANN.8 REFERENCES BERGER. 17. Elements of LargeSample Theory New York: SpringerVerlag. 1980.. S. Statist. 132 (1948). S. P. N. O.. 40. P. WILSON. 318324 (1953). J. Approximation Theorems of Mathematical Statistics New York: J.. Vol. M. J. I. 1985. R. AND D. RAo. BHATTACHARYA. A. E. 1995. B. M. SCOTT. SERFLING. M.. H. Mathematical Analysis. HANSCOMB. Sci. 1946. A. T. Camb. STIGLER. I Berkeley. 16.S.. Asymptotics in Statistics.. Phil. 2nd ed.. W. Hartley and E.. R. FERGUSON. J. 1986. H.. "Probability inequalities for sums of bounded random variables. CASELLA.. The Behavior of the Maximum Likelihood Estimator Under NonStandard Conditions. 1999. H. Monte Carlo Methods London: Methuen & Co. CA: University of California Press. 22. J. "Consistent estimates based on partially consistent observations. 'The distribution of chi square.. Soc. L. 1958. Mathematical Methods of Statistics Princeton. Assoc. 1973. C. 3rd ed. HfLPERTY. 1976. DAVID. W. Wiley & Sons.. Vth Berkeley Symposium. Some Basic Concepts New York: Springer. AND G." Econometrica. 684 (1931). 700725 (1925)... Vth Berk. Theory o/Statistics New York: Springer. MA: Harvard University press. p.. Box. HOEFFDING.. Prob. 1990. CRAMER. Amer. G..Section 5. P. Symp. AND M. "Theory of statistical estimation." J.. L." Proc. AND G. Statist. AND E. L. FISHER. C. NEYMAN. New York: J. FISHER." Proc. L. E. AND R. Tables of the Correlation Coefficient. Editors Cambridge: Cambridge University Press. BILLINGSLEY. Normal Approximation and Asymptotic Expansions New York: Wiley. F. HUBER.A. E. SCHERVISCH. Theory ofPoint EstimatiOn New York SpringerVerlag. 1998. RUDIN.. Proc. Vol. R. L. 1967. LEHMANN. Acad. New York: McGraw Hill. LE CAM. RANGA RAO.
. . I .I 1 . ' i ' . .. (I I i : " .\ . .
This chapter is a leadin to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non. real parameters and frequently even more semi.2. the multinomial (Examples 1. 365 .2. the bootstrap. We begin our study with a thorough analysis of the Gaussian linear model with known variance in which exact calculations are possible. with the exception of Theorems 5. the number of observations. are often both large and commensurate or nearly so.3).2 and 5. the modeling of whose stochastic structure involves complex models governed by several. an important aspect of practical situations that is not touched by the approximation. curve estimates. tests.3. [n this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates.3. 2.Chapter 6 INFERENCE IN THE MUlTIPARAMETER CASE 6.5. the properties of nonparametric MLEs.4.1 INFERENCE FOR GAUSSIAN LINEAR MODElS • Most modern statistical questions iovol ve large data sets. and n. in which we looked at asymptotic theory for the MLE in multiparameter exponential families. however. There is. for instance. we have not considered asymptotic inference. testing.or nonpararnetric models.1) and more generally have studied the theory of multiparameter exponential families (Sections 1.2.3). the fact that d. The inequalities ofVapnikChervonenkis. often many.1. However. 2. We shall show how the exact behavior of likelihood procedures in this model correspond to limiting behavior of such procedures in the unknown variance case and more generally in large samples from regular ddimensional parametric models and shall illustrate our results with a number of important examples.6.7.6. confidence regions. the number of parameters. multiple regression models (Examples 1.8 C R d We have presented several such models already.3. and efficiency in semiparametric models. Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II.1. 2. 2. 1.and semiparametric models.4. and confidence regions in regular onedimensional parametric models for ddirnensional models {PO: 0 E 8}. and prediction in such situations. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for functionvalued statistics.
In the classical Gaussian (nannal) linear model this dependence takes the fonn p 1 Yi where EI.2J) (6.Herep= 1andZnxl = (l.4 and 2.Zip. . 366 Inference in the Multiparameter Case Chapter 6 ! .a2). These are among the most commonly used statistical techniques. . We consider experiments in which n cases are sampled from a population.1.3). we have a response Yi and a set of p .3). . i = 1.1 covariate measurements denoted by Zi2 • .3): l I • Example 6. ..l p.2.. We have n independent measurements Y1 .1. .l)T. It turn~ out that these techniques are sensible and useful outside the narrow framework of model (6.. In Section 6.€nareij. . we write (6. 6.1.5) .d. I The regression framewor~ of Examples 1.1..1 The Classical Gaussian linear Model l' f Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants Zil. i . Notational Convention: In this chapter we will. andJ is then x nidentity matrix. e ~N(O. The OneSample Location Problem. and for each case.6 we will investigate the sensitivity of these procedures to the assumptions of the model. the Zij are called the design values. 1 en = LZij{3j j=1 +Ei. .1.1. Z = (Zij)nxp.. are U.1.j .1.4) 0 where€I. Example 6.d. The model is Yi j = /31 + €i..1.1." .1.Zip' We are interested in relating the mean of the response to the covariate values. . i = 1.. n (6.1 is also of the fonn (6. . Regression..3) whereZi = (Zil.1. when there is no ambiguity. . let expressions such as (J refer to both column and row vectors. .1. .N(O..1.. .zipf.1. i = l. • . 0"2). .2(4) in this framework.. 1 Yn from a population with mean /31 = E(Y).. " 'I and Y = Z(3 + e. . Here Yi is called the response variable. .. .. The normal linear regression model is }i = /31 + L j=2 p Zij/3j + €i... say the ith case.2) Ii . '.1) j . • i . and Z is called the design matrix. I I.'" .n (6.. In this section we will derive exact statistical procedures under the assumptions of the model (6. Here is Example 1. N(O. In vector and matrix notation. " n (6.
. and so on.3.•. Yn1 correspond to the group receiving the first treatment. . then the notation (6.. Then for 1 < j < p. Ynt +n2 to that getting the second.3 and Section 4./.4.6) is an example of what is often called analysis ojvariance models.3. Yn1 + 1 .1.2) and (6. 1 < k < p.Way Layout.~) random variables. TWosample models apply when the design values represent a qualitative factor taking on only two values. we are interested in qualitative factors taking on several values.9.6) where Y kl is the response of the lth subject in the group obtaining the kth treatment. one from each population. n.. If we are comparing pollution levels. we arrive at the one~way layout or psample model.. /3p are called the regression coefficients.. . 0 Example 6.1fwe set Zil = 1. If the control and treatment responses are independent and nonnally distributed with the same variance a 2 .3 we considered experiments involving the comparisons of two population means when we had available two independent samples.. We treat the covariate values Zij as fixed (nonrandom). The random design Gaussian linear regression model is given in Example 1. . To fix ideas suppose we are interested in comparing the performance of p > 2 treatments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k.1. To see that this is a linear model we relabel the observations as Y1 . · · .1.Yn . Generally. .6) is often reparametrized by introducing ( l ~ pl I:~~1 (3k and Ok = 13k ."i = 1.. The model (6.. and so on. (6. We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional distribution of Y given a set of observed covariate values.0: because then Ok represents the difference between the kth and average treatment . The model (6.. + n p = n.5) is called the fixed design norrnallinear regression model. .1. (6.1. where YI. In this case. The pSample Problem Or One. if no = 0.1 Inference for Gaussian Linear Models 367 where (31 is called the regression intercept and /32. we often have more than two competing drugs to compare. .Section 6. and the €kl are independent N(O. this terminology is commonly used when the design values are qualitative. . In Example I.3) applies. . the design matrix has elements: 1 if L jl nk + 1<i < L nk k=1 j k=l ootherwise and z= o 0 ••• o o Ip where I j is a column vector of nj ones and the 0 in the "row" whose jth member is I j is a column vector of nj zeros.1. nl + . Frequently. we want to do so for a variety of locations.1. 13k is the mean response to the kth treatment.
k model is Inference in the Multiparameter Case Chapter 6 1. Note that Z· is of rank p and that {3* is not identifiable for {3* E RP+l.. E ~N(O.. .a 2 ) is identifiable if and only ifr = p(Problem6. n.. . • I t Ew ¢:> t = L:(v[t)Vi i=l ¢:> vTt = 0. . V n for Rn such that VI. b I _ . However. the linear Y = Z'(3' + E. Recall that orthononnal means Vj = 0 fori f j andvTvi = 1. .. Let T denote the number of linearly independent Cj..po In tenns of the new parameter {3* = (0'. It is given by 0 = (itl. ..368 effects. n..(T2J) where Z.1. we call Vi and Vj orthogonal.P. We now introduce the canonical variables and means . The Canonical Fonn of the Gaussian Linear Model The linear model can be analyzed easily using some geometry. V r span w. • . Z) and Inx I is the vector with n ones. .p. there exists (e. 1 JLn)T I' where the Cj = Z(3 = L j=l P .. i = T + 1. of the design matrix. n. . . and with the parameters identifiable only once d ..g. by the GramSchmidt process) (see Section B.r additional linear restrictions have been specified.. then r is the rank of Z and w has dimension r. an orthononnal basis VI..7) !. . We assume that n > r.• znj)T. . Even if {3 is not a parameter (is unidentifiable). .6jCj are the columns of the design matrix. vT n i I t and that T = L(vTt)v.2). j = I •. r i' . {3* is identifiable in the pdimensional linear subspace {(3' E RP+l : L:~~l dk = O} of RP+ 1 obtained by adding the linear restriction E~=l 15k = 0 forced by the definition of the 15k '5. ... Because dimw = r.. •• Note that w is the linear space spanned by the columns Cj. Note that any t E Jl:l can be written .3. TJi = E(Ui) = V'[ J1. i5p )T. ...1. the vector of means p of Y always is. . It follows that the parametrization ({3. •... i .17). " .. j = I. When VrVj = O. . i=I (6.! x (p+l) = (1... . is common in analysis of variance models. This type oflinear model with the number of columns d of the design matrix larger than its rank r. (3 E RP}..Y. The parameter set for f3 is RP and the parameter set for It is W = {I' = Z(3. j = 1.. i = 1. Cj = (Zlj. . I5 h . . Ui = v.
which are sufficient for (IJ...2.2 Estimation 0. .)T is sufJicientfor'l.. . observing U and Y is the same thing.1..2 known (i) T = (UI •. where = r + 1. 1]r)T varies/reely over Rr. ... 0. (T2)T. n. i i = 1. .. . 20. n because p. Then we can write U = AY. .1. . '" ~ A 1'1.1.2 and E(Ui ) = vi J.1 Inference for G::'::"::"::"::"::L::'::"e::'::'::M::o::d::e'::' '" '3::6::::. .u) 1 ~ 2 n 2 . . n.1.11) _ n log(21r"'). and by Theorem B. Un are independent nonnal with 0 variance 0. and then translate them to procedures for . The U1 are independent and Ui 11i = 0. Moreover.1. v~...1.. which is the guide to asymptotic inference in general.2 ) using the parametrization (11. .2<7' L.10) It will be convenient to obtain our statistical procedures for the canonical variables U..2 L i=l + _ '" ryiui _ 02~ i=l 1 r r 2 (6. n..'J nxn .•• (ii) U 1 .Section 6. 1 If Cl. whereas (6.. IJ. .' based on Y using (6..9 N(rli.'Ii) . i (iv) = 1. Proof..10).• .Cr are constants.2"log(21r<7 ) t=1 n 1 n _ _ ' " u.L = 0 for i = r + 1. •... U1 .. Theorem 6. 2 ' " 'Ii i=l ~2(T2 6.(Ui . E w.8) So.3. it = E. (6. . .1. 7J = AIJ. = 1. Theorem 6. 1 ViUi and Mi is UMVU for Mi. = L. .. (3. We start by considering the log likelihood £('1. . r.".). 2 0. Note that Y~AIU. i it and U equivalent.. U.and 7J are equivalently related.Ur istheMLEof1]I. u) based on U £('I. . and . then the MLE of 0: Ci1]i is a = E~ I CiUi. also UMVU for CY.. .2 We first consider the known case. (iii) U i is the UMVU estimate of1]i.. ..1. ..1]r.1. ais (v) The MLE of. (6. In the canonical/orm ofthe Gaussian linear model with 0.9) Var(Y) ~ Var(U) = . Let A nxn be the orthogonal matrix with rows vi.1.. is Ui = making vTit..8)(6. . . • . while (1]1.2.
In the canonical Gaussian linear model with a 2 unknown. Q is UMVU.. But because L~ 1 Ui2 L:~ I Ui2 + r+l Ul.2 . i > r+ 1. = 0. . by observation. 2 2 ()j = 1'0/0. I . n.. (i) By observation. (ii) U1. we need to maximize n (log 27f(J2) 2 II "...4.1.3. 0.10. recall that the maximum of (6.6. 1 U1" are the MLEs of 1}1.(v) are still valid.2 . Assume that at least one c is different from zero. . By Problem 3. (U I .1. (We could also apply Theorem 2. (6.1]r. . j = 1.6 to the canonical exponential family obtained from (6.O)T E R n . . ~i = vi'TJ.Ur'L~ 1 Ul)T is sufficient. Ui is UMVU for E(U. The maximizer is easily seen to be n 1 L~ r+l (Problem 6. r..1."7i)2 and is minimized by setting '(Ii = Ui. ( 2 )T..4... "7r because. there exists an orthonormal basis VI) . obtain the MLE fJ of fl. .1.• EUl U? Ul Projections We next express j1 in terms of Y.. (6.(J2J) by Theorem B. (iv) The conclusions a/Theorem 6. 0  i Next we consider the case in which a 2 is unknown and assume n >r+ 1. is q(8) = L~ 1 <.3. l Cr .2. and give a geometric interpretation of ii. (i) T = (Ul1" .. .. where J is the n x n identity matrix. .1. " V n of R n with VI = C = (c" . .11) by setting T j = Uj . {3. then W . The distribution of W is an exponential family.2. T r + 1 = L~ I and ()r+l = 1/20. n " . I. I i > .. Let Wi = vTu. the MLE of q(9) = I:~ I c.4.11).Ui · If all thee's are zero. (iii) 8 2 =(n .'  . (ii) The MLE ofa 2 is n. (iv) By the invariance of the MLE (Section 2.2). .11) is a function of 1}1.) = '7" i = 1. Proof. 2(J i=r+l ~ U. this statistic is equivalent to T and (i) follows. By GramSchmidt orthogonalization.r) 2:7 1 L~ r+l ul is an unbiased estimator of (J2... .1). Proof.1... .4. .....1 L~ r+l U. r. .3. i = 1. and is UMVU for its expectation E(W. define the norm It I of a vector tERn by  ! It!' = I:~ . .1.4. 1lr only through L~' 1 CUi . To this end. 1 Ur ..2 (ii).'7.1.) = a.. = 1. wecan assume without loss of generality that 2:::'=1 c. . . That is. WI = Q is sufficient for 6 = Ct. 370 Inference in the Multiparameter Case Chapter 6 I iI .) (iii) By Theorem 3.11) is an exponential family with sufficient statistic T..1.ti· 1 . By (6. . . .3 and Example (iii) is clear because 3. To show (ii). (v) Follows from (iv).. " as a function of 0.N(~. . and . .4 and Example 3. I i I .2. Theorem 6.11) has fJi 1 2 = Ui . To show (iv). 2:7 r+l un Tis sufficientfor (1]1.. apply Theorem 3..
Ii = .1.. any linear combination ofY's is also a linear combination of U's.L..3 E RP.2<T' IY . which implies ZT s = 0 for all s E w. We have Theorem 6.. fj = arg min{Iy .3 because Ii = l:~ 1 ViUi and Y . the MLE = LSE of /3 is unique and given by (6. f3j and Jii are UMVU because. (ii) and (iii) are also clear from Theorem r+l VjUj . IY . note that 6.1.1 and Section 2. the MLE of {3 equal~ the least squares estimate (LSE) of {3 defined in Example 2. because Z has full rank. That is.p.L of vectors s orthogonal to w can be written as ~ 2:7 w..1. and  Jii is the UMVU estimate of J. then /3 is identifiable.1. any linear combination of U's is a UMVU estimate of its expectation. (i) is clear because Z.. . To show f3 = (ZTZ)lZTy.ji) = 0 and the second equality in (6. In the Gaussian linear model (i) jl is the unique projection oiY on L<.L = {s ERn: ST(Z/3) = Oforall/3 E RP}.jil' /(n r) (6. Thus.. /3.' and by Theorems 6.1. ) 2 or. . spans w. It follows that /3T (ZT s) = 0 for all/3 E RP.1.9). ~ The maximum likelihood estimate .3(iv).1.' and is given by ji (ii) jl is orthogonal to Y (iii) 8' ~ = zfj (6.2(iv) and 6. equivalently.1 Inference for Gaussian linear Models 371 E Rn on w is the point Definition 6.1.3 of.Z/3I' : /3 E W}.. 0 ~ .3. i=l.1. and /3 = (ZTZll ZT 1".Z/3 I' .1.1. j = 1.2. /3 = (ZTZ)l ZT.n log(21f<T . by (6..1. fj = (ZTZ)lZTji.13) (iv) lfp = r.1. To show (iv).14) follows.3 maximizes 1 log p(y. The projection Yo = 7l"(Y I L. . . ZTZ is nonsingular.14) (v) f3j is the UMVU estimate of (3j. ZT (Y ...J) of a point y Yo =argmin{lyW: tEw}. <T) = . note that the space w. Proof..12) ii. = Z/3 and (6. = Z T Z/3 and ZTji = ZTZfj and.12) implies ZT.Section 6....ti.1.n. _.4.
'.3) that if J = J nxn is the identity matrix.. 1 < i < n. i .)I are the vertical distances from the points to the fitted line. : H = Z(Z T Z)IZT The matrix H is the projection matrix mapping R n into w.1. In this method of "prediction" of Yi. see also Section RIO. (j ~ N(f3. . By Theorem 1.1. I. 1 < i < n. (12(J .I" (12H).ill 2 = 1 q. 1 ~ ~ ! I ~ ~ ~ i .({3t + {3.1. 1 · . i = 1. and the residual € are independent.2 illustrates this tenninology in the context of Example 6.1.Y = (J .2.' € = Y .1..H)). the residuals €. (ii) (iii) y ~ N(J. n lie on the regression line fitted to the data {('" y. w~ obtain Pi = Zi(3. Example 2. € ~ N(o. In statistics it is also called the hat matrix because it "puts the hat on Y. Note that by (6.14). the best MSPE predictor of Y if f3 is known as well as z is E(Y) = ZT{3 and its best (UMVU) estimate not knowing f3 is Y = ZT {j.1. • ~ ~ Y=HY where i• 1 .H)Y. moreover. . then (6. L7 1 .ii is called the residual from this fit. Suppose we are given a value of the covariate z at which a value Y following the linear model (6. H I I It follows from this and (B. it is commOn to write Yi for 'j1i. H T 2 =H. whenp = r. ..1 we give an alternative derivation of (6. (6. and I. Var(€) = (12(J .. There the points Pi = {31 + fhzi. = [y. (iv) ifp = r. The goodness of the fit is measured by the residual sum of squares (RSS) IY . In the Gaussian linear model (i) the fitted values Y = iJ. We can now conclude the following.H). i = 1. n}.12) and (6. Taking z = Zi. That is. the ith component of the fitted value iJ.16) J I I CoroUary 6. . I .3) is to be taken. ~ ~ ·.4.2 with P = 2. . Y = il. The residuals are the projection of Y on the orthocomplement of wand ·i .5." As a projection matrix H is necessarily symmetric and idempotent. .1.15) Next note that the residuals can be written as ~ I . .1. • .. . (12 (ZTz) 1 ).14) and the normal equations (Z T Z)f3 = Z~Y.1. =H .1.).1.372 Inference in the Multiparameter Case Chapter 6 Note that in Example 2.. The estimate fi = Z{3 of It is called thefitted value and € = y .
1.1...1). . (not Y.5.2. and € are nonnally distributed with Y and € independent. The OneWay Layout (continued).8 and (3. and€"= Y . o . Thus.ii = Y.3. Regression (continued).8. ... In the Gaussian case. i = 1. 0 ~ ~ ~ ~ ~ ~ ~ Example 6. is th~ UMVU estimate of the average effect of all a 1 p = Yk .p.1. in general) p k=l L and the UMVU estimate of the incremental effect Ok = {3k . nk(3k = LYkl.. } is a multipleindexed sequence of numbers or variables. . Here Ji = .Section 6.ii. + n p and we can write the least squares estimates as {3k=Yk .. then replacement of a subscript by a dot indicates that we are considering the average over that subscript..Ill'. One Sample (continued).. Var(. then the MLE = LSE estimate is j3 = (ZTZr1 ZTy as seen before in Example 2..1 ~ Inference for Gaussian Linear Models 373 Proof (Y.1 and Section 2.a of the kth treatment is ~ Pk = Yk. . a = {3.Y are given in Corollary 6... joint Gaussian. k=I. the treatments.. We now see that the MLE of It is il = Z.1.p. . (n . .2. .p. Example 6..3). hence. k = 1. . in the Gaussian model. k = 1.4. The independence follows from the identification of j1 and € in tenns of the Ui in the theorem. 0 ~ We now return to our examples.1.3.81 and i1 = {31 Y. 'E) is a linear transformation of U and.no The variances of ..8) = (J2(Z T Z)1 follows from (B.8 and that {3j and f1i are UMVU for {3j and J1.j = 1.1.2).8. By Theorem 6. where n = nl + . which we have seen before in the unbiased estimator 8 2 of (72 is L:~ I (Yi ~  Yf/ Problem 1.i respectively. In this example the nonnal equations (Z T Z)(3 = ZY become n. 1=1 At this point we introduce an important notational convention in statistics.1. If the design matrix Z has rank P. .p.1. Y.1. .. Moreover. 0 Example 6. The error variance (72 = Var(El) can be unbiasedly estimated by 8 2 = (n _p)lIY .3. If {Cijk _. .
" Now. Zi3 is the age of the ith patient.2.1. . . i = 1. The most important hypothesistesting questions in the context of a linear model correspond to restriction of the vector of means JL to a linear subspace of the space w. .. see Problems 1. For instance.p}. which together with a 2 specifies the model. . . (6. under H.. j 1 • 1 .1.al.Li = 13 E R. • . • = 131 + f32Zi2 + f33Zi3.1.3 Tests and Confidence Intervals . The first inferential question is typically "Are the means equal or notT' Thus we test H : 131 = .JL): JL E wo} ~ ! .. the mean vector is an element of the space {JL : J. i = 1.zT . 1 and the estimates of f3 and J. in the context of Example 6.2. I.1. n} is a twodimensional linear subspace of the full model's threedimensional linear subspace of Rn given by (6. The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEssee Stigler (1986). " . {JL : Jti = 131 + f33zi3. An alternative approach 10 the MLEs for the nonnal model and the associated LSEs of this section is an approach based on MLEs for the model in which the errors El. see Koenker and D'Orey (1987) and Portnoy and Koenker (1997). 1 < ..1. the LADEs are obtained fairly quickly by modem computing methods.31.6. . under H.. However.1) have the Laplace distribution with density . = {3p = 13 for some 13 E R versus K: "the f3's are not all equal..L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L~ 1 IYi . a regression equation of the form mean response ~ . j where Zi2 is the dose level of the drug given the ith patient.JL): JL E w} sup{p(Y. Next consider the psample model of Example 1.i 1 6. .\(y) = sup{p(Y. ! j ! I I . we let w correspond to the full model with dimension r and let Wo be a qdimensional linear subspace over which JL can range under the null hypothesis H.7 and 2..374 Inference in the Multiparameter Case Chapter 6 Remark 6.17). Now we would test H : 132 = 0 versus K : 132 =I O. We first consider the a 2 known case and consider the likelihood ratio statistic . . q < r. The LSEs are preferred because of ease of computation and their geometric properties.. whereas for the full model JL is in a pdimensional subspace of Rn. For more on LADEs.1. Thus. .17) I. in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider. j. and the matrix IIziillnx3 with Zil = 1 has rank 3. .4.1. which is a onedimensional subspace of Rn.3 with 13k representing the mean resJXlnse for the kth population.En in (6. . In general.
l9) then. 2 log . .20) i=q+I 210g. when H holds. ..21 ) where fLo is the projection of fL on woo In particular. V r span wand set .. by Theorem 6. by Theorem 6. then r. But if we let A nx n be an orthogonal matrix with rows vi.'" (}r)T (see Problem B.q {(}2) for this distribution.1. .Section 6_1 Inference for Gaussian Linear Models 375 for testing H : fL E Wo versus j{ : JL E W  woo Because (6. o . .. 1) distribution with OJ = 'Ida. Note that (uda) has a N(Oi.2(v).\(Y) Proof We only need to establish the second equality in (6._q' = AJL where A L 'I. = IJL i=q+l r JLol'.1. . 2 log .IY .J X.. Proposition 6. We have shown the following..iio 2 1 } where i1 and flo are the projections of Yon wand wo.1..iii' .q( (}2) distribution with 0 2 =a 2 ' \ '1]i L.. We write X.1. Write 'Ii is as defined in (6. In this case the distribution of L:~q+1 (Uda? is called a chisquare distribution with r .\(Y) has a X. i=q+l r = a 21 fL  Ito [2 (6.\(Y) = L i=q+l r (ud a )'.1. v~' such that VI.1..I8) then. span Wo and VI.. .1.\(Y) ~ exp {  2t21Y .\(Y) It follows that = exp 1 2172 L '\' Ui' (6.1.1.l2).3.q degrees offreedom and noncentrality parameter Ej2 = 181 2 = L:=q+ I where 8 = «(}q+l. V q (6. In the Gaussian linear model with 17 2 known..21). respectively.19).1.1. . (}r.
{io.'I/L.376 Inference in the Multiparameter Case Chapter 6 Next consider the case in which a 2 is unknown. IIY . we can write . For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >. .nr distribution.l) {Ir . In particular.L n where p{y.5) that if we introduce the variance equal likelihood ratio w'" .m{O') for this distrihution where k = r .1./L. We know from Problem 6._q(02) distribution with 0' = ..itol 2 = L~ q+ 1 (Ui /0) 2./to I' ' n J respectively.. Thus.1 suppose the assumption "0.1. is poor compared to the fit under the general model..• j .r){r .21ii./L.1.q and n . has the noncentral F distribution F r _q. In Proposition 6.max{p{y. it can be shown (Problem 6.I}. We have shown the following. T has the representation .'}' YJ. We write :h.1.n_r(02) where 0 2 = u.iT'): /L E w} . statistic :\(y) _ max{p{y. T = n .liol' IY r _q IY _ iii' iii' (r .3.1.'IY _ iii' = L~ r+l (Ui /0)2. J • T = (noncentral X:.14).2. T is called the F statistic for the general linear hypothesis.r.1 that 021it .22) Because T = {n . Proposition 6. I . which is equivalent to the likelihood ratio statistic for H : Ii. E In this case.0..r IY . = aD IIY . The distribution of such a variable is called the noncentral F distribution with noncentrality parameter 0 2 and r . We have seen in PmIXlsition 6. ]t consists of rejecting H when the fit.23) .E Wo for K : Jt E W .22). Remark 6. (6.18).itol 2 have a X.1. . In the Gaussian linear model the F statistic defined by (6.liol' (n _ r) 'IY _ iii' .') denotes the righthand side of (6./Lol'. .2 is the same under Hand K and estimated by the MLE 0:2 for /. /L.iT') : /L E wo} (6. we obtain o A(Y) = P Y. T is an increasing function of A{Y) and the two test statistics are equivalent.1.1.JLo. which has a X~r distribution and is independent of u. as measured by the residual sum of squares under the model specified by H. Substituting j).2 is known" is replaced by '.q and m = n . and 86 into the likelihood ratio statistic.1 that the MLEs of a 2 for It E wand It E wQ are ..wo. 8 2 .19).I. By the canonicaL representation (6.1.1..2 1JL .J. T has the (central) Frq..q variable)/df (central X~T variahle)/df with the numerator and denominator independent. when H I lwlds.a:.r degrees affreedam (see Prohlem B.q)'{[A{Y)J'/n . The resulting test is intuitive.JtoI 2 .~I..q)'Ili . . .a = ~(Y'E.L a =  n I' and .2.(Y).1.
1. Y3 y Iy . The canonical representation (6. See Pigure 6.Section 6.IO.24) Remark 6.1.iLol' = IY .\(Y) = .1.1. One Sample (continued). In this = (Y 1'0)2 (nl) lE(Y. (6.1'0 I' y.1 and Section B.1 Inference for Gaussian Linear Models ~ 377 then >.2.1. q = 0.(Y) equals the likelihood ratio statistic for the 0'2.1. r = 1 and T = 1'0 versus K : (3 'f 1'0.iLol'.22).19) made it possible to recognize the identity IY .r)ln central X~_cln where T is the F statistic (6.9.q noncentral X? q 2Iog. 0 . The projections il and ilo of Yon w and Wo.3.iLl' + liL . Example 6. Yl = Y2 Figure 6. and the Pythagorean identity.~T = (n . We next return to our examples.1.1. We test H : (31 case wo = {I'o}.Y)2' which we recognize as t 2 / n.iLl' ~++Yl I' I' 1 . where t is the onesample Student t statistic of Section 4. (6.25) which we exploited in the preceding derivations.1. This is the Pythagorean identity.1. It follows that ~ (J2 known case with rT 2 replaced by r.
q covariates in multiple regression have an effect after fitting the first q.. We consider the possibility that a subset of p .n 378 Inference in the Multiparameter Case Chapter 6 iI II " • Example 6..22) we obtain the F statistic for the hypothesis H in the oneway layout . • 1 F = (RSSlf .ito 1 are the residual sums of squares under the full model and H. (6. . .Yk.. I3I) where {32 is a (p . L~"(Yk' .np distribution. 7) iW l. respectively. I • ! . we want to test H : {31 = . Now the linear model can be written as i i I I ! We test H : f32 (6.: I :I T _ n ~ p L~l nk(Y .n_p(fP) distribution with noncentrality parameter (Problem 6.1 L~=. However.. Z2) where Z1 is n x q and Z2 is 11 X (p .g.. in general 02 depends on the sample correlations between the variables in Zl and those in Z2' This issue is discussed further in Example 62. '1 Yp. . As we indicated earlier. age.Q)f3f(ZfZ2)f32.. In this case f3 (ZrZ)lZry and f3 0 = (Z[Ztl1Z[Y are the ML& under the full model (6.q) x 1 vector of main (e. 1 02 = (J2(p . Under the alternative F has a noncentral Fp_q.vy.g.q)1f3nZfZ2 . The One.. I • I b . respectivelY. which only depends on the second set of variables and coefficients.' • ito = Thus. 02 simplifies to (J2(p ...1. treatment) effect coefficients and f3 1 is a q x 1 vector of "nuisance" (e.1.1.)2= Lnk(Yk. .1.=11=1 k=1 •• Substituting in (6. •.1.y)2 k .1. P nk (Y. Without loss of generality we ask whether the last p . = {3p. Using (6. _y.Way Layout (continued). 0 I j Example 6. . .1. P . Regression (continued). The F test rejects H if F is large when compared to the ath quantile of the Fpq.26) 0 versus K : f3 2 i' O. we partition the design matrix Z by writing it as Z = (ZI.1.. and we partition {3 as f3T = (f3f.)2.2.P .RSSF)/(dflf .}f32.  I 1.{3p are Y1.)2' . economic status) coefficients.q).. anddfF = np and dfH = nq are the corresponding degrees of freedom..26) and H.q covariates does not affect the mean response. .. Recall that the least squares estimates of {31.ZfZl(ZiZl)'ziz. To formulate this question. j ! j • litPol2 = LL(Yk _y.1.3.22) we can write the F statistic version of the likelihood ratio test in the intuitive fonn =   i ... Under H all the observations have the same mean so that. In the special case that ZrZ2 = 0 so the variables in Zl are orthogonal to the variables in Z2.27) 1 .dfp) RSSF/dh I 2 where RSSF = IY and RSSH = IY .
n. (3p)T and its projection 1'0 = (ij. 888 =L k=I P nk(Yk.fJ. .)'. . The sum of squares in the numerator.np distribution. (3" .p) of the (n 1) degrees offreedom of SST /0" as "cooling" from S8w/a'. .30) can also be viewed stochastically.···..Section 6. identifying 0' and (p .)' . This information as well as S8B/(P .29) Thus. which is their ratio. We assume the oneway layout is valid. we have a decomposition of the variability of the whole set of data...25) 88T = 888 + 88w .2 1M . [f we define the total sum of squares as 88T =L l' L(Y nk k'  Y.. . T has a noncentral Fp1. sum of squares in the denominator. As an illustration. . . the total sum of squares. Ypnp' The 88w ~ L L(Y Y p "' k'  k )'.fLo 12 for the vector Jt (rh. are not all equal. respectively. T has a Fpl.1 and 6. (31.. Yi nl . SSB..1) and 8Sw /(n . the between groups (or treatment) sum of squares and SSw. k=1 1=1 which measures the variability of the pooled samples. into two constituent components.np distribution with noncentrality parameter (6.1) degrees of freedom and noncentrality parameter 0'.. and the F statistic.p).1) and (n .1.1.1.1. are often summarized in what is known as an analysis a/variance (ANOVA) table. (3p.. SST/a 2 is a (noncentral) X2 variable with (n . is a measure of variation between the p samples YII . k=l 1=1 measures variation within the samples. Note that this implies the possibly unrealistic . .p) degrees of freedom. . Ypl .. compute a. . and III with I being the "high" end... .1) degrees offreedom as "coming" from SS8/a' and the remaining (n .Y. . the unbiased estimates of 02 and a 2 . . ijjT = There is an interesting way of looking at the pieces of infonnation summarized by the F statistic. See Tables 6...1 Inference for Gaussian Linear Models 379 When H holds..3. To derive IF. consider the following data(I) giving blood cholesterol levels of men in three different socioeconomic groups labeled I. then by the Pythagorean identity (6.1. .. SST. (6. Because 88B/0" and SSw / a' are independent X' variables with (p . we see that the decomposition (6. If the fJ.28) where j3 = n I :2:=~"" 1 nifJi.1. the within groups (or residual) sum of squares.
j:
I ,
380
Inference in the Multiparameter Case Chapter 6
,
:
TABLE 6.1.1. ANOYA table for the oneway layout
Sum of squares
d.f.
,
Between samples
Within samples
SSe
I:
r I lld\'k_
Mean squares
F value
1\1 S '
i'dS B
Lf!
P
1
MSB
~
, ,
Total
58W  ""k '1'1n I "(I'k l  I')'  I " k 55T  ,. P 1 L "I" 1 (I'kl _ I' )' ~ k I
np
Tl  1
" '''
A1S w = SSw
1
1
j
TABLE 6.1.2. Blood cholesterol levels
I
J
286 290
II III
403 312 403
311 222 244
269 302 353
336 420 235
259 420 319
386 260
353
210
l 1
I
I
,
i,
.!
, , I
I , ,
assumption that the variance of the measurement is the same in the three groups (not to speak of normality). But see Section 6.6 for "robustness" to these assumptions. We want to test whether there is a significant difference among the mean blood cholesterol of the three groups. Here p = 3, nl = 5, n2 = 10, n3 = 6. n = 21, and we compute
TABLE 6.1.3. ANOYA table for the cholesterol data
;
,
88
Between groups Within groups Total
I
,
,I
1202.5 85,750,5 86,953.0
dJ, 2 18 20
M8
601.2 4763,9
F~value
0.126
'I
From :F tables, we find that the pvalue corresponding to the Fvalue 0.126 is 0.88. Thus, there is no evidence to indicate that mean blood cholesterol is different for the three socioeconomic groups. 0 Remark 6.1.4. Decompositions such as (6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding to variation of covariates are referred to as analysis oj variance. They can be fonnulated in any linear model including regression models. See Scheff" (1959, pp. 4245) and Weisberg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to establish the distribution theory of the components via a device known as Cochran's theorem (Graybill, 1961, p. 86). Their principal use now is in the motivation of the convenient summaries of infonnation we call ANOVA tables.
,
,I "
,
,
,
Section 6.1
Inference for Gaussian linear Models
~
381
Confidence Intervals and Regions We next use our distributional results and the method of pivots to find confidence intervals for J.li, 1 < i < n, !3j, 1 <j < p, and in general, any linear combination
n
,p
= ,p(/l) = Lai/Li
i::: 1
~ aT /l
of the J.l's. If we set;j;
= 1:7
1 ai/Ii
= aT fl and
~
~
where H is the hat matrix, then (,p  ,p)ja(,p) has a N(O, I) distribution. Moreover,
(n  r)8 2ja 2 ~
IY  iii 2 ja 2 =
L
i=r+l
n
(Uda 2 )'
has a X~r distribution and is independent of ;;;. Let
~
~
be an estimate of the standard deviation a('IjJ) of
~
'0. This estimated standard deviation is
called the standard error of 'IjJ. By referring to the definition of the t distribution, we find that the pivot
has a TnT. distribution. Let t n  r (1  40) denote the 1 ~o: quantile of the bution, then by solving IT(,p)1 < t n _, (I  ~",) for,p, we find that
Tn r
distri
is, in the Gaussian linear model, a 100(1  a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider'IjJ = p. We obtain the interval
i' ~ Y ± t n 1
(1
~q) sj.,fii,
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = T. First consider 1/J = f3j for some specified ~gression coefficient (3j. The 100(1  a)% confidence interval for (3j is
(3j
}" = (3j ± t n p (','" s{ [ (ZT Z) 11 j j ' 1 )
382
Inference in the Multiparameter Case
Chapter 6
where [(ZTZ)~l]j) is the jthdiagonal element of (ZTZ)~ '. Computersoftware computes (ZTZ)~I and labels S{[(ZTZ)~lI]j); as the standard errOr of the (estimated) jth regression coefficient. Next consider ljJ = j.ti = mean response for the ith case, 1 < i < n. The level (1  0:) confidence interval is
J.Li =
Jii ± t n _ p (1 
!a) sJh:
where hii is the ith diagonal element of the hat matrix H. Here sjh;; is called the standard error of the (estimated) mean of the ith case. Next consider the special case in which p = 2 and
Yi = 131 + 132Zi2 + Eil i = 1 ", n.
"
If we use the identity
n
~)Zi2  Z2)(l'i  Y)
i=1
~)Zi2  Z.2)l'i,
We obtain from Example 2.2.2 that
ih ~
Because Var(Yi) = a 2 , we obtain
~ Var(.6,)
L~l (Zi2  z.,)l'i
L~ 1 (Zi2  Z.2)2 .
(6.1.30)
= (J I L)Zi' i=l
''''
n
Z.2) ,
,
and the 100(1  a)% confidence interval for .6, has the form
732 ± t n p (I  ~a) sl J2:(Zi2  Z.2)'. The confidence interval for 131 is given in Problem 6.1.10. .62
=
Similarly, in the p = 2 case, it is straightforward (Problem 6.1.10) to compute
i.
h ii
I ,
•
;
_ 1

(Zi2  z.,)2
",n
+ n
L....i=l Zi2 
(
Z·2
)'
,
0
•
I
and the confidence interval for th~ mean response J1.i of the ith case has a simple explicit
fonn.
,
1
,
i ,
.I
Example 6.1.3. OneWay Layout (continued). We consider 'I/J = 13k. 1 < k .6k = Yk. ~ N(.6k, (J'lnk), we find the 100(1  a)% confidence interval
~
< p. Because
•
,
I
= 7J. ± t n  p (1 ~a) slj'nj; where 8' = SSwl(np). The intervals for I' = .6. and the incremental effect Ok =.6k1'
.6k
are given in Problem 6.1.11. 0
j
j
"I
I
i
,
I
I
Do
l
Section 6.2
Asymptotic Estimation Theory in p Dimensions
383
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j, 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector /3 with prescribed probability (1  0:). This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of f3 0 that is accepted when the level (1  a) F test is used to test H : (3 ~ (30' Under H, /' = /'0 = Z(3o; and the numerator of the F statistic (6.1.22) is based on
Iii  /'01'
C "
=
Izi3 
Z(3ol' =
(13  (30)T(Z T Z)(i3 (30)
(30)
Thus, using (6.1.22), the simultaneous confidence region for
f3 is the ellipse
= {(30 .. (13  (30)T(Z 2 Z)(i3 rs
T
< f r,ni"" (1 _ 1 20
'J}
. .T
(6.1.31)
where fr,nr
(1  40:)
is the 1  !o quantile of the :Fr,nr distribution.
Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) write Z = (ZJ, Z2) and /3T = (f3j, {3f), where f32 is a vector of main effect coefficients and f3 1is a vector of "nuisance" coefficients. Similarly, we partition {3 as {3 = ({31 ,(32 ) where (31 is q x 1 and (3, is (p  q) x 1. By Corollary 6.1.1, O"(ZTZ) is the variancecovariance matrix of {3. It follows that if we let 8 denote the lower right (p  q) x (p ~ q) comer of (ZTZ)I, then (728 is the variancecovariance matrix of 132' Thus, a joint 100(1  0:)% confidence region for {32 is the p  q dimensional ellipse
~ ~ ~
.T ",T
C={(3
0' .
. (i3,(302)TSl(i3,_(3o,) <f
() 2
pqs

pq,np
(11
20:·
l}
o
Summary. We consider the classical Gaussian linear model in which the resonse Yi for (3jZij of the ith case in an experiment is expressed as a linear combination J1i = covariates plus an error fi, where Ci, ... 1 f n are i.i.d. N (0, (72). By introducing a suitable orthogonal transfonnation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transfonnation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {(3j}, the resJXlnse means {J1i}, and linear combinations of these.
LJ=l
6.2
ASYMPTOTIC ESTIMATION THEORY IN p DIMENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic properties of the MLE and related tests and confidence bounds for onedimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generalizing Section 5.4.2.
384'
f ",'c:,c:""""ce=:::;'=''''h':.M=,'"';,,pa::.'::'m'''e::.t::"...:C::':::",..::Cc::h,:,p:::'e::'~6
6.2.1
Estimating Equations
OUf assumptions are as before save that everything is made a vector: X!, ... , X n are i.i.d. Pwhere P E Q, a model containing P = {PO: 0 E e} such that
(i)
e open C RP.
e.
(ii) Densities of P, are pC 0),9 E
1 , I
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=(,p" ... , ,pp)T where,p) ~ g:., is well defined and
 L >I'(X n.
1=1
1
n
..
i,
On) = O.
(6.2.1)
A solution to (6.2.1) is called an estimating equation estimate or an M estimate.
AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p
unknowns):
J
A2. Epl'l'(X" 0(P)1 2
>I'(x, O)dP(x)
=0
(6.2.2)
, , ,
I
is well defined on Q so that O(P) is the unique solution of (6.2.2). Necessarily O(PO) because Q => P.
=0
I , ,
< 00 where I· I is the Euclidean nonn.
I I:
.' , ,
,
I
A3. 'l/Ji{', 8), 1 < i < P. have firstorder partials with respect to all coordinates and using the notation of Section B.8,
I
where
': I
,
~
.,
i
~~
l:iI ~
is nonsingular.
A4. sup
..
{I ~ L~ 1 (D>I'(X"
p
t)  D>I'(Xi , O(P))) I: It  O(P)I < En} ~ 0 if En ~ O.
, .I ,
,
I
I
AS. On ~ O(P) for all P E Q.
Theorem 6.2.1. Under AOA5 ofthis section
8n = O(P) + where
L iii(X n.
t=l
1
n
i,
O(P))
+ op(n 1 / 2 )
(6.2.3)
i..I , ,
• •
iii(x,o(p))
= (EpD>I'(X" 0(P)W 1 >1'(x, O(P)).
(6.2.4)
b
Section 6.2
Asymptotic Estimation Theory in p Dimensions
385
Hence,
(6.2.5)
where
E(q" P) ~ J(O. p)Eq,q,T(X" O(p))F (0, P)
and
81/J~
J
(6.2.6)
r'(o,p)
= EpDq,(X"O(P))
~
E p 011 (X"O(P))
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
 n
2.:= q,(Xi , O(P)) = n 2.:= Dq,(Xi,0~)(9n
i=l
n
1
n
 O(P)).
(6.2.7)
i=l
Note that the lefthand side of (6.2.7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open , subset of RP , representable, for instance, as the Set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to conclude that A3 and A4 guarantee that with probability tending to 1, ~ l::~ I Dq,(Xi , 6~) is nonsingular. Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by p, the behavior in (6.2.3) is guaranteed for P E Q, which can include P <1c P. In fact, typically Q is essentially the set of P's for which O(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to: A6. If 1(,0) is differentiabLe
EODq,(X"O)

EOq,(X" O)DI(X" 0) CovO(q,(X" 0), DI(X, , 0))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the onedimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however. be shown that with probability tending to 1, a rootfinding algorithm starting at a consistent estimate 6~ will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
386
Inference in the Multiparameter Case
Chapter 6
6,2,2
comes
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x,O)
=
l(x,O)
= 10gp(x,0), and >I>(x,O)
obeys AOA6, then (62,8) be.
T
l
EODl (X" O)D l( X 1,0» VarODl(X"O)
(62,9)
where
j 1 ,
, ,
is the Fisher information matrix I(e) introduced in Section 3.4. If p: e _ R, e c R d , is a scalar function, the matrix)! 8~~P(}J (e) is known as the Hessian or curvature matrix of the sutface p. Thus, (6.2.9) stateS that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
Theorem 6.2.2. If AOA6 holdfor p(x, 0)
,1, , ,
I
j
=10gp(x, 0), then the MLE  satisfies On
i,
(62,10)
(62,11)
On ~ 0+  :Lr'(O)DI(Xi,O) + op(n"/')
n
i=1
1
n
,,
so that
is a minimum contrast estimate with p and 'f/J satisfying AOA6 and corresponding asymptotic variance matrix E('I1, Pe), then
"
"
If en
"
, ,
E(>I>,PO)
On
> r'(O)
(62,12)
in the sense of Theorem 3.4.4 with equality in (6,2,12) for 0
,
= 0 0 iff, undRr 0 0 ,
(6,2,13)
~
On + op(n 1/2 ),
, i ,
•
,
Proof. The proofs of (6,2,10) and (6,2,11) parallel those of (5.4.33) and (5.4,34) exactly, The proof of (6,2,12) parallels that of Theorem 3.4.4, For completeness we give it Note that hy (6,2,6) and (6,2,8)
I
" :I I.
1" ,
where U >I>(Xt,O). V Var(U T , VT)T nonsingular Var(V)
=
E(>I>,PO)
= CovO'(U, V)VarO(U)CovO'(V, U)
~
(62,14)
DI(X1,0),
But hy (B,lO.8), for any U,V with
(6,2.15)
> Cov(U, V)Var1(U)Cov(V, U),
Taking inverses of both sides yields
r1(0)
= Var01(V) < E(>I>,O).
(6.2.16)
• • !
Section 6.2
Asymptotic Estimation Theory in p Dimensions
387
Equality holds in (6.2.15) by (B. 10.2.3) iff for some b U
= b(O)
(6.2.17)
= b + Cov(U, V)Var1(V)V with probability 1. This means in view of Eow = EODl = 0 that
w(X"O) = b(0)Dl(X1,0).
In the case of identity in (6.2.16) we must have
[EODw(X 1, OW'W(X 1 , 0)
=
r1(0)DI(X" 0).
(6.2.18)
Hence, from (6.2.3) and (6.2.10) we conclude that (6.2.13) holds.
o
apxl, a T 8 n
~
We see that, by the theorem, the MLE Is efficient in the sense that for any has asymptotic bias o(n1/2) and asymptotic variance nlaT ]1(8)a, which is n.? larger than that of any competing minimum contrast estimate. Further any competitor 8 n such that aTO n has the same asymptotic behavior as a T 8 n for all a in fact agrees with On to ordern 1/2


A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6
on the asymptotic nonnality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zr, Yi)T. 1 < i < n, be ij.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways:
(il
(6.2.19) where, is distributed as N(O, (),2) independent of Z and E(Z) Z, Y has aN (a + ZT [3, (),2) distrihution.
= O.
That is, given
(ii) The distribution Ho of Z is known with density h o and E(ZZT) is nonsingular.
The second assumption is unreasonable but easily dispensed with. It readily follows (Prohlem 6.2.6) that the MLE of [3 is given by (with probability 1) [3 = [Z(n}Z(n}]
T
~
lT
Zen} Y.
(6.2.20)
Here Zen) is the n x p matrix IIZij ~ Z.j II where z.j = ~ 1 Zij. We used subscripts (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Zen) = (Zl, .. , 1 Zn)T is referred to as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and ()'2 are
p
2:7
Ci
=Y
 ~ Zj{3j, (j
J=l
"

2
.
1 ~2 = IY  (Ci + Z(n)[3)1 .
n
(6.2.21 )
I
388
~
Inference in the Multiparameter Case
Chapter 6
Note that although given ZI," " Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AOA6 hold in this case because if H o has density k o and if 8 denotes (a,{3T,a 2 )T. then
~
j
1
j
I(X,8) Dl(X,8)
and
 20'[Y  (a + Z T 13W
(
1
)
2(logo'
1
)
+ log21T) + logho(z)
(6.2.22)
;2'
1
i,
Z ;" 20 4
(.2
o o
1)
I
.,,
a 2
1(8) =
I
0 0' E(ZZT)
0
,
o o
20 4
(6.2.23)
i ,
,
1
so that by Theorem 6.2.2
iI
I'
L(y'n(ii 
a,(3  13,8'  0 2»
~ N(O,diag(02,02[E(ZZ T W',20 4 ».
(6.2.24)
1
!
I
!
,,
This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of H o 2 known plays no role in the limiting result for a,j3,Ci • Of course, these will only be the MLEs if H o depends only on parameters other than (a, f3,( 2 ). In this case we can estimate E(ZZT) by ~ L~ 1 ZiZ'[ and give approximate confidence intervals for (3j. j = 1 .. ,po " An interesting feature of (6.2.23) is that because 1(8) is a block diagonal matrix so is II (6) and, consequently, f3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perfonn inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew 0 2 , the MLE would still be and its asymptotic varianc~ optimal for this model. If we knew a and 13. ;;2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, 0=2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = {P(9,"} : 8 E e, '/ E £}
~
j,
1 ,
l
I ,
1 ,
1
13
. ,
:
~
•
L
.' ,
we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE (J (or more generally, an efficient estimate of e) in the pair (e, iiJ is lhe same as that of e(,/o), the efficient estimate for 'P'I7o = {P(9,l'jo) : E 8}. The possibility of adaptation is in fact rare. though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, ({32 ... , /3p ) with
~ ~
e
(i) a, (3" ... , {3p known. (ii)
13 arbitrary.
/32
... = {3p = O. LetZi
In case (i), we take, without loss of generality, a = (ZiI, ... , ZiP) T. then the efficient estimate in case (i) is
,
•• "
I
;;n _ L~l Zit Yi Pi n 2
Li=l Zil
(6,2.25)
, , ,
, I
.
I
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance (T2[EZlJ1. On the other hand, [31 is the first coordinate of {3 given by (6.2.20). Its asymptotic variance is the (1,1) element of O"'[EZZTjl, which is strictly bigger than ,,' [EZfj1 unless [EZZTj1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate [31 adaptively if {3z, .. . , (3p are regarded as nuisance parameters. What is happening can be seen by a representation of [Z'fn) Z(n)ll Zen) Y and [11(11) where ['(II) Ill'j(lI) II· We claim that


=
2i
fJl
= 2:~_l(Z"
",n
 Zill))y; (Z _ 2(11 ),
tl
t
(6.2.26)
L....t=1
where Z(1) is the regression of (Zu, ... , Z1n)T on the linear space spanned by (Zj1,' . " Zjn)T, 2 < j < p. Similarly.
[11 (II)
= 0"'/ E(ZlI
 I1(ZlI
1
Z'l,"" Zp,»)'
(6.2.27)
where II(Z11 I ZZ1,' .. , Zpl) is the projection of ZII on the linear span of Z211 .. , , Zpl (Problem 6.2.11). Thus, I1(ZlI I Z'lo"" Zpl) = 2:;=, Zjl where (ai, ... ,a;) minimizes E(Zll  2: P _, ajZjl)' over (a" ... , ap) E RPI (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3" ... , f3p when the variables Z2, .. . ,Zp are in any way correlated with Z1 and the price is measured by
a;
[E(Zll  I1(Zll I Z'lo"" Zpl)'jl = ( _ E(I1(Zll 1 Z'l, ... ,ZPI)),)I '''''';:;f;c;o,~''"'''''1 , E(ZII) E(ZII)
(6.2.28) In the extreme case of perfect collinearity the price is 00 as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where (Z2, . .. , Zp) have no value in predicting Zl linearly (see Section 1.4). Correspondingly in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, ... , n, (31 is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if E(Zu  I1(ZlI I Z'l, ... , Zpt})' = O. 0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are ij.d. but not necessarily Gaussian with density ~fo (~). for instance, e x
fo(x)
~
=
(1 + eX)"
the logistic density. Such error densities have the often more realistic, heavier tails(l) than the Gaussian density. The estimates {301 0'0 now solve
and
390
Inference in the Multiparameter Case
Chapter 6
f' ~ ~ where1/; = 'X;,X(Y) =  (iQ)~ Yfo(y)+1 ,(30 (fito, ... ,ppO)T The assumptions of Theorem 6.2.2 may be shown to hold (Problem 6.2.9) if
=
(i) log fa is strictly concave, i.e.,
*
is strictly decreasing.
(ii) (log 10)" exists and is bonnded.
Then, if further fa is symmetric about 0,
I(IJ)
cr"I((3T, I)
= cr"
where
Cl
(C 1 Z ) E(t
~)
(6.2.29)
, ,
= J (f,(x»)' lo(x)dx, c, = J (xf,(x) + I)' 10 (x)dx.
~
Thus, ,Bo, ao are opti
mal estimates of {3 and (l in the sense of Theorem 6.2.2 if fo is true. Now suppose fa generating the estimates Po and (T~ is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable conditions we can apply Theorem 6.2.1 with
1 ,
,
i
where
i
1f;j(Z,y,(3,(J)
,
;
"
~ 1f; (y  L~1 Zkpk )
, I
<j < p
(6.2.30)
!
,
•
> (y  L~1
to conclude that
where (301 ao solve
I 1
j
Zkpk )
1
I
I
I
L:o( y'n(,Bo  (30)) ~ N(O, I: ('I', P») L:( y'n(a  cro) ~ N(O, (J'(P))
1 •
,
(6.2.31)
• ,
J'I'(y zT(30)dP=O
• ,
I
p
II . ,.I'
ii
I:
and E('I', P) is as in (6.2.6). What is the relation between (30' (Jo and (3, (J given in the Gaussian model (6.2.19)? If 10 is symmetric about 0 and the solution of (6.2.31) is unique, . then (30 = (3. But (Jo = c(fo)q for some, c(to) typically different from one. Thus, (30 can be used for estimating (3 althougll if t1j. true distribution of the is N(O, (J') it should ' perform less well than (3. On the Qther hand, o is an estimate of (J only if normalized by a constant depending on 10. (See Problem 6.2.5.) These are issues of robustness, that
~

a
'i
1:
is, to have a bounded sensitivity curve (Section 3~5. Problem 3.5.8), we may well wish to use a nonlinear bounded '.Ii = ('!f;11'" l ,pp)T to estimate f3 even though it is suboptimal when € rv N(O, (12). and to use a suita~ly nonnalized version of {To for the same purpose. One effective choice of.,pj is the Hu~er function defined in Problem 3.5.8. We will discuss these issues further in Section 6.6 and Volume II. 0
•
Section 6. typically there is an attempt at "exact" calculation. . Wald tests (a generalization of pivots). and Rao's tests. . A5(a.s. say. However.3.5. This approach is refined in Kass. under PeforallB. The problem arises also when.19).32) a.2. A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems. Ifthe multivariate versions of AOA3. fJ p vary freely. All of these will be developed in Section 6.. The consequences of Theorem 6. the equivalence of Bayesian and frequentist optimality asymptotically. When the estimating equations defining the M estimates coincide with the likelihood . The three approaches coincide asymptotically but differ substantially. .2.2.2. We have implicitly done this in the calculations leading up to (5.2. as is usually the case. Op). See Schervish (1995) for some of the relevant calculations. and interpret I . for fixed n. because we then need to integrate out (03 . in perfonnance and computationally. the likelihood ratio principle. up to the proportionality constant f e . Confidence regions that parallel the tests will also be developed in Section 6.(t) n~ 1 p(Xi . .3 are the same as those of Theorem 5.19) (Laplace's method). A4(a. 1T(8) rr~ 1 P(Xi1 8).I as the Euclidean nonn in conditions A7 and A8. we are interested in the posterior distribution of some of the parameters..s. We simply make 8 a vector. Again the two approaches differ at the second order when the prior begins to make a difference. A new major issue that arises is computation.3.3 The Posterior Distribution in the Multiparameter Case The asymptotic theory of the posterior distribution parallels that in the onedimensional case exactly. 2 The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : fJ 1 < 0 versus K : fh > 0 where fJ 2 .).2 Asymptotic Estimation Theory in p Dimensions 391 Testing and Confidence Bounds There are three principal approaches to testing hypotheses in multiparameter models. Kadane. 6. O ). and Tierney (1989). .. Summary.3. ~ (6. say (fJ 1.. These methods are beyond the scope of this volume but will be discussed briefly in Volume II.s.8 we obtain Theorem 6.5. t)dt. Although it is easy to write down the posterior density of 8. the latter can pose a fonnidable problem if p > 2. ife denotes the MLE.5.) and A6A8 hold then. Using multivariate expansions as in B. We defined minimum contrast (Me) and M estimates in the case of pdimensional parameters and established their convergence in law to a nonnal distribution.
However. 6. X II . Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic i.. I H I! ~. I j I I .f/ : () E 8.9) .1 .s. 9 E 8" 8 1 = 8 .392 Inference in the Multiparameter Case Chapter 6 . . PO'  . to the N(O. A(X) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions.d. In Section 6.B} has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {PO. Wald and Rao large sample tests.1 we developed exact tests and confidence regions that are appropriate in re~ gression and anaysis of variance (ANOVA) situations when the responses are normally distributed. 9 E 8 0 } for testing H : 9 E 8 0 versus K . We present three procedures that are used frequently: likelihood ratio. In linear regression. 1 . covariates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations. .9). . We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {PO. this result gives the asymptotic distribution of the MLE. ry E £} if the asymptotic distribution of In(() . 170 specified. We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AIestimate if we know the correct model P = {Po : BEe} and use the MLE for this model. and confidence procedures.110 : () E 8}. confidence regions. in many experimental situations in which the likelihood ratio test can be used to address important questions. Another example deals with M estimates based on estimating equations generated by linear models with nonGaussian error distribution. .3.i.I .denotes the MLE for PO' then the posterior distribution of yn(O if 0 r' (9)) distribution. under P. I .4.. '(x) = sup{p(x.•• .1 we considered the likelihood ratio test statistic.4 to vectorvalued parameters. However. as in the linear model.8 0 .4.". These were treated for f) real in Section 5. In these cases exact methods are typically not available. and other methods of inference. Finally we show that in the Bayesian framework where given 0. adaptive estimation of 131 is possible iff Zl is uncorre1ated with every linear function of Z2) . X n are i. 9 E 8} sup{p(x.6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal.9 and 6. and we tum to asymptotic approximations to construct tests. 1 '~ J . 1 Zp. In this section we will use the results of Section 6. . the exact critical value is not available analytically. I . ! I In Sections 4. In such .3 • . we need methods for situations in which. LARGE SAMPLE TESTS AND CONFIDENCE REGIONS ~ 0) converges a.I . We shall show (see Section 6. and showed that in several statistical models involving normal distributions.4.2 to extend some of the results of Section 5. equations. "' ~ 6.
l. 0) = IT~ 1 p(Xi. 1. Y n be nn. exists and in Example 2.1 exp{ iJx )/qQ).l.randXi = Uduo. x ~ > 0. as X where X has the gamma. Q > 0..1.Xn are i.v'[. and we transform to canonical form by setting Example 6.. This is not available analytically.n.. 0 = (iX. Let YI .d.3. we conclude that under H. Using Section 6.2 we showed that the MLE. .. ... ~ X."" V r spanw.'··' J.3.i = 1. =6r = O..2 we showed how to find B as a nonexplicit solution of likelihood equations. 0 It remains to find the critical value. Wilks's approximation is exact. . As in Section 6. r and Xi rv N(O. such that VI.4. . qQ. = (J.1. .3..Wo where w is an rdimensional linear subspace of Rn and W ::> Wo.3.\(x) is available as p(x... In this u 2 known example. . E W ..1). Other approximations that will be explored in Volume II are based on Monte Carlo and bootstrap simulations. . the X~_q distribution is an approximation to £(21og .i = 1. The MLE of f3 under H is readily seen from (2. V q span Wo and VI.i.3. 0) ~ iJ"x·. ~ In Example 2. •. .. .. We next give an example that can be viewed as the limiting situation for which the approximation is exact: independent with Yi rv N(Pi. where A nxn is an orthogonal matrix with rows vI. q < r. 0). iJo) is the denominator of the likelihood ratio statistic.n.\(Y) = L X. when testing whether a parameter vector is restricted to an open subset of Rq or R r . iJ > O. 2 log .. Suppose we want to test H : Q = 1 (exponential distribution) versus K : a i. Moreover. iJ). ~ ~ ~ ~ ~ The approximation we shall give is based on the result "2 log '\(X) .\(X) based on asymptotic theory. iJ). .1. Here is an example in which Wilks's approximation to £('\(X)) is useful: Example 6.q' i=q+1 r Wilks's theorem states that. 1).. under regularity conditions. SetBi = TU/uo. .tl.3 we test whether J.5) to be iJo = l/x and p(x. . . which is usually referred to as Wilks's theorem or approximation. . <1~) where {To is known. i = r+ 1.. Then Xi rvN(B i .1. i = 1.Section 6. distribution with density p(X. . .3 Large Sample Tests and Confidence Regions 393 cases we can turn to an approximation to the distribution of . 0 We iHustrate the remarkable fact that X~q holds as an approximation to the null distribution of 2 log A quite generally when the hypothesis is a nice qdimensional submanifold of an rdimensional parameter space with the following.£ X~" for degrees of freedom d to be specified later. the hypothesis H is equivalent to H: 6 q + I = . Thus. Suppose XL X 2 .\(Y)). The Gaussian Linear Model with Known Variance.2.tn)T is a member of a qdimensional linear subspace of woo versus the alternative that J. the numerator of ..
(Jol.3. where x E Xc R'.3. case with Xl.(J) Uk uJ rxr c'· • 1 . Then.. "'" (6. we can conclude arguing from A.q' Finally apply Lemma 5.7 to Vn = I:~_q+lXl/nlI:~ r+IX.1. ". = 2[ln«(Jn) In((Jo)] ~ ~ ~ £ X. c = and conclude that 210g .2. Because . I I i Example 6. !.2.3.\(Y) £ X. 0)..i..3.(X) ~ 1 . hy Corollary B..(Jo) In«(Jn)«(Jn .d.3.(Y) defined in Remark ° 6. : . an expansion of lnUI) about en evaluated at 8 = 0 0 gives 2[ln«(Jn) In((Jo)] = n«(Jn .3. X. If Yi are as in = (IL.(Jol. B). Apply Example 5.6.2 unknown case.2 but (T2 is unknown then manifold whereas under H.2 with 9(1) = log(l + I). and conclude that Vn S X. . ! .· . I1«(Jo)). . I(J~ . .. N(o. where DO is the derivative with respect to e. j .logp(X. o .Xn a sample from p(x. Hence.Onl + IOn  (Jol < 210n .3. I . In Section 6.. i\ I . .1 ((Jo)). .2. (6. 1 • ! We tirst consider the simple hypothesis H : 6 = 8 0 _ Theorem 6.(Jol < I(J~ .(Jo) for some (J~ with [(J~ .. ·'I1 . I( I In((J) ~ n 1 n L i=l & & &" &". Suppose the assumptions of Theorem 6.~ L H .394 Inference in the Multiparameter Case Chapter 6 1 .. Note that for J.2) V ~ N(o. . 2 log >'(Y) = Vn ". . y'ri(On . 0 Consider the general i." ._q as well. The Gaussian Linear Model with Unknown Variance. (72) ranges over an r + Idimensional Example 6.1) " I.1. where Irxr«(J) is the Fisher information matrix. ranges over a q + Idimensional manifold.. "'" T. V T I«(Jo)V ~ X. Write the log likelihood as e In«(J) ~ L i=I n logp(X i .. under H .(In[ < l(Jn . and (J E c W.2. .i=q+l 'C'" I=r+l Xi' ) X.3.2 are satisfied.3 and AA that that In((J~) "'" 2[ln((Jn) In((Jo)] £r ~ V I((Jo)V. 21og>.2.(Jo) ". The result follows because.3. we derived e e 2log >'(Y) ~ n log ( 1 + 2::" L. . Eln«(Jo) ~ I«(Jo)... an = n. I.. By Theorem 6.. Here ~ ~ j.1.. Proof Because On solves the likelihood equation DOln(O) = 0. (J = (Jo. .q also in the 0.
Let Po be the model {Po: 0 E eo} with corresponding parametn"zation 0(1) = ~(1) (1) ~(1) (8 1.n) /n(Oo)l.2[1. distribution.10) and (6. where e is open and 8 0 is the set of 0 E e with 8j = 80 . ~ {Oo : 2[/ n (On) /n(OO)] < x.3. (JO. OT ~ (0(1). 0(2) = (8. where x r (1.. 0 E e.3 Large Sample Tests and Confidence Regions 395 = As a consequence of the theorem.3..0:'.+l.j.)1'.(9 0 . and {8 0 .6) .4). SupposetharOo istheMLEofO underH and that 0 0 satisfiesA6for Po. Let 0 0 E 8 0 and write 210g >'(X) ~ = 2[1n(9 n )  In(Oo)] . (6. has approximately level 1. ~(11 ~ (6.Section 6. Examples 6. Furthennore.0(1) = (8 1" ". 10 (0 0 ) = VarO. the test that rejects H : () 2Iog>.4) It is easy to see that AOA6 for 10 imply AOA5 for Po. . Then under H : 0 E 8 0. Suppose that the assumptions of Theorem 6.0(2».n and (6.(X) 8 0 when > x. 0b2) = (80"+1"..0) is the 1 and C!' quantile of the X.j} are specified values.. Let 00.(] . By (6.)T.3. Next we tum to the more general hypothesis H : 8 E 8 0 .".2.(Ia).T.0 0 ).2.3.1) applied to On and the corresponding argument applied to 8 0 .3. T (1) (2) Proof. for given true 0 0 in 8 0 • '7=M(OOo) where. .n = (80 . 8 2)T where 8 1 is the first q coordinates of8.8.3) is a confidence region for 0 with approximat~ coverage probability 1 .35) where 8(00) = n 1/2 L Dl(X" 0) i=1 n and 8 = (8 1 .' ". Theorem 6.8.)T. 0)..1 and 6.. We set d ~ r ..81(00).3.a)) (6.2 hold for p(x.q.2 illustrate such 8 0 . dropping the dependence on 6 0 • M ~ PI 1 / 2 (6. j ~ q + 1. Make a change of parameter.a.8q ).3..3.2.80.
3. • 1 y{X) = sup{P{x. rejecting if .9).1> . 0 Note that this argument is simply an asymptotic version of the one given in Example 6. a piece of 8..1 because /1/2 Ao is the intersection of a q dirnensionallinear subspace of R r with JI/2{8 .2. This asymptotic level 1 . J) distribution by (6.In( 00 . Moreover.7). I I eo~ ~ ~ {(O. .3..(X) > x r . . under the conditions of Theorem 6.1 '1)..Tf(O)T .all (6..1 '1ll/sup{p(x.2. Such P exists by the argument given in Example 6. if A o _ {(} .0.II). Tbe result follows from (6.. ~ 'I.9) . with ell _ ..l.2.6) and (6.11) .8) that if T('1) then = 1 2 n. : • Var T(O) = pT r ' /2 II.00 ..8 0 : () E 8}.10) LTl(o) . Now write De for differentiation with respect to (J and Dry for differentiation with respect to TJ. (6.1ITD III(x. because in tenus of 7].(X) = y{X) where ! 1 1 .3.l.(1 .1I 0 + M.3. ••.1I0 + M. '1 E Me}. 1e ).(l . = 0. = ..+1 ~ .8.+I> . Me : 1]q+l = TJr = a}. 18q acting as r nuisance parameters. • ..I '1): 11 0 + M.396 Inference in the Multiparameter Case Chapter 6 and P is an orthogonal matrix such that. .0: confidence region is i . then by 2 log "(X) TT(O)T(O) .1/ 2 p = J.1I0 + M.1'1 '1 E eo} and from (B.00 : e E 8 0} MAD = {'1: '/.13) D'1I{x. which has a limiting X~q distribution by Slutsky's theorem because T(O) has a limiting Nr(O. (O) r • + op(l) (6. Note that. Or)J < xr_.a) is an asymptotically level a test of H : 8 E Of equal importance is that we obtain an asymptotic confidence region for (e q+ 1.LTl(o) + op(l) i=l i=l L i=q+l r Ti2 (0) + op(l)./ L i=I n D'1I{X" liD + M..+ 1 . His {fJ E applying (6.3. .... Thus.3.3..1 '1) = [M.3.3. IIr ) : 2[ln(lI n ) . by definition. ). . We deduce from (6.• I .5) to "(X) we obtain. is invariant under reparametrization .3..
More sophisticated examples are given in Problems 6..e:::S:::.12) The extension of Theorem 6.3. v r span " R T then..g.. Wilks's theorem depends critically On the fact that nOt only is open but that if giveu in (6.2 falls under this schema with 9. 2 e eo Ii o ~ (Xl + X 2) 2 + ~ 1 _ (X.:::m.. l B. We only need note that if WQ is a linear space spanned by an orthogonal basis v\. Suppose the MLE hold (JO. themselves depending on 8 q + 1 ...3. 8q o assuming that 8q + 1 .n under H is consistent for all () E 8 0. if A(X) is ~ the likelihood ratio statistic for H : 0 E 8 0 given in (6.13) Evideutly. More complicated linear hypotheses such as H : 6 . The formulation of Theorem 6.3..3.. Here the dimension of 8 0 and 8 is the same but the boundary of 8 0 has lower dimension. 2' 2 + X 2 )) 2 and 210g A(X) ~ xf but if 81 + 82 < 1 clearly 2 log A(X) ~ op(I).2 is still inadequate for most applications. J).3. Then. R.. which require the following general theorem.14) a hypothesis of the fonn (6. B2 .3. if 0 0 is true.'! are the MLEs. 9j .::. . will appear in the next section.o''C6::.2.. 8.5 and 6. Defiue H : 0 E 8 0 with e 80 = {O E 8 : g(O) = o}. (6.. Examples such as testing for independence in contingency tables.13) then the set {(8" .3). q + 1 < j < r.. . of f)l" . N(B 1 .( 0 ) ~ o} (6.8 0 E Wo where Wo is a linear space of dimension q are also covered.i.3.3.3. where J is the 2 x 2 identity matrix aud 8 0 = {O : B1 + O < I}. A(X) behaves asymptotically like a test for H : 0 E 8 00 where 800 = {O E 8: Dg(Oo)(O . If 81 + B2 = 1.8r are known. X.2 to this situation is easy and given in Problem 6.. q + 1 <j < r}. t.cTe:::..) be i... .d..3. It can be extended as follows. As an example ofwhatcau go wrong..(0) ~ 8j .p:::'e:.'':::':::"::d_C:::o:::.. "'0 ~ {O : OT Vj = 0.2)(6.q at all 0 E 8. The proof is sketched iu Problems (6. Suppose the assumptions of Theorem 6. q + 1 < j < r written as a vector g.. Theorem 6. Vl' are orthogonal towo and v" .13). .3..3...80."de"'::'"e:::R::e"g.2 and the previously conditions on g.::Se::.3. 8q)T : 0 E 8} is opeu iu Rq. (6.13)..:::f.3. . Theorem 6. . " ' J v q and Vq+l " .3. .... X~q under H.l:::. .o:::"'' 397 CC where 00. 210g A(X) ".t.3.. let (XiI. We need both properties because we need to analyze both the numerator and denominator of A(X).3. such that Dg(O) exists and is of rauk r . The esseutial idea is that. Suppose H is specified by: There exist d functions....6.3.
(6. the lower diagonal block of the inverse of .17) ~ ~ e ~ en 1 • I I Wn(IJ~2)) ~ n(9~2) 1J~2)l[I"(9nJrl(9~) _1J~2») where I 22 (IJ) is the lower diagonal block of II (IJ) written as I2 I 1 _ (Ill(lJ) (IJ) I21(IJ) I (1J)) I22(IJ) with diagonal blocks of dimension q x q and d x d. respectively.6. More generally. (6.16) n(9 n _IJ)TI(9 n )(9 n IJ) !:.4.9).lJ n ) where IJ n statistic as ~(1) ~(2) ~(1) = (0" .Nr(O..31).398 Inference in the Multiparameter Case Chapter 6 6. 1(8) continuous implies that [1(8) is continuous and. Then . 0 ""11 F . r 1 (1J)) where.3.rl(IJ» asn ~ p ~ L 00. ~ Theorem 6.3.0. in particular .X. I . i favored because it is usually computed automatically with the MLE.2..3.: b.3.q' ~(2) (6. ..3.3..18) i 1 Proof.2 Wald's and Rao's Large Sample Tests The Wald Test Suppose that the assumptions of Theorem 6. . the Hessian (problem 6. But by Theorem 6.Or) and define the Wald (6.2.. It follows that the Wald test that rejects H : (J = 6 0 in favor of K : () • i= 00 when . [22 is continuous.a)} is an ellipsoid in W easily interpretable and computablesee (6. . The last Hessian choice is ~ ~ . y T I(IJ)Y.. hence.3. for instance. I 22 (9 n ) is replaceable by any consistent estimate of J22( 8).2 hold.~ D 2 1n ( IJ n).2. . i . it follows from Proposition B. X. More generally /(9 0 ) can be replaced by any consistent estimate of I( 1J0 ). j Wn(IJ~2)) !:.15) and (6.fii(lJn (2) L 1J 0 ) ~ Nd(O.2. I22(lJo)) if 1J 0 E eo holds. according to Corollary B. y T I(IJ)Y . ~ ~ 00.3.fii(1J IJ) ~N(O. It and I(8 n ) also have the advantage that the confidence region One generates {6 : Wn (6) < xp(l .) and IJ n ~ ~ ~(2) ~ (Oq+" . If H is true.2.~ D 2ln (IJ n ).16). Y ..3.~ D 2l n (1J 0 ) or I (IJ n ) or . Under the conditions afTheorem 6. .15) Because I( IJ) is continuous in IJ (Problem 6.7.7.1.l(a) that I(lJ n ) ~ I(IJ) asn By Slutsky's theorem B. (6. I.3.10). theorem completes the proof. For the more general hypothesis H : (} E 8 0 we write the MLE for 8 E as = ~ ! I: (IJ n . . Slutsky's it I' I' .2. has asymptotic level Q.2.
This test has the advantage that it can be carried out without computing the MLE. X. 2 Wn(Oo ) ~(2) where .nl]  2  1 .21) where III is the upper left q x q block of the r x r infonnation matrix I (80 ).3 Large Sample Tests and Confidence Regions 399 The Wold test.6.(00 )  121 (0 0 )/..n W (80 .. un.3.0:) is called the Rao score test. as (6. = 2 log A(X) + op(l) (6. the two tests are equivalent asymptotically..3.Section 6.3. and the convergence Rn((}o) ~ X. Let eo '1'n(O) = n. The Rao Score Test For the simple hypothesis H that.. What is not as evident is that.. a consistent estimate of .8) E(Oo) = 1. 112 is the upper right q x d block. The Rao test is based on the statistic Rn(Oo ) ~n'1'n(Oo.19) indicates.2 that under H.ln(Oo.ln(8 0 . The test that rejects H when R n ( (}o) > x r (1. (6. as n _ CXJ. the asymptotic variance of .Q). The Wald test leads to the Wald confidence regions for (B q +] . . in practice they can be very different.. It can be shown that (Problem 6. therefore.:El ((}o) is ~ n 1 [D.3. is. Furthermore.1/ 2 D 2 In (0) where D 1 l n represents the q x 1 gradient with respect to the first q coordinates and D 2 l n the d x 1 gradient with respect to the last d. which rejects iff HIn (9b ») > x 1_ q (1 ..n under H.22) . The argument is sketched in Problem 6. ~ B ) T given by {8(2) : r W n (0(2)) < x r _ q (1 These regions are ellipsoids in R d Although. It follows from this and Corollary B.n) ~  where :E is a consistent estimate of E( (}o).9.3. Rn(Oo) = nt/J?:(Oo)r' (OO)t/Jn(OO) !. under H.20) vnt/Jn(OO) !. N(O..3.n) under n H. (J = 8 0 .' (0 0 )112 (00) (6. asymptotically level Q. the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n.n) 2  + D21 ln (00.3.nlE ~ (2) _ T  .n)[D....3.n] D"ln(Oo. 1(0 0 )) where 1/J n = n . and so on.. The extension of the Rao test to H : (J E runs as follows.I Dln ((Jo) is the likelihood score vector. requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests. ~1 '1'n(OO. (Problem 6. Rao's SCOre test is based on the observation (6.9) under AOA6 and consistency of (JO.19) : (J c: 80.\(X) is the LR statistic for H Thus. by the central limit theorem.
. X~' 2 j > xd(1 . Power Behavior of the LR. .5. . I . We also considered a quadratic fonn. for the Wald test neOn ._q(A T I(Oo)t. and showed that this quadratic fonn has limiting distribution q . . where X~ (')'2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'? It may he shown that the equivalence (6. which stales that if A(X) is the LR statistic. Finally. and Wald Tests It is possible as in the onedimensional case to derive the asymptotic power for these tests for alternatives of the form On = ~ eo + ~ where 8 0 E 8 0 . under regularity conditions. 2 log A(X) has an asymptotic X..400 Inference in the Multiparameter Case Chapter 6 where D~ is the d x d matrix of second partials of ill with respect to matrix of mixed second partials with respect to e(l). On the other hand.q distribution under H. it shares the disadvantage of the Wald test that matrices need to be computed and inverted.0 0 ) = T'" "" (vn(On . Under H : (J E A6 required only for Po eo and the conditions ADAS a/Theorem 6. 0(2).) . Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score testssee Rao (1973) for more on this. We established Wilks's theorem. D 21 the d x d Theorem 6.On) + t..8 0 where is an open subset of R: and 8 0 is the collection of 8 E 8 with the last r ~ q coordinates 0(2) specified.2 but with Rn(lJb 2 1 1 » !:. I I .3. X. and so on. We considered the problem of testing H : 8 E 80 versus K : 8 E 8 . . .. Rao. The asymptotic distribution of this quadratic fonn is also q• I e X: X: 6. e(l).3.2. which measures the distance between the hypothesized value of 8(2) and its MLE.a)}. The analysis for 8 0 = {Oo} is relatively easy.19) holds under On and that the power behavior is unaffected and applies to all three tests. we introduced the Rao score test.)I(On)( vn(On  On) + A) !:.a)} and • The Rao large sample critical and confidence regions are {R n (Ob » {0(2) : Rn (0(2) < xd(l. Summary. For instance. . then. .4 LARGE SAMPLE METHODS FOR DISCRETE DATA In this section we give a number of important applications of the general methods we have developed to inference for discrete data. In particular we shall discuss problems of .0 0 ) I(On)(On . The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. which is based on a quadratic fonn in the gradient of the log likelihood. b . called the Wald statistic.
. treated in more detail in Section 6. j = 1. we need the information matrix I = we find using (2. j=l i=1 The second term on the right is kl 2 n Thus. . k ." .]2100.OOi)(B. k1.nOo. .3. consider i.3. trials in which Xi = j if the ith trial prcx:luces a result in the jth category.4.8 we found the MLE 0.E~. and 2.. Ok Thus. In.) /OOk = n(Bk .2. the Wald statistic is klkl Wn(Oo) = n LfB. . Because ()k = 1 . + n LL(Bi .i.6. = L:~ 1 1{Xi = j}. k Wn(Oo) = L(N.4. . j = 1..00k)2lOOk.OO. k. j=1 To find the Wald test. I ij [ ..)/OOk.) > Xkl(1. j=1 Do.7. log(N..32) thaI IIIij II. we consider the parameter 9 = (()l. .+ ] 1 1 0. It follows that the large sample LR rejection region is ~ k 2IogA(X) = 2 LN. j = 1.)2InOo. .4 Large Sample Methods for Discret_e_D_a_ta ~_ _ 401 goodnessoffit and special cases oflog linear and generalized linear models (GLM). L(9. 6.00..d.()k_l)T and test the hypothesis H : ()j = ()OJ for specified (JOj.33) and (3. .5. Thus. 2. Pearson's X2 Test As in Examples 1. For i. where N.Section 6. with ()Ok =  1 1 E7 : kl j=l ()OJ.0).. ()j.lnOo. .2. ~ N. Ok if i if i = j.1. j=l . In Example 2. we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution.8.. # j. or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by theory.2.1 GoodnessofFit in a Multinomial Model.. Let ()j = P(Xi = j) be the probability of the jth category.
because kl L(Oj .80j ) = (Ok .4..Nk_d T and. we could invert] or note that by (6.2.1) X where the sum is over categories and "expected" refers to the expected frequency E H (Nj ).1) of Pearson's X2 will reappear in other multinomial applications in this section. 8k 80 k = {[8 0k (8 j  80j )]  [8 0j (8 k ~  80k ) ] } .. It is easily remembered as 2 = SUM (Observed . by A..13.80k).~) 8 80j 80 k ~ ~ ~ ] 2 0j = n [~ 80 k ~ 1] 2 To simplify the first term on the right of (6. The second term on the right is n [ j=1 (!.4..4. . where N = (N1 . . ~ = Ilaijll(kl)X(kl) with Thus. .2). Then.4. 80j 80k 1 and expand the square keeping the square brackets intact.402 Inference in the Multiparameter Case Chapter 6 The term on the right is called Pearson's chisquare (X 2 ) statistic and is the statistic that is typically used for this multinomial testing problem.. we write ~ ~  8j 80j .. The general form (6. with To find ]1. To derive the Rao test.~ 8 8 0j 0k  2 80j ) ( n LL _ _ L kl kl kl ( J=1 z=1 ~ 8_Z 80i ~) 8k 80k (~ 8 _J 80J ~) ) 8k .2..L . . the Rao statistic is .8.1S. '" .L . note that from Example 2.11).. j=1 . ]1(9) = ~ = Var(N).2) .Expected)2 Expected (6. n ( L kl ( j=1 ~ ~) !. ~ .. 80i 80j 80k (6.
3.75)2 = 04 34.1.4). n3 = 108. Contingency Tables Suppose N = (NI . If we assume the seeds are produced independently. which has a pvalue of 0. n4 = 32. LOi=l}.25 + (3.4. However. and we want to test whether the distribution of types in the n = 556 trials he performed (seeds he observed) is consistent with his theory. Example 6. l::. and (4) wrinkled green. j.fh. i=1 For example. 04 = 1/16. Then. in the HardyWeinberg model (Example 2. 4 as above and associated probabilities of occurrence fh. (3) round green. n020 = n030 = 104. Possible types of progeny were: (1) round yellow.4 Large Sample Methods for Discrete Data 403 the first term on the right of (6. this value may be too small! See Note 1. Here testing the adequacy of the HardyWeinberg model means testing H : 8 E 8 0 versus K : 8 E . Nk)T has a multinomial. distribution.2 GoodnessofFit to Composite Multinomial Models.. n2 = 101. In experiments on pea breeding. fh = 03 = 3/16. Mendel's theory predicted that 01 = 9/16. 2. 8 0 . We will investigate how to test H : 0 E 8 0 versus K : 0 ¢:.75 + (3.1.k. Testing a Genetic Theory. Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds.i::.25 + (2.2) becomes It follows that the Rao statistic equals Pearson 's X2.75. 04.9 when referred to a X~ table.1) dimensional parameter space k 8={8:0i~0.25)2 312. k = 4 X 2 = (2. Mendel observed nl = 315. 02.48 in this case. nOlO = 312. where 8 0 is a composite "smooth" subset of the (k .Section 6. 8).4. For comparison 210g A = 0. . • 6.4. (2) wrinkled yellow. we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1. 7.75.: .75 .25.25)2 104. which is a onedimensional curve in the twodimensional parameter space 8. . n040 = 34.. M(n.75)2 104. There is insufficient evidence to reject Mendel's hypothesis.
We suppose that we can describe 8 0 parametrically as where 11 = ('r/1. .. to test the HardyWeinberg model we set e~ = e1 . and the map 11 ~ (e 1(11).l now leads to the Rao statistic R (8("") n 11 =~ ~ j=l [Ni .("") X J 11 I where the righthand side is Pearson's X2 as defined in general by (6.404 Inference in the Multiparameter Case Chapter 6 8 1 where 8 1 = 8 . . a 11 'r/J .4. thus. . The algebra showing Rn(8 0 ) = X2 in Section 6.. .1). 8(11)) : 11 E £}.r for specified eOj . 8(11)) = 0.. . 8 0 • Let p( n1. e~) T ranges over an open subset of Rq and ej = eOj .ne j (ij)]2 = 2 ne.. involve restrictions on the e obtained by specifying independence assumptions on i classifications of cases into different categories. If a maximizing value. r. . .r... ij satisfies l a a'r/j logp(nl.. nk. i=l k If 11 ~ e( 11) is differentiable in each coordinate. However. ek(11))T takes £ into 8 0 . Then we can conclude from Theorem 6. we obtain the Rao statistic for the composite multinomial hypothesis by replacing eOj in (6."" 'r/q) T.q distribution for large n. by the algebra of Section 6. nk. ij = (iiI.3) Le." For instance.3. nk. the Wald statistic based on the parametrization 8 (11) obtained by replacing e by e (ij).q) exists. To apply the results of Section 6.. which will be pursued further later in this section. 8) for 8 E 8 0 is the same as maximizing p( n1. 8(11)) for 11 E £. . £ is open.q' Moreover. then it must solve the likelihood equation for the model.1. . i=l t n.4. . .. under H.nk.3).. approximately X.4.X approximately has a X..2) by ej(ij)... Other examples.. That is.X approximately has a xi distribution under H. Maximizing p(n1. and ij exists. j Oj is.~) and test H : e~ = O.4. .8 0 .. Consider the likelihood ratio test for H : e E 8 0 versus K : e 1:. 2 log .2~(1 . j = 1. The Wald statistic is only asymptotically invariant under reparametrization. The Rao statistic is also invariant under reparametrization and. also equal to Pearson's X2 • . . e~ = e2 ... the log likelihood ratio is given by log 'x(nb . j = q + 1.1. .4. To avoid trivialities we assume q < k . . If £ is not open sometimes the closure of £ will contain a solution of (6.. . nk) = L ndlog(ni/n) log ei(ij)]. where 9j is chosen so that H becomes equivalent to "( e~ . 8) denote the frequency function of N.4.(t)~ei(11)=O. . is a subset of qdimensional space...3 and conclude that. . 1 ~ j ~ q or k (6.. . l~j~q. we define ej = 9j (8).3 that 2 log . {p(..
1958.4. . (3) starchywhite. is or is not a smoker. AB. and so on. ( 2n3 + n2) 2) T 2n 2n o Example 6. To study the relation between the two characteristics we take a random sample of size n from the population.TJ). AB. T() test the validity of the linkage model we would take 8 0 {G(2 + TJ). (2nl + n2)(2 3 + n2) . The results are assembled in what is called a 2 x 2 contingency table such as the one shown. p. We found in Example 2. (}4) distribution. The Fisher Linkage Model.6 that Thus. Denote the probabilities of these types by {}n. iTJ) : TJ :. The only root of this equation in [0. (6. 9(ij) ((2nl + n2) 2n 2n 2. we obtain critical values from the X~ tables. {}12..4. 1] is the desired estimate (see Problem 6. . 301).5. ij {}ij = (Bil + Bi2 ) (Blj + B2j ). nl (2+TJ) (n2 + n3) n4 (1TJ) +:ry 0.4 Methods for Discrete Data 405 Example 6.. Then a randomly selected individual from the population can be one of four types AB.1). TJ). The likelihood equation (6.4. specifies that where TJ is an unknown number between 0 and 1. . i(1 . respectively. Because q = 1. green base leaf versus white base leaf) leads to four possible offspring types: (1) sugarywhite.3) becomes i(1 °:. An individual either is or is not inoculated against a disease.2. HardyWeinberg. do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second Band B. (h. k 4. Independent classification then means that the events [being an A] and [being a B] are independent or in terms of the B .4. We often want to know whether such characteristics are linked or are independent.4) which reduces to a quadratic equation in if. (2) sugarygreen. AB. A selfcrossing of maize heterozygous on two characteristics (starchy versus sugary. 0 Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories. then (Nb .• . For instance.4. is male or female.. N 4 ) has a M(n. H is rejected if X2 2:: Xl (1 . 1} a "onedimensional curve" of the threedimensional parameter space 8.4. A linkage model (Fisher. (4) starchygreen. If Ni is the number of offspring of type i among a total of n offspring. .Section 6. {}22. {}21.a) with if (2nl + n2)/2n.
406 Inference in the Multiparameter Case Chapter 6 l A 11 The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column. By our theory if H is true. Then. For () E 8 0 . respectively.2). I}. N 21 . C j = N 1j + N 2j is the jth column sum. X2 has approximately a X~ distribution. We test the hypothesis H : () E 8 0 versus K : 0 f{. Thus.RiCj/n) are all the same in absolute value and. 0 ::.4.'r/1) (1 . for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic.3) become . 0 12 .~ + n12) iiI (nll + n2d (nll i72 (n21 + n22) (1 . This suggests that X2 may be written as the square of a single (approximately) standard normal variable. ( 22 ). (6. 021 .5) whose solutions are Til 'r/2 = (n11+ n 12)/n (nll + n21)/n.Ti2) (6. q = 2.4. 0 11 .7) where Ri = Nil + Ni2 is the ith row sum.4. where 8 0 is a twodimensional subset of 8 given by 8 0 = {( 'r/1 'r/2. 'r/1 (1 .Tid (n12 + n22) (1 . where z tt [R~jl1 2=1 J=l . In fact (Problem 6. if N = (Nll' N 12 . 8 0 . 'r/2 to indicate that these are parameters.4. the likelihood equations (6. 'r/2 (1 . N 22 )T. the (Nij .4. which vary freely. (1 .'r/1). we have N rv M(n.'r/2)) : 0 ::.011 + 021 as 'r/1. 'r/2 ::.6) the proportions of individuals of type A and type B. Pearson's statistic is then easily seen to be (6. Here we have relabeled 0 11 + 0 12 . 'r/1 ::. These solutions are the maximum likelihood estimates. because k = 4. 1.'r/2).
. i :::.peA I B)] [~(B) ~(~)ll/2 peA) peA) where P is the empirical distribution and where we use A. .. The X2 test is equivalent to rejecting (twosidedly) if. . . TJj2 are nonnegative and 2:~=1 = 2:~=1 TJj2 = 1. if and only if. B.. B..... b ~ 2 (e. j = 1. that A is more likely to occur in the presence of B than it would in the presence of B). Next we consider contingency tables for two nonnumerical characteristics having a and b states.. . peA I B) = peA I B). if X2 measures deviations from independence. it is reasonable to use the test that rejects.4. The N ij can be arranged in a a x b contingency table.g.8) Thus. (Jij : 1 :::. a. a. then {Nij : 1 :::.Section 6. 1). that is. . j :::.. B.4.4. b). 1 :::.. b NIb Rl a Nal C1 C2 . B to denote the event that a randomly selected individual has characteristic A. . Z = v'n[P(A I B) .. peA I B)) versus K : peA I B) > peA I B).. If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N ij ... j :::. a. Nab Cb Ra n . 1 Nu 2 N12 . If (Jij = P[A randomly selected individual is of type i for 1 and j for 2]. b where N ij is the number of individuals of type i for characteristic 1 and j for characteristic 2.3) that if A and B are independent.4 Large Sample Methods for Discrete Data 407 An important altemative form for Z is given by (6..4. . It may be shown (Problem 6. hair color).a) as a level a onesided test of H : peA I B) = peA I B) (or peA I B) :::. 1 :::. i :::. Thus. b} "J M(n. a... a. j :::. eye color. Positive values of Z indicate that A and B are positively associated (i.e. (Jij TJil The hypothesis that the characteristics are assigned independently becomes H : TJil TJj2 for 1 :::.. . Therefore. and only if. Z indicates what directions these deviations take. 1 :::. b where the TJil. then Z is approximately distributed as N(O. Z ~ z(l. . respectively. i = 1. i :::.
"" 7rk)T based on X = (Xl"'" Xk)T is t. we observe the number of successes Xi = L. approximately normally distributed ~ for known constants {Zij} and and whose means are modeled as J.4.9) which has approximately a X(al)(bl) distribution under H. a simple linear representation zT ~ for 7r(.4. (6. The argument is left to the problems as are some numerical applications.{3p. i :::.. log(l "il] + t. log ( ~: ) .) over the whole range of z is impossible. we obtain what is called the logistic linear regression model where .4. we observe independent Xl' .. In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1. 1) d.408 Inference in the Multiparameter Case Chapter 6 with row and column sums as indicated. Examples are (1) medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0). Thus.10.f..8 as the canonical parameter zT TJ = g ( 7r) = log [7r / (1 . k.~ =1 Zij {3j = unknown parameters {31.4. or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y = 0).11) When we use the logit transfonn g(7r)." We assume that the distribution of the response Y depends on the known covariate vector ZT. As is typical.6. B(mi' 7ri)' where 7ri = 7r(Zi) is the probability of success for a case with covariate vector Zi. 6. . (6. Because 7r(z) varies between 0 and 1. Next we choose a parametric model for 7r(z) that will generate useful procedures for analyzing experiments with binary responses. Instead we turn to the logistic transform g( 7r). . ' XI. 1 :::.1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are. perhaps after a transfonnation."F~1 Yij where Yij is the response on the jth of the mi trials in block i. (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y = 0). In this section we assume that the data are grouped or replicated so that for each fixed i. Maximum likelihood and dimensionality calculations similar to those for the 2 x 2 table show that Pearson's X2 for the hypothesis of independence is given by (6.3 Logistic Regression for Binary Responses In Section 6.Li = L. with Xi binomial. and the loglog transform g2(7r) = 10g[log(1 .. [Xi C :i"J log + m. Other transforms. usually called the log it.7r)] are also used in practice.7r)]. whicll we introduced in Example 1. we call Y = 1 a "success" and Y = 0 a "failure. The log likelihood of 7r = (7rl. such as the probit gl(7r) = <I>1(7r) where <I> is the N(O.
Tp) T. Zi)T is the logistic regression model of Problem 2.. .2 can be used to compute the MLE of f3. mi 2mi Here the adjustment 1/2mi is used to avoid log 0 and log 00. p (Jj Tj  ~ mi loge1+ exp{Zif3} ) + ~ log '.16) 1fi) 1]. ( 1 .14) where W = diag{ mi1fi(l1fi)}kxk.[ i(l7r iW')· . Note that IN(f3) is the log likelihood of a pparameter canonical exponential model with parameter vector f3 and sufficient statistic T = (T1' .3. the likelihood equations are just ZT(X .+ ..4.Section 6.l. Zi The log likelihood l( 7r(f3)) = == k (1.. The coordinate ascent iterative procedure of Section 2.3.4. Because f3 = (ZTz) 1 ZT 'T] and TJi = 10g[1fi (1 1fi) in TJi has been replaced by 730 is a plugin estimate of f3 where 1fi and (1 1f. The condition is sufficient but not necessary for existencesee Problem 2.12) where T j = 2:7=1 ZijXi and we make the dependence on N explicit. (6.13) By Theorem 2.1 applies and we can conclude that if 0 < Xi < mi and Z has rank p..1. in (6.J) SN(O.4 the Fisher information matrix is I(f3) = ZTWZ (6. It follows that the NILE of f3 solves E f3 (Tj ) = T . j Thus.4. . where Z = IIZij Ilrnxp is the design matrix..3.14).f3p )T is. j = 1. Alternatively with a good initial value the NewtonRaphson algorithm can be employed. .15) Vi = log X+. k ( ) (6.4 Large Sample Methods for Discrete Data 409 The special case p = 2.3.+ . Theorem 2.1fi)*}' + 00.(1.. W is estimated using Wo = diag{mi1f.1fi )* = 1 .L) = O.4. that if m 1fi > 0 for 1 ::.3. we can guarantee convergence with probability tending to 1 as N + 00 as follows.. the solution to this equation exists and gives the unique MLE 73 of f3. Although unlike coordinate ascent NewtonRaphson need not converge.~ Xi 2 1 +"2 ' (6.! ) ( mi . k m} (V. mi 2mi Xi 1 Xi 1 = .log(l:'.. i ::. or Ef3(Z T X) = ZTX.4. 7r Using the 8method. .Li = E(Xi ) = mi1fi.4..J. Similarly. Then E(Tj ) = 2:7=1 Zij J.1 and by Proposition 3.Li.3. As the initial estimate use (6. the empirical logistic transform. We let J.p. .4. it follows by Theorem 5. IN(f3) of f3 = (f31..4. if N = 2:7=1 mi. I N(f3) = f.
.\ for testing H : TJ E w versus K : TJ n .4..13. i 1. ji) measures the distance between the fit ji based on the model wand the data X.8 that the inverse of the logit transform 9 is the logistic distribution function Thus. ji). and for the ith location count the number Xi among mi that has the given attribute. recall from Example 1. 2 log .)l {LOt (6. D(X.~.1.flm. Testing In analogy with Section 6. the MLE of 1ri is 1ri = 9..w is denoted by D(y. .4.:IITlPt'pr Case 6 Because Z has rankp.IOg(E. We obtain k independent samples. i = 1. To get expressions for the MLEs of 7r and JL. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and recording the number Xi of patients that recover.410 Inference in the Mllltin::lr. by Problem 6. For a second pendently and we observe Xl.. . f"V . suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition. i = 1.) +X.\ has an asymptotic X. from (6. .4..11) and (6.. In the present case.1. and the MLEs of 1ri and {Li are Xdmi and Xi.14) that {3o is consistent.. we let w = {TJ : 'f]i = {3.1 linear subhypotheses are important.4.. We want to contrast w to the case where there are no restrictions on TJ. The samples are collected indeB(1ri' mi).k.1 (L:~=l Xij(3j). By the multivariate delta method."" Xk independent with Xi example. The LR statistic 2 log . where ji is the MLE of JL for TJ E w. D(X.4. The Binomial OneWay Layout.4.18) {Lo~ where jio is the MLE of JL under H and fl~i = mi .4.. Theorem 5. distribution for TJ E w as mi + 00.17) where X: = mi Xi and M~ mi Mi. As in Section 6. we set n = Rk and consider TJ E n.. one from each location.12) k zT D(X. k. that is.13. {3 E RP} and let r be the dimension of w. If Wo is a qdimensional linear subspace of w with q < r. then we can form the LR statistic for H : TJ E Wo versus K: TJ E w .3. . Example 6. . ji) 2 I)Xi 10g(Xdfli) + XIlog(XI!flDJ i=1 (6. In this case the likelihood is a product of independent binomial densities. k < oosee Problem 6.4. it foHows (Problem 6.Wo 210g. fl) has asymptotically a X%r.\=2t [Xi log i=1 (Ei. Here is a special case.q distribution as mi + 00.4. Thus.6.
Using Z as given in the oneway layout in Example 6. discuss algorithms for computing MLEs.18) with JiOi mi7r. It follows from Theorem 2.7r)].3. we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates. 6.15). we find that the MLE of 7r under His 7r = TIN.Li = ~i. the Rao statistic is again of the Pearson X2 form.4. the Wald and Rao statistics take a form called "Pearson's X2. We derive the likelihood equations.1 and 6.1.5 GENERALIZED LINEAR MODELS Yi In Sections 6. Under H the log likelihood in canonical exponential form is {3T . J. we test H : 7rl = 7r2 7rk = 7r.mk7r)T.4.4. The LR statistic is given by (6.4. We found that for testing the hypothesis that a multinomial parameter equals a specified value. Summary. N 2::=1 mi. where J.3 we considered experiments in which the mean J." which equals the sum of standardized squared distances between observed frequencies and expected frequencies under H.3 to find tests for important statistical problems involving discrete data. The Pearson statistic 2 _ X  L k1 i=l (X "'(1) m'7r . an important hypothesis is that the popUlations are homogenous. (3 = L Zij{3j j=l p of covariate values. then J. .Li = EYij.3.Li 7ri = gl(~i)' where gl(y) is the logistic distribution . In the special case of testing independence of multinomial frequencies representing classifications in a twoway contingency table. the Pearson statistic is shown to have a simple intuitive form. and as in that section.Section 6. and give the LR test.1. Thus.3. in the case of a Gaussian response.13)..7r ~ i "')2 mi 7r is a Wald statistic and the X2 test is equivalent asymptotically to the LR test (Problem 6. we give explicitly the MLEs and X2 test.4. Finally.N log(1+ exp{{3}) + ~ log ( k ~: ) where T = 2::=1 Xi. .1 that if 0 < T < N the MLE exists and is the solution of (6. z. and (3 = 10g[7r1(1 . versus the alternative that the 7r'S are not all equal. In particular. 7r E (0.L (ml7r. In Section 6. When the hypothesis is that the multinomial parameter is in a qdimensional subset of the k . if J. We used the large sample testing results of Section 6.Li of a response is expressed as a function of a linear combination ~i =. In the special case of testing equality of k binomial parameters..Idimensional parameter space e.5 Generalized Linear Models 411 This model corresponds to the oneway layout of Section 6.1).
Special cases are: (i) The linear model with known variance. g = or 11 L J'j Z(j) = Z{3. in which case 9 is also called the link function. 1989) synthesized a number of previous generalizations of the linear model. . These Ii are independent Gaussian with known variance 1 and means JLi The model is GLM with canonical g(fL) fL. . g(fL) is of the form (g(JL1) . Canonical links The most important case corresponds to the link being canonical... (ii) Log linear models.1) where 11 is not in E. 11) given by (6.1). Typically. A (11) = Ao (TJi) for some A o. The generalized linear model with dispersion depending only on the mean The data consist of an observation (Z.5. More generally. McCullagh and NeIder (1983. the mean fL of Y is related to 11 via fL = A(11).5.. in which case JLi A~ (TJi).r Case 6 function. = L:~=1 Zij{3.. g(JLn)) T."" zn) with Zi = (Zib"" Zip)T nonrandom and Y has density p(y. such that g(fL) = L J'jZ(j) j=1 p Z{3 where Z(j) (Zlj. See Haberman (1974).412 Inference in the 1\III1IITin'~r:>rn"'T.1. fL determines 11 and thereby Var(Y) A( 11). the identity.. but in a subset of E obtained by restricting TJi to be of the form where h is a known function. Znj)T is the jth column vector of Z. j=1 p In this case. ••• . Typically. We assume that there is a onetoone transform g(fL) of fL. the natural parameter space of the nparameter canonical exponential family (6. Note that if A is oneone. the GLM is the canonical subfamily of the original exponential family generated by ZTy. . which is p x 1. that is..6.. Yn)T is n x 1 and ZJxn = (ZI. Y) where Y = (Yb . called the link junction. most importantly the log linear model developed by Goodman and Haberman. As we know from Corollary 1. .
2) (or ascertain that no solution exists). 2:~=1 B = 1. as seen in Example 1. the . The models we obtain are called log linearsee Haberman (1974) for an extensive treatment. by Theorem 2.3) where In this situation and more generally even for noncanonical links.5.8) (6. that procedure is just (6. p.4.2) It's interesting to note that (6. is that of independence Bij = Bi+B+j where Bi+ = 2:~=1 Bij .2) can be interpreted geometrically in somewhat the same way as in the Gausian linear modelthe "residual" vector Y . .t1. The coordinate ascent algorithm can be used to solve (6 .5. 1f.. so that Yij is the indicator of. if maximum likelihood estimates they necessarily uniquely satisfy the equation f3 exist. j::. Then. for example./1(. See Haberman (1974) for a further discussion.5. where f3i. But.Yp)T is M(n. p are canonical parameters.6. j ::. Algorithms If the link is canonical. the link is canonical. classification i on characteristic 1 and j on characteristic 2..3. NewtonRaphson coincides with Fisher's method of scoring described in Problem 6. Suppose. that Y = IIYij 111:Si:Sa. "Ij = 10gBj. b.. j ::. ..5. and the log linear model corresponding to log Bij = f3i + f3j. . The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ((}i).B 1.7. 1 ::. Bj > 0. f3j are free (unidentifiable parameters).3. If we take g(/1) = (log J. B+ j = 2:~=1 Bij .4. 1 ::.5 Generalized linear Models 413 Suppose (Y1. . /1(. a. b. . 1 ::.1. . ZTy or = ZT E13 y = ZT A(Z.Bp). i ::. With a good starting point f3 0 one can achieve faster convergence with the NewtonRaphson algorithm of Section 2.80 ~ f3 0 . 0 < (}i < 1 withcanonicallinkg(B) = 10g[B(1B)].Section 6. 1 ::..1. Then () = IIBij II.1::.8) is orthogonal to the column space of Z.5. This isjust the logistic linear model of Section 6. log J. say.8) is not a member of that space. j ::.. in general.tp) T . In this case.
Let ~m+l == 13m+1 .6) a decomposition of the deviance between Y and iLo as the sum of two nonnegative components.5.1](Y)) l(Y . the algorithm is also called iterated weighted least squares. (6. The LR statistic for H : IL E Wo is just D(Y. iLo) .414 Inference in the Multiparameter Case Chapter 6 true value of {3. Write 1](') for AI.4) That is.Wo with WI ~ Wo is D(Y. 210gA = 2[l(Y. the deviance of Y to iLl and tl(iL o.5. This name stems from the following interpretation.D(Y. As in the linear model we can define the biggest possible GLM M of the form (6. .D(Y .. iLl)' each of which can be thought of as a squared distance between their arguments.5. then with probability tending to 1 the algorithm converges to the MLE if it exists. For the Gaussian linear model with known variance 0'5 (Problem 6. called the deviance between Y and lLo. Testing in GLM Testing hypotheses in GLM is done via the LR statistic. 1]) > o}). the variance covariance matrix is W m and the regression is on the columns ofW mZProblem 6.5) is always ~ O. D generally except in the Gaussian case. Yn ) T (assume that Y is in the interior of the convex support of {y : p(y.13m' which satisfies the equation (6.5.1](lLa)] for the hypothesis that IL = lLa within M as a "measure" of (squared) distance between Y and lLo. This quantity. We can then formally write an analysis of deviance analogous to the analysis of variance of Section 6. Unfortunately tl =f. iLl) where iLl is the MLE under WI.2. iLl) == D(Y. as n ~ 00. the correction Am+ l is given by the weighted least squares formula (2.5. If Wo C WI we can write (6.5. We can think of the test statistic . iLo) == inf{D(Y. In this context. ••• .1) for which p = n. IL) : IL E wo} where iLo is the MLE of IL in Woo The LR statistic for H : IL E Wo versus K : IL E WI .4).2.20) when the data are the residuals from the fit at stage m. iLo) . In that case the MLE of J1 is iL M = (Y1 .1.
fil) is thought of as being asymptotically X.3. is consistent with probability 1. if we assume the covariates in logistic regression with canonical link to be stochastic. and can conclude that the statistic of (6..5. so that the MLE {3 is unique. (3) term. Zn in the sample (ZI' Y 1 ). . (3) is (Yi and so. asymptotically exists. y ..2. . (3))Z. can be estimated by f = :EA(Z{3) where :E is the sample variance matrix of the covariates. = f3d = 0. . . .3. For instance. (Zn' Y n ) can be viewed as a sample from a population and the link is canonical.Section 6 . and 6. then (ZI' Y1 ). (Zn. However.. 6.5.2.2.3 applies straightforwardly in view of the general smoothness properties of canonical exponential families. we can calculate i=l where {3 H is the (p xl) MLE for the GLM with {3~xl = (0. .10) is asymptotically X~ under H. . in order to obtain approximate confidence procedures.. f3p ). which we temporarily assume known. Asymptotic theory for estimates and tests If (ZI' Y 1 ). (Zn' Y n ) has density (6. . f3d+l. Ao(Z.5. . there are easy conditions under which conditions of Theorems 6.1 . . d < p.q. the theory of Sections 6.5 Generalized Linear Models 415 Formally if Wo is a GLM of dimension p and WI of dimension q with canonical links.3). we obtain If we wish to test hypotheses such as H : 131 = .. Thus. then ll(fio.5. if we take Zi as having marginal density qo.3 hold (Problem 6. Y n ) from the family with density (6. ••• .. . This can be made precise for stochastic GLMs obtained by conditioning on ZI.5. and (6..7) More details are discussed in what follows.9) What is ]1 ((3)? The efficient score function a~i logp(z ..0..8) This is not unconditionally an exponential family in view of the Ao (z... 1. which.2 and 6. Similar conclusions r '''''''.
5.1) depend on an additional scalar parameter r.11) Jp(y. for c( r) > 0.12) The lefthand side of (6. we take g(JL) = <1>1 (JL) so that 7ri = <1> (zT!3) we obtain the socalled probit model.10) is of product form A( 11) [1/ c( r)] whereas the righthand side cannot always be put in this form. Important special cases are the N (JL. .14) so that the variance can be written as the product of a function of the mean and a general dispersion parameter. As he points out. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean. 11.r)dy. For instance.1 (r)(11T y . p(y.4. It is customary to write the model as. .416 Inference in the Multiparameter Case Chapter 6 follow for the Wald and Rao statistics.1 ::. noncanonicallinks can cause numerical problems because the models are now curved rather than canonical exponential families.13) (6. if in the binary data regression model of Section 6. the results of analyses with these various transformations over the range . JL ::. which we postpone to Volume II. Cox (1970) considers the variance stabilizing transformation which makes asymptotic analysis equivalent to that in the standard Gaussian linear model. These conclusions remain valid for the usual situation in which the Zi are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables. Note that these tests can be carried out without knowing the density qo of Z1.5. (6. then it is easy to see that E(Y) = A(11) Var(Y) = c(r)A(11) (6. 1989). r)dy = 1. then A(11)/c(r) = log! exp{c1 (r)11T y}h(y. 11. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6.5.5. A) families. From the point of analysis for fixed n. (J2) and gamma (p.9 are rather similar. However.5. when it can. and NeIder (1983. Because (6. General link functions Links other than the canonical one can be of interest.A(11))}h(y. r). r) = exp{c. For further discussion of this generalization see McCullagp. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred.5.3.
Ii 6.. the right thing to do if one assumes the (Zi' Yi) are i. ____ .6 ROBUSTNESS PROPERTIES AND SEMIPARAMETRIC MODELS As we most recently indicated in Example 6. These issues."n"." Of course. is "act as ifthe model were the one given in Example 6..Section 6.J . if P3 is the nonparametric model where we assume only that (Z. Y) has a joint distribution and if we are interested in estimating the best linear predictor /lL(Z) of Y given Z.d.i. if we consider the semiparametric model. ".2.2.d. We discussed algorithms for computing MLEs of In the random design case. for estimating /lL(Z) in a submodel of P 3 with /3 (6. whose further discussion we postpone to Volume II. We found that if we assume the error distribution fa.6 Robustness Pr. with density f for some f symmetric about O}. so is any estimate solving the equations based on (6. and Models 417 Summary. where the Z(j) are observable covariate vectors and (3 is a vector of regression coefficients.19) even if the true errors are N(O.1) where E is independent of Z but ao(Z) is not constant and ao is assumed known.5). the resulting MLEs for (3 optimal under fa continue to estimate (3 as defined by (6. under further mild conditions on P. which is symmetric about 0. the distributional and implicit structural assumptions of parametric models are often suspect. In Example 6. Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem 6. Y) rv P given by (6. and tests.8 but otherwise also postponed to Volume II. then the LSE of (3 is. we studied what procedures would be appropriate if the linearity of the linear model held but the error distribution failed to be Gaussian.2. in fact.i. We considered the canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor. /3. confidence procedures. of a linear predictor of the form Ef3j Z(j). Furthermore. have a fixed covariate exact counterpart. y)T rv P that satisfies Ep(Y I Z) = ZT(3. There is another semiparametric model P 2 = {P : (ZT. the GaussMarkov theorem. the LSE is not the best estimate of (3. That is.2. a 2 ) or.6. EpZTZ nonsingular. roughly speaking.2. E p y2 < oo}.. For this model it turns out that the LSE is optimal in a sense to be discussed in Volume II.6.r+i".2. had any distribution symmetric around 0 (Problem 6. called the link function.31) with fa symmetric about O. PI {P : (zT. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function.2. still a consistent asymptotically normal estimate of (3 and.19) with Ei i.2..1. in fact. we use the asymptotic results of the previous sections to develop large sample estimation results. are discussed below.2.
and a is unbiased.14). the estimate = E~l aiJii has uniformly minimum variance among all unbiased estimates linear in Y 1 .. ( 2 ). 0 Note that the preceding result and proof are similar to Theorem 1. Let Ci stand for any estimate linear in Y1 .d.. . 418 Inference in the Multiparameter Case Chapter 6 Robustness in Estimation We drop the assumption that the errors E1.3(iv). N(O.an. By Theorems 6. Theorem 6. Suppose the GaussMarkov linear model (6.1. for any parameter of the form 0: = E~l aiJLi for some constants a1. .4. See Problem 6.6. Moreover. .En in the linear model (6.. One Sample (continued). moreover. In fact. However. n = 0:. In Example 6.6). Then. The preceding computation shows that VarcM(Ci) = Varc(Ci).. ..1.1 and the LSE in general still holds when they are compared to other linear estimates.i.3) are normal.1.p . {3 is p x 1. The GaussMarkov theorem shows that Y.6.4 are still valid.1. for . and Y. The optimality of the estimates Jii. JL = f31. Ej).. .. jj and il are still unbiased and Var(Ji) = a 2 H. ..1. the result follows.S. Var(e) = a 2 (I . in Example 6.1. Example 6. Yj) = Cov( Ei. and Ji = 131 = Y.H).2.1. i=l i<j i=l where VarCM refers to the variance computed under the GaussMarkov assumptions.1. and by (B.4. Var(Ei) + 2 Laiaj COV(Ei. where Varc(Ci) stands for the variance computed under the Gaussian assumption that E1. . n a Proof.2) holds. Varc(a) ::.4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case. Because E( Ei) = 0. Var(jj) = a 2 (ZTZ)1.6..Ej) = a La. is UMVU in the class of linear estimates for all models with EY/ < 00. .2(iv) and 6.'. Many of the properties stated for the Gaussian case carry over to the GaussMarkov case: Proposition 6. and ifp = r. . € are n x 1.Yn .Y . the conclusions (1) and (2) of Theorem 6.' En with the GaussMarkov assumptions.1.1.3.En are i. Varc(Ci) for all unbiased Ci. Because Varc = VarCM for all linear estimators. our current Ji coincides with the empirical plugin estimate of the optimal linear predictor (1.1. . .6. in addition to being UMVU in the normal case. n 2 VarcM(a) = La.1. Ifwe replace the Gaussian assumptions on the errors E1. E(a) = E~l aiJLi Cov(Yi.. 13i of Section 6.6.2) where Z is an n x p matrix of constants. Instead we assume the GaussMarkov linear model where (6.
A > O. a major weakness of the GaussMarkov theorem is that it only applies to linear estimates.3 we investigated the robustness of the significance levels of t tests for the one... 00..6.:. Thus. P)).3) . the asymptotic behavior of the LR. our estimates are estimates of Section 2. ~('11.2.6. If the are version of the linear model where E( €) known.4.6. which is the unique solution of ! and w(x. . 8) and Pis true we expect that 8n 8(P). Another weakness is that it only applies to the homoscedastic case where Var(1'i) is the same for all i.6.Section 6. 8)dP(x) ° = 80 (6. If 8(P) (6.12). However.P) i= II (8 0 ) in general.4. The Linear Model with Stochastic Covariates. but Var( Ei) depends on i.2 are UMVU.5) . ~(w. Y n ) are d. . From the theory developed in Section 6.2.2 we know that if '11(" 8) = Die..ti". This observation holds for the LR and Rao tests as wellbut see Problem 6.1.2.d. 0.ii(8 .6. we can use the GaussMarkov theorem to conclude that the weighted least squares are unknown. which does not belong to the model.2.8(P)) ~ N(O. There is an important special case where all is well: the linear model we have discussed in Example 6.. y E R.6. aT Robustness of Tests In Section 5. Il E R. for instance. and . Wald and Rao tests depends critically on the asymptotic behavior of the underlying MLEs 8 and 80 .6.1. The 6method implies that if the observations Xi come from a distribution P. As seen in Example 6.. Suppose we have a heteroscedastic 0. Y) where Z is a (p x 1) vector of random covariates and we model the relationship between Z and Y as (6. then Tw ~ VT I(8 0 )V where V '" N(O. 0. P)) with :E given by (6.5) than the nonlinear estimate Y median when the density of Y is the Laplace density 1 2A exp{ Aly .ennip.Ill}. Example 6."n".4) and the Wald test statis8 0 evidently we have i= Tw But if 8(P) = 8 0 . Suppose (ZI' Yd.6. 1 (Zn. Remark 6.6 Robustness "" . the asymptotic distribution of Tw is not X.. Because ~ ('lI . For this density and all symmetric densities.and twosample problems using asymptotic and Monte Carlo methods.6).:lrarnetlric Models 419 sample n large. Y has a larger variance (Problem 6.. Now we will use asymptotic methods to investigate the robustness of levels more generally. consider H : 8 tics Tw = n(e ( 0 )T I(8 0 )(e ( 0 ). when the not optimal even in the class of linear estimates. as (Z.. Y is unbiased (Problem 3..
Zn)T and  2 1 ~ " 2 1 ~(Yi . and Rao tests all are equivalent to the F test: Reject if Tn = (3 Z(n)Z(n)(3/ S 2 !p. .3) equals (3 in (6.q+ 1.9) i=l n1Zfn)Z(n)S2 / Vii are still asymptotically of correct level. X..1 and 6.7) and (6..t. (12).6.6.6. .5) has asymptotic level a even if the errors are not Gaussian.7. (i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model. It is intimately linked to the fact that even though the parametric model on which the test was based is false. Then. even if E is not Gaussian.6. the limiting distribution of is Because !p.2) the LR. (Jp (Jo.. !ltin:::.np(1 .30) then (3(P) specified by (6.8) Moreover.2. by the law of large numbers..a) where _ ""T T  2 (6. (see Examples 6.. It follows by Slutsky's theorem that. . under H : (3 = 0.r Case 6 with the distribution P of (Z. and we consider.6. For instance.Yi) = IY n .np(l a) .  2 Now.2..p or more generally (3 E £0 + (30' a qdimensional affine subspace of RP.6) (3 is the LSE. EpE 0. say.420 Inference in the M . n lZT Z (n) (n) so that the confidence ~ ~ ZiZ'f n ~ 1 (6. the test (6.. hence. Zen) S (Zb . it is still true by Theorem 6.6.r::nn". procedures based on approximating Var(.2.. Y) such that Eand Z are independent.5) and (12(p) = Varp(E). E p E2 < 00. Thus.P i=l n p  Z(n)(31 .t xp(l a) by Example 5.. it is still true that if qi is given by (6.1 that (6. when E is N(O. H : (3 = O. .6.2. Wald.3. This kind of robustness holds for H : (Jq+ 1 (Jo.B) by where W pxp is Gaussian with mean 0 and.
6. . That is. For simplicity.. under H. Then H is still meaningful. 2  (th).4.5) are still meaningful but (and Z are dependent. But the test (6.1 Vn(~ where (3) + N(O. If it holds but the second fails..6. The Linear Model with Stochastic Covariates with E and Z Dependent. Simply replace (12 by (J ~2 = ~ ~ { ( ~1) n~ 2=1 Xz _ (X(I) + X(2»))2 2 + _ (X(1) + X(2»))2} 2 . X(2) are independent E of paired pieces of equipment. let the distribution of ( given Z = z be that of a(z)(' where (' is independent of Z.12) The conditions of Theorem 6.Section 6.) respectively. note that. X(2») where X(I).10) nL and ~ ~ X(j) 2 (6. then it's not clear what H : (J = (Jo means anymore. We illustrate with two final examples. the theory goes wrong.3. It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general.3.6. E (i.10) does not have asymptotic level a in general.13) but & V2Ep(X(I») ::f. (X(I).6. i=1 (6.6.6..17) is Suppose our model is that X "Reject iff n (0 1 where &2 2 O) > x 1 (1  . V2 VarpX{I) (~2) Xz in general. we assume variances are heteroscedastic.7) fails and in fact by Theorem 6. XU) and X(2) are identically distributed. Q) (6. However suppose now that X(I) 101 and X(2) 102 are identically distributed but not exponential.6. The TwoSample Scale Problem..6. (6. Suppose E(E I Z) = 0 so that the parameters f3 of (6.11) i=1 &= v 2n ! t(xP) + X?»)..2. To see this.6.6 Robustness Properties and Semiparametric Models 421 If the first of these conditions fails. Suppose without loss of generality that Var( (') = L Then (6.6.6.15) .3.14) o Example 6. (6. the lifetimes A2 the standard Wald test (6... If we take H : Al . Example 6.2 clearly hold..a)" (6.
6). we showed that the MLEs and tests generated by the model where the errors are U. 1 ::.. In particular.6) by Q1...6. Show that in the canonical exponential model (6. the twosample problem with unequal sample sizes and variances. (5 .. Wald. Yn . use Theorem 3.1. In this case. ""2 _ ""12 (TJ1..16) Summary.Ad) distribution. . Wald. (6.d. and are uncorrelated. . (52) assumption on the errors is replaced by the assumption that the errors have mean zero. In the linear model with a random design matrix.Id) has a multinomial (A1.6.6) and. . 0 < Aj < 1.JL • ""n 2.. . provided we restrict the class of estimates to linear functions of Y1 . N(O. in general. 6. We also demonstrated that when either the hypothesis H or the variance of the MLE is not preserved when going to the wider model. . . (5 2)T·IS (U1. . = {3d = 0 fail to have correct asymptotic levels in general unless (52 (Z) is constant or A1 = . For d = 2 above this is just the asymptotic solution of the BehrensFisher problem discussed in Section 4.. . 0 To summarize: If hypotheses remain meaningful when the model is false then..7 PROBLEMS AND COMPLEMENTS Problems for Section 6. ••• .n l1Y . . then the MLE (iil. and Rao tests need to be modified to continue to be valid asymptotically. N(O.11 . and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model. For the canonical linear Gaussian model with (52 unknown.1 are still approximately valid as are the LR. the confidence procedures derived from the normal model of Section 6.d. It is easy to see that our tests of H : {32 = . the test (6.. . where Qis a consistent estimate of Q.. the methods need adjustment. The GaussMarkov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i.6.4. d.A1.6.6..422 Inference in the Multiparameter Case Chapter 6 (Problem 6. . in general.i.3 to compute the information lower bound on the variance of an unbiased estimator of (52.Ad)T where (I1. 1967). This is the stochastic version of the dsample model of Example 6..3.11) with both 11 and (52 unknown.fiT) o:2)T of U 2)T .1 1.. TJr. Ur. Compare this bound to the variance of 8 2 • .Id .. . Specialize to the case Z = (1. LR. (i) the MLE does not exist if n = r. n 1 L. identical variances. (52) are still reasonable when the true error distribution is not Gaussian. and Rao tests."i=r+ I i . .1. The simplest solution at least for Wald tests is to use as an estimate of [Var y'n8]1 not 1(8) or ~D2ln(8) but the socalled "sandwich estimate" (Huber.6) does not have the correct level. j ::. A solution is to replace Z(n)Z(n)/ 8 2 in (6. . then the MLE and LR procedures for a specific model will fail asymptotically in the wider model.9. = Ad = 1/ d (Problem 6.. In partlCU Iar.4. (ii) if n 2: r + 1. We considered the behavior of estimates and tests when the model that generated them does not hold.
= (Zi2.d.i. i = 1.29)."" (Z~.. Show that >:(Y) defined in Remark 6.2 coincides with the likelihood ratio statistic A(Y) for the 0.14). c E [0. N(O.. Let Yi denote the response of a subject at time i.28) for the noncentrality parameter ()2 in the regression example. Find the MLE B ()... where IJ. . n. zi = (Zi21"" Zip)T. see Problem 2.... . . Show that in the regression example with p = r = 2.22. where a. 1 n f1. and (3 and {Lz are as defined in (1. 0.. is (c) Show that Y and Bare unbiased. n. 1 {LLn)T.7 Problems and Complements 423 Hint: By A. Var(Uf) = 20.. 5. . Let 1.. 0. (d) Show that Var(O) ~ Var(Y).1. Derive the formula (6. Derive the formula (6. .a)% confidence interval for /31 in the Gaussian linear model is Z2)2 ] . . Yl). fO 0.2. . i = 1.d. of 7.2 with p = r coincides with the empirical plugin estimate of JLL = ({LL1.4 • 3.2 . 4.n ~ where fi can be written as fi = ceil + ei for given constant c satisfying 0 ~ c are independent identically distributed with mean zero and variance 0. (Zi. .13. J =~ i=O L)) c i (1.'''' Zip)T. .. Suppose that Yi satisfies the following model Yi ei = () + fi. Show that Ii of Example 6..Section 6. . and the 1. fn where ei = ceil + fi.29) for the noncentrality parameter 82 in the oneway layout.. .1. t'V (b) Show that if ei N(O.n.1.2 replaced by 0: 2 • 6. i = 1. then B the MLE of (). 9. i eo = 0 (the €i are called moving average errors. Yn ) where Z.2 ). {Ly = /31.1] is a known constant and are i. .Li = {Ly+(zi {Lz)f3. 8. (e) Show that Var(B) < Var(Y) unless c = O. Consider the model (see Example 1.l+c (_c)j+l)/~ (1.5) Yi () + ei. the 100(1 .1.1. ..2 ). Here the empirical plugin estimate is based on U.(_C)i)2 l+c L i=l (a) Show that Bis the weighted least squares estimate of (). i = 1.4.2 known case with 0.
.a) confidence interval for the best MSPE predictor E(Y) = {31 + {32 Z.• 424 10.p (1 .2)(Zj2 . 15. then Var(8k ) is minimized by choosing ni = n/2. (b) Find a level (1 a) prediction interval for Y (i. (c) If n is fixed and divisible by 2(p .1 2)? and that a level (1 a) confidence interval for a 2 is given by ~a)::. = 2(Y . ::.. (a) Find a level (1 .":=::.. Yn be a sample from a population with mean f.Z.k)' C (b) If n is fixed and divisible by p.a) confidence intervals for linear functions of the form {3j . .e ..~ L~=1 nk = (p~:)2 + Lk#i . 'lj.Z.. . = np n/2(p 1). Yn .2). 1  (a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2? (b) Which estimate has the smallest MSE for estimating 0 13. ..p (~a) (n .p)S2 /x n .a)% confidence intervals for a and 6k • 12. then Var( a) is minimized by choosing ni = n / p. . Assume the linear regression model with p future observation Y to be taken at the pont z. (b) Find confidence intervals for 'lj. n2 = . ~ 1 .2) L(Zi2 . Consider the three estimates T1 = and T3 Y.I). ~(~ + t33) t31' r = 2. which is the amount . Note that Y is independent of Yl.. Yn ) such that P[t. Y ::.2. Show that if p Inference in the Multiparameter Case Chapter 6 = r = 2 in Example 6... .2)2 a and 8k in the oneway layout Var(8k ) .p)S2 /x n . (d) Give the 100(1 . Often a treatment that is beneficial in small doses is harmful in large doses. (a) Show that level (1 . .. (Zi2 Z..:1 Yi + (3/ 2n) L~= ~ n+ 1 }i.a). (n . l(Yh .1. The following model is useful in such situations. We want to predict the value of a 14. ••.{3i are given by = ~(f.1 and variance a 2 . Consider a covariate x.• statistics HYl. a 2 ::. In the oneway layout. Show that for the estimates (a) Var(a) = +==_=:. T2 1 (1/ 2n) L. Let Y 1 . where n is even.. then the hat matrix H = (h ij ) is given by 1 n 11. Yn ).
. •.I::llogxi' You may use X = 0. r :::. which is yield or production.Section 6. ( 2 ) density. (32. Suppose a good fit is obtained by the equation where Yi is observed yield for dose 10gYi where E1.2 1. 1952. But xC'C = o => IIxC'I1 2 = xC'Cx' = 0 => xC' = O. . Assume the model = (31 + (32 X i + (33 log Xi + Ei. Show that if C is an n x r matrix of rank r. compute confidence intervals for (31. fli) where fi1. Y (yield) 3. . xC' = 0 => x = 0 for any rvector x. ( 2 ). (b) Plot (Xi. 16.1) + 2<t'CJL.r2) (x). . En Xi.. Yi) and (Xi. (b) Show that the MLE of ({L. .emEmts 425 .0'2)(x ) + E<t'(JL. Check AO. p(x. nonsingular. . 17. ( 2 ). Problems for Section 6. ( 2 )T is identifiable if and only if r = p. Hint: Because C' is of rank r. 8). In the Gaussian linear model show that the parametrization ({3. P. E :::.0289. 1 2 where <t'CJL. 653). and Q = P.5 Zi2 * + (33zi2 where Zil = Xi  X. n are independent N(O. fi3 and level 0.95 Yi = e131 e132xi Xf3. ( 2 ) logp(x. hence.2. (a) Show that AOA4 and A6 hold for model Qo with densities of the form 1 2<t'CJL.A6 when 8 = ({L.. (33. (a) For the following data (from Hald. i = 1.0'2) is the N(Il'. and a response variable Y. then the r x r matrix C'C is of rank r and.1 and let Q be the class of distributions with densities of the form (1 E) <t' CJL. P = {N({L. and p be as in Problem 6. fi2.7 Probtems and Lornol. : {L E R. a 2 > O}.• . Find the value of X that maxi mizes the estimated yield Y= e131 e132xx133. a 2 <7 2 ..77 167. n. 7 2 ) does not exist so that A6 doesn't hold. Let 8.r2 )l 1 2 7 > O. p. X (nitrogen) Hint: Do the regression for {Li = (31 + (32zil + (32zil = 10gXi .or dose of a treatment. 8) = 2.
8. show that c(fo) = (To/a is 1 if fo is normal and is different from 1 if 10 is logistic. T 2 ) based on the first two I (d) Deduce from Problem 6.24).2.10 that the estimate Raphson estimates from On is efficient.21). (fj multivariate nannal distribution with mean 0 and variance.426 Inference in the Multiparameter Case Chapter 6 I . H abstract.".(3) = op(n.IY .I3). /1.2 are as given in (6.. and .1 show thatMLEs of (3.2.i > 1.1) EZ .2.(b) The MLE minimizes . show that ([EZZ T jl )(1. . . (a) In Example 6. Show that if we identify X = (z(n). Establish (6.2 hold if (i) and (ii) hold. i' " (b) Suppose that the distribution of Z is not known so that the model is semiparametric.2. . (I) Combine (aHe) to establish (6.fii(ji . iT 2 ) are conditional MLEs.2.2. In Example 6. then ({3. ell of (} ~ = (Ji. I I ..)Z(n)(fj . X ... H E H}. t) is then called a conditional MLE._p distribution. (c) Apply Slutsky's theorem to conclude that and.. T(X) = Zen). '". Fill in the details of the proof of Theorem 6.2.2. e Euclidean. The MLE of based on (X. 5.1.2.. b . (P("H) : e E e. given Z(n). Hint: !x(x) = !YIZ(Y)!z(z).  I .20).. . 6.2.24) directly as follows: (a) Show that if Zn = ~ I:~ 1 Zi then. In Example 6. Z i ~O.2. In some cases it is possible to find T(X) such that the distribution of X given T(X) = tis Q" which doesn't depend on H E H.1/ 2 ). (6. Y). . I .Zen) (31 2 e ~ 7. (e) Construct a method of moment estimate moments which are ~ consistent. (e) Show that 0:2 is unconditionally independent of (ji. 3. that (d) (fj  (3)TZ'f. Hint: (a). show that the assumptions of Theorem 6.Pe"H). n[z(~)~en)jl)' X. and that 0:2 is independent of the preceding vector with n(j2 /0 2 having a (b) Apply the law oflarge numbers to conclude that T P n I Z(n)Z(n) ~ E ( ZZ T ).2. hence.2.  (3)T)T has a (". (J derived as the limit of Newtonwith equality if and only if > [EZfj1 4.'. In Example 6.1.
Let Y1 •.3).p(Xi'O~) + op(1) n i=l n ) (O~ .LD. Him: You may use a uniform version of the inverse function theorem: If gn : Rd j..0 0 ). (logistic). t=l 1 tt Show that On satisfies (6. E R.. hence.l + aCi where fl. (iii) Dg(0 0 ) is nonsingular. a > 0 are unknown and ( has known density f > 0 such that if p(x) log f(x) then p" > 0 and. Suppose AGA4 hold and 8~ is vn consistent.l exists and uniquely ~ . ] t=l 1 n .0 0 1 < <} ~ 0.Section 6. p is strictly convex. R d are such that: (i) sup{jDgn(O) . . ao (}2 = (b) Write 8 1 uniquely solves 1 8. p i=l J.l real be independent identically distributed Y.Dg(O)j .2. (al Let ii n be the first iterate of the NewtonRapbson algorithm for solving (6.1/ 2 ). Y.1) starting at 0..1' _ On = O~  [ .' .Oo) i=cl (1 . 10 .L 'l'(Xi'O~) n i=l 1 1 =n n L 'l'(Xi..L'l'(Xi'O~) n. Show that if BB a unique MLE for B1 exists and 10.LD.•. (a) Show that if solves (Y ao is assumed known a unique MLE for L. (ii) gn(Oo) ~ g(Oo). . . <).(XiI') _O.O) has auniqueOin S(Oo.2.'2:7 1 'l'(Xi.p(Xi'O~) n. a' .x (1 + C X ) . the <ball about 0 0 . Examples are f Gaussian. = J. and f(x) = e. Hint: n . (b) Show that under AGA4 there exists E > 0 such that with probability tending to 1. (iv) Dg(O) is continuous at 0 0 .7 Problems and Complements 427 9. that is. 8~ = 1 80 + Op(n.
.6). •. log Ai = ElI + (hZi. Suppose responses YI . Wald.. i . P(Ai). . P I .3. .2. converges to the unique root On described in (b) and that On satisfies 1 1 • j. t• where {31. ~. LJ j=2 Differentiate with respect to {31. and ~ . Hint: Write n J • • • i I • DY. > 0 such that gil are 1 .3.1 on 8(9 0 • 6) (c) Conclude that with probability tending to 1. = 131 (Zil ..3. 1 Zw Find the asymptotic likelihood ratio. .27). 0 < ZI < . Ii 'I :f Z'[(3)2 over all {3 is the same as minimizing n . and Rao tests for testing H : Ih.(Z" . Zip)) + L j=2 p "YjZij + €i J ...' " 13.(Z"  ~(') z. N(O. Suppose that Wo is given by (6.2.(j) l..12)./3)' = L i=1 1 CjZiU) n p . i2... . I '" LJ i=l ~(1) Y.. . •.2 holds for eo as given in (6. 8) for 'I E 3 and 3 = {'1( 0) : 9 E e}. Zi. i=l • Z. .O)..Zi )13. Hint: You may use the fact that if the initial value of NewtonRaphson is close enough to a unique solution.26) and (6. Similarly compute the infonnation matrix when the model is written as Y. a 2 ). Establish (6.3 i 1. for "II sufficiently large. Reparametnze l' by '1(0) = Lj~1 '1j(O)Vj where '1j(9) 0 Vj.l hold for p(x. = versus K : O oF 0.. iteration of the NewtonRaphson algorithm starting at 0.d.. Y. Thus. f. Problems for Section 6.ip range freely and Ei are i. < Zn 1 for given covariate values ZI.i. Inference in the Multiparameter Case Chapter 6 > O. {Vj} are orthonormal. II. that Theorem 6. 'I) = p(.II(Zil I Zi2.3). q(. )Ih  '" j=l LJ(lJj + where ~(l) = I:j and the Cj do not depend on j3. . . hence. 2 ° 2. then it converges to that solution.12) and the assumptions of Theorem (6.. . 'I) : 'I E 3 0 } and. Show that if 3 0 = {'I E 3 : '1j = 0.~_.3. there exists a j and their image contains a ball S(g(Oo). • (6.. q + 1 < j < r} then )'(X) for the original testing problem is given = by )'(X) = sup{ q(X· 'I) : 'I E 3}/ sup{q(X.O E e. . CjIJtlZ. 1 Yn are independent Poisson variables with Yi .. minimizing I:~ 1 (Ii 2 i.428 then.2.
hence. OJ}..(X" .X.3. Oil + Jria(00 . . Assume that Pel of Pea' and that for some b > 0. Let (Xi. Bt ) is a KullbackLeibler in fonnation ° "_Q K(Oo. 0 ») is nK(Oo.n 7J(Bo... .(I(X i . N(Oz.. (ii) 'I is IIon S( 1J0 ) and D'I( IJ) is a nonsingular r x r matrix for aIlIJ E S( 1J0).2. (ii) E" (a) Let ).. .gr. Consider testing H : (j = 00 versus K : B = B1 . > 0 or IJz > O.(X1 .) Show that if we reparametrize {PIJ: IJ E S(lJo)} by q('..) where k(Bo. q + 1 <j< rl.IJ) where q(.Section 6.n:c"' 4c:2=9 3.o log (X 0) PI.Xn ) be the likelihood ratio statistic.d. X n Li. Show that under H. p . = ~ 4. respectively.'1 8 0 = and. even if ~ b = 0. N(IJ" 1).n) satisfy the conditions of Problem 6. IJ.. \arO whereal. Suppose 8)" > 0.2. j = 1. 1)..aq are orthogonal to the linear span of ~(1J0) .3.'1(IJ» is uniquely defined on:::: = {'1(IJ) : IJ E Tio.7 Problems and C:com"p"":c'm"'. < 00.3 is valid. . B). with density p(.. There exists an open ball about 1J0. Consider testing H : 8 1 = 82 = 0 versus K : 0.. pXd . {IJ E S(lJo): ~j(lJ) = 0. . Let e = {B o . q + 1 <j < r S(lJo) .. Vi). =  p(·. Xi.) 1(Xi .3. (b) If h = 2 show that asymptotically the critical value of the most powerful (NeymanPearson) test with Tn ~ L~ . with Xi and Y. 1 5. 210g.l. . 1 < i < n. aTB. (Adjoin to 9q+l. IJ. . Suppose that Bo E (:)0 and the conditions of Theorem 6.'" . Oil and p(X"Oo) = E. . Deduce that Theorem 6..3 hold.. S(lJ o) and a map ce which is continuously differentiable such that (i) ~J(IJ) = 9J(IJ) on S(lJo). 'I) S(lJo)} then. Testing Simple versus Simple.) O.d. independent. . be ii.'1) and Tin '1(lJ n) and. q('..
(h) : 0 < 0.. ~ ° with probability ~ and V is independent of ~ (e) Obtain tbe limit distribution of y'n( 0. 1988). (b) Suppose Xil Yi e 4 (e) Let (X" Y. under H. • i . I . XI and X~ but with probabilities ~ . be i. ii. 0. Sucb restrictions are natural if. /:1. for instance. In the model of Problem 5(a) compute the MLE (0.2 for 210g. 1 < i < n. Note: The results of Problems 4 and 5 apply generally to models obeying AQA6 when we restrict the parameter space to a cone (Robertson.. > OJ. 0.. 2 log >'( Xi.3. 1) with probability ~ and U with the same distribution. = (T20 = 1 andZ 1= X 1.. Let Bi l 82 > 0 and H be as above.i.0. . O > 0. which is a mixture of point mass at O. Yi : 1 and X~ with probabilities ~. Wright. Then xi t. 7. Po) distribution and (Xi.\(X).3.\( Xi. j ~ ~ 6.. (ii) Show that Wn(B~2» is invariant under affine reparametrizations "1 B is nonsingular.) have an N. Show that (6.19) holds.) if 0.0.2 and compute W n (8~2» showing that its leading term is the same as that obtained in the proof of Theorem 6.0"10' O"~o. Exhibit the null distribution of 2 log . Po .d.)) ~ N(O.~.  0.. =0. = 0..0).=0 where U ~ N(O. are as above with the same hypothesis but = {(£It. and ~ where sin. . < . li). > 0. = v'1~c2' 0 <.430 Infer~nce in the Multiparameter Case Chapter 6 (a) Show that whatever be n. > 0. Show that 2log. = a + BB where .1.) under the model and show that (a) If 0. and Dykstra. mixture of point mass at 0. (iii) Reparametrize as in Theorem 6.6. 2 e( y'n(ii. < cIJ"O.3. Z 2 = poX1Y v' .0.0.1. we test the efficacy of a treatment on the basis of two correlated responses per individual. (d) Relate the result of (b) to the result of Problem 4(a).\(Xi. ±' < i < n) is distributed as a respectively.li : 1 <i< n) has a null distribution. (0.6. Hint: ~ (i) Show that liOn) can be replaced by 1(0).  OJ. ' · H In!: Consl'defa1O . (b) If 0. Hint: By sufficiency reduce to n = 1. Yi : l<i<n).
3. hence. 2.4.21). (e) Conclude that if A and B are independent. In the 2 x 2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a Born.0 < PCB) < 1. Under conditions AOA6 for (a) and AOA6 with A6 for (a) ~(1 ) i!~1) for (b) establish that [~D2ln(en)]1 is a consistent estimate of 1.nf.2 to lin .3. Show that under AOA5 and A6 for 8 11 where ~(lIo) is given by (6.7 Problems and Complements ~(1 ) 431 8.10. 9. A611 Problems for Section 6. Hint: Argue as in Problem 5. (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (estimated) expectations from each entry has all rows and columns summing to zero. (b) (6.4. 1.P(A)P(B) JP(A)(1 . 0 < P( A) < 1. is of the fonn (b) Deduce that X' ~ Z' where Z is given by (6. Hint: Write and apply Theorem 6.1(0 0 ).8) for Z. 10.P(B))· (b) Show that the sample correlation coefficient r studied in Example 5.8) (e) Derive the alternative form (6.3.P(A))P(B)(1 . A3.3.4 ~ 1(11) is continuous. Show that under A2.4.4. then Z has a limitingN(011) distribution.8) by Z ~ .4) explicitly and find the one that corresponds to the maximizer of the likelihood. .22) is a consistent estimate of ~l(lIo).. Exhibit the two solutions of (6. 3. (a) Show that the correlation of Xl and YI is p = peA n B) .2.Section 6.6 is related to Z of (6.
(a) Let (NIl. (a) Show that then P[N'j niji i = 1.. in principle.~~. n a2 ) .)). 8(r2' 82 I/ (8" + 8. Cj = Cj] : ( nll. . n) can be then the test that rejects (conditionally on R I = TI' C 1 = GI) if N ll > j(a) is exact level o. N 22 ) rv M (u. . Hint: (a) Consider the likelihood as a function of TJil.. 7.54).b1only. 1 : X 2 (b) How would you. use this result to construct a test of H similar to the test with probability of type I error independent of TJil' TJj2? 1 .. . 6..1 II . (a) Show that the maximum likelihood estimates of TJil. . . b I Ri ( = Ti... (b) Deduce that Pearson's X2 is given by (6.D. Fisher's Exact Test From the result of Problem 6. . are the multinomial coefficients. C j = L' N'j. . Ti) (the hypergeometric distribution). ( rl. j = 1. Let N ij be the entries of an let 1Jil = E~=l (}ij.. (22 ) as in the contingency table. TJj2... n R.~~. This is known as Fisher's exact test.4. i = 1... TJj2 = 2::: a x b contingency table with associated probabilities Bij and 1 Oij. I (c) Show that under independence the conditional distribution of N ii given R.ra) nab . C i = Ci. 432 Inference in the Multiparameter Case Chapter 6 i 4.1 .~. a .u.. TJj2 ~ = Cj n where R. =T 1.. It may be shown (see Volume IT) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5. 8 21 . j.. .2 is 1t(Ci. Suppose in Problem 6.. Let R i = Nil + N i2 • Ci = Nii + N 2i · Show that given R 1 = TI. j = 1. 8l! / (8 1l + 812 )). R 2 = T2 = n . nal ) n12.1. ( where ( . = Lj N'j. .4.2.. N 12 • N 21 . S. TJj2 are given by TJil ~ = . 811 \ 12 . N ll and N 21 are independent 8( r. nab) A ) = B.. . .TI. . i = 1.~l.4 deduce that jf j(o:) (depending on chosen so that Tl.6 that H is true. (b) Sbow that 812 /(8l! ° + 812 ) ~ 821 /(8 21 + 822 ) iff R 1 and C 1 are independent. n. CI.C. . Consider the hypothesis H : Oij = TJil TJj2 for all i.9) and has approximately a X1al)(bl) distribution under H..4.. B!c1b!.
. 5 Hint: (b) It is easier to work with N 22 • Argue that the Fisher test is equivalent to rejecting H if N 22 > q2 + n . n.(rl + cI).93'1 215 103 69 172 Deny 225 162 n=387 (d) Relate your results to the phenomenon discussed in (a). + IhZi. 11.5? Admit Deny Men Women 1 19 . Suppose that we know that {3. C2). but (iii) does not.. Then combine the two tables into one.0. (i) p(AnB (ii) I C) = PeA I C)P(B I C) (A. (h) Construct an experiment and three events for which (i) and (ii) hold.(rl + cd or N 22 < ql + n .ziNi > k. there is a UMP level a test. and petfonn the same test on the resulting table. . Admit Men Women 1 235 1~35' 38 7 273 42 n = 315 Deny Admit 270 45 Men Women I 122 1'.Section 6. 10. Would you accept Or reject the hypothesis of independence at the 0.BINDEPENDENTGNENC) p(AnB I C) ~ peA I C)P(B I C) (A.. B. Zi not all equal. N 22 is conditionally distributed 1t(r2. (b).1""". consider the assertions. Give pvalues for the three cases. and that we wish to test H : Ih < f3E versus K : Ih > f3E.4. B INDEPENDENT) (iii) PeA (C is the complement of C. C are three events. and that under H. classified by sex and admission status. if and only if.05 level (a) using the X2 test with approximate critical value? (b) using Fisher's exact test of Problem 6. = 0 in the logistic model.4. if A and C are independent or B and C are independent. ~i = {3. 2:f . Test in both cases whether the events [being a man] and [being admitted] are independent. (a) If A.) Show that (i) and (ii) imply (iii).14). (e) The following 2 x 2 tables classify applicants for graduate study in different departments of the university according to admission status and sex. which rejects. for suitable a.ZiNi > k] = a. Establish (6. where Pp~ [2:f .BINDEPENDENTGIVENC) n B) = P(A)P(B) (A. 9.12 I . The following table gives the number of applicants to the graduate program of a small department of the University of California. Show that.7 Problems and Complements 433 8.
• I 1 • • (a)P[Z. Zi and so that (Zi. .OiO)2 > k 2 or < k}.d.4. for example. 1 .4..4) is as claimed formula (2. a under H.. . in the logistic regression model. n) and simplification of a model under which (N1. asymptotically N(O.. In the binomial oneway layout show that the LR test is asymptotically equivalent to Pearson's  X2 test in the sense that 2log'\  X2 .OkO) under H.. k and a 2 is unknown.5. . I).. Suppose that (Z" Yj). with (Xi I Z.3.8) and... (a) Compute the Rao test for H : (32 Problem 6. I. Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fishersee Rao (1973).. . . Y n ) have density as in (6. (Zn.. a 2 ) where either a 2 = (known) and 01 ..5. Nk) ~ M(n... · .00 or have Eo(Nd : .4.4. . . 13. . case. I < i < k are independent.11 are obtained as realization of i.".1 14. . but under K may be either multinomial with 0 #.5 construct an exact test (level independent of (31). which is valid for the i. 010. Xi) are i.20) for the regression described after (6. . 3. a5 j J j 1 J I .d. Show that if Wo C WI are nested logistic regression models of dimension q < r < k and mI. Given an initial value ()o define iterates  Om+l . i Problems for Section 6. i 16.4.l 1 . mk ~ 00 and H : fJ E Wo is true then the law of the statistic of (6..(Ni ) < nOiO(1 .. . ~ 8 m + [1(8 m )Dl(8 m ). then 130 as defined by (6.5. " lOk vary freely...) ~ B(m.4. . .4).18) tends 2 to Xrq' Hint: (Xi .i..3. .Ld. ... X k be independent Xi '" N (Oi.LJ2 Z i )).'Ir(. 434 • Inference in the Multiparameter Case Chapter 6 f • 12. . " Ok = 8kO. Verify that (6.5 1.: nOiO. . 1 i Show that for GLM this method coincides with the NewtonRaphson method of Section 2.3) l:::~ I (Xi .. Show that the likelihO<Xt ratio test of H : O} = 010 ... but Var. = rn I < /3g and show that it agrees with the test of (b) Suppose that 131 is unknown. .Oio)("Cooked data").15) is consistent.i. Use this to imitate the argument of Theorem 6.. . Show that.. Suppose the ::i in Problem 6. Compute the Rao test statistic for H : (32 case.z(kJl) ~I 1 .2. a 2 = 0"5 is of the form: Reject if (1/0". Tn. This is an approximation (for large k. .)llmi~i(1 'lri). if the design matrix has rank p.11. E {z(lJ. 2. Let Xl. 15. or Oi = Bio (known) i = 1.f. < f3g in this (c) By conditioning on L~ 1 Xi and using the approach of Problem 6.
Yn be independent responses and suppose the distribution of Yi depends on a covariate vector Zi.. (e) Suppose that Y. T). has the Poisson. T).Section 6.. h(y.Oi)~h(y. and b' and 9 are monotone. as (Z.) .. . contains an open L. b(9). .).(. . . the deviance is 5. Find the canonical link function and show that when g is the canonical link. under appropriate conditions. . (d) Gaussian GLM.T). give the asymptotic distribution of y'n({3 . Y) and that given Z = z..= 1 . Yn ) are i. Show that for the Gaussian linear model with known variance D(y. g(p. Hint: By the chain rule a l(y a(3j' 0) = i3l dO d" a~ . distribution. k. Show that the conditions AQA6 hold for P = P{3o E P (where qo is assumed known). Set ~ = g(.. d". 00 d" ~ a(3j (b) Show that the Fisher information is Z]. ball about "k=l A" zlil in RP ..5. .b(Oi)} p(y. Suppose Y. ..d. 1'0) = jy 1'01 2 /<7.5. Wi = w(".(3). the resuit of (c) coincides with (6.. L. Show that. b(B)... (a) Show that the likelihood equations are ~ i=I L.. Give 0. your result coincides with (6.. P(I"). . T. .9). C(T).) = ~ Var(Y)/c(T) b"(O). . (Zn.WZ v where Zv = Ilz'jll is the design matrix and W = diag( WI. and v(. Give 0.. 9 = (b')l. k. (y. . . g("i) = zT {3.. Show that when 9 is the canonical link.T)exp { C(T) where T is known. j = 1.F.5. Y follow the model p(y.). Let YI.9).) and C(T) such that the model for Yi can be written as O.. and v(.i. ~ .. ~ N("" <7. Wn). then the convex support of the conditional distribution of = 1 Aj Yj zU) given Z j = Z (j) .) ~ 1/11("i)( d~i/ d". J JrJ 4.)z'j dC.ztkl} is RP (c) P[ZI ~ z(jl] > 0 for all j.. h(y.y . Hint: Show that if the convex support of the conditional distribution of YI given ZI = zU) contains an open interval about p'j for j = 1. . In the random design case. (c) Suppose (Z" Y I ).1 = 0 V fJ~ . 05.p..) and v(.7 Problems and Complements 435 (b) The linear span of {ZII). T. <. b(9).5)... Assume that there exist functions h(y.. J .O(z)) where O(z) solves 1/(0) = gI(zT {3). C(T).
Consider the Rao test for H : f} = f}o for the model "P = {P/I : /I E e} and ADA6 hold. set) = 1.. then v( P) . that is. Suppose that the ttue P does not belong to"P but if f}(P) is defined by (6. but that if the estimate ~ L:~ dDIllDljT (Xi.6 1.6. By Problem 5.3.3) then f}(P) = f}o. l 1 . Suppose ADA6 are valid. Apply Theorem 6. • I 3. . replacing &2 by 172 in (6. . I 4.(3p ~ (3o.6. . (a) Show that if f is symmetric about 1'. Show that the standard Wald test forthe problem of Example 6.Q+1>'" . Establish (6. = 1" (b) Show that if f is N(I'.6. I .6.p under the sole assumption that E€ = 0. Suppose Xl.1) ~ . n . . I "j 5. P. 2.6.1.10).6. (3) where s(t) is the continuous distribution function of a random variable symmetric about 0. if VarpDl(X. hence. I: . Show that the LR. . then the Rao test does not in general have the correct asymptotic level.7. Consider the linear model of Example 6. (6. f}o) is used. Note: 2 log An.. /10) is estimated by 1(80 ). Wn Wn + op(l) + op(I). 0 < Vae f < 00.Xn are i.2. .). t E R.s(t). but if f~(x) = ~ exp Ix 1'1.15) by verifying the condition of Theorem 6.2 and the hypothesis (3q+l = (3o. Show that. j. and Rao tests are still asymptotically equivalent in the sense that if 2 log An. . then 0'2(P) < 0'2. then under H. ! I I ! I . the unique median of p. then O' 2(p) > 0'2 = Varp(X.d. W n . and R n are the corresponding test statistics. the infonnation bound and asymptotic variance of Vri(X 1'). in fact. W n and R n are computed under the assumption of the Gaussian linear model with a 2 known.1.j 436 Inference in the Multiparameter Case Chapter 6 i Problems for Section 6. let 1r = s(z. i 6. " .O' 2 (p») = 1/4f(v(p». 7.3. if P has a positive density v(P). Wald. then the sample median X satisfies ~ f at I I Vri(X where O' (p) 2 yep») ~ N(0. .10) creates a valid level u test.6.2. O' 2(p)/0'2 = 2/1r.4. 0'2). then it is..3 is as given in (6.1 under this model and verifying the fonnula given.14) is a consistent estimate of2 VarpX(l) in Example 6.i. In the hinary data regression model of Section 6.6. Show that 0: 2 given in (6.3 and. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation.
. for instance).i . y~p) and.(b) continue to hold if we assume the GaussMarkov model.2 is an unbiased estimate of EPE(P).. are indepeqdent of}] 1 ' · ' 1 Yn and ~* is distributed as }'i.1) Yi = J1i + ti.lp)2 Here Yi" l ' •• 1 Y... i = 1.jP)2 where ILl ) = z. Hint: Apply Theorem 6. .. where ti are i.lpI)2 be the residual sum of squares. i = 1.1.p don't matter.9.. Gaussian with mean zero J1i = Z T {3 and variance 0'2.I EL(Y. (b) Suppose that Zi are realizations of U. then 13L is not a consistent estimate of f3 0 unless s(t) is the logistic distribution. hence.1... 1973. then ~ Vri(rJL QI ({3) Var(ZI (Y1   {3 Ll has a limiting normal distribution with mean 0 and variance A(Zr {3)) )[QI ({3)J where Q({3) = E(Zr A(Zr{30)ZI) is p x p and necessarily nonsingular. Zi are ddimensional vectors for covariate (factor) values.Zn and evaluate the performance of Yjep) .' i=l . be the MLE for the logit model. . . the model with 13d+l = ..2 (b) Show that (1 + ~D + . .p.Section 6. f3p... Let f3(p) be the LSE under this assumption and Y.. ~ ~ (d) Show that (a)..{3lp) p and {3(P) ~ (/31. . 1 V. 1 n. where Xln) ~ {(Y.9. i . A natural goal to entertain is to obtain new values Yi". /3P+I = '" = /3d ~ O. ..2. _. (Model Selection) Consider the classical Gaussian linear model (6. Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d . that Zj is bounded with probability 1 and let ih(X ln )).(p) the corresponding fitted value. 0. 1 n. at Zll'" . = (3p = 0 by the (average) expected prediction error ~ ~ n EPE(p) = n.d.. L:~ 1 (P. Zi) : 1 <i< n}.7 Problems and Complements 437 (a) Show that ~ Jr can be written in this form for both the probit and logit models. Model selection consists in selecting p to minimize EPE(p) and then using Y(P) as a predictor (Mallows. Show that if the correct model has Jri given by s as above and {3 = {3o.. Let RSS(p) = 2JY.d. Zi. 8. But if 13L is defined as the solution of EZ1s(Zr {30) = Q(/3) where Q({3) = E(Zr A(Zr/3) is p x 1. . . O)T and deduce that (c) EPE(p) = RSS(p) + ~. (a) Show that EPE(p) ~ . .2 is known.. .. Suppose that .
16). i=1 Derive the result for the canonical model.L .9 REFERENCES Cox. i 6.' . which are not multinomial. D.i2} such that the EPE in case (i) is smaller than in case (d) and vice versa. Use 0'2 = 1 and n = 10. 3rd 00.1 (1) From the L A. this makes no sense for the model we discussed in this section.". . 6. Heart Study after Dixon and Massey (1969).2 (1) See Problem 3.438 (e) Suppose p ~ Inference in the Multiparameter Case Chapter 6 2 and 11(Z) ~ Ii. MASSEY. The Analysis of Binary Data London: Methuen. Note for Section 6. ..  J . t12 and {Z/I. New York: McGrawHill. 1970. For instance. but it is reasonable. (b) The result depends only on the mean and covariance structure of the i=l"". ! .4.. )I'i). 1 I . The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section 6.9 for a discussion of densities with heavy tails.8 NOTES Note for Section 6. ti. AND F. Hint: (a) Note that ". .n. A. Y. + Evaluate EPE for (i) ~i ~ "IZ. 1 n n L (I'. . DIXON. EPE(p) n R SS( p) = .) . Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good.z. To guard against such situations he argued that the test should be used in a twotailed fashion and that we should reject H both for large and for small values of X2 • Of course. .~I'. if we consider alternatives to H. 1 "'( i=1 . Give values of /31. and (ii) 'T}i = /31 Zil + f32zi2. W.. Note for Section 6. Introduction to Statistical Analysis.6. ~(p) n . .5. Y/" j1~. we might envision the possibility that an overzealous assistant of Mendel "cooked" the data. R.4 (1) R. 1969. LR test statistics for enlarged models of this type do indeed reject H for data corresponding to small values of X2 as well as large ones (Problem 6.
The Analysis of Frequency Data Chicago: University of Chicago Press. S. NELDER.Him{ Models. S. 1995..661675 (1973). AND R. The Analysis of Variance New York: Wiley. KADANE AND L. Wiley & Sons.. M. II. Prob. KOENKER. /5. 383393 (1987). STIGLER. MALLOWS. A. Order Restricted Statistical Inference New York: Wiley.• St(ltistical Methods {Of· Research lV(Jrkers. 1985. KOENKER. LAPLACE." Biometrika. $CHERVISCH. 279300 (1997). S. 2nd ed. HUBER. Ser. New York.. WRIGHT. Paris) (1789). D'OREY. T. A. New York: Hafner. "Some comments on C p . second edition.. 1989.. 1988. R. Statistical Theory with £r1gineering Applications New York: Wiley. Roy. 1961. WEISBERG. RAO. DYKSTRA. Genemlized Linear Models London: Chapman and Hall. Math. New York: Wiley.• "Sur quelques points du systeme du monde. 12. "The Gaussian Hare and the Laplacian Tortoise: Computability of squarederror versus absoluteerror estimators.. 2nd ed. 1973. Vol..Section 6. R.. R." J.." Statistical Science. PORTNOY. R. T." Memoires de l'Academie des Sciences de Paris (Reprinted in Oevres CompUtes. McCULLAGH. 13th ed. S. AND R. "Approximate marginal densities of nonlinear functions. I New York: McGrawHill." Proc. HAW. 76.9 References 439 Frs HER. 475558. Theory ofStatistics New York: Springer. An Inrmdllction to Linear Stati. "Computing regression qunntiles. t 983.. E. Soc. 36. c. GRAYBILL. 1959. L. 1952. ROBERTSON. GauthierVillars. C. HABERMAN. Univ. J. 1958. MA: Harvard University Press. I.425433 (1989). C. TIERNEY. PS. Applied linear Regression. Statist. f. SCHEFFE. . AND Y. New York: J. of California Press. A. 1986. 1974. AND j. P." Technometrics. 221233 (1967). P. Fifth BNkeley Symp. Linear Statisticallnference and Its Applications. II. "The behavior of the maximum likelihood estimator under nonstandard conditions. The l!istory of Statistics: The Measuremem of Uncel1ainty Before 1900 Cambridge. J. KASS. Statist.
. . j. i 1 I . . I j i. I . I I ... . .I :i . .
although we do not exclude this case. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties. A. an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation. Sections A.I THE BASIC MODEl Classical mechanics is built around the principle that like causes produce like effects. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts. is tossed can land heads or tails. A coin that. in addition.Appendix A A REVIEW OF BASIC PROBABILITY THEORY In statistics we study techniques for obtaining and using information in the presence of uncertainty. The reader is expected to have had a basic course in probability theory. We add to this notion the requirement that to be called an experiment such an action must be repeatable. The purpose of this appendix is to indicate what results we consider basic and to introduce some of the notation that will be used in the rest of the book. we include some commentary.IS contain some results that the student may not know. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level. which are relevant to our study of statistics.14 and A. The Kolmogorov model and the modem theory of probability based on it are what we need. Therefore. This 441 . we include some proofs as well in these sections. The situations we are going to model can all be thought of as random experiments. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. at least conceptually. The adjective random is used only to indicate that we do not. Probability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. require that every repetition yield the same outcome. The intensity of solar flares in the same month of two different years can vary sharply. Viewed naively.
We denote events by A. The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A. However. whether it is conceptually repeatable or not. the probability model. For example. I l 1. and inclusion as is usual in elementary set theory. Chung. . = 1. A. For technical mathematical reasons it may not be possible to assign a probability P to every subset of n. the relation A C B between sets considered as events means that the occurrence of A implies the occurrence of B. intersection. from horse races to genetic experiments.1/ /1. and Loeve. B. falls under the vague heading of "random experiment.3 Subsets of are called events. A random experiment is described mathematically in tenns of the following quantities. then (ii) If AI. 1977). induding the authors. Ii I I I j . 1974. A2 . and complementation (cf. A is always taken to be a sigma field. Grimmett and Stirzaker.\ is the number of times the possible outcome A occurs in n repetitions.l. Savage (1962). If wEn. as we shall see subsequently.. is denoted by A.l. We denote it by n.. .~ n and is typically denoted by w.1. where 11. {w} is called an elementary event. which by definition is a nonempty class of events closed under countable unions." The set operations we have mentioned have interpretations also. the null set or impossible event. 1 1  . . We shall use the symbols U. Lindley (1965). By interpreting probability as a subjective measure. 1992. . In this section and throughout the book.4 We will let A denote a class of subsets of to which we an assign probabilities. complementation. Its complement. almost any kind of activity involving uncertainty. we preSUnle the reader to be familiar with elementary set theory and its notation at the level of Chapter I of Feller (1968) or Chapter 1 of Parzen (1960). C for union.2 A sample point is any member of 0.I I 442 A Review of Basic Probability Theory Appendix A 1 longtefm relative frequency 11 . For a discussion of this approach and further references the reader may wish to consult Savage (1954). the operational interpretation of the mathematical concept of probability. i ! A. intersections. it is called a composite event. A. . and so on or by a description of their members. they are willing to assign probabilities in any situation involving uncertainty. set theoretic difference. " are pairwise disjoint sets in I • I Recall that Ui I Ai is just the collection of points that are in anyone of the sets Ai and that two sets are disjoint if they have no points in common. . c. In this sense. A probabiliry distribution or measure is a nonnegative function P on A having the following properties: (i) P(Q) n 1 n . A. If A contains more than one point.. We now turn to the mathematical abstraction of a random experiment. n.1.l The sample space is the set of all possible outcomes of a random experiment. is to many statistician::. Raiffa and Schlaiffer (\96\). and Berger (1985). • . de Groot (1970)." Another school of statisticians finds this formulation too restrictive.
' 1 peA.3 Panen (1960) Chapter I. by axiom (ii) of (A.S P A..l1.3.3 A. Port.4). then P (U::' A.2.2 ELEMENTARY PROPERTIES OF PROBABILITY MODELS The following are consequences of the definition of P.P(A). We shall refer to the triple (0. A.' 1 An) < L.1 If A c B. when we refer to events we shall automatically exclude those that are not members of A. (U. 1. In this case.2.Section A. A.3 Hoel. > PeA).l..3.PeA).. W2.2. Sections 13.S The three objects n.3 DISCRETE PROBABILITY MODELS A.2 Elementary Properties of Probability Models 443 A.3 A. P(B) A..2. } and A is the collection of subsets of n.. and P together describe a random experiment mathematically.L~~l peA.6 If A . then PCB .2) n . Sections 45 Pitman (1993) Section 1. c A. A.1. Section 8 Grimmett and Stirzaker (1992) Section 1.A) = PCB) . ~ A.40 < P(A) < 1.2 and 1. (A. (n~~l Ai) > 1. . P) either as a probability model or identify the model with what it represents as a (random) experiment. C An .2 PeN) 1 ..3 Hoe!. Port. A.2 Parzen (1960) Chapter 1.3 If A C B. Sections 15 Pitman (1993) Sections 1.2.) (Bonferroni's inequality). (1967) Chapter l. and Stone (1971) Sections 1.2. For convenience. References Gnedenko (1967) Chapter I. P(0) ~ O.). we can write f! = {WI.2. A.t A probability model is called discrete if is finite or countably infinite and every subset of f! is assigned a probability. ..68 Grimmett and Stirzaker (1992) Sections 1. J... c .7 P 1 An) = limn~= P(A n ). That is.. we have for any event A. References Gnedenko. and Stone (1992) Section 1.
Given an event B such that P( B) > 0 and any other event A.4.4)(ii) and (A.I B) is a probability measure on (fl.4..).l) gives the multiplication rule.2) In fact. (A. . machines. If A" A 2 . I B) = ~ PiA.. we define the conditional probability of A given B. . then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur.• P (Q A. by l P(A I B) ~ PtA n B) P(B) .4 CONDITIONAL PROBABILITY AND INDEPENDENCE I j 1 1 ! . which we write PtA I B). From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred. for fixed B as before. shaking well.1 • 1 .. (A. WN are the members of some population (humans.3) 1 I A. .:c:. (A.~ 1 1 ~• 1 .:. and drawing.• 1 B n are (pairwise) disjoint events of positive probability whose union is fl. are (pairwise) disjoint events and P(B) > 0. and P( A ) ~ j .. say N.1.. guinea pigs.3) If B l l B 2 .:c Number of elements in A N .=:. ~f 1 .' . Then P( {w}) = 1/ N for every wEn.3. . A) which is referred to as the conditional probability measure given B. is an experiment leading to the model of (A. i References Gnedenko (1967) Chapter I. (A. Such selection can be carried out if N is small by putting the "names" of the Wi in a hopper.. (A...4. Sections 45 Parzen (1960) Chapter I. Transposition of the denominator in (AA. For large N. the identity A = . I B). flowers. '< = ~=~:.3. i • i: " i:. Sections 67 Pitman (1993) Section l.4 Suppose that WI.(A n B j ). a random number table or computer can be used.1) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment.3). . 1 1 PiA n B) = P(B)P(A I B).l n' " I"I ". Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another. the function P(. j=l (A. A.3.4.3) yield U. n PiA) = LP(A I Bj)P(Bj ).444 A Review of Basic Probability Theory Appendix A An important special case arises when n has a finite number of elements.4) . then .=. selecting at random. etc. all of which are equally likely.4. .
. I. .S) The conditional probability of A given B I defined by . P(il.. relation (A. An are said to be independent if P(A i . B n ) and for any events A.3).. I PeA I B. we can combine (A.. Sections IA Pittnan (1993) Section lA .) A) ~ ""_ PIA I B )p(BT L. n··· nA i . Two events A and B are said to be independent if P(A n B) If P( B) ~ P(A)P(B).. ." . .) j=l k (AA.} such thatj ct {i" ..i..)P(B.4.. BnJl (AA. B 2 ) .4) and obtain Bayes rule .A.4. . and Stone (1971) Sections lA. P(B" I Bl..···. Section 4. and (A... n Bnd > O... Port. (AA...I J (AA. . B ll is written P(A B I •.. ..7) whenever P(B I n . .En such that P(B I n ...) = II P(A. . . .I_1 .n}.. n B n ) > O. A and B are independent if knowledge of B does not affect the probability of A..lO) is equivalent to requiring that P(A J I A. Chapter 3. . P(B 1 n·· n B n ) ~ P(B 1 )P(B2 I BJlP(B3 I ill. The events All ..8) > 0. (AA.i..4. .1 ).) ~ P(A J ) (AA. Ifall theP(A i ) are positive.i k } of the integers {l.II) for any j and {i" . References Gnedenko (1967) Chapter I.. .S parzen (1960) Chapter 2..Section AA Conditional Probability and Independence 445 If P( A) is positive.• ... B I .. Sections 9 Grimmett and Stirzaker (1992) Section IA Hoel.lO) for any subset {iI..}. the relation (AA.9) In other words. Simple algebra leads to the multiplication rule.8) may be written P(A I B) ~ P(A) (AA.
If we want to make the £i independent.. on (fl2. we should have . r. P([A 1 x fl2 X . For example.. .1 . then Ai corresponds to . .on. (A53) It may be shown (Billingsley.' .3) for events Al x . x . it can be uniquely extended to the sigma field A specified in note (l) at the end of this appendix.An with Ai E Ai.) .. then (A52) defines P for A) x ". P(fl.o n = {(WI.o i . . . ) = P(A. .3). W2 is the outcome of £2 and E Oi corresponds to the Occurrence of the comso on. Loeve. 1995. . Chung.2) 1 i " 1 If we are given probabilities Pi on (fl" AI).. On the other hand.'" £n and recording all n outcomes.. Pn(A n ). We shall speak of independent experiments £1. it is easy to give examples of dependent experiments: If we draw twice at random from a hat containing two green chips and one red chip. 1974. There are certain natural ways of defining sigma fields and probabilities for these experiments. a compound experiment is one made up of two or more component experiments.. In the discrete case (A. If we are given n experiments (probability models) t\. The interpretation of the sample space 0 is that (WI." X An) = P. If P is the probability measure defined on the sigma field A of the compound experiment. I P n on (fln> An)... the Cartesian product Al x .. .0 1 X '" x . .5.. X . These will be discussed in this section. X fl n 1 X An). A2).0 if and only if WI is the outcome of £1.. independent. To say that £t has had outcome pound event (in . .. X A 2 X '" x fl n ). An are events. .. 446 A Review of Basic Probability Theory Appendix A A. W n ) : W~ E Ai. I ! i w? 1 . The reader not interested in the formalities may skip to Section A. (A5. 1 1 . if we toss a coin twice. \ I I . (A. x An of AI.."" . The (n stage) compound experiment consists in performing component experiments £1. . then intuitively we should have alI classes of events AI. (A5A) i. 1 £n if the n stage compound experiment has its probability structure specified by (A5.6 where examples of compound experiments are given. x " . x An by 1 j P(A.1 Recall that if AI.0 1 x··· XOi_I x {wn XOi+1 x··· x. then the sample space .' . I n  I . . P. x'" . More generally. that is. . the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second. £n with respective sample spaces OJ.53) holds provided that P({(w"". the sigma field corresponding to £i. 1977) that if P is defined by (A. An is by definition {(WI.w n ) E o : Wi = wf}. .0) given by . if Ai E Ai.5. . Pn({W n }) foral! Wi E fli.5 COMPOUND EXPERIMENTS There is an intuitive notion of independent experiments. I <i< n.. x On in the compound experiment. we introduce the notion of a compound experiment. .. To be able to talk about independence and dependence of experiments. wn ) is a sample point in .0 ofthe n stage compound experiment is by definition 0 1 x·· .. .on. x An..wn )}) I: " = PI ({Wi}) .. .. x flnl n Ifl) x A 2 x ' .. This makes sense in the compound experiment. A. .. . the subsets A ofn to which we can assign probability(1). x fl 2 x '" x fln)P(fl. . ..... . InformaUy.1 X Ai x 0i+ I x . and if we do not replace the first chip drawn before the second draw. then the probability of a given chip in the second draw will depend on the outcome of the first dmw. x An) ~ P(A.. x flnl n ' . .. 1 < i < n}.
In the discrete case we know P once we have specified P( {(wI .. If Ak is the event [exactly k S's occur].6. n The probability structure is determined by these conditional probabilities and conversely. and Stone (1971) Section 1. i = 1" . By the multiplication rule (A. ..3) where n ) ( k = n! kl(n . A.·.. If we repeat such an experiment n times independently.6.S) .3) is known as the binomial probability. Port. SAMPLING WITH AND WITHOUT REPLACEMENT A.w n )}) = P(£l has outcome wd P(£2 hasoutcomew21 £1 has outcome WI)'" P(£T' has outcome W I £1 has outcome WI.. ..£"_1 has outcome wnd..SP({(w] .. the following..4 More generally. . Other examples will appear naturally in what follows. .· IPq...5.. .2) where k(w) is the number of S's appearing in w. .7) we have. the compound experiment is called n multinomial trials with probabilities PI. we shall refer to such an experiment as a Bernoulli trial with probability of success p. .4.6 Hoel.6 BERNOULLI AND MULTINOMIAL TRIALS.: n )}) for each (WI. we refer to such an experiment as a multinomial trial with probabilities PI.6. .1 Suppose that we have an experiment with only two possible outcomes. If the experiment is perfonned n times independently. The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success)..k)!' The fonnula (A. then (A. any point wEn is an ndimensional vector of S's and F's and.u. J.. /I. ill the discrete case.5 Parzen (1960) Chapter 3 A. Ifweassign P( {5}) = p.5. which we shall denote by 5 (success) and F (failure).6. If o is the sample space of the compound experiment... if an experiment has q possible outcomes WI. . 1. . (A.'" . we say we have performed n Bernoulli trials with success probability p.Section A 6 Bernoulli and Multinomial Trials. A.: n ) with Wi E 0 1.wq and P( {Wi}) = Pi. then (A.6. Sampling With and Without Replacement 447 Specifying P when the £1 are dependent is more complicated.Pq' If fl is the sample space of this experiment and W E fl. References Grimmett and Stirzaker ( (992) Sections (.6.
P))nk k (N)n = (A6.7 If we perform an experiment given by (.1 .(N(I.. · · • A. .N (1 . with replacement is added to distinguish this situation from that described in (A.k q is the event (exactly k l WI 's are observed..448 A Review of Basic Probability Theory Appendix A where k~(w) = number of times Wi appears in the sequence w. and the component experiments are independent and P( {a}) = liNn. exactly k 2 wz's are observed. .6.. exactly kqwq's are observed). ·Pq t (A. we shall sometimes refer to the outcome of the compound experiment as a sample of size n from the population given by (n.p)) < k < min( n. . and P(Ak) = 0 otherwise.S If we have a finite population of cases = {WI"" WN} and we select cases Wi successively at random n times without replacement. ..6.) of the compound experiment. 1 . then ~ . Sections 14 Pitman (1993) Section 2. I' .n)!' If the case drawn is replaced before the next drawing. .6. When n is finite the tenn. Sectiou 11 Hoel. and Stone (1971) Section 2.7 PROBABILITIES ON EUCLIDEAN SPACE Random experiments whose outcomes are real numbers playa central role in theory and practice. 10) is known as the hypergeometric probability. The probability models corresponding to such experiments can all be thought of as having a Euclidean space for sample space... for any outcome a = (Wil"" 1Wi. . References Gnedeuko (1967) Chapter 2. Np).) A = n! k k k !. I A. . A.4 Parzen (1960) Chapter 3.1 j ..1 .1O) J for max(O. = N! (N .. The fonnula (A.S) as follows. then P( k" .9) .0. n P({a}) ~ (N)n where 1 I (A. .6) where the k i are natural numbers adding up to n.6. . .. i . kq!Pt' . the component experiments are not independent and. . If AkJ. __________________J i . we are sampling with replacement. .k. P) independently n times. (N)n . . A.A l P).6. Port. i . . PtA ) k =( n ) (Np).6. . n . If Np of the members of n have a "special" characteristic S and N (1 ~ p) have the opposite characteristic F and A k = (exactly k "special" individuals are obtained in the sample).
. That is. } are equivalent.. However.1If (al. 1 <i <k}anopenkrectallgle. dx n • is called a density function.3. . Recall that the integral on the right of (A. xd'.S) is by definition r JR' 1A(X)P(x)dx where 1A(x) ~ 1 if x E A..bk ) are k open intervals.7. It may be shown that a function P so defined satisfies (AlA). Riemann integrals are adequate.Xn .bd x '" (ak. Integrals should be interpreted in the sense of Lebesgue.. An important special case of (A.S) is given by (A.Section A.EA pix. . (A7. . ...7. We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely. We will write R for R1 and f3 for f31.8 are usually called absolutely continuous.b k ) = {(XI"". P(A) is the volume of the "cylinder" with base A and height p(x) at x. Thefrequency function p of a discrete distribution is defined on Rk by n pix) = P({x»).bd.. A.6 A nonnegative function p on R k • which is integrable and which has r p(x)dx = 1. Geometrically.7. (A7A) Conversely. .I) because the study of this model and that of the model that has = {Xl.7. where ( )' denotes transpose..5) A. JR' where dx denotes dX1 .7 Probabilities on Euclidean Space 449 We shall use the notation R k of k~dimensional Euclidean space and denote members of Rk by symbols such as x or (Xl. . X n . Any subset of R k we might conceivably be interested in turns out to be a member of f3k. and 0 otherwise. This definition is consistent with (A.2 The Borelfield in R k . P defined by A.. .7. A.3 A discrete (probability) distribution on R k is a probability measure P such that L:~ I P( {Xi}) = 1 for some sequence of points {xd in R k .... any nonnegative function pon R k vanishing except on a sequence {Xl. A. which we denote by f3k.7.7. ..7. .7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation P(A) = L p(x)d x =1 (A7.).S) for some density function P and all events A. x A. for practical purposes.7.. is defined to be the smallest sigma field having all open k rectangles as members. only an Xi can occur as an outcome of the experiment. } of vectors and that satisfies L:~ I P(Xi) = 1 defines a unique discrete probability distribution by the relation P(A) = L x.Xk) :ai <Xi <bi. .(ak.9) . we shall call the set (aJ.
r . then by the mean value theorem paxo . thus. then P = Q.15) I.16) I It may be shown that any function F satisfying (A.) F is defined by F(Xl' .) = P( ( 00.lO) The ratio p(xo)jp(xl) can.f.2 parzen (1960) Chapter 4.7.1.7."(l) Although in a continuous model P( {x}) = 0 for every x. limx~oo F(x) limx~_oo =1 F(x) = O.7. the density function has an operational interpretation close to thal of the frequency function. Thus. . F is a function of a real variable characterized by the following properties: > I . • J . .7.. .7.450 A Review of Basic Probability Theory Appendix A It turns out that a continuous probability distribution determines the density that generates it "uniquely. POlt. Sections 14...7.2. When k = 1. and h is close to 0.16) defines a unique P on the real line. (A. .12) The dJ.h. P Xl (A.J x . x. Xo P([xo . 5.1. (A.7 Pitman (1993) Sections 3.7. ! I .1'. For instance. F is continuous at x if and only if P( {x}) (A. 4. if p is a continuous density on R. We always have F(x)F(xO)(2) =P({x}).5 i • . (A. 3. Sections 21.h. References Gnedenko (1967) Chapter 4.. defines P in the sense that if P and Q are two probabilities with the same d.7. 22 Hoel.:0 and Xl are in R. x.]).13) x <y =? F(x) < F(y) (Monotone) F(x) (Continuous from the right) (A.7.4. x (00.x.17) = O. be thought of as measuring approximately how much more Or less likely we are to obtain an outcome in a neighborhood of XQ then one in a neighborhood of Xl_ A. 5. x n j X =? F(x n ) ~ (A.14) .xo + h]) '" 2hp(xo) and P([ h Xl 1 + h]) + Xl p(xo) hi) '" ( ). and Stone (1971) Sections 3..13HA.11 The distribution function (dJ.1 and 4.7.
a RANDOM VARIABLES AND VECTORS: TRANSFORMATIONS Although sample spaces can be very diverse. The probability distribution of a random vector X is. the time to breakdown and length of repair time for a randomly chosen machine. Letg be any function from Rk to Rm.(1) = A.8. density. in fact.8. The subscript X or X will be used for densities.7.S.3) A. we measure the weight of pigs drawn at random from a population.5) and (A.'s. . the yield per acre of a field of wheat in a given year.2 A random vector X = (Xl •. Similarly. dJ.5) p(x)dx . and so on. these quantities will correspond to random variables and vectors. In the probability model. or equivalently a function from to Rk such that the set {w . the concentration of a certain pollutant in the atmosphere.8 Random VariOlbles and Vectors: Transformations 451 A.8.. When we are interested in particular random variables or vectors. The event Xl( B) will usually be written [X E B] and P([X E BJ) will be written PIX E: H]. 13k .Xk)T is ktuple of random variables. if X is continuous. the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred. and so on of a random vectOr when we are. ifXisdiscrete xEA L (A. from (A. referring to those features of its probability distribution. .7. A.or vectorvalued functions of a random vectOr X is central in the theory of probability and of statistics.8) PIX E: A] LP(X). The study of real. the probability measure Px in the model (R k . and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted. Thus. X(w) E: B} ~ X. m > 1. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X. dj.1 A random variable X is a function from Oto Rsuch that the set {w: X(w) E B} X. Forexample.Section A.4 A random vector is said to have a continuous or discrete distribution (or to be continuous or discrete) according to whether its probability distribution is continuous or discrete. we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined. we will refer to the frequency Junction.1 (B) is in 0 fnr every BE B. Here is the formal definition of such transformations. by definition.1 (H) is in A for every B E BkJI) For k = 1 random vectors are just random variables. such that(2) gl(B) = {y E: Rk : g(y) E: . In the discrete case this means we need only know the frequency function and in the continuous case the density. Px ) given by n Px(B) = PIX E BI· (A.8. k.
12) by summing or integrating out over yin P(X. .8.92/ with 91 (X) = k.Y). Furthennore. a # 0. The (marginal) frequency or density of X is found as in (A. . a random vector obtained by putting two random vectors together.)'. . . is given by(4) i I ( PX(X) = LP(X. If g(X) ~ aX + 1'. then 1 (I .8)) that X is a marginal density function given by px(x) ~ 1: P(X.6) An example of a transformation often used in statistics is g (91.1 E~' l(X i .y)dy.8.7. and X is continuous.1') .8.452 A Review of BClsic ProbClbility Theory Appendix A [J} E BI. y). ! for Pg(x)(I) ~ PX(gl(t)) Ig'(g 1(1))1 (A. .7) and (A.8. it may be shown (as a consequence of (A.7) If X is discrete with frequency function Px.Y) (x.11) Similarly. i . This is called the change of variable formula. Then g(X) is continuous with density given by . Yf is a discrete random vector with frequency function p(X. then the frequency function of X.jPX a (A.8.1O) From (A. Discrete random variables may be used to approximate continuous ones arbitrarily closely and vice versa.8.X)2. (A. These notions generalize to the case Z = (X. . The probability distribution of g(X) is completely detennined by that of X through L:: P[g(X) E BI = PIX E gI(B)].). j .11) and (A. max{X.y).y)(x.S.y)(x. then g(X) is discrete and has frequency function Pg(X)(t) = L {x:g(x)=t} Px(x).8. . and 0 otherwise.: for every B E Bill. • II ! I • . known as the marginal frequency function. V). if (X.(5) (A.12) 1 .S.8. Pg(X) (I) = j.Y). y (A.8) Suppose that X is continuous with density PX and 9 is realvalued and onetoone(3) on an open set S such that P[X E 5] = 1. Then the random tran~form(lti(m g( X) is defined by g(X)(w) = g(X(w)).8) it follows that if (X.1 1 Xi = X and 92(X) = k. Another common example is g(X) = (min{X.9) t E g(S). y)T is continuous with density p(X.8. (A. assume that the derivative l of 9 exists and does not vanish on S. (A.
" X n ) is continuous.1 Two random variables X I and X 2 are said to be independent if and only if for sets A and B in E. . .7.9.6..4 Parzen (1960) Chapter 7.. X n with X (XI>"" Xn)' .9.. it is common in statistics to work with continuous distributions.9. Then the random variables Xl.7). . the events [XI E AI and IX. .9...13 A convention: We shan write X = Y if the probability of the event IX fe YI is O.3. 6. either ofthe (A. . the events [X I E All.S.. Sections 2124 Grimmett and Stirzaker (1992) Section 4. Theorem A. . the approximation of a discrete distribution by a continuous one is made reasonable by one of the limit theorems of Sections A.9 Independence of Random Variables and Vectors 453 In practice.9. Suppose X (Xl. if (Xl. whatever be g and h. .4) (A. and Stone (1971) Sections 3..1.2. then X = (Xl.6IftheXi are all continuous and independent. if X and Yare independent. S... and only if. One possibility is that the observed random variable or vector is obtained by rounding off to a large number of places the true unobservable continuous random variable specified by some idealized physical modeL Or else. X 2 ) and (Yt> Y2 ) are independent. A. (X"XIX.) and Y" and so on. A. For example.[Xn E An] are independent.S. A.7 Hoel. E BI are independent. . Sections 15. so are Xl + X2 and YI Y2 . 1 X n are independent if.9.An in S.4 A. all random variables are discrete because there is no instrument that can measure with perfect accuracy.Section A. .Xn are said to be (mutually) independent if and only if for any sets AI. 9 Pitman (1993) Section 4. so are g(X) and h(Y)..7 The preceding equivalences are valid for random vectors XlJ .1.l5 and B. The justification for this may be theoretical or pragmatic. To generalize these definitions to random vectors Xl. Nevertheless. Port. which may be easier to deal with... 5.9 INDEPENDENCE OF RANDOM VARIABLES AND VECTORS A.9.5) = following two conditions hold: A. References Gnedenko (1967) Chapter 4. " X n ) is either a discrete or continuous random vector.. ' .Xn (not necessarily of the same dimensionality) we need only use the events [Xi E Ail where Ai is a set in the range of Xi' A.9.3 By (A..2 The random variables X I. . .
2 More generally. F x or density (frequency function) PX.. . it follows that this average is given by 'L. then the Xi form a sample from a distribution that assigns probability p to 1 and (1.~ ( . A.p) to 0. X n are independent identically distributed kdimensional random vectors with dJ. we define the expectation or mean of X.. decompose {Xl l X2. if X is discrete.3. Sections 6. Sections 23. discrete random variable with possible values {Xl. Fx or density (frequency function) Px.. Port.9) If we perfonn n Bernoulli trials with probability of success p and we let Xi be the indicator of the event (success on the ith trial). then Xl .EA XiPX (Xi) < . . 24 Grimmett and Stirzaker (1992) Sections 3.5.3 . . i==l (A. If XI . .Ll XiP[X = Xi] where P[X = Xi] is just the proportion of individuals of height Xi in the population. Take • . I i I I . by 1 ifw E A o otherwise. Such samples win be referred to as the indicators of n Bernoulli tn"als with probability ofsuccess p" References Gnedenko (1967) Chapter 4. The same quantity arises (approximately) if we use the longrun frequency interpretation of probability and calculate the average height of the individuals in a large sample from the population in question.• }.2 Hoel.4 Par""n (1960) Chapter 7... ..j j . . ..454 A Review of Basic Probability Theory Appendix A A. i .9.. X l l is called a random sample (l size n from a population with dJ.4) from a population and measuring k characteristics on each member.I) . .. . 1 Xi =i. "" 1 X q are the only heights present in the population..I l I . If A is any event. such a random sample is often obtained by selecting n members at random in the sense of (A.. .S If Xl . In statistics. If X is a nonnegative.W. and Stone (1971) Section 3.' (Infinity is a possible value of E(X).2. . 7 Pitman (1993) Sections 2.. (A9..X.I . by 00 1 i . In line with these ideas we develop the general concept of expectation as follows..) A.2. the indicator of the event A. Then a reasonable measure of the center of the distribution of X is the average height of an individual in the given population. 5. we define the random variable 1 A. X2. written E(X).I0 THE EXPECTATION OF A RANDOM VARIABLE r: r I Let X be the height of an individual sampled at random from a finite population.~ 1 i E(X) = LXiPX(Xi). 4. } into two sets A and B where A consists of aU nonnegative Xi and B of all negative Xi· If either 'L. px(i) = i(i+1)' i= 1.. I.I0.
9.p.c. Otherwise. . Here are some properties of the expectation that hold when X is discrete.rn". 00 455 or L."5'.) < 00.'"":. . I 0.IO..... Otherwise. Those familiar with Lebesgue integration will realize that this leads to E(X) = 1: xpx(x)dx whenever (A.'_""o"_of_'""R_'_"d"o:.3) = IA (cf (A. 10. 10. If X is a constant.I). If X is a continuous random variable.. Jo= xpx(x)dx or A. we leave E(X) undefined.6) Taking g(x) = L:~ 1 O:'iXi we obtain the fundamental relationship (A. an are constants and E(!Xil) < 00.J:. then E(X) = P(A).5) As a consequence of this result. and if E(lg(X) I) < 00. i = 1.9)).. If X (A. we define E(X) unambiguously by (A."_"_b_.7) if Qt.ES( x.IO.lO. .9) as the definition of the expectation or mean of X f~oo xpx(x)dx is finite. 10.)Px (. then E(X) = c. if 9 is a realvalued function on Rn.tO A random variable X is said to be integrable if E(IXI) < 00.. . E(Y) are defined.v') = c for all w.8From (A.V. . (A.. . It may be shown that if X is a continuous k~dimensional random vector and g(X) is any random variable such that JR' { Ig(x)IPx(x)dx < 00.l"O_T_h''CEx. lOA) If X is an ndimensional random vector. it is natural to attempt a definition of the expectation via approximation from the discrete case. then it may be shown that 00 E(g(X)) = Lg(Xi)PX(Xi) i=l (A. 10."_0_"_A". 10. E(X) is left undefined. . X(t. n. A.7) it follows that if X < Y and E(X). we have 00 E(IXI) ~ L i=l IXiIPx(Xi). then E(X) < E(Y). (A.
IRk g(x)dP(x) (A..2) xkpx(x)dx if X is continuous...5) and (A. . The fonnulae (A 10. Chapter S.B) g(x)p(x)dx.1 I 1 .. E(X k ) j = Lxkpx(x) if X is discrete .3. 4..1 parzen (1960) Chapter 5. continuous case.1.' • 1: x I (A I 1. I In general. POrl. (A. .6) hold. (A. 'I I' I. I . Sections 14 Pitman (1993) Sections 3. I: We refer to J1 = IlP as the dominating measure for P.7). the kth moment of X is defined to be the expectation of X k.12) where F denotes the distribution function of X and P is the probability function of X defined by (AS. lOA).. We assume that all moments written here exist. 10. (AIO. 10. which means J . Chung (1974) Chapter 3 Gnedenko (1967) Chapter 5. j i . 7. 1 .5) and (A 10.. discrete case f: f References i~l (AIO. I0. 4. Chapter 3.3).3. j . The interested reader may consult an advanced text such as Chung (1974)...456 then E(g(X» exists and E(g(X» A Review of BaSIC Probability Theory Appendix A = 1. Section 26 Grimmett and Stirzaker (1992) Sections 3. g(X)Px(x)dx.3 Hoel..! If k is any natural number and X is a random variable. In the continuous case dJ1(x) = dx and M(X) is called Lebesgue measure. j A. II.. By (A.. 10. and (AIO. In the discrete case J1 assigns weight one to each of the points in {x : p(x) > O} and it is called counting measure. I0..ll MOMENTS 1 ..!I) In the continuous case expectation properties (A.5). the moments depend on the distribution of X only.H. It I! II . A. .lO.. I I) are both sometimes written as E(g(X)) ~ r ink = g(x)dF(x) or r ..11).. A convenient notation is dP(x) = p(x)dl'(x). 3A.5) and (A. and Stone (1971) Sections 4..IO. • g(x)dP(x) = L g(Xi)p(Xi).S) as well as continuous analogues of (A. It is possible to define the expectation of a random variable in general using discrete approximations. We will often refer to p(x) as the density of X in the discrete case as well as the continuous case.
then X = O.6) it follows then that A.7) Var(aX + b) = a2 Var X. from (A. and X 2 is again by definition E[(X.15». then the product moment of order (i. (A.E(XIl)i(X. Another measure of the same type is E(jX . If XI and X 2 are random variables and i.15. (AIl. If X ~ N(/l.E(X))k]. then 11 A. These descriptive measures are useful in comparing the shapes of various frequently used densities. is by definition E[(X .H. See also Section A.U. 1) is . These results follow.9 If E(X 2 ) = 0. are the same as those of X. which are defined by where 0'2 = Var X.1».7 If X is any random variable with welldefined (finite) mean and variance. and is denoted by I'k.E(X 2 ))i]. the stan E(Z) = 0 and Var Z = 1.2 are expressed in tenns of cumulants.S The second central moment is called the variance of X and will be written Var X. The variance of X is finite if and only if the second moment of X is finite (cf.n.ll Moments 457 A.l and.n.n. If Var X = 0. A.IO The third and fourth central moment. for instance. 10. X = E(X) (a constant). by definition. .E(X)). For sim plicity we consider the case k = 2.4 The kth central moment of X (X . which is often referred to as the mean deviation.n If Y = a + bX with b > 0. (All.ll.E(X)J). then the coefficient of skewness and the kurtosis of Y = 12 = O.3 The distribution of a random variable is typically uniquely specified by its moments. The central product moment of order (1. j are natural numbers.1 and the kunosis ')'2.7) and (All.) dardized version or Zscore of X is the random variable Z ~ (X . By (A 10.12 where. if the random variable possesses a moment generating function (cf. are used in the coefficient of skewness .II.ll. It is also called a measure of scale. A.12 It is possible to generalize the notion of moments to random vectors.S) A. The standard deviation measures the spread of the distribution of X about its expectation. 0 2). E(XiX~). for example. the kth moment of A.j) of Xl and X 2 is.1l.j) of X..Section A.2). This is the case. The central product moment of order (i. then by (A.E(X))/vVar x. If a and b are constants. The nonnegative square root of Var X is called the standard deviation of X. (AI2. A. .6) (One side of the equation exists if and only if the other does.
IIl4) If Xi and Xf are distributed as Xl and X 2 and are independent of Xl and X 2.. is defined whenever Xl and X 2 are not constant and the variances of Xl and X 2 are finite by (A. I 1 1 j I I .) (X. = a + oXI .3) and (AID.19) 1 • t .E( X.E(X J!)( X. is linear function (X. ar '2 .). Equality holds if and only if X.).\ 1. J I • for any two random variables Z" Z.X.E(X I ).) + od Cov(X"X. Equality holds if and only if one of Zll Z2 equals 0 or 2 1 = aZ2 for some constant a.I6) .E(X. with equality holding if and only if (1) Xl or X 2 is a constant Or ! • • .\:2 and is written Cov(.I US) The correlation of Xl and X 2 is the covariance of the standardized versions of Xl and X 2· The correlation inequality is equivalent to the statement (A.X.11l3) (A. • i .X 2).E(X. X. Cov(aX I + oX" eX.)(X. Z. •  II . 0 # 0) of XI' l il . . we get the formula Var X = E(X') . such that E(Zf) < 00. o.) = ac Cov(X I . .[E(X)j2 (A.E(X I ») = CO~(X~X.)) and using (AI 0. = X. i .14). + dX. It may be obtained from the CauchySchwartz inequality.458 A Review of Basic Prob<lbility Theory Appendix A called the cOl'Orionce of X 1 and.). The correlation inequality correspcnds to the special case ZI = Xl . X 3 ) + be Cov(X" X 3 ) and + ad Cov(X I .) = lfwe put XI ~ ~E(XI . I liS) The covariance is defined whenever Xl and X 2 have finite variances and in that case (AII. (AI 117) j r .1. A proof of the CauchySchwartz inequality is given in Remark 1. This is the correlation inequality. denoted by Corr(XI1 X2). E(Zil < 00.I 1. By expanding the product (XI . =X in (A.X. .. ..) (A. (2) (XI . 7). . . X. The correlation of Xl and X 2.4. then Cov(XI.lI. we obtain the relations.
) ~ o when Var(X.l2.3 Parzen (1960) Chapter 5.3) k=O . respectively).J4).X. It is 1 Or 1 in the case of perfect relationship (X2 = a l. XII have finite variances. I 1. + X n ) = L Var Xi· i=I n (A.12. See also Section 1.X2 ) = Corr(X1. then Var(X 1 References Gnedenko (1967) Chapter 5.2.Section A.13) the relation Var(X 1 + ...22) and (AI 1.5) and (A. 30 Hoel...12 Moment and Cumulant Generating Functions 459 If Xl. +2L '<7 COV(X. lsi < So Mx(s) LeSX'PX(Xi) if X is discrete 1: Mx(s) i~1 (A.l1. By (A. . =L = E(X k ) lsi < so· (A.12 MOMENT AND CUMULANT GENERATING FUNCTIONS A.4.22) This may be checked directly.II. 10.bX 1. As a consequence of (AI 1. lsi < so} of zero.20) If Xl and X 2 are independent and X I and X 2 are integrable.22) (i. 11.. i ~ 1.Xn are independent with finite variances. (A. 7.23) A. Port.2) eSXpx(x)dx if X is continuous. It is not trUe in general that X I and X 2 that satisfy (AIl.4 ° + . and Stone (1971) Sections 4. we see that if Xl..20). all moments of X k! sk. 28.llf E(c'nIXI) < 00 for some So > 0.II).. 12.ID. Mx(s) ~ E(e'X) is well defined for and is called the moment generating function of X. Xl)· (A. Sections 14 Pitman (1993) Section 6.e.21 ) or in view of(All. we obtain as a consequence of (A. CoV(X1.II..) > 0.. are uncorrelated) need be independent. then (A. The correlation coefficient roughly measures the amount and sign of linear relationship between Xl and X 2. Sections 27.24. Chapter 8. + X n ) = L '1=1 n Var X.'" . b < orb> 0. are If M x is well defined in a neighhorhood finite and {s .5.
460 A Review of Basic: Probability Theory Appendix A A. Port.. the probability distribution of a random variable or vector is just a probability measure on a suitable Euclidean space. 12. If XI . A/x determines the distribution of X uniquely and is itself uniquely determined by the distribution of X.5. See • 1 ~ . and C4 = J. then Cj(X + Y) = c.'(8) ds~ = E(X'). see Section B. For a generalization of the notion afmoment generating function to random vectors. then Xl + . c3/ci and 1'2 = C4/C~.+x"I(S) = II Mx. Ii . SOME CLASSICAL DISCRETE AND CONTINUOUS DISTRIBUTIONS I f • .4 The moment generuting function . XII are independent random variables with moment generating functions 11. ~ Cj(X) ~ d . .9) l 1 " I J .3J1~...6) This follows by induction from the definition and (A.(s).12.+. I I i I A.12. K x can be represented by the convergent Taylor expansion Kx(s) = where L j=O 00 Cisj (A.8) J..(X) + cJ(Y). References Hoel.t of X. .l1.5 If defined. • 111. Section 3.12. If X and Yare independent.'(". ! .t4 .~o sJ dj (A.12. .2l). i=l n (A. j > L For j > 2 and any constant a. • By definition.13 .12. . Chapter 8. and Stone (1971) Chapter 8. . .Us hils derivatives of all orders at :. If lv/x is well defined in some neighborhood of zero.7) is called the cumulant generating function of X.8. In this section we introduce certain families of b •. The function Kx(s) = log M x (s) (A. 11.1 parzen (1960) Chapter 5. The first cumulant Cl is the mean J. The coefficients of skewness and kurtosis (see (A.10» can be written as 1'1 = Problem 8. is called the jth cumulant of X. Cj = 0 for j > 3.3. j A.. c. Sections 23 Rao (1973) Section 2bA j i i I . If X is nonnally distributed.1x I ' .Kx(s)I. . = 0 and dk :\1.. Cj(X + a) = Cj(X). C2 and C3 equal the second and third central moments f12 and f13 of X. Section 8. + XI! has moment generating function given by M(x.
l. A. X 2 .D). If X has a B( n. which arise frequently in probability and statistics. then E(X) ~ nO. Or. Discrete Distributions The binomial distribution with parameters nand B : B( n.13.5 If Xl. B). 1].n. B(n.6) in conjunction with (A. The hypergeometric distribution with parameters D.n).4) . it is assumed to be zero to the "left" ofthe set and one to the "right" ofthe set. The symbol p as usual stands for a frequency or density function. The parameter n can be any integer (A. B)" for "the binomial distribution wi