Mathematical Statistics
Basic Ideas and Selected Topics
Volume I
Peter J. Bickel
University of California
Kjell A. Doksum
University of California
1'1"(.'nl icc
Hall
.. ~
PRENTICE HALL
Upper Saddle River, New Jersey 07458
Library of Congress CataloginginPublication Data Bickel. Peter J. Mathematical statistics: basic ideas and selected topics / Peter J. Bickel, Kjell A. Doksum2 nd ed. p. em. Includes bibliographical references and index. ISBN D13850363X(v. 1) L Mathematical statistics. L Doksum, Kjell A. II. Title.
QA276.B47200l
519.5dc21 00031377
Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte
}'I('I1II(\'
lI,dl
@2001, 1977 by PrenticeHall, Inc. Upper Saddle River, New Jersey 07458
All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
I
i
ISBN: D1385D363X
PrenticeHall International (UK) Limited, London PrenticeHall of Australia Pty. Limited, Sydney PrenticeHall of Canada Inc., Toronto PrenticeHall Hispanoamericana, S.A., Mexico PrenticeHall of India Private Limited, New Delhi PrenticeHall of Japan, Inc., Tokyo Pearson Education Asia Pte. Ltd. Editora PrenticeHall do Brasil, Ltda., Rio de Janeiro
J
To Erich L Lehmann
." i
~I
"I
~ !
,'
,
I,
,.~~
_.

..
CONTENTS
PREFACE TO THE SECOND EDITION: VOLUME I PREFACE TO THE FIRST EDITION I STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA 1.1 Data, Models, Parameters, and Statistics
xiii
xvii
1 1
1.1.1
1.1.2
Data and Models
Pararnetrizations and Parameters
I
6
1.1.3
1.2
Statistics as Functions on the Sample Space
8
1.3
1.4
1.5
1.6
1.7 1.8 1.9
1.1.4 Examples, Regression Models Bayesian Models The Decision Theoretic Framework 1.3.1 Components of the Decision Theory Framework 1.3.2 Comparison of Decision Procedures 1.3.3 Bayes and Minimax Criteria Prediction Sufficiency Exponential Families 1.6.1 The OneParameter Case 1.6.2 The Multiparameter Case 1.6.3 Building Exponential Families 1.6.4 Properties of Exponential Families 1.6.5 Conjugate Families of Prior Distributions Problems and Complements Notes References
9 12 16 17 24 26 32 41 49 49 53 56 58 62 66 95 96
VII
VIII
•••
CONTENTS
2 METHODS OF ESTIMATION 2.1 Basic Heuristics of Estimation 2.1.1 Minimum Contrast Estimates; Estimating Equations 2.1.2 The PlugIn and Extension Principles 2.2 Minimum Contrast Estimates and Estimating Equations 2.2.1 Least Squares and Weighted Least Squares 2.2.2 Maximum Likelihood 2.3 Maximum Likelihood in Multiparameter Exponential Families *2.4 Algorithmic Issues 2.4.1 The Method of Bisection 2.4.2 Coordinate Ascent 2.4.3 The NewtonRaphson Algorithm 2.4.4 The EM (ExpectationlMaximization) Algorithm 2.5 Problems and Complements 2.6 Notes 2.7 References
3
MEASURES OF PERFORMANCE
99 99 99 102 107 107 114 121 127 127 129 132 133 138 158 159
3.1 Introduction 3.2 Bayes Procedures 3.3 Minimax Procedures *3.4 Unbiased Estimation and Risk Inequalities 3.4.1 Unbiased Estimation, Survey Sampling 3.4.2 The Information Inequality *3.5 Nondecision Theoretic Criteria 3.5.1 Computation 3.5.2 Interpretability 3.5.3 Robustness 3.6 Problems and Complements 3.7 Notes 3.8 References
4 TESTING AND CONFIDENCE REGIONS 4.1 Introduction 4.2 Choosing a Test Statistic: The NeymanPearson Lemma 4.3 UnifonnIy Most Powerful Tests and Monotone Likelihood Ratio
161 161 161 170 176 176 179 188 188 189 190 197 210 211 213 213 223
227 233
4.4
Models Confidence Bounds, Intervals, and Regions
CONTENTS
ix
241 248 251 252
4.5 *4.6 *4.7 4.8
The Duality Between Confidence Regions and Tests Uniformly Most Accurate Confidence Bounds Frequentist and Bayesian Formulations Prediction Intervals
4.9
Likelihood Ratio Procedures 4.9.1 Inttoduction
4.9.2 4.9.3 Tests for the Mean of a Normal DistributionMatched Pair Experiments Tests and Confidence Intervals for the Difference in Means of
255 255
257
4.9.4
4.9.5
Two Normal PopUlations The TwoSample Prohlem with Unequal Variances
Likelihood Ratio Procedures fOr Bivariate Nonnal
261 264 266 269 295 295
Distrihutions 4.10 Problems and Complements 4.11 Notes 4.12 References
5 ASYMPTOTIC APPROXIMATIONS 5.1 Inttoduction: The Meaning and Uses of Asymptotics
5.2 Consistency
297 297
301
5.2.1
5.2.2
PlugIn Estimates and MLEs in Exponential Family Models
Consistency of Minimum Contrast Estimates
301
304
5.3
5.4
5.5 5.6 5.7 5.8
First and HigherOrder Asymptotics: The Delta Method with Applications 5.3.1 The Delta Method for Moments 5.3.2 The Delta Method for In Law Approximations 5.3.3 Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families Asymptotic Theory in One Dimension 5.4.1 Estimation: The Multinomial Case *5.4.2 Asymptotic Normality of Minimum Conttast and M Estimates *5.4.3 Asymptotic Normality and Efficiency of the MLE *5.4.4 Testing *5.4.5 Confidence Bounds Asymptotic Behavior and Optimality of the Posterior Distribution Problems and Complements Notes References
306 306 311 322 324 324 327 331 332 336 337 345 362 363
x
CONTENTS
6 INFERENCE IN THE MULTIPARAMETER CASE
6.1 Inference for Gaussian Linear Models 6.1.1 6.1.2 6.1.3 *6.2 6.2.1 6.2.2 6.2.3 *6.3 6.3.1 6.3.2 *6.4 6.4.1 6.4.2 6.4.3 *6.5 *6.6 6.7 6.8 6.9 The Classical Gaussian Linear Model Estimation Tests and Confidence Intervals Estimating Equations Asymptotic Normality and Efficiency of the MLE The Posterior Distribution in the Multiparameter Case Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic Wald's and Rao's Large Sample Tests GoodnessofFit in a Multinomial Model. Pearson's X 2 Test GoodnessofFit to Composite Multinomial Models. Contingency Thbles Logistic Regression for Binary Responses
365
365 366 369 374 383 384 386 391 392 392 398 400 401 403 408 411 417 422 438 438 441 441 443 443
Asymptotic Estimation Theory in p Dimensions
Large Sample Tests and Confidence Regions
Large Sample Methods for Discrete Data
Generalized Linear Models
Robustness Properties and Scmiparametric Models Problems and Complements Notes References
A A REVIEW OF BASIC PROBABILITY THEORY A.I The Basic Model A.2 Elementary Properties of Probability Models A.3 Discrete Probability Models A.4 Conditional Probability and Independence A.5 Compound Experiments A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement A.7 Probabilities On Euclidean Space A.8 Random Variables and Vectors: Transformations A.9 Independence of Random Variables and Vectors A.IO The Expectation of a Random Variable A.II Moments A.12 Moment and Cumulant Generating Functions
444
446 447 448 451 453 454 456 459
CONTENTS
XI
•
A. t3 Some Classical Discrete and Continuous Distributions A.14 Modes of Convergence of Random Variables and Limit Theorems A. I5 Further Limit Theorems and Inequalities A.16 Poisson Process
A.17 Notes
460
466 468 472
474 475 477 477 477 479 480 482 484 485 485 488 491 491 494 497 502 502 503 506 506 508
A.18 References
B ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS
B.I Conditioning by a Random Variable or Vector B.I.l B.I.2 B.1.3 B.IA B.l.5 B.2.1 B.2.2 B.3
B.3.1
The Discrete Case Conditional Expectation for Discrete Variables Properties of Conditional Expected Values Continuous Variables Comments on the General Case The Basic Framework The Gamma and Beta Distributions
The X2 , F, and t Distributions
B.2 Distribution Theory for Transformations of Random Vectors
Distribution Theory for Samples from a Normal Population B.3.2 Orthogonal Transformations
BA The Bivariate Normal Distribution
B.5 Moments of Random Vectors and Matrices B.5.1 B.5.2 B.6.1 B.6.2 B.8 Basic Properties of Expectations Properties of Variance Definition and Density Basic Properties. Conditional Distributions
Op
B.6 The Multivariate Normal Distribution
B.7 Convergence for Random Vectors: Multivariate Calculus B.9 Convexity and Inequalities
and Op Notation
511
516 518 519 519 520 521
B.1O Topics in Matrix Theory and Elementary Hilbert Space Theory
B.1O.1 Symmetric Matrices B,10.2 Order on Symmetric Matrices
B.10.3 Elementary Hilbert Space Theory
B.Il Problems and Complements B.12 Notes B.13 References
524
538 539
•• XII
CONTENTS
C TABLFS
Table I The Standard Nonna! Distribution Table I' Auxilliary Table of the Standard Normal Distribution
Table II t Distribution Critical Values Table In x2 Distribution Critical Values Table IV F Distribution Critical Values
INDEX
541 542 543 544 545 546 547
I
PREFACE TO THE SECOND EDITION: VOLUME I
In the twentythree years that have passed since the first edition of our book appeared statistics has changed enonnollsly under dIe impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phy
logenetic and other), and other types of combinatorial objects.
(2) The generation of enonnous amounts of dataterrabytes (the equivalent of 10 12 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore's "law") and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number of directions:
(I) Methods for inference based on larger numbers of observations and minimal assumptionsasymptotic methods in non and semiparametric models, models with ''infinite'' number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo.
XIll
.,q;,
•
:ci'~:,f"
.;.
'4,.
<.it:
...
and probability inequalities as well as more advanced topics in matrix theory and analysis. been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. each to be only a little shorter than the first edition. problems. then go to parameters and parametric models stressing the role of identifiability. our focus and order of presentation have changed. There have. From the beginning we stress functionvalued parameters. Volume I. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. covers material we now view as important for all beginning graduate students in statistics and science and engineering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing concluSIons. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. Appendix B is as selfcontained as possible with proofs of mOst statements. However. Hilbert space theory is not needed. and references to the literature for proofs of the deepest results such as the spectral theorem. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. which includes more advanced topics from probability theory such as the multivariate Gaussian distribution. However. some methods run quickly in real time. reflecting what we now teach our graduate students. These will not be dealt with in OUr work.and semipararnetric models. Our one long book has grown to two volumes. which we present in 2000. we assume either a discrete probability whose support does not depend On the parameter set. As a consequence our second edition." That is. of course. such as the density. such as the empirical distribution function. instead of beginning with parametrized models we include from the start non.xiv Preface to the Second Edition: Volume I (4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. weak convergence in Euclidean spaces. We . Chapter 1 now has become part of a larger Appendix B. Despite advances in computing speed. but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis. (5) The study of the interplay between numerical and statistical considerations. we do not require measure theory but assume from the start that our models are what we call "regular. Specifically. or the absolutely continuous case with a density. Others do not and some though theoretically attractive cannot be implemented in a human lifetime. Volume [ covers the malerial of Chapters 16 and Chapter 10 of the first edition with pieces of Chapters 710 and includes Appendix A on basic probability theory. As in the first edition. is much changed from the first. and functionvalued statistics.
the Wald and Rao statistics and associated confidence regions. including some optimality theory for estimation as well and elementary robustness considerations. including a complete study of MLEs in canonical kparameter exponential families. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in muitiparameter exponential families and ail introduction to the EM algorithm. There is more material on Bayesian models and analysis.Preface to the Second Edition: Volume I xv also. if somewhat augmented. The conventions established on footnotes and notation in the first edition remain. are an extended discussion of prediction and an expanded introduction to kparameter exponential families. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the mOre advanced topics of Volume II. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that are given in Appendix B. there is clearly much that could be omitted at a first reading that we also star. Robustness from an asymptotic theory point of view appears also. inference in the general linear model. which parallels Chapter 2 of the first edition. Wilks theorem on the asymptotic distribution of the likelihood ratio test. Chapter 5 of the new edition is devoted to asymptotic approximations. Also new is a section relating Bayesian and frequentist inference via the Bemsteinvon Mises theorem. It includes the initial theory presented in the first edition but goes much further with proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference. Generalized linear models are introduced as examples. Almost all the previous ones have been kept with an approximately equal number of new ones addedto correspond to our new topics and point of view. we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis. Nevertheless. from the start. Chapters 14 develop the basic principles and examples of statistics. Major differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs). include examples that are important in applications. and some parailels to the optimality theory and comparisons of Bayes and frequentist procedures given in the univariate case in Chapter 5. Although we believe the material of Chapters 5 and 6 has now become fundamental. such as regression experiments. Save for these changes of emphasis the other major new elements of Chapter 1. Chapters 3 and 4 parallel the treatment of Chapters 4 and 5 of the first edition on the theory of testing and confidence regions. Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation. Finaliy. As in the first edition problems playa critical role by elucidating and often substantially expanding the text. There are clear dependencies between starred . The main difference in our new treatment is the downplaying of unbiasedness both in estimation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. Included are asymptotic normality of maximum likelihood estimates. Chapter 6 is devoted to inference in multivariate (multiparameter) models. One of the main ingredients of most modem algorithms for inference.
Topics to be covered include permutation and rank tests and their basis in completeness and equivariance.berkeley.5 I. 6.edn Kjell Doksnm doksnm@stat. j j Peter J. Jianhna Hnang.3 ~ 6. will be studied in the context of nonparametric function estimation. Fujimura. Bickel bickel@stat. encouragement. density estimation. Examples of application such as the Cox model in survival analysis. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo. We also thank Faye Yeager for typing.2 ~ 6.6 Volume II is expected to be forthcoming in 2003.2 ~ 5.3 ~ 6. Michael Ostland and Simon Cawley for producing the graphs. For the first volume of the second edition we would like to add thanks to new colleagnes.berkeley.XVI • Pref3ce to the Second Edition: Volume I sections that follow. elementary empirical process theory. greatly extending the material in Chapter 8 of the first edition. and Prentice Hall for generous production support. convergence for random processes. Semiparametric estimation and testing will be considered more generally.edn . taken on a new life. in part in appendices. We also expect to discuss classification and model selection using the elementary theory of empirical processes. particnlarly Jianging Fan.4 ~ 6.4. are weak. • with the field. and active participation in an enterprise that at times seemed endless. Michael Jordan. and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley. The basic asymptotic tools that will be developed or presented. Ying Qing Chen. Nancy Kramer Bickel and Joan H. and the functional delta method. in part in the text and. and the classical nonparametric k sample and independence problems will be included. Yoram Gat for proofreading that found not only typos but serious errors. The topic presently in Chapter 8. With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics. 5.4. and our families for support. other transformation models. appeared gratifyingly ended in 1976 but has. Last and most important we would like to thank our wives.
The work of Rao. However. (2) Give careful proofs of the major "elementary" results such as the NeymanPearson lemma. Although there are several good books available for tbis purpose. the Lehmann5cheffe theorem. (4) Show how the ideas aod results apply in a variety of important subfields such as Gaussian linear mOdels. 2nd ed.. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory). PREFACE TO THE FIRST EDITION This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be. Our appendix does give all the probability that is needed. the treatment is abridged with few proofs and no examples or problems. statistics. Be cause the book is an introduction to statistics. and the GaussMarkoff theorem. covers most of the material we do and much more but at a more abstract level employing measure theory. and Stone's Introduction to Probability Theory. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. Our book contains more material than can be covered in tw~ qp. and the structure of both Bayes and admissible solutions in decision theory. multinomial models. In the twoquarter courses for graduate students in mathematics. the physical sciences. we feel that none has quite the mix of coverage and depth desirable at this level.arters. which go from modeling through estimation and testing to linear models. Finally. 3rd 00. and engineering that we have taught we cover the core Chapters 2 to 7.. and nonparametric models. These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. the information inequality. Introduction to Mathematical Statistics. (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates. At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated. Linear Statistical Inference and Its Applications. we need probability theory and expect readers to have had a course at the level of. Port. for instance. Hoel. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice. we select topics from xvii .
6 would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. (i) Various notational conventions and abbreviations are used in the text. Chen. L. and friends who helped us during the various stageS (notes. densities. I . J. Conventions: (i) In order to minimize the number of footnotes we have added a section of comments at the end of each Chapter preceding the problem section. A special feature of the book is its many problems. X. Pyke's careful reading of a nexttofinal version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D. distribution functions. 2 for the second. Drew. W. Quang. The comments contain digressions. and additional references. Bickel Kjell Doksum Berkeley /976 : . Cannichael. respectively. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text.5 was discovered by F. preliminary edition. E. R. We would like to acknowledge our indebtedness to colleagues. Gray. Without Winston Chow's lovely plots Section 9. (iii) Basic notation for probabilistic objects such as random variables and vectors. Chou. A serious error in Problem 2.xviii Preface to the First Edition Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. I for the first. enthusiastic. Minassian who sent us an exhaustive and helpful listing. These comments are ordered by the section to which they pertain. Lehmann's wise advice has played a decisive role at many points. They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text. or it may be included at the end of an introductory probability course that precedes the statistics course. in proofreading the final version. and so On. C. They need to be read only as the reader's curiosity is piqued.2. reservations. U. Among many others who helped in the same way we would like to mention C. G. and moments is established in the appendix. Gupta. Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers. . It may be integrated with the material of Chapters 27 as the course proceeds rather than being given at the start. Scholz. The foundation of oUr statistical knowledge was obtained in the lucid. final draft) through which this book passed. students. S. We would also like tn thank tlle colleagues and friends who Inspired and helped us to enter the field of statistics. Later we were both very much influenced by Erich Lehmann whose ideas are strongly rellected in this hook. Chapter 1 covers probability theory rather than statistics. and A Samulon. Much of this material unfortunately does not appear in basic probability texts but we need to draw on it for the rest of the book. caught mOre mistakes than both authors together. i . Peler J. A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text. and stimUlating lectures of Joe Hodges and Chuck Bell. P.
Mathematical Statistics Basic Ideas and Selected Topics Volume I Second Edition .
I . j 1 J 1 .. . .
and/or characters. Data can consist of: (1) Vectors of scalars. scientific or industrial. Chapter 1. In any case all our models are generic and. MODELS.Chapter 1 STATISTICAL MODELS. trees as in evolutionary phylogenies.1 DATA. AND PERFORMANCE CRITERIA 1.2. but we will introduce general model diagnostic tools in Volume 2.1. large scale or small. A generic source of trouble often called grf?ss errors is discussed in greater detail in the section on robustness (Section 3. are to draw useful information from data using everything that we know. we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading. produce data whose analysis is the ultimate object of the endeavor.3). Subject matter specialists usually have to be principal guides in model formulation. functions as in signal processing.4 and Sections 2. which statisticians share. in particUlar. A detailed discussion of the appropriateness of the models we shall discuss in particUlar situations is beyond the scope of this book. The particular angle of mathematical statistics is to view data as the outcome of a random experiment that we model mathematically. and so on. A 1 . (4) All of the above and more. PARAMETERS AND STATISTICS Data and Models Most studies and experiments. for example.5. as usual. GOALS. ''The Devil is in the details!" All the principles we discuss and calculations we perform should only be suggestive guides in successful applications of statistical analysis in science and policy. for example. Moreover. measurements.1.1 1. (3) Arrays of scalars and/or characters as in contingency tablessee Chapter 6or more generally multifactor IDultiresponse data on a number of individuals. a single time series of measurements. digitized pictures or more routinely measurements of covariates and response on a set of n individualssee Example 1. The goals of science and society. (2) Matrices of scalars and/or characters.1 and 6.1.
We begin this in the simple examples that follow and continue in Sections 1. An unknown number Ne of these elements are defective.. reducing pollution. The data gathered are the number of defectives found in the sample. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. testing. and diagnostics are discussed in Volume 2. learning a maze. For instance. if we observe an effect in our data. starting with tentative models: (I) We can conceptualize the data structure and our goals more precisely. This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population.3 and continue with optimality principles in Chapters 3 and 4. (c) An experimenter makes n independent detenninations of the value of a physical constant p. (5) We can be guided to alternative or more general descriptions that might fit better.! I 2 Statistical Models. we can assign two drugs. . have the patients rated qualitatively for improvement by physicians. for modeling purposes. In this manner. Hierarchies of models are discussed throughout. Random variability " : . we obtain one or more quantitative or qualitative measures of efficacy from each experiment. and so on. and so on. (b) We want to study how a physical or economic feature. a shipment of manufactured items. and B to n. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes. and more general procedures will be discussed in Chapters 24. treating a disease. It is too expensive to examine all of the items. Chapter I. plus some random errors. robustness. for instance. e. For instance.21. "Models of course. A to m. Here are some examples: (a) We are faced with a population of N elements. give methods that assess the generalizability of experimental results. is distributed in a large population. in the words of George Box (1979). and Performance Criteria Chapter 1 priori. (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee. An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the population.5 and throughout the book. confidence regions. We begin this discussion with decision theory in Section 1. producing energy. randomly selected patients and then measure temperature and blood pressure. are never true but fortunately it is only necessary that they be usefuL" In this book we will study how. The population is so large that. in particular. height or income. So to get infonnation about a sample of n is drawn without replacement and inspected. (2) We can derive methods of extracting useful information from data and. Goals.. Goodness of fit tests. we approximate the actual process of sampling without replacement by sampling with replacement. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p. to what extent can we expect the same effect more generally? Estimation. (3) We can assess the effectiveness of the methods we propose. for example.
OneSample Models.". n). Parameters..t + (i..l. we observe XI. What should we assume about the distribution of €. Sampling Inspection.2) where € = (€ I . . On this space we can define a random variable X given by X(k) ~ k.. That is... If N8 is the number of defective items in the population sampled.l) if max(n .. Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording "defective" or not... ... . .1. n.d. N. . .Section 1.. 1. which together with J1.. . Fonnally. where .. 1 €n) T is the vector of random errors.. as Xi = I. in principle. and also write that Xl. H(N8.. anyone of which could have generated the data actually observed. .. First consider situation (a). as X with X . n corresponding to the number of defective items found. identically distributed (i. .. I. . which are modeled as realizations of X I. 1 <i < n (1.1.xn . The sample space consists of the numbers 0. X n ? Of course. . Here we can write the n determinations of p. F. . The same model also arises naturally in situation (c). X n as a random sample from F. We often refer to such X I..."" €n are independent. . we postulate (l) The value of the error committed on one determination does not affect the value of the error at other times. €I./ ' stands for "is distributed as.. . that depends on how the experiment is carried out. although the sample space is well defined. we cannot specify the probability structure completely but rather only give a family {H(N8. The mathematical model suggested by the description is well defined.Xn independent.1. can take on any value between and N.6) (J. Given the description in (c). so that sampling with replacement replaces sampling without.. X n are ij.) random variables with common unknown distribution function F.8).d.. The main difference that our model exhibits from the usual probability model is that NO is unknown and. n) distribution. N. It can also be thought of as a limiting case in which N = 00. k ~ 0. completely specifies the joint distribution of Xl. Sample from a Population. . X has an hypergeometric. Models.. which we refer to as: Example 1.2.1.." The model is fully described by the set :F of distributions that we specify. 0 ° Example 1.i. So. .. if the measurements are scalar. A random experiment has been perfonned. n)} of probability distributions for X. We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models. then by (AI3. and Statistics 3 here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs.N(I .0) < k < min(N8.1 Data. Thus.
whatever be J. We let the y's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo.1. G E Q} where 9 is the set of all allowable error distributions that we postulate.Xn are a random sample and. . . Thus. Example 1. cr > O} where tP is the standard normal distribution. Goals. Then if F is the N(I". (2) Suppose that if treatment A had been administered to a subject response X would have been obtained. . then G(·) F(' . response y = X + ~ would be obtained where ~ does not depend on x.4 Statistical Models.. so that the model is specified by the set of possible (F. Heights are always nonnegative. We call the y's treatment observations. or alternatively all distributions with expectation O. if we let G be the distribution function of f 1 and F that of Xl..Yn." Now consider situation (d). E}. if drug A is a standard or placebo.. 1 X Tn a sample from F. Often the final simplification is made. we refer to the x's as control observations. YI. we have specified the Gaussian two sample model with equal variances. cr 2 ) distribution. will have none of this. .. By convention. •.t. 0'2). where 0'2 is unknown. 1 X Tn . patients improve even if they only think they are being treated. respectively. . The classical def~ult model is: (4) The common distribution of the errors is N(o... (3) The distribution of f is independent of J1.. Equivalently Xl.. This implies that if F is the distribution of a control. To specify this set more closely the critical constant treatment effect assumption is often made.2) distribution and G is the N(J. . All actual measurements are discrete rather than continuous. and Performance Criteria Chapter 1 (2) The distribution of the error at one determination is the same as that at another.. It is important to remember that these are assumptions at best only approximately valid. the Xi are a sample from a N(J. '.. There are absolute bounds on most quantitiesIOO ft high men are impossible. That is. G) : J1 E R. Let Xl. .. Commonly considered 9's are all distributions with center of symmetry 0.t and cr.~). (3) The control responses are normally distributed. tbe set of F's we postulate..1. 1 Yn a sample from G. 0 ! I = . that is. The Gaussian distribution.3) and the model is alternatively specified by F. 0 This default model is also frequently postulated for measurements taken on units obtained by random sampling from populations.t + ~. A placebo is a substance such as water tpat is expected to have no effect on the disease and is used to correct for the welldocumented placebo effect. or by {(I". (72) population or equivalently F = {tP (':Ji:) : Jl E R. G) pairs.3. and Y1. £n are identically distributed. . . Natural initial assumptions here are: (1) The x's and y's are realizations of Xl.. then F(x) = G(x  1") (1. heights of individuals or log incomes. for instance. We call this the shift model with parameter ~. TwoSample Models. be the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B. Then if treatment B had been administered to the same subject instead of treatment A.
In this situation (and generally) it is important to randomize. Parameters.6. we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated. In Example 1.1. The danger is that. For instance. we can be reasonably secure about some aspects. there is tremendous variation in the degree of knowledge and control we have concerning experiments.Section 1. have been assigned to B. our analyses. P is the . the methods needed for its analysis are much the same as those appropriate for the situation of Example 1. Models. The number of defectives in the first example clearly has a hypergeometric distribution. For instance.2. As our examples suggest. It is often convenient to identify the random vector X with its realization. ~(w) is referred to as the observations or data. the data X(w). we observe X and the family P is that of all bypergeometric distributions with sample size n and population size N.'" . Using our first three examples for illustrative purposes. We are given a random experiment with sample space O. The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise. When w is the outcome of the experiment. in comparative experiments such as those of Example 1. Fortunately.1. G are assumed arbitrary.3 when F.2 is that. if (1)(4) hold. may be quite irrelevant to the experiment that was actually performed.1). However. we know how to combine our measurements to estimate JL in a highly efficient way and also assess the accuracy of our estimation procedure (Example 4. and Statistics 5 How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations. for instance. we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + n available patients.2. This will be done in Sections 3. in Example 1. if they are true.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease. we now define the elements of a statistical model. All the severely ill patients might.1. equally trained observers with no knowledge of each other's findings. For instance. The advantage of piling on assumptions such as (I )(4) of Example 1.1. but not others. Since it is only X that we observe. if they are false.4. A review of necessary concepts and notation from probability theory are given in the appendices.1.1. That is. we need only consider its probability distribution.3 and 6. the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed. P is referred to as the model. though correct for the model written down.5. Statistical methods for models of this kind are given in Volume 2. In others.Xn ).1. In some applications we often have a tested theoretical model and the danger is small. Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter. Experiments in medicine and the social sciences often pose particular difficulties. in Example 1. This distribution is assumed to be a member of a family P of probability distributions on Rn. we can ensure independence and identical distribution of the observations by using different. On this sample space we have defined a random vector X = (Xl.1 Data.
Of even greater concern is the possibility that the parametrization is not onetoone.2)).X l .2. Finally. G) : I' E R. knowledge of the true Pe. G with density 9 such that xg(x )dx = O} and p(".1.3 that Xl. the parameter space e..2) suppose that we pennit G to be arbitrary. in Example 1. we can take = {(I'. in senses to be made precise later... we will need to ensure that our parametrizations are identifiable.1 we can use the number of defectives in the population. models P are called parametric. Ghas(arbitrary)densityg}.1. Thus.2 ) distribution. j..2 Parametrizations and Parameters i e To describe P we USe a parametrization. .l. X rn are independent of each other and Yl . in Example = {O. The critical problem with such parametrizations is that ev~n with "infinite amounts of data. or equivalently write P = {Pe : BEe}.. on the other hand. moreover. tt = to the same distribution of the observations a~ tt = 1 and N (~ 1. () is the fraction of defectives. tt is the unknown constant being measured.Xn are independent and identically distributed with a common N (IL. parts of fJ remain unknowable. . such that we can have (}1 f..G) : I' E R. we may parametrize the model by the first and second moments of the normal distribution of the observations (i.1.. that is. Thus. If. . () t Po from a space of labels. 1) errors lead parametrization is unidentifiable because. 02 ). The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything. When we can take to be a nice subset of Euclidean space and the maps 0 + Po are smooth. under assumptions (1)(4). G) into the distribution of (Xl. NO. Yn . What parametrization we choose is usually suggested by the phenomenon we are modeling. the first parametrization we arrive at is not necessarily the one leading to the simplest analysis. We may take any onetoone function of 0 as a new parameter.1. .. . by (tt. N.2 with assumptions (1)(4) we have = R x R+ and.3 with only (I) holding and F.6 Statistical Models. For instance.1. In Example 1. However. Now the and N(O. I 1.1 we take () to be the fraction of defectives in the shipment.. and Performance Criteria Chapter 1 family of all distributions according to which Xl.1.2 with assumptions (1)(3) are called semiparametric." that is. . For instance. Pe the distribution on Rfl with density implicitly taken n~ 1 . as we shall see later. X n ) remains the same but = {(I'. If.. to P. 1) errors. Pe 2 • 2 e ° .I} and 1. we know we are measuring a positive quantity in this model. lh i O => POl f. Goals. for eXample. .. we only wish to make assumptions (1}(3) with t:. It's important to note that even nonparametric models make substantial assumptionsin Example 1. Pe the H(NB. in (l.G taken to be arbitrary are called nonparametric. e j .e. Models such as that of Example 1. if) (X'. 02 and yet Pel = Pe 2 • Such parametrizations are called unidentifiable.JL) where cp is the standard normal density. .G) has density n~ 1 g(Xi 1'). . that is. . 1 • • • . that is. n) distribution. we have = R+ x R+ . Note that there are many ways of choosing a parametrization in these and all other problems. ...1.t2 + k e e e J . a map. . . Then the map sending B = (I'.. having expectation 0. as a parameter and in Example 1.Xrn are identically distributed as are Y1 . 0. Yn . if () = (Il. .1. still in this example. models such as that of Example 1.
For instance. More generally. if POl = Pe').2 again in which we assume the error € to be Gaussian but with arbitrary mean~. from to its range P iff the latter map is 11.1 Data. We usually try to combine parameters of interest and nuisance parameters into a single grand parameter (). A parameter is a feature v(P) of the distribution of X. X n independent with common distribution. or the midpoint of the interquantile range of P. implies 8 1 = 82 . make 8 + Po into a parametrization of P. Implicit in this description is the assumption that () is a parameter in the sense we have just defined. Then P is parametrized by 8 = (fl1~' 0'2). if the errors are normally distributed with unknown variance 0'2. there are also usually nuisance parameters. But given a parametrization (J + Pe.2 where fl denotes the mean income and. a map from some e to P. where 0'2 is the variance of €..1. Models.3 with assumptions (1H2) we are interested in~. and observe Xl.. v. For instance. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable). (J is a parameter if and only if the parametrization is identifiable. thus. or the median of P.1. formally a map. the focus of the study.2 with assumptions (1)(4) the parameter of interest fl.1. which indexes the family P. we can start by making the difference of the means.1.2 is now well defined and identifiable by (1. For instance. In addition to the parameters of interest.Section 1.2. But = Var(X.1. is that of a parameter. in Example 1. that is. ~ Po.. For instance. Then "is identifiable whenever flx and flY exist. ." = f1Y .1. E(€i) = O. ) evidently is and so is I' + C>.1.3. which can be thought of as the difference in the means of the two populations of responses. our interest in studying a population of incomes may precisely be in the mean income.flx. As we have seen this parametrization is unidentifiable and neither f1 nor ~ arc parameters in the sense we've defined. For instance. say with replacement. Here are two points to note: (1) A parameter can have many representations. the fraction of defectives () can be thought of as the mean of X/no In Example 1. consider Example 1. . . from P to another space N. then 0'2 is a nuisance parameter. Parameters. implies q(BJl = q(B2 ) and then v(Po) q(B). Similarly. as long as P is the set of all Gaussian distributions. When we sample. which correspond to other unknown features of the distribution of X. in Example 1. and Statistics 7 Dual to the notion of a parametrization. instead of postulating a constant treatment effect ~. we can define (J : P + as the inverse of the map 8 + Pe.1. in Example 1. The (fl. it is natural to write e e e .M(P) can be characterized as the mean of P. a function q : + N can be identified with a parameter v( P) iff Po. G) parametrization of Example 1.1. in Example 1. that is.3) and 9 = (G : xdG(x) = J O}. or more generally as the center of symmetry of P. Formally. Sometimes the choice of P starts by the consideration of a particular parameter.
3 Statistics as Functions on the Sample Space I j • . Goals.. x E Ris F(X " ~ I . . Informally. These issues will be discussed further in Volume 2. is used to decide what estimate of the measure of difference should be employed (ct. Often the outcome of the experiment is used to decide on the model and the appropriate measure of difference. the fraction defective in the sample. this difference depends on the patient in a complex manner (the effect of each drug is complex). Models and parametrizations are creations of the statistician. however. a statistic T is a map from the sample space X to some space of values T.1.. we can draw guidelines from our numbers and cautiously proceed. It estimates the function valued parameter F defined by its evaluation at x E R. X n ) are a sample from a probability P on R and I(A) is the indicator of the event A. This statistic takes values in the set of all distribution functions on R. in Example 1. a statistic we shall study extensively in Chapter 2 is the function valued statistic F. usually a Euclidean space. Next this model. statistics. For future reference we note that a statistic just as a parameter need not be real or Euclidean valued.• 1 X n ) = X ~ L~ I Xi. Our aim is to use the data inductively. and Performance Criteria Chapter 1 1. Mandel. For instance. For instance.1. < xJ. called the empirical distribution function. consider situation (d) listed at the beginning of this section. Xn)(x) = n L I(X n i < x) i=l where (X" . which now depends on the data.X)" L i=l I n How we use statistics in estimation and other decision procedures is the subject of the next section.2 a cOmmon estimate of J1. we have to formulate a relevant measure of the difference in performance of the drugs and decide how to estimate this measure. F(P)(x) = PIX.2 is the statistic I i i = i i 8 2 = n1 "(Xi .. for example. is the statistic T(X 11' . In this volume we assume that the model has . then our attention naturally focuses on estimating this constant If. a common estimate of 0. Nevertheless. which evaluated at ~ X and 8 2 are called the sample mean and sample variance..1. to narrow down in useful ways our ideas of what the "true" P is. hence. . The link for us are things we can compute. Thus. Databased model selection can make it difficult to ascenain or even assign a meaning to the accuracy of estimates or the probability of reaChing correct conclusions. Formally. can be related to model formulation as we saw earlier. T( x) is what we can compute if we observe X = x.1. T(x) = x/no In Example 1.8 Statistical Models. 1964). but the true values of parameters are secrets of nature. If we suppose there is a single numerical measure of performance of the drugs and the difference in performance of the drugs for any given patient is a constant irrespective of the patient. Deciding which statistics are important is closely connected to deciding which parameters are important and.
). .). Regression Models We end this section with two further important examples indicating the wide scope of the notions we have introduced.. Moreover. Regression Models.1. 0). See A..lO. the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other. ) that is independent of 0 such that 1 P(Xi' 0) = 1 for all O.3. There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. we shall denote the distribution corresponding to any particular parameter value 8 by Po. The experimenter may.1. the number of patients in the study (the sample size) is random. in the study. in Example 1.1. Regular models. Notation. are continuous with densities p(x. Yn )..4 Examples. Distribution functions will be denoted by F(·. . For instance.3 we could take z to be the treatment label and write onr observations as (A. in situation (d) again. after a while. density and frequency functions by p(. 8).. (A.. X m ). (B. Zi is a d dimensional vector that gives characteristics such as sex. In the discrete case we will use both the tennsfrequency jimction and density for p(x. (B. 1990). Problems such as these lie in the fields of sequential analysis and experimental deSign. This is the Stage for the following.1 Data.. Y n ) where Y11 ••• . Parameters. Yn are independent. 0). 'L':' Such models will be called regular parametric models. Thus. height. (2) All of the P e are discrete with frequency functions p(x.X" . Expectations calculated under the assumption that X rv P e will be written Eo. for example. Y. We observe (ZI' Y. 0). In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1. assign the drug that seems to be working better to a higher proportion of patients. . This selection is based on experience with previous similar experiments (cf. and so On of the ith subject in a study. 0). (zn. Example 1. It will be convenient to assume(1) from now on that in any parametric model we consider either: (I) All of the P. . However. and there exists a set {XI. When dependence on 8 has to be observed. and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients. We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more infonnation. weight. For instance. age. and Statistics 9 been selected prior to the current experiment. Models. This is obviously overkill but suppose that. Thus. patients may be considered one at a time. drugs A and B are given at several . 1. The distribution of the response Vi for the ith subject or case in the study is postulated to depend on certain characteristics Zi of the ith subject. sequentially. Lehmann. these and other subscripts and arguments will be omitted where no confusion can arise. Xl). assign the drugs alternatively to every other patient in the beginning and then.4.Section 1.. They are not covered under our general model and will not be treated in this book.1.
1.2 with assumptions (1)(4). I3d) T of unknowns. . We usually ueed to postulate more.10 Statistical Models.2) with .. 0 . Example 1. Treatment Dose Level) for patient i. ...Z~)T and Jisthen x n identity. .. .. semiparametric as with (I) and (2) with F arbitrary. and Performance Criteria Chapter 1 dose levels. In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed observations. Identifiability of these parametrizations and the status of their components as parameters are discussed in the problems..treat the Zi as a label of the completely unknown distributions of Yi. The most common choice of 9 is the linear fOlltl.1. So is Example 1. (3) g((3. in Example 1. A eommou (but often violated assumption) is (I) The ti are identically distributed with distribution F. Often the following final assumption is made: (4) The distributiou F of (I) is N(O. Clearly.8.L(z) denote the expected value of a response with given covariate vector z. then we can write. . See Problem 1. Then we have the classical Gaussian linear model.. Here J. If we let f(Yi I zd denote the density of Yi for a subject with covariate vector Zi.L(z) is an unknown function from R d to R that we are interested in. In the two sample models this is implied by the constant treatment effect assumption. . Goals.1.~II3Jzj = zT (3 so that (b) becomes (b') This is the linear model. On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) 1'(z) = 9((3. (b) where Ei = Yi .(z) only.2 uuknown. Then.. .1. then the model is zT (a) P(YI. n. . and nonparametric if we drop (1) and simply .Yn) ~ II f(Yi I Zi). (c) whereZ nxa = (zf. d = 2 and can denote the pair (Treatment Label.. By varying the assumptions we obtain parametric models as with (I). z) where 9 is known except for a vector (3 = ((31. That is.E(Yi ). which we can write in vector matrix form. i=l n If we let J. (3) and (4) above. For instance.. the effect of z on Y is through f. i = 1. [n general. I'(B) = I' + fl. Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on Zi.3(3) is a special case of this model. .1..3 with the Gaussian twosample model I'(A) ~ 1'.z) = L.
.(1 .(3)I'). . Measurement Model with Autoregressive Errors. and the associated probability theory models and inference for dependent data are beyond the scope of this book. Xl = I' + '1. ... .{3Xj1 j=2 n (1. .5.. p(e n I end f(edf(c2 .. It is plausible that ei depends on Cil because long waves tend to be followed by long waves.(3cd··· f(e n Because €i =  (3e n _1). x n ). . .cn _d p(edp(e2 I edp(e3 I e2) . X n is P(X1. . and Statistics 11 Finally.. In fact we can write f.. Consider the model where Xi and assume = J.. . .p(e" I e1. Xi .t+ei. . eo = a Here the errors el. .. Models.. save for a brief discussion in Volume 2.. . i = 1.1 Data.t = E(Xi ) be the average time for an infinite series of records. the conceptual issues of stationarity. Xi ~ 1.. we give an example in which the responses are dependent.. we have p(edp(C21 e1)p(e31 e"e2)".t.Il. en' Using conditional probability theory and ei = {3ei1 + €i.n. . Parameters.Section 1. Let j. . . the model for X I. A second example is consecutive measurements Xi of a constant It made by the same observer who seeks to compensate for apparent errors. is that N(O..(3) + (3X'l + 'i. Of course.. An example would be... . we start by finding the density of CI... 0 . X n spent above a fixed high level for a series of n consecutive wave records at a point on the seashore. 0'2) density. ergodicity. n. . at best an approximation for the wave example. . Example 1. To find the density p(Xt. .n ei = {3ei1 + €i. model (a) assumes much more but it may be a reasonable first approximation in these situations. say. i = l.. The default assumption.xn ) ~ f(x1 I') II f(xj .. C n where €i are independent identically distributed with density are dependent as are the X's. the elapsed times X I. i ~ 2. However.1. Then we have what is called the AR(1) Gaussian model J is the We include this example to illustrate that we need not be limited by independence. Let XI... . ' . . . X n be the n determinations of a physical constant J.
1) Our model is then specified by the joint distribution of the observed number X of defectives in the sample and the random variable 9. This is done in the context of a number of classical examples. 0 = N I I ([. N. in the past. Models are approximations to the mechanisms generating the observations. This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as measure of the beliefs of the experimenter concerning the true value of (J before he or she takes any data.. a . we have had many shipments of size N that have subsequently been distributed. Goals. Thus. N. The notions of parametrization and identifiability are introduced. ([. i = 0. PIX = k. If the customers have provided accurate records of the number of defective items that they have found.2) This is an example of a Bayesian model. The general definition of parameters and statistics is given and the connection between parameters and pararnetrizations elucidated. 1. . There is a substantial number of statisticians who feel that it is always reasonable. We know that. They are useful in understanding how the outcomes can he used to draw inferences that go beyond the particular experiment. in the inspection Example 1.1..1. and indeed necessary.. given 9 = i/N. That is. n). There are situations in which most statisticians would agree that more can be said For instance.. it is possible that. X has the hypergeometric distribution 'H( i. N. .2. How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences. and Performance Criteria Chapter 1 Summary. 1l"i is the frequency of shipments with i defective items. the most important of which is the workhorse of statistics.2. . We view statistical models as useful tools for learning from the outcomes of experiments and studies."" 1l"N} for the proportion (J of defectives in past shipments. .I 12 Statistical Models. vector observations X with unknown probability distributions P ranging over models P.2 BAYESIAN MODELS Throughout our discussion so far we have assumed that there is no information available about the true value of the parameter beyond that provided by the data. Now it is reasonable to suppose that the value of (J in the present shipment is the realization of a random variable 9 with distribution given by P[O = N] I = IT" i = 0. to think of the true value of the parameter (J as being the realization of a random variable (} with a known distribution. the regression model. In this section we introduced the first basic notions and formalism of mathematical statistics. we can construct a freq uency distribution {1l"O.
1)'(0. let us turn again to Example 1.Section 1.2. In the "mixed" cases such as (J continuous X discrete.9)'00'. = ( I~O ) (0.2 Bayesian Models 13 Thus. Suppose that we have a regular parametric model {Pe : (J E 8}. 1.. (8. 100. Before the experiment is performed. In this section we shall define and discuss the basic clements of Bayesian models. we can obtain important and useful results and insights. the information about (J is described by the posterior distribution. select X according to Pe. . the information or belief about the true value of the parameter is described by the prior distribution. For a concrete illustration. However.3) Because we now think of p(x.2. .. which is called the posterior distribution of 8. Eqnation (1.1. 1. De Groot (1969). The function 7r represents our belief or information about the parameter (J before the experiment and is called the prior density or frequency function. (1962).3). and Berger (1985).4) for i = 0. by giving (J a distribution purely as a theoretical tool to which no subjective significance is attached. The joint distribution of (8.2. X) is that of the outcome of a random experiment in which we first select (J = (J according to 7r and then. To get a Bayesian model we introduce a random vector (J. There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in tenus of frequencies. with density Or frequency function 7r. We shall return to the Bayesian framework repeatedly in our discussion. whose range is contained in 8. (1. (1) Our own point of view is that SUbjective elements including the views of subject matter experts arc an essential element in all model building. the resulting statistical inference becomes subjective.3). (J) as a conditional density or frequency function given 8 = we will denote it by p(x I 0) for the remainder of this section. x) = ?T(O)p(x. If both X and (J are continuous or both are discrete. Before sampling any items the chance that a given shipment contains . This would lead to the prior distribution e. ?T. (1.1 of being defective independently of the other members of the shipment. given (J = (J. the joint distribution is neither continuous nor discrete. Lindley (1965). insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later). suppose that N = 100 and that from past experience we believe that each item has probability .O). For instance. Raiffa and Schlaiffer (1961).l.2. I(O. Savage (1954).1. We now think of Pe as the conditional distribution of X given (J = (J. After the value x has been obtained for X. However.2) is an example of (1. then by (B. The theory of this school is expounded by L. The most important feature of a Bayesian model is the conditional distribution of 8 given X = x. An interesting discussion of a variety of points of view on these questions may be found in Savage et al. X) is appropriately continuous or discrete with density Or frequency function.
8) roo 1r(9)p(x 19) 1r(t)p(x I t)dt if 8 is continuous.9)(0. this will continue to be the case for the items left in the lot after the 19 sample items have been drawn.9)(0.:.9) I .9) I . P[IOOO > 201 = 10] P [ .2. Now suppose that a sample of 19 has been drawn in which 10 defective items are found.1)(0. .2. Thus. Specifically.1.1100(0.10 '" . This leads to P[1000 > 20 I X = lOJ '" 0.X..6) we argue loosely as follows: If be fore the drawing each item was defective with probability .2. is iudependeut of X and has a B(81.X) . 1. to calculate the posterior.1) distribution.30. (i) The posterior distribution is discrete or continuous according as the prior distri bution is discrete or continuous.1)(0.X > 10 I X = lOJ P [(1000 .1100(0. 1r(9)9k (1 _ 9)nk 1r(Blx" .1 and good with probability ..2. 0.9 independently of the other items. I 20 or more bad items is by the normal approximation with continuity correction. Suppose that Xl. Goals.001.8) as posterior density of 0.1) '" I .2. 1008 ..k dt (1. (1.4) can be used.II ~. P[1000 > 20 I X = 10] P[IOOO . .1) (1.2. X) given by (1. 1r(t)p(x I t) if 8 is discrete.. . we obtain by (1. Bernoulli Trials..6) To calculate the posterior probability given iu (1.J C3 (1. theu 1r(91 x) 1r(9)p(x I 9) 2.52) 0. ..2.2.<I> 1000 . some variant of Bayes' rule (B.xn )= Jo'1r(t)t k (lt)n.<1>(0.3). the number of defectives left after the drawing.. .181(0. If we assume that 8 has a priori distribution with deusity 1r.2.8.j .9 ] . Therefore.9) > . and Performance Criteria Chapter 1 I . Here is an example.10).Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1.1 > .30.. (A 15.7) In general..1. In the cases where 8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (0.181(0. . (ii) If we denote the corresponding (posterior) frequency function or density by 1r(9 I x). ( 1. 14 Statistkal Models.5) 5 ) = 0.. Example 1..
the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all. As Figure B. = 1. We present an elementary discussion of Bayesian models. 1).13) leads us to assume that the number X of geniuses observed has approximately a 8(n.s) distribution..2. 0 A feature of Bayesian models exhibited by this example is that there are natural parametric families of priors such that the posterior distributions also belong to this family. introduce the notions of prior and posterior distributions and give Bayes rule. If n is small compared to the size of the city.2 Bayesian Models 15 for 0 < () < 1. The result might be a density such as the one marked with a solid line in Figure B..n. One such class is the twoparameter beta family.05 and its variance is very small..ll) in (1.8) distribution. say 0.2. k = I Xi· Note that the posterior density depends on the data only through the total number of successes. . We also by example introduce the notion of a conjugate family of distributions. and the posterior distribution of B givenl:X i =kis{3(k+r. we are interested in the proportion () of "geniuses" (IQ > 160) in a particular city.6. To get infonnation we take a sample of n individuals from the city. Or else we may think that 1[(8) concentrates its mass near a small number. . Such families are called conjugaU? Evidently the beta family is conjugate to the binomial.2. and s only.nk+s). Xi = 0 or 1.2.<. For instance.2. . If we were interested in some proportion about which we have no information or belief. r. We return to conjugate families in Section 1.·) is the beta function. nonVshaped bimodal distributions are not pennitted. upon substituting the (3(r. which depends on k. which corresponds to using the beta distribution with r = . 8) distribution given B = () (Problem 1. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B.2 indicates.11» be B(k + T.1).2. we might take B to be uniformly distributed on (0.10) The proportionality constant c. which has a B( n. . we need a class of distributions that concentrate on the interval (0. Suppose. To choose a prior 11'. must (see (B. x n ). 'i = 1.2.9) we obtain 1 L: L7 (}k+rl (1 _ 8)nk+s1 c (1.2. n .Section 1. Specifically. L~ 1 Xi' We also obtain the same posterior density if B has prior density 1r and 1 Xi. We may want to assume that B has a density with maximum value at o such as that drawn with a dotted line in Figure B. Summary.15. for instance. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions.2. Another bigger conjugate family is that of finite mixtures of beta distributionssee Problem 1.9). we only observe We can thus write 1[(8 I k) fo". (A.2.k + s) where B(·.16. s) density (B.2.(8 I X" .05. so that the mean is r/(r + s) = 0.. Then we can choose r and s in the (3(r. where k ~ l:~ I Xj.
the fraction defective B in Example 1. For instance. Po or Pg is special and the general testing problem is really one of discriminating between Po and Po. Thus.1. as it's usually called. say a 50yearold male patient's response to the level of a drug.1. shipments.1." (} > Bo.l < Jio or those corresponding to J.. and as we shall see fonnally later. placebo and treatment are equally effective) are special because the FDA (Food and Drug Administration) does not wish to pennit the marketing of drugs that do no good.4.l > J. sex. Intuitively. the information we want to draw from data can be put in various forms depending on the purposes of our analysis. For instance. z) we can estimate (3 from our observations Y. Given a statistical model. In other situations certain P are "special" and we may primarily wish to know whether the data support "specialness" or not. one of which will be announced as more consistent with the data than others. Thus.lo as special. i we can try to estimate the function 1'0. say. Unfortunately {t(z) is unknown. If J. i.1 contractual agreement between shipper and receiver may penalize the return of "good" shipments.L in Example 1. with (J < (Jo. and Performance Criteria Chapter 1 1. We may have other goals as illustrated by the next two examples. As the second example suggests. say. For instance.1. Ranking." In testing problems we.2.1 do we use the observed fraction of defectives . whereas the receiver does not wish to keep "bad.lo means the universe is expanding forever and J. the expected value of Y given z.3. However. in Example 1. Zi) and then plug our estimate of (3 into g. state which is supported by the data: "specialness" or.1. 0 In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function.1 or the physical constant J. as in Example 1. if we believe I'(z) = g((3.3. drug dose) T that can be used for prediction of a variable of interest Y.3 THE DECISION THEORETIC FRAMEWORK I .e. Prediction. if we have observations (Zil Yi).lo is the critical matter density in the universe so that J. A consumer organization preparing (say) a report on air conditioners tests samples of several brands. P's that correspond to no treatment effect (i.16 Statistical Models.1. We may wish to produce "best guesses" of the values of important parameters. in Example 1. Goals.1.l > JkO correspond to an eternal alternation of Big Bangs and expansions. Note that we really want to estimate the function Ji(')~ our results will guide the selection of doses of drug for future patients. at a first cut. such as. for instance. there are k! possible rankings or actions. A very important class of situations arises when. there are many problems of this type in which it's unclear which oftwo disjoint sets of P's. These are estimation problems. 1 < i < n. of g((3.2. On the basis of the sample outcomes the organization wants to give a ranking from best to worst of the brands (ties not pennitted). There are many possible choices of estimates.3. 0 Example 1. Example 1. the receiver wants to discriminate and may be able to attach monetary Costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment. if there are k different brands.l < J. we have a vector z. (age. "hypothesis" or "nonspecialness" (or alternative). In Example 1. Making detenninations of "specialness" corresponds to testing significance. then depending on one's philosophy one could take either P's corresponding to J. a reasonable prediction rule for an unseen Y (response of a new patient) is the function {t(z).
1. Thus. for instance. (2) point to what the different possible actions are. in testing whether we are right or wrong. That is.1. (2. Thus. accuracy.1. 1. If we are estimating a real parameter such as the fraction () of defectives. . In any case.1. . I} with 1 corresponding to rejection of H. Ranking. or p.1. 2. Thus. in Example 1. even if. in ranking what mistakes we've made. most significantly. I)}. The answer will depend on the model and.1. We usnally take P to be parametrized. . X = ~ L~l Xi. is large. in Example 1. or combine them in some way? In Example 1. : 0 E 8). defined as any value such that half the Xi are at least as large and half no bigger? The same type of question arises in all examples.1. 3). there are 3! = 6 possible rankings.3. A new component is an action space A of actions or decisions or claims that we can contemplate making.. Action space. On the other hand. For instance. in Example 1.3. A = {O. (3. if we have three air conditioners. it is natural to take A = R though smaller spaces may serve equally well.. or the median. . (3. Here quite naturally A = {Permntations (i I . and so on.6. k}}. in estimation we care how far off we are. 3. we begin with a statistical model with an observation vector X whose distribution P ranges over a set P. whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing. Testing.3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that. a large a 2 will force a large m l n to give us a good chance of correctly deciding that the treatment effect is there. (1. (2. 3). (4) provide guidance in the choice of procedures for analyzing outcomes of experiments.2 lO estimate J1 do we use the mean of the measurements. 1. ik) of {I. (3) provide assessments of risk. once a study is carried out we would probably want not only to estimate ~ but also know how reliable our estimate is. # 0. In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter.1. and reliability of statistical procedures. in Example 1.. ~ 1 • • • l I} in Example 1. Intuitively. 2.1.1. A = {( 1. P = {P. Here are action spaces for our examples. 2).Section 1..2. 1). By convention. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study. 2).3 The Decision Theoretic Framework 17 X/n as our estimate or ignore the data and use hislOrical infonnation on past shipments. A = {a. we need a priori estimates of how well even the best procedure can do. .1 Components of the Decision Theory Framework As in Section 1. taking action 1 would mean deciding that D. 3. we would want a posteriori estimates of perfomlance. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify Po with the set of "special" P's). Estimation. on what criteria of performance we use.
This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable. In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is. in the prediction example 1. l(P. As we shall see. Other choices that are. a) 1(0. . which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1.. M) would be our prediction of response or no response for a male given treatment B. a) = [v( P) .. Closely related to the latter is what we shall call confidence interval loss. a) = min {(v( P) .3. Although estimation loss functions are typically symmetric in v and a. Goals. .. Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·). Quadratic Loss: I(P. = max{laj  vjl.)2 = Vj [ squared Euclidean distance/d absolute distance/d 1. 1'() is the parameter of interest. and truncated quadraticloss: I(P. they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. a) = 0.3." that is.. and Performance Criteria Chapter 1 Prediction.18 Statistical Models. . sometimes can genuinely be quantified in economic terms. Q is the empirical distribution of the Zj in .2. if Y = 0 or 1 corresponds to. and z = (Treatment. the probability distribution producing the data. Loss function.a)2.ai. Far more important than the choice of action space is the choice of loss function defined as a function I : P X A ~ R+. say. a) if P is parametrized. a) I d . [v(P) . The interpretation of I(P. a) = 1 otherwise. the expected squared error if a is used. as the name suggests.: la. or 1(0. less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P.1).3. 1 d} = supremum distance. l( P.V. For instance. [f Y is real. d'}. r I I(P. If. ' . r " I' 2:)a. is the nonnegative loss incurred by the statistician if he or she takes action a and the true "state of Nature. . Estimation. a) = (v(P) .. is P. and z E Z. For instance.. a) = J (I'(z) . ~ 2. a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z.Vd) = (q1(0). examples of loss functions are 1(0.al < d.i = We can also consider function valued parameters." respectively. For instance. a) 1(0. .a) ~ (q(O) . then a(B. a).qd(lJ)) and a ~ (a""" ad) are vectors. as we shall see (Section 5.a(z))2dQ(z). "does not respond" and "responds. a) = l(v < a). A = {a . asymmetric loss functions can also be of importance. Here A is much larger. say. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider.a)2 (or I(O. Sex)T.a)2). although loss functions. I(P. If v ~ (V1.
I loss: /(8. For instance. We define a decision rule or procedure to be any function from the sample space taking its values in A. is a partition of (or equivalently if P E Po or P E P. L(I'(Zj) _0(Z.Section 1. We next give a representation of the process Whereby the statistician uses the data to arrive at a decision. (1.1 suppose returning a shipment with °< 1(8. in the measurement model. 0. a) ~ 0 if 8 E e a (The decision is correct) l(0.3. I) /(8. then a reasonable rule is to decide Ll = 0 if our estimate x . Using 0 means that if X = x is observed.. .). . this leads to the commonly considered I(P.2) I if Ix ~ 111 (J >c .3 we will show how to obtain an estimate (j of a from the data.3 with X and Y distributed as N(I' + ~. a(zn) jT and the vector parameter (I'( z... where {So. Then the appropriate loss function is 1.1'(zn))T Testing.1.y is close to zero.).2). In Example 1. In Section 4. a) = 1 otherwise (The decision is wrong).9.1 loss function can be written as ed. is a or not..1.. Otherwise. the statistician takes action o(x).2) and N(I'. . This 0 . other economic loss functions may be appropriate.3 The Decision Theoretic Framework 19 the training set (Z 1. The decision rule can now be written o(x. ]=1 which is just n. .. if we are asking whether the treatment effect p'!1'ameter 6. that is. relative to the standard deviation a.1) Decision procedures. . (Zn.a) = I n n.. and to decide Ll '# a if our estimate is not close to zero. respectively.3. The data is a point X = x in the outcome or sample space X...). we implicitly discussed two estimates or decision rules: 61 (x) = sample mean x and 02(X) = X = sample median.0) sif8<80 Oif8 > 80 rN8. a Estimation. e ea. For the problem of estimating the constant IJ.. Testing.))2. I) 1(8. If we take action a when the parameter is in we have made the correct decision and the loss is zero. Y). y) = o if Ix::: 111 (J <c (1. . We ask whether the parameter () is in the subset 6 0 or subset 8 1 of e.1 times the squared Euclidean distance between the prediction vector (a(z. Y n).. in Example 00 defectives results in a penalty of s dollars whereas every defective item sold results in an r dollar replacement cost. the decision is wrong and the loss is taken to equal one. Of course. Here we mean close to zero relative to the variability in the experiment.
R('ld) is our a priori measure of the performance of d. (If one side is infinite. lfwe use the mean X as our estimate of IJ.3.6) = Ep[I(P. Thus. (fi .v) = [V  + [E(v) . e Estimation. so is the other and the result is trivially true. and X = x is the outcome of the experiment.3.) 0 We next illustrate the computation and the a priori and a posteriori use of the risk function. our risk function is called the mean squared error (MSE) of. We illustrate computation of R and its a priori use in some examples. we turn to the average or mean loss over the sample space. we regard I(P. The MSE depends on the variance of fJ and on what is called the bias ofv where = E{fl) .1. () is the true value of the parameter.) = 2L n ~ n . We do not know the value of the loss because P is unknown.d. A useful result is Bias(fi) Proposition 1. How do we choose c? We need the next concept of the decision theoretic framework. then the loss is l(P.E(fi)] = O. the risk or riskfunction: The risk function. . fi) = Ep(fi(X) . for each 8. If we expand the square of the righthand side keeping the brackets intact and take the expected value.. we typically want procedures to have good properties not at just one particular x. The other two terms are (Bias fi)' and Var(v).20 Statistical Models.. (]"2) errors. MSE(fi) I. and is given by v= MSE{fl) = R(P. 1 is the loss function.. Suppose Xl.3.v(P))' (1. That is. If d is the procedure used. Moreover. 6 (x)) as a random variable and introduce the riskfunction R(P.3) where for simplicity dependence on P is suppressed in MSE. then Bias(X) 1 n Var(X) . but for a range of plausible x·s.1']. Proof. withN(O.v can be thought of as the "longrun average error" of v." Var(X. and assume quadratic loss. X n are i. i=l .. and Performance Criteria Chapter 1 where c is a positive constant called the critical value.. (Continued). fJ(x)). Estimation of IJ. If we use quadratic loss. 6(X)] as the measure of the perfonnance of the decision rule o(x). Example 1. measurements of IJ. Thus. Suppose v _ v(P) is the real parameter we wish to estimate and fJ(X) is our estimator (our decision rule).i. the cross term will he zero because E[fi . Write the error as 1 = (Bias fi)' + E(fi)] Var(v). .3.. R maps P or to R+. Goals.
fii 1 af2 00 Itl<p(t)dt = . If. ( N(o.8)X. Example 1.. R( P. if X rnedian(X 1 . The choice of the weights 0.X) = ".3.23). Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be < crJ. Next suppose we are interested in the mean fl of the same measurement for a certain area of the United States. . 1 X n from area A. E(i .3 The Decision Theoretic Framework 21 and. of course. Let flo denote the mean of a certain measurement included in the U.3. in general. . If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = <:TolE measurements. as we assumed.t.3. 1) and R(I". If we only assume. or na 21(n . for instance by 8 2 = ~ L~ l(X i .a 2 .1). . itself subject to random error.6). a natural guess for fl would be flo. or approximated asymptotically.fii V. In fact. is possible.3. X) = EIX . .5) This harder calculation already suggests why quadratic loss is really favored.3.X) =. . Suppose that instead of quadratic loss we used the more natural(l) absolute value loss.1". as we discussed in Example 1. n (1.1 is useful for evaluating the performance of competing estimators. an estimate we can justify later.1.4) which doesn't depend on J. X) = a' (P) / n still.2)1"0 + (O. an}). for instance.X)2. whereas if we have a random sample of measurements X" X 2...2. (.d. planning is not possible but having taken n measurements we can then estimate 0. The a posteriori estimate of risk (j2/n is. 0 ~ = a We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1.2 .i. that the E:i are i.4) can be used for an a priori estimate of the risk of X. the £.4.. For instance.fii/a)[ ~ a .Section 1.a 2 then by (A 13.. write for a median of {aI.3.. Then R(I". say.1 2 MSE(X) = R(I". then for quadratic loss. Ii ~ (0. = X. or numerical and/or Monte Carlo computation. we may wantto combine tLo and X = n 1 L~ 1 Xi into an estimator.1")' ~ E(~) can only be evaluated numerically (see Problem 1. If we have no idea of the value of 0. a'.S.8 can only be made on the basis of additional knowledge about demography or the economy. age or income. 00 (1. We shall derive them in Section 1. by Proposition 1.6 through a . analytic. with mean 0 and variance a 2 (P).1"1 ~ Eill 2 ) where £.2. but for absolute value loss only approximate. census.Xn ) (and we. computational difficulties arise even with quadratic loss as soon as we think of estimates other than X.3.2 and 0. Then (1. areN(O. .. If we have no data for area A.
using MSE.3.2) for deciding between i'> two values 0 and 1. We easily find Bias(ii) Yar(ii) 0.6) P[6(X. = 0. Figure 1. I.8)'Yar(X) = (.1') (0. then X is optimal (Example 3. Ii) of ii is smaller than the risk R(I'. Because we do not know the value of fL.0)P[6(X.3.21'0 R(I'. D Testing. The test rule (1.04(1'0 . Y) = IJ. Y) = 01 if i'> i O.1. Here we compare the performances of Ii and X as estimators of Jl using MSE. The mean squared errdrs of X and ji. Y) = which in the case of 0 .64 when I' = 1'0. the risk R(I'. respectively. I' E R} being 0. A test " " e e e . I)P[6(X. iiJ + 0. thus.4).1 loss is 01 + lei'>. Goals.2(1'0 .81' .1')' + (.n. However. The two MSE curves cross at I' ~ 1'0 ± 30'1 .3. 22 Statistical Models.3. the risk is = 0 and i'> i 0 can only take on the R(i'>.) ofthe MSE (called the minimax criteria).6) = 1(i'>. and we are to decide whether E 80 or E 8 where 8 = 8 0 U 8 8 0 n 8 . neither estimator can be proclaimed as being better than the other.64)0" In. if we use as our criteria the maximum (over p. Y) = I] if i'> = 0 P[6(X.I.1 gives the graphs of MSE(ii) and MSE(X) as functions of 1'. Figure 1. and Performance Criteria Chapter 1 i formal Bayesian analysis using a nonnal prior to illustrate a way of bringing in additional knowledge. X) = 0" In of X with the minimum relative risk inf{MSE(ii)IMSE(X).I' = 0. In the general case X and denote the outcome and parameter space. MSE(ii) MSE(X) i . R(i'>.. If I' is close to 1'0.64)0" In MSE(ii) = .
1 lead to (Prohlem 1. confidence bounds and intervals (and more generally regions). on the probability of Type I error.3." in Example 1.05.Section 1.3. For instance. an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed.01 or less. Such a v is called a (1 . and then trying to minimize the probability of a Type II error. For instance.8) sP.6) E(6(X)) ~ P(6(X) ~ I) if8 E 8 0 (1.1) and tests 15k of the fonn.o) 0. whereas if <5(X) = 0 and we decide () E 80 when in fact 8 E 8 b we call the error a Type II error. 8 > 80 . I(P.IX > k] + rN8P. we call the error committed a Type I error. [n the NeymanPearson framework of statistical hypothesis testing. Finding good test functions corresponds to finding critical regions with smaIl probabilities of error.8) for all possible distrihutions P of X. Suppose our primary interest in an estimation type of problem is to give an upper bound for the parameter v. we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding ~ =1= 0 when ~ = 0).6) P(6(X) = 0) if 8 E 8 1 Probability of Type II error. usually .1. say . This is not the only approach to testing.3. in the treatments A and B example.a (1. the focus is on first providing a small bound.[X < k]. "Reject the shipment if and only if X > k. Thus. the loss function (1. R(8. 8 < 80 rN8P. that is. If <5(X) = 1 and we decide 8 E 8 1 when in fact E 80. For instance.05 or .3 The Decision Theoretic Framework 23 function is a decision rule <5(X) that equals 1 On a set C C X called the critical region and equals 0 on the complement of C. a < v(P) .[X < k[.3. the risk of <5(X) is e R(8. a > v(P) I. where 1 denotes the indicator function.7) Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation. Here a is small. If (say) X represents the amount owed in the sample and v is the unknown total amount owed.18). This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function. and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding Do = 0 when Do ¥ 0). (1.a) upper coofidence houod on v. 6(X) = IIX E C].3. it is natural to seek v(X) such that P[v(X) > vi > 1.6) Probability of Type [error R(8.
and a3. I I (Oil) (No oil) Bj B2 0 12 10 I 5 6 \ . an asymmetric estimation type loss function. A has three points.a ·1 for all PEP. Suppose the following loss function is decided on TABLE 1. aJ. What is missing is the fact that.1 nature makes it resemble a testing loss function and. we could drill for oil. are available. or sell partial rights. though upper bounding is the primary goal. it is customary to first fix a in (1. as we shall see in Chapter 4. 1.8) and then see what one can do to control (say) R( P. Ii) = E(Ii(X) v(P))+.a>v(P) .5. The loss function I(B. e Example 1.3.3. where x+ = xl(x > 0).2 . In the context of the foregoing examples. and the risk of all possible decision procedures can be computed and plotted. a 2 certain location either contains oil or does not. thai this fonnulation is inadequate because by taking D 00 we can achieve risk O. the connection is close. Goals. We shall illustrate some of the relationships between these ideas using the following simple example in which has two members. Suppose we have two possible states of nature. Suppose that three possible actions. For instance. we could leave the component in. general criteria for selecting "optimal" procedures. replace it. . or repair it. a patient either has a certain disease or does not. Typically. We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model. It is clear.24 Statistical Models. The 0. We next tum to the final topic of this section. The same issue arises when we are interested in a confidence interval ~(X) l v(X) 1for v defined by the requirement that • P[v(X) < v(P) < Ii(X)] > 1 . a) (Drill) aj (Sell) a2 (Partial rights) a3 ..: Comparison of Decision Procedures In this section we introduce a variety of concepts used in the comparison of decision procedures. sell the location. . administer drugs.a)  a~v(P) . c ~1 . in fact it is important to get close to the truthknowing that at most 00 doJlars are owed is of no use. which we represent by Bl and B . we could operate.3. rather than this Lagrangian form. or wait and see. a2.1. a component in a piece of equipment either works or does not work. We shall go into this further in Chapter 4. though. The decision theoretic framework accommodates by adding a component reflecting this. and so on.3. for some constant c > O. and Performance Criteria Chapter 1 1 .a<v(P). For instance = I(P.
a.5 4. if X = 1..3. .5. . it is known that formation 0 occurs with frequency 0.3.4) ~ 7. e TABLE 1.4. .0 9 5 6 It remains to pick out the rules that are "good" or "best. take action U2.7 0. and frequency function p(x..3 and formation 1 with frequency 0..).4 10 6 6. a3 4 a2 a. 5)) and if k = 2 we can plot the set of all such points obtained by varying 5. and so on.6) + 1(0. we can represent the whole risk function of a procedure 0 by a point in kdimensional Euclidean space.52 ) + 10(0. if there is oil and we drill.2 (Oil) (No oil) e.6." and so on. 8 a3 a2 9 a3 a3 Here. 1. X may represent a certain geological formation." Criteria for doing this will be introduced in the next subsection.5 3 7 1. R(B2.6 3 3. 3 a.3 The Decision Theoretic Framework 25 Thus. 6 a. 5. a. the loss is 12. e. B2 Thus. Next. a.)) 1 1 R(B . The risk points (R(B" 5.3.3 0.).4 8 8. If is finite and has k members. R(B" 52) R(B2 ..6 4 5 1 " 3 5.3.6 0.3." 02 corresponds to ''Take action Ul> if X = 0.5) E[I(B.3) For instance.). I x=O x=1 a. The risk of 5 at B is R(B.5(X))] = I(B. a3 7 a3 a.4.7. the loss is zero. . Risk points (R(B 5. a. (R(O" 5). 0 .2 for i = 1.4 = 1.2. 5 a.a3)P[5(X) = a3]' 0(0. R(025. 1 9. The frequency function p(x.)P[5(X) = ad +1(B.Section 1. an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0.) R(B. B) given by the following table TABLE 1.5 9.) 0 12 2 7 7.)) are given in Table 1. 01 represents "Take action Ul regardless of the value of X.7) = 7 12(0. TABLE 1. and when there is oil. i Rock formation x o I 0. We list all possible decision rules in the following table.6 and 0. formations 0 and 1 occur with frequencies 0..a2)P[5(X) ~ a2] + I(B.3. R(B k . whereas if there is no oil. Possible decision rules 5i (x) .5 8.4 and graphed in Figure 1. whereas if there is no oil and we drill. 2 a.5.
Researchers have then sought procedures that improve all others within the class. Goals.2 . R(8. if we ignore the data and use the estimate () = 0.) I • 3 10 . if J and 8' are two rules. Extensions of unbiasedness ideas may be found in Lehmann (1997.4 .) < R( 8" S6) but R( 8" S. Symmetry (or invariance) restrictions are discussed in Ferguson (1967).7 . or level of significance (for tests).Si) Figure 1..S) < R(8.6 .2. For instance. . Usually. for instance. and only if. Consider. We say that a procedure <5 improves a procedure Of if. unbiasedness (for estimates and tests).3. It is easy to see that there is typically no rule e5 that improves all others.3 Bayes and Minimax Criteria The difficulties of comparing decision procedures have already been discussed in the special contexts of estimation and testing.3..5).S.S') for all () with strict inequality for some (). Si).9. .9 . we obtain M SE(O) = (}2. We shall pursue this approach further in Chapter 3. i (1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry. R( 8" Si)). Here R(8 S.5 0 0 5 10 R(8 j . and S6 in our example. in estimating () E R when X '"'' N((). a5). S. The risk points (R(8 j . 1.) > R( 8" S6)' The problem of selecting good decision " procedures has been attacked in a variety of ways.   . Section 1.8 5 . and Performance Criteria Chapter 1 R(8 2 . i = 1. (2) A second major approach has been to compare risk functions by global crite . The absurd rule "S'(X) = 0" cannot be improved on at the value 8 ~ 0 because Eo(S'(X)) = 0 ifand only if O(X) = O.26 Statistical Models. neither improves the other.
5 7 7.2.4 5 2. (1.o)1I(O)dO.I. therefore.20) in Appendix B. 0) isjust E[I(O.) ~ 0. if we use <5 and () = ().0) Table 1.5 we for our prior.92 5. ()2 and frequency function 1I(OIl The Bayes risk of <5 is. We postpone the consideration of posterior analysis. If we adopt the Bayesian point of view. given by reo) = E[R(O.8 10 6 3. We shall discuss the Bayes and minimax criteria.5 9 5. .3.9) The second preceding identity is a consequence of the double expectation theorem (B. O(X)) I () = ()]. suppose that in the oil drilling example an expert thinks the chance of finding oil is .8 6 In the Bayesian framework 0 is preferable to <5' if. and reo) = J R(O.2. This quantity which we shall call the Bayes risk of <5 and denote r( <5) is then. Then we treat the parameter as a random variable () with possible values ()1. .3 The Decision Theoretic Framework 27 ria rather than on a pointwise basis. 11(0.02 8. 0)11(0).o)] = E[I(O.2. Bayes: The Bayesian point of view leads to a natural global criterion.2R(0.3. if () is discrete with frequency function 'Tr(e). = 0.9)..3.48 7.3.3. (1.5.6 4 4.8R(0.0).Section 1. + 0. If there is a rule <5*.7 6.3.4 8 4.o(x))]. From Table 1. the only reasonable computational method. r(o") = minr(o) reo) = EoR(O. Note that the Bayes approach leads us to compare procedures on the basis of. but can proceed to calculate what we expect to lose on the average as () varies. to Section 3. r(o. Recall that in the Bayesian model () is the realization of a random variable or vector () and that Pe is the conditional distribution of X given 0 ~ O.9 8. TABLE 1. the expected loss. and only if. R(0" Oi)) I 9.6 3 8. .. Bayes and maximum risks oftbe procedures of Table 1. The method of computing Bayes procedures by listing all available <5 and their Bayes risk is impracticable in general. we need not stop at this point.3.).5 gives r( 0.8. it has smaller Bayes risk..3. . To illustrate.6 12 2 7.10) r( 0) ~ 0.38 9.) maxi R(0" Oil. In this framework R(O. such that • see that rule 05 is the unique Bayes rule then it is called a Bayes rule. which attains the minimum Bayes risk l that is. r( 09) specified by (1. .
This criterion of optimality is very conservative.3. J). if V is the class of all decision procedures (nonrandomized).75 if 0 = 0. Nevertheless l in many cases the principle can lead to very reasonable procedures. .(2) We briefly indicate "the game of decision theory. For instance. It aims to give maximum protection against the worst that can happen. Of course. R(O. Goals.3.75 is strictly less than that of 04.J). Player II then pays Player I. R(02. we toss a fair coin and use 04 if the coin lands heads and 06 otherwise.5. From the listing ofmax(R(O" J). if and only if. a randomized decision procedure can be thought of as a random experiment whose outcomes are members of V. the sel of all decision procedures. This is. sup R(O. Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule. J'). J)) we see that J 4 is minimax with a maximum risk of 5. we prefer 0" to 0"'.5 we might feel that both values ofthe risk were equally important. 2 The maximum risk 4. in Example 1. J) A procedure 0*. 8)]. but only ac.4. The maximum risk of 0* is the upper pure value of the game. which makes the risk as large as possible. The principle would be compelling. e 4.28 Statistical Models. The criterion comes from the general theory of twoperson zero sum games of von Neumann. a weight function for averaging the values of the function R( B. which has . = infsupR(O. For instance." Nature (Player I) picks a independently of the statistician (Player II). . < sup R(O. It is then natural to compare procedures using the simple average ~ [R( fh 10) + R( fh. supR(O. Nature's choosing a 8. To illustrate computation of the minimax rule we tum to Table 1. Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to emplOy. 6). in Example 1.20 if 0 = O .4. But this is just Bayes comparison where 1f places equal probability on f}r and ()2. For simplicity we shall discuss only randomized .J') . and Performance Criteria Chapter 1 if (J is continuous with density rr(8). 4. Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk. Such comparisons make sense even if we do not interpret 1r as a prior density or frequency. Our expected risk would be. .3. if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used. I Randomized decision rules: In general. suppose that. is called minimax (minimizes the maximum risk). who picks a decision procedure point 8 E J from V.
A)R(02. this point is (10. l5q of nOn randomized procedures. given a prior 7r on q e.r2):r. i = 1 . the Bayes risk of l5 (1.14) .Ji )] i=l A randomized Bayes procedure d. then all rules having Bayes risk c correspond to points in S that lie on the line irl + (1 ..5.r).R(O. . A randomized minimax procedure minimizes maxa R(8. (1. R(O .3. All points of S that are on the tangent are Bayes. As in Example 1. including randomiZed ones.12) r(J) = LAiEIR(IJ.13) As c varies.5.3.3. (1.\. 9 (Figure 2 1.'Y) that is tangent to S al the lower boundary of S.Ji)' r2 = tAiR(02.3.3.J): J E V'} where V* is the set of all procedures.. 1 q. Jj . J)) and consider the risk sel 2 S = {(RrO"J).).2.Section 1. We will then indicate how much of what we learn carries over to the general case.3. A point (Tl. (1.3. This is thai line with slope .(02 ).d) anwng all randomized procedures. .. R(O .).J) ~ L. when ~ = 0. If .. T2) on this line can be written AR(O"Ji ) + (1. i = 1. Ji )).3). If the randomized procedure l5 selects l5i with probability Ai. That is.11) Similarly we can define.R(02. . S is the convex hull of the risk points (R(0" Ji ). Ei==l Al = 1. TWo cases arise: (I) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule.1).10). 0 < i < 1.13) defines a family of parallel lines with slope i/(1 .. For instance.i)r2 = c.3.3. which is the risk point of the Bayes rule Js (see Figure 1. we represent the risk of any procedure J by the vector (R( 01 . ~Ai=1}. J). Jj ). By (1.R(OI. we then define q R(O. Finding the Bayes rule corresponds to finding the smallest c for which the line (1.A)R(O"Jj ). We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1.3.13) intersects S.i/ (1 . . Ji ) + (1 .(0 d = i = 1 . minimizes r(d) anwng all randomized procedures.3).3.. S= {(rI.J. i=l (1. AR(02.3 The Decision Theoretic Framework 29 procedures that select among a finite set (h 1 • • • . (2) The tangent is the line connecting two unonrandomized" risk points Ji ..J.. Ai >0. = ~A.
See Figure 1. = 0).>'). The convex hull S of the risk poiots (R(B!.3. Let c' be the srr!~llest c for which Q(c) n S i 0 (i. < 1.30 Statistical Models.16) touches S is the risk point of the minimax rule.3. ranges from 0 to 1./(1 .11) corresponds to the values Oi with probability>.3. oil. (\)). where 0 < >. as >.13). OJ with probability (1 . (1.3.3. the first square that touches S). (1.16) whose diagonal is the line r. .3. and. Goals.)') of the line given by (1. 0 < >. = r2. by (1.15) Each one of these rules.9. and Performance Criteria Chapter 1 r2 I 10 5 Q(c') o o 5 10 Figure 1. Then Q( c*) n S is either a point or a horizontal or vertical line segment. the first point of contact between the squares and S is the .. ..3.3. is Bayes against 7I". Because changing the prior 1r corresponds to changing the slope .• all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes).3.. It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*. R(B . 2 The point where the square Q( c') defined by (1. i = 1. We can choose two nonrandomized Bayes rules from this class. the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (Le. < 1. In our example. thus. To locate the risk point of the nilhimax rule consider the family of squares. namely Oi (take'\ = 1) and OJ (take>.e.
R( 81 .. we can define the risk set in general as s ~ {( R(8 . if and only if. all admissible procedures are either Bayes procedures or limits of . y) in S such that x < Tl and y < T2.3.4 < 7. To gain some insight into the class of all admissible procedures (randomized and nOnrandomized) we again use the risk set.5 can be shown to hold generally (see Ferguson.14).Section 1. for instance. Y < T2} has only (TI 1 T2) in common with S.3. 0. The following features exhibited by the risk set by Example 1. (e) If a Bayes prior has 1f(Oi) to 1f is admissible. In fact.)).59. they are Bayes procedures. (d) All admissible procedures are Bayes procedures. > a for all i. the minimax rule is given by (1. thus. However. or equivalently.3 The Decision Theoretic Framework 31 intersection between TI = T2 and the line connecting the two points corresponding to <54 and <56 . for instance).\ + 6. Naturally. there is no (x. all rules that are not inadmissible are called admissible. e = {fh. R(8" 0)) : 0 E V'} where V" is the set of all randomized decision procedures. if and only if.4 we can see.3. {(x.3.) and R( 8" 04) = 5.4'\ + 3(1 . 04) = 3 < 7 = R( 81 . the set of all lower left boundary points of S corresponds to the class of admissible rules and. l Ok}.4. From the figure it is clear that such points must be on the lower left boundary.\ the solution of From Table 1. (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the positive quadrant. .. under some conditions. which yields . There is another important concept that we want to discuss in the context of the risk set.e.\) = 5. (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the Sense of (1. j = 6 and . Using Table 1. ..\). (a) For any prior there is always a nonrandomized Bayes procedure.V) : x < TI. .\ ~ 0.5(1 . A decision rule <5 is said to be inadmissible if there exists another rule <5' such that <5' improves <5. 1967. if there is a randomized one.3. A rule <5 with risk point (TIl T2) is admissible. 1 .0). (c) If e is finite(4) and minimax procedures exist. If e is finite.14) with i = 4. that <5 2 is inadmissible because 64 improves it (i..6 ~ R( 8" J.. Thus. then any Bayes procedure corresponding If e is not finite there are typically admissible procedures that are not Bayes. this equation becomes 3. agrees with the set of risk points of Bayes procedures.
We want to find a function 9 defined on the range of Z such that g(Z) (the predictor) is "close" to Y. The frame we shaH fit them into is the following. Bayes procedures (in various senses). A stockholder wants to predict the value of his holdings at some time in the future on the basis of his past experience with the market and his portfolio. Section 3. • 1. loss function..2 presented important situations in which a vector z of 00variates can be used to predict an unseen response Y. at least when procedures with the same risk function are identified. An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson. We introduce the decision theoretic foundation of statistics inclUding the notions of action space. The joint distribution of Z and Y can be calculated (or rather well estimated) from the records of previous years that the admissions officer has at his disposal.32 .g(Z») = E[g(Z) ~ YI 2 or its square root yE(g(Z) .3. Goals. confidence bounds. A government expert wants to predict the amount of heating oil needed next winter. which is the squared prediction error when g(Z) is used to predict Y.Yj'. he wants to predict the firstyear grade point averages of entering freshmen on the basis of their College Board scores. Similar problems abound in every field. and prediction. Using this information. ranking. ]967.I 'jlI . testing. Summary. For example.y)2. We assume that we know the joint probability distribution of a random vector (or variable) Z and a random variable Y. in the college admissions situation.4). at least in their original fonn. We stress that looking at randomized procedures is essential for these conclusions. The basic biasvariance decomposition of mean square error is presented. and Performance Criteria Chapter 1 '. Next we must specify what close means. These remarkable results. and risk through various examples including estimation. Here are some further examples of the kind of situation that prompts our study in this section. we tum to the mean squared prediction error (MSPE) t>2(Y. The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility. Since Y is not known. . Z is the information that we have and Y the quantity to be predicted. we referto Blackwell and Girshick (1954) and Ferguson (1967). A college admissions officer has available the College Board scores at entrance and firstyear grade point averages of freshman classes for a period of several years. In terms of our preceding discussion. The MSPE is the measure traditionally used in the . although it usually turns out that all admissible procedures of interest are indeed nonrandomized.4 PREDICTION . decision rule. which include the admissible rules. A meteorologist wants to estimate the amount of rainfall in the coming spring. I The prediction Example 1. Z would be the College Board score of an entering freshman and Y his or her firstyear grade point average. One reasonable measure of "distance" is (g(Z) . Other theorems are available characterizing larger but more manageable classes of procedures. Statistical Models. For more information on these topics. I II . are due essentially to Waldo They are useful because the property of being Bayes is ea·der to analyze than admissibility.
Just how widely applicable the notions of this section are will become apparent in Remark 1.1 assures us that E[(Y . when EY' < 00.Section 1.2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss.L and the lemma follows.4.YI) (Problems 1. E(Y .g(Z))' I Z = z] = E[(Y .1) < 00 if and only if E(Y . Proof. exists.3. Lemma 1.2) I'(z) = E(Y I Z = z).1) follows because E(Y p.4. (1. E(Y .c)2 is either oofor all c or is minimized uniquely by c In fact.c) implies that p.25.l.g(z))' IZ = z] = E[(Y I'(z))' I Z = z] + [g(z) I'(z)]'. or equivalently.4.4. The class Q of possible predictors 9 may be the nonparametric class QN P of all 9 : R d Jo R or it may be to some subset of this class. Grenander and Rosenblatt. g(Z)) such as the mean absolute error E(lg( Z) . The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6.4.1. see Example 1. EY' = J.4. In this situation all predictors are constant and the best one is that number Co that minimizes B(Y . By the substitution theorem for conditional expectations (B.5 and Section 3. we can find the 9 that minimizes E(Y .g(Z))' (1. we can conclude that Theorem 1.c)' < 00 for all c. (1. Because g(z) is a constant. In this section we bjZj .4.I'(Z))' < E(Y . We see that E(Y . given a vector Z.4.c)2 as a function of c.4. that is.6.20).4 Prediction 33 mathematical theory of prediction whose deeper results (see.16).711). for example.4. 1.C)2 has a unique minimum at c = J.4) . and by expanding (1. then either E(Y g(Z))' = 00 for every function g or E(Y .3) If we now take expectations of both sides and employ the double expectation theorem (B.4. See Remark 1.1.(Y. ljZ is any random vector and Y any random variable.) = 0 makes the cross product term vanish.c = (Y 1') + (I' .L = E(Y). 1957) presuppose it.g(Z))2. see Problem 1. 0 Now we can solve the problem of finding the best MSPE predictor of Y. in which Z is a constant.4.c)' ~ Var Y + (c _ 1')'.g(z))' I Z = z].4. Let (1. EY' < 00 Y . L1=1 Lemma 1. we have E[(Y . consider QNP and the class QL of linear predictors of the form a + We begin the search for the best predictor in the sense of minimizing MSPE by considering the case in which there is no covariate information.
and Performance Criteria Chapter 1 for every 9 with strict inequality holding unless g(Z) best MSPE predictor. then by the iterated expectation theorem.E(V)J Let € = Y . = O.5) is obtained by taking g(z) ~ E(Y) for all z. Property (1. As a consequence of (1. I'(Z) is the unique E(Y . which will prove of importance in estimation theory.7) IJVar Y • < 00 strict inequality ooids unless 1 Y = E(Y I Z) (1. Properties (b) and (c) follow from (a). we can derive the following theorem. Equivalently U and V are uncorrelated if either EV[U E(U)] = 0 or EUIV . Proof.4. (1.4.5) An important special case of (1. Suppose that Var Y < 00.6) which is generally valid because if one side is infinite.6) and that (1.I'(Z))' + E(g(Z) 1'(Z»2 ( 1.4. then Var(E(Y I Z)) < Var Y.4.g(Z)' = E(Y . If E(IYI) < 00 but Z and Y are otherwise arbitrary. then we can write Proposition 1.4. Theorem 1. let h(Z) be any function of Z. That is.E(v)IIU .4. 1. that is. To show (a).E(U)] = O. when E(Y') < 00.4.1(c) is equivalent to (1. Vat Y = E(Var(Y I Z» + Var(E(Y I Z). Infact.6).4. Var(Y I z) = E([Y .4.5) follows from (a) because (a) implies that the cross product term in the expansionof E ([Y 1'(z)J + [I'(z) g(z)IP vanishes.4. Goals. = I'(Z).5) becomes. E{h(Z)<1 E{ E[h(Z)< I Z]} E{h(Z)E[Y I'(Z) I Z]} = 0 because E[Y I'(Z) I Z] = I'(Z) I'(Z) = O. then (a) f is uncorrelated with every function ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <. (1.4.2.1.E(Y I z)]' I z).8) or equivalently unless Y is a function oJZ.20). .Il(Z) denote the random prediction error.34 Statistical Models. and recall (B. then (1. Write Var(Y I z) for the variance of the condition distribution of Y given Z = z.4. 0 Note that Proposition 1.6) is linked to a notion that we now define: Two random variables U and V with EIUVI < 00 are said to be uncorrelated if EIV . so is the other.4.
or 3 failures among all days. I. Within any given month the capacity status does not change. The first is direct.4. Each day there can be 0.10.50 z\y 1 0 0. and only if.10 0.4 Prediction 35 Proof. p(z.Section 1.8) is true.15 3 0.E(Y I Z = z))2 p (z.'" 1 Y.45. the best predictor is E (30.lI. The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state.4. as we reasonably might.25 2 0.10 0. . (1.10 I 0. the average number of failures per day in a given month. i=l 3 E (Y I Z = ~) ~ 2.025 0. E(Var(Y I Z)) ~ E(Y .1 2. An assembly line operates either at full.25 0. also.30 py(y) I 0. 1.4.10 0.05 0.25 0.y) pz(z) 0.05 0.9) this can hold if.4.45 ~ 1 The MSPE of the best predictor can be calculated in two ways.1.25 1 0. The following table gives the frequency function p( z. y) = 0. 2. half.885. In this case if Yi represents the number of failures on day i and Z the state of the assembly line. I z) = E(Y I Z).20. 2.E(Y I Z))' =L x 3 ~)y y=o . if we are trying to guess.025 0. or 3 shutdowns due to mechanical failure.05 0.6). E(Y . E (Y I Z = ±) = 1.4.15 I 0. and only if. The assertion (1. o Example 1. These fractional figures are not too meaningful as predictors of the natural number values of Y. But this predictor is also the right one.7) can hold if. We want to predict the number of failures for a given day knowing the state of the assembly line for the month. y) = p (Z = Z 1 Y = y) of the number of shutdowns Y and the capacity state Z of the line for a randomly chosen day.30 I 0. or quarter capacity. whereas the column sums py (y) yield the frequency of 0.7) follows immediately from (1. We find E(Y I Z = 1) = L iF[Y = i I Z = 1] ~ 2.:.E(Y I Z))' ~ 0 By (A. Equality in (1.
2 I . = z)]'pz(z) 0.4. a'k. the best predictor is just the constant f. One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y.885 as before./lz).2 tells us that the conditional dis /lO(Z) Because .10) I I " The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution. Y) has a N(Jlz l f.4. E(Y .2. the MSPE of our predictor is given by. which is the MSPE of the best constant predictor. The term regression was coined by Francis Galton and is based on the following observation. The larger this quantity the more dependent Z and Yare. (1. . Therefore.9) I . Goals.p')). (1 .4. = /lY + p(uyjuz)(Z ~ /lZ). Thus. which corresponds to the best predictor of Y given Z in the bivariate normal model.6) we can also write Var /lO(Z) p = VarY . is usually called the regression (line) of Y on Z. this quantity is just rl'. for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y. p) distribution. and Performance Criteria Chapter 1 The second way is to use (1.4. whereas its magnitude measures the degree of such dependence.E(Y I Z p') (1.36 Statistical Models. E(Y ~ E(Y I Z))' VarY ~ Var(E(Y I Z)) E(Y') ~ E[(E(Y I Z))'] I>'PY(Y) . Because of (1. If p = 0.LY. = Z))2 I Z = z) = u~(l _ = u~(1 _ p'). Regression toward the mean. .L[E(Y I Z y . the predictor is a monotone increasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z. In the bivariate normal case. If (Z. p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence.4. E((Y . can reasonably be thought of as a measure of dependence. " " is independent of z. u~.11) I The line y = /lY + p(uy juz)(z ~ /lz). Theorem B. o tribution of Y given Z = z is N City + p(Uy j UZ ) (z . Similarly. O"~.J. The Bivariate Normal Distribution. Suppose Y and Z are bivariate normal random variables with the same mean . If p > O.y as we would expect in the case of independence.4.6) writing. the best predictor of Y using Z is the linear function Example 1.4.E(Y I Z))' (1.
11).6)..4. N d +1 (I'. uYYlz) where. .6." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox. Let Z = (Zl. (Zn. By (1. (1 . where the last identity follows from (1. The quadratic fonn EyzEZ"iEzy is positive except when the joint nonnal distribution is degenerate. consequently..p)1" + pZ) = p'(T'..Section 1.5 states that the conditional distribution of Y given Z = z is N(I"Y +(zl'zf. E). .4. Theorem B. or the average height of sbns whose fathers are the height Z. variance cr2. there is "regression toward the mean. is closer to the population mean of heights tt than is the height of the father. Y» T T ~ E yZ and Uyy = Var(Y). the best predictor E(Y I Z) ofY is the linear function (1. Thus. . " Zd)T be a d x 1 covariate vector with mean JLz = (ttl. The variability of the predicted value about 11. usual correlation coefficient p = UZy / crfyCT lz when d = 1.6) in which I' (I"~. coefficient ofdetennination or population Rsquared. I"Y = E(Y). . in particular in Galton's studies. Then the predicted height of the son. the distribution of (Z. be less than that of the actual heights and indeed Var((! . One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y. Y) is unavailable and the regression line is estimated on the basis of a sample (Zl. YI ). ttd)T and suppose that (ZT.12) with MSPE ElY l"o(Z)]' ~ E{EIY l"o(zlI' I Z} = E(uYYlz) ~ Uyy  EyzEziEzy.3. y)T has a (d + !) multivariate normal.4 Prediction 37 tt. I"Y) T. distribution (Section B.4... ..4. the MCC equals the square of the .8 = EziEzy anduYYlz ~ uyyEyzEziEzy. Ezy O"yy E zz is the d x d variancecovariance matrix Var(Z) . We shall see how to do this in Chapter 2. . these were the heights of a randomly selected father (Z) and his son (Y) from a large human population. . Yn ) from the population. . so the MSPE of !lo(Z) is smaller than the MSPE of the constant predictor J1y. Note that in practice. and positive correlation p.COV(Zd. This quantity is called the multiple correlation coefficient (MCC). tall fathers tend to have shorter sons. E zy = (COV(Z" Y). The Multivariate Normal Distribution.. We write Mee = ' ~! _ PZy ElY /lO(Z))' ~ Var l"o(Z) Var Y Var Y . Thus. 0 Example 1.p)tt + pZ. should.8. In Galton's case.
E(Y .9% and 39. we can avoid both objections by looking for a predictor that is best within a class of simple predictors. The natural class to begin with is that of linear combinations of components of Z.1 and 2. We expand {Y .5%.38 Statistical Models. 29W· Then the strength of association between a girl's height and those of her mother and father. E(Y . Z2 = father's height).)T..bZ}' = E(Y)  b E(Z) 1· = (Y  [Z(b .2. let Y and Z = (Zl.3%.zz ~ (.1. Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor. Goals. and E(Y _ boZ)' = E(Y') _ [E(ZY)]' E(Z') (1.. (Z~. the hnear predictor l'a(Z) and its MSPE will be estimated using a 0 sample (Zi. We first do the onedimensional case.209.4.bZ)' = E(Y') + E(Z')(b  Therefore.. The problem of finding the best MSPE predictor is solved by Theorem 1. knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33. when the distribution of (ZT.ba) + Zba]}' to get ba )' .l Proof.4. y)T is unknown. yn)T See Sections 2. The percentage reductions knowing the father's and both parent's heights are 20.13) ..y = .393.zy ~ (407. P~. respectively.3. If we are willing to sacrifice absolute excellence. l. The best linear predictor.39 l. and parents.335. What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1. respectively. Suppose that E(Z') and E(Y') are finite and Z and Y are not constant. are P~. and Performance Criteria Chapter 1 For example. In words. . Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z. p~y = .~ ~:~~). Suppose(l) that (ZT l Y) T is trivariate nonnal with Var(Y) = 6. . Y) a I VarZ.4. Y.bZ)' is uniquely minintized by b = ba. Then the unique best zero intercept linear predictor i~ obtained by taking E(ZY) b=ba = E(Z') . Z2)T be the heights in inches of a IOyearold girl and her parents (Zl = mother's height. whereas the unique best linear predictor is ILL (Z) = al + bl Z where b = Cov(Z. In practice.y = .:.E(Z')b~.
whiclt corresponds to Y = boZo We could similarly obtain (A.b(Z . then a = a. (1.a .E(Z) and Y .4.E(Z)IIZ .4. Therefore. Best Multivariate Linear Predictor.1). Let Po denote = . In that example nothing is lost by using linear prediction.4. then the Unique best MSPE predictor is I'dZ) Proof.boZ)' > 0 is equivalent to the CauchySchwarz inequality with equality holding if.E(Z)f[Z . by (1. = I'Y + (Z  J.4 Prediction 39 To prove the second assertion of the theorem note that by (lA.t and covariance E of X. If EY' and (E([Z . That is.E(Y) to conclude that b. Theorem 1.E(Z)J))1 exist. b) X ~ (ZT.4.bE(Z) .bZ) + (E(Y) . On the other hand. it must coincide with the best linear predictor. whatever be b.4. I 1.bZ)2 is uniquely minimized by taking a ~ E(Y) . E(Y .4.a)'. A loss of about 5% is incurred by using the best linear predictor.E(Y)]) = EziEzy.2.4.bE(Z).1.E(Z))]'. is the unique minimizing value. This is because E(Y . We can now apply the result on zero intercept linear predictors to the variables Z .5). 0 Note that if E(Y I Z) is of the form a + bZ. 0 Remark 1.16) directly by calculating E(Y .bZ)2 we see that the b we seek minimizes E[(Y .E(Y)) .Z)'.tzf [3.. From (1.boZ)' = 0. and b = b1 • because. E(Y .Section 1.14) Yl T Ep[Y . E[Y .E(Z)J[Y .a . This is in accordance with our evaluation of E(Y I Z) in Example 1. E(Y . Substituting this value of a in E(Y . .1 the best linear predictor and best predictor differ (see Figure 1. and only if.4.b. in Example 1.E(ZWW ' E([Z .a .I1.l).a.13) we obtain tlte proof of the CauchySchwarz inequality (A. if the best predictor is linear. Note that R(a.4.I'd Z )]' = 1 05 ElY I'(Z)j' .l7) in the appendix.I'l(Z)1' depends on the joint distribution P of only through the expectation J. OUf linear predictor is of the fonn d I'I(Z) = a + LbjZj j=l = a+ ZTb [3 = (E([Z .bZ)' ~ Var(Y .
d. . . Ro(a.(a + ZT 13)]) = 0. 2 • • 1 o o 0..4 by extending the proof of Theorem 1. E).50 0. 0 Remark 1. . By Example 1. 0 Remark 1. Because P and Po have the same /L and E. (1. . .05 + 1.00 z Figure 1.14).4. Thus.3 to d > 1.4. Zd are uncorrelated.25 0.4. our new proof shows how secondmoment results sometimes can be established by "connecting" them to the nannal distribution.4.17 for an overall measure of the strength of this relationship.40 Statistical Models. However. We want to express Q: and f3 in terms of moments of (Z. Suppose the mooel for I'(Z) is linear..4. distribution and let Ro(a. Goals. By Proposition 1. j = 0. A third approach using calculus is given in Problem 1.75 1. R(a. b) = Ro(a.4. b) is minimized by (1. b). b) = Epo IY . .3.3. the MCC gives the strength of the linear relationship between Z and Y. The three dots give the best predictor. 0 Remark 1.4. b) is miuimized by (1. See Problem 1.1(a). case the multiple correlation coefficient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y. that is.I'(Z) and each of Zo.I'I(Z)]'.45z. ..2. p~y = Corr2 (y.I'LCZ)). ~ Y .4. and R(a.1. We could also have established Theorem 1. The line represents the best linear predictor y = 1.14). that is. the multivariate uormal. not necessarily nonnal.. In the general. thus.19. I"(Z) = E(Y I Z) = a + ZT 13 for unknown Q: E R and f3 E Rd. and Performance Criteria Chapter 1 y .4.4. Y). .4.. E(Zj[Y .4. .15) . N(/L.4.4. Set Zo = 1.
Using the distance D. 1. T(X) = X loses information about the Xi as soon as n > 1.I'(Z) is orthogonal to I'(Z) and to I'dZ). we see that r(5) ~ MSPE for squared error loss 1(9.g(Z»): 9 E g).I 0 and there is a 90 E 9 such that 0 < oc form a Hilbert go = arg inf{". then 90(Z) is called the projection of Y on the space 9 of functions of Z and we write go(Z) = . We consider situations in which the goal is to predict the (perhaps in the future) value of a random variable Y. I'(Z» + ""'(Y. o Summary. . We return to this in Section 3.15) for a and f3 gives (1. we would clearly like to separate out any aspects of the data that are irrelevant in the context of the model and that may obscure our understanding of the situation.4.4. If T assigns the same value to different sample points. Because the multivariate normal model is a linear model. usually R or Rk..Thus. but as we have seen in Sec~ tion 1. this gives a new derivation of (1.6. Consider the Bayesian model of Section 1.. Recall that a statistic is any function of the observations generically denoted by T(X) or T. We begin by fonnalizing what we mean by "a reduction of the data" X EX. Remark 1. we can conclude that I'(Z) = .4. I'(Z» Y .5) = (9 .17) ""'(Y.3. g(Z) and h(Z) are said to be orthogonal if at least one has expected value zero and E[g(Z)h(Z)] ~ O.14) (Problem 1.4. Remark 1.4.. the optimal MSPE predictor is the conditional expected value of Y given Z.10. It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear. Moreover.4.5 Sufficiency 41 Solving (1.8) defined by r(5) = E[I(II.5)'.4.(Y. can also be a set of functions.' (l'e(Z). Note that (1.3. then by recording or taking into account only the value of T(X) we have a reduction of the data. The range of T is any space of objects T.(Y 19NP). Thus.16) is the Pythagorean identity. With these concepts the results of this section are linked to the general Hilbert space results of Section 8.16) (1.. I'dZ» = ".(Y 1ge) ~ "(I'(Z) 19d (1. and it is shown that if we want to predict Y on the basis of information contained in a random vector Z.1. If we identify II with Y and X with Z.(Y I 9).12).5.4.23). and projection 1r notation.. The notion of mean squared prediction error (MSPE) is introduced. The optimal MSPE predictor in the multivariate normal distribution is presented.2.5(X»].Section 1.5 SUFFICIENCY Once we have postulated a statistical model. I'dZ) = . When the class 9 of possible predictors 9 with Elg(Z)1 space as defined in Section B. the optimal MSPE predictor E(6 I X) is the Bayes procedure for squared error loss..2 and the Bayes risk (1.4.
Xn = xnl = 8'(1. P[X I = XI. . The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X.. Suppose there is no dependence between the quality of the items produced and let Xi = 1 if the ith item is good and 0 otherwise. X n ) into the same number. suppose that in Example 1. Therefore." it is important to notice that there are many others e. For instance. is a statistic that maps many different values of (Xl. . Thus. the sample X = (Xl. t) distribution. Y) where X is uniform on (0. • X n ) (X(I) .3.. X 2 the time between the arrival of the first and second customers.9.1 we had sampled the manufactured items in order. ~ t. once the value of a sufficient statistic T is known. The total number of defective items observed.42 Statistical Models. T is a sufficient statistic for O.. 0 In both of the foregoing examples considerable reduction has been achieved. A machine produces n items in succession. By Example B. By (A. . is sufficient.'" . Each item produced is good with probability 0 and defective with probability 1.l.) is conditionally distributed as (X. the conditional distribution of Xl = [XI/(X l + X. X(n))' loses information about the labels of the Xi.. A statistic T(X) is called sufficient for PEP or the parameter if the conditional distribution of X given T(X) = t does not involve O.2. whatever be 8. 1) whatever be t. when Xl + X.1. Although the sufficient statistics we have obtained are "natural. + X. . (Xl. Instead of keeping track of several numbers. By (A.l. XI/(X l +X. We prove that T = X I + X 2 is sufficient for O.l we see that given Xl + X2 = t.1.1) where Xi is 0 or 1 and t = L:~ I Xi. We could then represent the data by a vector X = (Xl. = t.5).. o Example 1. " .2. the conditional distribution of X 0 given T = L:~ I Xi = t does not involve O. Then X = (Xl. . The idea of sufficiency is to reduce the data with statistics whose use involves no loss of information. Example 1. we need only record one. Suppose that arrival of customers at a service counter follows a Poisson process with arrival rate (parameter) Let Xl be the time of arrival of the first customer. Goals. the conditional distribution of XI/(X l + X. It follows that.Xn ) does not contain any further infonnation about 0 or equivalently P. .) are the same and we can conclude that given Xl + X. . it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T.X. recording at each stage whether the examined item was defective or not. 16.l. . T = L~~l Xi. Thus. .'" . X.4). X.5. where 0 is unknown. in the context of a model P = {Pe : (J E e}.'" . Xl and X 2 are independent and identically distributed exponential random variables with parameter O.)](X I + X. However.Xn ) where Xi = 1 if the ith item sampled is defective and Xi = 0 otherwise. and Performance Criteria Chapter 1 Even T(X J.) and that of Xlt/(X l + X. given that P is valid. .5.8)n' (1.0. Thus. One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following.) given Xl + X. Begin by noting that according to TheoremB. t) and Y = t .' • .5. 1). We give a decision theory interpretation that follows. are independent and the first of these statistics has a uniform distribution on (0.Xn ) is the record of n Bernoulli trials with probability 8. Xl has aU(O. whatever be 8.. = tis U(O.) and Xl +X. Using our discussion in Section B..
5 Sufficiency 43 that will do the same job.6).5. X2. • only if. Po [X ~ XjlT = t. By our definition of conditional probability in the discrete case.4) if T(xj) ~ Ii 41 o if T(xj) oF t i . In general. T = t. e porT = til = I: {x:T(x)=t. then T 1 and T2 provide the same information and achieve the same reduction of the data.. e) definedfor tin T and e in on X such that p(X. Neyman. Being told that the numbers of successeS in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one. it is enough to show that Po[X = XjlT ~ l.} p(x. if (152) holds. Po[X = XjlT = t. a simple necessary and sufficient criterion for a statistic to be sufficient is available. a statistic T(X) with range T is sufficient for e if. O)h(x) for all X E X. To prove the sufficiency of (152).j is independent ofe on cach of the sets Si = {O: porT ~ til> OJ. e)h(Xj) Po[T (15. Such statistics are called equivalent. Proof. checking sufficiency directly is difficult because we need to compute the conditional distribution. if Tl and T2 are any two statistics such that 7 1 (x) = T 1 (y) if and only if T2(x) = T 2(y).. . Now. We shall give the proof in the discrete case.") be the set of possible realizations of X and let t i = T(Xi)' Then T is discrete and 2::~ 1 porT = Ii] = 1 for every e.]/porT = Ii] p(x. forO E Si. 0 E 8. h(xj) (1.O) I: {x:T(x)=t. It is often referred to as the factorization theorem for sufficient statistics.In a regular model.] Po[X = Xj..Ll) and (152). and Halmos and Savage.5) . we need only show that Pl/[X = xjlT = til is independent of for every i and j. there exists afunction g(t. and e and a/unction h defined (152) = g(T(x).2. Applying (153) we arrive at.I.S. Fortunately. The complete result is established for instance by Lehmann (1997. (153) By (B. i = 1. Let (Xl. More generally. This result was proved in various forms by Fisher.] o if T(xj) oF ti if T(xj) = Ii.O) Theorem I..e) = g(ti.Section 1.} h(x). 0) porT ~ Ii] ~~CC+ g(li. Section 2..
4. . 0 o Example 1. }][exp{ .'t n /'[exp{ .xn }. n P(Xl. > 0.. 1 X n be independent and identically distributed random variables each having a normal distribution with mean {l and variance (j2.5. both of which are unknown.n /' exp{ . Consider a population with () members labeled consecutively from I to B. . ' .9) can be rewritten as " 1 Xn . then the joint density of (X" .. .S. .. Then the density of (X" .Xn are recorded. and Performance Criteria Chapter 1 Therefore.1 to conclude that T(X 1 .1..Xn ) is given by (see (A. and h(xl •. are introduced in the next section. l X n are the interarrival times for n customers.2. 1=1 (!.Xn ) is given by I. Takeg(t..7) o Example 1. . 10 fact. T = T(x)] = g(T(x). if T is sufficient. . If X 1. . = max(xl.2.JL)'} ~=l I n I! 2 . let g(ti' 0) = porT = tiL h(x) = P[X = xIT(X) = til Then (1.. _ _ _1 ..5. and P(XI' . which admits simple sufficient statistics and to which this example belongs.9) if every Xi is an integer between 1 and B and p( X 11 • (1. . 0) by (B. we can show that X(n) is sufficient. X(n) is a sufficient statistic for O.5. 1.10) P(Xl. We may apply Theorem 1.. .. Common sense indicates that to get information about B.... Estimating the Size of a Population..5.X...Xn ) = L~ IXi is sufficient. T is sufficient. ~ Po[X = x. By Theorem 1. X n ).5. . ..Il) . .O) where x(n) = onl{x(n) < O}. . I x n ) = 1 if all the Xi are > 0. Goals. •.16.5. Let Xl.8) if all the Xi are > 0.'" . A whole class of distributions. Conversely.3)..8) = eneStift > 0.2 (continued). O} = 0 otherwise.O)=onexP[OLXil i=l (1.5..5.' (L x~ 1=1 1 n n 2JL LXi))].'I.X n JI) = 0 otherwise..3. Expression (1.' L(Xi . The population is sampled with replacement and n members of the population are observed and their labels Xl.5.44 Statistical Models. we need only keeep track of X(n) = max(X . The probability distribution of X "is given by (1.5. and both functions = 0 otherwise.6) p(x. [27". Let 0 = (JL. . 0 Example 1.[27f. O)h(x) (1.4))..~.Xn.').5.
. 2:>. 0 X Example 1. Let o(X) = Xl' Using only X.~0)}(2"r~n exp{ . we can.• Yn are independent.5..5 Sufficiency 45 I Evidently P{Xl 1 • •• .5. for any decision procedure o(x).1 we can conclude that n n Xi. o Sufficiency and decision theory Sufficiency can be given a clear operational interpretation in the decision theoretic setting. (1.' + 213 2a + 213.o) = R(O. find a randomized decision rule o'(T(X)) depending only on T(X) that does as well as O(X) in the sense of having the same risk function. Yi N(Jli. respectively. X n ) = [(lin) where I: Xi. Suppose. a<. in Example 1. 1 2 2 . with JLi following the linear regresssion model ("'J . An equivalent sufficient statistic in this situation that is frequently used IS S(X" . . X is sufficient. The first and second components of this vector are called the sample mean and the sample variance. a 2)T is identifiable (Problem 1. Specifically. Ezi 1'i) is sufficient for 6.EZiY. a 2 ). we construct a rule O'(X) with the sarne risk = mean squared error as o(X) as follows: Conditionally. Then 6 = {{31. .5. Here is an example.9) and _ (y.1. X n are independent identically N(I). X n ) = (I: Xi. that is.1.X)'].) i=l i=l is sufficient for B. T = (EYi . . that Y I . . I)) = exp{nI)(x .~ I: xn By the factorization theorem.( 2?Ta ')~ exp p {E(fJ.5.l) distributed.5. i= 1 = (lin) L~ 1 Xi. EYi 2 . I)) .Section 1. R(O. Suppose X" .4 with d = 2. [1/(n i=l n n 1)1 I:(Xi .xnJ)) is itself a function of (L~ upon applying Theorem 1. if T(X) is sufficient. {32. Thus. where we assume diat the given constants {Zi} are not all identical.··· . L~l x~) and () only and T(X 1 .} + EY. ..12) t ofT(X) and a Example 1.Zi)'} exp {Er.2a I3.6. Then pix.. By randomized we mean that o'(T(X») can be generated from the value random mechanism not depending on B. 0') for allO.
if Xl. IfT(X) is sufficient for 0. by the double expectation theorem. O)h(x) E:  . 6'(X) and 6(X) have the same mean squared error. Example 1. L~ I xl) and S(X) = (X" .12) follows along the lines of the preceding example: Given T(X) = t. it is Bayes sufficient for every 11.. . and Performance Criteria Chapter 1 given X = t. Now draw 6' randomly from this conditional distribution. the distribution of 6(X) does not depend on O. Then by the factorization theorem we can write p(x.5. Goals. the same as the posterior distribution given T(X) = L~l Xi = k. Using Section B.5.. we find E(T') ~ E[E(T'IX)] ~ E(X) ~ I" ~ E(X . But T(X) provides a greater reduction of the data. In this Bernoulli trials case. where k In this situation we call T Bayes sufficient. . Theorem }. Minimal sufficiency For any model there are many sufficient statistics: Thus. ).14. Equivalently. 0 and X are independent given T(X).5. R(O..46 Statistical Models. X n ) are hoth sufficient. Thus. then T(X) = (L~ I Xi. 0 The proof of (1. .. we can find a transformation r such that T(X) = r(S(X)).6).2. T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of () given T(X) = T(x) for all x.O) = g(S(x).4. We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X). in that. This result and a partial converse is the subject of Problem 1. In Example 1. 0) as p(x. (T2) sample n > 2. This 6' (T(X)) will have the same risk as 6' (X) because.5. Let SeX) be any other sufficient statistic. (Kolmogorov).6). choose T" = 15"' (X) from the normal N(t.l and (1.6') ~ E{E[R(O..Xn is a N(p.2.6'(T))IT]} ~ E{E[R(O.1 (Bernoulli trials) we saw that the posterior distribution given X = x is 1 Xi. n~l) distribution. o Sufficiency aud Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P = {Po : 0 E e} we postulate a prior distribution 11 for e. ) Var(T') = E[Var(T'IX)] + VarIE(T'IX)]  ~ nl n + 1 n ~ 1 = Var(X .6(X»)IT]} = R(O.1 (continued). T = I Xi was shown to be sufficient. = 2:7 Definition.
L Xf) i=l i=l n n = (0 10 0. The formula (1. t. for example. the "likelihood" or "plausibility" of various 8. the likelihood function (1. we find OT(1 . L x (()) gives the probability of observing the point x. ~ 1/3. Thus.4 (continued). t2) because. ()) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient..1).(t. = 2 log Lx(O.) = (/". for a given observed x.o)nT ~g(S(x).11) is determined by the twodimensional sufficient statistic T Set 0 = (TI. Lx is a map from the sample space X to the class T of functions {() t p(x. as a function of (). then Lx(O) = (27rO. it gives. if we set 0. In this N(/".5. (T') example. }exp ( I . if X = x. We define the likelihood function L for a given observed data vector x as Lx(O) = p(x. when we think of Lx(B) as a function of (). the statistic L takes on the value Lx.5.O).T. It is a statistic whose values are functions.1) nlog27r .8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of 8. In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x. Example 1. ()) x E X}. Now. The likelihood function o The preceding example shows how we can use p(x. However.) n/' exp {nO.) ~ (L Xi. 20.Section 1.O)h(x) forall O.5. In the discrete case. 20' 20 I t.)}. take the log of both sides of this equation and solve for T. T is minimally sufficient.5.O E e.5 Sufficiency 47 Combining this with (1. we find T = r(S(x)) ~ (log[2ng(s(x). L x (') determines (iI. (T'). for a given 8. ~ 2/3 and 0. For any two fixed (h and fh. the ratio of both sides of the foregoing gives In particular. Thus. 2/3)/g(S(x). 1/3)]}/21og 2.
we can find a randomized decision rule J'(T(X» depending only on the value of t = T(X) and not on 0 such that J and J' have identical risk functions. We show the following result: If T(X) is sufficient for 0.Oo) > OJ = Lx~6o)' Thus. SeX) becomes irrelevant (ancillary) for inference if T(X) is known but only if P is valid. . Summary. B ul the ranks are needed if we want to look for possible dependencies in the observations as in Example 1. Lehmann.1. Thus. 0) and heX) such that p(X. O)h(X).12 for a proof of this theorem of Dynkin. . hence.5.. and Performance Criteria Chapter 1 with a similar expression for t 1 in terms of Lx (O.Rn ).. L is a statistic that is equivalent to ((1' i 2) and. in Example 1. Let Ax OJ c (x: p(x.X Cn )' the order statistics.5. Suppose there exists 00 such that (x: p(x.X). 1 X n are a random sample. _' . We say that a statistic T (X) is sufficient for PEP. Ax is the function valued statistic that at (J takes on the value p x.. If T(X) is sufficient for 0. or if T(X) ~ (X~l)' . 1) (Problem 1. OJ denote the frequency function or density of X. where R i ~ Lj~t I(X. 1) and Lx (1.X)2) is sufficient. but if in fact a 2 I 1 all information about a 2 is contained in the residuals. . as in the Example 1. ..5.48 Statistical Models.5.. Let p(X. See Problem ((x:).1 (continued) we can show that T and. The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t.. < X. (X.4.X.5. l:~ I (Xi . Goals. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic SeX) we can find a ttansformation r such that T(X) = r(S(X). but if in fact the common distribution of the observations is not Gaussian all the information needed to estimate this distribution is contained in the corresponding S(X)see Problem 1. L is minimal sufficient.Xn .. the ranks. . • X( n» is sufficient. hence.13. Thus. if a 2 = 1 is postulated. . and Scheffe.5. ifT(X) = X we can take SeX) ~ (Xl . SeX)~ where SeX) is a statistic needed to uniquely detennine x once we know the sufficient statistic T(x). 0) = g(T(X). For instance. if the conditional distribution of X given T(X) = t does not involve O. 1. the residuals. Suppose that X has distribution in the class P ~ {Po : 0 E e}. Consider an experiment with observation vector X = (Xl •. We define a statistic T(X) to be Bayes sufficient for a prior 7r if the posterior distribution of f:} given X = x is the same as the posterior distribution of 0 given T(X) = T(x) for all X. 0 In fact. a statistic closely related to L solves the minimal sufficiency problem in general.5.O) > for all B. Then Ax is minimal sufficient. 0 the likelihood ratio of () to Bo.. If. then for any decision procedure J(X).17). If P specifies that X 11 •. itself sufficient. X is sufficient. a 2 is assumed unknown. S(X) = (R" . The likelihood function is defined for a given data vector of . The "irrelevant" part of the data We can always rewrite the original X as (T(X). .5. (X( I)' . or for the parameter O.. By arguing as in Example 1.). 1 X n ). it is Bayes sufficient for O.
gamma. BEe. Example 1.Section 1.B) and h(x) with itself in the factorization theorem.o I x.1) where x E X c Rq. B>O. B E' sufficient for B. then. We shall refer to T as a natural sufficient statistic of the family.B) ~ h(x)exp{'7(B)T(x) .B o) > O}. More generally.1 The OneParameter Case The family of distributions of a model {Pe : B E 8}. B( 0) on e.2"" }. realvalued functions T and h on Rq. Pitman. if there exist realvalued functions fJ(B).6 Exponential Families 49 observations X to be the function of B defined by Lx(B) = p(X. p(x. and T are not unique. beta. Subsequently. these families form the basis for an important class of models called generalized linear models. The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman. = 1 . Then. They will reappear in several connections in this book. B) > O} c {x: p(x.B(B)} with g(T(x). 1. IfT(X) is {x: p(x. by the factorization theorem. Ax(B) is a minimally sufficient statistic. for x E {O. B) of the Pe may be written p(x.B) = eXe. B).6. (1. many other common features of these families were discovered and they have become important in much of the modern theory of statistics.6.6. and if there is a value B E such that o e e.B(B)} (1. and Dannois through investigations of this property(l). Poisson. This is clear because we need only identify exp{'7(B)T(x) . Let Po be the Poisson distribution with unknown mean IJ. is said to be a oneparameter exponentialfami/y. The Poisson Distribution.exp{xlogBB}. x. B. In a oneparameter exponential family the random variable T(X) is sufficient for B. such that the density (frequency) functions p(x. binomial.6 EXPONENTIAL FAMILIES The binomial and normal models considered in the last section exhibit the interesting feature that there is a natural sufficient statistic whose dimension as a random vector is independent of the sample size. Here are some examples.2) . the likelihood ratio Ax(B) = Lx(B) Lx(Bo) depends on X through T(X) only. Note that the functions 1}. 1. Probability models with these common features include normal. 1.1.6. and multinomial regression models used to relate a response variable Y to a set of predictor variables. We return to these models in Chapter 2.
B(O) = 0. we have p(x. 0 Example 1.6. T(x) ~ x.3. .y. n) < 0 < 1.T(x)=x. .~z.(y I z) = <p(zW1<p((y . Z and W are f(x. 0 Then. 1 X m ) considered as a random vector in Rmq and p(x. where the p.6) .. 1). . Suppose X has a B(n.h(X)=( : ) .) i=1 m B(8)] ( 1.. Suppose X = (Z.6.Z)OI) Z)'O'J} (2rrO)1 exp { (2rr)1 exp {  ~ [z' + (y  ~z' } exp { . B) are the corresponding density (frequency) functions..} .) exp[~(O)T(x...O) = f(z)f.~O'(y  z)' logO}.6.6. is the family of distributions of X = (Xl' .50 .3) I' . 0 E e.B(O)=nlog(I0). . This is a oneparameter exponential family distribution with q = 2. If {pJ=». the family of distributions of X is a oneparameter exponential family with q=I.' x. 1 (1.2. q = 1. fonn a oneparameter exponential family as in (1. B(O) = logO. .4) Therefore. 1.O) f(z.0)].6. the Pe form a oneparameter exponential family with ~~I . h(x) = (2rr)1 exp { . for x E {O. Statistical Models.T(x) = (y  z)'. . The Binomial Family.~O'.1](0) = . Specifically. + OW.6. .1).O) IIh(x.6.5) o = 2. Example 1. Here is an example where q (1.1](O)=log(I~0). suppose Xl. I I! . Goals.Xm are independent and identically distributed with common distribution Pe. o The families of distributions obtained by sampling from oneparameter exponential families are themselves oneparameter exponential families. and Performance Criteria Chapter 1 Therefore. (1.h(x) = .O) = ( : ) 0'(1_ Or" ( : ) ex p [xlog(1 ~ 0) + nlog(l. y)T where Y = Z independent N(O. 0) distribution. Then > 0. .1](0) = logO. p(x. .
1 X m). and h. then the m ) fonn a I Xi. •.8) .Xm ) = I Xi is distributed as P(mO). and pJ LT(Xi). .6. . Therefore..\) (3(r..6.Il)" .· .Section 1. For example. 1 X m ) is a vector of independent and identically distributed P(O) random variables and m ) is the family of distributions of x. . B(m)(o) ~ mB(O). · . then qCm) = mq.. 1)(0) N(I'.7) Note that the natural sufficient statistic T(m) is onedimensional whatever be m. B. 1]. We leave the proof of these assertions to the reader. fl.6. X m ) corresponding to the oneparameter exponential fam1 T(Xd.2) r(p. B.\ 1"/'" (p (s 1) 1) 1) (r T(x) x (x . and h. .1 the sufficient statistic T :m)(x1.1. then the family of distributions of the statistic T(X) is a oneparameter exponential family of discrete distributions whose frequency functions may be written h'Wexp{1)(O)t . I:::n I::n Theorem 1. pi pi I:::n TABLE 1..6. This family of Poisson distIibutions is oneparameter exponential whatever be m. oneparameter exponential family with natural sufficient statistic T(m)(x) = Some other important examples are summarized in the following table. i=l (1. . . In the discrete case we can establish the following general result. ily of distributions of a sample from any of the foregoin~ is just In our first Example 1.. Let {Pe } be a oneparameter exponential family of discrete distributions with corresponding functions T.x log x 10g(l x) logx The statistic T(m) (X I.. .. the m) fonn a oneparameter exponential family.\ fixed r fixed s fixed 1/2. If we use the superscript m to denote the corresponding T.' fixed I' fixed p fixed . h(m)(x) ~ i=l II h(xi). .. if X = (Xl.6 Exponential Families 51 where x = (x 1 .1 Family of distributions .B(O)} for suitable h * ..
We obtain an important and useful reparametrization of the exponential family (1. x E {O.1}) = (1(x!)exp{1}xexp[1}]}.B(8)] L {x:T(x)=t} ( 1. If X is distributed according to (1. E is either an interval or all of R and the class of models (1.9) with 1} E E contains the class of models with 8 E 8.52 Statistical Models.A(1})1 for s in some neighborhood 0[0. . Let E be the collection of all 1} such that A(~) is finite. Jh(x)exp[1}T(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case. The Poisson family in canonical form is q(x. selves continuous.6. 0 A similar theorem holds in the continuous case if the distributions ofT(X) are them Canonical exponential families. L {x:T{x)=t} h(x)}. Goals.9) with 1} ranging over E is called the canonical oneparameter exponential family generated by T and h.6. £ is called the natural parameter space and T is called the natural sufficient statistic.6. }.1) by letting the model be indexed by 1} rather than 8.6.2. Theorem 1.6. By definition..1. The model given by (1. Example 1.2. (continued).1}) = h(x)exp[1}T(x) . 00 00 exp{A(1})} andE = R. .9) and ~ is an Ihlerior point of E. then A(1}) must be finite.8) h(x) exp[~(8)T(x) . x E X c Rq (1.A(~)I. If 8 E e.6..9) where A(1}) = logJ . x=o x=o o Here is a useful result.2.. and Performance Criteria Chapter 1 Proof. the result follows. where 1} = log 8. Then as we show in Section 1. the momentgenerating function o[T(X) exists and is given by M(s) ~ exp[A(s + 1}) . P9[T(x) = tl L {x:T(x)=t} p(x.6.6. The exponential family then has the form q(x.1.8) exp[~(8)t .6. if q is definable. = L(e"' (x!) = L(e")X (x! = exp(e").B(8)]{ Ifwe let h'(t) ~ L:{"T(x)~t} h(x)..
c R k . X n is a sample from a population with density p(x. We give the proof in the continuous case. if there exist realvalued functions 171. h on Rq such that the density (frequency) functions of the Po may be written as. Direct computation of these moments is more complicated. x> 0. This is known as the Rayleigh distribution. 02 ~ 1/21]. ' e k p(x. .A(1])] because the last factor. exp[A(s + 1]) . A family of distrib~tioos {PO: 0 E 8}.<·or . is one. E(T(X)) ~ A'(1]).. Var(T(X)) ~ A"(1]). More generally.O) = h(x) exp[L 1]j(O)Tj (x) . and Dannois were led in their investigations to the following family of distributions.8) ~ (x/8 2)exp(_x 2/28 2). 2 i=1 i=l n 1 n Here 1] = 1/20 2. which is naturally indexed by a kdimensional parameter and admit a kdimensional sufficient statistic..2 The Multiparameter Case Our discussion of the "natural form" suggests that oneparameter exponential families are naturally indexed by a onedimensional real parameter fJ and admit a onedimensional sufficient statistic T(x). 1. is said to be a kparameter exponential family. 0 = n log 82 and A(1]) 1 xl has mean nit} = = nlog(21]). (1. nlog0 J.17k and B of 6. Pitman. We compute M(s) = = E(exp(sT(X))) {exp[A(s ~ + 1])  A(1])]} J'" J J'" J h(x)exp[(s + 1])T(x) .8 > O. x E X j=1 c Rq.j8 2))exp(i=l n n I>U282) ~=1 = (il xi)exp[202 LX.8(0)].6. being the integral of a density.10) . The rest of the theorem follows from the momentenerating property of M(s) (see Section A. 1 Tk.6. Koopman.6 Exponential Families 53 Moreover. Proof.. It is used to model the density of "time until failure" for certain types of equipment.A(1])]dx h(x)exp[(s + 1])T(x)  A(s + 1])Jdx = . Therefore. Now p(x. 0 Here is a typical application of this result.12).Section 1. _ .6. . B(O) the natural sufficient statistic E~ 2n(j2 and variance nlt}2 4n04 .4 Suppose X 1> ••• . Example 1. and realvalued functions T 1.8) = (il(x.
In either case.. 0 Again it will be convenient to consider the "biggest" families. letting the model be Thus.A('l)}.'l) = h(x)exp{TT(x)'l.Ot = fl. . J1 < 00.. . . q(x.. The density of Po may be written as e ~ {(I". .11) (]"2. the canonical kparameter exponential indexed by 71 = (fJI.1.. 1)2(0) = 2 " B(O) " 1"' 1 " T.'" . we define the natural parameter space as t: = ('l E R k : 00 < A('l) < oo}.x ..5..'). the vector T(X) = (T. Suppose lhat P. and Performance Criteria Chapter 1 By Theorem 1.10).. The Normal Family. If we observe a sample X = (X" .' . .') population. . Again.').6.fJk)T rather than family generated by T and h is e. + log(27r"'». . i=l ..2(".. suppose X = (Xl. 8 2 = and I" 1 2' Tl(x) = x. (]"2 > O}. (1. X m ) from a N(I". . then the preceding discussion leads us to the natural sufficient statistic m m (LXi.') : 00 < f1 x2 1 J12 2 p(x.6.Xm ) where the Xi are independent and identically distributed and their common distribution ranges over a k~parameter exponential family given by (1..Tk(x)f and. Goals. h(x) = I.5.LTk(Xi )) t=1 Example 1. . It will be referred to as a natural sufficient statistic of the family. A(71) is defined in the same way except integrals over Rq are replaced by sums..2.(X) •.6. .4). I ! In the discrete case. which corresponds to a twOparameter exponential family with q = I. .TdX))T is sufficient.. in the conlinuous case. . i=l i=1 which we obtained in the previous section (Example 1.LX.(x) = x'.x E X c Rq where T(x) = (T1 (x). = N(I". 2 (". +log(27r" »)]..I' 54 Statistical Models. ..O) =exp[".3. Then the distributions of X form a kparameter exponential family with natural sufficient statistic m m TCm)(x) = (LTl(Xi).
'.~3 < OJ.j j=l = 1.T.EziY.~.Section 1. 0) ~ exp{L .5./(7'. 1/) =exp{T'f.) + log(rr/.(x) = L:~ I l[X i = j]. ~3 and t: = {(~1.. and rewriting kl q(x.5.k. Y n are independent. ~l ~ /. a 2 ). In this example.= {(~1. E R. 2. k}] with canonical parameter 0. . where m. A) = I1:~1 AJ'(X). We write the outcome vector as X = (Xl.kj. Example 1. = n'Ezr Example 1. .7.'I2.4 and 1.j = 1."k.~2)]. . is not and all c. In this example. A(1/) = 4n[~:+m'~5+z~1~'+2Iog(rr/~3)]. 0) for 1 = (1.~./(7'. . < OJ. However 0. = 1/2(7'.. Example 1. Now we can write the likelihood as k qo(x.1. ..5.5 that Y I .._l)(x)1/nlog(l where + Le"')) j=1 . as X and the sample space of each Xi is the k categories {I.(x). 0. I <j < k 1. in Examples 1. with J./Ak) = ". .~l= (3. and AJ ~ P(X i ~ j). n. . . . Linear Regression. A E A. i = 1. E R k .~·(x) . h(x) = I andt: ~ R x R..'12 ~ (3. Let T..)T. 0 + el) ~ qo(x.6.5. Multinomial Trials. .. the density of Y = (Yb ..~3): ~l E R.k../(7'. TT(x) = T(Y) ~ (EY"EY...)j.6.~3 ~ 1/2(7'. k ~ 2.ti = f31 + (32Zi. ~.. . We observe the outcomes of n independent trials where each trial can end up in one of k possible categories.. Then p(x. (N(/" (7') continued}. . and t: = Rk . ...id. remedied by considering IV ~j = log(A.n log j=l L exp(.. Yi rv N(Jli.. ... where A is the simplex {A E R k : 0 < A.6. .. j=1 k This is a kparameter canonical exponential family generated by TIl"" Tk and h(x) = II~ I l[Xi E {I. A(1/) ~ ~[(~U2~.Xn)'r where the Xi are i.(x»). I yn)T can be put in canonical form with k = 3.~2): ~l E R. . < l. we can achieve this by the reparametrization k Aj = eO] /2:~ e Oj . From Exam~ pIe 1.6 Exponential Families 55 (x. L71 Aj = I}.6. .. . Suppose a<. It will often be more convenient to work with unrestricted parameters. This can be identifiable because qo(x. x') = (T.
See Problem 1. . be independent binomial.7 and X = (X t. 1/(0») (1. as X. " .i. . . is unchanged. X n ) T where the Xi are i. Logistic Regression. the parameters 'Ii = Note that the model for X Statistical Models. 17). 0) ~ q(x... Here 'Ii = log 1\. if c Re and 1/( 0) = BkxeO C R k . is an npararneter canonical exponential family with Yi by T(Yt . Goals. < X n be specified levels and (1..6. then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h. let Xt =integers from 0 to ni generated Vi < n. 1]) is a k and h(x) = rr~ 1 l[xi E over.56 Note that q(x. Similarly. I < k. . More10g(P'I[X = jI!P'IIX ~ k]).d.. Let Y. 0 1. this. 1 < i < n.13) . If the Ai are unrestricted.3 Building Exponential Families Submodels A submodel of a kparameter canonical exponential family {q(x.6. 1 <j < k .6.6. and 1] is a map from e to a subset of R k • Thus. M(T) sponding to and ~ MexkT + hex" it is easy to see that the family generated by M(T(X» and h is the subfamily of P corre 1/(0) = MTO. < . 0 < Ai < 1. 1 < i < n. then all models for X are exponential families because they are submodels of the multinomial trials model. are identifiable. B(n.2.17 e for details. 11 E £ C R k } is an exponential family defined by p(x.6.. h(y) = 07 I ( ~. " Example 1. and Performance Criteria Chapter 1 1 parameter canonical exponential family generated by T (k 1) {I. ) 1(0 < However. where (J E e c R1.1. k}] with canonical parameter TJ and £ = Rk~ 1.6. from Example 1.8.·.6. if X is discrete Affine transformations from Rk to RI defined by UP is the canonical family generated by T kx 1 and hand M is the affine transformation I' . Ai).' A(1/) = L:7 1 ni 10g(l + e'·).12) taking on k values as in Example 1..). Here is an example of affine transformations of 6 and T.Y n ) ~ Y. .
Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic. The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance. . Yn. Suppose that YI ~ . . Y.6.xn)T.a 2 ). y. Then (and only then).Xn i. are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization..x = (Xl.9.. '.x)W'.'. 8) in the 8 parametrization is a canonical exponential family.5 a 2nparamctercanonical exponential family model with fJi = P. where 1 is (1.). = '\501 and 'fJ2(0) = ~'\5B2.6..12) with the range of 1J( 8) restricted to a subset of dimension l with I < k .. > O. the 8 parametrization has dimension 2.) = L ni log(1 + exp(8. In the nonnal case with Xl.'V T(Y) = (Y" . . p(x. with B = p" we can write where T 1 = ~~ I Xi. i=I This model is sometimes applied in experiments to determine the toxicity of a substance.x and (1. .10. .6. . x). Gaussian with Fixed SignaltoNoise Ratio. o Curved exponential families Exponential families (1. a.. . N(Jl.6. }Ii N(J1.. generated by . and l1n+i = 1/2a.6. E R. + 8.6 Exponential Families 57 This is a linear transformation TJ(8) = B nxz 8 corresponding to B nx2 = (1. LocationScale Regression. then this is the twoparameter canonical exponential family generated by lYIY = (L~l. Example 1. + 8.6.00). 'fJ1(B) curved exponential family with l = 1. that is.PIX < x])) 8.i/a.. n.i.. PIX < xl ~ [1 + exp{ (8. Then.8. If each Iii ranges over R and each a. so it is not a curved family. Example 1.!t is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied..i. . This is a 0 In Example 1... log(P[X < xl!(1 . suppose the ratio IJlI/a.Section 1. + 8. ..Xi)). ranges over (0.8. SetM = B T . .1)T. I ii.. is a known constant '\0 > O.14) 8..8.1... which is called the coefficient of variation or signaltonoise ratio. i = 1.13) holds..d. this is by Example 1. . ~ (1. which is less than k = n when n > 3.)T .6. Y n are independent. L:~ 1 Xl yi)T and h with A(8. T 2 = L:~ 1 X. However.
10 and 1. Bickel.g. 1 Bj(O).5.2.58 Statistical Models.6. We extend the statement of Theorem 1.E Yj c Rq. Carroll and Ruppert. 1978.8.) = (Yj.. 1.) depend on the value Zi of some covariate. n. the map 1/(0) is Because L:~ 11]i(6)Yi + L:~ i17n+i(6)Y? cannot be written in the fonn ~. Recall from Section B.10). We return to curved exponential family models in Section 2._I11. . In Example 1. ~. yn)T is modeled by the exponential family generated by T(Y) ~. but a curved exponential family model with 0 1=3. Ii. for unknown parameters 8 1 E R.8 note that (1. say.12.. and Performance Criteria Chapter 1 and h(Y) = 1.d.6 are heteroscedastic and homoscedastic models. ..4 Properties of Exponential Families Theorem 1. 1 < j < n. Section 15. 8. Supermode1s We have already noted that the exponential family structure is preserved under i. a.6. Goals. Sections 2. Thus.3.1. 1989. and Snedecor and Cochran.5 that for any random vector Tkxl. respectively. M(s) as the momentgenerating function.(O)Tj'(Y) for some 11.12) is not an exponential family model. as being distributed according to a twoparameter family generated by Tj(Y. and B(O) = ~.6. 83 > 0 (e.xjYj ) and we can apply the supennode! approach to reach the same conclusion as before.1/(O)) as defined in (6.i. and = Ee sTT . For 8 = (8 1 . Examples 1. 1 Tj(Y. Models in which the variance Var(}'i) depends on i are called heteroscedastic whereas models in which Var(Yi ) does not depend on i are called homoscedastic. sampling.(0). E R.. 1988.6.1 generalizes directly to kparameter families as does its continuous analogue.13) exhibits Y. we define I I . with an exponential family density Then Y = (Y1. TJ'(Y). 0) ~ q(y. Let Yj .6.8. then p(y. Even more is true.) and 1 h'(YJ)' with pararneter1/(O).6. Next suppose that (JLi.6. be independent.).
a)'72) < aA('7l) + (1 .T(x)).6.A('7o)} vaLidfor aLL s such that 110 + 5 E E. J u(x)v(x)h(x)dx < (J ur(x)h(x)dx)~(Jv'(x)h(x)dx)~.9. If '7" '72 E + (1 .3. h(x) > 0. (with 00 pennitted on either side). h) with corresponding natural parameter space E and function A(11). vex) = exp«1 . .6. then T(X) has under 110 a moment M(s) ~ exp{A('7o + s) .6.a)'72 E is proved in exactly the same way as Theorem 1. v(x). Suppose '70' '71 E t: and 0 < a < 1. Since 110 is an interior point this set of5 includes a baLL about O. J exp('7 TT (x))h(x)dx > 0 for all '7 we conclude from (1.a)A('72) Because (1. Under the conditions ofTheorem 1.6. S > 0 with ~ + ~ = 1. ('70) II· The corollary follows immediately from Theorem B. Proof of Theorem 1. 8".1. u(x) ~ exp(a'7.Section 1. Let P be a canonicaL kparameter exponential famiLy generated by (T.5..8".r(T) = IICov(7~.6.15) that a'7. 8A 8A T" = A('7o) f/J. 1O)llkxk.6. Substitute ~ ~ a.6.3 V.6. Then (a) E is convex (b) A: E ) R is convex (c) If E has nonempty interior in R k generating function M given by and'TJo E E.2. T.15) is finite.6. . A where A( '70) = (8".4).6. .a)'7rT(x)) and take logs of both sides to obtain. + (1 .3(c).15) t: the righthand side of (1.a.1 give a classical result in Example 1. ('70)) . A(U'71 Which is (b).1 and Theorem 1. t: and (a) follows.3.. By the Holder inequality (B. ('70). Theorem 1.A('70) = II 8".r'7o T(X) .6 Expbnential Families 59 V. for any u(x). Corollary 1.6. ~ = 1 . We prove (b) first.6. Fiually (c) 0 The formulae of Corollary 1.
ifn ~ 1.(Tj(X» = P>. P7)[L. .6. + 9. and ry.4 that follows. Sintilarly. Suppose P = (q(x.60 Statistical Models. and Performance Criteria Chapter 1 Example 1.1. Formally. p:x. o The rank of an exponential family I . .6. if we consider Y with n'> 2 and Xl < X n the family as we have seen remains of rank < 2 and is in fact of rank 2. But the rank of the family is 1 and 8 1 and 82 are not identifiable. (continued).6. we are writing the oneparameter binomial family corresponding to Yl as a twoparameter family with generating statistic (Y1 .1 is in fact its rank and this is seen in Theorem 1.7). 9" 9. (i) P is a/rank k.7.lx ~ j] = AJ ~ e"'ILe a . (ii) 7) is a parameter (identifiable).~~) < 00 for all x. .~. II ! I Ii . ajTj(X) = ak+d < 1 unless all aj are O. such that h(x) > O. T.4.(9) = 9. Tk(X) are linearly independent with positive probability.6. It is intuitively clear that k . . We establish the connection and other fundamental relationships in Theorem 1. (iii) Var7)(T) is positive definite. Goads.. Theorem 1. 7) E f} is a canonical exponential/amily generated by (TkXI ' h) with natural parameter space e such that E is open. Our discussion suggests a link between rank and identifiability of the 'TJ parameterization. f=l . Here. for all 9 because 0 < p«X. using the a: parametrization.6.7 we can see that the multinomial family is of rank at most k . An exponential family is of rank k iff the generating statistic T is kdimensional and 1. k A(a) ~ nlog(Le"') j=l and k E>.x. there is a minimal dimension. Xl Y1 ). Note that PO(A) = 0 or Po (A) < 1 for some 0 iff the corresponding statement holds i. (X)".8. However."I 11 J Evidently every kparameter exponential family is also k'dimensional with k' > k. 2' Going back to Example 1.6.4. in Example 1. However. Then the following are equivalent.
0 Corollary 1. ". '" '.1)o) : 1)0 + c(1). Let ~ () denote "(. = Proof. (b) logq(x..'t·· ... Thus. which that T implies that A"(TJ) = 0 for all TJ and. A' is constant. by Theorem 1. Note that.. hence. (iii) = (iv) and the same discussion shows that (iii) .. A'(TJ) is strictly monotone increasing and 11. ~ (ii) ~ _~ (i) (ii) = P1).A(TJI)}h(x) ~ exp{TJ2T(x) . Taking logs we obtain (TJI . . ranges over A(I'). hence.. Suppose tlult the conditions of Theorem 1. Then (a) P may be uniquely parametrized by "'(1)) E1)T(X) where".' . Apply the case k = 1 to Q to get ~ (ii) =~ (i).) with probability 1 =~(i).6. A is defined on all ofE.3. Now (iii) =} A"(TJ) > 0 by Theorem 1.T ~ a2] = 1 for al of O. all 1) = (~i) II. . for all T}.6. A"(1]O) = 0 for some 1]0 implies c.(ii) = (iii). (iv) '" (v) = (tii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined as previously for arbitrary 1)0' 1). by our remarks in the discussion of rank. = Proof ofthe general case sketched I. have (i) . of 1)0· Let Q = (P1)o+o(1).2.6.Section 1. 1)0) E n· Q is the exponential family (oneparameter) generated by (1)1 . ~ (i) =~ (iii) ~ (i) = P1)[aT T = cJ ~ 1 for some a of 0.4 hold and P is of rank k. III. with probability 1. 0 .A(TJ2))h(x). Conversely." Then ~(i) > 1 is then sketched with '* P"[a. because E is open.. all 1) ~ (iii) = a T Var1)(T)a = Var1)(aT T) = 0 for some a of 0. We.. 1)) is a strictly concavefunction of1) On 1'. F!~ . This is equivalent to Var"(T) ~ 0 '*~ (iii) {=} f"V(ii) There exist T}I =I= 1J2 such that F Tll = Pm' Equivalently exp{r/.1)o)TT.T(x) . :.~> .TJ2)T(X) = A(TJ2) .6 Exponential Families 61 (iv) 1) ~ A(1)) is 11 onto (v) A is strictly convex on E.6. . We give a detailed proof for k = 1. This is just a restatement of (iv) and (v) of the theorem.(v).A(TJ.2 and. ~ P1)o some 1). The proof for k details left to a problem. thus.) is false. Proof.
10).tpx 1 and positive definite variance covariance matrix L::pxp' iff its density is f(Y. 0 ! = 1.p/2 I exp{ . (1. .. .Yn)T follows the k = p(p + 3)/2 parameter exponential family with T = (EiYi .Jl)TE.Yp. E. E) = Idet(E) 1 1 2 / .6.Jl)}..6.~p).. which is obviously a 11 function of (J1. T (and h generalizing Example 1.(9) LT. The p Van'ate Gaussian Family. Recall that Y px 1 has a p variate Gaussian distribution. ifY 1...6."5) family by E(X). By our supermodel discussion. For {N(/" "')}. and that (. Goals...6. An important exponential family is based on the multivariate Gaussian distributions of Section 8.62 Statistical Models. .2 p p (1. It may be shown (Problem 1._.17) can be rewritten ( 2: 1 <i<j~p aijYiYj + ~ 2: aii r:2 ) + L(2: aij /lj)Yi i=l i=l j=l p where E.5 Conjugate Families of Prior Distributions In Section 1. families to which the posterior after sampling also belongs. 9 ~ (Jl.4 applies. with mean J. where we identify the second element ofT. which is a p x p symmetric matrix.)] eXP{L 1/.(Y).6.6.17) The first two terms on the right in (1. E) _~yTEIy + (E1JllY 2 I 2 (log Idet(E)1 + JlT I P log".Jl) .Xn is a sample from the kparameter exponential family (1. This is a special case of conjugate families of priors. (1. B(9) = (log Idet(E) I+JlTEl Jl).Jl.ljh<i<. We close the present discussion of exponential families with the following example. and Performance Criteria Chapter 1 The relation in (a) is sometimes evident and the j. the relation in (a) may be far from obvious (see Problem 1. E(X. ..6. . then X .1 '" 11"'. and. ~iYi Vi).21).I. See Section 2.t parametrization is close to the initial parametrization of classical P...2 we considered beta prior distributions for the probability of success in n Bernoulli trials. the N(/l. However.{Y..6. Then p(xI9) = III h(x._. . Jl.5.. Suppose Xl.3.6. revealing that this is a k = p(p + 3)/2 parameter exponential family with statistics (Yi.(Xi) i=l j=l i=l n • n nB(9)}.. . E). with its distinct p(p + I) /2 entries.2 (Y . the 8(11) 0) family is parametrized by E(X). as we always do in the Bayesian context. N p (1J. Thus. write p(x 19) for p(x.16) Rewriting the exponent we obtain 10gf(Y. 0). The corollary will prove very important in estimation theory.a 2 + J12). h(Y) . so that Theorem 1.E).11. Y n are iid Np(Jl.1 (Y .6.11. E)...6. Example 1... ..a 2 ).X') = (JL. where X is the Bernoulli trial...18) _.29) that I) generate this family and that the rank of the family is indeed p(p + 3)/2. is open. ..6.
6.36).u5) sample.O<W(tJ. the parameter t of the prior distribution is updated to s = (t + a). .22) 02 p(xIO) ex exp{ 2 . We assume that rl is nonempty (see Problem 1.. .'''' I:~ 1 Tk(Xi).) . . To choose a prior distribution for model defined by (1.'" . n)T D It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way.6. be "parameters" and treating () as the variable of interest.18).6.6.(0) ~ exp{L ~j(O)fj .. ..Xn is a N(O. tk+l) E 0. let t = (tl. X n become available.6.(Olx) ex p(xlO)". . then k n ..2)' Uo 2uo Ox . (1. 0 Remark 1.21) is an updating formula in the sense that as data Xl.. . Because two probability densities that are proportional must be equal.19) o {(iJ..18) by letting 11.(lk+1 j=1 i=1 + n)B(O)) ex ".6.6.. .(Xi)+ f. .6 Exponential Families 63 where () E e... Forn = I e.fkIJ)<oo) with integrals replaced by sums in the discrete case. where a = (I:~ 1 T1 (Xi). then Proposition 1. k.6. and t j = L~ 1 Tj(xd.. ik + ~Tk(Xi)' tk+l + 11. .1. is a conjugate prior to p(xIO) given by (1.Section 1.20) where t = (fJ. j = 1. Example 1. If p(xIO) is given by (1. 1r(6jx) is the member of the exponential family (1.fk+Jl. Proof.18) and" by (1. .6. . That is..6.. .fk+IB(O) logw(t)) j=1 (1.21) where S = (SI.(0). n n )T and ex indicates that the two sides are proportional functions of ().21) and our assertion follows. Note that (1..6.1. A conjugate exponential family is obtained from (1.20).(O) ex exp{L ~J(O)(LT.6.12.20). we consider the conjugate family of the (1.. . The (k + I)parameter exponential/amity given by k ".6.6. Suppose Xl.6.20) given by the last expression in (1..6. tk+ I) T and w(t) = 1= eXP{hiJ~J(O)' _='" _= 1 k ik+JB(II)}dO J···dOk (1."" Sk+l)T = ( t} + ~TdxiL . which is k~dimensional. where u6 is known and (j is unknown.
30) that the Np()o" f) family with )0.. = nTg(n)I(1~.6.64 Statistical Models.i.28) where W.W. we obtain 1r.26) intuitively as ( 1. E RP and f symmetric positive definite is a conjugate family f'V . (J Np(TJo.24) Thus.6. = s. if we observe EX. 761) where '10 varies over RP.6.6. = 1 .( 0 .24). consists of all N(TJo. (0) is defined only for t.n) and variance = t.(S) = t.6.6.23) with (70 TO . Using (1.6.23) Upon completing the square. 75) prior density. . Eo known.6. and Performance Criteria Chapter 1 This is a oneparameter exponential family with The conjugate twoparameter exponential family given by (1. therefore. 1 < i < n. rJ) distributions where Tlo varies freely and is positive.20) has density (1.37).(n) = . t. If we start with aN(f/o. Tg is scalar with TO > 0 and I is the p x p identity matrix (Problem 1.26) (1. Moreover. W. 1r.6. 0 These formulae can be generalized to the case X. = rg(n)lrg so that w. + n.d. the posterior has a density (1. Np(O. it can be shown (Problem 1.25) By (1. Eo). i. we find that 1r(Olx) is a normal density with mean Jl(s. Goals. tl(S) = 1]0(10 2 TO 2 + S. > 0 and all t I and is the N (tI!t" (1~ It.6.6.21).) } 2a5 h t2 t1 2 (1..) density. Our conjugate family.(n) (g +n)I[s+ 1]0~~1 TO TO (1.6.6.(O) <X exp{ . we must have in the (t l •t2) parametrization "g (1.27) Note that we can rewrite (1.
the family of distributions in this example and the family U(O.6.6 Exponential Families 65 but a richer one than we've defined in (1. .B(O)]. B) are not exponential.. ... The set is called the natural parameter space.Section 1. The set E is convex. It is easy to see that one can consUUct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1. . T k (X)) is called the natural sufficient statistic of the family. In the onedimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape. the conditions of Proposition 1.. which admit kdimensional sufficient statistics for all sample sizes. and Darmois.6..29) (Tl (X)..10 is a special result of this type. the map A .'T}k and B on e. 2. The natural sufficient statistic max( X 1 ~ .A(7Jo)} e e. which is onedimensional whatever be the sample size. In fact.6. and realvalued functions TI.3 is not covered by this theory. ••• . h on Rq such that the density (frequency) function of Pe can be written as k p(x. {PO: 0 E 8}. ..32. x j=1 E X c Rq.1 are often too restrictive. to Np (8 . .J: h(x)exp{TT(x)7J}dx in the continuous case. See Problems 1.. Problem 1. starting with Koopman. must be kparameter exponential families..6. .6. e. is a kparameter exponential/amity of distributions if there are realvalued functions 'T}I .5. Pitman. The canonical kparameter exponentialfamily generated by T and h is T(X) = where A(7J) ~ log J:. for all s such that '10 + s is in Moreover E 7Jo [T(X)] = A(7Jo) and Var7Jo[T(X)] A( '10) where A and A denote the gradient and Hessian of A.6. is not of the form L~ 1 T(Xi). r) is a p(p + 3)/2 rather than a p + I parameter family. with integrals replaced by sums in the discrete case.Xn ). If bas a nonempty interior in R k and '10 E then T(X) has for X ~ P7Jo the momentgenerating function . Some interesting results and a survey of the literature may be found in Brown (1986).6. a theory has been built up that indicates that under suitable regularity conditions families of distributions. 0) = h(x) expl2:>j(I1)Ti (x) . . Discussion Note that the uniform U( (I. Summary.(s) = exp{A(7Jo + s) . " Tk. O}) model of Example 1. Despite the existence of classes of examples such as these.20) except for p = 1 because Np(A.. E 1 R is convex.20).31 and 1.. Eo). 8 C R k . In fact. (1.
(ii) '7 is identifiable. (a) A geologist measures the diameters of a large number n of pebbles in an old stream bed.parameter exponential family k ". If P is a canonical exponential family with E open. tk+1) 0 ~ {(t" .. Suppose that the measuring instrument is known to be biased to the positive side by 0. (iii) Var'7(T) is positive definite. State whether the model in question is parametric or nonparametric. .B(O)tk+l logw} j=l where w = 1:" 1: E exp{l:.B(O)}dO. (b) A measuring instrument is being used to obtain n independent determinations of a physical constant J. A family F of priOf distributions for a parameter vector () is called a conjugate family of priors to p(x I 8) if the posterior distribution of () given x is a member of F. tk+. and Performance Criteria Chapter 1 An exponential family is said to be of rank k if T is kdimensional and 1. .L and a 2 but has in advance no knowledge of the magnitudes of the two parameters..7 PROBLEMS AND COMPLEMENTS Problems for Section 1.. 1. He wishes to use his observations to obtain some infonnation about J. Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean J. Ii II I .1 units. Give a formal statement of the following models identifying the probability laws of the data and the parameter space.29)..L and variance a 2 .66 Statistical Models. l T k are linearly independent with positive Po probability for some 8 E e.. (v) A is strictlY convex on f. and t = (t" ..1 1.6.) E R k+l : 0 < w < oo}. then the following are equivalent: (i) P is of rank k. Assume that the errors are otherwise identically distributed nonnal random variables with known variance. Goals. T" . (iv) the map '7 ~ A('7) is I ' Ion 1'.(0) = exp{L1)j(O)t) . The (k + 1). is conjugate to the exponential family p(xl9) defined in (1.L.1)) (O)t j .
. 1 Ab) restricted to the sets where l:f I (Xi = 0 and l:~~1 A.... (a) Let U be any random variable and V be any other nonnegative random variable. 0.1.2) where fLij = v + ai + Aj...2). B = (1'1.··. i=l (c) X and Y are independent N (I' I .Xp are independent with X~ ruN(ai + v. i = 1.Section 1. .for this model? (d) The number of eggs laid by an insect follows a Poisson distribution with unknown mean A. = o.. At. . .. Show that Fu+v(t) < Fu(t) for every t. (b) The parametrization of Problem 1..2) and N (1'2. .. An entomologist studies a set of n such insects observing both the number of eggs laid and the number of eggs hatching for each nest.l(d) if the entomologist observes only the number of eggs hatching but not the number of eggs laid in each case. Ab.1(c).Xp ). ..) < Fy(t) for every t.2 ). e (e) Same as (d) with (Qt. then X is (b) As in Problem 1. each egg has an unknown chance p of hatching and the hatChing of one egg is independent of the hatching of the others..Xpb . Each . 2. restricted to p = (all' .t. = (al. j = 1. respectively. 0.1. ... (e) The parametrization of Problem l. bare independeht with Xi. . 3.. Two groups of nl and n2 individuals. (If F x and F y are distribution functions such that Fx(t) said to be stochastically larger than Y.. . v. 4.p.. . are sampled at random from a very large population. Q p ) and (A I .1. . and Po is the distribution of X (b) Same as (a) with Q = (X ll .1 describe formaHy the foHowing model... .) (a) The parametrization of Problem 1.. . 1'2) and we observe Y X.) (a) Xl. .7 Problems and Complements 67 (c) In part (b) suppose that the amount of bias is positive but unknown.ap ) : Lai = O}. Once laid.2 ) and Po is the distribution of Xll.. ~ N(l'ij. . Which of the following parametrizations are identifiable? (Prove or disprove. Can you perceive any difficulties in making statements about f1. Q p ) {(al. .. (d) Xi."" a p . . Are the following parametrizations identifiable? (Prove or disprove.1(d). . .
I ! fl. P8[N I . Hint: IfY .  .c bas the same distribution as Y + c. • "j :1 member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour..Lk VI n". 8.1. o < J1 < 1.) (a) p.. 5..k is proposed where 0 < 71" < I.c) = PrY > c . 0'2).. 1 < i < k and (J = (J.. (e) Suppose X ~ N(I'. M k = mk] I! n.. 0 < V < 1 are unknown. (0). VI.t) fj>r all t...1. .. It is known that the drug either has no effect or lowers blood pressure.Li = 71"(1 M)iIJ. k. 0 = (1'. .P(Y < c . }. mk·r. + mk + r = n 1.. Which of the following models are regular? (Prove or disprove.1. 0 < J. . ml ...L. The simplification J. if and only if. Show that Y .. 7. . is the distribution of X when X is unifonn on (0. and Performance Criteria Chapter 1 ":.p(0..9 and they oecur with frequencies p(0. .. What assumptions underlie the simplification? 6. Let Po be the distribution of a treatment response. (a) What are the assumptions underlying this model'? (b) (J is very difficult to estimate here if k is large.4(1).t) = 1 .Li < + .LI ni + .0. . . the density or frequency function p of Y satisfies p( c + t) ~ p( c ..· . then P(Y < t + c) = P( Y < t .. Both Y and p are said to be symmetric about c.. 0 < Vi < 1.0.t).. 68 Statistical Models. Let N i be the number dropping out and M i the number graduating during year i. + nk + ml + .0)... Goals... Mk..O}. V k mk P r where J. = n11MI = mI.2. J.. Consider the two sample models of Examples 1.0'2) (d) Suppose the possible control responses in an experiment are 0.. but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown...' I . whenXis unifonnon {O. . i = 1... Vk) is unknown... 2. .p(0..LI.c has the same distribution as Y + c.2). In each of k subsequent years the number of students graduating and of students dropping out is recorded. The number n of graduate students entering a certain department is recorded. • . . I. Vi = (1 11")(1 ~ v)iI V for i = 1. e = = 1 if X < 1 and Y {J. . .··· . + Vk + P = 1.2.k + VI + . . + fl. The following model is proposed..3(2) and 1.9). Suppose the effect of a treatment is to increase the control response by a fixed amount (J. nk·mI .Nk = nk.1)..I nl . I nl. is the distribution of X e = (0. .:". I . =X if X > 1.. (b)P. Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour. . Let Y and Po is the distribution of Y.
For what tlando(x) ~ 2J1+tl2x? type ofF is it possible to have C(·) ~ F(·tl) for both o(x) (e) Suppose that Y ~ X + O(X) where X ~ N(J1. . r(j) = = PlY = j. I(T < C)) = j] PIC = j] PIT where T. according to the distribution of X.tl). (]"2) independent. > 0.. o(x) ~ 21' + tl .. .. Let c > 0 be a constant.tl) does not imply the constant treatment effect assumption. N} are identifiable. that is. then C(·) ~ F(. . . N.."" Yn ). e) vary freely over F ~ {(I'. e) : p(j) > 0.. 1 < i < n. .1.Section 1. {r(j) : j ~ 0. eY and (c) Suppose a scale model holds for X.7 Problems and Complements 69 (a) Show that if Y ~ X + o(X). Suppose X I.zp. (b) In part (a). 10. are not collinear (linearly independent). The Lelunann 1\voSample Model. if the number of = (min(T.{3p) are identifiable iff Zl. (Yl.. ."') and o(x) is continuous..N = 0. = (Zlj. .. e(j) > 0. suppose X has a distribution F that is not necessarily normal.Xn ).i. X m and Yi.. Therefore. Sx(t) = . .. i ."'). (b) Deduce that (th. fJp) are not identifiable if n pardffieters is larger than the number of observations. Yn denote the survival times of two groups of patients receiving treatments A and B. t > O. Positive random variables X and Y satisfy a scale model with parameter 0 > 0 if P(Y < t) = P(oX < t) for all t > 0.. . ci '" N(O. Show that if we assume that o(x) + x is strictly increasing.2x and X ~ N(J1. The Scale Model. eX o. log y' satisfy a shift model? = Xc.tl) implies that o(x) tl. . Y = T I Y > j]. then satisfy a scale model with parameter e. .Xn are observed i. I} and N is known.Znj?' (a) Show that ({31. .. and (I'. then C(·) = F(.. . the two cases o(x) . . C are independent. Hint: Consider "hazard rates" for Y min(T. That is.. Does X' Y' = yc satisfy a scale model? Does log X'.. e(j). C(t} = F(tjo). Sbow that {p(j) : j = 0.6.. = 9.. p(j).. In Example 1. Collinearity: Suppose Yi LetzJ' = L:j:=1 4j{3j + Ci. . Let X < 1'. N}. 12. 0 < j < N.'" .3 let Xl.. C). . 11. 0 'Lf 'Lf op(j) = I. log X and log Y satisfy a shift model with parameter log (b) Show that if X and Y satisfy a shift model With parameter tl. or equivalently. .tl and o(x) ~ 21' + tl. . j = 0. o (a) Show that in this case.d. . Y. C(·) = F(· . C). . .2x yield the same distribution for the data (Xl..
l3. Sy(t) = Sg.1) Equivalently.(t). (c) Under the assumptions of (b) above. and Performance CriteriCl Chapter 1 t Ii . as T with survival function So. thus. z) zT. (c) Suppose that T and Y have densities fort) andg(t). I (3p)T of unknowns.ll) with scale parameter 5.(t) with C. t > 0. log So(T) has an exponential distribution.ho(t) is called the Cox proportional hazard model. For treatments A and E. then X' = log So (X) and Y' ~ log So(Y) follow an exponential scale model (see Problem Ll. . 12.12. 13. (b) By extending (bjn) from the rationals to 5 E (0.. t > 0. Show that hy(t) = c. Show that if So is continuous. where TI". then Yi* = g({3.7.z)}. .. Also note that P(X > t) ~ Sort).G(t). Let T denote a survival time with density fo(t) and hazard rate ho(t) = fo(t)j P(T > t). h(t I Zi) = f(t I Zi)jSy(t I Zi). Survival beyond time t is modeled to occur if the events T} > t. Tk > t all occur.. ~ a5. So (T) has a U(O.(t).{a(t).1. 7. . t > 0. 1 P(X > t) =1 (.  .(t) = fo(t)jSo(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y. The most common choice of 9 is the linear form g({3. A proportional hazard model. k = a and b. = exp{g(.. I' 70 StCitistical Models.I ~I .7. . Specify the distribution of €i.d. Then h. hy(t) = c. Moreover.) Show that Sy(t) = S'.2) and that Fo(t) = P(T < t) is known and strictly increasing. respectively.. F(t) and Sy(t) ~ P(Y > t) = 1 . (1. Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t).00). The Cox proportional hazani model is defined as h(t I z) = ho(t) exp{g(... Hint: See Problem 1.2) is equivalent to Sy (t I z) = sf" (t).h. " T k are unobservable and i.). Hint: By Problem 8. Zj) + cj for some appropriate €j. . .l3.) Show that (1.. we have the Lehmann model Sy(t) = si.i. z)) (1. Goals. Let f(t I Zi) denote the density of the survival time Yi of a patient with covariate vector Zi apd define the regression survival and hazard functions of Y as i Sy(t I Zi) = 1~ fry I zi)dy.7. I (b) Assume (1.. = = (.2) where ho(t) is called the baseline hazard function and 9 is known except for a vector {3 ({311 .2..(t) if and only if Sy(t) = Sg. are called the survival functions.. Set C. I) distribution.l3. show that there is an increasing function Q'(t) such that ifYt = Q'(Y.
'1/1' > 0. Consider a parameter space consisting of two points fh and ()2. the parameter of interest can be characterl ized as the median v = F. 14. still identifiable? If not." . have a N(O.8 0.11 and 1. and suppose that for given ().i.. F. Yn be i.2 0. an experiment leads to a random variable X whose frequency function p(x I 0) is given by O\x 01 O 2 0 0. wbere the model {(F.. Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function J=oo Jo F(x. .1. 15. Find the . where () > 0 and c = 2.p(t)). G. Generally. Show how to choose () to make J.1..Section 1.12.2 1. Here is an example in which the mean is extreme and the median is not. Merging Opinions. When F is not symmetric.000 is the minimum monthly salary for state workers. 1) distribution. 'I/1(±oo) = ±oo.. .. .i. X n are independent with frequency fnnction p(x posterior1r(() I Xl.1. and Zl and Z{ are independent random variables. the median is preferred in this case.d.5) or mean I' = xdF(x) = Fl(u)du.. ) = ~. Observe that it depends only on l:~ I Xi· I 0). 1f(O2 ) = ~. what parameters are? Hint: Ca) PIXI < tl = if>(.. have a N(O. (a) Suppose Zl' Z.v arbitrarily large. (b) Suppose Zl and Z. Show that both . Yl .p and ~ are identifi· able. . (b) Suppose X" . J1 may be very much pulled in the direction of the longer tail of the density.L .p and C.2) distribution with . Problems for Section 1..2 with assumptions (t){4). .G)} is described by where 'IjJ is an unknown strictly increasing differentiable map from R to R. J1 and v are regarded as centers of the distribution F.7 Problems and Complements 71 Hint: See Problems 1. and for this reason.d. x> c x<c 0..Xm be i. Ca) Find the posterior frequency function 1f(0 I x).l (0.X n ). Find the median v and the mean J1 for the values of () where the mean exists.(x/c)e. . Are.O) 1 . In Example 1. Examples are the distribution of income and the distribution of wealth. Let Xl.4 1 0.6 Let 1f be the prior frequency function of 0 defined by 1f(I1.2 unknown.
8).. is used in the fonnula for p(x)? 2.'" 1rajI Show that J + a. X n be distributed as where XI. .'" . (d) Suppose Xl. 0 < x < O. Consider an experiment in which. . < 0 < 1. c(a). . " ••. .. the outcome X has density p(x OJ ~ (2x/O').[X ~ k] ~ (1.. l3(r. This is called the geometric distribution (9(0)).2.0 < 0 < 1.2.)=1. (3) Suppose 8 has prior frequency.. . what is the most probable value of 8 given X = 2? Given X = k? (c) Find the posterior distribution of (} given X = k when the prior distribution is beta. 2. Find the posteriordensityof8 given Xl = Xl. . does it matter which prior. ~ = (h I L~l Xi ~ = ...Xn are natural numbers between 1 and Band e= {I. k = 0. 3. Unllormon {'13} . . Let X I.. For this convergence. 7r (d) Give the values of P(B n = 2 and 100. Then P.. . ) ~ .72 Statistical Models. 71"1 (0 2 ) = . }. (b) Find the posterior density of 8 when 71"( OJ = 302 . . X has the geometric distribution (a) Find the posterior distribution of (J given X = 2 when the prior distribution of (} is .25.. Let 71" denote a prior density for 8. 7l" or 11"1. ':.Xn = X n when 1r(B) = 1. . 'IT . Compare these B's for n = 2 and 100. 1. . Suppose that for given 8 = 0. (3) Find the posterior density of 8 when 71"( 0) ~ 1. " J = [2:. and Performance CriteriCi Chapter 1 (cj Same as (b) except use the prior 71"1 (0 ..0 < B < 1..... 4'2'4 (b) Relative to (a).=l1r(B i )p(x I 8i ).m+I. 0 (c) Find E(81 x) for the two priors in (a) and (b). Goals.O)kO. 1r(J)= 'u . Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success O. I XI . (J. for given (J = B. Assume X I'V p(x) = l:. IE7 1 Xi = k) forthe two priors (f) Give the set on which the two B's disagree. 4.5n) for the two priors and 1f} when 7f (e) Give the most probable values 8 = arg maxo 7l"(B and 71"1. . "X n ) = c(n 'n+a m) ' J=m. Show that the probability of this set tends to zero as n t 00. . where a> 1 and c(a) . 3. X n are independent with the same distribution as X.75.
7 Problems .and Complements 73 wherern = max(xl. I Xn+k) given . is the standard normal distribution function and n _ T 2 x + n+r+s' a J. . (b) Suppose that max(xl.• X n = Xn' and the conditional distribution of the Zi'S given Y = t is that of sample from the population with density e.. Show that a conjugate family of distributions for the Poisson family is the gamma family.n. X n+ Il . .. X n+ k ) be a sample from a population with density f(x I 0). Assume that A = {x : p( x I 8) > O} does not involve O. Show that the conditional distribution of (6.2. Justify the following approximation to the posterior distribution where q. 11'0) distribution. Show in Example 1. . Let (XI. XI = XI. b) denote the posterior distribution.0 E Let 9 have prior density 1r."'tF·t'. In Example 1.2. S.... .. X n = Xn is that of (Y. a regular model and integrable as a function of O.J:n). W. n. Xn) = Xl = m for all 1 as n + 00 whatever be a. 7. . b) is the distribution of (aV loW)[1 + (aV IbW)]t.Xn is a sample with Xi '"'' p(x I (}). ..f) = [~. . Show rigorously using (1.8) that if in Example 1.Section 1.. (a) Show that the family of priors E. } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of () given X = x is . Show that ?T( m I Xl.··... f(x I f). ZI." . where {i E A and N E {l.2... .b> 1.1. then !3( a.2.. . where E~ 1 Xi = k."" Zk) where the marginal distribution of Y equals the posterior distribution of 6 given Xl = Xl.. ..• X n = Xn. s).c(b. then the posterior distribution of D given X = k is that of k + Z where Z has a B(N . .(1.'" . 11'0) distribution. 10.. Suppose Xl. are independent standard exponential..1= n+r+s _ ..) = n+r+s Hint: Let!3( a.. ..1. .1 that the conditional distribution of 6 given I Xi = k agrees with the posterior distribution of 6 given X I = Xl •. Next use the central limit theorem and Slutsky's theorem.. f3(r.1 suppose n is large and (1 In) E~ I Xi = X is not close to 0 or 1 and the prior distribution is beta. Interpret this result. 6. D = NO has a B(N. If a and b are integers. 9. Va' W t . Xn) + 5. where VI.
) pix I 0) ~ ({" . Note that.1 < j < T.. . 82 I fL. . X 1. Find the posterior distribution 7f(() I x) and show that if>' is an integer. . j=1 '\' • = 1. and Performance Criteria Chapter 1 + nand ({. . and () = a.~l . .O the posterior density 7t( 0 Ix). (12).Xn.. (. (a) If f and 7t are the N(O. . 11.~X) '" tnI.d. T6) densities. LO. N. 01 )=1 j=l Uj < 1.d. . then the posterior density ~(Jt 52 = _1_ "'(X _ X)2 nI I..O. The posterior predictive distribution is the conditional distribution of X n +l given Xl. (c) Find the posterior distribution of 0". 'V ./If (Ito.d.. unconditionally. . 0> 0 a otherwise. Let N = (N l . a = (a1.. Show that if Xl. i). . 0 > O.2 is (called) the precision of the distribution of Xi. Q'j > 0.. = 1. given x.. has density r("" a) IT • fa(u) ~ n' r(. "5) and N(Oo. The Dirichlet distribution. X and 5' are independent with X ~ N(I".)T.. .... Next use Bayes rule. The Dirichlet distribution is a conjugate prior for the multinomial. . (a) Show that p(x I 0) ()( 04 n exp (~tO) where "proportional to" as a function of 8. X n are i.J u.i. L. ..Xn + l arei. In a Bayesian model whereX l .. D(a). (N.. x n ).. 0 > O.i. Letp(x I OJ =exp{(xO)}.. po is I t = ~~ l(X i  1"0)2 and ()( denotes (b) Let 7t( 0) ()( 04 (>2) exp { . <0<x and let7t(O) = 2exp{20}. O(t + v) has a X~+n distribution. .74 where N' ~ N Statistical Models.. compute the predictive and n + 00. 13. N(I".) be multinomial M(n. Goals. 9= (OJ. x> 0. where Xi known.. j=1 • .. Find I 12.i. . . . •• • .J t • I X."Z In) and (n1)S2/q 2 '" X~l' This leads to p(x.) u/' 0 < cx) L". 52) of J1. . Suppose p(x I 0) is the density of i. . . .9).\ > 0. Here HinJ: Given I" and ". Xl ..~vO}. Q'r)T.") ~ . . 8 2 ) is such that vn(p. 0<0. I (b) Use the result (a) to give 7t(0) and 7t(0 I x) when Oexp{ Ox}. v > 0... < 1.2) and we formally put 7t(I". 14. vB has a XX distribution. given (x.6 ""' 1r.. I(x I (J). X n . the predictive distribution is the marginal distribution of X n + l . posterior predictive distribution. (b) Discuss the behavior of the two predictive distributions as 15. 1 X n ..
. (b) Find the minimax rule among {b 1 . and the loss function l((J. Find the Bayes rule when (i) 3.3.1.1..3.) = 0.5. Compute and plot the risk points (a)p=q= . Suppose the possible states of nature are (Jl. a) is given by 0.3.Section 1. 0. . respectively and suppose .7 Problems and Complements 75 Show that if the prior7r( 0) for 0 is V( a).q) and let when at. (J2. then the posteriof7r(0 where n = (nt l · · · . Let the actions corresponding to deciding whether the loss function is given by (from Lehmann... (d) Suppose 0 has prior 1C(01) ~ 1'. 159 of Table 1.p) (I . Suppose that in Example 1. a3. . .. (d) Suppose that 0 has prior 1C(OIl (a). . O 2 a 2 I a I 2 I Let X be a random variable with frequency function p(x. 159 for the preceding case (a).1. the possible actions are al. . 1C(O.3. . (b)p=lq=. 1. .5. = 0.3. Find the Bayes rule for case 2. I.159 be the decision rules of Table 1. 8 = 0 or 8 > 0 for some parameter 8. See Example 1. (c) Find the minimax rule among bt. = 11'. (c) Find the minimax rule among the randomized rules. The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether 8 < O.5.. a new buyer makes a bid and the loss function is changed to 8\a 01 O 2 al a2 a3 a 12 7 I 4 6 (a) Compute and plot the risk points in this case for each rule 1St.3 IN ~ n) is V( a + n).1. 1957) °< °= a 0.5 and (ii) l' = 0. or 0 > a be penoted by I.. n r ).. (J) given by O\x a (I .3. I b"9 }. Problems for Section 1. 1C(02) l' = 0. a2.
(a) Compute the biases. < J' < s. where <f> = 1 <I>... variances.andastratumsamplemeanXj. . Weassurne that the s samples from different strata are independent. Stratified sampling.) r=2s=1. n = 1 and . Suppose X is aN(B. 0 be chosen to make iiI unbiased? I (b) Neyman allocation. geographic locations or age groups). 1) distribution function. are known (estimates will be used in a latet chapter). 1) sample and consider the decision rule = 1 if X <r o (a) Show that the risk function is given by ifr<X<s 1 if X > s. c<I>(y'n(sO))+b<I>(y'n(rO)).7. E.j = I. I . b<f>(y'ns) + b<I>(y'nr). and MSEs of iiI and 'j1z.Xnjj. < 00.. . We want to estimate the mean J1. 1 < j < S. and <I> is the N(O. J".) c<f>( y'n(r .g. Goals.8.3) . Within the jth stratum we have a sample of i.. (b) Plot the risk function when b = c = 1.i. Show that the strata sample sizes that minimize MSE(!i2) are given by (1.76 StatistiCCII Models. .I L LXij. are known. .. 1 (l)r=s=l. and Performance Criteria ChClpter 1 O\a 1 0 c 1 <0 0 >0 J". 1 < j < S. Suppose that the jth stratum has lOOpj% of the population and that the jth stratum population mean and variances are f£j and Let N = nj and consider the two estimators 0. 11 . random variables Xlj"".j = 1. (. I I I. 0<0 0=0 0 >0 R(O. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e. iiz = LpjX j=Ii=1 )=1 s n] s j where we assume that Pj..0)). i I For what values of B does the procedure with r procedure with r = ~s = I? = 8 = 1 have smaller risk than the 4.0)) +b<f>( y'n(s .s. How should nj.(X) 0 b b+c 0 c b+c b 0 where b and c are positive..=l iii = N. Assume that 0 < a.d.
and = 1. l X n . ~ 7. ° = 15..0 ~ + 2'.20.40. P(X ~ b) = 1 ..5(0 + I) and S ~ B(n.0 + '. . 5. that X is the median of the sample.5.. p = .bl for the situation in (i).. The answer is MSE(X) ~ [(ab)'+(cb)'JP(S where k = .25. X n be a sample from a population with values 0.. U(O.bl when n = compare it to EIX . ali X '"'' F. b = '. I)..3.5.bl.5. . Hint: See Problem B. plot RR for p = .i. > 0. . (c) Show that MSE(!ill with Ok ~ PkN minus MSE(!i.= 2:. < ° P < 1. and 2 and (e) Compute the relative risks M SE(X)/MSE(X) in questions (ii) and (iii). . (iii) F is normal. Hint: By Problem 1. respectively..13.3.2.15. Hint: See Problem B. (b) Compute the relative risk RR = M SE(X)/MSE(X) in question (i) when b = 0. We want to estimate "the" median lJ of F. Suppose that n is odd... Use a numerical integration package.0.. . Suppose that X I.7 Problems and Complements 77 Hint: You may use a Lagrange multiplier.3. (c) Same as (b) except when n ~ 1.=1 Pj(O'j where a. Let Xb and X b denote the sample mean and the sample median of the sample XI b. n ~ 1.) with nk given by (1. ~ ~  ~ 6.7. (d) Find EIX ...'.d. a = . set () = 0 without loss of generality. Also find EIX .2'.b. . [f the parameters of interest are the population mean and median of Xi . ~ ~ ~ . Hint: Use Problem 1.9.45.2. (b) Evaluate RR when n = 1. . Each value has probability .5..0 . N(o. Let X and X denote the sample mean and median. (a) Find MSE(X) and the relative risk RR = MSE(X)/MSE(X).3. Next note that the distribution of X involves Bernoulli and multinomial trials. 1).b. . Let XI. ~ (a) Find the MSE of X when (i) F is discrete with P(X a < b < c.1. ~ a) = P(X = c) = P.2.Section 1. and that n is odd. .4.2p. X n are i.=1 pp7J" aV.p).5. . . where lJ is defined as a value satisfyingP(X < v) > ~lllldP(X >v) >~.. show that MSE(Xb ) and MSE(Xb ) are the same for all values of b (the MSEs of the sample mean and sample median are invariant with respect to shift).I 2:. 75.2.3) is N.. > k) (ii) F is uniform.
suppose (J is discrete with frequency function 11"(0) = 11" (!) = 11" U) = Compute the Bayes risk of I5r •s when i.0) = E.3(a) with b = c = 1 and n = I. 10.0(X))) < E.. then this definition coincides with the definition of an unbiased estimate of e. = c L~ 1 (Xi . 9. ... Let Xl.) is (n+ 1)1 Hint!or question (bi: Recall (Theorem B. 8. (a) Show that s:? = (n . Goals.1)1 E~ 1 (Xi . You may use the fact that E(X i .Xn be a sample from a population with variance a 2 . . . being lefthanded). A person in charge of ordering equipment needs to estimate and uses I . the powerfunction. a) = (6 . then a test function is unbiased in this sense if. A decision rule 15 is said to be unbiased if E.8lP where p = XI n is the proportion with the characteristic in a sample of size n from the company. . II. 0 < a 2 < 00.3. .. Let () denote the proportion of people working in a company who have a certain characteristic (e. i1 .(o(X)).(1(6. I .  . ' .2)(. 6' E e.for what 60 is ~ MSE(8)/MSE(P) < 17 Give the answer for n = 25 and n = 100. I .X)2 has a X~I distribution.1'] . 0) (J(6'. . . If the true 6 is 60 . I I 78 Statistical Models. ~ 2(n _ 1)1.a)2.'). Show that the value of c that minimizes M S E(c/. (a)r = 8 =1 (b)r= ~s =1.L)4 = 3(12.2 ~~ I (Xi .10) + (. 10% have the characteristic. I (b) Show that if we use the 0 1 loss function in testing..X)2.1'])2. (a) Show that if 6 is real and 1(6.J.[X . Which one of the rules is the better one from the Bayes point of view? 11. e 8 ~ (.X)' = ([Xi . satisfies > sup{{J(6. In Problem 1. and Performance Criteria Chapter 1 :! . then expand (Xi . Find MSE(6) and MSE(P).0(X))) for all 6. for all 6' E 8 1..X)2 keeping the square brackets intact.3) that . defined by (J(6.(1(6'.3.4. It is known that in the state where the company is located. and only if.g. (i) Show that MSE(S2) (ii) Let 0'6 c~ . (b) Suppose Xi _ N(I'.X)2 is an unbiased estimator of u 2 • Hint: Write (Xi ..0): 6 E eo}.
we petiorm a Bernoulli trial with probability rp( x) of success and decide 8 1 if we obtain a success and decide 8 0 otherwise. if <p(x) = 1. find the set of I' where MSE(ji) depend on n. Suppose that Po.) for all B. B = . 1) and is independent of X. b. s = r = I. 0 < u < 1. In Example 1.' and 0 = = WJ10 + (1  w)x' II'  1'0 I are known. consider the estimator < MSE(X).1. by 1 if<p(X) if <p(X) >u < u. Your answer should Mw If n.Polo(X) ~ 01 = Eo(<p(X)). Convexity ofthe risk set.3. If X ~ x and <p(x) = 0 we decide 90.4. (c) Same as (b) except k = 2.. there is a randomized procedure 03 such that R(B.) + (1 . procedures. Furthersuppose that l(Bo..1'0 I· 17.7). A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 < <p(X) < 1. o Suppose that U . 14.3. and possible actions at and a2. (a) find the value of Wo that minimizes M SE(Mw). wc dccide El.3. Po[o(X) ~ 11 ~ 1 .ao) ~ O. and then JT. Consider the following randomized test J: Observe U..' and 0 = II' . 18.4. then. 16.UfO.1) and let Ok be the decision rule "reject the shipment iff X > k.a )R(B. In Problem 1.1. .3. Compare 02 and 03' 19.1. Consider a decision problem with the possible states of nature 81 and 82 . Define the nonrandomized test Ju . Show that J agrees with rp in the sense that. . 03) = aR(B. 0. consider the loss function (1. In Example 1." (a) Show that the risk is given by (1. 0. use the test J u . but if 0 < <p(x) < 1. (h) If N ~ 10. For Example 1. z > 0. (B) = 0 for some event B implies that Po (B) = 0 for all B E El. Ok) as a function of B.3. (b) find the minimum relative risk of P.Section 1. 15. The interpretation of <p is the following. given 0 < a < 1. show that if c ::.. a) is .~ is unbiased 13.7 Problems and Complements 79 12.wo to X. plot R(O. Show that the procedure O(X) au is admissible. Suppose the loss function feB.3. and k o ~ 3. Show that if J 1 and J 2 are two randomized. Suppose that the set of decision procedures is finite. If U = u.
3.8 0. 4.80 Statistical Models. but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z).) = 0..4. Let Y be any random variable and let R(c) = E(IYcl) be the mean absolute prediction error.4 = 0. Let X be a random variable with probability function p(x ! B) O\x 0. !. What is 1. In Example 1. 7. The midpoint of the interval of such c is called the conventionally defined median or simply just the median.2 0.. Goals. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn. Give the minimax rule among the randomized decision rules. (a) Find the best predictor of Y given Z.7 find the best predictors ofY given X and of X given Y and calculate their MSPEs. (b) Compute the MSPEs of the predictors in (a).9. In Problem B.6 (a) Compute and plot the risk points of the nonrandomized decision rules. and the ratio of its MSPE to that of the best and best linear predictors.Is Z of any value in predicting Y? = Ur + U?. Give an example in which the best linear predictor of Y given Z is a constant (has no predictive value) whereas the best predictor Y given Z predicts Y perfectly.(0. and the best zero intercept linear predictor. its MSPE. (c) Suppose 0 has the prior distribution defined by .1 calculate explicitly the best zero intercept linear predictor. Give an example in which Z can be used to predict Y perfectly. Show that either R(c) = 00 for all cor R(c) is minimized by taking c to be any number such that PlY > cJ > pry < cJ > A number satisfying these restrictions is called a median of (the distribution of) Y. Four balls are drawn at random without replacement.. Let U I .4 I 0. U2 be independent standard normal random variables and set Z Y = U I. 6.(0. 5. al a2 0 3 2 I 0. the best linear predictor. . An urn contains four red and four black balls. Give the minimax rule among the nonrandomized decision rules. 0. and Performance Criteria Chapter 1 B\a 0.1. 0 0.) the Bayes decision rule? Problems for Section 1. !. (b) Give and plot the risk set S.l. 2.
Y)I < Ipl.2 ) distrihution. Suppose that Z has a density p. Y = Y' + Y". (b) Suppose (Z. Show that if (Z. Define Z = Z2 and Y = Zl + Z l Z2. Y" be the corresponding genetic and environmental components Z = ZJ + Z". + z) = p(c . Y'. Yare measurements on such a variable for a randomly selected father and son. 10. p) distribution and Z".Section 1. 7 2 ) variables independent of each other and of (ZI. (b) Show directly that I" minimizes E(IY . Find (a) The best MSPE predictor E(Y I Z = z) ofY given Z = z (b) E(E(Y I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z» . Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables. 12. that is. that is. Y) has a bivariate normal distribution. Z". Y" are N(v. which is symmetric about c and which is unimodal. a 2.cl) = oQ[lc .YI > s. (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict y J • 13.z) for 11. Let Y have a N(I". Y') have a N(p" J. 9.el) as a function of c. ICor(Z. Suppose that Z has a density p. Show that c is a median of Z. Sbow that the predictor that minimizes our expected loss is again the best MSPE predictor. p( c all z. (a) Show that P[lZ  tl < s] is maximized as a function oftfor each s > ° by t = c. p( z) is nonincreasing for z > c.7 Problems and Complements 81 Hint: If c ElY . z > 0.col < eo. a 2. y l ). exhibit a best predictor of Y given Z for mean absolute prediction error. Suppose that Z.  = ElY cl + (c  co){P[Y > co] .L. Y) has a bivariate nonnal distribution the best predictor of Y given Z in the sense of MSPE coincides with the best predictor for mean absolute error. Let Zl and Zz be independent and have exponential distributions with density Ae\z.1"1/0] where Q(t) = 2['I'(t) + t<I>(t)]. Suppose that if we observe Z = z and predict 1"( z) for Your loss is 1 unit if 11"( z) .PlY < eo]} + 2E[(c  Y)llc < Y < co]] 8.t. and otherwise. which is symmetric about c. If Y and Z are any two random variables. (a) Show that E(IY . 0. where (ZI . (a) Show that the relation between Z and Y is weaker than that between Z' and y J . Let ZI. ° 14.
. = Corr'(Y. . . Show that p~y linear predictors. Goals. I.4. 2 'T ' . (c) Let '£ = Y . g(Z)) 9 where g(Z) stands for any predictor.) = Var( Z. Yo > O. and is diagnosed with a certain disease. > 0. mvanant un d ersuchh.g(Z)) where £ is the set of 17.ElY . >. (See Pearson. . (a) Show that the conditional density j(z I y) of Z given Y (b) Suppose that Y has the exponential density = y is H(y.4. then'fJh(Z)Y =1JZY. €L is uncorrelated with PL(Z) and 1J~Y = P~Y' 18. and Var( Z. in the linear model of Remark 1. and a 2 are the mean and variance of the severity indicator Zo in the population of people without the disease. 1). ~~y = 1 . IS.S from infection until detection. that is./L(Z)J' /Var(Y) ~ Var(/L(Z))/Var(Y). Predicting the past from the present.) = E( Z.3. Show that.) (a) Show that 1]~y > piy. y> O. (b) Show that if Z is onedimensional and h is a 11 increasing transfonnation of Z. where piy is the population multiple correlation coefficient of Remark 1. a blood cell or viral load measurement) is obtained. 1995. We are interested in the time Yo = t . Consider a subject who walks into a clinic today. Show that Var(/L(Z))/Var(Y) = Corr'(Y.1] . Let /L(z) = E(Y I Z ~ z). 1905./LdZ) be the linear prediction error. It will be convenient to rescale the problem by introducing Z = (Zo ./L(Z)) = max Corr'(Y.: (0 The best linear MSPE predictor of Y based on Z = z. . hat1s./L£(Z)) ~ maxgEL Corr'(Y. at time t. j3 > 0./L)/<J and Y ~ /3Yo/<J. . on estimation of 1]~y. Assume that the conditional density of Zo (the present) given Yo = Yo (the past) is where j1. 82 Statisflcal Models. Let S be the unknown date in the past when the sUbject was infected.15.4. Hint: Recall that E( Z. At the same time t a diagnostic indicator Zo of the severity of the disease (e. 16.I • ! .. .exp{ >. One minus the ratio of the smallest possible MSPE to the MSPE of the constant predictor is called Pearson's correlation ratio 1J~y.(y) = >. Here j3yo gives the mean increase of Zo for infected subjects over the time period Yo. Hint: See Problem 1.4. . and Performance Criteria Chapter 1 .y}. IS.) ~ 1/>. and Doksurn and Samarov.g.) ~ 1/>''.
2000). x = 1.>. . + b2 Z 2 + W. Let Y be a vector and let r(Y) and s(Y) be real valued. 1990. and checking convexity.Section 1. s(Y) I Z]} + Cov{ E[r(Y) I Z]' E[s(Y) I Z]}. b)...~ [y  (z . (c) Show that if Z is real. 7r(Y I z) ~ (27r) .7 and 1.4.>')1 2 } Y> ° where c = 4>(z . . see Berman.4. b) equal to zero. 20. then = E{Cov[r(Y).z). solving for (a. .4. Z] = Cov{E[r(Y) (d) Suppose Y i = al + biZ i + Wand 1'2 = a2 and Y2 are responses of subjects 1 and 2 with common influence W and separate influences ZI and Z2. (c) The optimal predictor of Y given X = x. need to be estimated from cohort studies. (c) Find the conditional density 7I"o(Yo I zo) of Yo given Zo = zo. Z2 and W are independent with finite variances. density. and Nonnand and Doksum.4. s(Y)] < 00. Find Corr(Y}.(>' . 1).9. s(Y» given Z = z. where Z}.).14 by setting the derivatives of R(a. I Z]' Z}. Establish 1. N(z .s(Y)] Covlr(Y).I exp { .6) when r = s. (a) Show that ifCov[r(Y). Y2) using (a). c. (b) The MSPE of the optimal predictor of Y based on X. including the "prior" 71".7 Problems and Complements 83 Show that the conditional distribution of Y (the past) given Z = z (the present) has density .g(Zo)l· Hint: See Problems 1. Let Y be the number of heads showing when X fair coins are tossed. (In practice. (e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z ~ z) ~ cI<p(>. all the unknowns. 19.z) .. Write Cov[r(Y). 21. s(Y) I z] for the covatiance between r(Y) and s(Y) in the conditional distribution of (r(Y). where X is the number of spots showing when a fair die is rolled.. Hint: Use Bayes rule. Cov[r(Y).. (d) Find the best predictor of Yo given Zo = zo using mean absolute prediction error EIYo . wbere Y j . Find (a) The mean and variance of Y. 6. This density is called the truncated (at zero) normal.>. (b) Show that (a) is equivalent to (1.
24. and suppose that Z has the beta. Y z )? (f) In model (d). Find the 22. .3.4.Xn is a sample from a population with one of the following densities. 0 > 0. and Performance Criteria Chapter 1 (e) In the preceding model (d). 0) = Ox'. (a) p(x. Oax"l exp( Ox"). Zz and ~V have the same variance (T2. z) ~ ep(y.g(z») is called weighted squared prediction error.p~y). 0 > 0.z) be a positive realvalued function. .e)' Hint: Whatever be Y and c. z) and c is the constant that makes Po a density. s).z)lw(y. Let Xi = 1 if the ith item drawn is bad.z). and (ii) w(y. z) = I.0> O. . a > O. population where e > O. p(e). a > O.6(O. 0 (b) p(x.g(z)]'lw(y. Goals. z) = z(1 . . (c) p(x.. (a) Show directly that 1 z:::: 1 Xi is sufficient for 8. n > 2.5 1. This is thebeta. 1 2 y' .. . are N(p. 1). ..z). 0) = < X < 1. we say that there is a 50% overlap between Y 1 and Yz.2eY + e' < 2(Y' + e'). 1 X n be a sample from a Poisson. density.4. 0) = Oa' Ix('H).4. Show that L~ 1 Xi is sufficient for 8 directly and by the factorization theorem. x  . In Example 1.z) = 6w (y.6(r. . Show that EY' < 00 if and only if E(Y . \ (b) Establish the same result using the factorization theorem. if b1 = b2 and Zl. show that the MSPE of the optimal predictor is . < 00 for all e.e' < (Y ..15) yields (1.') and W ~ N(po.84 Statistical Models. z "5). 00 (b) Suppose that given Z = z. Assume that Eow (Y. where Po(Y. Y ~ B(n. Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. Let Xl.14). Suppose Xl. g(Z)) < for some 9 and that Po is a density.. In this case what is Corr(Y. (a) Let w(y. Zz). Problems for Section 1. suppose that Zl and Z. Then [y . Show that the mean weighted squared prediction error is minimized by Po(Z) = EO(Y I Z). 23. .density. 1. This is known as the Weibull density.x > 0.~(1 . optimal predictor of Yz given (Yi. > a. 2. 0 < z < I.. 3.l . and = 0 otherwise.e)' = Y' . Find Po(Z) when (i) w(y. 25. Verify that solving (1.
Section 1. 7
Problems and Complements
85
This is known as the Pareto density. In each case. find a realvalued sufficient statistic for (), a fixed. 4. (a) Show that T 1 and T z are equivalent statistics if, and only if, we can write T z = H (T1 ) for some 11 transformation H of the range of T 1 into the range of T z . Which of the following statistics are equivalent? (Prove or disprove.) (b) n~ (c) I:~
1
x t and I:~
and I:~
I:~
1
log Xi, log Xi.
Xi Xi
I Xi
1
>0 >0
l(Xi X
1 (Xi 
(d)(I:~ lXi,I:~ lxnand(I:~ lXi,I:~
(e) (I:~
I Xi, 1
?)
xl)
and (I:~
1 Xi,
I:~
X)3).
5. Let e = (e lo e,) be a bivariate parameter. Suppose that T l (X) is sufficient for e, whenever 82 is fixed and known, whereas T2 (X) is sufficient for (h whenever 81 is fixed and known. Assume that eh ()2 vary independently, lh E 8 1 , 8 2 E 8 2 and that the set S = {x: pix, e) > O} does not depend on e. (a) Show that ifT, and T, do not depend one2 and e, respectively, then (Tl (X), T2 (X)) is sufficient for e. (b) Exhibit an example in which (T, (X), T2 (X)) is sufficient for T l (X) is sufficient for 8 1 whenever 8 2 is fixed and known, but Tz(X) is not sufficient for 82 , when el is fixed and known. 6. Let X take on the specified values VI, .•. 1 Vk with probabilities 8 1 , .•• ,8k, respectively. Suppose that Xl, ... ,Xn are independently and identically distributed as X. Suppose that IJ = (e" ... , e>l is unknown and may range over the set e = {(e" ... ,ek) : e, > 0, 1 < i < k, E~ 18i = I}, Let Nj be the number of Xi which equal Vj' (a) What is the distribution of (N" ... , N k )? (b) Sbow that N = (N" ... , N k _,) is sufficient for 7. Let Xl,'"
1
e,
e.
X n be a sample from a population with density p(x, 8) given by
pix, e)
o otherwise.
Here
e = (/1, <r) with 00 < /1 < 00, <r > 0.
(a) Show that min (Xl, ... 1 X n ) is sufficient for fl when a is fixed. (b) Find a onedimensional sufficient statistic for a when J1. is fixed. (c) Exhibit a twodimensional sufficient statistic for 8. 8. Let Xl,. " ,Xn be a sample from some continuous distribution Fwith density f, which is unknown. Treating f as a parameter, show that the order statistics X(l),"" X(n) (cf. Problem B.2.8) are sufficient for f.
,
"
I
!
86
Statistical Models, Goals, and Performance Criteria
Chapter 1
9. Let Xl, ... ,Xn be a sample from a population with density
j,(x)
a(O)h(x) if 0, < x < 0,
o othetwise
where h(x)
> 0,0= (0,,0,)
with
00
< 0, < 0, <
00,
and a(O)
=
[J:"
h(X)dXr'
is assumed to exist. Find a twodimensional sufficient statistic for this problem and apply your result to the U[()l, ()2] family of distributions. 10. Suppose Xl" .. , X n are U.d. with density I(x, 8) = ~elx61. Show that (X{I),"" X(n», the order statistics, are minimal sufficient. Hint: t,Lx(O) =  E~ ,sgn(Xi  0), 0 't {X"" . , X n }, which determines X(I),
. " , X(n)'
11. Let X 1 ,X2, ... ,Xn be a sample from the unifonn, U(O,B). distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for O.
12. Dynkin, Lehmann, Scheffe's Theorem. Let P = {Po : () E e} where Po is discrete concentrated on X = {x" x," .. }. Let p(x, 0) p.[X = xl Lx(O) > on Show that f:xx(~~) is minimial sufficient. Hint: Apply the factorization theorem.
=
=
°
x,
13. Suppose that X = (XlI" _, X n ) is a sample from a population with continuous distribution function F(x). If F(x) is N(j1., ,,'), T(X) = (X, ,,'). where,,2 = n l E(Xi 1')2, is sufficient, and S(X) ~ (XCI)"" ,Xin»' where XCi) = (X(i)  1')/'" is "irrelevant" (ancillary) for (IL, a 2 ). However, S(X) is exactly what is needed to estimate the "shape" of F(x) when F(x) is unknown. The shape of F is represented hy the equivalence class F = {F((·  a)/b) : b > 0, a E R}. Thus a distribution G has the same shape as F iff G E F. For instance, one "estimator" of this shape is the scaled empirical distribution function F,(x) jln, x(j) < x < x(i+1)' j = 1, . .. ,nl
~
0, x
< XCI)
> x(n)
1, x
~
Show that for fixed x, F,((x  x)/,,) converges in prohahility to F(x). Here we are using F to represent :F because every member of:F can be obtained from F.
I
I ,
'I i,
14. Kolmogorov's Theorem. We are given a regular model with e finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on 9, the posterior distrihution of 9 depends on x only through T(x). Show that T(X) is sufficient.
(b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the posterior distribution depends on x only through T(x).
Section 1.7
Problems and Complements
87
Hint: Apply the factorization theorem.
15. Let X h .··, X n be a sample from f(x  0), () E R. Show that the order statistics arc minimal sufficient when / is the density Cauchy Itt) ~ I/Jr(1 + t 2 ). 16. Let Xl,"" X rn ; Y1 ,· . " ~l be independently distributed according to N(p, (72) and N(TI, 7 2 ), respectively. Find minimal sufficient statistics for the following three cases:
(i) p, TI,
0", T
are arbitrary:
00
< p, TI < 00, a <
(J,
T.
(ii)
(J
=T
= TJ
and p, TI, (7 are arbitrary. and p,
0", T
(iii) p
are arbitrary.
17. In Example 1.5.4. express tl as a function of Lx(O, 1) and Lx(l, I). Problems to Sectinn 1.6
1. Prove the assertions of Table 1.6.1.
2. Suppose X I, ... , X n is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show that the distribution of X fonns a oneparameter exponential family. Identify 'TI, B, T, and h. 3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability nf success O. Then P, IX = k] = (I  0)'0, k ~ 0 0 1,2, ... This is called thc geometric distribution (9 (0». (a) Show that the family of geometric distributions is a oneparameter exponential family with T(x) ~ x. (b) Deduce from Theorem 1.6.1 that if X lo '" oXn is a sample from 9(0), then the distributions of L~ 1 Xi fonn a oneparameter exponential family. (c) Show that E~
1
Xi in part (b) has a negative binomial distribution with parameters
(noO)definedbyP,[L:71Xi = kJ =
(n+~I
)
(10)'on,k~0,1,2o'"
(The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success 0.) Hint: By Theorem 1.6.1, P,[L:7 1 Xi = kJ = c.(1  o)'on. 0 < 0 < 1. If
=
' " CkW'
I
= c(;'',)::, 0 lw n
k=O
L..J
< W < I, then
4. Which of the following families of distributions are exponential families? (Prove or
disprove.) (a) The U(O, 0) fumily
88
(b) p(.", 0)
Statistical Models, Goals, and Performance Criteria
Chapter 1
(c)p(x,O)
= {exp[2Iog0+log(2x)]}I[XE = ~,xE {O.I +0, ... ,0.9+0j = 2(x +
0)/(1 + 20), 0
(0,0)1
(d) The N(O, 02 ) family, 0 > 0
(e)p(x,O)
< x < 1,0> 0
(f) p(x,9) is the conditional frequency function of a binomial, B(n,O), variable X, given that X > O.
5. Show that the following families of distributions are twoparameter exponential families and identify the functions 1], B, T, and h. (a) The beta family. (b) The gamma family. 6. Let X have the Dirichlet distribution, D( a), of Problem 1.2.15. Show the distribution of X form an rparameter exponential family and identify fJl B, T, and h.
7. Let X = ((XI, Y I ), ... , (X no Y n » be a sample from a bIvariate nonnal population.
Show that the distributions of X form a fiveparameter exponential family and identify 'TJ, B, T, and h.
8. Show that the family of distributions of Example 1.5.3 is not a one parameter eX(Xloential family. Hint: If it were. there would be a set A such that p(x, 0) > on A for all O.
°
9. Prove the analogue of Theorem 1.6.1 for discrete kparameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (XI, X 2) is a sample of size 2 from f(·, 0), then XI + X2 is sufficient for B. Show that f(·, B) corresponds to a onearameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x" X2) such that log f(x" 0) + log f(X2, 0) = g(xI + X2, 0) + h(XI, X2). Fix 00 and let r(x, 0) = log f(x, 0)  log f(x, 00), q(x, 0) = g(x,O)  g(x,Oo). Then, q(xI + X2,0) = r(xI,O) +r(x2,0), and hence, [r(x" 0) r(O, 0)1 + [r(x2, 0)  r(O, 0») = r(xi + X2, 0)  r(O, 0). 11. Use Theorems 1.6.2 and 1.6.3 to obtain momentgenerating functions for the sufficient statistics when sampling from the following distributions. (a) normal, () ~ (ll,a 2 )
(b) gamma. r(p, >.), 0
= >., p fixed
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3)
(0 gamma. r(p, >'). ()
= (p, >.).
 
,

Section 1. 7
Problems and Complements
89
12. Show directly using the definition of the rank of an ex}X)nential family that the multinomialdistribution,M(n;OI, ... ,Ok),O < OJ < 1,1 <j < k,I:~oIOj = 1, is of rank k1. 13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any (k ~ I)dimensional hyperplane. 14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of &. Show that if k = 1 and &0 oJ 0 and A, A are defined on all of &, then Theorem 1.6.3 continues to hold. 15. Let P = {P. : 0 E e} where p. is discrete and concentrated on X = {x" X2, ... }, and let p( x, 0) = p. IX = x I. Show that if P is a (discrete) canonical ex ponential family generated bi, (T, h) and &0 oJ 0, then T is minimal sufficient. Hint: ~;j'Lx('l) = Tj(X)  E'lTj(X). Use Problem 1.5.12.
16. Life testing. Let Xl,.'" X n be independently distributed with exponential density (20)l e x/2. for x > 0, and let the ordered X's be denoted by Y, < Y2 < '" < YnIt is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r aparticles have been emitted. Show that
(i) The joint distribution of Y1 , •.. , Yr is an exponential family with density
n! [ (20), (n _ r)! exp (ii) The distribution of II:: I Y;
(iii) Let
1
I::l Yi + (n 20
r)Yr]
' 0  Y,  ...  Yr·
<
<
<
+ (n 
r)Yrl/O is X2 with 2r degrees of freedom.
denote the time required until the first, second,... event occurs in a Poisson process with parameter 1/20' (see A.I6). Then Z, = YI/O', Z2 = (Y2 Yr)/O', Z3 = (Y3  Y 2)/0', ... are independently distributed as X2 with 2 degrees of freedom, and the joint density of Y1 , ••. , Yr is an exponential family with density
Yi, Yz , ...
The distribution of Yr/B' is again XZ with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burnedout tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
I ,
90
Statistical Models, Goals, and Performance Criteria Chapter 1
1)(Y; l~~l)/e (I = 1", .. ,') are independently distributed as X2 with 2 degrees of freedom, and [L~ 1 Yi + (n  7")Yr]/B = [(ii): The random variables Zi ~ (n  i
+
L::~l Z,.l
17. Suppose that (TkXl' h) generate a canonical exponential family P with parameter k 1Jkxl and E = R . Let
(a) Show that Q is the exponential family generated by IlL T and h exp{ cTT}. where IlL is the projection matrix of Tonto L = {'I : 'I = BO + c). (b) Show that ifP has full rank k and B is of rank I, then Q has full rank l. Hint: If B is of rank I, you may assume
18. Suppose Y1, ... 1 Y n are independent with Yi '" N(131 + {32Zi, (12), where Zl,'" , Zn are covariate values not all equaL (See Example 1.6.6.) Show that the family has rank 3.
Give the mean vector and the variance matrix of T.
19. Logistic Regression. We observe (Zll Y1 ), ... , (zn, Y n ) where the Y1 , .. _ , Y n are independent, Yi "' B(TIi, Ad The success probability Ai depends on the characteristics Zi of the ith subject, for example, on the covariate vector Zi = (age, height, blood pressure)T. The function I(u) ~ log[u/(l  u)] is called the logil function. In the logistic linear re(3 where (3 = ((31, ... ,/3d ) T and Zi is d x 1. gression model it is assumed that I (Ai) = Show that Y = (Y1 , ... , yn)T follow an exponential model with rank d iff Zl, ... , Zd are
zT
not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). 20. (a) In part IT of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by ('11 'Io)TT and that ~(ii) =~(i). (b) Fill in the details of part III of the proof of Theorem 1.6.4. 21. Find JJ.('I) ~ EryT(X) for the gamma,
qa, A), distribution, where e = (a, A).
I
22. Let X I, . _ . ,Xn be a sample from the k·parameter exponential family distribution (1.6.10). Let T = (L:~ 1 1 (Xi ), ... , L:~ 1Tk(X,») and let T
I
S
~
((ryl(O), ... ,ryk(O»): e E 8).
Show that if S contains a subset of k + 1 vectors Vo, .. _, Vk+l so that Vi  Vo, 1 < i are not collinear (linearly independent), then T is minimally sufficient for 8.
< k.
I .' jl,
"
23. Using (1.6.20). find a conjugate family of distributions for the gamma and beta families. (a) With one parameter fixed. (b) With both parameters free.
:
I
Section 1.7
Problems and Complements
91
24. Using (1.6.20), find a conjugate family of distributions for the normal family using as parameter 0 = (O!, O ) where O! = E,(X), 0, ~ l/(Var oX) (cf. Problem 1.2.12). 2 25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with (72 known. Find a conjugate family of prior distributions for (131,132) T. 26. Using (1.6.20), find a conjugate family of distributions for the multinomial distribution. See Problem 1.2.15. 27. Let P denote the canonical exponential family genrated by T and h. For any TJo E £, set ho(x) = q(x, '10) where q is given by (1.6.9). Show that P is also the canonical exponential family generated by T and h o.
28. Exponential/amities are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by h(f)
~ E(logf(X)) =
l:IIOgf(X)I!(X)dx.
This quantity arises naturally in infonnation in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S ~ {x: f(x) > OJ. (a) Show that the canonical kparameter exponential family density
f(x, 'I)
= exp
• ryjrj(x) 1/0 + I:
j:=1
A('I)
, XES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is
f(x)dx
~ 1,
Is
f(x)rj(x)
~ aj,
1 < j < k,
where '17o, .•.• '17k are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. (b) Find the maximum entropy densities when rj(x) = x j and (i) S ~ (0,00), k = 1, at > 0; (ii) S = R, k = 2, at E R, a, > 0; (iii) S = R, k = 3, a) E R, a, > 0, a3 E R. 29. As in Example 1.6.11, suppose that Y 1, ...• Y n are Li.d. Np(f.L. E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y = (Y ... , Yn ) is the p(p + 3)/2 canonical " exponential family generated by h = 1 and the p(p + 3)/2 statistics
n n
Tj
=
LYii>
i=l
1 <j <Pi
Tjl =
LJ'ijJ'iI.
i=l
1 <j< l<p
where Y i = (Yi!, ... , Yip). Show that <: is open and that this family is of rank pcp + 3)/2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = pcp + 3)/2 statistics Tj(Y) ~ Yj, 1 < j < p, and Tj,(Y) = YjYi, 1 <j < I < p,
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Np(J.l, E). As E ranges over all p x p symmetric positive definite matrices, so does E 1 • Next establish that for symmetric matrices M,
J
M
exp{ _uT Mu}du
< 00 iff M
is positive definite
by using the spectral decomposition (see B.I0.1.2)
=L
j=1
p
AjejeJ for el, ... , e p orthogonal. Aj E R.
To show that the family has full rank m, use induction on p to show that if Zt, ... , Zp are i.i.d. N(O, 1) and if B pxp = (b jl ) is symmetric, then
p
P
LajZj
j"" 1
+ Lbj,ZjZ,
j,l
~c
= P(aTZ + ZTBZ = c) ~ 0
N p(l', E), then
unless a ~ 0, B = 0, c = 0. Next recall (Appendix B.6) that since Y ~ y = SZ for some nonsingular p x p matrix S.
I
30. Show that if Xl,'" ,Xn are d.d. N p (8,E o) given (J where ~o is known, then the Np(A, f) family is conjugate to N p(8, Eo), where A varies freely in RP and f ranges over
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I'j, Tj) : 1 < j < k} be a given collection of pairs with I'j E R, Tj > 0. Let (tt, tT) be a random pair with Aj = P«(I', tT) = (I'j, Tj)), 0 < Aj < 1, L:~~l Aj = 1. Let 8 be a random variable whose conditional distribution given (IL, IT) = (p,j 1 Tj) is nonnal, N(p,j, rJ). Consider the model X = 8 + f, where 8 and € are independent and € rv N(O, a3), a~ known. Note that 8 has the prior density
11'(0)
=L
j=l
k
Aj'l'rj (0  I'j)
(1.7.4)
where 'I'r denotes the N(O, T 2 ) density. Also note that (X tion. (a) Find the posterior
i!
I 0) has the N(O, (75) distribu
" "
,;
k
11'(0 I x)
and write it in the fonn
= LP((tt,tT) ~
j=1
(l'j,Tj) I X)1I'(O I (l'j,Tj),X)
" • ,
k
L
j=1
Aj (x)'I'rj(x) (0 l'j(X»
Section 1.7
Problems ;3nd Complements
93
for appropriate A) (x), Tj (x) and ILJ (x). This shows that (1. 7.4) defines a conjugate prior for the N(O, (76), distribution. (b) Let Xi = + Ei, I < i < n, where is as previously and EI," ., En are ij.d. N(O, (76)' Find the posterior 7r( 0 I Xl, ... , x n ), and show that it belongs to class (1.7 A). Hint: Consider the sufficient statistic for p(x I B).
e
e
32. A Hierarchical BinomialBeta Model. Let {(rj, Sj) : 1 <j < k} be a given collection of pair.; with rj > 0, sJ > 0, let (R, S) be a random pair with P(R = cJ' S = 8j) = Aj, D < Aj < 1, E7=1 Aj = 1, and let e be a random variable whose conditional density ".(0, c, s) given R = r, S = S is beta, (3(c, s). Consider the model in which (X I 0) has the binomial, B( n, fJ), distribution. Note that e has the prior density
".(0)
Find the posterior
k
=L
j=1
k
Aj"'(O, cJ ' sJ)'
(J .7.5)
".(0 I x) = LP(R= cj,S =
j=l
8j
I x)7r(O I (rj,sj),x)
and show that it can be written in the form J (x)7r(O,rj(x),sj(x)) for appropriate Aj(X), Cj(x) and 8j(X). This shows that (1.7.5) defines a class of conjugate prior.; for the B( n, 0) distribution.
L:A
33. Let p(x,TJ) be a one parameter canonical exponential family generated by T(x) = x and h(x), X E X C R, and let 1jJ(x) be a nonconstant, nondecreasing function. Show that E,1jJ(X) is strictly increasing in ry. Hint:
Cov,(1jJ(X), X)
~E{(X 
X')i1jJ(X) 1jJ(X')]}
where X and X' are independent identically distributed as X (see A.Il.12).
34. Let (Xl, ... , X n ) be a stationary Markov chain with two states D and 1. That is.
P[Xi
where
= Ei I Xl = EI,·· .,Xi  = Eid = P[Xi = Ci I X i  l = Eid =
l
PEi_1Ei
(POO PIO
pal) is the matrix of transition probabilities. Suppose further that Pn '
= 1  p.
(i) poo
(ii)
= PII = p, so that, PlO = Pal PIX, = OJ = PIX, = IJ = !.
94
Statistical Models, Goals, and Performance Criteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, oneparameter exponential family with T = NOD + N ll where Nt) the number of transitions from i to j. For example, 01011 has N Ol = 2, Nil = 1, N oo = 0, N IO ~ 1.
(b) Show that E(T)
= (n 
l)p (by the method of indicators or otherwise).
35. A Conjugate Priorfor the Two~Sample Problem. Suppose that Xl, ... , X n and Y1 , ... , Yn are independent N(fLI' (12) and N(1l2' ( 2 ) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro 2 has a X~ distribution and given 0 2 , /11 and fL2 are independent with N(~I, <7 2/ kt} and N(6, <7 2/ k2) distributions, respectively, where ~j E R, k j > 0, j = 1,2. Show that Jr is a conjugate prior.
I
36. The inverse Gaussian density. IG(j..t, .\), is
f(x,J1.,>')
= [>./21Tjl/2 x 3/2 exp { >.(x 
J1.)2/ 2J1.2 X}, x> 0, J1. > 0, >. > O.
(a) Show thatthis is an exponentialfamily generated hy T( X) h(x) = (21T)1/2 X 3/'. (b) Show that the canonical parameters TJl, TJ2 are given by TJI that A( 'II, '12) =  [! [Og('I2) + v''I1'I2]'£ = [0,00) x (0,00).
= ! (X, XI) T and = fL 2A, 1]2 =
= '\, and
(e) Fwd the momentgenerating function ofT and show that E(X) J1. 3>., E(XI) = J1. 1 + >.1, Var(X I ) = (>'J1.)1 + 2>'2. (d) Suppose J1. pnor.
~
J1., Var(X)
J1.o is known. Show that the gamma family, qa,,6), is a conjugate
(e) Suppose that>' = >'0 is known. Show that the conjngate prior formula (1.6.20) produces a function that is not integrable with respect to fl. That is, defined in (1.6.19) is empty.
n
(I) Suppose that J1. and>. are both unknown. Show that (1.6.20) produces a function
that is not integrable; that is, f! defined in (1.6,19) is empty. 37, Let XI, ... , X n be i.i.d. as X ~ Np(O, ~o) where ~o is known. Show that the conjugate prior generated by (1.6.20) is the N p ( 1]0,761) family, where 1]0 varies freely in RP, 76 > 0 and I is the p x p identity matrix.
,
•
38. Let Xi
"
(Zi, Yi)T be jj,d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of XI, ... ,Xn as a canonical exponential family and identify T, h, A, and E. Find the expected value and variance of the sufficient statistic.
=
,!,i
i!: "
39. Suppose that Y1 , •.. 1 Y n are independent, Yi
'"'' N(fLi, a 2 ), n > 4.
"'
(a) Write the distribution of Y1 , ..• ,Yn in canonical exponential family fonn. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 1) : (0 1, O , 03)T ~ (1'7, <72 jT where 1) is detennined by 2
fLi
I i'i
I ,
I
= exp{OI
+ 02Zi},
Zl
< Z2 < .. , <
Zn;
(72 =
03


Section 1.8
Notes
95
where 8 r E R, O E R, 03 > O. This model is sometimes used when IIi is restricted to be 2 positive. Show that p(y, 0) as given by (1.6.12) is a curved exponential family model with
1=3.
40. Suppose Y 1 , • •. , y;'l are independent exponentially, E' (Ai), distributed survival times, n > 3. (a) Write the distribution of Y1 , ... 1 Yn in canonical exponential family form. Identify T, h, '1, A, and E. (b) Recall that J.1i = E (Y'i) = Ai I. Suppose lJi depends on the value Zi of a covariate. Because Iti > O. fLi is sometimes modeled as
fLi
= CXp{ 0 1 + (hZi},
i
=
1, ... , n
where not all the z's are equal. Show that p(y, fi) as given by (1.6.12) is a curved exponential family model with 1 = 2.
1.8
NOTES
Note for Section 1.1
(1) For the measure theoretically minded we can assume more generally that the Po are
all dominated by a derivative.
(J
finite measure It and that p(x, 8) denotes
dJ;lI, the Radon Nikodym
Notes for Section 1,3
~
(I) More natural in the sense of measuring the Euclidean distance between the estimate f} and the "truth" Squared error gives much more weight to those that are far away from f} than those close to f}.
e.
e
~
(2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r.
Note for Section 1,4
(I) Source; Hodges, Jr., J. L., D. Keetch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGrawHill, 1975.
Notes for Section 1,6
(1) Exponential families arose much earlier in the work of Boltzmann in statistical mechanics as laws for the distribution of the states of systems of particlessee Feynman (1963), for instance. The connection is through the concept of entropy, which also plays a key role in infonnation theorysee Cover and Thomas (199]). (2) The restriction that's x E Rq and that these families be discrete or continuous is artificial. In general if fL is a (J finite measure on the sample space X. p( x, e) as given by (1.6.1)
" Biomelika. T. I: I" Ii L6HMANN. L. • • . GRENANDER. B. I and II.. and spheres (e. r HorXJEs.." Statist. 0 . STUART. and so on. Ch. FEYNMAN. "Model Specification: The Views of Fisher and Neyman... Sands. M. E. 1961. Statistical Analysis of Stationary Time Series New York: Wi • . J. BLACKWELL.. E. THOMAS.I' L6HMANN." J.. 1954. D.96 Statistical Models. ley. 1957. Note for Section 1.547572 (1957). Science 5. The Advanced Theory of Statistics." Ann. BROWN. R. "A Theory of Some Multiple Decision Problems. Goals. CARROLL. 1. JR.266291 (1978). 5. L.. S. 1963. L. KENDALL. Optimal Statistical Decisions New York: McGrawHili. . BOX.. for instance. CRUTCHFIELD. 23.733741 (1990). Iii I' i M. P. "A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts.. A... L. AND M. 383430 (1979). IMS Lecture NotesMonograph Series. 1969. 22. 1985. Feynman.. ROSENBLATT." Ann. m New York: Hafner Publishing Co. M. RUPPERT. This permits consideration of data such as images. the Earth). 1991. Testing Statistical Hypotheses.7 (I) u T Mu > 0 for all p x 1 vectors u l' o.. "Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression. BERMAN. M. I. H. J. Statist. G. and Later Developments. A 143." Ann. New York: Springer.g. DE GROOT. R. Eels. L. Ii . AND A. D. Elements of Information Theory New York: Wiley. . II. T. 2nd ed.. . .. Leighton. and M.• Statistical Decision Theory and Bayesian Analysis New York: Springer. 1988. Statist. A. J. P. AND COVER. Soc. Vols. Transformation and Weighting in Regression New York: Chapman and Hall. G. Royal Statist. DoKSUM. SAMAROV. 77. Math Statist. S. Statlab: An Empirical Introduction to Statistics New York: McGrawHili. v.t AND D. 40 Statistical Mechanics ofPhysics Reading. 1967. AND M. Theory ofGames and Statistical Decisions New York: Wiley.. 6. KRETcH AND R. K A AND A. Nonlinearity. J. E.9 REFERENCES BERGER. positions. "Using Residuals Robustly I: Tests for Heteroscedasticity. 1986. E. Mathematical Statistics New York: Academic Press. U. P. 160168 (1990). 14431473 (1995). 1997. L6HMANN. and Performance Criteria Chapter 1 can be taken to be the density of X with respect to fLsee Lehmann (1997). The Feynmtln Lectures on Physics. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. . Hayward.. P.. R. 1975. 1966. MA: AddisonWesley.. 125. . BICKEL. R. FERGUSON. "Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discussion).. GIRSHICK.
SNEDECOR. ET AL. "Empirical Bayes Procedures for a Change Point Problem with Application to HIVJAIDS Data. Roy. 9 References 97 LINDLEY. SCHLAIFFER. Harvard University. 8th Ed.. Ames." Empirical Bayes and Likelihood Inference.. Wiley & Sons. "On the General Theory of Skew Correlation and NOnlinear Regression.. PEARSON. H. New York: Springer. Dulan & Co.. 1964. W. E. COCHRAN. AND R. 1962. SL. 1986. 1965. Graduate School of Business Administration.. Introduction to Probability and Statistics from a Bayesian Point of View. Statistical Methods. Applied Statistical Decision Theory. Reid. AND K. Soc.. WETHERILL. SAVAGE. B. J. The Statistical Analysis of Experimental Data New York: J. . The Foundations ofStatistics. Biometrics Series II. 1. V. NORMAND. 6779. AND K. Division of Research. Editors: S. J. The Foundation ofStattstical Inference London: Methuen & Co. D.. G. L. J. 303 (1905). Part II: Inference London: Cambridge University Press. 2000. Boston. IA: Iowa State University Press. K. L. A. London 71." Proc. AND W.. Wiley & Sons. GLAZEBROOK. 1961. D. Part I: Probability. Ahmed and N. New York. G. MANDEL.) RAIFFA. Sequential Methods in Statistics New York: Chapman and Hall. Lecture Notes in Statistics. (Draper's Research Memoirs.Section 1. DOKSUM. 1954. 1989. SAVAGE. G.
I II 1'1 .·.'.! ~J . I .: '". I:' o. . :i. I 'I i .
1 BASIC HEURISTICS OF ESTIMATION Minimum Contrast Estimates.8) is smooth. The equations (2. if PO o were true and we knew DC 8 0 . X E X.1. Now suppose e is Euclidean C R d . X """ PEP. As a function of 8. but in a very weak sense (unbiasedness).1 2.8).2) define a special form of estimating equations. Then we expect  \lOD(Oo.Chapter 2 METHODS OF ESTIMATION 2. 8). p(X. DC8 0l 9) measures the (population) discrepancy between 8 and the true value 8 0 of the parameter. the true 8 0 is an interior point of e.1) Arguing heuristically again we are led to estimates B that solve \lOp(X.O) EOoP(X.1. In this parametric case.8) as a function of 8. usually parametrized as P = {PO: 8 E e}.O) where V denotes the gradient.1. and 8 Jo D( 8 0 . 6) ~ o. (2. Of course. In order for p to be a contrast function we require that DC 8 0l 9) is uniquely minimized for 8 = (Jo.8) is an estimate of D(8 0 .2) 99 .O). we don't know the truth so this is inoperable. how do we select reasonable estimates for 8 itself? That is. we could obtain 8 0 as the minimizer.1. =0 (2. That is. how do we find a function 8(X) of the vector observation X that in some sense "is close" to the unknown 81 The fundamental heuristic is typically the following. We consider a function that we shall call a contrast function  p:Xx8>R and define D(Oo. This is the most general fonn of the minimum contrast estimate we shall consider in the next section. Estimating Equations Our basic framework is as before. So it is natural to consider 8(X) minimizing p(X.
100 More generally.z. (3) n n<T. . suppose we are given a function W : X and define Methods of Estimation Chapter 2 X R d . further. :i = ~8g~ ~ L. Yn are independent.). Consider the parametric version of the regression model of Example 1.. we take n p(X. Here the data are X = {(Zi.Zid)T 'j • r .1. Here is an example to be pursued later. . I D({3o.1. Then {3 parametrizes the model and we can compute (see Problem 2.i. W . z. ZI).1.p(X. If.3) = 0 has 8 0 as its unique solution for all 8 0 ~ E e. I In the important linear case.10). A naiural(l) function p(X.8/3 ({3. i=l 3 I (2. (1).8) ~ E I1 ..z)l: 1{31 ~ oo} 'I = 00 I (Problem 2.)g(l3. ~ Then we say 8 solving (2. {3) to consider is the squared Euclidean distance between the vector Y of observed Yi and the vector expectation of Y.1. Least Squares. IL( z) = (g({3. ua). z. W(X.4) w(X.2) or equivalently the system of estimating eq!1ations. {3 E R d .1.4 are i.) . where the function 9 js known. Zn»T That is.1.)J i=1 ~ 2 (2.1. (3) E{3.4 R d.g({3. (2.5) Strictly speaking P is not fully defined here and this is a point we shall explore later.7) i=l J . for convenience. z) is continuous and ! I lim{lg(l3.. l'i) : 1 < i < n} where Yi.g({3.)Y. then {3 satisfies the equation (2.1.8/3 ({3. The estimate (3 is called the least squares estimate.16).i + L[g({3o. . ~ ~8g~ L.!I [" '1 L • .z. J which is indeed minimized at {3 = {3o and uniquely so if and only if the parametrization is identifiable. i=l (2.1.. suppose we postulate that the Ei of Example 1.d. .1. N(O. But. Example 2.6) . g({3. An estimate (3 that minimizes p(X.1. z). (1) Suppose V (8 0 .1. there is a substantial overlap between the two classes of estimates.' . Wd)T V(l1 o. z. . .('l/Jl. (3) exists if g({3.zd = LZij!3j j=1 andzi = (Zil.4 with I'(z) = g({3. Evidently. z) is differentiable in {3. . g({3. "'.. d g({3. z.»)'. (3) = !Y .1L1 2 = L[Yi .I1) =0 is an estimating equation estimate.
/1j converges in probability to flj(fJ).. . . u5). . Here is another basic estimating equation example.1 Basic Heuristics of Estimation 101 the system becomes (2.1 from R d to Rd.i. This very important example is pursued further in Section 2. once defined we have a method of computing a statistic fj from the data X = {(Zi 1 Xi). . The motivation of this simplest estimating equation example is the law of large numbers: For X "' Po.d. More generally. . I'd). we need to be able to express 9 as a continuous function 9 of the first d moments. 1 1 < j < d. L.d.. and then using h(p!. Method a/Moments (MOM).. I'd of X. Thus. as X ~ P8.  n n.9) where Zv IIZijllnxd is the design matrix.1. N(O.. 1 1j ~_l".i.. we obtain a MOM estimate of q( 8) by expressing q( 8) as a function of any of the first d moments 1'1. . provides a first example of both minimum contrast and estimating equation methods. ~ . 8 E R d and 8 is identifiable.Section 2.1.2.. 0 Example 2. Least squares.. d > k. In fact. We return to the remark that this estimating method is well defined even if the Ci are not i. . The method of moments prescribes that we estimate 9 by the solution of p. Suppose Xl. I'd( 8) are the first d moments of the population we are sampling from. suppose is 1 .. 1 < i < n}. thus..Xi t=l .Xn are i. Suppose that 1'1 (8). which can be judged on its merits whatever the true P governing X is.1..(8).1 <j < d if it exists. if we want to estimate a Rkvalued function q(8) of 9. lid) as the estimate of q( 8).2 and Chapter = 6. These equations are commonly written in matrix fonn (2.8) the normal equations. . . . Thus. . we assume the existence of Define the jth sample moment f1j by. = 1'. say q(8) = h(I'I. To apply the method of moments to the problem of estimating 9.
Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category. for instance.4 and in Chapter 6. case. .11). As an illustration consider a population of men whose occupations fall in one of five different job categories. We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments. Here is some job category data (Mosteller. in particular Problem 6. 2. 1968). I.1. or 5.•. . Suppose we observe multinomial trials in which the values VI. but their respective probabilities PI. A = X/a' where a 2 = fl. in general.102 Methods of Estimation Chapter 2 For instance.1. with density [A"/r(. If we let Xl. . express () as a function of IJ" and fl3 = E(X 3 ) and obtain a method of moment estimator based on /11 and fi3 (Problem 2.. It is defined by initializing with eo. ..(X")1 dxd J isquickandM is nonsingular with high probability is the NewtonRaphson algorithm. 0 Algorithmic issues We note that.~ (I'l/a)'.. 2.. 1'1 for () gives = E(X) ~ "/ A.d.2 The PlugIn and Extension Principles We can view the method of moments as an example of what we call the plugin (or substitution) and extension principles. neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn. as X and Ni = number of indices j such that X j = Vi. the proportion of sample values equal to Vi. There are many algorithms for optimization and root finding that can be employed. X n be Ltd. .6. . Vi = i. I.)]xO1exp{Ax}. 3.. An algorithm for estimating equations frequently used when computationofM(X.. In this case f) ~ ('" A). . Here k = 5. A>O. . 4. the method of moment estimator is not unique. . f(u.1 0) .1.·)  l~t. '\)..(1 + ")/A 2 Solving .10. and 1" ~ E(X') = .·) fli =D'I'(X. . then setting (2.X 2 • In this example..Pk are completely unknown. . l Vk of the population being sampled are known.3. two other basic heuristics particularly applicable in the i.>0. consider a study in which the survival time X is modeled to have a gamma distribution. Frequency Plugin(2) and Extension..2 and 0=2 = n 1 EX1. This algorithm and others will be discussed more extensively in Section 2. i = 1. We can. . ~ iii ~ (X/a)'.i. A = Jl1/a 2 .5. Example 2.. x>O. then the natural estimate of Pi = P[X = Vi] suggested by the law of large numbers is Njn.
13 n2::.. . . . P3 PI = 0 = (1. can be identified with a parameter v : P ~ R. Suppose . For instance. .. N 3 ) has a multinomial distribution with parameters (n.1.. . suppose that in the previous job category table.53 = 0.0 < 0 < 1...31 5 95 0.4.Pk) = (~l .Ps) = (P4 + Ps) . (2.12 3 289 0.0. v. HardyWeinberg Equilibrium. + P3). That is. . the estimate is which in our case is 0. Equivalently.}}.09.Pk) with Pi ~ PIX = 'Vi]. . P2 ~ 20(1. The frequency plugin principle simply proposes to replace the unknown population frequencies PI..11) to estimate q(Pll··. i < k. 1 ()d) and that we want to estimate a component of 8 or more generally a function q(8)... then (NIl N 2 . Now suppose that the proportions PI' .1. .12) If N i is the number of individuals of type i in the sample of size n. that is.. P3) given by (2. Next consider the marc general problem of estimating a continuous function q(Pl' . .1.0).pl . .. We would be interested in estimating q(P"". If we use the frequency substitution principle.(P2 + P3). 1 Nk/n.Pk by the observable sample frequencies Nt/n.. use (2.Xn . the multinomial empirical distribution of Xl. . let P dennte p ~ (P". and Then q(p) (p.44 .. . ." . Many of the models arising in the analysis of discrete data discussed in Chapter 6 are of this type. whereas categories 2 and 3 correspond to whitecollar jobs.. Example 2. v(P) = (P4 + Ps) and the frequency plugin principle simply says to replace P = (PI.12).03 2 84 0.Section 2. ..P2. Pk) of the population proportions. we are led to suppose that there are three types of individuals whose frequencies are given by the socalled HardyWeinberg proportions 2 . .1. 1 < think of this model as P = {all probability distributions Pan {VI. in v(P) by 0 P ).Ni =708 2:::o. the difference in the proportions of bluecollar and whitecollar workers. Pk do not vary freely but are continuous functions of some ddimensional parameter 8 = ((h.41 4 217 0.1 Basic Heuristics of Estimation 103 Job Category l I Ni Pi ~ 23 0.. together with the estimates Pi = NiJn. lPk).0)2. . Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles. categories 4 and 5 correspond to bluecollar jobs..»i=1 for Danish men whose fathers were in category 3. 1~!.. If we assume the three different genotypes are identifiable.
The plugin and extension principles can be abstractly stated as follows: Plugin principle. 0 write 0 = 1 . Note. T(Xl.. F(x) < a}.4.(IJ). where a E (0. case if P is the space of all distributions of X and Xl. as X p.1. (2. the empirical distribution P of X given by    f'V 1 n P[X E Al = I(X. . that we can also Nsfn is also a plausible estimate of O.Pk are continuous functions of 8. . consider va(P) = [Fl (a) + Fu 1(a )1. however. we can usually express q(8) as a continuous function of PI.1.14) are not unique. "·1 is. . Nk) .13) Given h we can apply the extension principle to estimate q( 8) as.14) As we saw in the HardyWeinberg case.1.. We shall consider in Chapters 3 (Example 3. . the frequency of one of the alleles.15) ~ ". .. .1. then v(P) is the plugin estimate of v..Pk. . Let be a submodel of P. . in the Li. Po ~ R given by v(PO) = q(O).d. Now q(lJ) can be identified if IJ is identifiable by a parameter v .13) and estimate (2. 1  V In general. I) and ! F'(a) " = inf{x. (2.Xn ) =h N./P3 and.16) . . Then (2.13) defines an extension of v from Po to P via v. We can think of the extension principle alternatively as follows. .d. the representation (2. ( . .4) and 5 how to choose among such estimates. In particular. if X is real and F(x) = P(X < x) is the distribution function (dJ. F(x) > a}.h(p) and v(P) = v(P) for P E Po.Pk(IJ»).1. I: t=l (2.E have an estimate P of PEP such that PEP and v : P 4 T is a parameter.).. by the law of large numbers.'" . If PI. . .. that is. . Fu'(a) = sup{x. q(lJ) with h defined and continuous on = h(p.. Because f) = ~.1. P > R where v(P) . we can use the principle we have introduced and estimate by J N l In.1... E A) n.104 Methods of Estimation Chapter 2 we want to estimate fJ. (2. a natural estimate of P and v(P) is a plugin estimate of v(P) in this nonparametric context For instance. X n are Ll.. suppose that we want to estimate a continuous Rlvalued function q of e. thus. If w..
12) and to more general method of moment estimates (Problem 2. The plugin and extension principles are used when Pe.1.1.1. P 'Ie Po. With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plugin estimates for multinomial trials because I'j(8) where =L i=l k vfPi(8) = h(p(8» = viP. Suppose Po is a submodel of P and P is an element of P but not necessarily Po and suppose v: Po ~ T is a p~eter. This reasoning extends to the general i. However. Here x!. is called the sample median. (P) is called the population (2. ~ ~ . If v: P ~ T is an extension of v in the sense that v(P) = viP) on Po. ~ For a second example.' Nl = h N () ~ ~ = v(P) and P is the empirical distribution. For instance. case (Problem 2. = v(P).1. Here x ~ = median. the sample median V2(P) does not converge in probability to Ep(X). As stated. . in the multinomial examples 2. then Va. then v(P) is an extension (and plugin) estimate of viP). Remark 2.1. these principles are general.14. Let viP) be the mean of X and let P be the class of distributions of X = 0 + < where B E R and the distribution of E ranges over the class of distributions with mean zero. v(P8 ) = q(8) = h(p(8)) is a continuous map from to Rand D( P) = h(p) is a continuous map from P to R.d. In this case both VI(P) = Ep(X) and V2(P) = "median of P" satisfy v(P) = v(P).2.17) where F is the empirical dJ.Section 2. because when P is not symmetric.i.4. i=1 k /1j ~ = ~ LXI 1=1 I n .d. is a continuous map from = [0. casebut see Problem 2. ~ I ~ ~ Extension principle. P E Po. if X is real and P is the class of distributions with EIXlj < 00. then the plugin estimate of the jth moment v(P) = f.tj = E(Xj) in this nonparametric . The plugin and extension principles must be calibrated with the target parameter.13)..) h(p) ~ L vip. and v are continuous. but only VI(P) = X is a sensible estimate of v(P). = Lvi n i= 1 k . e e Remark 2.1.3 and 2.1.1. II to P.1. A natural estimate is the ath sample quantile .i. For instance. Pe as given by the HardyWeinberg p(O). context is the jth sample moment v(P) = xjdF(x) ~ 0. (P) is the ath population quantile Xa. they are mainly applied in the i. let Po be the class of distributions of X = B + E where B E R and the distribution of E ranges over the class of symmetric distributions with mean zero. 1/.1 Basic Heuristics of Estimation 105 V 1..1 I:~ 1 Xl.
It does turn out that there are "best" frequency plugin estimates. a 2 ) sample as in Example 1.) = ~ (0 + I). a special type of minimum contrast and estimating equation method. • . for large amounts of data. these estimates are likely to be close to the value estimated (consistency).I and 2X . .2 with assumptions (1)(4) holding. as we shall see in Section 2. the frequency of successes. If the model fits. Moreover. Suppose that Xl. We will make a selection among such procedures in Chapter 3.1. where Po is n. C' X n are ij. Suppose X I. ".2. = 0]. they are often difficult to compute.4.8) we are led by the first moment to the estimate.l [iX.1. The method of moments estimates of f1 and a 2 are X and 2 0 a. 2. Because we are dealing with (unrestricted) Bernoulli trials. and there are best extensions.. . Thus. This is clearly a foolish estimate if X (n) = max Xi> 2X . . " . Example 2.1 is a method of moments estimate of B.. X).4. What are the good points of the method of moments and frequency plugin? (a) They generally lead to procedures that are easy to compute and are..7. Plugin is not the optimal way to go for the Bayes. those obtained by the method of maximum likelihood. The method of moments can lead to either the sample mean or the sample variance. these are the frequency plugin (substitution) estimates (see Problem 2. .Ll). Unfortunately.5. See Section 2. valuable as preliminary estimates in algorithms that search for more efficient estimates. . For instance.LI (8) = (J the method of moments leads to the natural estimate of 8.1. we find I' = E. (X. or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter 3. (b) If the sample size is large. Estimating the Size of a Population (continued). as we shall see in Chapter 3. When we consider optimality principles. there are often several method of moments estimates for the same q(9). Discussion. For example. .X). In Example 1. 0 = 21' .106 Methods of Estimation Chapter 2 Here are three further simple examples illustrating reasonable and unreasonable MOM estimates.4. I . the plugin principle is justified. X(l .1 because in this model B is always at least as large as X (n)' 0 As we have seen. However... therefore. . 'I I[ i['l i . X. because Po = P(X = 0) = exp{ O}. minimax. a frequency plugin estimate of 0 is Iogpo. then B is both the population mean and the population variance. Remark 2. Algorithms for their computation will be introduced in Section 2. This minimal property is discussed in Section 5. 0 Example 2.d. [ [ . •i . a saving grace becomes apparent in Chapters 5 and 6. To estimate the population variance B( 1 . O}. U {I. if we are sampling from a Poisson population with parameter B. Because f. we may arrive at different types of estimates than those discussed in this section. Example 2.6.3.3 where X" . )" I I • .5. X n is a N(f1. optimality principle solutions agree to first order with the best minimum contrast and estimating equation solutions. estimation of B real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of (J values rather than minimizers of functions p( (J.. .1. .1. X n are the indicators of a set of Bernoulli trials with probability of success fJ.
For this contrast. is parametric and a vector q( 8) is to be estimated.. An extension ii of v from Po to P is a parameter satisfying v(P) = v(P). li) : I < i < n} with li independent and E(Yi ) = g((3. P.O) = O. 1 < j < k. 0).1 MINIMUM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares Least squares(1) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement.. and the contrast estimating equations are V Op(X. then is called the empirical PIE. the associated estimating equations are called the normal equations and are given by Z1 Y = ZtZD{3. Method of moment estimates are empirical PIEs based on v(P) = (f. D(P) is called the extensionplug·in estimate of v(P). For the model {Pe : () E e} a contrast p is a function from X x H to R such that the discrepancy D(lJo. In the multinomial case the frequency plugin estimators are empirical PIEs based on v(P) = (Ph .2.g((3. where ZD = IIziJ·llnxd is called the design matrix. Zi). where Pj is the probability of the jth category. In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6. 0 E e. If P is an estimate of P with PEP. I < i < n.Pk). We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters. P E Po. a least squares estimate of {3 is a minimizer of p(X.. If P = PO.lj = E( XJ\ 1 < j < d. .. z) = ZT{3. For data {(Zi. . Let Po and P be two statistical models for X with Po c P. . .lJ) ~ E'o"p(X. where 9 is a known function and {3 E Rd is a vector of unknown regression coefficients.1 L:~ 1 I[X i E AI..Zi)j2. 0 E e c Rd is uniquely minimized at the true value 8 = 8 0 of the parameter.2 Minimum Contrast Estimates and Estimating Equations 107 Summary.. f. A minimum contrast estimator is a minimizer of p(X.ld)T where f. when g({3.) = q(O) and call v(P) a plugin estimator of q(6). When P is the empirical probability distribution P E defined by Pe(A) ~ n. It is of great importance in many areas of statistics such as the analysis of variance and regression theory.2 2.(3) ~ L[li ..Section 2. The plugin estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P. 6).ll. Suppose X . ~ ~ ~ ~ ~ v we find a parameter v such that v(p.. The general principles are shown to be related to each other.. ~ ~ ~ ~ 2.
13) n n i=1  L Varp«.zn)f is 11. 1_' .2) Are least squares estimates still reasonable? For P in the semiparametric model P. < i < n. ..2.::'hc.) =0.z.f3d and is often applied in situations in which specification of the model beyond (2. .g(l3. z.E(Y. that is. which satisfies (but is not fully specified by) (2. . Y) ~ PEP = {Alljnintdistributions of(Z. . . then 13 has an interpretation as a parameter on p.2. Yi).i. (2.. where Ci = g(l3. 13 E R d }.zt}.=. as in (a) of Example 1. . " . . j.2.g(l3.»)2. Z)f.C=ha:::p::. (3)  E p p(X. we can compute still Dp(l3o.) i.2. I<i<n. 1 < i < n.4 with <j simply defined as Y. = 0. .Y) such thatE(Y I Z = z) = g(j3. .. I Zj = Zj).2.2. E«.2. For instance.2. E«. ..).) = g(l3... " (2..z.. i=l (2. . fj) because (2.) . The least squares method of estimation applies only to the parameters {31' . z. 13) = LIY. a 2 and H unknown. the model is semiparametric with {3. 13 = I3(P) is the miniml2er of E(Y .3) continues to be valid. If we consider this model. The contrast p(X.6) is difficult Sometimes z can be viewed as the realization of a population variable Z.) = 0.2. is modeled as a sample from a joint distribution. • .3) which is again minimized as a function of 13 by 13 = 130 and uniquely so if the map 13 ~ (g(j3.::. The estimates continue to be reasonable under the GaussMarkov assumptions. This follows "I .En) is any distribution satisfying the GaussMarkov assumptions. 1 < i < j < n 1 > 0\ .:0::d=':Of.4}{2.. Zn»)T is 1. This is frequently the case for studies in the social and biological sciences.z. Suppose that we enlarge Po to P where we retain the independence of the Yi but only require lJ.d.g(l3. (Zi.o=nC. NCO.. as (Z.1.m::a. Note that the joint distribution H of (C}. 1 < j < n. . . Z could be educationallevel and Y income.(z..). aJ) and (3 ranges over R d or an open subset.. j. I) are i.:1.E:::'::'. (Z" Y.) + L[g(l3 o.1.i.z).g(l3.108_~ ~_ _~ ~ ~cMc. z. I' .d.g(l3.1.) = E(Y.2).::2 : In Example 2.6) Var( €i) = u 2 COV(fi.5) (2. or Z could be height and Y log weight Then we can write the conditional model of Y given Zj = Zj. that is. That is.2.)]' i=l ~ led to the least squares estimates (LSEs) {3 of {3.1 we considered the nonlinear (and linear) Gaussian model Po given by Y. and 13 ~ (g(l3.4) (2.:.) + <" I < i < n n (2.
j=1 (2. The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth.1. Zd. z) = zT (3. Yi) are a sample from a (d + 1)dimensional distribution and the covariates that are the coordinates of Z are continuous.4. is called the linear (multiple) regression modeL For the data {(Zi. Sections 6.1 the most commonly used 9 in these models is g({3. y)T has a nondegenerate multivariate Gaussian distribution Nd+1(Il.. (2. In this case we recognize the LSE {3 as simply being the usual plugin estimate (3(P)..zo).L{3jPj.3 and 6.2. j=O This type of approximation is the basis for nonlinear regression analysis based on local polynomials.8) d f30 = («(3" .. See also Problem 2.6).. we can approximate p(z) by p(z) ~ p(zo) +L j=1 d a: a (zo)(z . J for Zo an interior point of the domain. (2) If as we discussed earlier.Section 2. as we have seen in Section lA. n} we write this model in matrix fonn as Y = ZD(3 + €    where Zv = IIZij 11 is the design matrix. ..2.2.41. and Seber and Wild (1989).2.4.2. For nonlinear cases we can use numerical methods to solve the estimating equations (2. where P is the empirical distribution assigning mass n. = py .2 Minimum Contrast Estimates and Estimating Equations 109 from Theorem 1. i = 1.7). we are in a situation in which it is plausible to assume that (Zi. Yi). (2.5.1. which.7) (2. 2:). and Volnme n.4).. Fan and Gijbels (1996). a further modeling step is often taken and it is assumed that {Zll . in conjnnction with (2.9) . see Rnppert and Wand (1994). .5) and (2. . E1=1 g~ (zo)zo as an + 1) dimensional d p(z) = L(3jZj. . E(Y I Z = z) can be written as d p(z) where = (30 + L(3jZj j=1 (2. I)/.2. We can then treat p(zo)  nnknown (30 and identify (zo) with (3j to give an approximate (d 1 and Zj as before and linear model with Zo = t!.2.1). We continue our discussion for this important special case for which explicit fonnulae and theory have been derived.1.I to each of the n pairs (Zil Yi). In that case.2. As we noted in Example 2. {3d).
1967.2. Y is random with a distribution P(y I z).2. The nonnal equations are n i=l n ~(Yi .) ~ 0.8). 1 < i < n. < Methods of Estimation Chapter 2 =y  I'(Z) Eyz:Ez~Ezy. given Zi = 1 < i < n. the solution of the normal equations can be given "explicitly" by (2. We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil. For this reason.2.(3zz. The parametrization is identifiable if and only ifZ D is of rank d or equivalently if Z'bZD is affulI rank d.il. d = 1.12) .2.1 '. we get the solutions (2. 1 64 4 71 5 54 9 81 11 76 13 23 77 93 23 95 28 109 The points (Zi' Yi) and an estimate of the line 131 + 132z are plotted in Figure 2. Example 2. . exists and is unique and satisfies the Donnal equations (2. il. In that case. we assume that for a given z. see Problem 2. 0 Ii. ~ = (lin) L:~ 1Yi = ii.2.1. We have already argued in Examp}e 2. necessarily.2.no Furthennore.1. if the parametrization {3 ) ZD{3 is identifiable. i I .il. we have a Gaussian linear regression model for Yi.2. L 1 that. whose solution is . il. Nine samples of soil were treated with different amounts Z of phosphorus. Estimation of {3 in the linear regression model.I 'i.il2Zi) = 0. In the measurement model in which Yi is the detennination of a constant Ih. Zi 1'. 139). ( (J2 2 ) distribution where = ayy Therefore.) = O. p. the relationship between z and y can be approximated well by a linear equation y = 131 + (32Z provided z is restricted to a reasonably small interval. g(z.) = il" a~.1. . g(z. Example 2.2. we will find that the values of y will not be the same. i=l (2. If we run several experiments with the same z using plants and soils that are as nearly identical as possible. LZi(Yi . Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil.ecor and Cochran. Zi. the least squares estimate.25. I I. is independent of Z and has a N(O. Following are the results of an experiment to which a regression model can be applied (Sned. .ill .10) Here are some examples. For certain chemicals and plants. il. {3. We want to estimate f31 and {32.) = 1 and the normal equation is L:~ 1(Yi .11) When the Zi'S are not all equal.
. ... Geometrically.. .132 z ~ ~ (2. (zn.2.. then the regression line minimizes the sum of the squared distances to the n points (ZI' Yl). .. ..1. &atter plot {(Zi.42. i = 1. Y6). j=1 Then we are still dealing with a linear regression model because we can define WpXI . . if we measure the distance between a point (Zi. €6 is the residual for (Z6. . . 9p( z).(/31 + (32Zi)] are called the residuals of the fit..2... and 131 = fi . For instance.(a + bzi)l. that is p I'(z) ~ 2:). The regression line for the phosphorus data is given in Figure 2. ..1.1. . .(Z). n.. Yi).9. and fi = (lin) I:~ 1 Yi· The line y = /31 + f32z is known as the sample regression line or line of best fit of Yl. Yn).. The vertical distances €i = fYi . suppose we select p realvalued functions of z. 0 ~ ~ ~ ~ ~ Remark 2. Zn. The line Y = /31 + (32Z is an estimate of the best linear MSPE predictor Ul + b1Z of Theorem 1.Yi) and a line Y ~ a + bz vertically by di = Iy. . P > d and postulate that 1'( z) is a linear combination of 91 (z)..13) where Z ~ (lin) I:~ 1 Zi. i = 1.2. .2 Minimum Contrast Estimates and Estimating Equations 111 y 50 • o 10 20 x 30 Figure 22. n} and sample regression line for the phosphorus data.4. . 91.3. Here {31 = 61.Section 2. . The linear regression model is considerably more general than appears at first sight. This connection to prediction explains the use of vertical distance! in regression..9p.Yn on ZI.58 and 132 = 1.
2. Zi) = (2. Zi) = {3l + fhZi.wi 1 < .2. .2... z...2. 1 <i < n Yi n 1 .3.. Yi = 80 + 8 1Zi + 82 + Cisee Problem 2.. Weighted least squares.z. .2 and many similar situations it may not be reasonable to assume that the variances of the errors Ci are the same for all levels Zi of the covariate variable. That is... We return to this in Volume II. and g({3.'l are sufficient for the Yi and that Var(Ed. . Thus. [Yi ..= g({3. . Example 2. 1<i <n Wij = 9j(Zi). _ Yi .filii minimizes ~  I i i L[1Ii .8.wi) g((3. Such models are called heteroscedastic (as opposed to the equal variance models that are homoscedastic). We need to find the values {3l and /32 of (3.5) fails. which for given Yi = yd. . + /3zzi)J2 (2.= y.. Zi) + Ei . Zil = 1. < n. Zi2 = Zi.)]2 i=l i=l ~ n (2. Whether any linear model is appropriate in particular situations is a delicate matter partially explorable through further analysis of the data and knowledge of the subject matter.wi  _ g((3. gp( z)) T as our covariate and consider the linear model Yi = where L j=l p ()jWij + Ci.. Zi)/. but the Wi are known weights.wi. we arrive at quadratic regression.wi.. fi = <d. 0 < j < 2.2.2. In Example 2.. if d = 1 and we take gj (z) = zj. i = 1. Note that the variables . we can write (2. then = WW2/Wi = .II 112 Methods of Estimation Chapter 2 (91 (z). I . For instance.24 for more on polynomial regression.zd + ti.Zi)]2 = L ~. .5). • I' "0 r .2.14) where (J2 is unknown as before.2. I and the Y i satisfy the assumption (2.17) .. Weighted Linear Regression. .. .... and (32 that minimize ~ ~ "I.16) as a function of {3. Consider the case in which d = 2. The method of least squares may not be appropriate because (2.g«(3. 1 n. iI L Vi[Yi i=l n ({3.2.. we may be able to characterize the dependence of Var( ci) on Zi at least up to a multiplicative constant.wi . The weighted least squares estimate of {3 is now the value f3.2. However.". if we setg«(3...g(.15) .
(Li~] Ui Z i)2 n 2 n (2. Y*) denote a pair of discrete random variables with possible values (z" Yl).1. . Yn) and probability distribution given by PI(Z".2. .4H2. Moreover. .2 Minimum Contrast Estimates and Estimating Equations 113 where Vi = l/Wi_ This problem may be solved by setting up analogues to the normal equations (2.6)..2. .Section 2. When ZD has rank d and Wi > 0. we may allow for correlation between the errors {Ei}. Var(Z") and  _ L:~l UiZiYi . we can write Remark 2. .8).18) ~ ~ ~I" (31 = E(Y') .. we find (Problem 2.2. the .13 minimizing the least squares contrast in this transformed model is given by (2. i i=l = 1. i ~ I.(32E(Z') ~ .B. . that weighted least squares estimates are also plugin estimates.26..2. By following the steps of Exarnple 2. Zi) = z.1. 1 < i < n..(3.27) that f3 satisfy the weighted least squares normal equations ~ ~ where W = diag(wI..n where Ui n = vi/Lvi. (3 and for general d. 0 Next consider finding the. wn ) and ZD = IIZijllnxd is the design matrix..Y.2.~ UiZi· n n i"~l I" n n F"l This computation suggests.3.(L:~1 UlYi)(L:~l uizd Li~] Ui Zi .2. Y*) V. .2.2. Let (Z*. We can also use the results on prediction in Section 1. More generally. when g(. suppose Var(€) = a 2W for some invertible matrix W nxn.n. Then it can be shown (Problem 2..B+€ can be transformed to one satisfying (2..28) that the model Y = ZD.1) and (2. ~ .2. z) = zT.2. . Thus.4 as follows.2.)] ~Ui.. If Itl(Z*) = {31 given by + {32Z* denotes a linear predictor of y* based on Z .16) for g((3.B.1 leading to (2. (Zn.Y") ~(Zi. as we make precise in Problem 2.2.19) and (2.4.B that minimizes (2.. . That is.~ UiYi . . then its MSPE is ElY" ~ 1l1(Z")f ~ :L UdYi i=l n «(3] + (32 Zi)1 2 It follows that the problem of minimizing (2.2.7). a__ Cov(Z"'.2..20). using Theorem 1.1 7) is equivalent to finding the best linear MSPE predictor of y .
.
.
.
.
2. Then the same calculation shows that if 2nl + nz and n2 + 2n3 are both positive. and as we have seen in Example 2. OJ = (1  0)' where 0 < () < 1 (see Example 2. Similarly.lx(0) = 5 1 0' .OJ'"..0)' < ° for all B E (0. ]n practice.ny l.4).27) that are not maxima or only local maxima.28) If 2n. and n3 denote the number of {Xl. then I • Lx(O) ~ p(l. . x n } equal to 1. the dual point of view of (2. . Because .l.2.. the MLE does not exist if 112 + 2n3 = O. which is maximized by 0 = 0.2. If we observe a sample of three individuals and obtain Xl = 1.6. p(2. so the MLE does not exist because = (0. . there may be solutions of (2.(1.2. Evidently. respectively. 0 e Example 2. 0) ~ 20(1 . Consider a popUlation with three kinds of individuals labeled 1. (2. . then X has a Poisson distribution with parameter nA. 2 and 3.22) and (2. If we make the usual simplifying assumption that the arrivals form a Poisson process. n2. + n. p(3. let nJ.2.2.O) = 0'. 1. 1). represents the expected number of arrivals in an hour or. X3 = 1. + n.118 Methods of Estimation Chapter 2 which again enables us to analyze the behavior of B using known properties of sums of independent random variables.2.. Here X takes on values {O.2. 0)p(2. which has the unique solution B = ~. 8' 80. the maximum likelihood estimate exists and is given by 8(x) = 2n. where ). p(X..x=O. 2n (2. situations with f) well defined but (2.0) The likelihood equation is 8 80 Ix (0) ~ = 5 1 0 10 =0.. the likelihood is {1 .1. X2 = 2. In general.27) doesn't make sense. . O)p(l. the rate of arrival. equivalently.2. x. I). . Here are two simple examples with (j real..>. Example 2. A is an unknown positive constant and we wish to estimate A using X.. 2.7. is zero.2. ~ maximizes Lx(B). Let X denote the number of customers arriving at a service counter during n hours.) = ·j1 e'"p. Nevertheless.0).29) '(j .27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section.5. } with probabilities. and 3 and occurring in the HardyWeinberg proportions p(I. 0) = 20'(1.
. let B = P(X. and must satisfy the likelihood equa~ons ~ 8B1x(fJ) = 8B 3 8 8 " n.32) We first consider the case with all the nj positive. is that l be concave in If l is twice differentiable.lx(B) <0.d. 8) = 0 if any of the 8j are zero. this is well known to be equivalent to ~ e. and k j=1 k Ix(fJ) = LnjlogB j . and the equation becomes (BkIB j ) n· OJ = . the MLE must have all OJ > 0. 80.. j=d (2.ekl with kl Ok ~ 1..2 Minimum Contrast Estimates and Estimating Equations 119 The likelihood equation is which has the unique solution). .kl.. 0 To apply the likelihood equation successfully we need to know when a solution is an MLE.logB.2. A sufficient condition.31 ) To obtain the MLE (J we consider l as a function of 8 1 . for an experiment in which we observe nj ~ 2:7 1 I[X. 8Bk/8Bj (2. j = 1. . J = I. k.30)).LBj j=1 • (2. = L.o. . . We assume that n > k ~ 1. Then. and let N j = L:~ 1 l[Xi = j] be the number of observations in the jth category.2. (see (2.2.1. Then p(x.. ..\ I 0. thus.. = x/no If x is positive.8. the maximum is approached as .32).Section 2.. A similar condition applies for vector parameters. 31=1 By (2..32) to find I.2.2.2. . this estimate is the MLE of ).6. fJ) = TI7~1 B7'. ~ ~ . ~ n . p(x.2.7.30) for all This is the condition we applied in Example 2. 8' (2. However. Example 2. trials in which each trial can produce a result in one of k categories. L. As in Example 1. consider an experiment with n Li. Multinomial Trials. = jl.6.2.n. Let Xi = j if the ith trial produces a result in the jth category.8B " 1=13 k k =0. e. = j) be the probability J of the jth category. If x = 0 the MLE does not exist.. fJE6= {fJ:Bj >O'LOj ~ I}. 8B.. familiar from calculus.
((g((3.30.. z.  seen and shall see more in Section 6. Then (} with OJ = njjn.g«(3.2..d.(Xi . See Problem 2. Suppose that Xl. Next suppose that nj = 0 for some j.. o 1=1 Evidently maximizing Ix«(3) is equivalent to minimizing L:~ n (2. 0. i=l g«(3. version of this example will be considered in the exponential family case in Section 2.. we find that for n > 2 the unique MLEs of Il and Ii = X and iT2 ~ n. are known functions and {3 is a parameter to be estimated from the independent observations YI . Thus. The 0 < OJ < 1... Suppose the model Po of Example 2. Zi). X n are Li.' . .9.3. Zi)) 00 ao n log 2 I"" 2 "2 (21Too) .2. ~ g((3.28. Zj)]Wij where W = Ilwijllnxn is a symmetric positive definite matrix. is still the unique MLE of fJ.120 ~ Methods of Estimation Chapter 2 To show thaI this () maximizes lx(8). we check the concavity of lx(O): let 1 1 <j < k . . . where Var(Yi) does not depend on i.1.2 are Maximum likelihood and least squares We conclude with the link between least squares and maximum likelihood. .1.2 ) with J1. .34) g«(3. .IV. least squares estimates are maximum likelihood for the particular model Po. and 0. It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model Yi independent N(g({3.. g«(3. n. (2.) = gi«(3).X)2 (Problem 2. Summary. Zi)J[l:} . ~ g«(3. i = 1.1).6. zn) 03W. YnJT. Ix(O) is strictly concave and (} is the unique ~ ~ maximizer of lx(O)..2. 1 < j < k.2 both unknown.2. Using the concavity argument.1 we consider least squares estimators (LSEs) obtained by min2 imizing a contrast of the fonn L:~ 1 IV.. we can consider minimizing L:i. where E(V. ...). wi(5).11(a)).. 1 Yn .2. then <r <k I. . More generally. these estimates viewed as an algorithm applied to the set of data X make sense much more generally. Example 2.gi«(3) 1 .. 'ffi f.Zi)]2.2. N(lll 0. This approach is applied to experiments in which for the ith case in a study the mean of the response Yi depends on • .33) It follows that in this nj > 0. 1 < i < n. as maximum likelihood estimates for f3 when Y is distributed as lV. OJ > ~ 0 case.lV. gi. j = 1. .. As we have IIV. Then Ix «(3) log IT ~'P (V.1 holds and X = (Y" .1 L:~ . k.. zi)1 . see Problem 2.202 L. In Section 2.2.
though the results of Theorems 1. (m. 2 e Lemma 2.a < b< oo} U {(a. That is. See Problem 2.3 and 1. .00. for a sequence {8 m } of points from open.. (m.5.3.2. Then there exists 8 E e such that 1(8) = max{l(li) : Ii E e}.6. if X N(B I . Concavity also plays a crucial role in the analysis of algorithms in the next section. b). In particular we consider the case with 9i({3) = Zij{3j and give the LSE of (3 in the case in which I!ZijllnXd is of rank d.3. Let &e = be the boundary of where denotes the closure of in [00.oo}}. . z=1=1 ~ 2. Suppose 8 c RP is an open set.1 only. = RxR+ and ee e.zid. Suppose we are given a function 1: continuous. (a. b) : aER. or 8 1nk diverges with 181nk I .. For instance. Suppose also that e t R where e c RP is open and 1 is (2. For instance.6.1. including all points with ±oo as a coordinate. oo]P.4 and Corollaries 1. it is shown that the MLEs coincide with the LSEs.1. m. we define 8 1n . are given. In general. We start with a useful general framework and lemma.3. These estimates are shown to be equivalent to minimum contrast estimates based on a contrast function related to Shannon entropy and KullbackLeibler information divergence. (a. Properties that derive solely fTOm concavity are given in Propositon 2. Formally.1) lim{l(li) : Ii ~ ~ De} = 00. m).1 and 1. b). e e I"V e De= {(a.00. which are appropriate when Var(Yj) depends on i or the Y's are correlated. 8). Extensions to weighted least squares. m.2 and other exponential family properties also playa role. (m.ae as m t 00 to mean that for any subsequence {B1nI<} either 8 1nk t t with t ¢ e.b) .I ) all tend to De as m ~ 00.Section 2.3 Maximum Likelihood in Multiparameter Exponential Families 121 a set of available covariate values Zil. a8 is the set of points outside of 8 that can be obtained as limits of points in e. Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2.bE {a.3. where II denotes the Euclidean norm. This is largely a consequence of the strict concavity of the log likelihood in the natural parameter TI. In Section 2. Proof. . in the N(O" O ) case.6. (h).2 we consider maximum likelihood estimators (MLEs) 0 that are defined as maximizers of the likelihood Lx (B) = p(x. In the case of independent response variables Yi that are modeled to have a N(9i({3).1 ).a= ±oo.. (}"2) distribution.3.6.3 MAXIMUM LIKELIHOOD IN MUlTIPARAMETER EXPONENTIAL FAMILIES Questions of existence and uniqueness of maximum likelihood estimates in canonical exponential families can be answered completely and elegantly. as k .
We show that if {11 m } has no subsequence converging to a point in E.3.)) > ~ (fx(Od +lx (0.12. a contradiction. If 0.)) 0 Themem 2.3. /fCT is the convex suppon ofthe distribution ofT (X).2) then the MLE Tj exists.3. Applications of this theorem are given in Problems 2. and 0.to. II) is strictly concave and 1. . From (B. Suppose X ~ (PII . . is unique. then Ix lx(O. which implies existence of 17 by Lemma 2. Suppose the cOlulitions o/Theorem 2. with corresponding logp(x. if lx('1) logp(x.1. Furthennore. i . .6. E. " .3. then the MLE doesn't exist and (2. h) and that (i) The natural parameter space. and is a solution to the equation (2. II). I I. = .1. are distinct maximizers.3.1. Let x be the observed data vector and set to = T(x). ~ Proof. We can now prove the following.9) we know that II lx(lI) is continuous on e. Existence and Uniqueness o/the MLE ij. '10) for some reference '10 E [ (see Problem 1. then the MLE8(x) exists and is unique. Corollary 2. we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) .3) I I' (b) Conversely. [ffunher {x (II) e = e e t 88.'1 .'1) with T(x) = 0. " We. ! " . thus. Proofof Theorem 2. U m = Jl~::rr' Am = .. ' L n . (ii) The family is of rank k. By Lemma 2.3. II E j.. (a) If to E R k satisfies!!) Ifc ! 'I" 0 (2.1. We give the proof for the continuous case. no solution.3.3. open C RP.2).).122 Methods of Estimation Chapter 2 Proposition 2.1. Write 11 m = Am U m . then 1] exists and is unique ijfto E ~ where C!i: is the interior of CT. if to doesn't satisfy (2. lI(x) =  exists.1 hold.(11) ~ 00 as densities p(x.8 and 2. Suppose P is the canonical exponential/amity generated by (T.3. Define the convex suppon of a probability P to be the smallest convex set C such that P(C) = 1. Without loss of generality we can suppose h(x) = pix.3.3.27). Then. is open.  (hO! + 0.1. then lx(11 m ) .3) has i 1 !. have a necessary and sufficient condition for existence and uniqueness of the MLE given the data..00.3.3.
t u.Xn are i. The equivalence of (2. Case 1: Amk t 111]=11. Then. The Gaussian Model.2) and COTollary 2.1 hold and T k x 1 has a continuous case density on R k • Then the MLE Tj exists with probabiliry 1 and necessarily satisfies (2.9. u mk t u.i.3.1) a point to belongs to the interior C of a convex set C iff there exist points in CO on either side of it. Por n = 1. which is impossible. if {1]m} has no subsequence converging in £ it must have a subsequence { 11m k} that obeys either case 1 or 2 as follows.Section 2. Suppose the conditions ofTheorem 2. existence of MLEs when T has a continuous ca. CT = R )( R+ FOT n > 2. iff. that is. = 0 because T(XI) is always a point on the parabola T2 = T'f and the MLE does not exist.se density is a general phenomenon.k Lx( 11m. So.. Po[uTT(X) > 61 > O. In fact. Write Eo for E110 and Po for P11o . If ij exists then E'I T ~ 0 E'I(cTT) ~ 0. This is equivalent to the fact that if n = 1 the formal solution to the likelihood equations gives 0'2 = 0. Theorem 2. I Xl) and 1.3.3 Maximum Likelihood in Multiparameter Exponential Families ~ 123 1.4. Suppose X"". this is the exponential family generated by T(X) = 1 Xi.J = 00.* P'IlcTT = 0] = I. Evidently.3. Because any subsequence of {11m} has no subsequence converging in E we conclude L ( 11m) t 00 and Tj exists. Then because for some 6 > 0.3. T(X) has a density and.3. there exists c # such that Po[cTT < 0] = 1 E1](c T T(X)) < 0.6.1 follow. thus. So we have Case 2: Amk t A. By (B.3. 0 ° '* '* PrOD/a/Corollary 2.1.3). .3. 0 (L:7 L:7 CJ. I' E R. 0 Example 2. It is unique and satisfies x (2. (1"2 > O. fOT all 1]. Then AU ¢ £ by assumption.3) by TheoTem 1.5.3. both {t : dTt > dTto} n CO and {t : dTt < dTto} n CO are nonempty open sets. contradicting the assumption that the family is of rank: k.1.d. As we observed in Example 1. lIu m ll 00. N(p" ( 2 ).2) fails. . Nonexistence: if (2.3.2. So In either case limm. um~.6. CT = C~ and the MLE always exists. for every d i= 0.
Example 2. The TwoParameter Gamma Family. Thasadensity. we see that (2.9). I < j < k.5) have a unique solution with probability 1. the best estimate of 8.d. if T has a continuous case density PT(t).1.3 we know that E. How to find such nonexplicit solutions is discussed in Section 2.3.4) (2.3.. Example 2.4 we will see that 83 is. in a certain sense.n.3.6.I. Suppose Xl. The likelihood equations are equivalent to (problem 2. It is easy to see that ifn > 2. using (2. The statistic of rank k . = = L L i i bJ _ I . >...1. and one of the two sums is nonempty because c oF 0.7.2 that (2.3.3. x > 0. > O.i. We assume n > k .3.)..2) holds. Because the resulting value of t is possible if 0 < tjO < n.Xn are i.3.3. . This is a rank 2 canonical exponential family generated by T   = (L log Xi.4. with density 9p. They are determined by Aj = Tj In.. We follow the notation of Example 1. I <j < k.4) and (2.3.3. where Tj(X) L~ I I(Xi = j).1 that in this caseMLEs of"'j = 10g(AjIAk). if we write c T to = {CjtjO : Cj > O} + {cjtjO : Cj < O} we can increase c T to by replacing a tjO by tjo + I in the first sum or a tjO by tjO .3).1._t)T. thus.. by Problem 2. then and the result follows from Corollary 2. When method of moments and frequency substitution estimates are not unique. h(x) = XI. I < j < k.1.2. Here is an example.I and verify using Theorem 2. the maximum likelihood principle in many cases selects the «best" estimate among them.3.. 0 = If T is discrete MLEs need not exist.)e>"XxPI.1). We conclude from Theorem 2.1 in the second.6.ln. I < j < k. o Remark 2.6.2(b» r' log ..3. I <t< k . lit ~ . the MLE 7j in exponential families has an interpretation as a generalized method of moments estimate (see Problem 2.4. A nontrivial application of Theorem 2. In Example 3.T(X) = A(.3.3. To see this note that Tj > 0.3. Thus.4 and 2. but only Os is a MLE. LXi). From Theorem 1.13 and the next example).(X) = r~..I which generates the family is T(k_l) = (Tl>"" T. Thus..>. exist iff all Tj > O. I < j < k iff 0 < Tj < n.1..3. liz = I  vn3/n and li3 = (2nl + nz)/2n are frequency substitution estimates (Problem 2. The boundary of a convex set necessarily has volume 0 (Problem 2. with i'. P > 0.124 Methods of Estimation Chapter 2 Proof.2.= r(jJ) A log X (2.2(a). where 0 < Aj PIX = j] < I. Multinomial Trials. For instance.5) ~=X A where log X ~ L:~ t1ogXi.3. in the HardyWeinberg examples 2.2 follows. .
..2.3 can be applied to determine existence in cases for which (2. then it is the unique MLE ofB. Then Theorem 2. n > kl. when we put the multinomial in canonical exponential family form.6.3.. if 201 + 0. Let Q = {PII : II E e).B) ~ h(x)exp LCj(B)Tj(x) .13).2. 1 < i < k . mxk e. Unfortunately strict concavity of Ix is not inherited by curved exponential families. our parameter set is open. x E X.8 we saw that in the multinomial case with the clQsed parameter set {Aj : Aj > 0. In Example 2.A(c(ii)) ~ = O.3. BEe.1).A(c(II))}h(x) Suppose c : 8 ~ [ C R k has a differential (2. Let CO denote the interior of the range of (c. (II) .3.1 we can obtain a contradiction to (2. . Ck (B)) T and let x be the observed data.1] it does exist ~nd is unique. Alternatively we can appeal to Corollary 2.3. The following result can be useful. lfP above satisfies the condition of Theorem 2.8see Problem 2. for example.B(B) j=l .1.1 and Haberman (1974). (B).6. . The remaining case T k = a gives a contradiction if c = (1.1. note that in the HardyWeinberg Example 2. L. 0 The argument of Example 2.1 directly D (Problem 2. if any TJ = 0 or n. the MLEs ofA"j = 1. the MLE does not exist if 8 ~ (0.3.3. and unicity can be losttake c not onetoone for instance.2. Corollary 2. c(8) is closed in [ and T(x) = to satisfies (2.k. the bivariate normal case (Problem 2.7) Note that c(lI) E c(8) and is in general not ij. = 0.3 Maximum likelihood in Multiparameter Exponential Families 125 On the other hand.2) so that the MLE ij in P exists. (2. the following corollary to Theorem 2. When P is not an exponential family both existence and unicity of MLEs become more problematic.10). Consider the exponential family k p(x. be a cUlVed exponential family p(x. 0 < j < k . whereas if B = [0. . .6) on c( II) = ~~. m < k .1 is useful. .3. then so does the MLE II in Q and it satisfies the likelihood equation ~ cT(ii) (to . In some applications. However.3.3) does not have a closedform solution as in Example 1. Similarly.3.3.~l Aj = I}. .1. .3.. If the equations ~ have a solution B(x) E Co.Section 2. 1)T.3. h).. Remark 2.lI) ~ exp{cT (II)T(x) .3.exist and are unique.2) by taking Cj = 1(i = j).3. e open C R=.3..3.I.1. Here [ is the natural paramerer space of the exponential fantily P generated by (T.
~Yn.3)T.. 9) is a curved exponential family of the form (2..: Note that 11+11_ = '6M2 solution we seek is ii+.9. as in Example 1. m m m f?. tLi = (}1 + 82 z i . < O}. l = 1. . 11. . I .7) if n > 2. We find CI Example 2.n(it' + )"5it'))T which with Ji2 = n.it.10... This is a curved exponential ¥.gx±). m} is a 2nparameter canonical exponential family with 'TJi = /tila.. suppose Xl..ryl > O.0'2) with Jl/ a = AO > a known./2'1' . ry. . j = 1. Jl > 0. with I.. Now p(y. Because It > 0.2.... 1 n..5 and 1. i = 1. 'TJn+i = 1/2a. where 0l N(tLj 1 0'1).'. i Example 2.it. t. which is closed in E = {( ryt> ry.7) becomes = 0. af = f)3(8 1 + 82 z i )2.126 The proof is sketched in Problem 2. we can conclude that an MLE Ii always exists and satisfies (2. generated by h(Y) = 1 and "'> I T(Y) '.3. = ~ryf. < O}.6.i. the o q... l}m.5X' +41i2]' < 0. that . corresponding to 1]1 = //x' 1]2 = .3. .6..".6.~Y'1.n.. L: Xi and I.A6iL2 = 0 Ii.2 .. . .nit.'' . Ii± ~ ~[). .Zn are given constants. . = ( ~Yll'''' .3.6) with " • ..oJ). '() A 1/ Thus. . which implies i"i+ > 0.d. Using Examples 1.10..Xn are i.2 and 2. . are n independent random samples. E R. .2~~2 .1/'1') T . LocationScale Regression.5. = L: xl.3. ). '1. Next suppose. Evidently c(8) = {(lh...6.. Methods of Estimation Chapter 2 family with (It) = C2 (It) = . As a consequence of Theorems 2. .6.3 )(t.) : ry.4.5 = . simplifies to Jl2 + A6XIl. .11. < Zn where Zt.< O. As in Example 1.I LX.ry.ry. we see that the distribution of {01 : j = 1.3.3.) : ry. Gaussian with Fixed Signal to Noise. Zl < .g( it' . .!0b''.?' m)T " ". N'(fl.\()'. C(O) and from Example 1.3. n..3. Suppose that }jI •. Equation (2.3.. . = 1 " = 2 n( ryd'1' .
f i strictly. d.3. 2.7).1 The Method of Bisection The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in kparameter exponential families. b). In fact. even in the classical regression model with design matrix ZD of full rank. even in the context of canonical multiparameter exponential families. Given f continuous on (a.3.4. . x o1d = xo. These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency. f( a+) < 0 < f (b. Initialize x61d ~ XI. is isolated and shown to apply to a broader class of models. then the full 2nparameter model satisfies the conditions of Theorem 2. Here. at various times.02 E R. strict concavity. Finally. in this section. It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis. E R. MLEs may not be given explicitly by fonnulae but only implicitly as the solutions of systems of nonlinear equations. is the bisection algOrithm to find x*.x*l: Find Xo < x" f(xo) < 0 < f(x') by taking Ixol. Then c(8) is closed in £ and we can conclude that for m satisfies (2. such as the twoparameter gamma. we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all.4 ALGORITHMIC ISSUES As we have seen. ~ 2. which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2. there exists unique x*€(a.). by the intermediate value theorem. However. > OJ. Let £ be the canonical parameter set for this full model and let e ~ (II: II. in pseudocode. L 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of Z'bZD requires on the order of nd 2 operations to evaluate each of d(d + 1)/2 tenns with n operations to get Z'bZD and then. an MLE (J of (J exists and () 0 ~ ~ Summary.3.3. b) such that f(x*) = O. the basic property making Theorem 2.O.4 Algorithmic Issues 127 If m > 2. the fonnula (2. order d3 operations to invert. Ixd large enough.3. L 10).1. if implemented as usual. Given tolerance € > a for IXfinal . then.1 and Corollary 2. > 2. entrust ourselves.1.1 work. We begin with the bisection and coordinate ascent methods.Section 2. In this section we derive necessary and sufficient conditions for existence of MLEs in canonical exponential families of full rank with £ open (Theorem 2. The packages that produce least squares estimates do not in fact use fonnula (2.3.1).
Then.3. i I i.1) . Theorem 2...xol/€).) = EryT(X) . o ! i If desired one could evidently also arrange it so that. the MLE Tj.. From this lemma we can deduce the following. lx.1. . (2. in addition.4. so that f is strictly increasing and continuous and necessarily because i. 0 Example 2. The bisection algorithm stops at a solution xfinal such that i Proot If X m is the mth iterate of Xnew i (I) Moreover.1).4.. x~ld = Xnew· Go to (I). The Shape Parameter Gamma Family.1. exists.4. End Lemma 2.6. .. Let X" . may befound (to tolerance €) by the method afbisection applied to f(.3...x*1 < €. satisfying the conditions of Theorem 2.4.to· I' Proaf. I. which exists and is unique by Theorem 2. .xol· 1 (2) Therefore.1 and T = to E C~..128 (I) If IX~ld ..) = VarryT(X) > 0 for all . Xnew = !(x~ld + x~ld)' ~ Xnew· (3) If f(xnew) = 0. If(xfinal)1 < E. Moreover. [(8.. !'(. the interior (a.. Xm < x" < X m +l for all m.IXm+l . x~ld (5) If f(xnew) > 0.1. I (3) and X m + x* i as m j. f(a+) < 0 < f(b). By Theorem 1. X n be i. by the intermediate value theorem. for m = log2(lxt . xfinal = !(x~ld + x old ) and return xfinal' (2) Else. 00.d. xfinal = Xnew· (4) If f(xnew) < 0.i.1. • . Let p(x I 1]) be a oneparameter canonical exponentialfamily generated by (T. b) of the convex support a/PT. h).xoldl Methods of Estimation Chapter 2 < 2E.4.
4 Algorithmic Issues 129 Because T(X) = L:~' equation I log Xi has a density for all n the MLE always exists. It solves the r'(B) r(B) T(X) n which by Theorem 2. but as we shall see. d and finally 1] _ ~Il) _ =1] (~1 1]1"".1]2. we would again set a tolerenceto be. it is in fact available to high precision in standard packages such as NAG or MATLAB. Here is the algorithm. The case k = 1: see Theorem 2.4.4.'TJk =tk' ) . eventually.4. r > 1. The function r(B) = x'Iexdx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method. for a canonical kparameter exponential family. 0 J:: 2. always converges to Ti.. Notes: (1) in practice.···.'fJk· Repeat. getting 1j 1r).1.···. However.2 Coordinate Ascent The problem we consider is to solve numerically. E1/(T(X» = A(1/) = to when the MLE Tj = Tj(to) exists. In fact.oJ _ = (~1 "" {}) ~02 _ 1]1.'fJk' ~1) an soon. for each of the if'. This example points to another hidden difficulty. in cycle j and stop possibly in midcycle as soon as <I< .Section 2. say c. which is slow. bisection itself is a defined function in some packages. .'TJk. 1 k. The general case: Initialize ~o 1] = ("" 'TJll···' "") • 'TJk Solve ~1 f or1]k: Set 1] 8'T]k [) A(~l ~1 1]ll1'J2.1 can be evaluated by bisection. 1] ~Ok = (~1 ~1 {} "") 1]I.· .'TJ3.1]2.
. We pursue this discussion next.5). .1j{ can be explicit.pO) = We now use bisection ' r' .4..=1 d j = k and the problem of obtaining ijl(tO. 1j(r) ~ 1j. .2. (3) I (71j) = A for all j because the sequence oflikelihoods is monotone. 71 1 is the unique MLE. = 7Jk. This twodimensional problem is essentially no harder than the onedimensional problem of Example 2.. they result in substantial savings of time. 'tn. (Wn') ~ O. fI.2.3. Continuing in this way we can get arbitrarily close to 1j. Whenever we can obtain such steps in algorithms. 0 Example 2. the log likelihood. Then each step of the iteration both within cycles and from cycle to cycle is quick.. (2) The sequence (iji!. ij as r t 00.. Let 1(71) = tif '1. I(W ) = 00 for some j.2: For some coordinates l. (I) l(ij'j) Tin j for i fixed and in. 0 ~ (4) :~i (77') = 0 because :~. . j (5) Because 1]1..4.4. Else lim.. The TwoParameter Gamma Family (continued).2. TIl = . _A (1»). x ink) ( '~ini .2.1J k) . ijik) has a convergent subsequence in t x .'. To complete the proof notice that if 1j(rk ) is any subsequence of 1j(r) that converges to ij' (say) then. 1 < j < k. (3) and (4) => 1]1 .I ~(~ (1) 1 to get V. A(71') ~ to.3. The case we have C¥. Here we use the strict concavity of t. limi. ..\(0) = ~.'11 11 t t (I . Therefore.4. refuses to converge (in fI space!)see Problem 2.I) solving r0'.) where flj has dimension d j and 2::. For n > 2 we know the MLE exists. (fij(e) are as above. . Because l(ijI) ~ Aand the MLE is unique.2.1 hold and to E 1j(r) t g:.. in fact.3. is computationally explicit and simple. Consider a point we noted in Example 2.. 1J ! But 71j E [. ij = (V'I) .2'  It is natural to ask what happens if.···) C!J. Suppose that this is true for each l.1.··.1 because the equa~ tion leading to Anew given bold' (2. as it should. W = iji = ij.. that is.. is the expectation of TI(X) in the oneparameter exponential family model with all parameters save T1I assumed known.130 Methods of Estimation Chapter 2 (2) Notice that (iii. by (I)... We note some important generalizations. . the MLE 1j doesn't exist. (i). I • . ilL" iii.. ff 1 < j < k.. j ¥ I) can be solved in closed form. Bya standard argument it follows that.I)) = log X + log A 0) and then A(1) = pY.A(71) + log hex).l(ij') = A. Hence.? Continuing. Theorem 2. rp differ only in the second coordinate. the algorithm may be viewed as successive fitting of oneparameter families. . . Thus. ij'j and ij'(j+I) differ in only one coordinate for which iji(j+1) maximizes l.. We can initialize with the method 2 of moments estimate from Example 2..4. 'II. . (ii) a/Theorem 2.. Proof. Suppose that we can write fiT = (fir. to ¢ Fortunately in these cases the algorithm. We give a series of steps. 71J.j l(i/i) = A (say) exists and is > 00. (6) By (4) and (5). We use the notation of Example 2..
2 has a generalization with cycles of length r.. . Then it is easy to see that Theorem 2.Section 2. for instance. .4. values of (B 1 .. Feinberg.4. that is.1. Solve g~: (Bt. .. BJ+l' .4.p.. A special case of this is the famous DemingStephan proportional fitting of contingency tables algorithmsee Bishop.4 Algorithmic Issues 131 just discussed has d 1 = . 1' B . If 8(x) exists and Ix is differentiable. The coordinate ascent algorithm.4. each of whose members can be evaluated easily.10. is strictly concave. The graph shows log likelihood contours.B~) = 0 by the method of j bisection in B to get OJ for j = 1. The coordinate ascent algorithm can be slow if the contours in Figure 2.. Next consider the setting of Proposition 2.4. r = k. At each stage with one coordinate fixed. the method extends straightforwardly.B2 )T where the log likelihood is constant. which we now sketch. Figure 2.92.3. See also Problem 2. the log likelihood for 8 E open C RP. Change other coordinates accordingly. .4. e ~ 3 2 I o o 1 2 3 Figure 2. 1 B.. and Holland (1975). .1 illustrates the j process.1 in which Ix(O). iterate and proceed. and Problems 2. = dr = 1.4..7.1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method. find that member of the family of contours to which the vertical (or horizontal) line is tangent..
to). A onedimensional problem 7 . When likelihoods are noncave. this method is known to converge to ij at a faster rate than coordinate ascentsee Dahlquist.4.2) The rationale here is simple. if l(B) denotes the log likelihood. 1 l(x. Newton's method also extends to the framework of Proposition 2. Here is the method: If 110 ld is the current value of the algorithm. In this case.6.2) gives (2. is the NewtonRaphson method. We return to this property in Problem 6. _. B)}F(Xi. Let Xl.k I (iiold)(A(iiold) . iinew after only one step behaves approximately like the MLE. methods such as bisection.4.. then ii new = iiold . which may counterbalance its advantage in speed of convergence when it does converge. though there is a distinct possibility of nonconvergence or convergence to a local rather than global maximum.3 The NewtonRaphson Algorithm An algorithm that. A hybrid of the two methods that always converges and shares the increased speed of the NewtonRaphson method is given in Problem 2.and lefthand sides. If 110 ld is close enough to fj. Bjork. X n be a sample from the logistic distribution with d.3.4.7. (2.3) Example 2.4.B) We find exp{ (x . If 7J o ld is close to the root 'ij of A(ij) expanding A(ij) around 11old' we obtain = to. then by f1new is the solution for 1] to the approximation equation given by the right.B) i=l 1 1 2 L i=l n I(X" B) < O. and NewtonRaphson's are still employed.4.4.3. in general. F(x. B) The density is = [I + exp{ (x = B) WI. the argument that led to (2..1. and Anderson (1974). can be shown to be faster than coordinate ascent. I 1 The NewtonRaphson method can be implemented by taking Bold X.132 Methods of Estimation Chapter 2 2.  I o The NewtonRaphson algorithm has the property that for large n. when it converges.10. coordinate ascent. This method requires computation of the inverse of the Hessian.B)} [i+exp{(xB)}j2' n 1 l(B) l(B) n2Lexp{(X.f.
Po[X = (0. Yet the EM algorithm.. Say there is a closedfonn MLE or at least Ip. 1 8 m are not Xi but (€il. e = Example 2. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997). 0 leads us to an MLE if it exists in both cases.€i3).0.4. is not X but S where = 5i 5i Xi. Many examples and important issues and methods are discussed.4. X ~ Po with density p(x. Bjork. This could happen if. There are ideal observations.4.6. and its main properties. (2.0)2. Lumped HardyWeinberg Data. Petrie. the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€i2 = 1).4. 1.B) + 2E'310g(1 .OJ] i=l n (2.2.4) Evidently. I)] (1 . As in Example 2..4 The EM (Expectation/Maximization) Algorithm There are many models that have the following structure.0 E c Rd Their log likelihood Ip.. in Chapter 6 of Dahlquist. Xi = (EiI. where Po[X = (1. for some individuals. though an earlier general form goes back to Baum..0)] ~ 28(1 . i = 1. and Anderson (1974).n. a)] = B2. . The log likelihood of S now is 1". €i2 + €i3).x(B) is "easy" to maximize.4. but the computation is clearly not as simple as in the Original HardyWeinberg canonical exponential family example.13.5) + ~ [(EiI+Ei2)log(1(10)')+2Ei310g(IB)] i=m+l a function that is of curved exponential family fonn. Soules. m+ 1 <i< n. . however. 0). If we suppose (say) that observations 81. We give a few examples of situations of the foregoing type in which it is used. It does tum out that in this simplest case an explicit maximum likelihood solution is still possible. Po[X (0. . . 1 <i< m (€i1 +€i2. Ei3). for instance.4 Algorithmic Issues 133 in which such difficulties arise is given in Problem 2. 0) where I". S = S(X) where S(X) is given by (2. Laird. (0) = log q(s. the rest of X is "missing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood. What is observed. difficult to compute. Here is another important example. be a sample from a population in HardyWeinberg equilibrium for a twoallele locus.4. we observe 5 5(X) ~ Qo with density q(s. E'2. 0.4. A fruitful way of thinking of such problems is in terms of 8 as representing part of X.Section 2.4). 0) is difficultto maximize. with an appropriate starting point. 2.(0) })2EiI 10gB + Ei210g2B(1 . a < 0 < I. and Rubin (977).x(B) is concave in B. the function is not concave. then explicit solution is in general not lXJssible. The algorithm was fonnalized with many examples in Dempster.B). let Xi. and Weiss (1970). and so on. Unfortunately. A prototypical example folIows.
Bo) _ I S(X) = s ) (2.7) ii.B) 0=00 = E.<T.4.11).o log p(X.tz E Rand 'Pu (5) = . + (1  A.12) q(s. I. Mixture of Gaussians. o (:B 10gp(X. the Si are independent with L()(Si I . = 11 = . including the examples.8). . Thus.af) or N(J12.4. differentiating and exchanging E oo and differentiation with respect i. a local maximum close to the true 8 0 turns out to be a good "proxy" for the 0 nonexistent MLE. j.\"'u.. p(s. are independent identically distributed with p() [A. (P(X. I . Let ..9) I for all B (under suitable regularity conditions). A.6).. If this is difficul~ the EM algorithm is probably not suitable. that under fJ.)<T~).. .(f~).i = 1. .4.. Although MLEs do not exist in these models.4.Sn is a sample from a population P whose density is modeled as a mixture of two Gaussian densities. EM is not particularly appropriate. reset Bold = Bnew and repeatthe process.4.B) I S(X) = s) 0=00 (2. The second (M) step is to maximize J(B I Bold) as a function of B. (P(X. which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2. J.4. the M step is easy and the E step doable. Suppose that given ~ = (~11"" ~n).(sI'z)where() = (.<T. :B 10gq(s. B) = (1..134 Methods of Estimation Chapter 2 Example 2.). The log likelihood similarly can have a number of local maxima and can tend to 00 as e tends to the boundary of the parameter space (Problem 2.2)andO <.0'2 > 0.4.4. The EM algorithm can lead to such a local maximum.8) by o taking logs in (2.Bo) and (2. • • i ! I • .4.12). we have given.4.\ = 1. . if this step is difficult. The EM Algorithm. Again.\. This fiveparameter model is very rich pennitting up to two modes and scales. = 01.\)"'u. Here is the algorithm. .4) = L()(Si I Ai) = N(Ail'l + (1. O't.5.\ < 1.ll.Ai)I'Z. Initialize with Bold = Bo· Tbe first (E) step of the algorithm is to compute J(B I Bold) for as many values of Bas needed.B) =E.p() [A. I That is. Then we set Bnew = arg max J(B I Bold).Bo) 0 p(X. Note that (2. The rationale behind the algorithm lies in the following formulas. S has the marginal distribution given previously. As we shall see in important situations.(I'. . we can think of S as S(X) where X is given by (2. i. It is not obvious that this falls under our scheme but let (2.I. Suppose 8 1 . It is easy to see (Problem 2.B) IS(X)=s) q(s.B) J(B I Bo) ~ E.6) where A. .8) .4. . I' Hi where we suppress dependence on s.p (~). I:' . ~i tells us whether to sample from N(Jil.4.9) follows from (2.(sl'd +.
h) satisfying the conditions a/Theorem 2. ~ E '0 (~I ogp(X .4 Algorithmic Issues 135 to 0 at 00 .4.1. However.Onew) E'Old { log r(X I 8. 0 ) +E. Id log (X I . Because. r(Xj8.12) = s. J(Onew I 0old) > J(Oold I 0old) = 0 by definition of Onew.0) = O.4. Suppose {Po : () E e} is a canonical exponential family generated by (T. O n e w ) } log ( " ) ~ J(Onew I 0old) .13) iff the conditional distribution of X given S(X) forOnew asfor 00ld and Onew maximizes J(O I 0old)' =8 is the same Proof. 0 0 = 0old' = Onew. 00ld are as defined earlier and S(X) (2.4.10) [)J(O I 00 ) DO it follows that a fixed point 0 of the algorithm satisfies the likelihood equation.14) whete r(· j ·. Onew) > q(s. DJ(O I 00 ) [)O and.0 ) I S(X) ~ 8 . 0old)' Equality holds in (2.1.0) } ~ log q(s.17) o The most important and revealing special case of this lemma follows.4.4.0) (2. In the discrete case we appeal to the product rule.4. (2.3. We give the proof in the discrete case.13) q(8.lfOnew. uold Now.3." ) I S(X) ~ 8 . Lemma 2.4. Let SeX) be any statistic. the result holds whenever the quantities in J(O I (0 ) can be defined in a reasonable fashion. Then (2. then .O) { " ( X I 8.O)r(x I 8.4.1.Section 2.11)  D IOgq(8.2. hence. q S. formally. (2.00Id) by Shannon's ineqnality.4. 0) I S(X) = 8 ) DO (2.15) q(s. DO The main reason the algorithm behaves well follows.4. Fat x EX.(X 18.o log . Lemma 2. On the other hand..4.E. S (x) ~ 8 p(x. I S(X) ~ s > 0 } (2.16) ° q(s. uold 0 r s.Onew) {r(X I 8 .0) is the conditional frequency function of X given S(X) J(O I 00 ) If 00 = s. (2. 0) ~ q(s. Theorem 2.
4. i i. Part (b) is more difficult.4.4.4.24) " ! I .BO)TT(X) . I .4.19) = Bnew · If a solution of(2.136 (a) The EM algorithm consists of the alternation Methods of Estimation Chapter 2 A(Bnew ) = Eoold(T(X) I S(X) Bold ~ = s) (2. Now. . . (2.(A(O) .= +Eo ( t (2<" + <i') I <it + <i'. i=m. that.(x). In this case.A('1)}h(x) (2. h(x) = 2 N i . = 11 <it +<" = IJ.I<j<2 I r r: I I r! Pol<it 11 <it + <i' = 11 0' B' B' + 20(1 .21) Part (a) follows. m + 1 < i < n) .4.18) (2. 1 I • • Po I<ij = 1 I <it + <i' = OJ =  0.+l (2. 0 Example 2.4.Pol<i.18) exists it is necessarily unique. after some simplification. Proof.4 (continued).0)' 1. A proof due to Wu (I 983) is sketched in Problem 2.Bof Eoo(T(X) I S(X) = y) .4. I • Under the assumption that the process that causes lumping is independent of the values of the €ij.4.4. 1 <j < 3. X is distributed according to the exponential family I where p(x. = Ee(T(X) I S(X) = s) ~ (2.16.4..(A(O) . A'(1)) = 2nB (2.B) 1) = log (1 = exp(1)(2Ntn (x) + N'n(x)) .25) I" .A(Bo)) I S(X) = s} (B .20) has a unique solution.A(Bo)) (2. • I E e (2Ntn + N'n I S) = 2Nt= + N. then it converges to a limit 0·. Thus.B) 1 .(1 . we see.23) I .B). which is necessarily a local maximum J(B I Bo) I I = Eoo{(B .22) • ~ 0) . 1'I. A(1)) = 2nlog(1 + e") and N jn = L:~ t <ij(Xi). (b) lfthe sequence of iterates {Bm} so obtained is bounded and the equation A(B) o/q(s.
which is indeed the MLE when S is observed.d. then B' . 1"1)/ad 2 + (1 .4. ii2 . 112. iii.26) It may be shown directly (Problem 2. This completes the Estep.2 I Z.. i=I i=l n n n i=l s = {(Z" Y.1. In this case a set of sufficient statistics is T 1 = Z.T = (1"1. (Zn. For other cases we use the properties of the bivariate nonna] distribution (Appendix BA and Section 1. a~ + ~.T2 ]}' . /L2. b. the EM iteration is L i=tn+l n ('iI + <. I). the conditional expected values equal their observed values.4.(Y. T3 = The observed data are n l L zl.4.). 2 Mn . Y). ~ T l (801 d). . we observe only Yi . a~..(ZiY. new = T 3 (8ol d) .8) of B based on the observed data.) E. .4 Algoritnmic Issues 137 where Mn ~ Thus. ai ~ We take Bo~ = B MOM ' where BMOM is the method of moment estimates (11" 112. as (Z. T4 = n. at Example 2. converges to the unique root of ~ + N.27) hew i'tT2Ji{[T3 (8ol d) . I Zi) 1"2 + P'Y2(Z. + 1 <i< n}.): 1 <i< nt} U {Z.new ii~.l L ZiYi.new T4 (Bo1d ) .. p). where B = (111.: nl + 1 <i< n. T 5 = n. I Z.12) that if2Nl = Bm.Yn) be i.4). T 2 = Y. where (Z. To compute Eo (T I S = s).= > N 3 =)) 0 and M n > 0.1) that the Mstep produces I1l.(2N3= + N 2=)B + 2 (N1= + (I n n =0 0 in (0.} U {Y.i. For the Mstep.1).p2)a~ [1'2 + P'Y2(Z.4. and find (Problem 2. for nl + 1 < i < n2.(y. B). [T5 (8ald) 2 = T2 (8old)' iii. we oberve only Zi.Section 2.. Let (Zl'yl). Suppose that some of the Zi and some of the Yi are missing as follows: For 1 < i < nl we observe both Zi and Yi. Jl2. unew  _ 2N1m + N 2m n + 2 . and for n2 + 1 < i < n. al a2P + 1"1/L2). Y) ~ N (I"" 1"" 1 O'~. r) (Problem 2.6.T{ (2. l"l)/al]Zi with the corresponding Z on Y regression equations when conditioning on Yi (Problem 2. to conclude ar.Bold n ~ (2..4. we note that for the cases with Zi andlor Yi observed.new ~ + I'i.) E.T 2 .1) A( B) = E.l LY/.: n.Tl IlT4 (8ol d) .4.4. compute (Problem 2. I'l)/al [1'2 + P'Y2(Z.
Also note that.3 and the problems. Consider n systems with failure times X I .4. (3( 0'1.0'2) based on the first two moments. (a) Show that T 3 = N1/n + N 2 /2n is a frequency substitution estimate of e.4. based on the first moment. Finally in Section 2.. in the context of Example 2.P3 given by the HardyWeinberg proportions. X I.2.. ~' ". Find the method of moments estimates of a = (0'1. .1 1.2. are discussed and introduced in Section 2.4. Summary. in Section 2. ! I 2.2.B)? .0'2) distribution.23).0. j.1 with respective probabilities PI. Remark 2. B)/p(X.B o)]. .4. use this algorithm as a building block for the general coordinate ascent algorithm. respectively.. 2B(1 . .5 PROBLEMS AND COMPLEMENTS Problems fnr Sectinn 2.4. Now the process is repeated with B MOM replaced by ~ ~ ~ ~ 0new. X n assumed to be independent and identically distributed with exponential.0) and (I .138 Methods of Estimation Chapter 2 where T j (B) denotes T j with missing values replaced by the values computed in the Estep and T j = Tj(B o1d )' j = 1. . Suppose that Li. (b) Find the method of moments estimate of>.i' I' I i.1. . 2. where 0 < B < 1.. which yields with certainty the MLEs in kparameter canonical exponential families with E open when it exists. 0 Because the Estep. . Note that if S(X) = X. including the NewtonRaphson method. based on the second moment. based On the first two moments. 0) is minimized. &(>.. We then. the EM algorithm is often called multiple imputation. distributions. The basic bisection algorithm for finding roots of monotone functions is developed and shown to yield a rapid way of computing the MLE in all oneparameter canonical exponential families with E open (when it exists). (b) Using the estimate of (a).. (c) Suppose X takes the values 1.. then J(B i Bo) is log[p(X. what is a frequency substitution estimate of the odds ratio B/(l.d.).. in general..B)2. E"IJ(O I 00)] is the KullbackLeiblerdivergence (2.6.4 we derive and discuss the important EM algorithm and its basic properties. (a) Find the method of moments estimate of >. X n have a beta. Consider a population made up of three different types of individuals occurring in the HardyWeinberg proportions 02. By considering the first moment of X. (c) Combine your answers to (a) and (b) to get a method of moment estimate of >. j : > I) that one system 3. show that T 3 is a method of moment estimate of (). involves imputing missing values. Important variants of and alternatives to this algorithm. which as a function of () is maximized where the contrast logp(X.P2. (d) Find the method of moments estimate of the probability P(X1 will last at least a month.
6.. < .. ~  7. . . .t ) = Number of vectors (Zi.. 4. Let Xl.. N 2 = n(F(t2J . (d) FortI tk.. . The jth cumulant '0' of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant Cj' Give the first three sample cumulants. Let (ZI. . YI ). . See A. we know the order statistics. t) is the bivariate empirical distribution function FCs. given the order statistics we may construct F and given P.F(t. = Xi with probability lin.12. Nk+I) where N I ~ nF(t l ). 8. . . Yn ) be a set of independent and identically distributed random vectors with common distribution function F. . empirical substitution estimates coincides with frequency substitution estimates.8..5 Problems and Complements 139 Hint: See Problem 8. Give the details of this correspondence. X n .8)ln first using only the first moment and then using only the second moment of the population. Hint: Consi~r (N I . . . .. (c) Argue that in this case all frequency substitution estimates of q(8) must agree with q(X). (See Problem B. . X 1l be the indicators of n Bernoulli trials with probability of success 8.2. < ~ ~ Nk+1 = n(l  F(tk)). 5. Let X I. . (b) Show that in the continuous case X ~ rv F means that X (c) Show that the empirical substitution estimate of the jth moment JLj is the jth sample moment JLj' Hinr: Write mj ~ f== xjdF(x) ormj = Ep(Xj) where XF. .Xn be a sample from a population with distribution function F and frequency function or density p.2.. (Z2.. The empirical distribution function F is defined by F(x) = [No. Y2 ). of Xi n ~ =X.5. If q(8) can be written in the fonn q(8) ~ s(F) for sOme function s of F we define the empirical substitution principle estimate of q( 8) to be s( F). The natural estimate of F(s. (Zn. Let X(l) < .. Yi) such that Zi < sand Yi < t n . ~ Hint: Express F in terms of p and F in terms of P ~() X = No. ~ ~ ~ (a) Show that in the finite discrete case... < X(n) be the order statistics of a sample Xl... (b) Exhibit method of moments estimates for VaroX = 8(1 .) There is a Onetoone correspondence between the empirical distribution function ~ F and the order statistics in the sense that.findthejointfrequencyfunctionofF(tl)... of Xi < xl/n. Show that these estimates coincide.. (a) Show that X is a method of moments estimate of 8.)).Section 2.F(tk). t). which we define by ~ F ~( s..
Hint: See Problem B. In Example 2.17. Y are the sample means of the sample correlation coefficient is given by Z11 •. : J • . and so on. Suppose X has possible values VI. p(X.) is the distribution function of a probability P on R2 assigning mass lin to each point (Zi.". z) is continuous in {3 and that Ig({3. 11.• .140 Methods of Estimation Chapter 2 (a) Show that F(. 9. Suppose X = (X"". Show that the sample product moment of order (i. L I. .ZY. the result follows..IU9) that 1 < T < L .1. the sample covariance. X n be LLd. . z) I tends to 00 as \. k=l n  n .. I . .L.4.!'r«(J)) I . (a) Find an estimate of a 2 based on the second mOment. as the corresponding characteristics of the distribution F. In Example 2.2 with X iiI and /is.Y) = Z)(Yk . are independent N"(O. (b) Construct an estimate of a using the estimate of part (a) and the equation a . (b) Define the sample product moment of order (i. 1".. .. (J E R d.. find the method of moments estimate based on i 12.j2. (See Problem 2. ~ i .."ZkYk . . j). = #. The All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates. 10. with (J identifiable. Show that the least squares estimate exists.) Note that it follows from (A. {3) > c. suppose that g({3... as X ~ P(J.2). Zn and Y11 . the sampIe correlation.. . Vi).2. Since p(X. . Vk and that q(9) can be written as ec q«(J) = h(!'I«(J)"" . " . I) = ". . There exists a compact set K such that for f3 in the complement of K. Let X".' where Z. . >. respectively.)..Xn ) where the X. (c) Use the empirical substitution principle to COnstruct an estimate of cr using the relation E(IX. . f(a. 0).81 tends to 00. Hint: Set c = p(X.. {3) is continuous on K. .j) is given by ~ The sample covariance is given by n l n L (Zk k=l . .1.Yn . .
1'(0.Pr(O)) .. .ilr) is a frequency plug (a) Show that the method of moments estimate if = h(fill . When the data are not i. Suppose that X has possible values VI. p fixed (v) Inverse Gaussian.. (a) Use E(X. In the fOllowing cases. Consider the Gaussian AR(I) model of Example 1. 0). with (J E R d and (J identifiable..5 Problems and Com". Show that the method of moments estimate q = h(j1!. .. 1 < j < k. = h(PI(O). to give a method of moments estimate of (72..p:::'. .• it may still be possible to express parameters as functions of moments and then use estimates based on replacing population moments with "sample" moments.(X) ~ Tj(X). (b) Suppose P ~ po and I' = b are fixed.6. where U.. Use E(U[). (b) Suppose {PO: 0 E 8} is the kparameter exponential family given by (1. Vk and that q(O) for some Rkvalued function h.0) (ii) Beta. > 0.1) (iii) Raleigh..:::nt:::' ~__ 141 for some R k valued function h..l Lgj(Xi ). A). 0 ~ (p.9r be given linearly independent functions and write ec n Pi(O) = EO(gj(X)). X n are i.i.36. . in estimate.d. . ~ (X.d. Let 911 . Iii = n.p(x. can you give a method of moments estimate of {3? . j i=l = 1. . .Section 2..O) ~ (x/O')exp(x'/20').10). See Problem 1. General method of moment estimates(!)..l and (72 are fixed.5.6.x (iv) Gamma. 13. 1'(1... Hint: Use Corollary 1.. . as X rv P(J. Suppose X 1.1. IG(p. .) to give a method of moments estimate of p. Let g. find the method of moments estimates (i) Beta.1. .i.. r. fir) can be written as a frequency plugin estimate. . rep. A).0 > 0 14. (c) If J.  1/' Po) / .:::m:::.6.
respectively."" X iq )...~b.142 Methods of Estimation Chapter 2 15. we define the empirical or sample moment to be j ~ .. ' l X q ). . i = 1. I.. ..ZY ~ . For independent identically distributed Xi = (XiI. ()2.. 11P2 + "2P4 + '2P6 ._ . 5 SF 28..1. 1 I ..z. In a large natural population of plants (Mimulus guttatus) there are three possible alleles S.3. ..  1. SF.. Hint: [y. Establish (2...•• Om) can be expressed as a function of the moments... 1 . bI) are as in Theorem 1. I • . 16. Readings Y1 .. and IF.. k> Q mjkrs L. j > 0. ....g(I3.z. Y) and (ai... )... . Let 8" 8" and 83 denote the probabilities of S.. II...)].. where (Z.4.g(l3o... For a vector X = (Xl.q.. n.. + '2P4 + 2PS . P3 + "2PS + '2P6 PI are frequency plugin estimates of OJ. .! . t n on the position of the object. The HardyWeinberg model specifies that the six genotypes have probabilities L7=1 I Genotype Genotype Probability SS 2 II 8~ 3 FF 8j 4 SI 28.J is' _ n n... SI. I ..2 1. = n 1 L:Z.. and F. I .. Y) and B = (ai. let the moments be mjkrs = E(XtX~). _ (Z)' ' a...z. and F at one locus resulting in six genotypes labeled SS..83 8t Let N j be the number of plants of genotype j in a sample of n independent plants..• l Yn are taken at times t 1.bIZ. An Object of unit mass is placed in a force field of unknown constant intensity 8..S 1 .Y.... = Y . .)] + [g(l3o. of observations. k > 0.g(I3.. b.) ... and (J3. S = 1.. I. .. where OJ = 1.)] = [Y...z...1".. 1 6 and let Pl ~ N j / n... The reading Yi .8. Let X = (Z. the method of moments l by replacing mjkrs by mjkrs.. FF. HardyWeinberg with six genotypes.. I I .. .. . Show that method of moments estimators of the parameters b1 and al in the best linear predictor are estimate ~ 8 of B is obtained L:Z. 83 6 IF 28. 1 q. Q. . Multivariate method a/moments.1 " Xir Xk J. 17..  t=1 liB = (0 1 . 1".. >. n 1  Problems for Sectinn 2.6).. . Show that <j < i ! . .
t of the best (MSPE) predictor of Yn + I ? 4.. .BJ . Show that the least squares estimate is always defined and satisfies the equations (2.I is to be taken at time Zn+l. . .. 9. the range {g(zr. B) = Be'x.13 E R d } is closed... . Find the line that minimizes the sum of the squared perpendicular distance to the same points. . . (zn. Suppose that observations YI . (J2) variables. . Find the least squares estimates of ()l and B .. Let X I. nl and Yi = 82 + ICi. in fact.4)(2... x > c. and 13 ranges over R d 8.3. Show that the fonnulae of Example 2. B > O.We suppose the oand be uncorrelated with constant variance.(zn. (b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1. . Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (Zi.. . all lie on aline. 13). . Yn have been taken at times Zl. 13). X n denote a sample from a population with one of the following densities Or frequency functions... g(zn. What is the least squares estimate based on YI . A new observation Ynl.4. " 7 5. 3. 2 10.5) provided that 9 is differentiable with respect to (3" 1 < i < d. 7..Yi).2. (Pareto density) . x > 0.Section 2. Suppose Yi ICl". (a) Let Y I .. . (a) f(x..." + 82 z i + €i = nl with ICi a~ given by = 81 + ICi.2 may be derived from Theorem 1.)' 1+ B5 6. B > O. . to have mean 2..z.1.n. Hint: The quantity to be minimized is I:7J (y. .y. Find the least squares estimates for the model Yi = 8 1 (2.. B) = Bc'x('+JJ. i = 1. Find the LSE of 8. Yl). . (exponential density) (b) f(x. _.. ~p ~(yj}) ~ . .. < O. .2. .5 Problems and Complements 143 ICi differs from the true position (8 /2)tt by a mndom error f l .<) ~ .4. . Yn). B. Y. .B. . i + 1.2. where €nlln2 are independent N(O. Find the least squares estimate of 0:. . .Yn).. Zn and that the linear regression model holds. nl + n2. .. The regression line minimizes the sum of the squared vertical distances from the points (Zl. if we consider the distribution assigning mass l/n to each of the points (zJ. i = 1.6) under the restrictions BJ > 0. Yn be independent random variables with equal variances such that E(Yi) = O:Zj where the Zj are known constants.).. Hint: Write the lines in the fonn (z. c cOnstant> 0. Find the MLE of 8.
for each wEn there is 0 E El such that w ~ q( 0). (72 Ii (a) Show that if J1 and 0. ~ I in Example 2. > O.a)l rather than 0. . .1.P > I. then i  WMLE = arg sup sup{Lx(O) : 0 E e(w)}. then {e(w) : wEn} is a partition of e. say e(w).2 are unknown. I < k < p. 0 > O. ..5. 0 E e and let 0 denote the MLE of O. Hint: Use the factorization theorem (Theorem 1.r(c+<). Show that any T such that X(n) < T < X(1) + is a maximum likelihood estimate of O. Let X I. x> 0.O) where 0 ~ (1". (beta.0"2 (a) Find maximum likelihood estimates of J.. Show that if 0 is a MLE of 0. be independently and identically distributed with density f(x. the MLE of w is by definition ~ W. x > 0.  under onetoone transformations). Show that depends on X through T(X) only provided that 0 is unique. (a) Let X . is a sample from a N(/l' ( 2 ) distribution.5 show that no maximum likelihood estimate of e = (1". . Pi od maximum likelihood estimates of J1 and a 2 . (We write U[a.16(b). < I" < 00. ncR'. 00 = I 0" exp{ (x . n > 2. Thus. x > 1".. 0) = (x/0 2) exp{ _x 2/20 2 }.: I I \ . = X and 8 2 ~ n.1). they are equivariant 16. ry) denote the density or frequency function of X in terms of T} (Le. I • :: I Suppose that h is a onetoone function from e onto h(e). q(9) is an MLE of w = q(O). Show that the MLE of ry is h(O) (i. 0) = VBXv'O'. (Wei bull density) 11. I I I (b) LetP = {PO: 0 E e}. Define ry = h(8) and let f(x. 0) = cO"..Pe.2. bl to make pia) ~ p(b) = (b .144 Methods of Estimation Chapter 2 (c) f(". . Hint: You may use Problem 2. X n .I")/O") . iJ( VB. . Let X" . then the unique MLEs are (b) Suppose J. Because q is onto n. Hint: Let e(w) = {O E e : q(O) = w}. be a family of models for X E X C Rd. I).t and a 2 .. WEn . 0 <x < I.0"2) 15.e. I i' 13. (Rayleigh density) (f) f(x. X n be a sample from a U[0 0 + I distribution.l exp{ OX C } . reparametrize the model using 1]).t and a 2 are both known to be nonnegative but otherwise unspecified. (b) Find the maximum likelihood estimate of PelX l > tl for t > 1".) ! ! !' ! 14. c constant> 0. 0 > O.:r > 0.  e . density) (e) fix. X n • n > 2. e c Let q be a map from e onto n. 0> O.l L:~ 1 (Xi .0"2).X)" > 0.. Suppose that Xl. J1 E R. . . 0) = Ocx c . If n exists. MLEs are unaffected by reparametrization. (Pareto density) (d) fix.II ". c constant> 0. and o belongs to only one member of this partition. 12. 0 > O. Suppose that T(X) is sufficient for 0 and ~at O(X) is an MLE of O... .
16(i). Bartholomew.1 (I_B).. We want to estimate O. find the MLE of B. X 2 = the number of failures between the first and second successes. and so on. (a) Find maximum likelihood estimates of the fJ i under the assumption that these quantities vary freely. (a) The observations are indicators of Bernoulli trials with probability of success 8. and have commOn frequency function.n . (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B. ...2. k ~ 1. .B) =Bk . We want to estimate B and Var..P.B) = 1. . < 00. (We denote by "r + 1" survival for at least (r + 1) periods. .. f(k. where 0 < 0 < 1. a model that is often used for the time X to failure of an item is P. (b) The observations are Xl = the number of failures before the first success. . Bremner. Thus. 1) distribution. Derive maximum likelihood estimates in the following models. Show that the maximum likelihood estimate of 0 based on Y1 . . X n ) is a sample from a population with density f(x.[X < r] ~ 1 LB k=l k  1 (1_ B) = B'.Xn be independently distributed with Xi having a N( Oi. . Yn which are independent. Censored Geometric Waiting Times. 21. 1 <i < n. if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods. .1 (1 .O") : 00 < fl.  ...~1 ' Li~1 Y." Y: . Show that maximum likelihood estimates do not exist.. B) = 100' cp 9 (x I') + 0' 10 cp(x 1') 1 where cp is the standard normal density and B = (1'.B). A general solution of this and related problems may be found in the book by Barlow.r f(r + I. Y n IS _ B(Y) = L. (KieferWolfowitz) Suppose (Xl.X I = B(I. M 18.. 19. Suppose that we only record the time of failure. .6. 17. . and Brunk (1972). identically distributed..B).0") E = {(I'. we observe YI . •• .) Let M = number of indices i such that Y i = r + 1.. Let Xl. in a sequence of binomial trials with probability of success fJ.Section 2.. 0 < a 2 < oo}...[X ~ k] ~ Bk . In the "life testing" problem 1. 20.5 Problems and Complements 145 Now show that WMLE ~ W ~ q(O). but e . If time is measured in discrete periods. k=I.
x n .5 66.8 14. Suppose X has a hypergeometric.9 3.6). Suppose Y.6. Assume that Xi =I=. where [t] is the largest integer that is < t.tl. Y.< J.a 2 ) and NU". (1') where e n ii' = L(Xi .2 Peed Speed  2 I 1 1 1 1 1 1 1 J2 o o 1 1 1 3.I' fl. 1 <k< p}. (1') populations..8 0.Y)' i=1 j=1 /(rn + n).2 4.2 4.Xj for i =F j and that n > 2.5 I' I I . n). " Yn be two independent samples from N(J.1 2. Weisberg. x) as a function of b. Hint: Coosider the mtio L(b + 1. respectively. 1 1 1 o o . (1') is If = (X. Ii equals one of the numbers 22.2.a 2 ) = Sllp~.. II I I :' 24. and assume that zt'·· .0 2.0"2) if. . N.X)' + L(Y.0 11. where'i satisfy (2.J.' wherej E J andJisasubsetof {UI . Show that the maximum likelihood estimate of b for Nand n fixed is given by if ~ (N + 1) is not an integer. 1985). Show that the MLE of = (fl..0 0.Xm and YI . .5 86. Tool life data Peed 1 Speed 1 1 Life 54.. .. distribution. J2 J2 o o o I 3. .(Zi) + 'i.jp): 0 <j. 7 . and ~ b(X) = X X (N + 1) or (N + 1)1 n n othetwise.4 o o o o o o o o o o o o o Life 20. and only if. .5 0.ji. 1i(b.8 3. .p(x.z!. Polynomial Regression. the following data were obtained (from S. . x)/ L(b. Let Xl.'.0 I ..8 2. i: In an experiment to study tool life (in minutes) of steelcutting tools as a function of cutting speed (in feet per minute) and feed rate (in thousands of an inch per revolution).t.4)(2. Set zl ~ .up(X. = fl. Xl. .0 5.146 Methods of Estimation Chapter 2 that snp...2.2 3. TABLE 2.1. 23.
4)(2. let v(z. (12 unknown.(P) coincide with PI and 13. (2.0:4Z? + O:SZlZ2 + f Use a least squares computer package to compute estimates of the coefficients (f3's and o:'s) in the two models.19).2.2. this has to be balanced against greater variability in the estimated coefficients. of Example 2. (2. (b) Let P be the empirical probability . 25.4)(2. = (cutting speed . the second model provides a better approximation. ZI ~ (feed rate . Two models are contemplated (a) Y ~ 130 + pIZI + p. This will be discussed in Volume II.6)). ZD = Wz ZD and € = WZ€. (a) The parameterization f3 (b) ZD is of rank d.y)!(z. .1. z) ing are equivalent.  27. 26.y)/c where c = I Iv(z..2. y) > 0 be a weight funciton such that E(v(Z.2. Derive the weighted least squares nonnal equations (2. weighted least squares estimates are plugin estimates..(P)Z of Y is defined as the minimizer of t = zT {3.6) with g({3. ~.(b l + b.(P) = Cov(Z'. Y)Y') are finite. Both of these models are approximations to the true mechanism generating the data.900)/300. z. Show that (3.6) with g({3.Y = Z D{3+€ satisfy the linear regression mooel (2. 28.. The best linear weighted mean squared prediction error predictor PI (P) + p.y)!(z. (c) Zj. . Let (Z. Show that the follow ZDf3 is identifiable.Z)]'). Being larger. Let Z D = I!zijllnxd be a design matrix and let W nXn be a known symmetric invertible matrix.Section 2.(P)E(Z·). .. in Problem 2. y). Use these estimated coefficients to compute the values of the contrast function (2. (a) Show tha!.13)/6. .2.2. + E eto (b) Y = + O:IZl + 0:2Z2 + Q3Zi +.6.2.1.ZD is of rank d.2.1 (see (B.1).5 Problems and Complements 147 The researchers analyzed these data using Y = log tool life.y) defined . l/Var(Y I Z = z). . Consider the model (2. Set Y = Wly.Y') have density v(z. Y)[Y . However.8 and let v(z. z) ~ Z D{3.z.y)dzdy. E{v(Z. Show that p. Let W. Y)Z') and E(v(Z. Consider the model Y = Z Df3 + € where € has covariance matrix (12 W. (a) Let (Z'. Y')/Var Z' and pI(P) = E(Y*) 11.3.(P) and p. That is..1). Y) have joint probability P with joint density ! (z.l be a square root matrix of W.5) for (a) and (b).
j}.453 1. Chon) shonld be read row by row. Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore.ZD... i = 1.156 5. Show that the MLE of . .676 5.. with mean zero and variance a 2 • The ei are called moving average errors. 1'. ..229 4.) ~ ~ (I' + 1'. = L. \ (e) Find the weighted least squares estimate of p.274 5. . (See Problem 2. .046 2.564 1.6)T(y .971 0.669 7.. in this model the optimal " MSPE predictor of the future Yi+ 1 given the past YI .511 3.131 5.) (c) Find a matrix A such that enxl = Anx(n+l)ECn+l)xl' (d) Find the covariance matrix W of e. n. 0) = II j=q+l k 0.ZD. = (Y  29. . = nq = 0.196 2..908 1..397 4. . . which vanishes if 8 j = 0 for any j = q + 1. Consider the model Yi = 11 + ei.jf3j for given covariate values {z. nq+1 > p(x..300 6.1. n.666 3.100 3.039 9. I Yi is (/1 + Vi).148 ~ Methods of Estimation Chapter 2 (b) Show that if Z  D has rank d..301 2.020 8. k.6) T 1 (Y . <J > 0.064 5. J. suppose some of the nj are zero...1.2.182 0.(Y .860 5. where 1'.17.476 3.a.856 2.019 6.480 5.788 2. Suppose YI .€n+l are i.360 1.870 30. Is ji.858 3.~=l Z.689 4. Use a weighted least squares c~mputer routine to compute the weighted least squares estimate Ii of /1.391 0.958 10. ' I (f) The following data give the elapsed times YI . Let ei = (€i + {HI )/2...ZD..455 9. 4 (b) Show that Y is a multivariate method of moments estimate of p.8. ~ ~ ~Show that the MLE of OJ is 0 with OJ = nj In. 2.053 4.d.. . _'.392 4.nk > 0. In the multinomial Example 2.6) is given by (2.6) W.916 6.968 9. The data (courtesy S. .054 1.+l I Y .i.5.097 1..921 2. Hint: Suppose without loss of generality that ni 0. (a) Show that E(1'.038 3. .20)..114 1.393 3. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay.. i = 1.457 2.379 2. j = 1.091 3.599 0. . k.703 4. different from Y? I I TABLE 2.918 2. Then = n2 = .716 7.071 0.'.058 3.).155 4.834 3.611 4.. /1i + a}. . .249 1. 31.968 2.093 5. where El" . Yn are independent with Yi unifonnly distributed on [/1i .582 2. .093 1.723 1. then the  f3 that minimizes ZD..081 2. .2. That is.075 4.665 2.
.28Id(F.1. . ..l1.../Jr ~ ~ and iii.J. Show that XHL is a plugin estimate of BHL.7 with Y having the empirical distribution F. let Xl and X.. . Let x. The HodgesLehmann (location) estimate XHL is defined to be the median of the ~n(n + 1) pairwise averages ~(Xi + Xj). the sample median fj is defined as Y(k) where k ~ ~(n + 1) and Y(l). Yn ordered from smallest to largest..8).4. " . . where Iii = ~ L:j:=1 ZtjPj· 32. . as (Z. Hint: Use Problem 1. Find the MLE of A and give its mean and variance. 8 E R.32(b). x E R.d.i... F)(x) *F denotes convolution. Y( n) denotes YI .d.d. 34.. .) Suppose 1'.(3p that minimizes the least absolute deviation contrast function L~ I IYi . . ~ 33.tLil and then setting a = max t IYi . then the MLE exists and is unique. An asymptotically equivalent procedure is to take the median of the distribution placing mass and mass .5 Problems and Complements 149 (f3I' . i < j. ~ f3I. (a) Show Ihat if Ill[ < 1. with density g(x . be i..Jitl. = L Ix.{3p. . + Xj i<j  201· (b) Define BH L to be the minimizer of J where F [x . These least absolute deviation estimates (LADEs).Section 2.. where Jii = L:J=l Zij{3j. = I' for each i. (a) Show that the MLE of ((31.fin are called (b) If n is odd. Let g(x) ~ 1/".x.3...17). Hint: See Example 1. Yn are independent with Vi having the Laplace density 1 2"exp {[Y.6. be the Cauchy density. . (3p. XHL i& at each point x'..I'. Let X.(1 + x'j. . If n is even. .Ili I and then setting (j = n~I L~ I IYi .. Show that the sample median ii is the minimizer of L~ 1 IYi .i. Let B = arg max Lx (8) be "the" MLE. A > 0. and x...>0 ~ ~ where tLi = :Ej=l Ztj{3j for given covariate values {Zij}. Give the MLE when .2. 1). . . Z and W are independent N(O. . .O) Hint: See Problem 2.  Illi < 1. .. a) is obtained by finding (31. (See (2. a)T is obtained by finding 131. .. be i..{:i. y)T where Y = Z + v"XW.). Suppose YI .& at each Xi. 35. be the observations and set II = ~ (Xl ..I/a}.:C j • i <j (a) Show that the HodgesLehmann estimate is the minimizer of the contrast function p(x.. the sample median yis defined as ~ [Y(.) + Y(r+l)] where r = ~n./3p that minimizes the maximum absolute value conrrast function maxi IYi .
Xn are i..B I ) (K'(ryo.. Define ry = h(B) and let p' (x. . . If we write h = logg. Problem 35 can be generalized as follows (Dhannadhikari and JoagDev. B) = g(xB). Let Xi denote the number of hits at a certain Web site on day i. and positive everywhere. On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making). 38. . 1 n + m. where Al +. i = 1. Let 9 be a probability density on R satisfying the following three conditions: I. (7') and let p(x. ~ Let B ~ arg max Lx (B) be "the" MLE. . then the MLE is not unique.. b) there exists a 0 > 0 ~ (c) There is an interval such that h(y + 0) . . (a. ~ (XI + x.. Ii I. X.. Assume :1. a < b. Find the values of B that maximize the likelihood Lx(B) when 1t>1 > L Hint: Factor out (x .t> . Let Xl and X2 be the observed values of Xl and X z and write j. 1985). Let Vj and W j denote the number of hits of type 1 and 2 on day j. B) is and that the KullbackLiebler divergence between p(x.+~l Vi and 8 2 = E. Let (XI. ryl Show that o n ».B )'/ (7'.+~l Wj have P(rnAl) and P(mA2) distributions.x2)/2. then B is not unique. 36. B) denote their joint density.ryt» denote the KullbackLeibler divergence between p( x. (b) Either () = x or () is not unique. . Show that (a) The likelihood is symmetric about ~ x..)/2 and t> = (XI . P(nA)..150 Methods of Estimation Chapter 2 (b) Show that if 1t>1 > 1. B ) and p( X. 51. h I (ry») denote the density or frequency function of X for the ry parametrization. Let K(Bo.) be a random samplefrom the distribution with density f(x. 37. 'I i 39. Bo) is tn( B. where x E Rand (} E R. > h(y) . Also assume that S... Suppose h is a II function from 8 Onto = h(8).h(y) .h(y . B) and pix.0). .. distribution. and 8 2 are independent. n. 9 is continuous. 2. then h"(y) > 0 for some nonzero y. Let X ~ P" B E 8. N(B.j'.B)g(x. .B). Assume that 5 = L:~ I Xi has a Poisson. ~ (d) Use (c) to show that if t> E (a. o Ii !n 'I . BI ) (p' (x. .b). ryo) and p' (x.. .B) g(x + t> . Suppose X I. b). and 5. ry) = p(x. = I ! I!: Ii. j = n + I.i.B)g(i . . Show that the entropy of p(x.d. 3. symmetric about 0.f)) in the likelihood equation. Find the MLEs of Al and A2 hased on 5. such that for every y E (a. The likelihood function is given by Lx(B) g(xI . Sl. 9 is twice continuously differentiable everywhere except perhaps at O. that 8 1 E.\2 = A. .
. Aj) + Ei. 41. .I[X.. n > 4. An example of a neural net model is Vi = L j=l p h(Zij.0). n where A = (a.. /3. j = 1. i = 1.7) for a..1'. .j ~ x <0 1. 1). 42.) Show that statistics.. o + 0.)/o]}" (b) Suppose we have observations (t I .. h(z. X n be a sample from the generalized Laplace distribution with density 1 O + 0. give the lea. yd 1 ••• .I[X. Give the least squares where El" •. Variation in the population is modeled on the log scale by using the model logY. .~e p = 1. j 1 1 cxp{ x/Oj}. a > 0. and g(t. a.O) ~ a {l+exp[[3(tJ1. n. [3. . Let Xl. .Section 2. Seber and Wild.. Suppose Xl. where 0 (a) Show that a solution to this equation is of the form y (a. (e) Let Yi denote the response of the ith organism in a sample and let Zij denote the level of the jth covariate (stimulus) for the ith organism. (tn. Yn). and /1. 1 En are uncorrelated with mean 0 and variance estimating equations (2.. . = L X.. on a population of a large number of organisms. 1'.1. where OJ > O. = log a . > OJ and T. x > 0.1. 1959. . exp{x/O. For the ca.7) for estimating a. .1. [3. 1 X n satisfy the autoregressive model of Example 1.. . T. I' E R. (. [3. [3 > 0. 6.. 0). < 0] are sufficient (b) Find the maximum likelihood estimates of 81 and 82 in tenns of T 1 and T 2 • Carefully check the "T1 = 0 or T 2 = 0" case. = LX. a get. and fj. and 1'. . + C. The mean relative growth of an organism of size y at time t is sometimes modeled by the equation (Richards. 1989) Y dt ~ [3 1  Idy [ (Y)!] ' y > 0.c n are uncorrelated with mean zero and variance (72. 1'). A) = g(z.  J1.)/o)} (72. .p.5. . 0 > O.2.}. [3. . i = 1.olog{1 + exp[[3(t.5 Problems and Complements 151 40.~t square estimating equations (2.
(b) Show that the likelihood equations are equivalent to (2. Xn· Ip (x. Hint: I . r(A. < I} and let 03 = 1. < Show that the MLE of a. .i. L x..3.0 . Xn)T can be written as the rank 2 canonical exponential family generated by T = (ElogX"EXi ) and hex) = XI with ryl = p... C2' y' 1 1 for (a) Show that the density of X = (Xl. .3. • II ~. . This set K will have a point where the max is attained. (One way to do this is to find a matrix A such that enxl = Anxn€nx 1. > 0.)I(c. . Yn) is not a sequence of 1's followed by all O's or the reverse. .'" £n)T of autoregression errors.y... In a sample of n independent plants. a. Let = {(01. Under what conditions on (Xl 1 . write Xi = j if the ith plant has genotype j.x.0. Consider the HardyWeinberg model with the six genotypes given in Problem 2.. . _ C2' 2. = If C2 > 0. . fi) = a + fix.fi) = 1log p pry.5). (X.3. Yn are independent PlY. S. 1 Xl <i< n. I . There exists a compact set K c e such that 1(8) < c for all () not in K.. Suppose Y I ..(8 . . + C1 > 0).15. Give details of the proof or Corollary 2.d. Is this also the MLE of /1? Problems for Section 2.. .152 Methods of Estimation Chapter 2 (aJ If /' is known.Xi)Y' < L(C1 + c. show that the MLE of fi is jj =  2.) Then find the weighted least square estimate of f. . = L(CI + C.. t·. .. Hint: Let C = 1(0)..::.4) and (2. fi exists iff (Yl . .3 1.). + 0. n > 2. = OJ. < .02): 01 > 0. gamma... 3. .l. n i=l n i=l n i=1 n i=l Cl LYi + C. Prove Lenama 2. the bound is sharp and is attained only if Yi x.l  L: ft)(Xi /.l. = 1] = p(x"a. '12 = A and where r denotes the gamma function.J..1.. find the covariance matrix W of the vector € = (€1.3.p). 0 for Xi < _£I. " l Xn) does the Mill exist? What is the MLE? Is it unique? e 4.) 1 II(Z'i /1)2 (b) If j3 is known.x..1 > _£l. I < j < 6.. + 8. X n be i. Let Xl.
..z=7 11. ~fo (X u I.lkIl :Ij >0..) ~ Ail = exp{a + ilz.10 with that the MLE exists and is unique. show 7.1). 12. and assume forw wll > 0 so that w is strictly convex.1 to show that in the multinomial Example 2.en.t). < Zn.1} 0 OJ = (b) Give an algorithm such that starting at iP = 0. . Prove Theorem 2.3. Let Xl.3. • It) w' ( Xi It) . .1 <j < ac k1.6. Suppose Y has an exponential.. . 10. Yn denote the duration times of n independent visits to a Web site. MLEs of ryj exist iffallTj > 0. < Show that the MLE of (a. if n > 2.3.' .Xn E RP be i. Show that the boundary of a convex C set in Rk has volume 0.fl. Hint: If it didn't there would exist ryj = c(9j ) such that ryJ 10 . J..0). < Zn Zn.'~ve a unique solution (. the MLE 8 exists and is unique. w( ±oo) = 00..Section 2. Let Y I .. . " (a) Show that if Cl: > ~ 1. Zj < . the MLE 8 exists but is not unique if n is even..n. n > 2.1 (a) ~ ~ c(a) exp{ Ix ...5 Problems and Complements 153 n 6. with density. (O. a(i) + a..d. u > 1 fR exp{Ixlfr}dxand 1·1 is the Euclidean norm. But c( e) is closed so that '10 = c( eO) and eO must satisfy the likelihood equations. 1 <j< k1. . > 3... Hint: If BC has positive volume. .n) are the vertices of the :Ij <n}. ji(i) + ji. .. [(Ai). fe(x) wherec.l E R.ill T exists and is unique.A( ryj) ~ max{ 'IT toA( 'I) : 'I E c(8)} > 00.0. the likelihood equations logfo that t (X It) w' iOJ t=I = 0 ~ { (X i . Use Corollary 2. Then {'Ij} has a subsequence that converges to a point '10 E f.O. C! > 0.0). then it must contain a sphere and the center of the sphere is an interior point by (B. .. Let XI. In the heterogenous regression Example 1. 0:0 = 1. .Ld. See also Problem 1. X n be Ll. e E RP. 0 < Zl < . . ...}. . and Zi is the income of the person whose duration time is ti. 9. (0. Hint: The kpoints (0. (b) Show that if a = 1.40. .0 < ZI < . (3) Show that.. _. convex set {(Ij. .9. distribution where Iti = E(Y. 8.6.3. ...3.8) .
2).J1. See. 8aob :1 13. (a) Show that the MLEs of a'f. EM for bivariate data. and p = [~(Xi .. it is unique.) Hint: (a) Thefunction D( a.J1. (See Example 2. b = : and consider varying a. for example. (Check that you are describing the GaussSeidel iterative method for solving a system of linear equations.. E" . Chapter 10. Y. /12. = (lin) L:. Golnb and Van Loan. b) ~ I w( aXi . Apply Corollary 2.4. > 0..2 are assumed to be known are = (lin) L:.J1. and p when J1. Describe in detail what the coordinate ascent algorithm does in estimation of the regression coefficients in the Gaussian linear model i: y ~ ZDJ3 + <.1.bo) D(a.8. . W is strictly convex and give the likelihood equations for f.. ~J ' . Let (X10 Yd.. b successively. rank(ZD) = k.3. b) and lim(a. complete the Estep by finding E(Zl I l'i) and E(ZiYi I Yd· (b) In Example 2.i.l and CT.2)/M'''2] iI I'. L:.b) . 7if)F 8 08 D 802 vb2 2 2 > (a'D)2 ' then D'IS strictIy convex.. . O'?. "i . I (Xi . Hint: (b) Because (XI. . . show that the estimates of J11.1)". J12. (Xn . Yn ) be a sample from a N(/11.1 and J1.3. O'~ + J1~. (i) If a strictly convex function has a minimum.' ! (a) In the bivariate nonnal Example 2.. En i. ar 1 a~. . (b) Reparametrize by a = .d. . 2.)(Yi .2. P coincide with the method of moments estimates of Problem 2. O'~. (I't') IfEPD 8a2 a2 D > 0 an d > 0. b) = x if either ao = 0 or 00 or bo = ±oo.. respectively. Show that if T is minimal and t: is open and the MLE doesn't exist.6.:.. provided that n > 3.) I .4 1.J1.'2). N(O.(Yi . il' i. p) population.n log a is strictly convex in (a. {t2.4. . . EeT = (J11. (b) If n > 5 and J11 and /12 are unknown. pal0'2 + J11J. verify the Mstep by showing that E(Zi I Yi).b)_(ao. then the coordinate ascent algorithm doesn't converge to a member of t:..6.9).154 Methods of Estimation Chapter 2 (c) Show that for the logistic distribution Fo(x) [1 + exp{ X}]I. 3. O'~ + J1i.4.2)2.:. 0'1 Problems for Section 2. ag. Note: You may use without proof (see Appendix B.) has a density you may assume that > 0. IPl < 1. 1985.
Because it is known that X > 1.B)n[n . it is desired to estimate the proportion 8 that has the genetic trait.. given X>l. when they exist. 6. 1 < i < n.Ii).n. Y'i). and that the (b) Use (2. Consider a genetic trait that is directly unobservable but will cause a disease among a certain proportion of the individuals that have it. thus. j = 0.[lt ~ OJ. B= (A. (a) Justify the following crude estimates of Jt and A. Let (Ii. 1 and (a) Show that X .4. 8new . Suppose that in a family of n members in which one has the disease (and.Section 2.1) x Rwhere P. .{(Ii._x(1. (c) Give explicitly the maximum likelihood estimates of Jt and 5. Hint: Use Bayes rule. Suppose the Ii in Problem 4 are not observed. is distributed according to an exponential ryl < n} = i£(1 I) uzuy. B) variable. as the first approximation to the maximum likelihood .5 Problems and Complements 155 4.2B)x + nB'] _.B)n} _ nB'(I. the model often used for X is that it has the conditional distribution of a B( n..[1 . Y (~ A 2:7 I(Yi  Y)'  O"~)/ (0". Yd : 1 < i family with T afi I a? known. i ~ 1'.. be independent and identically distributed according to P6.? (b) Give as explicitly as possible the E. For families in which one member has the disease.(1 . 2:Ji) .and Msteps of the EM algorithm for this problem. 2 (uJ L }iIi + k 2: Y.B)[I. Do you see any problems with >..x = 1. ~2 = log C\) + (b) Deduce that T is minimal sufficient. 1') E (0. B E [0.(I. X is the number of members who have the trait. Y1 '" N(fLl a}). also the trait). IX > 1) = \ X ) 1 (1 6)" . I n ) 8"(10)"" (a) Show that P(X = x MLE exists and is unique. and given II = j.P. I' >.(1 _ B)n]{x nB ..[lt ~ 1] = A ~ 1 . ..I ~ + (1  B)n] . 1]. where 8 = 80l d and 8 1 estimate of 8.3) to show that the NewtonRaphson algorithm gives ~ _ 81 ~ 8 ~ _ ~ _ B(I.B)"]'[(1 .O"~).
+ Vbc where 00 < /l.b. 8. Let Xl. * maximizes = iJ(A') t.4. c]' Show that this holds iff PIU ~ a. Show that the sequence defined by this algorithm converges to the MLE if it exists.Na+c.2 noting that the sequence of iterates {rymJ is bounded and. .v ~ b I W = c] = P[U = a I W = c]P[V = b I W ~ i. Define TjD as before. the sequence (11ml ijm+l) has a convergent subse quence.riJ(A) . = 1.X 2 . n N++ c ++c  . v vary freely is an exponential family of rank (C . b1 C.9)')1 Suppose X. Let Xl.. X 3 = a. hence. iff U and V are independent given W. X n be i. N +bc given by < N ++c for all a. (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the other of the local maxima (b) Make a similar study of the NewtonRaphson method in this case. (h) Show that the family of distributions obtained by letting 1'. N . v < 00. N++ c N a +c N+ bc Pabc = . . Hint: Apply the argument of the proof of Theorem 2. (a) Suppose for aU a. 1 < b < B. W = c] 1 < a < A.1 (1 + (x e. Consider the following algorithm under the conditions of Theorem 2. W). V. ~ 0. .4.i.d. V = b.. PIU = a. X Methods of Estimation Chapter 2 = 2.156 (c) [f n = 5. Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a.2. where X = (U.c)} and "+" indicates summation over the ind~x. (c) Show that the MLEs exist iff 0 < N a+c . X. 9. find (}l of (b) above using (} =   x/n as a preliminary estimate.1 generated by N++c.e.A(iJ(A)). (1) 10gPabc = /lac = Pabe.X3 be independent observations from the Cauchy distribution about f(x. 7. 1 <c < C and La. .c Pabc = l.1) + C(A + B2) = C(A + B1) .9) = ".N+bc where N abc = #{i : Xi = (a. . b1 c and then are . Let and iJnew where ).b.
Show that the algorithm converges to the MLE if it exists and di· verges otheIWise.(x) :1:::: 1(S(x) ~ = s). (b) Give explicitly the E.I) + (B .c' obtained by fixing the "b.8).I)(C .a>!b". 12. N±bc .a) 1 2 . Let f. (b) Consider the following "proportional fitting" algorithm for finding the maximum likelihood estimate in this model.(A + B + C). Justify formula (2.4.4.5 has the specified mixtnre of Gaussian distribution. Nl+co (c) The model implies Pabc = P+bcPa±c/P++c and use the likelihood equations. Hint: Note that because {p~~~} belongs to the model so do all subsequent iterates and that ~~~ is the MLE for the exponential family Pabc = ellauplO) _=.[X = x I SeX) = s] = 13. (a) Show that this is an exponential family of rank A + B + C .=_ L ellalblp~~~'CI a'. c" parameters..1) + (A . (a) Show that S in Example 2.5 Problems and Complements 157 Hint: (b) Consider N a+ c .and Msteps of the EM algorithm in this case. v. = fo(x  9) where fo(x) 3'P(x) + 3'P(x . 10. 11. but now (2) logPabc = /lac + Vbc + 'Yab where J1. Initialize: pd: O) = N a ++ N_jo/>+ N++ c abc nnn Pabc d2) dl) Nab+ Pabc dO) n n Pab+ d1) dl) dO) Pabc d3) N a + c Pabc Pa+c N+bc Pabc d 2) Pabc n P+bc d2)· Reinitialize with ~t~.1)(B .c. Suppose X is as in Problem 9.Section 2.:.I) ~ AB + AC + BC .N++c/A.N++ c / B.I)(C . f vary freely. c" and "a. Hint: P..b'.3 + (A .
Hint: Use the canonical nature of the family and openness of E.6. necessarily 0* is the global maximizer.4. {32).d.5..3 for the actual MLE in that example. a 2 . Hint: Show that {( 8m . Zl. Limitations a/the EM Algorithm.5. Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n. For instance. If Vi represents the seriousness of a disease. In Example 2. {31 1 a~.. Bm+d} has a subsequence converging to (0" JJ*) and." . Show for 11 = 1 that bisection may lead to a local maximum of the likelihood. .4. Establish part (b) of Theorem 2. These considerations lead essentially to maximum likelihood estimates. • . suppose all subjects with Yi > 2 drop out of the study.. where E) . but not on Yi.~ go. .. That is. ~ 16. For X . For example. . (J*) and necessarily g. given X .and Msteps of the EM algorithm for estimating (ftl. Then using the Estep to impute values for the missing Y's would greatly unclerpredict the actual V's because all the V's in the imputation would have Y < 2. The assumption underlying the computations in the EM algorithm is that the conditional probability that a component X j of the data vector X is missing given the rest of the data vector is not a function of X j . NOTES .1 (1) "Natural" now was not so natural in the eighteenth century when the least squares principle was introduced by Legendre and Gauss. Bm + I)} has a subsequence converging to (B* .i. if a is sufficiently large. 17."" En. in Example 2. .158 Methods of Estimation Chapter 2 and r.4. Verify the fonnula given in Example 2. ! . this assumption may not be satisfied.4. En a an 2. suppose Yi is missing iff Vi < 2.. Complete the E.p is the N(O} '1) density. 2 ). A. This condition is called missing at random. If /12 = 1. find the probability that E(Y.. Zn are ii. the process determining whether X j is missing is independent of X j .d. Noles for Section 2.6 . 15.) underpredicts Y. N(ftl. . 18. the probability that Vi is missing may depend on Zi. the "missingness" of Vi is independent of Yi. .2.3. I Z. consider the model I I: ! I I = J31 + J32 Z i + Ei I are i. thus. I I' .{Xj }. For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1986). (2) The frequency plugin estimates are sometimes called Fisher consistent. given Zi. 14. That is. Fisher (1922) argued that only estimates possessing the substitution property should be considered and the best of these selected. ~ .. Establish the last claim in part (2) of the proof of Theorem 2. . N(O.6. R.n}. al = a2 = 1 and p = 0. = {(Zil Yi) Yi :i = 1.4. EM and Regression. and independent of El. Hint: Show that {(19 m. we observe only}li. Ii .
HOLLAND. 1991. NJ: Princeton University Press. G. Statist." reprinted in Contributions to MatheltUltical Statistics (by R. . Math. GIlBELS. 0. G.5 (1) In the econometrics literature (e. 41." J. AND J. Note for Section 2. 2. The Econometrics ofFinancial Markets Princeton. F. BAUM.. M." The American Statistician. BJORK. "Examples of Nonunique Maximum Likelihood Estimators. 1997. P[T(X) E Al o for all or for no PEP. S. A. Discrete Multivariate Analysis: Theory and Practice Cambridge.2. for any A. A. AND C. THOMAS. Numerical Analysis New York: Prentice Hall.. A. DAHLQUIST.1974. Statistical Inference Under Order Restrictions New York: Wiley. AND D. RUBIN. S. (2) For further properties of KullbackLeibler divergence. H. E. A.. 1974. ANOH. Statist. "Y. Note for Section 2. ANDERSON.loAGDEV. T. R.g. AND N. C. J. DEMPSTER.. J. BRUNK. 138 (1977).3 (I) Recall that in an exponential family. LAIRD. see Cover and Thomas (1991). FAN. AND N.• AND I. MA: MIT Press. COVER. GoLUB. Soc. T. a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates...7 REFERENCES BARLOW. M. t 996. 1975. AND K. 1. Appendix A. M. S.7 References 159 Notes for Section 2. "Y. 1972. "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Am!. PETRIE.. 199200 (1985). B. VAN LOAN. AND A. La. EiSENHART. W. BISHOP. BARTHOLOMEW.. A. R.. A. Wiley and Sons. 1997). CAMPBELL.2 (I) An excellent historical account of the development of least squares methods may be found in Eisenhart (1964). "The Meaning of Least in Least Squares. 1922. Acad. M. E. Elements oflnfonnation Theory New York: Wiley. 164171 (1970). BREMNER. Local Polynomial Modelling and Its Applications London: Chapman and Hall.Section 2. G.. J.." Journal Wash. Matrix Computations Baltimore: John Hopkins University Press. 2433 (1964). W. M. HABERMAN. The Analysis ofFrequency Data Chicago: University of Chicago Press.. L. "On the Mathematical Foundations of Theoretical Statistics. 1985. MACKINLAY. FISHER. M. E. Fisher 1950) New York: J. Sciences. Campbell. WEISS. D. c. and MacKinlay. 54. Roy. "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm.. 39. FEINBERG.1. AND P. La. DHARMADHlKARI. B. SOULES.
"/RE Trans! Inform. 1986. 6th ed... 1%7.. Theory." J. SHANNON. AND W. WEISBERG. "A Mathematical Theory of Communication. Amer. A. 2nd ed. The History ofStatistics Cambridge. Statistical Methods. S.." 1. LIlTLE. F. 1989. 63. MOSTELLER.379243. 102108 (1956). MA: Harvard University Press. . Wiley.1. KR[SHNAN. AND T. Assoc. "A Flexible Growth Function for Empirical Use. E. RICHARDS. 1997. 128 (1968)." Ann. Statist. R. IA: Iowa State University Press. G. "Association and Estimation in Contingency Tables. AND D. G. F. Statist. STIGLER. Nonlinear Regression New York: Wiley. J.. A. WU. SNEDECOR. SEBER. Botany. RUPPERT.. 13461370 (1994). "Multivariate Locally Weighted Least Squares Regression. • . 11.623656 (1948). 1987. E. "On the Shannon Theory of Information Transmission in the Case of Continuous Signals. "On the Convergence Properties of the EM Algorithm. Statistical Anal... RUBIN. COCHRAN. i I. Exp. J. MACLACHLAN." Bell System Tech. 1985.. AND M... WILD. The EM Algorithm and Extensions New York: Wiley. Applied linear Regression. 27. S. C.. 22. 290300 ( 1959). WAND. G.. New York: Wiley.160 Methods of Estimation Chapter 2 KOLMOGOROV. D. J. In. C. Statisr. N.95103 (1983) . E.·sis with Missing Data New York: J.. AND c. Ames. A. 10. J. I • I . I 1 . B. W. P.." Ann. Journal.
we study. 3. and 62 on the basis of the risks alone is not well defined unless R(O. R(·.a). o r(1T.3 we show how the important Bayes and minimax criteria can in principle be implemented. ) < R(B. then for data X '" Po and any decision procedure J randomized or not we can define its risk function.2 and 3. by introducing a Bayes prior density (say) 1r for comparison becomes unambiguous by considering the scalar Bayes risk. in Chapter 4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case. 6) as measuring a priori the performance of 6 for this model.6) = EI(6. We also discuss other desiderata that strongly compete with decision theoretic optimality. in the context of estimation.6(X)).6 .6) =ER(O.3.4.2.3 that if we specify a parametric model P = {Po: E 8}. OUf examples are primarily estimation of a real parameter. (3.2 BAYES PROCEDURES Recall from Section 1. We think of R(. In Section 3. actual implementation is limited. In Sections 3. in particular computational simplicity and robustness. Strict comparison of 6.6(X). However. However. after similarly discussing testing and confidence bounds.6) = E o{(O. loss function l(O. the relation of the two major decision theoretic principles to the 000decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness. 6) : e . We return to these themes in Chapter 6.+ R+ by ° R(O. action space A.1) 161 . which is how to appraise and select among decision procedures. 6 2 ) for all 0 or vice versa.Chapter 3 MEASURES OF PERFORMANCE. AND OPTIMAL PROCEDURES 3.1 INTRODUCTION Here we develop the theme of Section 1. NOTIONS OF OPTIMALITY.
oj = ".2.162 Measures of Performance Chapter 3 where (0. 00 for all 6 or the Using our results on MSPE prediction. After all.3 we showed how in an example.(O)dfI p(x I O).(O)dO (3. 1(0.4.5). This is just the prohlem of finding the best mean squared prediction error (MSPE) predictor of q(O) given X (see Remark 1.(O)dfI (3.4) !. 6) ~ E(q(O) . 0) and "I (0) is equivalent to considering 1 (0. as usual. Here is an example. We first consider the problem of estimating q(O) with quadratic loss.2. (0.2. B denotes mean height of people in meters.. e "(). Clearly.2.5. We don't pursue them further except in Problem 3. x _ !oo= q(O)p(x I 0). such that (3. 6) Bayes rule 6* is given by = a? 6'(X) = E[q(O) I XJ. 0) = (q(O) using a nonrandomized decision rule 6.) ~ inf{r(". if computable will c plays behave reasonably even if our knowledge is only roughly right. Issues such as these and many others are taken up in the fundamental treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger (1985) and Bernardo and Smith (1994). Our problem is to find the function 6 of X that minimizes r(".2.5) This procedure is called the Bayes estimate for squared error loss. for instance.6): 6 E VJ (3. Suppose () is a random variable (or vector) with (prior) frequency function or density 11"(0). In view of formulae (1.6).3) In this section we shall show systematically how to construct Bayes rules. Thus. j I 1': II . (3.8) for the posterior density and frequency functions.. if 1[' is a density and e c R r(".2.. X) is given the joint distribution specified by (1.3).2. We may have vague prior notions such as "IBI > 5 is physically implausible" if. For testing problems the hypothesis is often treated as more important than the alternative. it is plausible that 6.. In the continuous case with real valued and prior density 1T.2) the Bayes risk of the problem.6(X)j2. Recall also that we can define R(. and that in Section 1.. and 1r may express that we care more about the values of the risk in some rather than other regions of e. (0. 'f " ..2. If 1r is then thought of as a weight function roughly reflecting our knowledge. we could identify the Bayes rules 0. considering I. we can give the Bayes estimate a more explicit form. and instead tum to construction of Bayes procedure. This exercise is interesting and important even if we do not view 1r as reflecting an implicitly believed in prior distribution on ().r= . we just need to replace the integrals by sums.(0)1 . It is in fact clear that prior and loss function cannot be separated out clearly either.(0) a special role ("equal weight") though (Problem 3.6) In the discrete case. 1 1  . 0) 2 and 1T2(0) = 1. .4) the parametrization plays a crucial role here.2. we find that either r(1I".6) = J R(0.
E(e I X))' I X)] E[(J'/(l+~)]=nja 2 + 1/72 ' 1 2 n n/ No finite choice of '1]0 and 7 2 will lead to X as a Bayes estimate. Intuitively.7) reveals the Bayes estimate in the proper case to be a weighted average MID + (1 .12.1.7) Its Bayes risk (the MSPE of the predictor) is just r(7r. 1]0 fixed). we see that the key idea is to consider what we should do given X = x.2. To begin with we consider only nonrandomized rules.1). '1]0. the Bayes estimate corresponding to the prior density N(1]O' 7 2 ) differs little from X for n large. J') ~ 0 as n + 00.1.6. we obtain the posterior distribution The Bayes estimate is just the mean of the posterior distribution J'(X) ~ 1]0 l/T' ] _ [ n/(J2 ] [n/(J' + l/T' + X n/(J' + l/T' (3. This quantity r(a I x) is what we expect to lose.r( 7r. that is. a 2 / n. . Thus. This action need not exist nor be unique if it does exist. Suppose that we want to estimate the mean B of a nonnal distribution with known variance a 2 on the basis of a sample Xl. Applying the same idea in the general Bayes decision problem.2. 0 We now tum to the problem of finding Bayes rules for gelleral action spaces A and loss functions l. X is approximately a Bayes estimate for anyone of these prior distributions in the sense that Ir( 7r.Section 3. If we look at the proof of Theorem 1. But X is the limit of such estimates as prior knowledge becomes "vague" (7 .2. take that action a = J*(x) that makes r(a I x) as small as possible. Such priors with f 7r(e) ~ 00 or 2: 7r(0) ~ 00 are called impropa The resulting Bayes procedures are also called improper. J' )]/r( 7r. . and X with weights inversely proportional to the Bayes risks of these two estimates. r) . we fonn the posterior risk r(a I x) = E(l(e.E(e I X))' = E[E«e . J') E(e .2. Fonnula (3. X is the estimate that (3.00 with.) minimizes the conditional MSPE E«(Y .2. see Section 5. if we substitute the prior "density" 7r( 0) = 1 (Prohlem 3. /2) as in Example 1. for each x. a) I X = x). For more on this.a)' I X ~ x) as a function of the action a.2 Bayes Procedures 163 Example 3.6) yields. .5. Bayes Estimates for the Mean of a Normal Distribution with a Normal Prior. " Xn. If we choose the conjugate prior N( '1]0. In fact. E(Y I X) is the best predictor because E(Y I X ~ . we should. Because the Bayes risk of X. However.w)X of the estimate to be used when there are no observations. if X = :x and we use action a.4. In fact. tends to a as n _ 00.
4.74.5) with priOf ?f(0. the posterior risks of the actions aI. E[I(O.·. Then the posterior distribution of o is by (1. we obtain for any 0 r(?f. and aa are r(at 10) r(a.o(X)) I X] > E[I(O. 6* = 05 as we found previously.. r(a.2. if 0" is the Bayes rule.2.67 5.8. a q }. Suppose that there exists a/unction 8*(x) such that r(o'(x) [x) = inf{r(a I x) : a E A}.8) Thus.! " j.) = 0. Therefore. and let the loss incurred when (}i is true and action aj is taken be given by j e e . [I) = 3. Therefore. r(al 11) = 8.. by (3.1. ?f(O.o'(X)) I X].8). 10) 8 10. o As a first illustration. ~ a. " . az has the smallest posterior risk and.o(X)) I X)]. .1. Let = {Oo.· . A = {ao.8) Then 0* is a Bayes rule. consider the oildrilling example (Example 1.2.2.70 and we conclude that 0'(1) = a. 0'(0) Similarly.164 Measures of Performance Chapter 3 Proposition 3. As in the proof of Theorem 1. E[I(O. r(a. (3..89. The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corrpeting procedures. and the resnlt follows from (3. :i " " Proof. Bayes Procedures Whfn and A Are Finite.o(X))] = E[E(I(O. I x) > r(o'(x) I x) = E[I(O. let Wij > 0 be given constants. !i n j. More generally consider the following class of situations.9). 11) = 5. .9) " "I " • But.2.. az. .) = 0. I 0)  + 91(0" ad r(a. Example 3.2. (3. Suppose we observe x = O.o'(X)) IX = x].2.o) = E[I(O.o(X)) I X = x] = r(o(x) Therefore.2.2.3.Op}.35.
Section 3.2
Bayes Procedures
165
Let 1r(0) be a prior distribution assigning mass 1ri to Oi, so that 1ri > 0, i = 0, ... ,p, and Ef 0 1ri = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each O. Then, by (1.2.8), the posterior probabilities are
and, thus,
raj (
x I) =
EiWipTiP(X I Oi) . Ei 1fiP(X I Oi)
(3.2.10)
The optimal action 6* (x) has
r(o'(x) I x)
=
O<rSq
min r(oj
I x).
Here are two interesting specializations. (a) Classification: Suppose that p
= q, we identify aj with OJ, j
1, i f j
=
0, ... ,p, and let
Wii
O.
This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,
r(Oi
Ix) = Pl8 f ei I X = xl
and minimizing r(Oi I x) is equivalent to the reasonable procedure of maximizing the posterior probability,
PI8 = e I X = i
xl
=
1fiP(X lei) Ej1fjp(x I OJ)
(b) Testing: Suppose p = q = 1, 1ro = 1r, 1rl = 1  'Jr, 0 < 1r < 1, ao corresponds to deciding 0 = 00 and al to deciding 0 = 01 • This is a special case of the testing fonnulation of Section 1.3 with 8 0 = {eo} and 8 1 = {ed. The Bayes rule is then to
decide e decide e
= 01 if (1 1f)p(x I 0, ) > 1fp(x I eo) = eo if (1  1f)p(x IeIl < 1fp(x Ieo)
and decide either ao or al if equality occurs. See Sections 1.3 and 4.2 on the option of randomizing .between ao and al if equality occurs. As we let 'Jr vary between zero and one. we obtain what is called the class of NeymanPearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) < Ct. This is treated further in Chapter 4. D
166
Measures of Performance
Chapter 3
To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form" the Bayes procedure for an estimation problem when the loss is not quadratic.
Example 3.2.3. Bayes Estimation ofthe Probability ofSuccess in n Bernoulli Trials. Suppose that we wish to estimate () using X I, ... , X n , the indicators of n Bernoulli trials with
probability of success
e, We shall consider the loss function I given by
(0  a)2 1(0, a) = 0(1 _ 0)' 0 < 0 < I, a real.
(3.2.11)
"
" " ,

d
This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for () close to zero, this l((), a) is close to the relative squared error (fJ  a)2 JO, lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if aU terms on the righthand side are finite,
rea I k)
(3.2.12)
;,
I
, " " I
,

I
Minimizing this parabola in a, we find our Bayes procedure is given by
J'(k) = E(1/(1  0) I S = k) E(I/O(I  0) IS  k)
;
(3.2.13)

i! , ,.
provided the denominator is not zero. For convenience let us now take as prior density the density br " (0) of the bela distribution i3( r, s). In Example 1. 2.1 we showed that this leads to a i3(k +r,n +s  k) posteriordistributionforOif S = k. Ifl < k < nI and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and
.
,
";1 
J'(k)
Jo'(I/(1  O))bk+r,n_k+,(O)dO
J~(1/0(1 O))bk+r,n_k+,(O)dO
B(k+r,nk+sI) B(k + r  I, n  k + s  I) k+rl n+s+r2'
(3.2.14)
where we are using the notation B.2.11 of Appendix B. If k = 0, it is easy to see that a = is the only a that makes r(a I k) < 00. Thus, J'(O) = O. Similarly, we get J'(n) = 1. If we assume a un!form prior density, (r = s = 1), we see that the Bayes procedure is the usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). 0
°
"Real" computation of Bayes procedures
The closed fonos of (3.2.6) and (3.2.10) make the compulation of (3.2.8) appear straightforward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 ~ (0" . .. , Op) has a hierarchically defined prior density,
"(0,, O ,... ,Op) = 2
"1 (Od"2(02
I( 1 ) ... "p(Op I Op_ d.
(3.2.15)
I
Section 3.2
Bayes Procedures
167
Here is an example. Example 3.2.4. The random effects model we shall study in Volume II has
(3.2.16)
where the €ij are i.i.d. N(O, (J:) and Jl, and the vector ~ = (AI, .. ' ,AI) is independent of {€i; ; 1 < i < I, 1 < j < J} with LI." ... ,LI.[ i.i.d. N(O,cri), 1 < j < J, Jl, '"'" N(Jl,O l (J~). Here the Xi)" can be thought of as measurements on individual i and Ai is an "individual" effect. If we now put a prior distribution on (Jl,,(J~,(J~) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (J.L, (J~, (J~, AI,. '. ,AI) with the Xii I () independent N(f.l + Ll. i . cr~). Then p(x I 0) = lli,; 'Pu.(Xi;  f.l Ll. i ) and
11'(0)
=
11'}(f.l)11'2(cr~)11'3(cri)
II 'PU" (LI.,)
i=l
I
(3.2.17)
where <Pu denotes the N(O, (J2) density. ]n such a context a loss function frequently will single out some single coordinate Os (e.g., LI.} in 3.2.17) and to compute r(a I x) we will need the posterior distribution of ill I X. But this is obtainable from the posterior distribution of (J given X = x only by integrating out OJ, j t= s, and if p is large this is intractable. ]n recent years socalled Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable 0 and the use of Bayesian methods has spread. We return to the topic in Volume II.
Linear Bayes estimates
When the problem of computing r( 1r, 6) and 611" is daunting, an alternative is to consider  of procedures for which r( 'R', 6) is easy to compute and then to look for 011" E 'D  a class 'D that minimizes r( 1f 16) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q( 0)  a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;=l bjXj _ If in (1.4.14) we identify q(8) with Y and X with Z, the solution is

8(X)
= Eq(8) + IX 
E(X)]T{3
where f3 is as defined in Section 1.4. For example, if in the 11l0del (3.2.16), (3.2.17) we set q(fJ) = Ll 1 , we can find the linear Bayes estimate of Ll 1 hy using 1.4.6 and Problem 1.4.21. We find from (1.4.14) that the best linear Bayes estimator of LI.} is
(3.2.18)
where E(Ll.I) given model
= 0,
X = (XIJ, ... ,XIJ)T, P.
= E(X)
and/l
=
ExlcExA,. For the
168
Var(X I ;)
Measures of Performance
Chapter 3
=E
Var(X,)
I B) +
Var E(XIj
I BJ = E«T~) +
(T~ + (T~,
E Cov(X,j , Xlk I B)
+ Cov(E(X'j I B), E(Xlk I BJ)
0+ Cov(1' + II'!, I' + tl.,)
= (T~ +
(T~,
Cov(tl.
Xlj)
=E
Cov(X,j , tl.,
!: ,,
I',
" From these calculations we find and Norberg (1986).
I B) + Cov(E(XIj I B), E(tl. l I 0» = 0 + (T~, =
(Tt·
f3
and OL(X). We leave the details to Problem 3.2.10.
'I
Linear Bayes procedures are useful in actuarial science, for example, Biihlmann (1970)
if
ii "
Bayes estimation, maximum likelihood, and equivariance
As we have noted earlier. the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior 71"(0) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than
modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing the estimate (j M is said to be equivariant with respect to reparametrization if for every onetoDne function h from e to fl = h (e). the estimate of w '" h(B) is W = h(OM); that is, M
~
I
= h(BM ). In Problem 2.2.16 we show that the MLE procedure is equivariant. If we consider squared error loss. then the Bayes procedure (}B = E(8 I X) is not equivariant
(h(O»
M
~

~
~
for nonlinear transfonnations because
E(h(O) I X)
op h(E(O I X))
1 ,
,
,,
for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is
i
re(a I x)
= LIB  a]'7f(O I x).
'Ee
(3.2.19)
i
If we set w = h(O) for h onetoone onto fl = heel, then w has prior >.(w) and in the w parametrization, the posterior Bayes risk is
= 7f(h
1
(w))
I
,
r,,(alx)
= L[waj2>,(wlx)
wE"
L[h(B)  a)2 7f (B I x).
(3.2.20)
'Ee
Thus, the Bayes procedure for squared error loss is not equivariant because squared error loss is not equivariant and, thus, r,,(a I x) op Te(h 1(a) Ix).
Section 3.2
Bayes Procedures
169
Loss functions of the form l((}, a) = Q(Po, Pa) are necessarily equivariant. The KullbackLeibler divergence K«(},a), (J,a E e, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h 1 (a)), thus, with this loss function,
ro(a I x)
~
re(h1(a) I x).
See Problem 2.2.38. In the discrete case using K means that the importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.l 0». In the N(O, case the K L (KullbackLeibler) loss K( a) is ~n(a  0)' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families
"J)
e,
K(1], a) = L[ryj  ajJE1]Tj
j=1
•
+ A(1]) 
A(a).
Moreover, if we can find the KL loss Bayes estimate 1]BKL of the canonical parameter 7] and if 1] = c( 9) :  t E is onetoone, then the K L loss Bayes estimate of 9 in the
e
general exponential family is 9 BKL = C 1(1]BKd. For instance, in Example 3.2.1 where J.l is the mean of a nonnal distribution and the prior is nonnal, we found the squared error Bayes estimate jiB = wf/o + (lw)X, where 1]0 is the prior mean and w is a weight. Because the K L loss is equivalent to squared error for the canonical parameter p, then if w = h(p), WBKL = h('iiUKL), where 'iiBKL =
~
wTJo + (1  w)X.
Bayes procedures based on the KullbackLeibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume 11.
Bayes methods and doing reasonable things
There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1%9; Lindley, 1965; Savage, 1954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior 1r. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E V there is a 60 E V o such that R(0, 60 ) < R(0, 6) for all O. Summary. We show how Bayes procedures can be obtained for certain problems by COmputing posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.
170
Measures of Performance
Chapter 3
3.3
MINIMAX PROCEDURES
In Section 1.3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worstcase analysis; the true () is one that is as "hard" as possible. That is, 6 1 is better than 82 from a minimax point of view if sUPe R(0,6 1 ) < sUPe R(B, 82 ) and is said to be minimax if
o·
supR(O, 0')
.
= infsupR(O,o).
,
.
,
I·
Here (J and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {p. : 0 E e}. It is fruitful to consider proper subclasses of V and subsets of P. but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by considering a socalled zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 8. For the basic game, 5 picks 0 without N's knowledge, N picks 1f without 5's knowledge and then all is revealed and S pays N
r(Jr,o)
, ,
,
=
J
R(O,o)dJr(O)
where the notation f R(O, o)dJr(0) stands for f R(0, o)Jr(O)dII in the continuous case and L, R(Oj, o)Jr(O j) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important.
I: N is told the choice 0 of 5 before picking 1r and 5 knows the rules of the game. Then
Ii ,
N naturally picks 1f,; such that
r(Jr"o)
that is, 1f,; is leastfavorable against such that
~supr(Jr,o),
o. Knowing the rules of the game S naturally picks 0*
•
(3.3.1)
r( Jr,., 0·) = sup r(Jr, 0') = inf sup r(Jr, 0).
We claim that 0* is minimax. To see this we note first that,
.
,
.
(3.3.2)
for allJr, o. On the other hand, if R(0" 0) = suP. R(0, 0), then if Jr, is point mass at 0" r( Jr" 0) = R(O" 0) and we conclude that supr(Jr,o) = sup R(O, 0)
,
•
•
(3.3.3)
I
,
•
•
Section 3.3
Minimax Procedures
171
and our claim follows. II: S is told the choke 7r of N before picking 6 and N knows the rules of the game. Then S naturally picks 6 1r such that
That is, r5 rr is a Bayes procedure for
11".
Then N should pick
7r*
such that (3.3.4)
For obvious reasons, 1f* is called a least favorable (to S) prior distribution. As we shall see by example, altbough the rigbthand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language.
Theorem 3.3.1. (von Neumann). If both
e and D are finite,
?5
11"
then:
(a)
v=supinfr(1I",6), v=infsupr(1I",6)
rr
?5
are both assumed by (say)
7r*
(least favorable), 6* minimax, respectively. Further,
") =v v=r1f,u (
.
(3.3.5)
v and v are called the lower and upper values of the basic game. When v (saY), v is called the value of the game.
=
v
=
v
Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification ~ {eo} and = {eIl (Example 3.2.2) but is too reSlrictive in ilS and testing when assumption for the great majority of inference problems. A generalization due to Wald and Karlinsee Karlin (1 959)states that the conclusions of the theorem remain valid if and D are compact subsets of Euclidean spaces. There are more farreaching generalizations but, as we shall see later, without some form of compactness of and/or D, although equality of v and v holds quite generally, existence of least favorable priors and/or minimax procedures may fail.
eo
e,
e
e
The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on and D and are easy to prove.
e
Proposition 3.3.1. Suppose 6**. 7r** can be found such that
U
£** =
r ()1r.',
1f **
= 1rJ••
(3.3.6)
172
Measures of Performance
Chapter 3
that is, 0** is Bayes against 11"** and 11"** is least favorable against 0**. Then v R( 11"** ,0**). That is, 11"** is least favorable and J*'" is minimax.
To utilize this result we need a characterization of 11"8. This is given by
v
=
Proposition 3.3.2.11"8 is leastfavorable against
°iff
1r,jO: R(O,b) = supR(O',b)) = 1.
.'
(3,3,7)
That is, n a assigns probability only to points () at which the function R(·, 0) is maximal.
Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 11"8 may not be unique. In particular, if R(O, 0) = constant. the rule has constant risk, then all 11" are least favorable. We now prove Propositions 3.3.1 and 3.3.2.
Proof of Proposition 3.3.1. Note first that we always have
v<v
because, trivially,
i~fr(1r,b)
(3.3.8)
< r(1r,b')
(3.3,9)
for aIln, 5'. Hence,
v = sup in,fr(1r,b) < supr(1r,b')
•
(3,3.10)
•
for all 0' and v
<
infa, sUP1r 1'(11", (/) =
v. On the other hand. by hypothesis,
sup1'(1r,6**) > v.
v> inf1'(11"*'",6)
,
= 1'(1I"*",0*'") =
.
(3.3.11)
Combining (3.3.8) and (3.3.11) we conclude that
:r
v
as advertised.
= i~f 1'(11"**,0) = 1'(11"**,6**) = s~p1'(1I",0*"')
= V
(3.3.12)
" 'I
,.;
~l
o
Proofof Proposition 3.3.2. 1r is least favorable for b iff E.R(8,6) =
,.,
f r(O,b)d1r(O) =s~pr( ..,6).
•
(3.3.13)
But by (3.3.3),
supr(..,b) = supR(O, 6),
(3.3,14)
•
i ,
Because E.R(8, 6)
= sUP. R(O, b), (3.3,13) is possible iff (3.3.7) holds.
o
Putting the two propositions together we have the following.
•
Section 3.3
Minimax Procedures
173
11"*
Theorem 3.3.2. Suppose 0* has sUPo R((}, 0*) = 1" < 00. If there exists a prior that 0* is Bayes for 11"* and tr" {(} : R( (}, 0") = r} = 1, then 0'" is minimax.
such
Example 3.3.1. Minimax estimation in the Binomial Case. Suppose S has a B(n,B) distribution and X = Sjn,as in Example 3.2.3. Let I(B, a) ~ (Ba)'jB(IB),O < B < 1. For this loss function,
R(B X) = E(X _B)' , B(IB)

=
B(IB) = ~ nB(IB) n'

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(Ol 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic loss neither of these assertions holds. The minimax estimate is
o's=S+hln = .,fii X+ I 1 () n+.,fii .,fii+l .,fii+1 2
This estimate does have constant risk and is Bayes against a (J( y'ri/2, vn/2) prior (Problem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n t 00 of the ratio of the risks of 0* and X is > 1 for every () =f ~. At B = the ratio tends to 1. Details are left to Problem 3.3.4. 0
!
Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending ~r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p" 0'2) where p, = v, if the satellite functions, and otherwise. The variance 0'2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or"p, = v." We view this as a decision problem with 1 loss. If the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to and 1  1r to v, use 0  1 loss, and set L(x, 0, v) = p( x Iv) j p(x I 0), then the Bayes test decides I' = v if
°
°°
L(x,O,v)
and decides p,
=
=
exp
°
" {
2"EXi   , a 2a
nv'} >
17l'
if L(x,O,v) <:17l' 7l'
174
Measures of Performance
Chapter 3
This test is equivalent to deciding f.t
= v (Problem 3.3.1) if. and only if,
"yn
1 ;;;:EXi>t,
T=
, , ,
where,
t
If we call this test d,r.
=
" [log " + 2 nv'] ;;;: vyn a
17r
2
'
R(O,J. )
1 ~ <I>(t)
<I>
= <1>( t)
R(v,o,)
(t  vf)
To get a minimax test we must have R(O, 61r ) = R( V, 6ft), which is equivalent to
v.,fii t = t  "'
h •
or
"
','
I ,~
~.
, ,
•
v.,fii t= . 20
•
Because this value of t corresponds to 7r = the intuitive test, which decides JJ only ifT > ~[Eo(T) + Ev(T)J, is indeed minimax.
!.
= v if and
0
'I
•
If is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem
e
3.3.2.
Theorem 3,3,3, Let 0' be a rule such that sup.R(O,o') = r < 00, let {"d denote a sequence of pn'or distributions such that 'lrk;{8 : R(B,o*) = r} = 1, and let Tk = inffJ r( ?rie, J), where r( 7fkl 0) denotes the Bayes risk wrt 'lrk;. If
Tk Task 00,
(3.3.i5)
then J* is minimax. Proof Because r( "k, 0') = r
supR(B, 0') = rk + 0(1)
•
where 0(1)
~
0 as k
~ 00.
But hy (3.3.13) for any competitor 0
supR(O,o) > E.,(R(B,o)) > rk ~supR(O,o') 0(1).
•
•
(3.3.16)
,
'.
If we let k _ suP. R(B, 0').
00
the lefthand side of (3.3.16) is unchanged, whereas the right tends to
0
j
1
•
Section 3.3
M'mimax Procedures
175
Example 3.3.3. Normal Mean. We now show that X is minimax in Example 3.2.1. Identify 1fk with theN(1Jo, 7 2 ) prior where k = 7 2 . Then
whereas the Bayes risk of the Bayes rule of Example 3.2.1 is
i~frk(J) ~ (,,'/n) +7' n
Because
(0
7
2
0
2
=
00,
n  (,,'/n) +7'
a2
1
0
2
n·
2
In)1 « (T2 In) + 7 2 )
>
0 as T 2

we can conclude that
X is minimax.
0
Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann). Suppose XI,'" ,Xn arei.i.d, FE:F
Then X is minimax for estimating B(F) EF(Xt} with quadratic loss. This can be viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as foUows:(J)
(i)
=
".{F: VarF(XIl # M}
= O.
(ii) ".{ F : F
# N(I', M) for some I'} = O.
(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F),M).
Evidently, the Bayes risk is now the same as in Example 3.3.3 with 0"2 evidently,
= M.
Because.
) VarF(X, ) max R(F, X = max :F :F n
Theorem 3.3.3 applies and the result follows. Minimax procedures and symmetry
M
n
o
As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" 0, There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann (1986) and Chapter 5 of Lehmann and CaseUa (1998), for instance. We shall discuss this approach somewhat, by example. in Chapters 4 and Volume II but refer to Lehmann (1986) and Lehmann and Casella (1998) for further reading. Summary. We introduce the minimax principle in the contex.t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.
and then see if within the Do we can find 8* E Do that is best according to the "gold standard. The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk. II . .4 UNBIASED ESTIMATION AND RISK INEQUALITIES 3.1 Unbiased Estimation. I I .I . in many cases." R(0. such as 5(X) = q(Oo). This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in the N (0 . on other grounds. symmetry. it is natural to consider the computationally simple class of linear estimates. E Do that minimizes the Bayes risk with respect to a prior 1r among all J E Do. for example. Do C D. for instance. This notion has intuitive appeal. the game is said to have a value v. Von Neumann's Theorem states that if e and D are both finite. are minimax. there is a least favorable prior 1[" and a minimax rule 8* such that J* is the Bayes rule for n* and 1r* maximizes the Bayes risk of J* over all priors. compute procedures (in particular estimates) that are best in the class of all procedures.• •••. Obviously. in Section 1. I X n are d. all 5 E Va. we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1['.6. An estimate such that Biase (8) 0 is called unbiased.L and (72 when XI. The most famous unbiased estimates are the familiar estimates of f. . Moreover... An alternative approach is to specify a proper subclass of procedures. 5') for aile. computational ease.. then the game of S versus N has a value 11. 5) > R(0. and so on. ruling out.1. S(Y) = L:~ 1 diYi . 0'5) model. we can also take this point of view with humbler aims. This approach coupled with the principle of unbiasedness we now introduce leads to the famous GaussMarkov theorem proved in Section 6. N (Il.. the solution is given in Section 3. then in estimating a linear function of the /3J. We show that Bayes rules with constant risk. We introduced.3.d. Bayes and minimaxity. 176 Measures of Performance Chapter 3 . D. 3. looking for the procedure 0. This approach has early on been applied to parametric families Va. (72) = . When v = ii. the notion of bias of an estimate O(X) of a parameter q(O) in a model P {Po: 0 E e} as = Biaso(5) = Eo5(X)  q(O). A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called leas! favorable. according to these criteria. Ii I More specifically.4. estimates that ignore the data.2. . which can't be beat for 8 = 80 but can obviously be arbitrarily terrible. if Y is postulated as following a linear regression model with E(Y) = zT (3 as in Section 2.2. When Do is the class of linear procedures and I is quadratic Joss. for which it is possible to characterize and. Survey Sampling In the previous two sections we have considered two decision theoretic optimality principles. v equals the Bayes risk of the Bayes rule 8* for the prior 1r*. In the nonBayesian framework.1 . or more generally with constant risk over the support of some prior.
UMVU (uniformly minimum variance unbiased).. reflecting the probable correlation between ('ttl.4.4.. ~ N L. (3. for instance..< 2 ~ ~ (3. . . ...4) where 2 CT. .4.5) This method of sampling does not use the information contained in 'ttl. (3. .XN} 1 (3. .. (~) If{aj.il) Clearly for each b...14) and has = iv L:fl (3. UN' One way to do this.. We let Xl.. X N ) as parameter . . a census unit. Unbiased estimates playa particularly important role in survey sampling.. If .3. . .4. (3.3) = 0 otherwise..(Xi 1=1 I" N ~2 x) .4.1.3 and Problem 1. Unbiased Estimates in Survey Sampling.8) Jt =X .. [J = ~ I:~ 1 Ui· XR is also unbiased.4. .···. Ui is the last census income corresponding to = it E~ 1 'Ui.. .4. ..2) 1 Because for unbiased estimates mean square error and variance coincide we call an unbiased estimate O*(X) of q(O) that has minimum MSE among all unbiased estimat~s for all 0.an}C{XI.. We want to estimate the parameter X = Xj.3. UN) and (Xl. X N ). We ignore difficulties such as families moving.. Write Xl. .. to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census. This leads to the model with x = (Xl. these are both UMVU. It is easy to see that the natural estimate X ~ L:~ 1 Xi is unbiased (Problem 3.1 ) ~ 1 nl L (Xi ~ ooc n ~ X) 2 .2. and u =X  b(U . is to estimate by a regression estimate ~ XR Xi. Suppose we wish to sample from a finite population.' . X N for the unknown current family incomes and correspondingly UI.Section 3.Xn denote the incomes of a sample of n families drawn at random without replacement. .4. As we shall see shortly for X and in Volume 2 for . Example 3.6) where b is a prespecified positive constant.4 Unbiased Estimation and Risk Inequalities 177 given by (see Example 1. . UN for the known last census incomes.
4). (iii) Unbiased_estimates do not <:.4..7) II ! where Ji is defined by Xi = xJ. This makes it more likely for big incomes to be induded and is intuitively desirable.. the unbiasedness principle has largely fallen out of favor for a number of reasons. " 1 ".... .4. . However. 111" N < 1 with 1 1fj = n. I II.19). They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later.4. . ..X)!Var(U) (Problem 3. II it >: . I j . (ii) Bayes estimates are necessarily biasedsee Problem 3. M (3.. 178 Measures of Performance Chapter 3 i . 1. q(e) is biased for q(e) unless q is linear.j' I ': Because 1rj = P[Xj E Sj by construction unbiasedness follows. .. 0 Discussion. . The result is a sample S = {Xl. ..15). . ' .2Gand minimax estimates often are.'I . . ". The HorvitzThompson estimate then stays unbiased. :I . Specifically let 0 < 7fl. Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems. Ifthc 'lrj are not all equal. If B is unbiased for e. For each unit 1. Xl!Var(U).~'! I . outside of sampling.3. N toss a coin with probability 7fj of landing heads and select Xj if the coin lands heads.bey the attractive equivariance property. . .Xi XHT=LJN i=l 7fJ.4..18. for instance. ~ . It is possible to avoid the undesirable random sample size of these schemes and yet have specified 1rj. An alternative approach to using the Uj is to not sample all units with the same prob ability. (i) Typically unbiased estimates do not existsee Bickel and Lehmann (1969) and Problem 3. To see this write ~ 1 ' " Xj XHT ~ N LJ 7'" l(Xj E S). X is not unbiased but the following estimate known as the HorvitzThompson estimate is: Ef . estimate than X and the best choice of b is bopt The value of bopt is unknown but can be estimated by = The resulting estimate is no longer unbiased but behaves well for large samplessee Problem 5. A natural choice of 1rj is ~n.I.4. Unbiasedness is also used in stratified sampling theory (see Problem 1. I .i . this will be a beller cov(U.11.j ..3. X M } of random size M such that E(M) = n (Problem 3. j=1 3 N ! I: .1 the correlation of Ui and Xi is positive and b < 2Cov(U. .
OJ exists and is finite. 167.4. f3 is the least squares estimate.4. (II) 1fT is any statistic such that E 8 (ITf) < 00 for all 0 E then the operations of integration and differentiation by B can be interchanged in J T( x )p(x. B) for II to hold. good estimates in large samples ~ 1 ~ are approximately unbiased. The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous~case theorems given later. 3. Then 11 holdS provided that for all T such that E.O) > O} docs not depend on O. in the linear regression model Y = ZDf3 + e of Section 2. Some classical conditiolls may be found in Apostol (1974). p. In particular we shall show that maximum likelihood estimates are approximately unbiased and approximately best among all estimates. We expect that jBiase(Bn)I/Vart (B n ) .8) is assumed to hold if T(xJ = I for all x. unbiased estimates are still in favor when it comes to estimating residual variances.p) where i = (Y . From this point on we will suppose p(x. %0 [j T(X)P(x.   .O E 81iJO log p( x..ZDf3). That is.Section 3. (I) The set A ~ {x : p(x. The lower bound is interesting in its own right. and p is the number of coefficients in f3. the variance a 2 = Var(ei) is estimated by the unbiased estimate 8 2 = iTe/(n .4. This preference of 8 2 over the MLE 52 = iT e/n is in accord with optimal behavior when both the number of observations and number of parameters are large. We make two regularity assumptions on the family {Pe : BEe}.4. For instance. has some decision theoretic applications.( lTD < 00 J . and appears in the asymptotic optimality theory of Section 5. which can be used to show that an estimate is UMVU. See Problem 3. e. What is needed are simple sufficient conditions on p(x. 0 or equivalently Vare(Bn)/MSEe(Bn) t 1 as n t 00. We suppose throughout that we have a regular parametric model and further that is an open subset of the line. e e.4.4. For all x E A. Assumption II is practically useless as written.9. suppose I holds.2 The Information Inequality The oneparameter case We will develop a lower bound for the variance of a statistic. Simpler assumptions can be fonnulated using Lebesgue integration theory. B)dx. for integration over Rq. B) is a density.8) whenever the righthand side of (3. Note that in particular (3. B)dx. as we shall see in Chapters 5 and 6. The arguments will be based on asymptotic versions of the important inequalities in the next subsection.8) is finite.4 Unbiased Estimation and Risk Inequalities 179 Nevertheless.OJdx] = J T(xJ%oP(x. and we can interchange differentiation and integration in p(x.OJdx (3. For instance. Finally.2.
o Example 3. 1 X n is a sample from a Poisson PCB) population. O)dx :Op(x.4.. which is denoted by I(B) and given by Proposition 3. (3.9) Note that 0 < 1(0) < Lemma 3. (3. O)dx ~ :0 J p(x. then I and II hold. 1 X n is a sample from a N(B. 0))' p(x.O)]j P(x.O)] dx " I'  are continuous functions(3) of O.O)} p(x. .. O)dx. where a 2 is known. Then (3.1.11 ) . (J2) population. t.O)] dxand J T(x) [:oP(X.180 for all Measures of Performance Chapter 3 e. Suppose Xl. I 1(0) 1 = Var (~ !Ogp(X.10) and.4.n and 1(0) = Var (~n_l .1.6. (~ logp(X. Similarly. 00. Then Xi . I J J & logp(x 0 ) = ~n 1 .4. suppose XI.O) & < 00.. thus.I [I ·1 .4. If I holds it is possible to define an important characteristic of the family {Po}. /fp(x. I' ..1) ~(O) = 0la' and 1 and 11 are satisfied.0 Xi) 1 = nO =  0' n O· o .B(O)} is an exponential family and TJ(B) has a nonvanishing continuous derivative on 8.4. 1 and 11 are satisfied for samples from gamma and beta distributions with one parameter fixed. 0))' = J(:Ologp(x. Suppose that 1 and II hold and that E &Ologp(X. . . 0) ~ 1(0) = E. Ii ! I h(x) exp{ ~(O)T(x) .2.&0 '0 {[:oP(X. O)dx = O. the Fisher information number. Then (see Table 1.O)). It is not hard to check (using Laplace transform theory) that a oneparameter exponential family quite generally satisfies Assumptions I and II.i J T(x) [:OP(X.4. For instance.  Proof. the integrals .
(e) and Var. e) and T(X).. Suppose that X = (XI.'(~)I)'. and that the conditions of Theorem 3.1.4. Proposition 3.(0) = E(feiogf(X"e»)'. by Lemma 3.1.1. ej. If we consider the class of unbiased estimates of q(B) = B.(0) is differentiable and Var (T(X)) . (3.13) By (A. We get (3.4. .Xu) is a sample from a population with density f(x.15) The theorem follows because.Section 3. Denote E.4.O)dX= J T(x) UOIOgp(x.O). Using I and II we obtain.O)dX. e)) = I(e). 0 (3.(T(X)) > [1/.16) The number 1/I(B) is often referred to as the information or CramerRoo lower bound for the variance of an unbiased estimate of 'tjJ(B). 1/. (3. we obtain a universal lower bound given by the following. Suppose that I and II hold and 0 < 1(0) < 00.1 hold.4.17) . Suppose the conditions of Theorem 3.14) Now let ns apply the correlation (CauchySchwarz) inequality (A. Var (t. Then for all 0.4 Unbiased Estimation and Risk Inequalities 181 Here is the main result of this section..4.O))P(x.I 1. T(X)) .' (e) = Cov (:e log p(X.4. Let [.2.(T(X) > I(O) 1 (3. 1/.4.1. Theorem 3.16) to the random variables 8/8010gp(X. > [1/. (Information Inequality). 0 The lower bound given in the information inequality depends on T(X) through 1/.4.(T(X)) by 1/.(T(X)) < 00 for all O.12) Proof. Let T(X) be all)' statistic such that Var.4. (3. nI.(0). .1 hold and T is an unbiased estimate ofB.I1.'(0) = J T(X):Op(x. 10gp(X. CoroUary 3.4. Then Var.l4) and Lemma 3. then I(e) = nI.4. 0 E e.4. Here's another important special case.'(0)]'  I(e) . 1/.(0).4.
then T' is UMVU as an estimate of 1. then X is a UMVU estimate of B.4. . i g . . we have in fact proved that X is UMVU even if a 2 is unknown.. I I "I and (3.4.. Suppose that the family {p.3. Then {P9} is a oneparameter exponential family with density or frequency function af the fonn II p(x.(B). Suppose Xl.B) = h(x)exp[ry(B)T'(x) . . if {P9} is a oneparameter exponentialfamily of the form (1.4.4.18) follows. Example 3. .1) with natural sufficient statistic T( X) and . (Br Now Var(X) ~ a'ln.[T'(X)] = [.. By Corollary 3. 1) density. which achieves the lower bound ofTheorem 3.B)] Var " i) ~ 8B iogj(Xi.p'(B)]'ll(B) for all BEe.1) that if Xl.18) 1 a ' .1 and I(B) = Var [:B !ogp(X.4..4.Xn are the indicators of n Bernoulli trials with probability of success B. We have just shown that the information [(B) in a sample of size n is nh (8).4. As we previously remarked.B)] = nh(B). Note that because X is UMVU whatever may be a 2 . Because X is unbiased and Var( X) ~ Bin.19) Ii i . then  1 (3.6. Theorem 3. Conversely.B(B)J. These are situations in which X follows a oneparameter exponential family. Next we note how we can apply the information inequality to the problem of unbiased estimation. (3. o II (B) is often referred to as the information contained in one observation.j. (Continued).1 we see that the conclusion that X is UMVU follows if ~  Var(X) ~ nI. X n is a sample from a nonnal distribution with unknown mean B and known variance a 2 . For a sample from a P(B) distribution.2.. Example 3. This is a consequence of Lemma 3. then X is UMVU. : BEe} satisfies assumptions I and II and there exists an unbiased estimate T* of1.2..1 for every B. the MLE is B ~ X.B) ] [ ~var [%0 ]Ogj(Xi.j.(B) such that Var. then T(X) achieves the infonnation inequality bound and is a UMVU estimate of E. .(T(X». If the family {Pe} satisfies I and II and if there exists an unbiased estimate T* of 1/. whereas if 'P denotes the N(O.4.4.182 Measures of Performance Chapter 3 Proof.(8) has a continuous nonvanishing derivative on e.4. 0 We can similarly show (Problem 3.. This is no accident. the conditions of the information inequality are satisfied.
10g(1 . be a denumerable dense subset of 8.A (B) .4. continuous in B.23) for all B1 . p(x.4.B) = a.3) that we have the canonical case with ~(B) = Band B( Bj = A(B) = logf h(x)exp{BT(x))dx. Note that if AU = nmAem .4.6. then (3. From this equality of random variables we shall show that Pe[X E A'I = 1 for all B where A' = {x: :Blogp(x. the information bound is [A"(B))2/A"(B) A"(B) Vare(T(X») so that T(X) achieves the information bound as an estimate of EeT(X).2. B) = a. By solving for a" a2 in (3. we see that all a2 are linear combinations of {} log p( Xj.4.(B)T' (x) +a2(B) for all BEe}. " and both sides are continuous in B.B)) + nz)[log B ..4. B) IdB.4.19) is highly technical.4.20) hold. 0 Example 3.20) guarantees Pe(Ae) ~ 1 and assumption 1 guarantees P.20) with Pg probability 1 for each B. (3.ll.19).4. thus.p(B) = A'(B) and.4. (3. .Section 3.4. Thus. and only if. Suppose without loss of generality that T(xI) of T(X2) for Xl. If A. it is necessary.23) must hold for all B. B) = aj(B)T'(x) + a2(B) (3. denotes the set of x for which (3.. The passage from (3. (3.20) to (3.4.A'(B» = VareT(X) = A"(B).4.1) we assume without loss of generality (Problem 3.14) and the conditions for equality in the correlation inequality (A. (B)T'(X) + a2(B) (3. B .2 and. Here is the argument.(A") = 1 for all B'.4. B) = 2"' exp{(2nj 2"' exp{(2n.4. Then ~ ( 80 logp X. j hence.20) with respect to B we get (3.(Ae) = 1 for all B' (Problem 3. Our argument is essentially that of Wijsman (1973).4. :e Iogp(x.21 ) Upon integrating both sides of (3. J 2 Pe.24) [(B) = Vare(T(X) . 2 A ** = A'" and the result follows.6.4 and 2. Conversely in the exponential family case (1.4.6).25) But .4 Unbiased Estimation and Risk Inequalities 183 Proof We start with the first assertion.B) I + 2nlog(1  BJ} .2.22) for j = 1. B .4.16) we know that T'" achieves the lower bound for all (J if. + n2) 10gB + (2n3 + n3) Iog(1 . But now if x is such that = 1. X2 E A"''''..1. then (3. Let B . By (3. In the HardyWeinberg model of Examples 2. However.4.B) so that = T(X) .. there exist functions al ((J) and a2 (B) such that :B 10gp(X.
DiscRSsion. I I' f . Na).p' (B) I'( I (B). . B) 2 1 a = p(x. =e'E(~Xi) = . in many situations. . B) aB' p(x.4. Suppose p('.2. 0 0 ..(e) exist.184 Measures of Performance Chapter 3 where we have used the identity (2nl + HZ) + (2n3 + nz) = 211. () (ell" .. Theorem 3. It often happens.26) holds. B)  2 (a log p(x.25) we obtain a' 10gp(X. . in the U(O. B) )' aB (3.4. See Volume II.B)] ~ B. A third metho:! would be to use Var(B) = I(I(B) and formula (3. 0 ~ Note that by differentiating (3. B) to canonical form by setting t) = 10g[B((1 .6. B) satisfies in addition to I and II: p(·. as Theorem 3. we will find a lower bound on the variance of an estimator .3. Sharpenings of the information inequality are available but don't help in general. Then (3. '.27) and integrate both sides with respect to p(x.4.4. Because this is an exponential family. • The multiparameter case We will extend the information lower bound to the case of several parameters.4. Even worse. We find (Problem 3. or by transforming p(x. ~ ~ This T coincides with the MLE () of Example 2.4.B) ~ A "( B). Example 3. (Continued).2.4. but the variance of the best estimate is not equal to the bound [.6. for instance.4. B) example. The variance of B can be computed directly using the moments of the multinomial distribution of (NI • N z . I(B) = Eo aB' It turns out that this identity also holds outside exponential families.26) Proposition 3.. For a sample from a P( B) distribution o EB(::. Extensions to models in which B is multidimensional are considered next. By (3.IOgp(X. . .B)) which equals I(B). Proof We need only check that I a aB' log p(x. although UMVU estimates exist. " 'oi .4.B) is twice differentiable and interchange between integration and differentiation is pennitted.4. we have iJ' aB' logp (X.B)(2n.4. .ed)' In particular. assumptions I and II are satisfied and UMVU estimates of 1/.2.24). e).'" . B). (3.. that I and II fail to hold.2 implies that T = (2N 1 + N z )/2n is UMVU for estimating E(T) ~ (2n)1[2nB' + 2nB(1 .B)] and then using Theorem 1.2 suggests.7) Var(B) ~ B(I .25).
.4 E(x 1") = 0 ~ I. er 2).. 0) j k That is. .4. 6) denote the density or frequency function of X where X E X C Rq. = Var('. . 0 = (I". We assume that e is an open subset of Rd and that {p(x. III (0) = E [~:2 logp(x. ..Xnf' has information matrix 1(0) = EO (aO:. (0) iJiJ lt2(0) ~ E [aer 2 iJl" logp(x....d. pC B) is twice differentiable and double integration and differentiation under the integral sign can be interchanged. (3. nh (O) where h is the information matrix ofX. .31) and 1(0) (b) If Xl. ~ N(I".4.4.30) iJ Ijd O) = Covo ( eO logp(X. 1 <j< d. (3.4 /2. Then 1 = log(211') 2 1 2 1 2 loger . 0). d. er 2 ).. B) : 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect OJ. (c) If.5.29) Proposition 3.28) where (3. The (Fisher) information matrix is defined as (3.. as X. . j = 1.Xn are U.Section 3.Bd are unknown.32) Proof. Suppose X = 1 case and are left to the problems.2 = er.2 ] ~ er. 0). Let p( x. . (a) B=T 1 (3.4. Under the conditions in the opening paragraph.0) = er. . .4.4.( x 1") 2 2a 2 logp(x 0) . then X = (Xl.O») . in addition...4 Unbiased Estimation and Risk Inequalities 185 of fh when the parameters 82 . a ).Ok IOgp(X.4.? ologp(X. 1 <k< d. The arguments follow the d Example 3.4.O) ] iJ2 ] l22(0) = E [ (iJer2)21ogp(x. 0)] = E[er. . iJO logp(X.
6. Z = V 0 logp(x..!pening paragraph hold and suppose that the matrix [(0) is nonsingular Then/or all 0.A..4. II are easily checked and because VOlogp(x. J)d assumed unknown..6. then [(0) = VarOT(X).' = explLTj(x)9j j=1 . 1/. [(0) = VarOT(X) = (3.(0) exists and (3.O») = VOEO(T(X» and the last equality follows 0 from the argument in (3. (3.4. where I'L(Z) denotes the optimal MSPE linear predictor of Y.4. .A(O)}h(x) (3.. Canonical kParameter Exponential Family. .4/2 .38) where Lzy = EO(TVOlogp(X.4.4. in this case [(0) ! . Assume the conditions of the .4. We claim that each of Tj(X) is a UMVU .4. = .(O).6 hold.( 0) be the d x 1 vector of partial derivatives.30) and Corollary 1. By (3.4.37) = T(X). Let 1/.6. Example 3.13).. Suppose k . We will use the prediction inequality Var(Y) > Var(I'L(Z».(0) = EOT(X) and let ".33) o Example 3.35) o ~ Next suppose 8 1 = T is an estimate of (Jl with 82. Here are some consequences of this result. UMVU Estimates in Canonical Exponential Families..186 Measures of Performance Chapter 3 Thus.1. Then Theorem 3.O) = T(X) . I'L(Z) = I'Y + (Z l'z)TLz~ Lzy· Now set Y (3.3.36) Proof. 0) . (continued).34) () E e open.4. Then VarO(T(X» > Lz~ [1(0) Lzy (3. ! p(x. that is. 0).2 ( 0 0 ) .4. Suppose the conditions of Example 3.4. . A(O).'(0) = V1/.4. The conditions I.
(X) = I)IXi = j].A. i =j:. we transformed the multinomial model M(n. a'i X and Aj = P(X = j).(1. j. .Section 3. (3..40) J We claim that in this case .4 8'A rl(O)= ( 8080 t )1 kx k .p(0)11(0) ~ (1.3.A(O) is just VarOT! (X). Example 3.)/1£.39) where. kI A(O) ~ 1£ log Note that 1+ Le j=l Oj 80 A (0) J 8 = 1 "k 11 ne~ I 0 +LJl=le l = 1£>'...... by Theorem 3.. .p(OJrI(O). hence. are known.A(0) j ne ej = (1 + E7 eel _ eOj ) . j=1 X = (XI.4. Thus. This is a different claim than TJ(X) is UMVU for EOTj(X) if Oi.(X)) 82 80.d.O) = exp{TT(x)O . .0..41) because . . ..6.. " Ok_il T.. But because Nj/n is unbiased and has 0 variance >'j(1 . ..Xn)T. Ak) to the canonical form p(x.. . = nE(T.j.4. . .i. .7.6 with Xl. . j = 1.Tk_I(X)). X n i.. AI. n T.. we let j = 1. .>'j)/n.4.A(O)} whereTT(x) = (T!(x).) = Var(Tj(X».4. To see our claim note that in our case (3. We have already computed in Proposition 3. is >. . the lower bound on the variance of an unbiased estimator of 1/Jj(O) = E(n1Tj(X) = >. 0 = (0 1 . Multinomial Trials.4.p(0) is the first row of 1(0) and.0). k.4 Unbiased Estimation and Risk Inequalities 187 estimate of EOTj(X). . . .T(O) = ~:~ I (3. without loss of generality.. But f.4. In the multinomial Example 1.. then Nj/n is UMVU for >'j. (1 + E7 / eel) = n>'j(1.>'.
1 Computation Speed of computation and numerical stability issues have been discussed briefly in Section 2. X n are i. interpretability of the procedure. Let 1/J(O) ~EO(T(X))dXl and "'(0) = (1/Jl(O).3 whose proof is left to Problem 3.42) where A > B means aT(A .4. reasonable estimates are asymptotically unbiased. we show how the infomlation inequality can be extended to the multiparameter case. ~ In Chapters 5 and 6 we show that in smoothly parametrized models."xl' Note that both sides of (3. Using inequalities from prediction theory.4. . . We derive the information inequa!ity in oneparameter models and show how it can be used to establish that in a canonical exponential family. Summary.4. The Normal Case. LX. N(IL. Also note that ~ o unbiased '* VarOO > r'(O). . Asymptotic analogues oftbese inequalities are sharp and lead to the notion and construction of efficient estimates..i.3 hold and • is a ddimensional statistic. T(X) is the UMVU estimate of its expectation. If Xl.4. These and other examples and the implications of Theorem 3.4. 3.4..8. 1 j Here is an important extension of Theorem 3. features other than the risk function are also of importance in selection of a procedure.4.5 NON DECISION THEORETIC CRITERIA In practice.5. The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure.. Then J (3.d.4. They are dealt with extensively 'in books on numerical analysis such as Dahlquist. We study the important application of the unbiasedness principle in survey sampling. ." . " il 'I.4.4. • Theorem 3.X)2 is UMVU for 0 2 . 188 MeasureS of Performance Chapter 3 j Example 3.. .2 + (J2. (72) then X is UMVU for J1 and ~ is UMVU for p. 3. We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal. .B)a > Ofor all .. . and robustness to model departures.3 are 0 explored in the problems.21. Suppose that the conditions afTheorem 3.42) are d x d matrices.pd(OW = (~(O)) dxd . even if the loss function and model are well specified. But it does not follow that n~1 L:(Xi .
The improvement in speed may however be spurious since AI is costly to compute if d is largethough the same trick as in computing least squares estimates can be used.9) by.2.5 Nondecision Theoretic Criteria 189 Bjork. estimates of parameters based on samples of size n have standard deviations of order n. takes on the order of log log ~ steps (Problem 3.4.1 / 2 . for this population of measurements has a clear interpretation. It is clearly easier to compute than the MLE. The interplay between estimated.10). It is in fact faster and better to Y. (72) Example 2.3. Closed form versus iteratively computed estimates At one level closed form is clearly preferable. variance and computation As we have seen in special cases in Examples 3. fI(j) = iPl) .2 Interpretability Suppose that iu the normal N(Il. in the algOrithm we discuss in Section 2.2.1. But it reappears when the data sets are big and the number of parameters large.A(BU ~l»)). It may be shown that. Gaussian elimination for the particular z'b Faster versus slower algorithms Consider estimation of the MLE 8 in a general canonical exponential family as in Section 2. if we seek to take enough steps J so that III III < .5. consider the Gaussian linear model of Example 2. a method of moments estimate of (A.4. On the other hand.1 / 2 is wasteful.1. The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d 3 operations when done in the usual way and can be numerically unstable.2 is the empirical variance (Problem 2.3.3 and 3. 3. say. and Anderson (1974).1. We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory. For instance.2).4.4. the signaltonoise ratio.5 we are interested in the parameter III (7 • • This parameter.P) in Example 2. Of course.1. Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at least bounds on the COnstants involved.Section 3. On the other hand. Its maximum likelihood estimate X/a continues to have the same intuitive interpretation as an estimate of /1. solve equation (2.5./ a even if the data are a sample from a distribution with . Then least squares estimates are given in closed fonn by equ<\tion (2.11). with ever faster computers a difference at this level is irrelevant. < 1 then J is of the order of log ~ (Problem 3.1). It follows that striving for numerical accuracy of ord~r smaller than n. the NewtonRaphson method in ~ ~(J) which the jth iterate. at least if started close enough to 8.2 is given by where 0.5.AI (flU 1) (T(X) .
Alternatively. 9(Pl. To be a bit formal. 1976) and Doksum (1975). We can now use the MLE iF2.1. what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about). would be an adequate approximation to the distribution of X* (i. they may be interested in total consumption of a commodity such as coffee. but we do not necessarily observe X*. . However. (c) We have a qualitative idea of what the parameter is. that is. if gross errors occur. For instance. This is an issue easy to point to in practice but remarkably difficult to formalize appropriately. But () is still the target in which we are interested. However. However. but there are a few • • = .)(p/>. We consider three situations (a) The problem dictates the parameter.. the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X) / [Var( X)] 1/2..5. we turn to robustness.4) is for n large a more precise estimate than X/a if this model is correct. and both the mean tt and median v qualify. For instance. We will consider situations (b) and (c). the parameter v that has half of the population prices on either side (fonnally. Then E(X)/lVar(X) ~ (p/>.4. Gross error models Most measurement and recording processes are subject to gross errors. economists often work with median housing prices.e. say () = N p" where N is the population size and tt is the expected consumption of a randomly drawn individuaL (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true'' parametric model with an interpretable parameter B. but there are several parameters that satisfy this qualitative notion. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied.I1'. we could suppose X* rv p. P(X > v) > ~). On the other hand. X n ) where most of the Xi = X:..13.')1/2 = l.\)' distribution. E P*). amoog others..5. 1 X~) could be taken without gross errors then p. The actual observation X is X· contaminated with "gross errOfs"see the following discussion. E p. suppose we initially postulate a model in which the data are a sample from a gamma. which as we shall see later (Section 5. See Problem 3. 1975b. . 3. We return to this in Section 5..3 Robustness Finally. suppose that if n measurements X· (Xi. the HardyWeinberg parameter () has a clear biological interpretation and is the parameter for the experiment described in Example 2. we observe not X· but X = (XI •.190 Measures of Performance Chapter 3 mean Jl and variance 0'2 other than the normal. This idea has been developed by Bickel and Lehmann (l975a. anomalous values that arise because of human error (often in recording) or instrument malfunction. v is any value such that P(X < v) > ~.5. Similarly. we may be interested in the center of a population.
Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of X· . We next define and examine the sensitivity curve in the context of the Gaussian location model. two notions. with common density f(x . and definitions of insensitivity to gross errors.d. = {f (. specification of the gross error mechanism.5 Nondecision Theoretic Criteria ~ 191 wild values.j = PC""f. .i. P. Y. (3. ISi'1 or 1'2 our goal? On the other hand..5.h(x).. Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6. with probability . X n ) will continue to be a good or at least reasonable estimate if its value is not greatly affected by the Xi I Xt. Informally B(X1 .. That is..1').1. iff X" . X is the best estimate in a variety of senses.1 ) where the errors are independent. In our new formulation it is the xt that obey (3. Consider the onesample symmetric location model P defined by ~ ~ i=l""l n . for example.. If the error distribution is normal.5.. .l. We return to these issues in Chapter 6.j) E P. where Y. and then more generally. if we drop the symmetry assumption.l.remains identifiable.i.2) for some h such that h(x) = h(x) for all x}. we do not need the symmetry assumption. . . h = ~(7 <p (:(7) where K ::» 1 or more generally that h is an unknown density symmetric about O. A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the Ci still i. Again informally we shall call such procedures robust. where f satisfies (3. and symmetric about 0 with common density f and d. in situation (c).1') : f satisfies (3. The advantage of this formulation is that jJ.. Without h symmetric the quantity jJ.Xn are i. Further assumptions that are commonly made are that h has a particular fonn.2) Here h is the density of the gross errors and . the sensitivity curve and the breakdown point. F. Xi Xt with probability 1 . but with common distribution function F and density f of the form f(x) ~ (1 . •. the assumption that h is itself symmetric about 0 seems patently untenable for gross errors. Example 1. X n ) knowing that B(Xi.Section 3.\ is the probability of making a gross error. we encounter one of the basic difficulties in formulating robustness in situation (b). has density h(y .is not a parameter.)~ 'P C) + . make sense for fixed n.l.1') and (Xt.d. so it is unclear what we are estimating. (3.2. Unfortunately. The breakdown point will be discussed in Volume II. . Then the gross error model issemiparametric. X~) is a good estimate. However. This corresponds to.2). That is.) for 1'1 of 1'2 (Problem 3.A Y.f. it is the center of symmetry of p(Po. Now suppose we want to estimate B(P*) and use B(X 1 . However. Formal definitions require model specification. identically distributed.5.1).5.) are ij. p(". . . . it is possible to have PC""f.d.5.18). the gross errors.5.J) for all such P.
" . (}(PU. This is equivalent to shifting the Be vertically to make its value at x = 0 equal to zero. Often this is done by fixing Xl. x) ..X.192 The sensitivity curve Measures of Performance Chapter 3 . SC(x. I . not its location... . . • .··.Xn ? An interesting way of studying this due to Tukey (1972) and Hampel (1974) is the sensitivity curve defined as follows for plugin estimates (which are well defined for all sample sizes n). where Xl.l.. X(n) are the order statistics.Xnl so that their mean has the ideal value zero. shift the sensitivity curve in the horizontal or vertical direction whenever this produces more transparent formulas... .. The sensitivity curoe of () is defined as ~ ~ ~ ~ ~ ~ ~ ~ ~ . ~ ) SC( X. (}(X 1 .. (i) It is the empirical plugin estimate of the population median v (Problem 3.I. . Thus.. 1 Xnl represents an observed sample of size nl from P and X represents an observation that (potentially) comes from a distribution different from P. +X"_I+X) n = x. The sample median can be motivated as an estimate of location on various grounds. is appropriate for the symmetric location model. in particular. We are interested in the shape of the sensitivity curve. we take I' ~ 0 without loss of generality.1. The empirical plugin estimate of 0 is 0 = O(P) where P is the empirical probability distribution.1 for all p(/l.od because E(X. X =n (Xl+ .L = E(X). At this point we ask: Suppose that an estimate T(X 1 ..16). 1') = 0. . How sensitive is it to the presence of gross errors among XI. P. .5.L..1. . X n ) = B(F). therefore. See Problem 3. . X n ordered from smallest to largest. In our examples we shall.14... .1 for which the estimator () gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation X does to the value of ~ We return to the location problem with e equal to the mean J.j)) = J. See Section 2.L = (}(X l 1'.2. that is. and it splits the sample into two equal halves.. Now fix Xl!'" .f) E P. See (2. X n ) .2. . ..4).32. . .1. . .J. X" 1'). that is. 0) = n[O(xl. . X"_l )J. (2.. the sample mean is arbitrarily sensitive to gross errOfa large gross error can throw the mean off entirely.od Problem 2. Because the estimators we consider are location invariant.5... where F is the empirical d. .. Then ~ ~ O. Xnl as an "ideal" sample of size n .O(Xl' . We start by defining the sensitivity curve for general plugin estimates. Suppose that X ~ P and that 0 ~ O(P) is a parameter. .17). . . Are there estimates that are less sensitive? A classical estimate of location based on the order statistics is the sample median X defined by ~ ~ X X(k+l) !(X(k) ifn~2k+1 + X(k+l)) ifn = 2k where X (I)' . . has the plugin property. Xl. .
Although the median behaves well when gross errors are expected. < X(nl) are the ordered XI.Xn_l = (X(k) + X(ktl)/2 = 0.2. . say.5. The sensitivity curve of the median is as follows: If.5.. XnI. SC(x) SC(x) x x Figure 3.1) is the Laplace (double exponential) density f(x) = 1 exp{l x l/7}.32 and 3.1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when X is near Ji. x is an empirical (iii) The sample median is the MLE when we assume the common density f(x) of the errors {cd in (3.1). . we obtain ..1.5 Nondecision Theoretic Criteria 193 (ii) In the symmetric location model (3.. its perfonnance at the nonnal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X. The sensitivity curve in Figure 3... v coincides with fL and plugin estimate of fL.Section 3.. SC(x. n = 2k + 1 is odd and the median of Xl.5.5. A class of estimates providing such intermediate behavior and including both the mean and .5.9. The sensitivity curves of the mean and median. x) nx(k) nx = _nx(k+l) for for < x(k) for x(k) < x x <x(k+I) nx(k+l) x> x(k+l) where xCI) < . 27 a density having substantially heavier tails than the normaL See Problems 2. .
Figure 3. whereas as Q i ~. i I . the Laplace density. that is.1. See Andrews.1 again.2[nnl where [na] is the largest integer < nO' and X(1) < . .4. by <Q < 4. If [na] = [(n ... . infinitely better in the case of the Cauchysee Problem 5. the mean is better than any trimmed mean with a > 0 including the median. and Hogg (1974).2[nn]/n)I.5. Intuitively we expect that if there are no gross errors. The range 0.5). 4e'x'. Rogers. the sensitivity Jo ~ CUIve of an Q trimmed mean is sketched in Figure 3..20 seems to yield estimates that provide adequate protection against the proportions of gross errors expected and yet perform reasonably well when sampling is from the nonnal distribution. f(x) = 1/11"(1 + x 2 ). Which a should we choose in the trimmed mean? There seems to be no simple answer. Haber. (The middle portion is the line y = x(1 . Xa  X.5. I I ! .8) than the Gaussian density. which corresponds approximately to a = This can be verified in tenns of asymptotic variances (MSEs}see Problem 5.) SC(x) X x(n[naJ) . f(x) = or even more strikingly the Cauchy.2." see Jaeckel (1971).194 Measures of Performance Chapter 3 the median has ~en known since the eighteenth century. f = <p. If f is symmetric about 0 but has "heavier tails" (see Problem 3.l)Q] and the trimmed mean of Xl. < X (n) are the ordered observations. suppose we take as our data the differences in Table 3. That is. For more sophisticated arguments see Huber (1981). Xa. For instance.4.. . However.5. Xu = X. for example. 4.5. Let 0 trimmed mean. . We define the a (3.1.. The sensitivity curve of the trimmed mean. For a discussion of these and other forms of "adaptation. There has also been some research into procedures for which a is chosen using the observations. and Tukey (1972). Huber (1972). Bickel. then the trimmed means for a > 0 and even the median can be much better than the mean. + X(n[nuj) n . l Xnl is zero. .. The estimates can be justified on plugin grounds (see Problem 3.. Note that if Q = 0. . we throw out the "outer" [na] observations on either side and take the average of the rest. Hampel.3) Xu = X([no]+l) + .5.2. the sensitivity curve calculation points to an equally intuitive conclusion.10 < a < 0.5.
75 and X. If we are interested in the spread of the values in a population.2.is typically used.X)2 is the empirical plugin estimate of 0.4) where the approximation is valid for X fixed. Spread. Write xn = 11. Let B(P) = X o deaote a ath quantile of the distribution of X. .5. where Xu has 100Ct percent of the values in the population on its left (fonnally.10). Example 3.X. < x(nI) are the ordered Xl. Then a~ = 11.Section 3. say k.16).X.I L:~ I (Xi .5. We next consider two estimates of the spread in the population as well as estimates of quantiles. Xu is called a Ctth quantile and X. then SC(X.. P( X > xO') > 1 .1x.. &) (3. .25).Ct). and let Xo. Let B(P) = Var(X) = 0"2 denote the variance in a population and let XI.5 Nondecision Theoretic Criteria 195 Gross errors or outlying data points affect estimates in a variety of situations. If no: is an integer.674)0". A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = X.742(x.5.25. SC(X.1. &2) It is clear that a:~ is very sensitive to large outlying Ixi values.1.. where x(t) < . 0"2) model. . . Xo: = x(k). X n is any value such that P( X < x n ) > Ct. . . 0 Example 3. 0 < 0: < 1. the scale measure used is 0. the nth sample quantile is Xo.75 . n + 00 (Problem 3. Similarly... then the variance 0"2 or standard deviation 0.75 . The IQR is often calibrated so that it equals 0" in the N(/l.. denote the o:th sample quantile (see 2. To simplify our expression we shift the horizontal axis so that L~/ Xi = O. X n denote a sample from that population. other examples will be given in the problems. = ~ [X(k) + X(k+l)]' and at sample size n  1.2 .5.XnI.25 are called the upper and lower quartiles. Quantiles and the lQR. Because T = 2 x (.1 L~ 1 Xi = 11.
We will return to the influence function in Volume II. Most of the section focuses on robustness.F) where = limIF.. = <1[0((1  <)F + <t. r: ~i II i .25) and the sample IQR is robust with respect to outlying gross errors x.: : where F n . .25· Then we can write SC(x. Ronchetti.. ~ ~ ~ Next consider the sample lQR T=X. Unfortunately these procedures tend to be extremely demanding computationally. .. discussing the difficult issues of identifiability. trimmed mean.5.(x. It plays an important role in functional expansions of estimates.O. for 2 Measures of Performance Chapter 3 < k < n .• . median. and computability. 0. Rousseuw. 1'75) .6.1 denotes the empirical distribution based on Xl. F) and~:z.15) I[t < xl)..O(F)] = . and other procedures. An exposition of this point of view and some of the earlier procedures proposed is in Hampel. x (t) that (Problem 3.) . 1'. in particular the breakdown point. The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean. . SC(x. The sensitivity of the parameter B(F) to x can be measured by the influence function. ~ .. ~' is the distribution function of point mass at x (. It is easy to see ~ ! '. although this difficulty appears to be being overcome lately. Ii I~ . have been studied extensively and a number of procedures proposed and implemented.F) dO I. which is defined by IF(x.1. Discussion.75 X. Summary. 1'0) ~[Xlkl) _ xlk)] x < XlkI) 2 '  2 2 1 Ix _ xlkl] XlkI) < x < xlk+11 '  (3.O. o Remark 3.2. and Stabel (1983).Xnl.SC(x.5) 1 [xlk+11 _ xlkl] x> xlk+1) ' Clearly.(x. xa: is not sensitive to outlying x's.5. Other aspects of robustness.196 thus. We discuss briefly nondecision theoretic considerations for selecting procedures including interpretability.5. i) = SC(x.
is preferred to 0. = (0 o)'lw(O) for some weight function w(O) > 0. (72) sample and 71" is the improper prior 71"(0) = 1. a(O) > 0. x) where c(x) =. Show that if Xl.O) = p(x I 0)71'(0) = c(x)7f(0 .0 E e. 2. which is called the odds ratio (for success). 71') as .2. Suppose IJ ~ 71'(0). lf we put a (improper) uniform prior on A. Find the Bayes estimate 0B of 0 and write it as a weighted average wOo + (1 ~ w)X of the mean 00 of the prior and the sample mean X = Sin.O) and = p(x 1 0)[7f(0)lw(0)]/c c= JJ p(x I 0)[7f(0)/w(0)]dOdx is assumed to be finite. (a) Show that the joint density of X and 0 is f(x. ~ ~ 4. 3.0). In some studies (see Section 6..3. E = R.0) and give the Bayes estimate of q(O). Check whether q(OB) = E(q(IJ) I x). Show that OB ~ (S + 1)/(n+2) for ~ ~ the uniform prior.a)' jBQ(1.6 PROBLEMS AND COMPLEMENTS Problems for Section 3. give the MLE of the Bernoulli variance q(0) = 0(1 . s). J). In the Bernoulli Problem 3.2.. J) ofJ(x) = X in Example 3.2 preceeding. then the improper Bayes rule for squared error loss is 6"* (x) = X. . . we found that (S + 1)/(n + 2) is the Bayes rule.0)' and that the prinf7r(O) is the beta. 0) is the quadratic loss (0 . change the loss function to 1(0. if 71' and 1 are changed to 0(0)71'(0) and 1(0.2. the parameter A = 01(1 .2 with unifonn prior on the probabilility of success O. 0) = (0 . where OB is the Bayes estimate of O.6 Problems and Complements 197 3. In Problem 3.Xn be the indicators of n Bernoulli trials with success probability O.. . Suppose 1(0. (c) In Example 3.Section 3.24. Let X I. where R( 71') is the Bayes risk. (3(r.3).O)P. Hint: See Problem 1. 1 X n is a N(B.1.. Compute the limit of e( J. Show that (h) Let 1(0. 6. density.r 7f(0)p(x I O)dO. 71') = R(7f) /r( 71'. That is. Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule. under what condition on S does the Bayes rule exist and what is the Bayes rule? 5.2 o e 1.4. Find the Bayes risk r(7f.4. 0) the Bayes rule is where fo(x. Consider the relative risk e( J. (X I 0 = 0) ~ p(x I 0). the Bayes rule does not change.2. respectively.0)/0(0).
9j ) = aiO:'j/a5(aa + 1).) with loss function I( B. E).. .aj)/a5(ao + 1).• N. i: agency specifies a number f > a such that if f) E (E. Suppose tbat N lo . where E(X) = O.O)  I(B. (Use these results. find necessary and j sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule. . (. I ~' .) (b) When the loss function is I(B. do not derive them.\(B) = I(B.19(c).. c2 > 0 I. where ao = L:J=l Qj.3. . given 0 = B are multinomial M(n. 7.. Set a{::} Bioequivalent 1 {::} Not Bioequivalent .  I . A regulatory " ! I i· . (Bj aj)2. a r )T. B). Measures of Performance Chapter 3 (b) n .. and Cov{8j . ..O'5).)T..3. then E(O)) = aj/ao.. I) = difference in loss of acceptance and rejection of bioequivalence. B. . (e) a 2 + 00.2(d)(i) and (ii). (c) We want to estimate the vector (B" . Bioequivalence trials are used to test whether a generic drug is. . On the basis of X = (X I. • 9.E). a) = (q(B) . Cr are given constants.d.I(d). Let 0 = ftc ... a = (a" .ft B be the difference in mean effect of the generic and namebrand drugs. where is known. I j l .15. There are two possible actions: 0'5 a a with losses I( B. .. Assume that given f). Find the Bayes decision rule. and that 9 is random with aN(rlO.Xn ) we want to decide whether or not 0 E (e. . N{O. Hint: If 0 ~ D(a). defined in Problem 1.\(B) = r . Var(Oj) = aj(ao .. bioequivalent. a) = [q(B)a]'. " X n of differences in the effect of generic and namebrand effects fora certain drug. 0) and I( B. . 8. One such function (Lindley.exp {  2~' B' } . compute the posterior risks of the possible actions and give the optimal Bayes decisions when x = O. (E. E). 1).=l CjUj . . 76) distribution.198 (a) T + 00.B. then the generic and brandname drugs are.Xu are Li. B = (B" . equivalent to a namebrand drug. E) and positive when f) 1998) is 1.. (c) Problem 1. For the following problems. 00.(0) should be negative when 0 E (E.2. (a) Problem 1.3. . to a close approximation. a) = Lj~. (b) Problem 1.a)' / nj~. . Xl. Suppose we have a sample Xl. and that 0 has the Dirichlet distribution D( a). where CI. Note that ). Let q( 0) = L:. by definition.. find the Bayes decision mle o' and the minimum conditional Bayes risk r(o'(x) I x).. B . (a) If I(B.
" (c) Discuss the behavior of the preceding decision rule for large n ("n sider the general case (a) and the specific case (b).Section 3. 0) and l((). Any two functions with difference "\(8) are possible loss functions at a = 0 and I. v) > ?r/(l ?r) is equivalent to T > t.. . Suppose 9 .6 Problems and Complements 199 where 0 < r < 1. . Yo) is in the interior of S x T.6. and 9 is twice differentiable. In Example 3. Discuss the preceding decision rule for this "prior. S T Suppose S and T are subsets of Rm. Yo) = inf g(xo.) + ~}" .1) is equivalent to "Accept bioequivalence if[E(O I x»)' where = x) < 0" (3.y = (YI.2.6.1) < (T6(n) + c'){log(rg(~)+. representing x = (Xl. Hint: See Example 3. 0. yo) is a saddle point of 9 if g(xo.2. .) = 0 implies that r satisfies logr 1 = ( 2 2c' This is an example with two possible actions 0 and 1 where l((). Yo) to be a saddle point is that.. . (b) It is proposed that the preceding prior is "uninformative" if it has 170 large ("76 + 00"). y). Yo) = sup g(x.. A point (xo.Yo) = {} (Xo.3 1. find (a) the linear Bayes estimate of ~l. + = 0 and it 00").. . RP.\(±€..1. S x T ~ R.. Note that .Xm). {} (Xo. (Xv.Yp)..16) and (3.3.17).2 show that L(x. For the model defined by (3. (a) Show that a necessary condition for (xo. (c) Is the assumption that the ~ 's are nonnal needed in (a) and (b)? Problems for Section 3. respectively. (a) Show that the Bayes rule is equivalent to "Accept biocquivalence if E(>'(O) I X and show that (3. 1) are not constant. Con 10. (b) the linear Bayes estimate of fl.2.. 2.Yo) Xi {}g {}g Yj = 0.
f.l.o•• ) =R(B1 . o')IR(O. fJ( . Bd.i) =0.j)=Wij>O. Show that the von Neumann minimax theorem is equivalent to the existence of a saddle point for any twice differentiable g. and that the mndel is regnlar. I • > 1 for 0 i ~. N(I'...1(0.d< p. B1 ) Ip(X. Let Lx (Bo. Yc Yd foraHI < i.a)'.a. the test rule 1511" given by o. and 1 .0•• ). ." 1 Xi ~ l}. 4. .=1 CijXiYj with XE 8 m . X n be i.n12). Hint: Show that there exists (a unique) 1r* so that 61f~' that R(Bo. Let S ~ 8(n. 2 J'(X1 .thesimplex.0). B1 ) >  (l.Xn ) . f" I • L : I' (a) Show that 0' has constant risk and is Bayes for the beta. y ESp. &* is minimax. BIl has a continuous distrio 1 bution under both P Oo and P BI • Show that (a) For every 0 < 7f < 1..n).2..y) ~ L~l 12.PIB = Bo].andg(x. 1 <j. (b) Suppose Sm = (x: Xi > 0. B ) and suppose that Lx (B o. . Suppose e = {Bo. B ) ~ p(X. I j 3.Yo) > 0 8 8 8 X a 8 Xb 0.j=O..<7') and 1(<7'..200 and Measures of Performance Chapter 3 8'g (x ) < 0 8'g(xo. • . . Suppose i I(Bi.. n: I>l is minimax. (b) There exists 0 < 11"* < 1 such that the prior 1r* is least favorable against is._ I)'.. d) = (a) Show that if I' is known to be 0 (!. and show that this limit ( .n12.c. I}. Thus. . Let X" ..= 1 . iij. a) ~ (0 ..1 < i < m. 0») equals 1 when B = ~. L:. Hint: See Problem 3.:. and o'(S) = (S + 1 2 . A = {O.2. I(Bi. 1rWlO = 0 otherwise is Bayes against a prior such that PIB = B ] = . Yo . prior.d.b < m.n)/(n + . 5.)w".i. (b) Show that limn_ooIR(O. i.(X) = 1 if Lx (B o.. • " = '!. o(S) = X = Sin. the conclusion of von Neumann's theorem holds.
Permutations of I. hence. " a. . 0 1. See Problem 1. N k ) has a multinomial..\ . .1.» = Show that the minimax rule is to take L l....j. . X k be independent with means f. . < im.k. M (n. . prior. . . Show that if (N).3. Remark: Stein (1956) has shown that if k > 3. . 0') = (1.X)' is best among all rules of the form oc(X) = c L(Xi .. 1 = t j=l (di .. Rj = L~ I I(XI < Xi)' Hint: Consider the uniform prior on permutations and compute the Bayes rule by showing that the posterior risk of a pennutation (i l .tn I(i. 1 < j < k.I). 00. Jlk.. .. Show that X has a Poisson (. . 8...k} I«i). . Write X  • "i)' 1(1'... . Let k .. then is minimax for the loss function r: l(p. . Xk)T.1)1 L(Xi . 1 <i< k.2. j 7. (jl.) .tj. jl~ < . For instance.X)' and.. that is.. .. ik) is smaller than that of (i'l"'" i~). / .).P.a? 1. 6... . 9.a < b" = 'lb..X)' are inadmissible. Show that if = (I'I''''.\. . See Volume II..~~n X < Pi < < R(I'.. Hint: Consider the gamma. X k ) = (R l . ... . dk)T.. respectively. a) = (. Rk) where Rj is the rank of Xi. .I'kf. h were t·.. Hint: (a) Consider a gamma prior on 0 = 1/<7'.. o(X) ~ n~l L(Xi . . . ... J. b = ttl. ik is an arbitrary unknown permutation of 1. o(X 1. X is no longer unique minimax.Pi.12.. .29). i=l then o(X) = X is minimax. < /12 is a known set of values. . .Section 3_6 Problems and Complements 201 (b) If I' ~ 0. distribution. PI.j. and iI.d) where qj ~ 1 . . show that 0' is unifonnly best among all rules of the form oc(X) = CL Conclude that the MLE is inadmissible. Let Xi beindependentN(I'i. that both the MLE and the estimate 8' = (n .?.o) for alII'. . d = (d). . .. o'(X) is also minimax and R(I'. .). ik). an d R a < R b. 1)..j. LetA = {(iI. Then X is nurnmax.l . d) = L(d.\) distribution and 1(. Let Xl.. ftk) = (Il?l'"'' J1. (c) Use (B. f(k. b.. = tj. . .\. ttl . . . where (Ill. Xl· (e) Show that if I' is unknown. > jm). . I' (X"".Pi)' Pjqj <j < k.
Let X" . ~ I I .. X n be independent N(I'. of Xi < X • 1 + 1 o. v'n v'n (a) Show that the risk (for squared error loss) E( v'n(o(X) ... q) denote the K LD (KullbackLeiblerdivergence) between the densities Po and q and define the Bayes KLD between P = {Po : BEe} and q as k(q. For a given x we want to estimate the proportion F(x) of the population to the left of x.B). 1). See Problem 3. B(n. ! p(x) = J po(x)1r(B)dB. d X+ifX 11. .i f X> . J K(po.. . 13... 14.d with density defined in Problem 1.. BO l) ~ (k ... I: I ir'z . .i..1r) = Show that the marginal density of X.. LB= j 01 1.... .1'»2 of these estimates is bounded for all nand 1'.d. a) is the posterior mean E(6 I X).. B).15.I)!._. M(n.8..4.. 1 . Define OiflXI v'n < d < v'n d v'n d d X . B j > 0. Hint: Consider the risk function of o~ No....... .. Let K(po. Let Xi(i = 1. . + l)j(n + k). I: • .202 Measures of Performance Chapter 3 Hint: Consider Dirichlet priors on (PI. distribution.Xo)T has a multinomial. Show that the Bayes estimate of 0 for the KullbackLeibler loss function lp(B. .2. Show that v'n 1+ v'n 2(1+ v'n) is minimax for estimating F(x) = P(X i < x) with squared error loss.2. X has a binomial.=1 Show that the Bayes estimate is (Xi . distribution.... (b) How does the risk of these estimates compare to that of X? 12. . with unknown distribution F....a) and let the prior be the uniform prior 1r(B" . Pk.q)1r(B)dB. Suppose that given 6 = B = (B ...3. n) be i. . Suppose that given 6 ~ B. Let the loss " function be the KullbackLeibler divergence lp(B. . 10.BO)T. X = (X"". See also Problem 3. .
1 .f [E. It is often improper. 0.1"0 known.k(p.4.'? /1 as in (3. . 7f) and that the minimum is I(J. . Show that if 1(0.X is called the mutual information between 0 and X. Hint: k(q. (a) Show that if Ip(O) and Iq(fJ) denote the Fisher information in the two parametrizations.alai) < al(O.a )I( 0. ry) = p(x. and O~ (1 . p( x.d. K) = J [Eo {log ~i~i}] K((J)d(J > a by Jensen's inequality.O)!. if I(O. Show that (a) (b) cr5 = n. then That is.6 Problems and Complements 203 minimizes k( q. Show that B q(ry) = Bp(h. J') < R( (J.. .12) for the two parametrizations p(x. Let A = 3.. Let X . . We shall say a loss function is convex. a.x =. then R(O. {log PB(X)}] K(O)d(J. for any ao. 0) and B(n.2) with I' .aao + (1 . Fisher infonnation is not equivariant under increasing transformations of the parameter.Section 3.. X n are Ll.4. Equivariance. Jeffrey's priors are proportional to 1..4 1. 4. ." A density proportional to VJp(O) is called Jeffrey's prior. . 0) with 0 E e c R. Suppose X I. . N(I"o. Show that X is an UMVU estimate of O. Prove Proposition 3. p(X) IO. J).. a < a < 1. suppose that assumptions I and II hold and that h is a monotone increasing differentiable function from e onto h(8). 2.. (ry»). Hint: Use Jensen's inequality: If 9 is a convex function and X is a random variable. N(I". the Fisher information lower bound is equivariant. Let X I. a) is convex and J'(X) = E( J(X) I I(X». that is. Jeffrey's "Prior. 0) and q(x. ry). 15. (b) Equivariance of the Fisher Information Bound. R. Show that in theN(O. respectively. . K) . then E(g(X) > g(E(X».~). S. a" (J. X n be the indicators of n Bernoulli trials with success probability B. ao) + (1 . .4. Problems for Section 3.). 0) cases. Reparametrize the model by setting ry = h(O) and let q(x. Suppose that there is an unbiased estimate J of q((J) and that T(X) is sufficient.tO)2 is a UMVU estimate of a 2 • &'8 is inadmissible. . h1(ry)) denote the model in the new parametrization. Give the Bayes rules for squared error in these three cases. Let Bp(O) and Bq(ry) denote the information inequality lower bound ('Ij. .1 E: 1 (Xi  J..
5(b).3. .. I I . Measures of Performance Chapter 3 I (c) if 110 is not known and the true distribution of X t is N(Ji. Hint: Consider T(X) = X in Theorem 3. 2 ~ ~ 10. .p) is an unbiased estimate of (72 in the linear regression model of Section 2. . Is it unbiased? Does it achieve the infonnation inequality lower bound? (b) Show that X is an unbiased estimate of 0/(0 inequality lower bound? + 1). find the bias o f ao' 6.. Let a and b be constants.4. (a) Find the MLE of 1/0.. ~ ~ ~ (a) Write the model for Yl give the sufficient statistic..11(9) as n ~ give the limit of n times the lower bound on the variances of and (J.4. compute Var(O) using each ofthe three methods indicated. Find lim n.' . n. For instance..XN }. Show that if (Xl. 9. .2 (0 > 0) that satisfy the conditions of the information inequality. 0) for any set E. and .1. 8. 7. X n be a sample from the beta. .4. X n ) is a sample drawn without replacement from an unknown finite population {Xl.13 E R. . 00.. .. ( 2).ZDf3)T(y . a ~ (c) Suppose that Zi = log[i/(n + 1)1. Show that . P. Y n in twoparameter canonical exponential form and (b) Let 0 = (a. Suppose (J is UMVU for estimating fJ. . 0) = Oc"I(x > 0). distribution. Suppose Yl . .{x : p(x. . 1).. Hint: Use the integral approximation to sums. Show that 8 = (Y . then (a) X is an unbiased estimate of x = I ~ L~ 1 Xi· . a ~ i 12.2.ZDf3)/(n . Let F denote the class of densities with mean 0. .1 and variance 0. . Does X achieve the infonnation . 204 Hint: See Problem 3.P.Ii . B(O. Pe(E) = 1 for some 0 if and only if Pe(E) ~ ~2 > OJ docsn't depend on 0. Let X" . Zi could be the level of a drug given to the ith patient with an infectious disease and Vi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered.\ = a + bB is UMVU for estimating .. i = 1.8.o. . Show that a density that minimizes the Fisher infonnation over F is f(x. 14.Yn are independent Poisson random variables with E(Y'i) = !Ji where Jli = exp{ Ct + (3Zi} depends on the levels Zi of a covariate.. Ct. • 13. then = 1 forallO. In Example 3. Show that assumption I implies that if A . . {3) T Compute I( 9) for the model in (a) and then find the lower bound on the variances of unbiased estimators and {J of a and (J. =f.. Establish the claims of Example 3. .\ = a + bOo 11.4.
3. B(n. . even for sampling without replacement in each stratum. . : 0 E for (J such that E((J2) < 00. 17.. Show that the resulting unbiased HorvitzThompson estimate for the population mean has variance strictly larger than the estimate obtained by taking the mean of a sample of size n taken without replacement from the population.6 Problems and Complements 205 (b) The variance of X is given by (3.".1  E1 k I Xki doesn't depend on k for all k such that 1Tk > o. . k = 1. 'L~ 1 h = N. P[o(X) = 91 = 1. that ~ is a Bayes estimate for O. 18. 20.X)/Var(U). (b) Deduce that if p.} and X =K  1". :/i. Suppose UI. (a) Take samples with replacement of size mk from stratum k = fonn the corresponding sample averages Xl. 1 < i < h. (c) Explain how it is possible if Po is binomial.. Show that if M is the expected sample size. = 1.6) is (a) unbiased and (b) has smaller variance than X if  b < 2 Cov(U.~).1 and Uj is retained independently of all other Uj with probability 1rj where 'Lf 11rj = n. . (See also Problem 1. Let X have a binomial.. UN are as in Example 3. Show that is not unbiasedly estimable.. Stratified Sampling.• . Define ~ {Xkl..Xkl.. G 19. k=l K ~1rkXk. then E( M) ~ L 1r j=1 N J ~ n.511 > (b) Show that the inequality between Var X and Var X continues to hold if ~ ..4. X is not II Bayes estimate for any prior 1r. . XK.l. 16.4. Suppose the sampling scheme given in Problem 15 is employed with 'Trj _ ~.~ for all k.. . ec R} and 1r is a prior distribution (a) Show that o(X) is both an unbiased estimate of (J and the Bayes estimate with respect to quadratic loss.) Suppose the Uj can be relabeled into strata {xkd. 15. K..4). distribution. B(n. 7. More generally only polynomials of degree n in p are unbiasedly estimable. 8). .4. = N(O.4. _ Show that X is unbiased and if X is the mean of a simple random sample without replacement from the population then VarX<VarX with equality iff Xk. . Show that X k given by (3.Section 3. Let 7fk = ~ and suppose 7fk = 1 < k < K. Suppose X is distributed accordihg to {p. if and only if.p).
. . J xdF(x) denotes Jxp(x)dx in the continuous case and L:xp(x) in the discrete 6.. the upper quartile X.4. is an empirical plugin estimate of ~ Here case.4. (iii) 2X is unbiased for ."" F (ij) Vat (:0 logp(X.25 and (n . If a = 0. Yet show eand has finite variance. Note that logp(x. B) be the uniform distribution on (0. however. B). E(9 Measures of Performance Chapter 3 I X) ~ ii(X) compute E(ii(X) . Show that the sample median X is an empirical plugin estimate of the population median v. Prove Theorem 3. Hint: It is equivalent to show that. and we can thus define moments of 8/8B log p(x. that is. Note that 1/J (9)a ~ 'i7 E9(a 9) and apply Theorem 3. B). 1 X n1 C.25. 1 2.l)a is an integer.25 and net =k 3.5. 4. If a = 0.75' and the IQR.9)' 21.. . with probability I for each B. Show that. for all adx 1. use (3. Var(a T 6) > aT (¢(9)II(9)./(9)a [¢ T (9)a]T II (9)[¢ T (9)a]. T .206 Hint: Given E(ii(X) 19) ~ 9. B) is differentiable fat anB > x. Show that the a trimmed mean XCII. B») ~ °and the information bound is infinite. Regularity Conditions are Needed for the Information Inequality. give and plot the sensitivity curves of the lower quartile X.4. If n IQR. for all X!. 22. give and plot the sensitivity curve of the median.5) to plot the sensitivity curve of the 1. 5. • Problems for Section 35 I is an integer. = 2k is even.3. Let X ~ U(O. An estimate J(X) is said to be shift or translation equivariant if.
5. plot the sensitivity curves of the mean. is symmetrically distributed about O.67. The HodgesLehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages ~(Xi + Xj). (7= moo 1 IX. k t 0 to the median. trimmed mean with a = 1/4.6.3. i < j. One reasonable choice for k is k = 1. The Huber estimate X k is defined implicitly as the solution of the equation where 0 < k < 00.. then (i.) ~ 8. F(x .1. . and xiflxl < k kifx > k kifx<k. I)order statistics). For x > . median. Deduce that X. .... X a arc translation equivariant and antisymmetric. (See Problem 3. It has the advantage that there is no trimming proportion Q that needs to be subjectively specified.e.. (b) Suppose Xl. ..1.30.).03 (these are expected values of four N(O. '& is an estimate of scale.30. . xH L is translation equivariant and antisymmetric. . .5 and for (j is..03. X n is a sample from a population with dJ. <i< n Show that (a) k = 00 corresponds to X. 7. Its properties are similar to those of the trimmed mean.6 Problems and Complements 207 It is antisymmetric if for all Xl. Xu.JL) where JL is unknown and Xi .(a) Show that X. and the HodgesLehmann estimate. X. Show that if 15 is translation equivariant and antisymmetric and E o(15(X» exists and is finite. J is an unbiased estimate of 11. X are unbiased estimates of the center of symmetry of a symmetric distribution. (a) Suppose n = 5 and the "ideal" ordered sample of size n ~ 1 = 4 is 1.  XI/0. (b) Show that   . In .Section 3. .. .
Find the limit of the sensitivity curve of t as (a) Ixl ~ (b) n ~ 00. For the ideal sample of Problem 3. thus.d.5. the standard deviation does not exist. iTn ) ~ (2a)1 (x 2  a 2 ) as n ~ 00. . 00. ii. 00.11.O)/ao) where fo(x) for Ixl for Ixl <k > k. (b) Find thetml probabilities P(IXI > 2). The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant \. with density fo((. In the case of the Cauchy density.5.. ... then limlxj>00 SC(x.i. we say that g(. If f(·) and g(. Let JJo be a hypothesized mean for a certain population. . (d) Xk is translation equivariant and antisymmetric (see Problem 3.) has heavier tails than f() if g(x) is above f(x) for Ixllarge. 9." .208 ~ Measures of Performance Chapter 3 (b) Ifa is replaced by a known 0"0. Laplace. and Cauchy distributions. 11. with k and € connected through 2<p(k) _ 24>(k) ~ e k 1(e) Xk exists and is unique when k £ ! i > O.Xn arei.25.) are two densities with medians v zero and identical scale parameters 7. Suppose L:~i Xi ~ . Show that SC(x. . plot the sensitivity curve of (a) iTn . . [. we will use the IQR scale parameter T = X. . (c) Show that go(x)/<p(x) is of order exp{x2 } as Ixl ~ 10. •• .. Let X be a random variable with continuous distribution function F. and is fixed. The (student) tratio is defined as 1= v'n(x I'o)/s. Use a fixed known 0"0 in place of Ci. and "" "'" (b) the Iratio of Problem 3. where S2 = (n .5.X.7(a). 1 Xnl to have sample mean zero. P(IXI > 3) and P(IXI > 4) for the nonnal.. n is fixed. Xk) is a finite constant.6). = O.• 13. X 12. This problem may be done on the computer.x}2. (a) Find the set of Ixl where g(Ixl) > <p(lxl) for 9 equal to the Laplace and Cauchy densitiesgL(x) = (2ry)1 exp{ Ixl/ry} and gc(x) = b[b2 + x 2 ]1 /rr.1)1 L:~ 1(Xi . Let 1'0 = 0 and choose the ideal sample Xl. In what follows adjust f and 9 to have v = 0 and 'T = 1. (e) If k < 00.75 .~ !~ t . thenXk is the MLEofB when Xl. Location Parameters.
(F): 0 <" < 1/2}. Let Y denote a random variable with continuous distribution function G.. . b > 0.5.t )) = 0. sample median.8. 0 < " < 1. let H(x) be the distribution function whose inverse is H'(a) = ![xax. X is said to be stochastically smaller than Y if = G(t) for all t E R. the ~ = bdSC(x. Also note that H(x) is symmetric about zero. ~ (b) Write the Be as SC(x. F(t) ~ P(X < t) > P(Y < t) antisymmctric. c + dO.. and sample trimmed mean are shift and scale equivariant. .O.. let v" = v._. In this case we write X " Y. (a) Show that if F is symmetric about c and () is a location parameter. SC(a + bx. (b) Show that the mean Ji. . (e) Let Ji{k) be the solution to the equation E ( t/Jk (X :. 8. . then for a E R.67 and tPk is defined in Problem 3. ([v(F). v(F) = inf{va(F) : 0 < " < I/2} andi/(F) =sup{v. (jx. and trimmed population mean JiOl (see Problem 3. if F is also strictly increasing.. i/(F)] is the value of some location parameter.. i/(F)] is the location parameter set in the sense that for any continuous F the value ()( F) of any location parameter must be in [v(F). In Remark 3. 8 < " is said to be order preserving if X < Y ::::} Ox < ()y.). a. then v(F) and i/( F) are location parameters. . c E R. Show that v" is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < i/(F).F). b > 0. 8) = IF~ (x. 8. .(k) is a (d) For 0 < " < 1.. (e) Show that if the support S(F) = {x : 0 < F(x) < 1} of F is a finite interval. median v. An estimate ()n is said to be shift and scale equivariant if for all xl. F) lim n _ oo SC(x.5) are location parameters. then ()(F) = c. and note thatH(xi/(F» < F(x) < H(xv(F». ~ ~ 8n (a +bxI. 8) and ~ IF(x._.xn ). .6 Problems and Complements 209 if (Ja+bX = a + bex· It is antisymmetric if (J x :. ~ ~ 15.t..:.1: (a) Show that SC(x. Fn _ I ).) That is. ~ se is shift invariant and scale equivariant. Hint: For the second part. · .]..  vl/O. n (b) In the following cases.. any point in [v(F). Show that if () is shift and location equivariant. ~ (a) Show that the sample mean..5. x n _.) 14.. 8..Section 3. Show that J1.. i/(F)] and. compare SC(x. ~ = d> 0.5. If 0 is scale and shift equivariant. xnd.8. and ordcr preserving.xn .(F) = !(xa + XI").a +bxn ) ~ a + b8n (XI.Xn_l) to show its dependence on Xnl (Xl. . a + bx. it is called a location parameter. where T is the median of the distributioo of IX location parameter.
show that (a) If h is a density that is symmetric about zero.OJ then l(I(i) . and we seek the unique solution (j of 'IjJ(8) = O.8) . in order to be certain that the Jth iterate (j J is within e of the desired () such that 'IjJ( fJ) = 0. {BU)} do not converge. In the gross error model (3. we in general must take on the order of log ~ steps.field generated by SA. I : . Show that in the bisection method. • • . also true of the method of coordinate ascent. B E B. C < 00. I I Notes for Section 3. is not identifiable. (b) Show that there exists. ~ I(x  = X n . . (1) The result of Theorem 3.2). Because priority of discovery is now given to the French mathematician M. IlFfdF(x). 3. (ii) O(F) = (iii) e(F) "7. 18. 0. . where B is the class of Borel sets.5. The NewtonRaphson method in this case is .. · .0 > 0 (depending on t/J) such that if lOCO) . We define S as the .B ~ {F E :F : Pp(A) E B}. (ii). ) (a) Show by example that for suitable t/J and 1(1(0) .1 is commonly known as the Cramer~Rao inequality. then fJ.7 NOTES Note for Section 3.rdF(l:). 81" < 0.I F(x. Let d = 1 and suppose that 'I/J is twice continuously differentiable. we shall • I' . (b) If no assumptions are made about h.4.Bilarge enough.210 (i) Measures of Performance Chapter 3 O(F) ~ 1'1' ~ J . • .1/. . (b) ~ ~ . A.t is identifiable. . This is. F)! ~ 0 in the cases (i).4 I . (e) Does n j [SC(x. then J. ~ ~ 17..1) Hint: (a) Try t/J(x) = Alogx with A> 1.' > 0. and (iii) preceding? ~ 16. Frechet. consequently.3 (1) A technical problem is to give the class S of subsets of:F for which we can assign probability (the measurable sets). . Assume that F is strictly increasing.BI < C1(1(i.
APoSTOL. Introduction. HUBER. Optimal Statistical Decisions New York: McGrawHill. BOHLMANN. Families.(T(X)) = 00.. 3. ANDERSON. 40. DoKSUM. Statist. DE GROOT. Statist. OLIVER.. SMITH.10451069 (1975b). Numen'cal Analysis New York: Prentice Hall. Statist. AND J. 1994. Reading. J. L..8 REFERENCES ANDREWS. 2. D. M.. Point Estimation Using the KullbackLeibler Loss Function and MML. J. Location. (2) Note that this inequality is true but uninteresting if f(O) = 00 (and 1/J'(0) is finite) or if Var. P. "Descriptive Statistics for Nonparametric Models. Bayesian Theory New York: Wiley. BICKEL. BERNARDO. I.. R. BJORK. S. P. J. "Descriptive Statistics for Nonparametric Models. AND E. T. J. ROGERS. 1985. . 10381044 (l975a)." Scand. NJ: Princeton University Press. K. 4.. 1969.Section 3.. TUKEY." Ann. BERGER. F.. BAXTER. P. WALLACE. AND E. F. P. LEHMANN. Jroo T(x) :>. DOWE. >.. AND A. Robust Estimates of Location: Sun>ey and Advances Princeton. "Measures of Location and AsymmeUy. 1972." Ann. 3. 2nd ed.. 1998. BICKEL. H.4. M. Mathematical Analysis. H. in Proceedings of the Second Pa. A.." Ann.. M." Ann. G. HAMPEL. 1974. 3. BICKEL... J. W. AND N. 1122 (1975). Statistical Decision Theory and Bayesian Analysis New York: Springer. 1. H. (3) The continuity of the first integral ensures that : (J [rO )00 . 15231535 (1969).. III.] = Jroo. MA: AddisonWesley. Jroo T(x) [~p(X' 0)] dx roo Joo 8(J 00 00 00 for all (J whereas the continuity (or even boundedness on compact sets) of the second integral guarantees that we can interchange the order of integration in (4) The finiteness of Var8(T(X)) and f(O) imply that 1/J'(0) is finite by the covariance interpretation given in (3. A.. J. Statist. "Unbiased Estimation in Convex. "Descriptive Statistics for Nonparametric Models. LEHMANN. AND E. LEHMANN. Ma.. BICKEL. M. P. Dispersion. LEHMANN...8). ofStatist. A. 1974. AND C. p(x. D.thematical Methods in Risk Theory Heidelberg: Springer Verlag. 1970. 0. II. BICKEL.8 References 211 follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement. AND E.)dXd>. R. P. 11391158 (1976). W.cific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: SpringerVerlag. DAHLQUIST. Math. F.
R... Amer. RISSANEN. StrJtist. (2000). RONCHEro.n.. E. E. A... "The Influence Curve and Its Role in Robust Estimation.. 1. RoyalStatist. Statist. S.V. 2Q4.. London: Oxford University Press. "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification. LEHMANN. Theory ofProbability.. "Adaptive Robust Procedures. New York: Springer. WALLACE. Mathematical Methods and Theory in Games. KARLIN. C. 42. Wiley & Sons. 49. D. 1998. AND P. B... 538542(1973). WUSMAN. . Math. "Decision Analysis and BioequivaIence Trials. "On the Attainment of the CramerRao Lower Bound. R. London. STEIN. Robust Statistics New York: Wiley. F. Assoc. 1954. JAECKEL." J. 2nd ed. R.. P.. E. SAVAGE. A." J. 69.. Part I: Probability. Statist. 1948. . W. Amer. L." Ann. 49. ROUSSEUW. Stu/ist. The Foundations afStatistics New York: J.. Wiley & Sons. LEHMANN. "Robust Statistics: A Review. 240251 (1987). 1986. Robust Statistics: The Approach Based on Influence Functions New York: J.. Programming.. Y. 909927 (1974). Actuarial J. and Probability. 1986. J. Amer." J. 1959. StatiSl. R." Statistica Sinica. L.212 Measures of Performance Chapter 3 HAMPEL." J. S. STAHEL. "Stochastic Complexity (With Discussions):' J. 223239 (1987). 1. Soc. AND G. L. M. D. B. 383393 (1974). Math. HUBER.. Assoc. P. "Estimation and Inference by Compact Coding (With Discussions). YU. 10201034 (1971). E.. • NORBERG. LINDLEY. "Model Selection and the Principle of Mimimum Description Length." Statistical Science." Scand. HAMPEL. 13.. Testing Statistical Hypotheses New York: Springer.." Proc." Ann.222 (1986).." Ann. AND W. "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution. Soc. • • 1 . 136141 (1998). LINDLEY.. "Robust Estimates of Location. Cambridge University Press.375394 (1997). MA: AddisonWesley.10411067 (1972). SHIBATA. C.. CASELLA. AND B. HANSEN. 197206 (1956). TuKEY. HOGG.. I.. Assoc. 7. "Boostrap Estimate of KullbackLeibler Information for Model Selection. J.. R. Math.. j I JEFFREYS. 1981. Stlltist. Exploratory Data Analysis Reading. Part II: Inference. Third Berkeley Symposium on Math. . University of California Press. FREEMAN. 2nd ed.. Introduction to Probability and Statistics from a Bayesian Point of View. H. HUBER. and Economics Reading. . 69. 43. MA: AddisonWesley. 1965. 1972. P. L. Stall'. Royal Statist. Theory ofPoint Estimation. H.
Usually.1. peIfonn a survey. Nfo of denied applicants. The design of the experiment may not be under our control. Nfo) by a multinomial. parametrically. where is partitioned into {80 . treating it as a decision theory problem in which we are to decide whether P E Po or P l or.Chapter 4 TESTING AND CONFIDENCE REGIONS: BASIC THEORY 4. conesponding. respectively. Nmo. 8d with 8 0 and 8.Pfl. not a sample. distribution. and 3. what does the 213 .3 we defined the testing problem abstractly. If n is the total number of applicants. the situation is less simple. Accepting this model provisionally. modeled by us as having distribution P(j. Sex Bias in Graduate Admissions at Berkeley.2.3 the questions are sometimes simple and the type of data to be gathered under our control. it might be tempting to model (Nm1 . 3. M(n. and indeed most human activities. and what 8 0 and 8 1 correspond to in tenns of the stochastic model may be unclear. to answering "no" or "yes" to the preceding questions. PI or 8 0 .Nf1 .Nfb the numbers of admitted male and female applicants. e Example 4. This framework is natural if. and the corresponding numbers N mo .PmllPmO. As we have seen. where Po.1. Here are two examples that illustrate these issues. as is often the case. o E 8.PfO). in examples such as 1. the parameter space e. we are trying to get a yes or no answer to important questions in science. But this model is suspect because in fact we are looking at the population of all applicants here. and we have data providing some evidence one way or the other. 8 1 are a partition of the model P or.1 INTRODUCTION In Sections 1. Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial. respectively.1. what is an appropriate stochastic model for the data may be questionable. The Graduate Division of the University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. or more generally construct an experiment that yields data X in X C Rq. They initially tabulated Nm1. pUblic policy. whether (j E 8 0 or 8 1 jf P j = {Pe : () E 8 j }. medicine.3.
~).1. NjOd. D). . I . OUf multinomial assumption now becomes N ""' M(pmld' PmOd.n 3 t I .." then the data are naturally decomposed into N = (Nm1d. has a binomial (n. the natural model is to assume. In these tenns the hypothesis of "no bias" can now be translated into: H: Pml Pmld PmOd Pfld Pild + + PfOd . . • In fact. . B (n. Mendel crossed peas heterozygous for a trait with two alleles. . that N AA.=7xlO 5 ' n 3 . In a modem formulation.. If departments "use different coins. .: Example 4.214 Testing and Confidence Regions Chapter 4 hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male] = Pm! Pml +PmO = P[Admit I Female] = P fI Pil +PjO But is this a correct translation of what absence of bias means? Only if admission is determined centrally by the toss of a coin with probability Pml Pi! Pml + PmO PIl +PiO [n fact. where N m1d is the number of male admits to department d. It was noted by Fisher corresponds to H : p = ~ with the alternative K : p as reported in Jeffreys (1961) that in this experiment the observed fraction ':: was much closer to 3.! . distribution. and so on. ! . one of which was dominant. Hammel. . . The example illustrates both the difficulty of specifying a stochastic model and translating the question one wants to answer into a statistical hypothesis. . The hypothesis of dominant inheritance ~. as is discussed in a paper by Bickel. Mendel's Peas.D.. . . m P [ NAA . admissions are petfonned at the departmental level and rates of admission differ significantly from department to department. either N AA cannot really be thought of as stochastic or any stochastic I . • Pml + P/I + PmO + Pio • I . the same data can lead to opposite conclusions regarding these hypothesesa phenomenon called Simpson's paradox. d = 1. the number of homozygous dominants. I .. This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate... . • . NIld. ~. In one of his famous experiments laying the foundation of the quantitative theory of genetics. and O'Connell (1975)... for d = 1. That is. PfOd. pml +P/I !. • I I . The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive).. D).< .p) distribution. N mOd .than might be expected under the hypothesis that N AA has a binomial. 0 I • . i:. Pfld. if there were n dominant offspring (seeds). d = 1. if the inheritance ratio can be arbitrary. I I 1] Fisher conjectured that rather than believing that such a very extraordinary event occurred it is more likely that the numbers were made to "agree with theory" by an overzealous assistant.2. .
(}o] and eo is composite. then S has a B(n.€)51 + cB( nlP). If we suppose the new drug is at least as effective as the old. say. P = Po.Xn ). and accept H otherwise. suppose we observe S = EXi . It will turn out that in most cases the solution to testing problem~ with 80 simple also solves the composite 8 0 problem. it's not clear what P should be as in the preceding Mendel example. . That is. Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug. we reject H if S exceeds or equals some integer. say 8 0 .3 recover from the disease with the old drug. If the theory is false. I} or critical region C {x: Ii(x) = I}. see. the set of points for which we reject. is better defined than the alternative answer 8 1 . We illustrate these ideas in the following example. where 1 ~ E is the probability that the assistant fudged the data and 6!. = Example 4.11. That a treatment has no effect is easier to specify than what its effect is. and we are then led to the natural 0 ..!. the number of recoveries among the n randomly selected patients who have been administered the new drug. p). These considerations lead to the asymmetric formulation that saying P E Po (e E 8 0 ) corresponds to acceptance of the hypothesis H : P E Po and P E PI corresponds to rejection sometimes written as K : P E PJ . (2) If we let () be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite. for instance. In this example with 80 = {()o} it is reasonable to reject IJ if S is "much" larger than what would be expected by chance if H is true and the value of B is eo.1. where Xi is 1 if the ith patient recovers and 0 otherwise. and then base our decision on the observed sample X = (X J. B) distribution. 8 0 and H are called compOSite. a) = 0 if BE 8 a and 1 otherwise. acceptance and rejection can be thought of as actions a = 0 or 1. administer the new drug. Thus. It is convenient to distinguish between two structural possibilities for 8 0 and 8 1 : If 8 0 consists of only one point. Now 8 0 = {Oo} and H is simple. In situations such as this one . we call 8 0 and H simple. then = [00 .3. Moreover.1. Suppose that we know from past experience that a fixed proportion Bo = 0. When 8 0 contains more than one point. in the e . Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied. for instance. where 00 is the probability of recovery usiog the old drug. (1 .. Thus.Section 4.1 loss l(B. then 8 0 = [0 . n 3' 0 What the second of these examples suggests is often the case. In science generally a theory typically closely specifies the type of distribution P of the data X as. Most simply we would sample n patients.1. . 8 1 is the interval ((}o.1 Introduction 215 model needs to pennit distributions other than B( n. our discussion of constant treatment effect in Example 1. say k.(1) As we have stated earlier. 0 E 8 0 . See Remark 4. What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug. we shall simplify notation and write H : () = eo. The set of distributions corresponding to one answer. The same conventions apply to 8] and K. K : () > Bo Ifwe allow for the possibility that the new drug is less effective than the old. is point mass at . 11 and K is composite. To investigate this question we would have to perform a random experiment.3. recall that a decision procedure in the case of a test is described by a test function Ii: x ~ {D.
By convention this is chosen to be the type I error and that in tum detennines what we call H and what we call K. Given this position. suggesting what test statistics it is best to use. .e.3.2 and 4. For instance. : I I j ! 1 I . As we noted in Examples 4. asymmetry is often also imposed because one of eo.. one can be thought of as more important. We will discuss the fundamental issue of how to choose T in Sections 4. and large.216 Testing and Confidence Regions Chapter 4 tenninology of Section 1.that has in fact occurred.3.2.T would then be a test statistic in our sense. The value c that completes our specification is referred to as the critical value of the test. our critical region C is {X : S rule is Ok(X) = I{S > k} with > k} and the test function or PI ~ probability of type I error = Pe. 4. No one really believes that H is true and possible types of alternatives are vaguely known at best.3. The Neyman Pearson framework is still valuable in these situations by at least making us think of possible alternatives and then. but computation under H is easy. . The Neyman Pearson Framework The Neyman Pearson approach rests On the idea that. 1 i . It has also been argued that.. Thresholds (critical values) are set so that if the matches occur at random (i. We do not find this persuasive. of the two errors. testing techniques are used in searching for regions of the genome that resemble other regions that are known to have significant biological activity. In most problems it turns out that the tests that arise naturally have the kind of structure we have just described.1 and4.. as we shall see in Sections 4. 0 > 00 . . but if this view is accepted. 0 PH = probability of type II error ~ P.2. (Other authors consider test statistics T that tend to be small. (5 > k) < k). One way of doing this is to align the known and unknown regions and compute statistics based on the number of matches. then the probability of exceeding the threshold (type I) error is smaller than Q". if H is false.1. ~ : I ! j .1.. (5 The constant k that determines the critical region is called the critical value. . it again reason~bly leads to a Neyman Pearson fonnulation. e 1 . is much better defined than its complement and/or the distribution of statistics T under eo is easy to compute. when H is false.1 . if H is true.) We select a number c and our test is to calculate T(x) and then reject H ifT(x) > c and accept H otherwise. We now tum to the prevalent point of view on how to choose c.. ! : . We call T a test statistic. announcing that a new phenomenon has been observed when in fact nothing has happened (the socalled null hypothesis) is more serious than missing something new. matches at one position are independent of matches at other positions) and the probability of a match is ~.3 this asymmetry appears reasonable. To detennine significant values of these statistics a (more complicated) version of the following is done. In that case rejecting the hypothesis at level a is interpreted as a measure of the weight of evidence we attach to the falsity of H. . • .. how reasonable is this point of view? In the medical setting of Example 4.1. generally in science. Note that a test statistic generates a family of possible tests as c varies. and later chapters. There is a statistic T that "tends" to be small. 1.
05 critical value 6 and the test has size . which is defined/or all 8 E 8 by e {3(8) = {3(8. in the Bayesian framework with a prior distribution on the parameter. 80 = 0.Ok) = P(S > k) A plot of this function for n ~ t( j=k n ) 8i (1 . In that case. Indeed. even though there are just two actions. e t . By convention 1 .0) = Pe[Rejection] = Pe[o(X) = 1] = Pe[T(X) If 8 E 8 0 • {3(8.1. it is too limited in any situation in which.3 (continued).Section 4. Once the level or critical value is fixed. if we have a test statistic T and use critical value c. 8 0 = 0.8)n. Thus. Here > cJ. Definition 4. Specifically.0473. It can be thought of as the probability that the test will "detect" that the alternative 8 holds. If 8 0 is composite as well.P [type 11 error] is usually considered.j J = 10.. This quantity is called the size of the test and is the maximum probability of type I error.1. k = 6 is given in Figure 4. numbers to the two losses that are not equal and/or depend on 0. whereas if 8 E power against (). then the probability of type I error is also a function of 8. Both the power and the probability of type I error are contained in the power function.3. Finally. The power is a function of 8 on I. 0) is the {3(8. Here are the elements of the Neyman Pearson story.2(b) is the one to take in all cases with 8 0 and 8 1 simple.1.01 and 0. and we speak of rejecting H at level a. The values a = 0. {3(8. even nominally. Example 4. there exists a unique smallest c for which a(c) < a. Because a test of level a is also of level a' > a. we can attach. The power of a test against the alternative 0 is the probability of rejecting H when () is true.2. In Example 4. See Problem 3.(S > 6) = 0. our test has size a(c) given by a(e) ~ sup{Pe[T(X) > cJ: 8 E eo}· (4. the power is 1 minus the probability of type II error.1.(6) = Pe. if 0 < a < 1.1. such as the quality control Example 1. 0) is just the probability of type! errot. Begin by specifying a small number a > 0 such that probabilities of type I error greater than a arc undesirable. the approach of Example 3. the probabilities of type II error as 8 ranges over 8 1 are determined. we find from binomial tables the level 0. It is referred to as the level a critical value. if our test statistic is T and we want level a. Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8 0 . .1.1 Introduction 217 There is an important class of situations in which the Neyman Pearson framework is inappropriate. Ll) Nowa(e) is nonincreasing in c and typically a(c) r 1 as c 1 00 and a(e) 1 0 as c r 00.1.05 are commonly used in practice. it is convenient to give a name to the smallest level of significance of a test.9.3 with O(X) = I{S > k}.3 and n = 10. This is the critical value we shall use. That is. Such tests are said to have level (o/significance) a.2.1.
.1 it appears that the power function is increasing (a proof will be given in Section 4.2) popnlation with . and H is the hypothesis that the drug has no effect or is detrimental. We return to this question in Section 4..1.1 0. 0) family of distributions.5..t > O.7 0. . Xn ) is a sample from N (/'. j 1 . record sleeping time without the drug (or after the administration of a placebo) and then after some time administer the drug and record sleeping time again. ! • i.5.) We want to test H : J.:.3 for the B(lO.9 1.[T(X) > k].3 to (it > 0.:"::':::'. When (}1 is 0. Let Xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient.6 0. I I I I I 1 . One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives. (The (T2 unknown case is treated in Section 4.1. I i j .8 0..0473.3 is the probability that the level 0. Suppose that X = (X I.3)..05 test will detect an improvement of the recovery fate from 0.218 Testing and Confidence Regions Chapter 4 1.1. .. _l X n are nonnally distributed with mean p.0 0. a 67% improvement. this probability is only . For instance. It follows that the level and size of the test are unchanged if instead of 80 = {Oo} we used eo = [0._~:.2 is known. We might.~==/++_++1'+ 0 o 0.3.1.3 versus K : B> 0. 0 Remark 4. :1 • • Note that in this example the power at () = B1 > 0. From Figure 4.3 04 05 0. for each of a group of n randomly selected patients.3. . Example 4. and variance (T2.05 onesided test c5k of H : () = 0...1. The power is plotted as a function of 0.t < 0 versus K : J. That is. k ~ 6 and the size is 0.8 06 04 0.0 0 ].4. whereas K is the alternative that it has some positive effect.3770. suppose we want to see if a drug induces sleep.2 o~31:. OneSided Tests for the Mean ofa Normal Distribution with Known Variance. then the drug effect is measured by p. I j a(k) = sup{Pe[T(X) > k]: 0 E eo} = Pe.2 0.0 Figure 4. . Power function of the level 0. This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject. If we assume XI. What is needed to improve on this situation is a larger sample size n.
~(z).. = P.( c) = C\' or e = z(a) where z(a) = z(1 . eo). The smallest c for whieh ~(c) < C\' is obtained by setting q:. S) inf{d(x. 1) distribution. p ~ P[AA]. which generates the same family of critical regions.o versus p.o if we have H(p. In all of these cases.d. T(X(B)) from £0. That is.1 and Example 4.(T(X)) doesn't depend on 0 for 0 E 8 0 .i. T(X(l)).p) because ~(z) (4. and we have available a good estimate (j of (j. it is natural to reject H for large values of X.3.. then the test that rejects iff T(X) > T«B+l)(la)). where T(l) < .1. However. .. sup{(J(p) : p < OJ ~ (J(O) = ~(c). This occurs if 8 0 is simple as in Example 4.1. In Example 4..1. .5. The power function of the test with critical value c is p " [Vii (X !") > e _ Vii!"] (J (J 1.   = . .3..co .2) = 1. This minimum distance principle is essentially what underlies Examples 4. if we generate i.• T(X(B)).a) is the (1. ~ estimates(jandd(~. has a closed form and is tabled.1 Introduction 219 Because X tends to be larger under K than under H. £0..9). But it occurs also in more interesting situations such as testing p.1. < T(B+l) are the ordered T(X). It is convenient to replace X by the test statistic T(X) = . critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods. N~A is the MLE of p and d(N~A. has level a if £0 is continuous and (B + 1)(1 . In Example 4.. where d is the Euclidean (or some equivalent) distance and d(x. Because (J(jL) a(e) ~ is increasing.5). #.1. (12) observations with both parameters unknown (the t tests of Example 4.co =. Rejecting for large values of this statistic is equivalent to rejecting for large values of X.3. Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward.Section 4.V. the common distribution of T(X) under (j E 8 0 .~ (c .a) quantile oftheN(O. . T(X(lI).(jo) = (~ (jo)+ where y+ = Y l(y > 0).a) is an integer (Problem 4.1.~I. The task of finding a critical value is greatly simplified if £.2 and 4.1. o The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H : (j E eo c RP. in any case. (Tn) for (j E 8 0 is usually invariance The key feature of situations in which under the action of a group of transformations_ See Lehmann (1997) and Volume II for discussions of this property.y) : YES}. it is clear that a reasonable test statistic is d((j.eo) = IN~A . Given a test statistic T(X) we need to determine critical values and eventually the power of the resulting tests.P.nX/ (J.2.I') ~ ~ (c + V.1.
that is. F o (Xi)..4. Un' where U denotes the empirical distribution function of Ul u = Fo(x) ranges over (0.U. Goodness of Fit to the Gaussian Family. (T2 = VarF(Xtl. n.n {~ n Fo(x(i))' FO(X(i)) _ . which is evidently composite. In particular. . where F is continuous. This statistic has the following distributionjree property: 1 Proposition 4. the order statistics..1 El{Fo(Xi ) < Fo(x)} n. which is called the Kolmogorov statistic.5. 1). F and the hypothesis is H : F = ~ (' (7'/1:) for some M. as a tcst statistic . as X ..1 J) where x(l) < . . . 1).d.01. .i.. h n (L358).L5 rewriting H : F(!' + (Tx) = <p(x) for all x where !' = EF(X .... The natural estimate of the parameter F(!.u[ O<u<l . .Fo(x)[. and h n (L224) for" = .1 El{U. 1997). Goodness of Fit Tests. thus.1. Let F denote the empirical distribution and consider the sup distance between the hypothesis Fo and the plugin estimate of F. Consider the problem of testing H : F = Fo versus K : F i.(i_l..05.7) that Do. . ... Let X I. What is remarkable is that it is independ~nt of which F o we consider. In particular. Set Ui ~ j ..1.3.Fa. x It can be shown (Problem 4.. can be wriHen as Dn =~ax max tI. This is again a consequence of invariance properties (Lehmann. then by Problem B.10 respectively. P Fo (D n < d) = Pu (D n < d).1. . The distribution of D n under H is the same for all continuous Fo.220 Testing and Confidence Regions Chapter 4 1 . The distribution of D n has been thoroughly smdied for finite and large n. :i' D n = sup [U(u) . the empirical distribution function F. + (Tx) is . • . Proof.   Dn = sup IF(x) . • Example 4. for n > 80. and hn(t) = t/( Vri + 0. 1) distribution. Also F(x) < x} = n. X n are ij. As x ranges over R.6.. ). .1. < x(n) is the ordered observed sample. 0 Example 4. F.)} n (4..12 + OJI/Vri) close approximations to the size" critical values ka are h n (L628). U.Xn be i.1. < Fo(x)} = U(Fo(x)) . o Note that the hypothesis here is simple so that for anyone of these hypotheses F = F o• the distribution can be simulated (or exhibited in closed fonn)... .. and . . Suppose Xl.1 El{Xi ~ U(O.d. We can proceed as in Example 4. and the result follows. where U denotes the U(O.
1 < i < n. . Therefore.X) . where X and 0'2 are the MLEs of J1 and we obtain the statistic F(X + ax) Applying the sup distance again. .. (4. whereas experimenter II insists on using 0' = 0. ~n) doesn't depend on fl. .1 Introduction 221 (12. the pvalue is <l'( T(x)) = <l' (. Tn has the same distribution £0 under H.1. and only if. T nB . 1).. H is not rejected. we would reject H if. If we observe X = x = (Xl. Tn sup IF'(X x x + iix) . a satisfies T(x) = . I < i < n. H is rejected.' . if the experimenter's critical value corresponds to a test of size less than the p~value.01. · · .(3) 0 The pValue: The Test Statistic as Evidence o Different individuals faced with the same testing problem may have different criteria of size. T nB . If the two experimenters can agree on a common test statistic T..3. .. under H. thereby obtaining Tn1 . Zn arc i. for instance. This quantity is a statistic that is defined as the smallest level of significance 0' at which an experimenter using T would reject on the basis ofthe observed outcome x. the joint distribution of (Dq . whatever be 11 and (12.X)/ii. this difficulty may be overcome by reporting the outcome of the experiment in tenus of the observed size or pvalue or significance probability of the test. Consider.i. observations Zi. That is. N(O. otherwise. 1..a) + IJth order statistic among Tn.<l'(x)1 sup IG(x) ..d. then computing the Tn corresponding to those Zi.Section 4. and the critical value may be obtained by simulating ij.) Thus. It is then possible that experimenter I rejects the hypothesis H.4) Considered as a statistic the pvalue is <l'( y"nX /u).. In). . {12 and is that of ~  (Zi . Now the Monte Carlo critical value is the I(B + 1)(1 . We do this B times independently. and only if.<l'(x)1 where G is the empirical distribution of (L~l"'" Lln ) with Lli (Xi . if X = x. .4. . .(iix u > z(a) or upon applying <l' to both sides if. . Experimenter] may be satisfied to reject the hypothesis H using a test with size a = 0. from N(O.. Tn!. whereas experimenter II accepts H on the basis of the same outcome x of an experiment.d. But. a > <l'( T(x)).. I).2.Z) / (~ 2::7 ! (Zi  if) • . Example 4.05. where Z I. (Sec Section 8..
these kinds of issues are currently being discussed under the rubric of datafusion and metaanalysis (e. ''The actual value of p obtainable from the table by interpolation indicates the strength of the evidence against the null hypothesis" (p. . !: The pvalue is used extensively in situations of the type we described earlier. H is rejected if T > Xla where Xl_a is the 1 . •• 1. the largest critical value c for which we would reject is c = T(x). let X be a q dimensional random vector. the smallest a for which we would reject corresponds to the largest c for which we would reject and is just a(T(x)). ..• see f [' Hedges and Olkin.1. • T = ~2 I: loga(Tj) j=l ~ ~ r (4.Oo)} > 5.. . The pvalue can be thought of as a standardized version of our original statistic.. distribution (Problem 4. then if H is simple Fisher (1958) proposed using l j :i. and the pvalue is a( s) where s is the observed value of X. a(T) is on the unit interval and when H is simple and T has a continuous distribution.4).222 Testing and Confidence Regions Cnapter 4 In general. Various melhods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff (1967). The statistic T has a chisquare distribution with 2n degrees of freedom (Problem 4.1. but K is not.. T(x) > c.. I. if r experimenters use continuous test statistics T 1 . a(Tr ).1. n(1 .5).1.1.1. The normal approximation is used for the pvalue also.6). But the size of a test with critical value c is just a(c) and a(c) is decreasing in c. Suppose that we observe X = x.g. This is in agreement with (4. 80). and only if. we would reject H if. .5) .6) I • to test H. that is. ~ H fJ. a(8) = p. to quote Fisher (1958). More generally. Proposition 4.(S > s) '" 1.2. For example. aCT) has a uniform. • i~! <:1 im _ . Thus. U(O. Similarly in Example 4. ' I values a(T. Thus.1.3. 1). so that type II error considerations are unclear.).ath quantile of the X~n distribution. In this context.. The pvalue is a(T(X)).<I> ( [nOo(1 ~ 0 )1' 2 s~l~nOo) 0 . (4...1. I T r to produce p . We have proved the following. 1985). Then if we use critical value c.1). for miu{ nOo. It is possible to use pvalues to combine the evidence relating to a given hypothesis H provided by several different independent experiments producing different kinds of data. when H is well defined. Thus. We will show that we can express the pvalue simply in tenns of the function a(·) defined in (4. Thus.
)1 5 [(1. This is an instance of testing goodness offit.3).2 and 3. that is. (null) hypothesis H and alternative (hypothesis) K. test statistics. we try to maximize the probability (power) of rejecting H when K is true. 0) 0 p(x. Summary. The statistic L takes on the value 00 when p(x. 0) is the density or frequency function of the random vector X.01 )/(1.0. that is. For instance. we test whether the distribution of X is different from a specified Fo.1. or equivalently. the highest possible power. power function. Typically a test statistic is not given but must be chosen on the basis of its perfonnance. and pvalue. test functions. (0./00)5[(1. 00 . in the binomial example (4. We start with the problem of testing a simple hypothesis H : () = ()o versus a simple alternative K : 0 = 01. subject to this restriction. OIl > 0. OIl = p (x. where eo and e 1 are disjoint subsets of the parameter space 6. and.2 Choosing a Test Statistic The NeymanPearson lemma 223 The preceding paragraph gives an example in which the hypothesis specifies a distribution completely. a given test statistic T. power. then. equals 0 when both numerator and denominator vanish. 00 ) = 0. In particular. we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 80 or e 1. is measured in the NeymanPearson theory. o:(Td has a U(O. We introduce the basic concepts of simple and composite hypotheses. (4. We introduce the basic concepts and terminology of testing statistical hypotheses and give the NeymanPearson framework. In the NeymanPearson framework.2.3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk. under H. L(x. p(x.Section 4. type II error. 00. (I . and S tends to be large when K : () = 01 > 00 is . we specify a small number 0: and conStruct tests that have at most probability (significance level) 0: of rejecting H (deciding K) when H is true. In this section we will consider the problem of finding the level a test that ha<. 0. The statistic L is reasonable for testing H versuS K with large values of L favoring K over H. type I error.00)/00 (1 .Oo)]n. 1) distribution.2 CHOOSING A TEST STATISTIC: THE NEYMANPEARSON LEMMA We have seen how a hypothesistesting problem is defined and how perfonnance of a given test b. Such a test and the corresponding test statistic are called most poweiful (MP). In this case the Bayes principle led to procedures based on the simple likelihood ratio statistic defined by L(x. critical regions.Otl/(1 .) which is large when S true. by convention. size. I) = EXi is large.00)ln~5 [0. OIl where p(x. In Sections 3. 4. significance level.
kEo['Pk(X) .7l').. thns.'P(X)] > kEo['PdX) . EO'PdX) We want to show E. Eo'P(X) < a. where 7l' denotes the prior probability of {Bo}.'P(X)] = Eo['Pk{X) .3. 'Pk(X) = 1 if p(x.'P(X)[L(X. If 0 < 'P(x) < 1 for the observation vector x.('Pk(X) .) . To this end consider (4. which are tests that may take values in (0.1) if equality occurs. BIl o . then ~ . i' ~ E.4) 1 j 7 .[CPk(X) .3 with n = 10 and Bo = 0.1]. then it must be a level a likelihood ratio test. Bo. Bo) = O. and suppose r. B . . Finally.p is a level a test. Because we want results valid for all possible test sizes Cl' in [0.'P(X)] 1{P(X.1). then I > O.2. B. Theorem 4. where 1= EO{CPk(X)[L(X. 0 < <p(x) < 1.3). lOY some x.'P(X)] a. using (4.'P(X)] > O. i = 0. and 'P(x) = [0.kJ}.224 Testing and Confidence Regions Chapter 4 We call iflk a likelihood ratio or NeymanPearsoll (NP) test ifunction) if for some 0 k < OJ we can write the test function Yk as 'Pk () = X < 1 o if L( x. Such randomized tests are not used in practice.Bo.k] + E'['Pk(X) .0262 if S = 5. 'Pk is MP for level E'c'PdX).3.. If L(x. we have shown that I j I i E.P(S > 5)I!P(S = 5) = . and because 0 < 'P(x) < 1. Bo) = O} = I + II (say). o Because L(x.2) that 'Pk is a Bayes rule with k ~ 7l' /(1 .) . then <{Jk is MP in the class oflevel a tests.'P(X)] . (b) For each 0 < a < 1 there exists an MP size 0' likelihood ratio test provided that randomization is permitted.2) forB = BoandB = B.2. (NeymanPearson Lemma). Note that a > 0 implies k < 00 and. if want size a ~ .05 . Note (Section 3. (4. It follows that II > O.k is < 0 or > 0 according as 'Pk(X) is 0 or 1.Bo. k] .05 in Example 4. there exists k such that (4. likelihood ratio tests are unbeatable no matter what the size a is.2.2. that is.B. (See also Section 1. 1.1.) For instance. Proof. we choose 'P(x) = 0 if S < 5. (a) Let E i denote E8. They are only used to show that with randomization. (a) If a > 0 and I{Jk is a size a likelihood ratio test. BI )  . 'P(x) = 1 if S > 5.['Pk(X) .Bd >k <k with 'Pdx) any value in (0.1. (c) If <p is an MP level 0: test.2. We show that in addition to being Bayes optimal. we consider randomized tests '1'. B .'P(X)] [:~~:::l..3) 1 : > O. the interpretation is that we toss a coin with probability of heads cp(x) and reject H iff the coin shows heads.
2.) + 1T' (4. = 'Pk· Moreover.) + 1Tp(X.2. where v is a known signal. 0. define a.2.I. It follows that (4.) > k] Po[L(X. OJ! > k] < a and Po[T. that is. 00 .2 where X = (XI.. Then the posterior probability of (). Next consider 0 < Q: < 1. 00 . Now 'Pk is MP size a.1. Consider Example 3. decides 0.2. 0.2.v)=exp 2LX'2 .4) we need tohave'P(x) ~ 'Pk(X) = I when L(x.2. then E9.2.2) holds forO = 0.O.Oo.1.k] on the set {x : L(x. we conc1udefrom (4.1. 0. is 1T(O. .Po[L(X.. Therefore. k = 00 makes 'Pk MP size a..Oo.2. 0. 0. also be easily argued from this Bayes property of 'Pk (Problem 4. OJ Corollary 4.O. { (7 i=l 2(7 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions..2.) = k] = 0. when 1T = k/(k + 1). Because Po[L(X. 0. 0. 2 (7 T(X) ~.) = k j.fit [ logL(X.5) If 0. then 'Pk is MP size a. Remark 4. See Problem 4. k = 0 makes E. or 00 according as 1T(01 Ix) is larger than or smaller than 1/2.. OJ! > k] > If Po[L(X.3.7. tben.) > k and have 'P(x) = 'Pk(X) ~ 0 when L(x.) (1 . 'Pk(X) = 1 and !. 'Pk(X) =a . .Section 4. OJ. 00 . 00 . Let 1T denote the prior probability of 00 so that (I 1T) is the prior probability of (it. 00 .1T)L(x. The same argument works for x E {x : p(x.L. 00 ) (11T)L(x. X n ) is a sample of n N(j. denotes the Bayes procedure of Exarnple 3.2. 00 ) > and 0 = 00 .0. Let Pi denote Po" i = 0. 0. 00 .2(b). Example 4.) (1 .10). If not. 0. for 0 < a < 1.) < k. We found nv2} V n L(X./f 'P is an MP level a test.u 2 ) random variables with (72 known and we test H : 11 = a versus K : 11 = v. Here is an example illustrating calculation of the most powerful level a test <{Jk. 0 It follows from the NeymanPearson lemma that an MP test has power at least as large as its level. 'P(X) > a with equality iffp(" 60 ) = P(·.1T)p(x. OJ! = ooJ ~ 0. If a = 1.2 Choosing a Test Statistic: The NeymanPearson lemma " 225 =:cc (b) If a ~ 0.. Part (a) of the lemma can.) .OI)' Proof.) > then to have equality in (4.pk is MP size a. 00 . (c) Let x E {x: p(x. = v.O.fit. then there exists k < 00 such that Po[L(X. 60 .v) + nv ] 2(72 x . I x) = (1 1T)p(X. 00 . 0.(X.5) thatthis 0.
It is used in a classification context in which 9 0 .6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > O. that the UMP test phenomenon is largely a feature of onedimensional parameter problems. among other things. The following important example illustrates.0. 0 An interesting feature of the preceding example is that the test defined by (4.95). itl' However if. Example 4.8)..90 or . E j ). this is no longer the case (Problem 4. O)T and Eo = I. then.. l. ).. then this test rule is to reject H if Xl is large. By the NeymanPearsoo lemma this is the largest power available with a level Ct test. Simple Hypothesis Against Simple Alternative for the Multivariate Normal: Fisher's Discriminant Flmction. The power of this test is. then we solve <1>( z(a) + (v..6) has probability of type I error 0:.2.2. Note that in general the test statistic L depends intrinsically on ito. however. if we want the probability of detecting a signal v to be at least a preassigned value j3 (say.2.1. .226 Testing and Confidence Regions Chapter 4 is also optimal for this problem. if Eo #./iil (J)). 0.1. If ~o = (1. Suppose X ~ N(Pj. by (4. • • . . 0. <I>(z(a) + (v. E j ). j = 0. itl = ito + )"6. if ito./ii/(J)) = 13 for n and find that we need to take n = ((J Iv j2[z(la) + z(I3)]'.. Eo are known. > 0 and E l = Eo. We return to this in Volume II. The likelihood ratio test for H : (J = 6 0 versus K : (J = (h is based on .2). In this example we bave assnmed that (Jo and (J. and only if.4. But T is the test statistic we proposed in Example 4. From our discussion there we know that for any specified Ct. (JI correspond to two known populations and we desire to classify a new observation X as belonging to one or the other.2. We will discuss the phenomenon further in the next section.Jto)E01X large.2. for the two popnlations are known.7) where c ~ z(1 . say. Particularly important is the case Eo = E 1 when "Q large" is equivalent to "F = (ttl . . . Such a test is called uniformly most powerful (UMP). I I . 6.2. . (Jj = (Pj. a UMP (for all ).a)[~6Eo' ~oJ! (Problem 4.9). large. T>z(la) (4. This is the smallest possible n for any size Ct test.2. the test that rejects if.I." The function F is known as the Fisher discriminant function. If this is not the case. Thus.) test exists and is given by: Reject if (4. they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population CQvanances..' Rejecting H for L large is equivalent to rejecting for I 1 .
However.. where nl.3. there is sometimes a simple alternative theory K : (}l = (}ll. .1) test !. 0) P n! nn.···..u k nn.. For instance.. n offspring are observed. Two examples in which the MP test does not depend on 0 1 are given. .. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : (j = ()o versus the simple alternative K : 0 = ()1. . . . . This phenomenon is not restricted to the Gaussian case as the next example illustrates. .3. the likelihood ratio L 's L~ rr(:. 0 < € < 1 and for some fixed (4. . is established. nk are integers summing to n.M( n.nk. We note the connection of the MP test to the Bayes procedure of Section 3.. .. here is the general definition of UMP: Definition 4.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 227 Summary.OkO' Usually the alternative to H is composite. . Ok = OkO. Example 4. 'P') > (3(0.. The NeymanPearson lemma.. Such tests are said to be UMP (unifonnly most powerful). A level a test 'P' is wtiformly most powerful (UMP) for H : 0 E versus K : 0 E 8 1 if eo (3(0.. _ (nl.2 for deciding between 00 and Ol. .... which states that the size 0:' SLR test is uniquely most powerful (MP) in the class of level a tests. then (N" .p. N k ) has a multinomial M(n.3. With such data we often want to test a simple hypothesis H : (}I = Ow. 1 (}k) distribution with frequency function. Testing for a Multinomial Vector..'P) for all 0 E 8" for any other level 0:' (4.1.."" (}k = Oklo In this case. 4. .2) where .1. . Suppose that (N1 . r lUI nt···· nk· ". Nk) .2 that UMP tests for onedimensional parameter problems exist.3 UNIFORMLY MOST POWERFUL TESTS ANO MONOTONE LIKELIHOOD RATIO MODELS We saw in the two Gaussian examples of Section 4. 0" . if a match in a genetic breeding experiment can result in k types. .Section 4. Ok)' The simple hypothesis would correspond to the theory that the expected proportion of offspring of types 1.)N' i=l to Here is an interesting special case: Suppose OjO integer I with 1 < I < k > 0 for all}.3. Before we give the example. . k are given by Ow. (}l. and N i is the number of offspring of type i.
OW and the model is by (4. .o)n[O/(1 .00). Bernoulli case.i. 02)/p(X.1) if T(x) = t. 0) ~.(x) any value in (0. 00 .3. we have seen three models where. for". 6.(X) = '" > 0. then this family is MLR.2. 0 . 0 Typically the MP test of H : () = eo versus K : () = (}1 depends on (h and the test is not UMP.. If 1J(O) is strictly increasing in () E e. where u is known. The family of models {P. then 6. . Example 4. e c R.228 Testmg and Confidence Regions Chapter 4 That is. type I is less frequent than under H and the conditional probabilities of the other types given that type I has not occurred are the same under K as they are under H.1. is an MLRfamily in T(x).1.nx/u and ry(!") Define the NeymanPearson (NP) test function 6 ( ) _ 1 ifT(x) > t . = (I . N 1 < c. X ifT(x) < t ° (4. . 0 form with T(x) .... Critical values for level a are easily determined because Nl . Consider the problem of testing H : 0 = 00 versus K: 0 ~ 01 with 00 < 81.3.3. Example 4. .2. under the alternative.. the power function (3(/:/) = E. (1) For each t E (0.1 is of this (. B( n.. This is part of a general phenomena we now describe.3. Consider the oneparameter exponential family mode! o p(x.6. In this i. we get radically different best tests depending on which Oi we assume to be (ho under H.3) with 6.nu)!".(X) is increasing in 0. there is a statistic T such that the test with critical region {x : T(x) > c} is UMP. If {P. equals the likelihood ratio test 'Ph(t) and is MP. = h(x)exp{ry(O)T(x) ~ B(O)}. .: 0 E e}.. Oil = h(T(x)) for some increasing function h.3. for testing H : 8 < /:/0 versus K:O>/:/I' .2.1) MLR in s. = P( N 1 < c). we conclude that the MP lest rejects H. Then L = pnN1£N1 = pTl(E/p)N1. : 0 E e} with e c R is said to be a monotone likelihood ratio (MLR) family if for (it < O the distributions POl and P0 2 are distinct and 2 the ratio p(x. However. .o)n. this test is UMP fortesting H versus K : 0 E 8 1 = {/:/ : /:/ . . Because dt does not depend on ()l. k. Moreover. : 8 E e}.d..3. p(x. . Oil is an increasing function ofT(x).3 continned). is UMP level"..3. Note that because l can be any of the integers 1 . it is UMP at level a = EOodt(X) for testing H : e = eo versus K : B > Bo in fact. Ow) under H.2 (Example 4. :. then L(x. < 1 implies that p > E. set s then = l:~ 1 Xi. in the case of a real parameter. is an MLRfamily in T(x).O) = 0'(1 . Example 4. Because f. is of the form (4. (2) If E'o6. Theorem 4. if and only if. e c R. Thus. .2) with 0 < f < I}. = . J Definition 4. Suppose {P.
Suppose X!.3. . H. where b = N(J.l..1.Xn .l. we could be interested in the precision of a new measuring instrument and test it by applying it to a known standard.(bl1) x.bo > n.l) yields e L( 0 0 )=b.( NOo. If the inspector making the test considers lots with bo = N(Jo defectives or more unsatisfactory. For simplicity suppose that bo > n. Suppose {Po: 0 E Example 4.4.2. 0 Proof (I) follows from b l = iPh(t) The following useful result follows immediately.<>.. (8 < s(a)) = a.5. the alternative K as (J < (Jo. Then. Because the most serious error is to judge the precision adequate when it is not. we now show that the test O· with reject H if.(X). Testing Precision. Corollary 4. where h(a) is the ath quantile of the hypergeometric. X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives. To show (2).3. d t is UMP for H : (J < (Jo versus K : (J > (Jo. 0.<» is lfMP level afor testing H: (J < (Jo versus K : (J > (Jo. X < h(a). N . If the distributionfunction Fo ofT(X) under X"" POo is continuous and ift(1a) isasolution of Fo(t) = 1 . Thus.3 Uniformly Most Po~rful Tests and Monotone likelihood Ratio Models 229 and Corollary 4. e c p(x.l. where U O represents the minimum tolerable precision.U 2 ) population.. then the test/hat rejects H if and only ifT(r) > t(1 . . then e}. where xn(a) is the ath quantile of the X. Let S = l:~ I (X.o.Section 4.o. 20' 2  This is a oneparameter exponential family and is MLR in T = So The UMP level 0' test rejects H if and only if 8 < s(a) where s(a) is such that Pa .. she formulates the hypothesis H as > (Jo. . if N0 1 = b1 < bo and 0 < x < b1 . distribution. and specifies an 0' such that the probability of rejecting H (keeping a bad lot) is at most 0'. For instance. 0) = exp {~8 ~ IOg(27r0'2)} . recall that we have seen that e5 t maximizes the power for testing II : (J = (Jo versus K : (J > (Jo among the class of tests with level <> ~ Eo. N. If a is a value taken on by the distribution of X. Because the class of tests with level 0' for H : (J < (Jo is contained in the class of tests with level 0' for H : (J = (Jo. and we are interested in the precision u. Suppose tha~ as in Example l.Xn is a sample from a N(I1.3.(X) for testing H: 0 = 01 versus J( : 0 = 0. Ifwe write ~= Uo t i''''l (Xi _1')2 000 we see that Sju5 has a X~ distribution. If 0 < 00 . Eoo. where 11 is a known standard. . 1 bo(bol) (blX+l)(Nbl) (box+1)(Nbo) (Nbln+x+l) (Nbon+x+l)' . the critical constant 8(0') is u5xn(a). distribution. and only if. is an MLRfamily in T(r). n). is UMP level a. then by (1).l of the measurements Xl. we test H : u > Uo l versus K : u < uo. and because dt maximizes the power over this larger class.. (X) < <> and b t is of level 0: for H : (J :S (Jo. _1')2. R. 0 Example 4. Quality Control. (l.1 by noting that for any (it < (J2' e5 t is MP at level Eo.
as seen in Figure 4. . we would also like large power (3(0) when () E 8 1 . By CoroUary 4. This continuity of the power shows that not too much significance can be attached to acceptance of H. I i (3(t) ~ 'I>(z(a) + . This equation is equiValent to = {3 z(a) + . For such the probability of falsely accepting H is almost 1 .O'"O.2).:(x:..4 because (3(p... The critical values for the hypergeometric distribution are available on statistical calculators and software. i.1.n + I) .' . H and K are of the fonn H : () < ()o and K : () > ()o. Thus.1.(}o. 0) continuous in 0. in general.. Note that a small signaltonoise ratio ~/ a will require a large sample size n. ~) would be our indifference region. 1 I . this is. the appropriate n is obtained by solving i' . 0 Power and Sample Size In the NeymanPearson framework we choose the test whose size is small. (0.Il = (b l . that is.ntI" = z({3) whose solution is il .1. this is a general phenomenon in MLR family models with p( x. forO <:1' < b1 1. we want the probability of correctly detecting an alternative K to be large.. we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small. In Example 4.B 1 ) =0 forb l Testing and Confidence Regions Chapter 4 < X <n.x) <I L(x. ".OIl box (Nn+I)(blx) .(b o . ~) for some small ~ > 0 because such improvements are negligible. This is possible for arbitrary /3 < 1 only by making the sample size n large enough. we specify {3 close to I and would like to have (3(!') > (3 for aU !' > t...L. This is a subset of the alternative on which we are willing to tolerate low power.) is increasing. Off the indifference region. we want guaranteed power as well as an upper bound on the probability of type I error. in (O .'.230 NotethatL(x. if all points in the alternative are of equal significance: We can find > 00 sufficiently close to 00 so that {3( 0) is arbitrarily close to {3(00) = a.4 we might be uninterested in values of p. In our example this means that in addition to the indifference region and level a.c:O".1.1 and formula (4. It follows that 8* is UMP level Q.. Thus.a.Oo. L is decreasing in x and the hypergeometric model is an MLR family in T( x) = r.x) (N . .1. In both these cases. In our nonnal example 4. On the other hand.3. not possible for all parameters in the alternative 8.ntI") for sample size n. {3(0) ~ a.. This is not serious in practice if we have an indifference region. and the powers are continuous increasing functions with limOlO.+. I' .I. Therefore. However. ° ° .. That is.
90. Our discussion uses the classical normal approximation to the binomial distribution. 1.3(0.1.3 to 0. we can have very great power for alternatives very close to O. In Example 4. we fleed n ~ (0.1. Now let .5).3.3. far from the hypothesis. As a further example and precursor to Section 5. Example 4.282 x 0. > O.4. test for = .Oi) [nOo(1 . when we test the hypothesis that a very large sample comes from a particular distribution.2) shows that.4. Suppose 8 is a vector.0. Thus.3 continued).05 binomial test of H : 8 = 0.90 of detecting the 17% increase in 8 from 0.. (3 = . we find (3(0) = PotS > so) = <I> ( [nO(l ~ 0)11/ 2 Now consider the indifference region (80 . Formula (4.) = (3 for n and find the approximate solution no+ 1 . = 0. It is natural to associate statistical significance with practical significance so that a very low pvalue is interpreted as evidence that the alternative that holds is physically significant. We solve For instance. Bd. and only if. 00 = 0.00 )] 1/2 .55)}2 = 162.4 this would mean rejecting H if. First." The reason is that n is so large that unimportant small discrepancies are picked up. Dt. using the SPLUS package) for the level . we next show how to find the sample size that will "approximately" achieve desired power {3 for the size 0: test in the binomial example. the size . (h = 00 + Dt. if Oi = .80 ) .86.Section 4.3 requires approximately 163 observations to have probability . They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is 0:. Again using the nonnal approximation.6 (Example 4.35(0. ifn is very large and/or a is small. There are various ways of dealing with this problem. The power achievable (exactly.05)2{1. Such hypotheses are often rejected even though for practical purposes "the fit is good enough. to achieve approximate size a. we solve (3(00 ) = Po" (S > s) for s using (4. This problem arises particularly in goodnessoffit tests (see Example 4.4) and find the approximate critical value So ~ nOo 1 + 2 + z(l.35.35. where (3(0.645 x 0. that is.7) + 1.1. Often there is a function q(B) such that H and K can be formulated as H : q(O) <: qo and K : q(O) > qo. 0 ° Our discussion can be generalized.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 231 Dual to the problem of not having enough power is that of having too much.05.4. and 0.1.35 and n = 163 is 0.
The set {O : qo < q(O) < ql} is our indjfference region. .< (10 among all tests depending on <. is the a percentile of X~l' It is evident from the argument of Example 4. We may ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O. > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q..5 that a particular test statistic can have a fixed distribution £0 under the hypothesis.232 Testing and Confidence Regions Chapter 4 q. For instance. 0) = (B . To achieve level a and power at least (3. I . that are not 01.7. This procedure can be applied. 0 E 9. is such that q( Oil = q. Thus.2 only.2 is (}'2 = ~ E~ 1 (Xi .+ 00.3.. (4. Implicit in this calculation is the assumption that POl [T > col is an increasing function ofn. Suppose that in the Gaussian model of Example 4. For each n suppose we have a level a test for H versus I< based on a suitable test statistic T..£ is unknown. Although H : 0.l)l(O. J. Then the MLE of 0.4.3 that this test is UMP for H : (1 > (10 versus K : 0..(Tn ) is an MLR family. for 9.= (10 versus K : 0.. and also increases to 1 for fixed () E 6 1 as n .(0) > O} and Co (Tn) = C'(') (Tn) for all O.= 00 is now composite. = (tm. we illustrate what can happen with a simple example. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II.3.(0) = o} aild 9. then rejecting for large values of Tn is UMP among all tests based on Tn..1 by taking q( 0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative. the distribution of Tn ni:T 2/ (15 is X. The theory we have developed demonstrates that if C. first let Co be the smallest number c such that Then let n be the smallest integer such that P" IT > col > fJ where 00 is such that q( 00 ) = qo and 0.Xf as in Example 2. when testing H : 0 < 00 versus K : B > Bo. We have seen in Example 4. I}.l' independent of IJ. In general.O)<O forB < 00 forB>B o. . = {O : ).3.Bo).9. Suppose that j3(O) depends on () only through q( (J) and is a continuous increasing function of q( 0). Testing Precision Continued.1) 1(0. . a).< (10 and rejecting H if Tn is small. However. for instance. Example 4. 0 E 9.0) > ° I(O. a reasonable class of loss functions are those that satisfy : 1 i j 1 1(0.3. we may consider l(O.(0) so that 9 0 = {O : ). is detennined by a onedimensional parameter ).4) j j 1 .1. the critical value for testing H : 0. a E A = {O.2.. It may also happen that the distribution of Tn as () ranges over 9. 0 = Complete Families of Tests The NeymanPearson framework is based on using the 01 loss function. to the F test of the linear model in Section 6. 00).
{1(8. is complete. 1) 1(O.O)} E. the test that rejects H : 8 < 8 0 for large values of T(x) is UMP for K : 8 > 8 0 .(X) be such that.5).E.12) and. and Regions 233 if for any decision rule 'P The class D of decision procedures is said to be there exists E such that complete(I). If E.3. (4.4 Confidence Bounds. 1) 1(0.o.E. We consider models {Po : () E e} for which there exist tests that are most powerful for every () in a composite alternative t (UMP tests). if the model is correct and loss function is appropriate.o. Intervals. a) satisfies (4.3.0.3.O)]I"(X)} Let o.(I"(X))) < 0 for 8 > 8 0 . a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing ()o versus 8 1 is an increasing function of a statistic T( x) for every ()o < ()1.3. hence. For MLR models. (4. then the class of tests of the form (4.3. is an MLR family in T( x) and suppose the loss function 1(0.(o. e c R. Now 0. it isn't worthwhile to look outside of complete classes.(X) > 1.E. For () real.4 CONFIDENCE BOUNDS.I"(X) 0 for allO then O=(X) clearly satisfies (4. 1) + [1 1"(X)]I(O.R(O. Finally.) .2. the risk of any procedure can be matched or improved by an NP test. Theorem 4. The risk function of any test rule 'P is R(O. hence. AND REGIONS We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter {J i" a . Thus. Proof. locally most powerful (LMP) tests in some cases can be found.(2) a v R(O. the class of NP tests is complete in the sense that for loss functions other than the 01 loss function. e 4.4).(x)) .3) with Eo.(X) = ". 0 < " < 1. O))(E. = R(O. we show that for MLR models.(X) = E"I"(X) > O. INTERVALS.I") ~ = EO{I"(X)I(O. is similarly UMP for H : 0 > 8 0 versus K : 0 < 00 (Problem 4. In the following the decision procedures arc test functions. (4. 1") for all 0 E e.1 and.3. is UMP for H : 0 < 00 versus K: 8 > 8 0 by Theorem 4. when UMP tests do not exist. In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H.Section 4.5) That is.(Io.(X)) = 1. Thus. 0 Summary.O) + [1(0.6) But 1 .o) < R(O. for some 00 • E"o.5) holds for all 8.3. 1") = (1(0.3. We also show how. then any procedure not in the complete class can be matched or improved at all () by one in the complete class.I"(X) for 0 < 00 .3. Suppose {Po: 0 E e}. E.
0. That is. Here ii(X) is called an upper level (1 . and a solution is ii(X) = X + O"z(1 . In this case.. . is a constant. is a lower bound with P(. and X ~ P... This gives .d.a.. .. ..!a)/. we find P(X ..4 where X I. 1 X n are i.fri. We say that [j.a)l. We say that .. .2 ) example this means finding a statistic ii(X) snch that P(ii(X) > .(Xj < . In general...) and =1 a .8)..) = Ia.3.. In our example this is achieved by writing By solving the inequality inside the probability for tt.a) of being correct. we want to find a such that the probability that the interval [X . we want both lower and upper bounds. X + a] contains pis 1 .a.i. That is..a) confidence interval for ".! = X . in many situations where we want an indication of the accuracy of an estimator.234 Testing and Confidence Regions Chapter 4 member of a specified set 8 0 . X n ) to establish a lower bound . In the nonBayesian framework.... we may be interested in an upper bound on a parameter.a. 11 I • J . Now we consider the problem of giving confidence bounds. it may not be possible for a bound or interval to achieve exactly probability (1 ..a)l.a) confidence bound for ". N (/1.(X).(X) . ...a).(X) is a lower confidence bound with confidence level 1 .+(X)] is a level (1 . if v ~ v(P).±(X) ~ X ± oz (1.) = 1 . and we look for a statistic . or sets that constrain the parameter with prescribed probability 1 . Finally.) = 1 .. . PEP.Q with 1 .(X) that satisfies P(. ( 2 ) with (72 known.oz(1 . . As an illustration consider Example 4.a.a) such as .Q equal to ..1. Then we can use the experimental outcome X = (Xl. X E Rq. as in (1. we settle for a probability at least (1 .. We find such an interval by noting . is a parameter.fri < .fri < .. • j I where ..95 Or some other desired level of confidence.(X) for" with a prescribed probability (1 . Suppose that JL represents the mean increase in sleep among patients administered a drug. In the N(".a. intervals.a)I.(X) and solving the inequality inside the probability for p.a) for a prescribed (1.95.oz(1 . Similarly.fri.
Intervals. we will need the distribution of Now Z(I') = Jii(X 1')/". Note that in the case of intervals this is just inf{ PI!c(X) < v < v(X).0)% confidence intervolfor v if. P[v(X) < vi > I . For a given bound or interval.L) obtained by replacing (7 in Z(J.a) upper confidence bound for v if for every PEP. Example 4.a) is called a confidence level. In general. For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient. Similarly. The (Student) t Interval and Bounds.3. for all PEP. by Theorem B.O!) lower confidence bound for v if for every PEP.1')1".4. (72) population.0') < (J .l)s2 lIT'. P[v(X) < v < v(X)] > I . Moreover. finding confidence intervals (or bounds) often involves finding appropriate pivots. 1) distribution and is.X) n i=l 2 . P[v(X) = v] > 10.0) will be a confidence level if (I ..1.1) = T(I') . the random interval [v( X).1. the minimum probability of coverage).0) is. A statistic v(X) is called a level (1 . has aN(O.L. independent of V = (n . We conclude from the definition of the (Student) t distribution in Section B. D(X) is called a level (1 . I) distribu~on to obtain a confidence interval for I' by solving z (1 . which has a X~l distribution. the confidence level is clearly not unique because any number (I . P E PI} (i.L) by its estimate s. . Now we tum to the (72 unknown case and propose the pivot T(J.e.o. ii( Xl] formed by a pair of statistics v(X). and assume initially that (72 is known. has a N(o. In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level. lhat is. Let X t .! that Z(iL)1 "jVI(n .3. and Regions 235 Definition 4.Section 4. The quantities on the left are called the probabilities of coverage and (1 ... In the preceding discussion we used the fact tbat Z (1') = Jii(X .!o) < Z(I') < z (1. " X n be a sample from a N(J. In this process Z(I') is called a pivot.0) or a 100(1 .3.o.4 Confidence Bounds. where S 2 = 1 nI L (X. v(X) is a level (I .~o) for 1'.4.
If we assume a 2 < 00.~a) < T(J... (4.1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal. then P(X("I) < V(<7 2) < x(l.355 is .Q).a) in the limit as n + 00.3.7.99 confidence intervaL From the results of Section B.1) The shortest level (I . .a).2) f...~a) = 3. thus.3. Then (72.fii and X + stn. • " ~.1) in nonGaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5. X n are i.1 (p) by the standard normal quantile z(p) for n > 120. . 0 . . To calculate the coefficients t n.l)s2/X(1.~a)/. ..Xn is a sample from a N(p" a 2 ) population.005.4. On the other hand.355s/3.1 )s2/<7 2 has a X. It turns out that the distribution of the pivot T(J. and if al + a2 = a. . By Theorem B . we can reasonably replace t n .Q. we enter Table II to find that the probability that a 'TnI variable exceeds 3. For the usual values of a. X +stn _ 1 (1 Similarly. .~a)) = 1..fii.4.1. • r.l) < t n_ 1 (1."2)) = 1". .) confidence interval of the type [X . .1 distribution and can be used as a pivot.st n_l (1 ~.n < J. distribution. Confidence Intervals and Bounds for the Vartance of a Normal Distribution.l) is fairly close to the Tn . the interval will have probability (1 .fii are natural lower and upper confidence bounds with confidence coefficients (1 .12).2.355s/31 is the desired level 0.!a) and tndl ... Thus.l < X + st n_ 1 (1..)/ v'n].236 has the t distribution 7. we have assumed that Xl.d.01.. See Figure 5.a) /.355 and IX .3. Solving the inequality inside the probability for Jt. . we find P [X .) /. N (p" a 2 ).l (I . .11 whatever be Il and Tj.a.1...1) has confidence coefficient close to 1 . By solving the inequality inside the probability for a 2 we find that [(n . Let tk (p) denote the pth quantile of the P (t n. . For instance.st n _ 1 (I .. Hence. we see that as n + 00 the Tn_l distribution converges in law to the standard normal distribution. " . computer software. .n is. The properties of confidence intervals such as (4.~a)/.a.4.i. if we let Xn l(p) denote the pth quantile of the X~1 distribution.nJ = 1" X ± sc/.stn_ 1 (I . X + 3. (4.l)s2/X("I)1 is a confidence interval with confidence coefficient (1 . . the confidence coefficient of (4."2)' (n . or very heavytailed distributions such as the Cauchy.1. Example 4.7 (see Problem 8. Testing and Confidence Regions Chapter 4 1 . i '1 !. t n . V( <7 2 ) = (n ...)/. Suppose that X 1..1) can be much larger than 1 .1 (I .l (I . for very skew distributions such as the X2 with few degrees of freedom. or Tables I and II. In this case the interval (4.4.1. we use a calculator. X . ~. if n = 9 and a = 0.. Up to this point.1 (1 .a).
.S)jn] {S + k2~ + k~j4} / (n + k~) (4.(2X + ~) e+X 2 For fixed 0 < X < I. g( 0.ka)[S(n .4. The pivot V( (T2) similarly yields the respective lower and upper confidence bounds (n .S)jn] + k~j4} / (n + k~). the scope of the method becomes much broader. X) < OJ (4. If X I. 0 The method of pivots works primarily in problems related to sampling from nonnal populations.16 we give an interval with correct limiting coverage probability. Intervals.3) + ka)[S(n . .k' < .3.0(1.Q) and (n . the confidence interval and bounds for (T2 do not have confidence coefficient 1." we can write P [ Let ka vIn(X . if we drop the nonnality assumption.4.4. Asymptotic methods and Monte Carlo experiments as described in Chapter 5 have shown that the confidence coefficient may be arbitrarily small depending on the underlying true distribution.1. they are(1) O(X) O(X) {S + k2~ .0) ] = Plg(O. which typically is unknown.Section 44 Confidence Bounds. We illustrate by an example.0)  1Ql] = 12 Q. vIn(X .a even in the limit as n + 00.0) has approximately a N(O. In Problem 1.. then X is the MLE of (j.l)sjx(Q).4. If we consider "approximate" pivots. [0: g(O.O)j )0(1 . and Regions 237 The length of this interval is random.. taking at = (Y2 = ~a is not far from optimal (Tate and Klett.4) . Example 4.X) < OJ'" 1where g(O. It may be shown that for n large. by the De MoivreLaplace theorem. In tenns of S = nX. 1 X n are the indicators of n Bernoulli trials with probability of success (j.~ a) and observe that this is equivalent to Q P [(X . There is a unique choice of cq and a2.l)sjx(1 .0) < z (1)0(1 . In contrast to Example 4. = Z (1 .0) . However.X) = (1+ k!) 0' . If we use this function as an "approximate" pivot and let:=:::::: denote "approximate equality. which unifonnly minimizes expected length among all intervals of this type. X) is a quadratic polynomial with two real roots.1. 1) distribution. = ~(X) < 0 < B(X)]' Because the coefficient of (j2 in g(() 1 X) is greater than zero. Approximate Confidence Bounds and Intervalsfor the Probability of Success in n Bernoulli Trials. There is no natural "exact" pivot based on X and O. 1959).
8) is at least 6.~a) ko 2 kcc i 1. or n = 9. we choose n so that ka. That is. See Problem 4.975. and uses the preceding model.a) procedure developed in Section 4.4.02 by choosing n so that Z = n~ ( 1. say I.4. and Das Gupta (2000).(n + k~)~ = 10 . This fonnula for the sample size is very crude because (4.4.a) confidence bounds. = T.~n)2 < S(n . We can similarly show that the endpoints of the level (1 . it is better to use the exact level (1 .6) Thus.4) has length 0. and Das Gupta (2000) for a discussion. n(l . These inlervals and bounds are satisfactory in practice for the usual levels.4. For small n.4.2a) interval are approximate upper and lower level (1 . we n .4.4.96) 2 = 9. For instance. He can then detennine how many customers should be sampled so that (4.600.~a = 0. A discussion is given in Brown. calls willingness to buy success.601. . 1 .02 ) 2 .n 1 tn (4. to bound I above by 1 0 choose . Cai. Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 00 < ~.02 and is a confidence interval with confidence coefficient approximately 0..5) is used and it is only good when 8 is near 1/2.S)ln] Now use the fact that + kzJ4}(n + k~)l (S . [n this case. of the interval is I ~ 2ko { JIS(n .5) (4.96 0. O(X)] is an approximate level (1 .96. and we can achieve the desired .95. . Note that in this example we can detennine the sample size needed for desired accuracy.238 Testing and Confidence Regions Chapter 4 so that [O(X). ka. To see this.02.16. He draws a sample of n potential customers..S)ln ~ to conclude that tn . consider the market researcher whose interest is the proportion B of a population that will buy a product.X). = 0.a) confidence interval for O.(1. i " . Another approximate pivot for this example is yIn(X . This leads to the simple interval 1 I (4.5. = length 0.7) See Brown. if the smaller of nO. i o 1 [.: iI j . Cai.0)1 JX(1 . note that the length. (1 .O > 8 1 > ~.
.).a). j==1 (4.).a) confidence interval 2nX/x (I .. Let 0 and and upper boundaries of this interval.4...r} is said to be a level (1 ..a). then q(C(X)) ~ {q(O) . 0 E C(X)} is a level (1.) confidence interval for qj (0) and if the ) pairs (T l ' 1'. 2nX/0 has a chi~square. . . (0). Then the rdimensional random rectangle I(X) = {q(O).. q(C(X)) is larger than the confidence set obtained by focusing on B1 alone. we will later give confidence regions C(X) for pairs 8 = (0 1 .Section 44 Confidence Bounds. For instance. qc( 0)) is at least (I . Note that ifC(X) is a level (1 ~ 0') confidence region for 0. then the rectangle I( X) Ic(X) has level = h (X) x . if q( B) = 01 . By using 2nX /0 as a pivot we find the (1 . .1 ).8) .X n denote the number of hours a sample of internet subscribers spend per week on the Internet. .(O) < ii. Confidence Regions of Higher Dimension We can extend the notion of a confidence interval for onedimensional functions q( B) to rdimensional vectors q( 0) = (q..1. If q is not 1 .0'). Let Xl. ...0') confidence region for q(O).(X).. Intervals.. . qc (0)). I is a level (I .a) confidence region. .. (T c' Tc ) are independent.(X) < q. In this case. We write this as P[q(O) E I(X)I > 1. X~n' distribution. ( 2 ) T. this technique is typically wasteful. then exp{ x/O} edenote the lower < q(O) < exp{ x/i)} is a confidence interval for q( 0) with confidence coefficient (1 . we can find confidence regions for q( 0) entirely contained in q( C(X)) with confidence level (1 . Example 4. Here q(O) = 1 ... distribution. if the probability that it covers the unknown but fixed true (ql (0). By Problem B. That is. .F(x) = exp{ x/O}. ii. Suppose X I. ~. Suppose q (X) and ii) (X) are realJ valued. .~a) < 0 <2nX/x (~a) where x(j3) denotes the 13th quantile of the X~n distribution. and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet. X n is modeled as a sample from an exponential. j ~ I. .4.a Note that if I. (X) ~ [q.4. x IT(1a.a). and Regions 239 Confidence Regions for Functions of Parameters We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 . £(8. .3. .a.4.
.i. ~ ~'f Dn(F) = sup IF(t) .. an rdimensional confidence rectangle is in this case automaticalll obtained from the onedimensional intervals. we find that a simultaneous in t size 1. (7 2)..1 h(X) = X ± stn_l (1. An approach that works even if the I j are not independent is to use Bonferroni's inequality (A. We assume that F is continuous. Suppose Xl.240 Testing and Confidence Regions Chapter 4 Thus. then I(X) has confidence level (1 . Confidence Rectangle for the Parameters ofa Normal Distribution. if we choose 0' j = 1 ... D n (F) is a pivot. c c P[q(O) E I(X)] > 1.7).2.Laj. ..LP[qj(O) '" Ij(X)1 > 1.a. that is. is a confidence interval for J1 with confidence coefficient (1  I (X) _ [(nl)S2 (nl)S2] 2 Xnl (1 . Then by solving Dn(F) < do for F. X n are i. From Example 4. and we are interested in the distribution function F(t) = P(X < t). . . for each t E R.5. .1. 1 X n is a N (M. consists of the interval i J C(x)(t) = (max{O. It is possible to show that the exact confidence coefficient is (1 .F(t)1 tEn i " does not depend on F and is known (Example 4. Example 4.40) 2.5). Moreover..0 confidence region C(x)(·) is the confidence band which. min{l.15.a).1. From Example 4.a) r. F(t) + do})· We have shown ~ ~ I o P(C(X)(t) :) F(t) for all t E R) for all P E 'P =1.a) confidence rectangle for (/" . Let dO' be chosen such that PF(Dn(F) < do) = 1 .2. as X rv P. That is. Suppose Xl> .0:.6. h (X) X 12 (X) is a level (1 .1) the distribution of . tl continuous in t.do}. ( 2 ) sample and we want a confidence rectangle for (Ii.. j=1 j=1 Thus. if we choose a J = air. See Problem 4.a = set of P with P( 00. in which case (Proposition 4.4.d.4.. . j = 1.4. r.~a). 0 The method of pivots can also be applied to oodimensional parameters such as F.(1 . then leX) has confidence level 1 .4.4. Thus. . Example 4.~a)/rn !a). v(P) = F(·). F(t) .2). According to this inequality.ia)' Xnl (lo) is a reasonable confidence interval for (72 with confidence coefficient (1.
18) arise in accounting practice (see Bickel. we similarly require P(C(X) :J v) > 1. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X. Suppose that an established theory postulates the value Po for a certain physical constant.5 THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS Confidence regions are random subsets of the parameter space that contain the true parameter with probability at least 1 .4. . We derive the (Student) t interval for I' in the N(Il. and more generally confidence regions.0'.0: for all E e.4 and 4. is /1.4.6.4.7. given by (4.5 The Duality Between Confidence Regions and Tests 241 We can apply the notions studied in Examples 4.19. By integration by parts.9)   because for C(X) as in Example 4. a 2 ) model with a 2 unknown." for all PEP.i.0: when H is true. Example 4. 1992) where such bounds are discussed and shown to be asymptotically strictly conservative. We shall establish a duality between confidence regions and acceptance regions for families of hypotheses.4. In a parametric model {Pe : BEe}. .! ~ /L(F) = f o tf(t)dt = exists. Example 4.!(F)   ~ oosee Problem 4. A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining . 0 Summary.4.!(F+) and sUP{J. We begin by illustrating the duality in the following example. Then a (1 .0') lower confidence bound for /1.1. 1WoSided Tests for the Mean of a Normal Distribution. X n are i.5. o 4. Acceptance regions of statistical tests are.d.. which is zero for t < 0 and nonzero for t > O. For a nonparametric class P ~ {P} and parameter v ~ v(P). then Let F(t) and F+(t) be the lower and upper simultaneous confidence boundaries of Example 4.4.Section 4.! ~ inf{J. as X and that X has a density f( t) = F' (t).6. Intervals for the case F supported on an interval (see Problem 4. and we derive an exact confidence interval for the binomial parameter.!(F) : F E C(X)) ~ J.!(F) : F E C(X)} = J. A Lower Confidence Boundfor the Mean of a Nonnegative Random Variable. subsets of the sample space with probability of accepting H at least 1 . We define lower and upper confidence bounds (LCBs and DCBs).5 to give confidence regions for scalar or vector parameters in nonparametric models.0: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Pe that C(X) covers q( 8) is at least 1 . if J. a level 1 . confidence intervals. for a given hypothesis H.4. Suppose Xl.4.. J.
(4. t n . i:. if and only if.00). We can base a size Q' test on the level (1 . and in Example 4. Jlo) being of size a only for the hypothesis H : Jl = flo· Conversely..1) by finding the set of /l where J(X.~Ct).5'[) If we let T = vn(X ... . X . consider v(P) = F.242 Testing and Confidence Regions Chapter 4 measurements Xl . Because p"IITI = I n. If any value of J.I n . Let v = v{P) be a parameter that takes values in the set N. in Example 4.1 (1 .~a). In contrast to the tests of Example 4.5. then it is reasonable to formulate the problem as that of testing H : fl = Jlo versus K : fl i.1) is used for every flo we see that we have.a. (/" .Q) confidence interval (4.1 (1. then S is a (I .. if and only if.l (1..a). • j n .5. and only if.2(p) takes values in N = (0.1) we constructed for Jl as follows. o These are examples of a general phenomenon. /l = /l(P) takes values in N = (00. generated a family of level a tests {J(X..!a) < T < I n.4.1 (1 .~a)] = 0 the test is equivalently characterized by rejecting H when ITI > tnl (1 .l other than flo is a possible alternative.6. Because the same interval (4. then our test accepts H. if we start out with (say) the level (1 . 00) X (0. in Example 4. that is P[v E S(X)] > 1 . in fact. X + Slnl (1  ~a) /vn]. We achieve a similar effect. generating a family of level a tests.4.5.fLo)/ s.a)s/vn > /l.2) These tests correspond to different hypotheses. = 1.2) we obtain the confidence interval (4.~1 >tn_l(l~a) ootherwise. it has power against parameter values on either side of flo. by starting with the test (4.4.1. /l) to equal 1 if.flO. For a function space example.2 = .2. Xl!_ Knowledge of his instruments leads him to assume that the Xi are independent and identically distributed normal random variables with mean {L and variance a 2 .4. (4.5.5.00).. the postulated value JLo is a member of the level (1 0' ) confidence interval [X  Slnl (1 ~a) /vn. Let S = S{X) be a map from X to subsets of N.a.I n _ 1 (1.2) takes values in N = (00 .00). as in Example 4. This test is called twosided because it rejects for both large and small values of the statistic T. For instance.4. Consider the general framework where the random vector X takes values in the sample space X C Rq and X has distribution PEP. . 1. We accept H.4..a) LCB X . /l) ~ O. Evidently. /l)} where lifvnIX. . all PEP..a) confidence region for v if the probability that S(X) contains v is at least (1 . where F is the distribution function of Xi' Here an example of N is the class of all continuous distribution functions.a)s/ vn and define J'(X. O{X.
a)}. this is a random set contained in N with probapility at least 1 . the acceptance region of the UMP size a test of H : 0 = 00 versus K : > ()a can be written e A(Oo) = (x: T(x) < too (1. E is an upper confidence bound for 0 with any solution e. We next apply the duality theorem to MLR families: Theorem 4. let . Suppose X ~ Po where (Po: 0 E ej is MLR in T = T(X) and suppose that the distribution function Fo(t) ofT under Po is continuous in each of the variables t and 0 when the other is fixed. X E A(vo)). Consider the set of Va for which HI/o is accepted.(01 + 0"2).Section 4. Duality Theorem.5. For some specified Va. if 8(t) = (O E e: t < to(1 . Fe (t) is decreasing in ().0" quantile of F oo • By the duality theorem. then PIX E A(vo)) > 1  a for all P E P vo if and only if S (X) is a 1 . If the equation Fo(t) = 1 . and pvalues for MLR families: Let t denote the observed value t = T(x) ofT(X) for the datum x.1.0" confidence region for v. We have the following. It follows that Fe (t) < 1 .vo) = OJ is a subset of X with probability at least 1 . then [QUI' oU2] is confidence interval for () with confidence coefficient 1 . Then the acceptance regIOn A(vo) ~ (x: J(x.a has a solution O.Ct) is the 1. Formally. the power function Po(T > t) = 1 . eo e Proof. H may be accepted. By applying F o to hoth sides oft we find 8(t) = (O E e: Fo(t) < 1. let Pvo = (P : v(P) ~ Vo : va E V}.0". Similarly. then ()u(T) is a lower confidence boundfor () with confidence coefficient 1 . By Corollary 4. is a level 0 test for HI/o. 00).1.3. if S(X) is a level 1 . < to(la). va) with level 0". then 8(T) is a Ia confidence region forO.aj.Ct. That i~. Moreover.0" of containing the true value of v(P) whatever be P.5 The Duality Between Confidence Regions and Tests 243 Next consider the testing framework where we test the hypothesis H = Hvo : v = Va for some specified value va.(t) in e.3. The proofs for the npper confid~nce hound and interval follow by the same type of argument.G iff 0 > Iia(t) and 8(t) = Ilia. acceptance regions. then the test that accepts HI/o if and only if Va is in S(X). H may be rejected. Let S(X) = (va EN. for other specified Va. 0 We next give connections between confidence bounds.1.Fo(t) for a test with critical constant t is increasing in (). if 0"1 + 0"2 < 1. Conversely.Ct confidence region for v.Ct).(T) of Fo(T) = a with coefficient (1 . Suppose we have a test 15(X.a)) where to o (1 . By Theorem 4.
In general. • a(t.Fo(t). vol denote the pvalue of a test6(T. a) . Then the set . .a) test of H.3. .5. The result follows. . The pvalue is {t: a(t.to(l. We shall use some of the results derived in Example 4. 00 ) denote the pvalue for the UMP size Q' test of H : () let = eo versus K : () > eo. then C = {(t.v) where.~) upper and lower confidence bounds and confidence intervals for B. Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials.a) confidence region is given by C(X t. we seek reasonable exact level (1 .. and A"(B) = T(A(B)) = {T(x) : x Corollary 4.2.Fo(t) is decreasing in t.~a)/ v'n}. v) =O} gives the pairs (t. B) pnints.a)~k(Bo.Xn ) = {B : S < k(B. t is in the acceptance region. I c= {(t.a) is nondecreasing in B. .80 E (0.1 244 Testing and Confidence Regions Chapter 4 1 o:(t. • . where S = Ef I Xi_ To analyze the structure of the region we need to examine k(fJ. B) ~ poeT > t) = 1 .I}. I X n are U.B) > a} {B: a(t. and for the given v.p): It pI < <7Z (1.1 shows the set C. For a E (0.B) = (oo. . A"(B) Set) Proof.1. Figure 4.2 known.. for the given t. . and an acceptance set A" (po) for this example. Under the conditions a/Theorem 4.5. _1 X n be the indicators of n binomial trials with probability of success 6. Let k(Bo. B) plane.4. . 1 ..v) : a(t. v will be accepted. We illustrate these ideas using the example of testing H : fL = J. We have seen in the proof of Theorem 4.1. In the (t. v) = O}. v) <a} = {(t.1. We call C the set of compatible (t. (ii) k(B. The corresponding level (1 . Let T = X. vertical sections of C are the confidence regions B( t) whereas borizontal sections are the acceptance regions A" (v) = {t : J( t. N (IL. '1 Example 4.a)] > a} = [~(t). Let Xl.v) : J(t.5.a)ifBiBo. a) denote the critical constant of a level (1. To find a lower confidence bound for B our preceding discussion leads us to consider level a tests for H : 6 < 00.1)..1 I. vo) = 1 IT > cJ of H : v ~ Vo based on a statistic T = T(X) with observed value t = T(x).d.3.1 that 1 .1). cr 2 ) with 0. We claim that (i) k(B. Because D Fe{t) is a distribution function. let a( t.Fo(t) is increasing in O.oo). : : !. a confidence region S( to).10 when X I . a). E A(B)).
Section 4.[S > j] < Q for all 0 < 00 O(S) = inf{O: k(O.Q)] > Po. and (iv) we see that. S(to) is a confidence interval for 11 for a given value to of T. Q) as 0 tOo. [S > j] = Q and j = k(Oo. let j be the limit of k(O. if we define a contradiction. From (i).[S > k(O"Q) I] > Q.Q) I] > Po.I] [0. if 0 > 00 .o' (iii) k(fJ. Therefore. hence. Q).5. P.3.[S > k(02. I] if S ifS >0 =0 .L = {to in the normal model. Q). (iii). If fJo is a discontinuity point 0 k(O. To prove (i) note that it was shown in Theorem 4. Therefore. (iv) k(O.Q) = n+ 1.[S > k(02.1. (ii). Then P.Q) ~ I andk(I. and.o : J. [S > j] < Q. e < e and 1 2 k(O" Q) > k(02. The shaded region is the compatibility set C for the twosided test of Hp. [S > j] > Q. whereas A'" (110) is the acceptance region for Hp.5 The Duality Between Confidence Regions and Tests 245 Figure 4. Po. On the other hand. Q) would imply that Q > Po. it is also nonincreasing in j for fixed e. Po. The assertion (ii) is a consequence of the following remarks. then C(X) ={ (O(S). Clearly. a) increases by exactly 1 at its points of discontinuity.1 (i) that PetS > j] is nondecreasing in () for fixed j. The claims (iii) and (iv) are left as exercises.Q) = S + I}.
When S 0.5. Q) is given by.Q) LCB for 0(2) Figure 4.I. Plot of k(8.2 0. I • i I .3.246 Testing and Confidence Regions Chapter 4 and O(S) is the desired level (I . . O(S) = I. O(S) is the unique solution of 1 ~( ~ s ) or(1 _ 8)nr = Q. j '~ I • I • .1 0. 0. i • . therefore.5 I i . . 4 k(8. these bounds and intervals differ little from those obtained by the first approximate method in Example 4. then k(O(S). j Figure 4. O(S) together we get the confidence interval [8(S).2 portrays the situatiou.Q) DCB for 0 and when S < n. we find O(S) as the unique solution of the equation. 1 _ .16) 3 I 2 11.Q)=Sl} where j (0.16) for n = 2.5. when S > 0. ..0.3 0. As might be expected. . . These intervals can be obtained from computer packages that use algorithms based on the preceding considerations. When S = n. From our discussion.Q) = S and.4. O(S) I oflevel (1. 8(S) = O.2Q).1 d I J f o o 0.2. if n is large. Similarly.4 0. Putting the bounds O(S). Then 0(S) is a level (1 . we define ~ O(S)~sup{O:j(O.
the twosided test can be regarded as the first step in the decision procedure where if H is not rejected. Make no judgment as 1O whether 0 < 80 or 8 > B if I contains B . To see this consider first the case 8 > 00 . I X n are i. but if H is rejected.Section 4.0') confidence interval!: 1. A and B. Decide I' > I'c ifT > z(1 . and o Decide 8 > 80 if I is entirely to the right of B . If we decide B < 0. Therefore.2.8 < B • or B > 80 is an example of a threeo decision problem and is a special case of the decision problems in Section 1. the probability of the wrong decision is at most ~Q. twosided tests seem incomplete in the sense that if H : B = B is rejected in favor of o H : () i.. Here we consider the simple solution suggested by the level (1 . 2. suppose B is the expected difference in blood pressure when two treatments.!a).5 The Duality Between Confidence Regions and Tests 247 Applications of Confidence Intervals to Comparisons and Selections We have seen that confidence intervals lead naturally to twosided tests. vo) is a level 0' test of H . Suppose Xl.!a).i. Example 4.Jii for !t.~Q) / . we make no claims of significance. Summary. N(/l. Decide I' < 1'0 ifT < z(l. we test H : B = 0 versus K : 8 i.~a). we decide whether this is because () is smaller or larger than Bo. we can control the probabilities of a wrong selection by setting the 0' of the parent test or confidence interval.O. and 3. when !t < !to. we usually want to know whether H : () > B or H : B < 80 . We can use the twosided tests and confidence intervals introduced in later chapters in similar fashions.. and vice versa.2 ) with u 2 known. . o o Decide B < 80 if! is entirely to the left of B . the probability of falsely claiming significance of either 8 < 80 or 0 > 80 is bounded above by ~Q. then we select A as the better treatment. Because we do not know whether A or B is to be preferred.3.3) 3.5. If J(x. are given to high blood pressure patients. This event has probability Similarly.Q) confidence interval X ± uz(1 . Using this interval and (4. 0.. by using this kind of procedure in a comparison or selection problem.d. We explore the connection between tests of statistical hypotheses and confidence regions. o (4. Then the wrong decision "'1 < !to" is made when T < z(1 . o o For instance.B .4 we considered the level (1 .5.~Q). However. Thus. v = va.5. then the set S(x) of Vo where . In Section 4.3) we obtain the following three decision rule based on T = J1i(X  1'0)/": Do not reject H : /' ~ I'c if ITI < z(1 . it is natural to carry the comparison of A and B further by asking whether 8 < 0 or B > O. For this threedecision rule.4.13. If H is rejected. The problem of deciding whether B = 80 .
0 I i.2 and 4.. 0"2) random variables with (72 known.Q). Note that 0* is a unifonnly most accurate level (1 . Definition 4. v = Va when lIa E 8(:1') is a level 0: test.6.6. The dual lower confidence bound is 1'.I (X) is /L more accurate than .IB' (X) I • • < B'] < P.2) Lower confidence bounds e* satisfying (4. for any fixed B and all B' < B.1.. va) ~ 0 is a level (1 . Using Problem 4. we say that the bound with the smaller probability of being far below () is more accurate. This is a consequence of the following theorem.n.Q) con· fidence region for v.5.Ilo)/er > z(1 .1 continued). But we also want the bounds to be close to (J. . P. (4.6. or larger than eo.Xn and k is defined by P(S > k) = 1 ..6. and only if. P. Thus.if. A level a test of H : /L = /La vs K .6.6 UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account.1) is nothing more than a comparison of (X)wer functions. and the threedecision problem of deciding whether a parameter 1 l e is eo.a) LCB 0* of (J is said to be more accurate than a competing level (1 . optimality of the tests translates into accuracy of the bounds. (72) model. 4. they are both very likely to fall below the true (J. we find that a competing lower confidence bound is 1'2(X) = X(k). B(n. where 80 is a specified value.~'(X) < B'] < P.[B(X) < B'].2) for all competitors.248 Testing and Confidence Regions Chapter 4 0("'.Q) LCB B if.3. j i I . .[B(X) < B'].u2(X) and is. I' i' . in fact.Xn ) is a sample of a N (/L. random variable S. which is connected to the power of the associated onesided tests. and only if. a level (1 any fixed B and all B' > B.Q) UCB for B. (4.a) LCB for if and only if 0* is a unifonnly most accurate level (1 . Which lower bound is more accurate? It does tum out that .1 (Examples 3. . twosided tests.0') lower confidence bounds for (J.6.' .Q)er/.n(X . We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution. I' > 1'0 rejects H when ..1) for all competitors are called uniformly most accurate as are upper confidence bounds satisfying (4. Formally.1) Similarly.. < X(n) denotes the ordered Xl. We next show that for a certain notion of accuracy of confidence bounds. then the lest that accepts H . Ii n II r . which reveals that (4. (X) = X .. Suppose X = (X" . where X(1) < X(2) < .Q for a binomial. the following is true.. unifonnly most accurate in the N(/L.6. A level (1.0) confidence region for va. for X E X C Rq.1 t! Example 4..z(1 . .6.. e. less than 80 . for 0') UCB e is more accurate than a competitor e .2. If S(x) is a level (1 . . We also give a connection between confidence intervals. ~). If (J and (J* are two competing level (1 .
X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it. Then!l* is uniformly most accurate at level (1 .7 for the proof). 00 ) by o(x. Suppose ~'(X) is UMA level (1 . . such that for each (Jo the associated test whose critical function o*(x. for e 1 > (Jo we must have Ee. We define q* to be a uniformly most accurate level (1 .Section 4. .2.6. 00 ) is UMP level Q' for H : (J = eo versus K : 0 > 00 . 0 If we apply the result and Example 4.Oo)) < Eo.2 and the result follows.6 Uniformly Most Accurate Confidence Bounds 249 Theorem 4.a) upper confidence bound q* for q( A) = 1 . Identify 00 with {}/ and 01 with (J in the statement of Definition 4.0') LeB for (J.2).a).t o .e>. Uniformly most accurate (UMA) bounds turn out to have related nice properties. the robustness considerations of Section 3. Defined 0(x.2. a real parameter. (Jo) is given by o'(x.6. Pe[f < q(O')1 < Pe[q < q(O')] whenever q((}') < q((}). We can extend the notion of accuracy to confidence bounds for realvalued functions of an arbitrary parameter.6. Let XI. and 0 otherwise. Example 4. Proof Let 0 be a competing level (1 .(o(X.6. Most accurate upper confidence bounds are defined similarly. and only if. Let O(X) be any other (1 .Because O'(X.1.4. if a > 0. for any other level (1 .a) LCB q. Then O(X.5. Let f)* be a level (1 .a) LeB 00.a) Lea for q( 0) if. Boundsforthe Probability ofEarly Failure ofEquipment. O(x) < 00 .z(1 a)a/ JTi is uniformly most accurate. [O(X) > 001 < Pe. 00 )) or Pe.a) lower confidence bound. the probability of early failure of a piece of equipment.5 favor X{k) (see Example 3. 00 ) ~ 0 if. then for all 0 where a+ = a. and only if. We want a unifonnly most accurate level (1 .1.Oo) 1 ifO'(x) > 00 ootherwise is UMP level a for H : (J = eo versus K : (J > 00 . However.6. (O'(X. . they have the smallest expected "distance" to 0: Corollary 4.1. 1 X n be the times to failure of n pieces of equipment where we assume that the Xi are independent £(A) variables. [O'(X) > 001.•.a) lower confidence boundfor O. we find that j. Also. 00 ) is a level a test for H : 0 = 00 versus K : 0 > 00 . For instance (see Problem 4.1 to Example 4.
If we turn to the expected length Ee(t . By using the duality between onesided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level (1 .a) intervals for which members with uniformly smallest expected length exist.a) intervals iliat has minimum expected length for all B. However.6. is random and it can be shown that in most situations there is no confidence interval of level (1 .a) is the (1... Considerations of accuracy lead us to ask that. however.a). Neyman defines unbiased confidence intervals of level (1 .) as a measure of precision.a) quantile of the X§n distribution.a) unbiased confidence intervals. ~ q(B) ~ t] ~ Pe[T. there exist level (1 . some large sample results in iliis direction (see Wilks.3) or equivalently if A X2n(1a) o < 2"'~ 1 X 2 L_n= where X2n(1. To find'\* we invert the family of UMP level a tests of H : A ~ AO versus K : A < AO. In particular.1 have this property. ~ q(()') ~ t] for every (). Of course. the intervals developed in Example 4.. Pratt (1961) showed that in many of the classical problems of estimation . l . Thus. 374376). because q is strictly increasing in A. they are less likely ilian oilier level (1 . the lengili t . the UMP test accepts H if LXi < X2n(1.a) UCB'\* for A.6.a) by the property that Pe[T.250 Testing and Confidence Regions Chapter 4 We begin by finding a uniformly most accurate level (1.a) confidence intervals that have uniformly minimum expected lengili among all level (1.. .a)/ 2Ao i=1 n (4. in the case of lower bounds.T. a uniformly most accurate level (1 . it follows that q( >.6. B'. That is. the confidence interval be as short as possible. By Problem 4.a) UCB for the probability of early failure. we can restrict attention to certain reasonable subclasses of level (1 . These topics are discussed in Lehmann (1997). the confidence region corresponding to this test is (0. Therefore. 0 Discussion We have only considered confidence bounds.a) UCB for A and.1.5. Confidence intervals obtained from twosided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes. The situation wiili confidence intervals is more complicated. subject to ilie requirement that the confidence level is (1 . There are.a) lower confidence bounds to fall below any value ()' below the true B.8.\ *) where>. Summary. 1962.a) in the sense that. in general. as in the estimation problem. the situation is still unsatisfactory because.a) iliat has uniformly minimum length among all such intervals. *) is a uniformly most accurate level (1 . there does not exist a member of the class of level (1 . * is by Theorem 4. pp.T. the interval must be at least as likely to cover the true value of q( B) as any other value.
Example 4.A)2 nro ao 1 Ji = liB + Zla Vn (1 + . II(a:s: t9lx) .a.::: 1 . Then.Xn is N(liB. Suppose that. then Ck = {a: 7r(lx) . Let 7r('lx) denote the density of agiven X = x.Xn are i. Thus.i. If 7r( alx) is unimodal. with fLo and 75 known.a lower and upper credible bounds for fL are  fL = fLB  ao Zla Vn (1 + .a) credible region for e if II( C k Ix) ~ 1  a . (j]. Xl. a a Turning to Bayesian credible intervals and regions. . and {j are level (1 . Let II( 'Ix) denote the posterior probability distribution of given X = x. Instead.'" . it is natural to consider the collec tion of that is "most likely" under the distribution II(alx). from Example 1. with ~ a6) a6 fLB = ~ nx+l/1 ~t""o n ~ + I ~ ~2 . the posterior distribution of fL given Xl.7 FREQUENTIST AND BAYESIAN FORMULATIONS We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E X c Rq are random while the parameters are fixed but unknown.1.12. no probability statement can be attached to this interval. In the Bayesian formulation of Sections 1. a~).a) credible bounds and intervals are subsets of the parameter space which are given probability at least (1 . and that fL rv N(fLo.1. and that a has the prior probability distribution II.7.2 and 1. then 100(1 . the interpretation of a 100(1". Definition 4.7 Frequentist and Bayesian Formulations 251 4.)2 nro 1 . .aB = n ~ + 1 I ~ It follows that the level 1 .d. A consequence of this approach is that once a numerical interval has been computed from experimental data. given a. We next give such an example.. N(fL. a E e c R.6. with known. then Ck will be an interval of the form [fl.a) lower and upper credible bounds for if they respectively satisfy II(fl. :s: alx) ~ 1 .a) by the posterior distribution of the parameter given the data.. 75).a.a)% of the intervals would contain the true unknown parameter value.3. a Definition 4.2.7.7..a)% confidence interval. X has distribution P e.4.Section 4. then fl. what are called level (1 '. Suppose that given fL.::: k} is called a level (1 .1.a)% confidence interval is that if we repeated an experiment indefinitely each time computing a 100(1 .
is shifted in the direction of the reciprocal b/ a of the mean of W(A). ~b) density where a > 0.a) upper credible bound for 0.n(1+ ~)2 nTo 1 • Compared to the frequentist interval X ± Zl~oO/. Then. . For instance. called level (1 a) credible bounds and intervals. whereas in the Bayesian credible interval.3. D Example 4.d.12.6.a) by the posterior distribution of the parameter () given the data :c. In the Bayesian framework we define bounds and intervals.4 for sources of such prior guesses. X n . that determine subsets of the parameter space that are assigned probability at least (1 . Xl. by Problem 1. (t + b)A has a X~+n distribution. Compared to the frequentist bound (n 1) 8 2 / Xnl (a) of Example 4. then . the Bayesian interval tends to the frequentist interval.7.a) lower credible bound for A and ° is a level (1 . 01 Summary. where /10 is a prior guess of the value of /1. Suppose that given 0. given Xl..2.. b > are known parameters. We shall analyze Bayesian credible regions further in Chapter 5..4 we discussed situations in which we want to predict the value of a random variable Y.02) where /10 is known. the interpretations of the intervals are different. it is desirable to give an interval [Y. Note that as TO > 00. However. Let A = a. however. a doctor administering a treatment with delayed effect will give patients a time interval [1::.252 Testing and Confidence Regions Chapter 4 while the level (1 .a) credible interval is [/1.n. /1 +1with /1 ± = /111 ± Zl'" 2 ~ 00 .2 and suppose A has the gamma f( ~a. . N(/10../10)2.. In addition to point prediction of Y.8 PREDICTION INTERVALS In Section 1. = x a+n (a) / (t + b) is a level (1 . the center liB of the Bayesian interval is pulled in the direction of /10.2 . the interpretations are different: In the frequentist confidence interval. t] in which the treatment is likely to take effect. See Example 1. the level (1 . the probability of coverage is computed with X = x fixed and () random with probability distribution II (B I X = x).2.. we may want an interval for the .• .Xn are Li. In the case of a normal prior w( B) and normal model p( X I B).4.2. the probability of coverage is computed with the data X random and B fixed. .2 . Similarly. Y] that contains the unknown value Y with prescribed probability (1 . Let xa+n(a) denote the ath quantile of the X~+n distribution. where t = 2:(Xi .a) credible interval is similar to the frequentist interval except it is pulled in the direction /10 of the prior mean and it is a little narrower.a). 4.
as X '" N(p" cr 2 )..8 Prediction Intervals 253 future GPA of a student or a future value of a portfolio.. [n. fnl (1  ~a) for Vn. it can be shown using the methods of . . . which has a X.Y) = 0.Ct) prediction interval as an interval [Y. By solving tnl Y. We define a level (1. where MSE denotes the estimation theory mean squared error. ::.4. we find the (1 . ~ We next use the prediction error Y . in this case. .+ 18tn l (1  (4.1.. the optimal MSPE predictor is Y = X. Let Y = Y(X) denote a predictor based on X = (Xl.8. It follows that.i. .X n +1 '" N(O. which is assumed to be also N(p" cr 2 ) and independent of Xl. has the t distribution. "~).. .1).1)1 L~(Xi .1.Xn .1) Note that Tp(Y) acts as a prediction interval pivot in the same way that T(p. 8 2 = (n .4.l + 1]cr2 ).) acts as a confidence interval pivot in Example 4. and can c~nclu~e that in the class of prediction unbiased predictors. AISP E(Y) = MSE(Y) + cr 2 .l distribution. Also note that the prediction interval is much wider than the confidence interval (4.1)8 2 /cr 2. X is the optimal estimator.d.. TnI. 1) distribution and is independent of V = (n .Section 4. The (Student) t Prediction Interval. Moreover.4.X) is independent of X by Theorem B.Xn be i.Y to construct a pivot that can be used to give a prediction interval. let Xl. As in Example 4. It follows that Z (Y) p  Y vnY l+lcr has a N(O. the optimal estimator when it exists is also the optimal predictor.3. Y] based on data X such that P(Y ::.. The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4.8.a) prediction interval Y = X± l (1 ~Ct) ::. Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is Note that Y can be regarded as both a predictor of Y and as an estimate of p" and when we do so. We want a prediction interval for Y = X n +1. by the definition of the (Student) t distribution in Section B..1. Note that Y Y = X . In fact.3.8. we found that in the class of unbiased estimators.3 and independent of X n +l by assumption. . We define a predictor Y* to be prediction unbiased for Y if E(Y* . In Example 3. Y) ?: 1 Ct. Tp(Y) !a) .4. Thus. Y ::.
8.254 Testing and Confidence Regions Chapter 4 Chapter 5 that the width of the confidence interval (4.~o:)..0:) Bayesian prediction interval for Y = X n + l if .d...1) is approximately correct for large n even if the sample comes from a nonnormal distribution.4. whereas the width of the prediction interval tends to 2(]z (1.2. X(k)] with k :s: Xn+l :s: XCk») = n + l' kj = n + 1 .1) tends to zero in probability at the rate n!.9.Xn are i. Moreover. UCk ) = v)dH(u. Now [Y B' YB ] is said to be a level (1 .2)...2) P(X(j) It follows that [X(j) . Here Xl"'" Xn are observable and Xn+l is to be predicted.v) j(v lI)dH(u.v) = E(U(k») . E(UCi») thus.8.i. then P(U(j) j :s: Un+l :s: UCk ») P(u:S: Un+l:S: v I U(j) = U. as X rv F. Let X(1) < .2j)/(n + 1) prediction interval for X n +l . = i/(n+ 1).8. Suppose XI. whereas the level of the prediction interval (4. n + 1..i. i = 1.. See Problem 4. Bayesian Predictive Distributions Xl' .'. By ProblemB. b). where X n+l is independent of the data Xl'.5 for a simpler proof of (4. 0 We next give a prediction interval that is valid from samples from any population with a continuous distribution.0:) in the limit as n ) 00 for samples from nonGaussian distributions. Q(. .. . that is.d. p(X I e). the confidence level of (4.' •..j is a level 0: = (n + 1 . .2. . . Example 4. I x) of Xn+l is defined as the conditional distribution of X n + l given x = (XI. then.4.Un+l are i. by Problem B. 00 :s: a < b :s: 00. . The posterior predictive distribution Q(.12.. with a sum replacing the integral in the discrete case. .8. 1). Set Ui = F(Xi ). This interval is a distributionfree prediction interval..' X n . We want a prediction interval for Y = X n+l rv F. Ul . " X n + l are Suppose that () is random with () rv 1'( and that given () = i... . Un ordered. < X(n) denote the order statistics of Xl. U(O.1) is not (1 .E(U(j)) where H is the joint distribution of UCj) and U(k). < UCn) be Ul ..2. (4. Let U(1) < . . where F is a continuous distribution function with positive density f on (a.8.xn ).d. uniform. " X n.i. I x) has in the continuous case density e. .
9 Likelihood Ratio Procedures 255 Example 4.8. Xn is T = X = n 1 2:~=1 Xi . Thus.1) . X n + l and 0. by Theorem BA.8. However. and 7r( B) is N( TJo . T2).c{(Xn +1 where. we find that the interval (4.3) To consider the frequentist properties of the Bayesian prediction interval (4.9. . (J"5). (4. Summary. even in . This is the same as the probability limit of the frequentist interval (4. Consider Example 3.8. B = (2 / T 2) TJo + ( 2 / (J"0 X. A sufficient statistic based on the observables Xl. The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables. (J"5 known. independent. Xn+l .Section 4.8. (J"5).  0) + 0 I X = t} = N(liB. . 1983). from Example 4.1 where (Xi I B) '" N(B .1.3. where X and X n + l are independent.0 and 0 are still uncorrelated and independent. N(B. Thus.2.~o:) (J"o as n + 00.7. In the case of a normal sample of size n + 1 with only n variables observable.i.0 and 0 are uncorrelated and.3) we compute its probability limit under the assumption that Xl " '" Xn are i. Because (J"~ + 0.0)0] = E{E(Xn+l . We consider intervals based on observable random variables that contain an unobservable random variable with probability at least (1.9 4. 4. (n(J"~ /(J"5) + 1. (J"5 + a~) ~2 = (J" B n I (. note that given X = t. Xn+l . and X ~ Bas n + 00.1 LIKELIHOOD RATIO PROCEDURES Introduction Up to this point.I .3) converges in probability to B±z (1 . To obtain the predictive distribution.c{Xn+l I X = t} = . . Note that E[(Xn+l .8. 0 The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box. The Bayesian prediction interval is derived for the normal model with a normal prior. (J" B n(J" B 2) It follows that a level (1 .2 o + I' T2 ~ P. and it is enough to derive the marginal distribution of Y = X n +l from the joint distribution of X. we construct the Student t prediction interval for the unobservable variable.B)B I 0 = B} = O..d. T2 known.0:).. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distributionfree prediction interval.0:) Bayesian prediction interval for Y is [YB ' Yt] with Yf = liB ± Z (1.~o:) V(J"5 + a~. the results and examples in this chapter deal mostly with oneparameter problems in which it sometimes is possible to find optimal procedures.
In the cases we shall consider. e) : () sup{p(x. 8 1 = {e l }. if /Ll < /Lo. e) : e E 8 l }. .2. ed/p(x. Because 6a (x) I. p(x. On the other hand. e) : e E 8l}islargecomparedtosup{p(x. 3. the MP level a test 'Pa(X) rejects H if T ::::. . and for large samples.. then the observed sample is best explained by some E 8 1 . note that it follows from Example 4./Lo)/a. there can be no UMP test of H : /L = /Lo vs H : /L I.'Pa(x). 1). there is no UMP test for testing H : /L = /Lo vs K : /L I.~a). the basic steps are always the same. if Xl. likelihood ratio tests have weak optimality properties to be discussed in Chapters 5 and 6. . Suppose that X = (Xl . ifsup{p(x. a 2 ) population with a 2 known.1 that if /Ll > /Lo. .2. . .. To see that this is a plausible statistic. 2. Calculate the MLE e of e..1) whose computation is often simple. Because h(A(X)) is equivalent to A(X) . B')/p(x . We are going to derive likelihood ratio tests in several important testing problems. So.2 that we think of the likelihood function L(e.its (1 .x) = p(x. .. z(a ). For instance. Calculate the MLE eo of e where e may vary only over 8 0 . optimal procedures may not exist. To see this.Xn ) has density or frequency function p(x. eo). sup{p(x. . and conversely. In particular cases. Note that in general A(X) = max(L(x). The test statistic we want to consider is the likelihood ratio given by L(x) = sup{p(x. 1.2. ed / p(x. eo). Find a function h that is strictly increasing on the range of A such that h(A(X)) has a simple form and a tabled distribution under H .9.256 Testing and Confidence Regions Chapter 4 the case in which is onedimensional. eo) when 8 0 = {eo}./Lo. e) : e E 8 0 } e e e Tests that reject H for large values of L(x) are called likelihood ratio tests.1(c». In this section we introduce intuitive and efficient procedures that can be used when no optimal methods are available and that are natural for multidimensional parameters. The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6. e) is a continuous function of e and eo is of smaller dimension than 8 = 8 0 U 8 1 so that the likelihood ratio equals the test statistic I e e 1 A(X) = sup{p(x. where T = fo(X ./LO. the MP level a test 6a (X) rejects H for T > z(l . Xn). recall from Section 2. Although the calculations differ from case to case. e) as a measure of how well e "explains" the given sample x = (Xl. Also note that L(x) coincides with the optimal test statistic p(x .a )th quantile obtained from the table. by the uniqueness of the NP test (Theorem 4.e): E 8 0 }. . e) and we wish to test H : E 8 0 vs K : E 8 1 . e) E 8} : eE 8 0} (4. Form A(X) = p(x. Xn is a sample from a N(/L. We start with a generalization of the NeymanPearson statistic p(x . we specify the size a likelihood ratio test through the test statistic h(A(X) and . 4.
9.. [ supe p(X. We shall obtain likelihood ratio tests for hypotheses of the form H : 81 = 8lD . Here are some examples. Response measurements are taken on the treated and control members of each pair.c 0 0 It is often approximately true (see Chapter 6) that c( 8) is independent of 8. and other factors. with probability ~) and given the treatment. bounds. An example is discussed in Section 4. This section includes situations in which 8 = (8 1 . Suppose we want to study the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age. while the second patient serves as control and receives a placebo. We are interested in expected differences in responses due to the treatment effect.Section 4. 8) > (8)] = eo a. In the ith pair one patient is picked at random (i. C (x) is just the set of all 8 whose likelihood is on or above some fixed value dependent on the data. and soon.e. Studies in which subjects serve as their own control can also be thought of as matched pair experiments. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo.9. We can regard twins as being matched pairs. After the matching.82 ) where 81 is the parameter of interest and 82 is a nuisance parameter. That is.9.9. and so on. An important class of situations for which this model may be appropriate occurs in matched pair experiments. If the treatment and placebo have the same effect. the experiment proceeds as follows. Let Xi denote the difference between the treated and control responses for the ith pair.a) confidence region C(x) = {8 : p(x. mileage of cars with and without a certain ingredient or adjustment. the difference Xi has a distribution that is . sales performance before and after a course in salesmanship. we consider pairs of patients matched so that within each pair the patients are as alike as possible with respect to the extraneous factors.2) where sUPe denotes sup over 8 E e and the critical constant c( 8) satisfies p. 8n e (4.24. In order to reduce differences due to the extraneous factors.5. diet.9. we measure the response of a subject when under treatment and when not under treatment. 8) ~ [c( 8)t l sup p(x.'" .2. 4.2 Tests for the Mean of a Normal DistributionMatched Pair Experiments Suppose Xl. p (X .8) . which are composite because 82 can vary freely. To see how the process works we refer to the specific examples in Sections 4.9 Likelihood Ratio Procedures 257 We can also invert families of likelihood ratio tests to obtain what we shall call likelihood confidence regions. For instance. In that case. we can invert the family of size a likelihood ratio tests of the point hypothesis H : 8 = 80 and obtain the level (1 . 0'2) population in which both JL and 0'2 are unknown.Xn form a sample from a N(JL. The family of such level a likelihood ratio tests obtained by varying 8lD can also be inverted and yield confidence regions for 8 1 .
good or bad.=1 fJ. Our null hypothesis of no treatment effect is then H : fJ. as discussed in Section 4. Let fJ. Finding sup{p(x. the test can be modified into a threedecision rule that decides whether there is a significant positive or negative effect. The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter 5 if the nonnality assumption is not satisfied.0) 2  a2 n] = 0. We found that sup{p(x.o.fJ. We think of fJ. B) was solved in Example 3.0 as an established standard for an old treatment.0}.a)= ~ ~2 (1 ~ 1 ~ n i=l 2 ) . which has the immediate solution ao ~2 = .258 Testing and Confidence Regions Chapter 4 symmetric about zero.0. =Ie fJ. = fJ.3.X) . where B=(x. Form of the TwoSided Tests Let B = (fJ" a 2 ). TwoSided Tests We begin by considering K : fJ. B) at (fJ. we test H : fJ. a 2 ) : fJ.6." However.L ( X i .0 )2 . B) 1[1~ ="2 a 4 L(Xi . a~). However. 'for the purpose of referring to the duality between testing and confidence procedures. The problem of finding the supremum of p(x. where we think of fJ.0. B) : B E 8 0 } boils down to finding the maximum likelihood estimate a~ of a 2 when fJ. as representing the treatment effect.5.L Xi 1 ~( n i=l . .L X i ' .B): B E 8} = p(x. = E(X 1 ) denote the mean difference between the response of the treated and control subjects. = fJ. To this we have added the nonnality assumption. n i=l is the maximum likelihood estimate of B. The likelihood equation is oa 2 a logp(x. = fJ. This corresponds to the alternative "The treatment has some effect. = O.0 is known and then evaluating p(x. e). 8 0 = {(fJ" Under our assumptions.
. A proof is sketched in Problem 4. ITnl Z 2. However.Section 4. and only if.  {~[(log271') + (log&5)]~} Our test rule.8) logp(x.9. For instance. Therefore. OneSided Tests The twosided formulation is natural if two treatments. the testing problem H : M ::.4. In Problem 4. suppose n = 25 and we want a = 0. &0)) ~ 2 {~[(log271') + (log&2)]~} ~ log(&5/&2). if we are comparing a treatment and control.064.9. Then we would reject H if. or Table III. are considered to be equal before the experiment is performed.1)1 I:(Xi . &5 gives the maximum of p(x. the likelihood ratio tests reject for large values of ITn I. To simplify the rule further we use the following equation.9 Likelihood Ratio Procedures 259 By Theorem 2. Mo versus K : M > Mo (with Mo = 0) is suggested. is of size a for H : M ::. Mo .MO)2/&2. 8) for 8 E 8 0. A and B. therefore. rejects H for iarge values of (&5/&2). (Mo. which thus equals log .11 we argue that P"[Tn Z t] is increasing in 8. the size a likelihood ratio test for H : M Z Mo versus K : M < Mo rejects H if. the relevant question is whether the treatment creates an improvement.\(x) is equivalent to log . &5/&2 (&5/&2) is monotone increasing Tn = y'n(x . and only if. Therefore.3.05.x)2 = n&2/(n .1). to find the critical value.\(x) logp(x. Thus.1). The test statistic . Because 8 2 function of ITn 1 where = 1 + (x .Mo)/a.2. Similarly. the size a critical value is t n 1 (1 . Because Tn has a T distribution under H (see Example 4. the test that rejects H for Tn Z t nl(1 . The statistic Tn is equivalent to the likelihood ratio statistic A for this problem. where 8 = (M . which can be established by expanding both sides.a).\(x).1. (n .~a) and we can use calculators or software that gives quantiles of the t distribution. 8 Therefore.Mo) .
Likelihood Confidence Regions If we invert the twosided tests. is possible (Lehmann. with 8 = fo(/L . is by definition the distribution of Z / . 1) distribution. / (n1)s2 /u 2 n1 V has a 'Tn1.b. This distribution. respectively. To derive the distribution of Tn. thus.n(X . fo( X . bring the power arbitrarily close to 0:. and the power can be obtained from computer software or tables of the noncentral t distribution. We have met similar difficulties (Problem 4.12.3. We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 ~ 81 > 0 in the twosided case and 8 ~ 81 or 8 :S 81 in the onesided cases. note that from Section B.9. by making a sufficiently large we can force the noncentrality parameter 8 = fo(/L .b distribution. we know that fo(X /L)/a and (n ./Lo)/a .11) just as the power functions of the corresponding tests of Example 4. say. Computer software will compute n. however.l distribution. The density of Z/ . we can no longer control both probabilities of error by choosing the sample size. denoted by Tk./Lol ~ ~. ( 2) only through 8./Lo) / a has N (8. Problem 17).7) when discussing confidence intervals. . p. we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8./Lo) / a. If we consider alternatives of the form (/L /Lo) ~ ~. the ratio fo(X . A Stein solution. The reason is that.260 Power Functions and Confidence 4 To discuss the power of these tests.4. 1) and X~ distributions./Lo)/a) = 1.4. With this solution.1 are monotone in fo/L/ a. we obtain the confidence region We recognize C(X) as the confidence interval of Example 4.4. whatever be n.1.jV/ k where Z and V are independent and have N(8.1. 1997.jV/k is given in Problem 4. Thus. 260.1)s2/a 2 are independent and that (n 1)s2/a 2 has a X. we may be required to take more observations than we can afford on the second stage. Because E[fo(X /Lo)/a] = fo(/L .. in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with I/L./Lo) / a as close to 0 as we please and. Similarly the onesided tests lead to the lower and upper confidence bounds of Example 4. The power functions of the onesided tests are monotone in 8 (Problem 4./Lo)/a and Yare .9.2. Note that the distribution of Tn depends on () (/L.
For quantitative measurements such as blood pressure. 'Yn2 are the measurements on a sample given the drug.. The preceding assumptions were discussed in Example 1.7 5. 8 2 = 1.58. we conclude at the 1% level of significance that the two drugs are significantly different. Because tg(0.6 0.4 1. p. The 0.0 3. ( 2) and N (Jl2.1. and so forth.9. height.4 1..99 confidence interval for the mean difference Jl between treatments is [0.9 likelihood Ratio Procedures 261 Data Example As an illustration of these procedures. while YI . then x = 1.2 0.A in sleep gained using drugs A and B on 10 patients. one from each popUlation.0 6 3.6. .3). (See also (4.2 the likelihood function and its log are maximized In Problem 4. .4 3 0. Then Xl.6 4.Y ) is n2 where n = nl + n2. 121) giving the difference B . suppose we wanted to test the effect of a certain drug on some biological variable (e..3 Tests and Confidence Intervals for the Difference in Means of Two Normal Populations We often want to compare two populations with distribution F and G on the basis of two independent samples X I. .'" .3 5 0.8 8 0.1 1.9 1. This is a matched pair experiment with each subject serving as its own control..1 0. Yn2 . it is usually assumed that X I. Let e = (Jlt.6 0.0 7 3.32.2 2 1.995) = 3. Tests We first consider the problem of testing H : Jl1 = J:L2 versus H : Jli =F Jl2. As in Section 4.25. respectively. . 1958.Section 4..2 1.4 If we denote the difference as x's. It suggests that not only are the drugs different but in fact B is better than A because no hypothesis Jl = Jl' < 0 is accepted at this level.2. Patient i A B BA 1 0. consider the following data due to Cushny and Peebles (see Fisher.3. blood pressure). ( 2) populations. YI .Xn1 . In the control versus treatment example.9.9. ( 2). .8 9 0. . Jl2.Xn1 could be blood pressure measurements on a sample of patients given a placebo.1 1.8 2.. .5. e .0 4.8 1. .g.. volume.1 0. . Y) = (Xl. weight... 'Yn2 are independent samples from N (JlI. Then 8 0 = {e : Jli = Jl2} and 9 1 = {e : Jli =F Jl2}' The log of the likelihood of (X. and ITnl = 4. . . it is shown that = over 8 by the maximum likelihood estimate e. A discussion of the consequences of the violation of these assumptions will be postponed to Chapters 5 and 6.84]. length. For instance.6 10 2.3 4 1.513.. temperature.) 4. .5 1..4 4...Xnl and YI .Xn1 and YI . this is the problem of determining whether the treatment has any effect..06.7 1..
X) + (X  j1)]2 and expanding.2 (y.3 Tn2 1 . By Theorem B.2. we find that the is equivalent to the test statistic ITI where and To complete specification of the size a likelihood ratio test we show that T has distribution when ttl = tt2.9.2 (Xi . the maximum of p over 8 0 is obtained for f) (j1.3. (7 ). Y.Y) '1':" a a a i=l a j=l J I .j112 log likelihood ratio statistic [(Xi .262 (X.1 I:rtl . where and If we use the identities ~ fCYi ?l i=l j1)2 1 n fp'i i=l y)2 + n2 (Y _ j1)2 n obtained by writing [Xi . Y. (75).2 X. our model reduces to the onesample model of Section 4. . Thus. where 2 Testing and Confidence Regions Chapter 4 When ttl tt2 = tt. j1.X) and .2 1 I:rt2 . .1 .
corresponding to the twosided test.9 likelihood Ratio Procedures 263 are independent and distributed asN(ILI/o. and only if.2)8 2 /0. twosample t test rejects if. 1L2. . Confidence Intervals To obtain confidence intervals for 1L2 .ILl = Ll versus K : 1L2 . we find a simple equivalent statistic IT(Ll)1 where If 1L2 ILl = Ll.2 has a X~2 distribution.Ll. this follows from the fact that. by definition. T and the resulting twosided. for H : 1L2 ::. T( Ll) has a Tn2 distribution and inversion ofthe tests leads to the interval (4.ILl.N(1L2/0. there are two onesided tests with critical regions.9.2)8 2/0. It is also true that these procedures are of size Q for their respective hypotheses. X~ll' X~2l' respectively. 1/nI). We conclude from this remark and the additive property of the X2 distribution that (n .2 • Therefore. T has a noncentral t distribution with noncentrality parameter. As in the onesample case.Section 4. As for the special case Ll = 0. Similarly.1L2.3) for 1L2 .2 under H bution and is independent of (n . onesided tests lead to the upper and lower endpoints of the interval as 1 . if ILl #.ILl #. 1/n2). f"V As usual. We can show that these tests are likelihood ratio tests for these hypotheses. 1) distriTn .~ Q likelihood confidence bounds.has a N(O. IL 1 and for H : ILl ::.ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 .X) /0. and that j nl n2/n(Y .
If normality holds.975) = 2.282 We test the hypothesis H of no difference in expected log permeability.395 ± 0. a~) samples..0:) confidence interval has probability at most ~ 0: of making the wrong selection. a treatment that increases mean response may increase the variance of the responses. Y1. thus. we would select machine 2 as producing the smaller permeability and. . . On the basis of the results of this experiment.627 2. Again we can show that the selection procedure based on the level (1 . thus. we are lead to a model where Xl.0264. As a first step we may still want to compare mean responses for the X and Y populations. .~o:).977. consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines.x = 0.790 1.95 confidence interval for the difference in mean log permeability is 0. and T = 2.395. we conclude at the 5% level of significance that there is a significant difference between the expected log permeability for the two machines.9.. Suppose first that aI and a~ are known. the more waterproof material. setting the derivative of the log likelihood equal to zero yields the MLE . This is the BehrensFisher problem. The level 0. . From past experience.8 2 = 0. 472) x (machine 1) Y (machine 2) 1.264 Testing and Confidence Regions Chapter 4 Data Example As an illustration.4 The TwoSample Problem with Unequal Variances In twosample problems of the kind mentioned in the introduction to Section 4. Because t4(0.9.042 1. . The results in terms oflogarithms were (from Hald.368. :.2 (1 . except for an additive constant.3. p. is The MLEs of Ji1 and Ji2 are. Ji2) E R x R. Here y . For instance.. 'Yn2 are two independent N (Ji1. it may happen that the X's and Y's have different variances..583 1. H is rejected if ITI 2 t n . it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same. 4. /11 = x and /12 = y for (Ji1. When Ji1 = Ji2 = Ji.776. 1952. ai) and N(Ji2. respectively.845 1. The log likelihood.Xn1 .
t2 :::.t2 must be estimated. DiaD has aN(I:::. y) Next we compute = exp {2~~ eM  x)2 + 2:'~ (Ii  y?}. It follows that "\(x. n2 an approximation to the distribution of (D .y)/(nl + .fi)2 we obtain \(x.) I 8 D depends on at I a~ for fixed nl.x)/(nl + . = J.X)2 + nl(j1. it Thus. j1x = = nl .fi = n2(x .n2)' It follows that the likelihood ratio test is equivalent to the statistic IDl/aD. n2. For large nl. 2 8y 8~ .t2.t1 = J.j1)2 j=1 = L(Yi .y) By writing nl L(Xi .3. 8D=+' nl n2 It is natural to try and use D I 8 D as a test statistic for the onesided hypothesis H : J. j1.1:::. 1) distribution.Section 4.. that is Yare X) + Var(Y) = nl + n2 . IDI/8D as a test statistic for H : J.) 18 D due to Welch (1949) works well. J. where I:::. Unfortunately the distribution of (D 1:::. BecauseaJy is unknown..t1. For small and moderate nI. n2... (D 1:::. and more generally (D .)18D has approximately a standard normal distribution (Problem 5..9 Likelihood Ratio Procedures J 265 where.28). = at I a~. by Slutsky's theorem and the central limit theorem.1:::.)18D to generate confidence procedures. where D = Y .t1.n2)' Similarly..fi? j=1 + n2(j1.X and aJy is the variance of D.x)2 i=1 n2 i=1 n2 L(Yi .. An unbiased estimate is J.n2(fi .
Y = blood pressure.28. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the BehrensFisher problem. X percentage of fat in score on mathematics exam. mice.3. .3 and Problem 5.1. The LR procedure derived in Section 4. . Wang (1971) has shown the approximation to be very good for Q: = 0. uI 4. See Figure 5. Y = age at death. X = average cigarette consumption per day in grams. can unfortunately be very misleading if =1= u~ and nl =1= n2. Note that Welch's solution works whether the variances are equal or not.u~. TwoSided Tests . the maximum error in size being bounded by 0.9. Yd. with > 0. Some familiar examples are: X test score on English exam.01.05 and Q: 0. Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many staweight. Empirical data sometimes suggest that a reasonable model is one in which the two characteristics (X.266 Testing and Confidence Regions Chapter 4 Let c = Sf/nlsb' Then Welch's approximation is Tk where k c2 [ nl 1 (1 C)2]1 + n2 . Y cholesterol level in blood. (Xn' Yn). ur Testing Independence.5 likelihood Ratio Procedures for Bivariate Normal Distributions If n subjects (persons.. etc. Y) have a joint bivariate normal distribution. fields.. . u~ > O. N(j1.3. which works well if the variances are equal or nl = n2.9.1 When k is not an integer. If we have a sample as before and assume the bivariate normal model for (X. Y). machines. p). our problem becomes that of testing H : p O.3.003. X test tistical studies.2 . then we end up with a bivariate random sample (XI. Y diet.J1.) are sampled from a population and two numerical characteristics are measured on each case. the critical value is obtained by linear interpolation in the t tables or using computer software.
Section 4.4) = Thus.X . p::. Because (U1. .8). (j~. 0. V n ) is a sample from the N(O.Jll)/O"I. 1 (Problem 2. (4. . 0) and the log of the likelihood ratio statistic becomes log A(X) (4.4. ii. (Un. See Example 5. y.0"2 = .~( Yi . then by Problem B. . the twosided likelihood ratio tests can be based on [Tn I and the critical values obtained from Table II.X)(Yi ~=1 fj)] /n(jl(72. Yn .. Pis called the sample correlation coefficient and satisfies 1 ::. Qualitatively.Y . Because ITn [ is an increasing function of [P1. where Ui = (Xi .9.3. 1~ )2 2 1~ )2 0"1 n i=1 n i=1 p= [t(Xi .1. we can control probabilities of type II error by increasing the sample size. There is no simple form for the distribution of p (or Tn) when p #. VI)"" .Now.9. . A normal approximation is available. (jr . ~ = (Yi . and eo can be obtained by separately maximizing the likelihood of XI. . where (jr e was given in Problem 2. if we specify indifference regions. When p = 0 we have two independent samples.1.~( Xi .2 distribution.5. the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1. p) distribution. the distribution of p depends on p only.(j~. p). Therefore.Jl2)/0"2..9 Likelihood Ratio Procedures 267 The unrestricted maximum likelihood estimate (x.9.Xn and that ofY1 . and the likelihood ratio tests reject H for large values of [P1. for any a. . When p = 0..o. To obtain critical values we need the distribution of p or an equivalent statistic under H. If p = 0.. log A(X) is an increasing function of p2.13 as 2 = .5) has a Tn. We have eo = (x. the distribution of pis available on computer packages.1.3.
Because c can be shown to be monotone increasing in p. Po but rather of the "equaltailed" test that rejects if. To obtain lower confidence bounds. c(po)" where Ppo [p::. by putting two level 1 . consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week. and only if.:::: c] is an increasing function of p for fixed c (Problem 4. It can be shown that pis equivalent to the likelihood ratio statistic for testing H : P ::. We find the likelihood ratio tests and associated confidence procedures for four classical normal models: .9. The power functions of these tests are monotone. we find that there is no evidence of correlation: the pvalue is bigger than 0. These tests can be shown to be of the form "Accept if.15). c(po) where P po [I)' d(po)] P po [I)'::. We can show that Po [p. we obtain size Q tests for each of these hypotheses by setting the critical value so that the probability of type I error is Q when p = O. c(po)] 1 .Q. 0 versus K : P > 0 and similarly that corresponds to the likelihood ratio statistic for H : P 0 versus K : P < O. Thus. For instance. for large n the equal tails and LR confidence intervals approximately coincide with each other. we can start by constructing size Q likelihood ratio tests of H : P Po versus K : p > po.75. only onesided alternatives are of interest.~Q bounds together. is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis. by using the twosided test and referring to the tables. c(Po)] 1 ~Q.48. We can similarly obtain 1 Q upper confidence bounds and. ! Data Example A~ an illustration. p::.7 18.1 Here p 0. Confidence Bounds and Intervals Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present. We obtain c(p) either from computer software or by the approximation of Chapter 5. Therefore. We want to know whether there is a correlation between the initial weights and the weight increase and formulate the hypothesis H : p O.18 and Tn 0. inversion of this family of tests leads to levell Q lower confidence bounds. we would test H : P = 0 versus K : P > 0 or H : P ::. 0 versus K : P > O. we obtain a commonly used confidence interval for p. if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood.: : d(po) or p::. These intervals do not correspond to the inversion of the size Q LR tests of H : p = Po versus K : p :::f. The likelihood ratio test statistic >. and only if p . However.268 Testing and Confidence Regions Chapter 4 OneSided Tests In many cases. 370. 'I7 Summary.
. We test the hypothesis that X and Yare indepen dent and find that the likelihood ratio test is equivalent to the test based on IPl.05? e :::. respectively. e).10 PROBLEMS AND COMPLEMENTS Problems for Section 4. Approximate critical values are obtained using Welch's t distribution approximation. (a) Compute the power function of 6c and show that it is a monotone increasing function (b) In testing H : exactly 0. Let Mn = max(X 1 .1 1.98 for (e) If in a sample of size n = 20. ( 2) and we test the hypothesis that the mean difference J.L is zero.48.X)I SD. . . ~ versus K : e> ~. . .Section 4. (3) Twosample experiments in which two independent samples are modeled as coming from N(J. respectively...Ll' ( 2) and N(J. . When a? and a~ are known. .L. where p is the sample correlation coefficient. Let Xl.2 degrees of freedom.L2' a~) populations. . . (4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases. Mn e = ~? = 0. (d) How large should n be so that the 6c specified in (b) has power 0..Lo. (2) Twosample experiments in which two independent samples are modeled as coming from N(J. 4... When a? and a~ are unknown. .Ll' an and N(J. X n are independently and identically distributed according to the uniform distribution U(O. X n ) and let 1 if Mn 2:: c = of e. . We test the hypothesis that the means are equal and find that the likelihood ratio test is equivalent to the twosample (Student) t test.. Assume the model where X = (Xl.10 Problems and Complements 269 (1) Matched pair experiments in which differences are modeled as N(J. what choice of c would make 6c have size (c) Draw a rough graph of the power function of 6c specified in (b) when n = 20.Xn denote the times in days to failure of n similar pieces of equipment.Xn ) is an £(A) sample. Consider the hypothesis H that the mean life II A = J.X. 0 otherwise. what is the pvalue? 2. where SD is an estimate of the standard deviation of D = Y . Suppose that Xl. J.L2' ( 2) populations. we use (Y .XI.L :::. We also find that the likelihood ratio statistic is equivalent to a t statistic with n . the likelihood ratio test is equivalent to the test based on IY . The likelihood ratio test is equivalent to the onesample (Student) t test.
(c) Give an approximate expression for the critical value if n is large and B not too close to 0 or 00. Tr are independent test statistics for the same simple H and that each Tj has a continuous distribution..3.3). where x(I . (cj Use the central limit theorem to show that <l?[ (I"oz( <» II") + VTi(1"  1"0) I 1"] is an approximation to the power of the test in part (a). X n be a 1'(0) sample. Show that if H is simple and the test statistic T has a c. Hint: See Problem B. (b) Give an expression of the power in tenns of the X~n distribution.05? 3. 0: (a) Construct a test of H : B = 1 versus K : B > 1 with approximate size complete sufficient statistic for this model. Hint: Approximate the critical region by IX > 1"0(1 + z(1 . 1). I . (b) Show that the power function of your test is increasing in 8. 7. " i I 5. (d) The following are days until failure of air monitors at a nuclear plant..2..~llog <>(Tj ) has a X~r distribution.Xn be a sample from a population with the Rayleigh density ...<»1 VTi)J . j = 1. j=l . X> 0. . (Use the central limit theorem.12. Let Xl.a) is the (1 . Assume that F o and F are continuous.270 Testing and Confidence Regions Chapter 4 (a) Use the result of Problem 8. U(O.r. i I (b) Check that your test statistic has greater expected value under K than under H. 6. T ~ Hint: See Problem B. . .. Let Xl. f(x. (0) Show that. then the pvalue <>(T) has a unifonn.a)th quantile of the X~n distribution. under H. . Draw a graph of the approximate power function... 1 using a I .. .4 to show that the test with critical region IX > I"ox( 1  <» 12n].0 > 0.ontinuous distribution.3. r.O) = (xIO')exp{x'/20'}.. .) 4. . . (a) Use the MLE X of 0 to construct a level <> test for H : 0 < 00 versus K : 0 > 00 . distribution. j .. .1. . Suppose that T 1 .. . Let o:(Tj ) denote the pvalue for 1j. If 110 = 25. 1nz t Z i . give a normal approximation to the significance probability. . is a size Q test. Days until failure: 315040343237342316514150274627103037 Is H rejected at level <> ~ 0.4. . Hint: Use the central limit theorem for the critical value. .. Establish (4.2 L:.
b > 0.p(u) be a function from (0. 1) variable on the computer and set X ~ F O 1 (U) as in Problem B.10 Problems and Complements 271 8. j = 1.) Evaluate the bound Pp(lF(x) . . T(B) ordered. B.o J J x .5 using the nonnal approximation to the binomial distribution of nF(x) and the approximate critical value in Example 4. Vw. .Fo(x)IO x ~ ~ sup. T(B) be B independent Monte Carlo simulated values of T. = (Xi . Use the fact that T(X) is equally likely to be any particular order statistic.1..p(Fo(x))lF(x) . then the power of the Kolmogorov test tends to 1 as n . let .. 80 and x = 0. (b) Use part (a) to conclude that LN(". T(l). (a) Show that the power PF[D n > kal of the Kolmogorov test is bounded below by ~ sup Fpl!F(x).Fo(x)1 > k o ) for a ~ 0. (b) Suppose Fo isN(O. Here. Hint: If H is true T(X). (b) When 'Ij. then T..T. .10. That is.X(B) from F o on the computer and computing TU) = T(XU)). ~ ll. with distribution function F and consider H : F = Fo.1. Let X I.)(Tn ) = LN(O. .Fo(x) 10 U¢.p(Fo(x))IF(x) . and 1. Let T(l). (a) Show that the statistic Tn of Example 4. n (c) Show that if F and Fo are continuous and FiFo.5. . and let a > O.2.. (a) For each of these statistics show that the distribution under H does not depend on F o.u. = ~ I. 10.o T¢. Fo(x)1 > kol x ~ Hint: D n > !F(x) .. ..p(F(x)IF(x) .5.. 1) to (0. In Example 4.T(X(l)).o V¢. . if X.o: is called the Cramervon Mises statistic. . (In practice these can be obtained by drawing B independent samples X(I).. Define the statistics S¢.12(b).d.a)/b.T(B+l) denote T. T(X(B)) is a sample of size B + 1 from La. (This is the logistic distribution with mean zero and variance 1. is chosen 00 so that 00 x 2 dF(x) = 1.Fo(x)IOdF(x).00). .. . X n be d..(X') = Tn(X).Fo• generate a U(O. 1.. 1) and F(x) ~ (1 +exp( _"/7))1 where 7 = "13/.p(F(x»!F(x) .) Next let T(I). to get X with distribution . . Suppose that the distribution £0 of the statistic T = T(X) is continuous under H and that H is rejected for large values of T.l) (Tn). 00. .6 is invariant under location and scale... . .5.Fo(x)IOdFo(x) .. .Section 4.1.J(u) = 1 and 0: = 2. 9. Express the Cramervon Mises statistic as a sum. Show that the test rejects H iff T > T(B+lm) has level a = m/(B + 1).o sup.Fo(x)1 foteach. ..
Expected pvalues. . Problems for Section 4./"0' Show that EPV(O) = if! ( . Let T = X/"o and 0 = /" .1) satisfy Pe. ! . f(3. Let 0 < 00 < 0 1 < 1.. ! . Consider a population with three kinds of individuals labeled 1. the UMP test is of the form I {T > c). . A gambler observing a game in which a single die is tossed repeatedly gets the impression that 6 comes up about 18% of the time. the other $105 • One second of transmission on either system COsts $103 each.) . Without loss of generality. Suppose that T has a continuous distribution Fe. (For a recent review of expected p values see Sackrowitz and SamuelCahn. which is independent ofT.1.0).. (a) Show that L(x. = /10 versus K : J. You want to buy one of two systems.10.2. 2. 3. and only if. and 3.. i i (b) Define the expected pvalue as EPV(O) = EeU. X n . the other has v/aQ = 1.272 Testing and Confidence Regions Chapter 4 (e) Are any of the four statistics in (a) invariant under location and scale.a)) = inf{t: Fo(t) > u}.2 and 4. f(2. For a sample Xl. take 80 = O.nO /. then the pvalue is U =I . I). whereas the other a I ~ a . (See Problem 4.2.). > cJ = a.. One has signalMtonoise ratio v/ao = 2. 1999.. Hint: P(To > T) = P(To > tiT = t)fe(t)dt where fe(t) is the density of Fe(t). 2N1 + N. where if! denotes the standard normal distribution.Fo(T). I X n from this population.Fe(F"l(l. The first system costs $106 .') sample Xl. Let To denote a random variable with distribution F o. and 3 occuring in the HardyWeinberg proportions f(I. let N I • N 2 • and N 3 denote the number of Xj equal to I. respectively. j 1 i :J . > cis MP for testing H : 0 = 00 versus K : 0 = 0 1 • .0)'. Consider Examples 3.O) = 0'.0) = 20(1. If each time you test..1.05. 12.) I 1 . you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0. Whichever system you buy during the year.L > J.. i • (a) Show that if the test has level a. . Hint: peT < to I To = to) is 1 minus the power of a test with critical value to_ (d) Consider the problem of testing H : J1. ! J (c) Suppose that for each a E (0. I 1 (b) Show that if c > 0 and a E (0. you intend to test the satellite 100 times.2 I i i 1. 00 .. [2N1 + N. then the test that rejects H if. Consider a test with critical region of the fann {T > c} for testing H : () = Bo versus I< : () > 80 . Show that EPV(O) = P(To > T).3.. where" is known. which system is cheaper on the basis of a year's operation? 1 I Ii I il 2..0) = (1.. the power is (3(0) = P(U where FO 1 (u)  < a) = 1. 5 about 14% of the time. 01 ) is an increasing function of 2N1 + N. 2.to on the basis of the N(p" . Show that the EPV(O) for I{T > c) is uniformly minimal in 0 > 0 when compared to the EPV(O) for any other test.
M(n. 7.6) where all parameters are known.2. [L(X.=  Section 4. (b) Find the test that is best in this sense for Example 4.2. = a.0196 test rejects if. Y) known to be distributed either (as in population 0) according to N(O.10 Problems and Complements 273 four numbers are equally likely to occur (i. Upon being asked to play. L . Y) and a critical value c such that if we use the classification rule..~:":2:~~=C:="::===='. Bk).. Find a statistic T( X.2..2. In Example 4. two 5's are obtained. prove Theorem 4.. (a) Show that if in testing H : f} such that = f}o versus K : f) = f)l there exists a critical value c Po.2. (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerfullevel.imately a N(np" na 2 ) distribution.. . +akNk has approx. Hint: The MP test has power at least that of the test with test function J(x) 8.1. .0. +.e..7). In Examle 4.imum probability of error (of either type) is as small as possible. A fonnulation of goodness of tests specifies that a test is best if the max. (X. PdT < cJ is as small as possible...1. Hint: Use Problem 4. 6..2. .2. [L(X..17)..o = (1. with probability .6) or (as in population 1) according to N(I. 9. of and ~o '" I..1. Show that if randomization is pennitted. .. Y) belongs to population 1 if T > c. I.0. the gambler asks that he first be allowed to test his hypothesis by tossing the die n times. 1.2.4. and only if.. 0. B" . . H : (J = (Jo versus K : (J = (J. then the maximum of the two probabilities ofmisclassification porT > cJ. . Bd > cJ = I . if ...1(a) using the connection between likelihood ratio tests and Bayes tests given in Remark 4. A newly discovered skull has cranial measurements (X. I. find an approximation to the critical value of the MP level a test for this problem. MPsized a likelihood ratio tests with 0 < a < 1 have power nondecreasing in the sample size. and to population 0 ifT < c.2) that linear combinations of bivariate nonnal random variables are nonnaUy distributed. 5. Prove Corollary 4. find the MP test for testing 10.<l.1. Bo. Nk) ~ 4.Pe. (c) Using the fact that if(N" . where 11 = I:7 1 ajf}j and a 2 = I:~ 1 f}i(ai 11)2.Bd > cJ then the likelihood ratio test with critical value c is best in this sense.4 and recall (Proposition B. For 0 < a < I. Bo.0.2.N.2. derive the UMP test defined by (4. then a...
You want to ensure that if the arrival rate is < 10. Show that if X I. . • I i • 4. • • " 5. How many days must you observe to ensure that the UMP test of Problem 4. the probability of your deciding to stay open is < 0. ]f the equipment is subject to wear. > 1/>'0.Xn be the times in months until failure of n similar pieces of equipment. Use the normal approximation to the critical value and the probability of rejection.01.1 achieves this? (Use the normal approximation.Xn is a sample from a truncated binomial distribution with • I.• n.&(>').3. but if the arrival rate is > 15.. Hint: Show that Xf .. I i .1. i 1 .. 0) = ( : ) 0'(1. . In Example 4.. i' i . show that the power of the UMP test can be written as (3(<J) = Gn(<J~Xn(<»/<J2) where G n denotes the X~n distribution function.3 Testing and Confidence Regions Chapter 4 1.. .4. Let Xl.3..i .: (b) Show that the critical value for the size a test with critical region [L~ 1 Xi > k] is k = X2n(1 .Xn is a sample from a Weibull distribution with density f(x. . . . Find the sample size needed for a level 0.. .95 at the alternative value 1/>'1 = 15. 1 j ! Xf is an optimal test statistic for testing H : 1/ A < 1/ AO versus J ! . hence. . the probability of your deciding to close is also < 0. 1965) is the one where Xl.<» is the (1 . . that the Xi are a sample from a Poisson distribution with parameter 8. A possible model for these data is to assume that customers arrive according to a homogeneous Poisson process and. I i .(IOn x = 1•. x > O.<»/2>'0 where X2n(1. 274 Problems for Section 4.01. Consider the foregoing situation of Problem 4.0)"'/[1. (a) Show that L~ K: 1/>. the expected number of arrivals per day.<»th quantile of the X~n distributiou and that the power function of the lIMP level a test is given by where G 2n denotes the X~n distribution function. r p(x. (c) Suppose 1/>'0 ~ 12.. A) = c AcxC1e>'x .. a model often used (see Barlow and Proschan. Suppose that if 8 < 8 0 it is not worth keeping the counter open.3. (b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP test? 2.01 lest to have power at least 0. (a) Exhibit the optimal (UMP) test statistic for H : 0 < 00 versus K : 0 > 00. Here c is a known positive constant and A > 0 is the parameter of interest. Let Xl be the number of arrivals at a service counter on the ith of a sequence of n days.) 3.
6. 7.:7 I Xi is an optimal test statistic for testing H : () = eo versus l\ . y > 0. x> c where 8 > 1 and c > O. 00 < a < b < 00. then P(Y < y) = 1 e  . (Ii) Show that if we model the distribotion of Y as C(min{X I . against F(u)=u B."" X n denote the incomes of n persons chosen at random from a certain population. imagine a sequence X I. In the goodnessoffit Example 4. (a) Express mean income JL in terms of e. X N e>.d.~) = 1. To test whether the new treatment is beneficial we test H : ~ < 1 versus K : . Suppose that each Xi has the Pareto density f(x.. Show how to find critical values.8 > 80 .1. Find the UMP test. . X N }).5.d.. Let N be a zerotruncated Poisson.6) is UMP for testing that the pvalues are uniformly distributed. Assume that Fo has ~ density fo. . A > O. A > O...[1. and let YI .. Y n be the Ll.2 to find the mean and variance of the optimal test statistic. .Fo(y)l". For the purpose of modeling.•• of Li. where Xl_ a isthe (1 . X 2. 1 ' Y > 0. In Problem 1. Let the distribution of sUIvival times of patients receiving a standard treatment be the known distribution Fo.1..Fo(y)  }).). Show that the UMP test for testing H : B > 1 versuS K : B < 1 rejects H if 2E log FO(Xi) > Xl_a. and consider the alternative with distribution function F(x. 8. b). .. > 1.XFo(y)  1 P(Y < y) = e ' 1 ' Y > 0. (b) Find the optimal test statistic for testing H : JL = JLo versus K . (i) Show that if we model the distribution of Y as C(max{X I .1.. (See Problem 4..O<B<1..6. then e. we derived the model G(y. Let Xl. . suppose that Fo(x) has a nonzero density on some interval (a. ~ > O..O) = cBBx(l+BI.12. 6. (a) Lehmann Alte~p. P().  .1.Q)th quantile of the X~n distribution. . survival times with distribution Fa. . > po. survival times of a sample of patients receiving an experimental treatment.B) = Fi!(x). .tive. X 2 .Section 4.6. (b) NabeyaMiura Alternative. 0 < B < 1. JL (e) Use the central limit theorem to find a normal approximation to the critical value of test in part (b). Hint: Use the results of Theorem 1. which is independent of XI.) It follows that Fisher's method for cQmbining pvalues (see 4.10 Problems and Complements 275 then 2. random variable.
. I Hint: Consider the class of all Bayes tests of H : () = .. n. J J • 9.. = .... 10. .2.3. F(x) = 1 . e e at l · 6. the denominator decreasing.{Od varies between 0 and 1.. . Show that under the assumptions of Theorem 4. 1 X n be a sample from a normal population with unknown mean J. 0 = 0.(O) «) L > 1 i The lefthand side equals f6". L(x.' " 2.. 0.j "'" . (a) Show how to construct level (1 .. Find the MP test for testing H : {} = 1 versus K : B = 8 1 > 1. eo versus K : B = fh where I i I : 11. .0".1). Problems for Section 4.4 1. B> O. i = 1. /.. We want to test whether F is exponential.276 (iii) Consider the model Testing and Confidence Regions Chapter 4 G(y.. 1 • . Oo)d.X)2. . or Weibull..52.exp( x). where the €i are independent normal random variables with mean 0 and known variance . Assume that F o has a density !o(Y). x > 0. Show that the UMP test is based on the statistic L~ I Fo(Y. Oo)d. Let Xl.(O)/ 16' I • 00 p(x. Let Xi (f)/2)t~ + €i. x > 0. > Er • .1 and 01 loss. j 1 To see whether the new treatment is beneficial.d.2 the class of all Bayes tests is complete.3.. Problem 2... I . Show that the test is not UMP.0) confidence intervals of fixed finite length for loga2 • (b) Suppose that By 1 (Xi .{Oo} = 1 . 0.(O) 6 .. Show that under the assumptions of Theorem 4. What would you .'? = 2.L and unknown variance (T2. F(x) = 1 .(O) L':x.' L(x.2. The numerator is an increasing function of T(x). every Bayes test for H : < 80 versus K : > (Jl is of the fann for some t. . Using a pivot based on 1 (Xi . Let Xl. we test H : {} < 0 versus K : {} > O. with distribution function F(x).' (cf.3.0 e6 1 0~0. 1 X n be i. O)d.).O)d. 'i . n announce as your level (1 .0) UeB for '. 0) e9Fo (Y) Fo(Y).exp( _x B).01..6t is UMP for testing H : 8 80 versus K : 8 < 80 . Show that under the assumptions of Theorem 4. 1 .X)' = 16.i. .  1 j . 12. 00 p(x. Hint: A Bayes test rejects (accepts) H if J .
a) confidence interval for B.2.(al + (2)) confidence interval for q(8).4.c. has a 1. . .. [0. Compare yonr result to the n needed for (4. < 0. but we may otherwise choose the t! freely.4.1.d.1. Hint: Reduce to QI +a2 = a by showing that if al +a2 with Ql + Q2 = a.02. n. 5. 6. (Define the interval arbitrarily if q > q. t.1) based on n = N observations has length at most l for some preassigned length l = 2d.. X!)/~i 1tl ofB. .3). ii are (b) Calculate the smallest n needed to bound the length of the 95% interval of part (a) by 0. . X+SOt no l (1. What is the actual confidence coefficient of Ji" if 17 2 can take on all positive values? 4.IN.~a) /d]2 .(X) = X .n .al). then the shortest level n (1 . < a. I")/so. Then take N . 0. minI ii.7).. distribution.a. Suppose we want to select a sample size N such that the interval (4. Show that if Xl.X are ij. N(Ji" 17 2) and al + a2 < a.Xn be as in Problem 4.a. what values should we use for the t! so as to make our interval as short as possible for given a? 3.] . find a fixed length level (b) ItO < t i < I.Section 4./N.4.q(X)] is a level (1 .1.1.. .3 we know that 8 < O.l..1 Show that. i = 1.(2) " .) Hint: Use (A.4. Suppose that in Example 4.. Let Xl. Use calculus.~a)/ .. (a) Justify the interval [ii. with X .n is obtained by taking at = Q2 = a/2 (assume 172 known). Show that if q(X) is a level (1 . Stein's (1945) twostage procedure is the following..no further observations. then [q(X). Suppose that an experimenter thinking he knows the value of 17 2 uses a lower confidence bound for Ji.01 [X . X + z(1 ..3). where c is chosen so that the confidence level under the assumed value of 17 2 is 1 .z(1 .. with N being the smallest integer greater than no and greater than or equal to [Sot". 0. It follows that (1 .IN(X  = I:[' lX.SOt no l (1. .IN] . Begin by taking a fixed number no > 2 of observations and calculate X 0 = (1/ no) I:~O 1 Xi and S5 = (no 1)II:~OI(Xi  XO )2.1] if 8 > 0.1)] if 8 given by (4.(2) UCB for q(8). where 8. there is a shorter interval 7..~a)!.a) interval of the fonn [ X ..) LCB and q(X) is a level (1. . of the form Ji.10 Problems and Complements 277 (a) Using a pivot based on the MLE (2L:r<ol (1 . although N is random.4.
Let 8 ~ B(n.001. find ML estimates of 11. (a) Use (A.18) to show that sin 1( v'X)±z (1 . T 2 I I' " I'. .a) confidence interval for 1"2/(Y2 using a pivot based on the statistics of part (a).: " are known. and. " 1 a) confidence .a:) confidence interval of length at most 2d wheri 0'2 is known.95 confidence interval for (J.O).fN(X . Y n2 be two independent samples from N(IJ. (a) Show that X interval for 8.. By Theorem B. (12) and N(1/.4.~a)].. . Suppose that it is known that 0 < ~. (12..jn(X . Hence.3) are indeed approximate level (1 . it is necessary to take at least Z2 (1 (12/ d 2 observations. (The sticky point of this approach is that we have no control over N. Show that ~(). 4(). Such two sample problems arise In comparing the precision of two instruments and in detennining the effect of a treatment. (a) Show that in Problem 4.jn is an approximate level (1  • 12.. Show that the endpoints of the approximate level (1 .1.  10.. in order to have a level (1 . (a) If all parameters are unknown. Let XI. i (b) What would be the minimum sample size in part (a) if (). we may very likely be forced to take a prohibitively large number of observations.') populations.. X n1 and Yll .!a)/ 4. v. I .0)/[0(1. and the fundamental monograph of Wald.3. 0) and X = 81n.jn is an approximate level (b) If n = 100 and X = 0. = 0.~ a) upper and lower bounds. (J'2 jk) distribution. but we are sure that (12 < (It. 0. .a) confidence interval for (1/1').4. (1  ~a) / 2. 1"2. . 11.. respectively.. 278 Testing and Confidence Regions Chapter 4 I i' I is a confidence interval with confidence coefficient (1 . (12 2 9. Show that these two quadruples are each sufficient. The reader interested in pursuing the study of sequential procedures such as this one is referred to the book of Wetherill and Glazebrook.051 (c) Suppose that n (12 = 5.1') has a N(o. ..a) interval defined by (4.a) confidence interval for sin 1 (v'e). > z2 (1  is not known exactly. is _ independent of X no ' Because N depends only on sno' given N = k.) I ~i . . . 1947.0') for Jt of length at most 2d./3z (1.. d = i II: . use the result in part (a) to compute an approximate level 0. Hint: Set up an inequality for the length and solve for n. (b) Exhibit a level (1 . Indicate what tables you wdtdd need to calculate the interval.6.) (lr!d observations are necessary to achieve the aim of part (a). exhibit a fixed length level (1 . if (J' is large. Let 8 ~ B(n. 1986.3.l4. 0") distribution and is independent of sn" 8. sn. (c) If (12. ±.) Hint: Note that X = (noIN)Xno + (lIN)Etn n _ +IXi.O)]l < z (1. X has a N(J1. I 1 j Hint: [O(X) < OJ = [.
I) + V2( n1) t z( "1) and x(1 .J1.4 and the fact that X~ = r (k.30.". and 10 000.D andn(X{L)2/ cr 2 '" = r (4.(n . .4 ) . known) confidence intervals of part (b) when k = I. Slutsky's theorem. find (a) A level 0. Show that the confidence coefficient of the rectangle of Example 4. K. 100.1.3.9 confidence interval for {L.~Q). (c) A level 0. 14.Q confidence interval for cr 2 . cr 2 ) population.5 is (1 _ ~a) 2.J1. (a) Show that x( "d and x{1. In Example 4. Hint: By B. Compare them to the approximate interval given in part (a). 15.)4 < 00 and that'" = Var[(X.3.7). In the case where Xi is normal. (d) A level 0.2. Z.3.2.4/0.<. it equals O. Assuming that the sample is from a /V({L.' Now use the central limit theorem. and X2 = X n _ I (1 .3. Suppose that 25 measurements on the breaking strength of a certain alloy yield 11.9 confidence interval for {L x= + cr. Hint: Let t = tn~l (1 . is replaced by its MOM estimate. and (b) (4. Find the limit of the distribution ofn.n} and use this distribution to find an approximate 1 .2 is known as the kurtosis coefficient. (n _1)8'/0' ~ X~l = r(~(nl)."') .' = 2:. give a 95% confidence interval for the true proportion of cures 8 using (al (4.IUI. 4) are independent.) can be approximated by x(" 1) "" (n .1 is known. 10. 1) random variables.) Hint: (n .4 = E(Xi .)/01' ~ (J1. Hint: Use Problem 8. See A.t ([en . (c) Suppose Xi has a X~ distribution.3). V(o') can be written as a sum of squares of n 1 independentN(O.2.4.n(X . (In practice.J1.027 13.".J1.9 confidence interval for cr. = 3. See Problem 5. ~).8). Now use the law of large numbers. XI 16.4. Xl = Xn_l na). Now the result follows from Theorem 8.)' .3.Section 4. but assume that J1. . K. 0).9 confidence region for (It.I).).1) t z{1 .4.1)8'/0') . :L7/ (b) Suppose that Xi does not necessarily have a nonnal diStribution.)'. then BytheproofofTheoremB.ia). If S "' 8(64. Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures are observed.10 Problems and Complements 279 (b) What sample size is needed to guarantee that this interval has length at most 0.4. (b) A level 0.1 and. and the central limit theorem as given in Appendix A.' 1 (Xi . . Compute the (K.4.1) + V2( n .
It follows that the binomial confidence intervals for B in Example 4.a) confidence interval for F(x). 1'1(0) < X < 1'. b) for some 00 < a < 0 < b < 00. we want a level (1 . 19.4.4. In this case nF(l') = #[X.6. )F(x)[l.~u) by the value U a determined by Pu(An(U) < u) = 1.F(x)]dx. i 1 . Consider Example 4.d.6 with . I (b) Using Example 4.I I 280 Testing and Confidence Regions Chapter 4 17. Show that for F continuous where U denotes the uniform.4.1. find a level (I .3 can be turned into simultaneous confidence intervals for F(x) by replacing z (I . .4.u) confidence interval for 1".l). pl(O) < X <Fl(b)}.F(x)1 Typical choices of a and bare . (c) For testing H o : F = Fo with F o continuous. U(O...1 (b) V1'(x)[l.F(x)1 . • i.I Bn(F) = sup vnl1'(x) .F(x)1 is the approximate pivot given in Example 4. X n are i. (a) Show that I" • j j j = fa F(x)dx + f.3 for deriving a confidence interval for B = F(x). 18. .F(x)1 )F(x)[1 .u.1. indicate how critical values U a and t a for An (Fo) and Bn(Fo) can be obtained using the Monte Carlo method of Section 4.9) and the upper boundary I" = 00. define . (b)ForO<o<b< I.F(x)l.11 .95.05 and .i. define An(F) = sup {vnl1'(X) .4.1'(x)] Show that for F continuous. as X and that X has density f(t) = F' (t). distribution function. !i . I' ..7. . That is. Suppose Xl.r fixed. " • j . verify the lower bounary I" given by (4. In Example 4. (a) For 0 < a < b < I. < xl has a binomial distribotion and ~ vnl1'(x) . Assume that f(t) > 0 ifft E (a. .
= 1 versns K : Il. Hint: Use the results of Problems B.al) and (1 . for H : (12 > (15. 1/ n J is the shortest such confidence interval. (15 = 1.3. (d) Show that [Mn . X2) = 1 if and only if Xl + Xi > c.(2) together.1 has size a for H : 0 (}"2 be < 00 .) denotes the o. M n /0. (a) Deduce from Problem 4. if and only if 8(X) > 00 . = 0.4 and 8. (e) Similarly derive the level (1 . Give a 90% confidence interval for the ratio ~ of mean life times. 3. (a) Find c such that Oc of Problem 4.a) LCB corresponding to Oc of parr (a).Section 4.th quantile of the .. (e) Suppose that n = 16.0.13 show that the power (3(0 1 . Yf (1 . < XI? < f(1. 5. . and consider the prob2 + 8~ > 0 when (}"2 is known. a (b) Show that the test with acceptance region [f(~a) for testing H : Il.90? 4. X 2 be independent N( 01. What value of c givessizea? (b) Using Problems B.~a) I Xl is a confidence interval for Il.1.3..2 = VCBs of Example 4. respectively.a.5. with confidence coefficient 1 . Show that if 8(X) is a level (1 .a) VCB for this problem and exhibit the confidence intervals obtained by pntting two such hounds of level (1 .12 and B. Hint: If 0> 00 . lem of testing H : 8 1 = 82 = 0 versus K : Or (a) Let oc(X 1 . . Yn2 be independent exponential E( 8) and £(. [8(X) < 0] :J [8(X) < 00 ].2).1 O? 2.. Or .. .) samples..1"2nl. respectively. is of level a for testing H : 0 > 00 . .3. (b) Derive the level (1. Let Xl.~a)J has size (c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant. and let Il.5.· (a) If 1(0.. ~ 01). x y 315040343237342316514150274627103037 826 10 8 29 20 10 ~ Is H : Il. show that [Y f( ~a) I X. test given in part (a) has power 0. . How small must an alternative before the size 0. of l.\..Xn1 and YI .05.3.5 1.4. O ) is an increasing 2 function of + B~. then the test that accepts.2 are level 0.a) UCB for 0..10 Problems and Complements 281 Problems for Section 4. "6 based on the level (1  a) (b) Give explicitly the power function of the test of part (a) in tenos of the X~l distribution function.2n2 distribution.2). . N( O ..2 that the tests of H : . Let Xl. Experience has shown that the exponential assumption is warranted. = 1 rejected at level a 0.
6. Let X l1 . k . O?. Thus. Show that the conclusions of (aHf) still hold. 'TJo) of the composite hypothesis H : ry = ryo. N( 020g. we have a location parameter family. lif (tllXi > OJ) ootherwise.0:) confidence region for 1] is equivalent to a family of level tests of these composite hypotheses. (c) Show that Ok(o) (X. ). respectively. .0:) confidence interval for 11.282 Testing and Confidence Regions Chapter 4 = ()~ 1 2 eg and exhibit the corresponding family of confidence circles for (ell ( 2). T) where 1] is a parameter of interest and T is a nuisance parameter.2). L:J ~ ( . of Example 4.4. .[XU) (f) Suppose that a = 2(n') < OJ and P. I: H': PIX! >OJ < ~ versus K': PIX! >OJ> ~..0:) LeB for 8 whatever be f satisfying our conditions.. (a) Show that testing H : 8 < 0 versus K : e > 0 is equivalent to testing . Show that PIX(k) < 0 < X(nk+l)] ~ . but I(t) = I( t) for all t. (g) Suppose that we drop the assumption that I(t) = J( t) for all t and replace 0 by the v = median of F.2). X. l 7. .Xn be a sample from a population with density f (t . (c) Modify the test of part (a) to obtain a procedure that is level 0: for H : OJ e= O?.  j 1 j . >k Determine the smallest value k = k(a) such that oklo) is level a for H and show that for n large. og are independentN (0. Let C(X) = (ry : o(X. (e) Show directly that P. We are given for each possible value 1]0 of1] a level 0: test o(X.8) where () and fare unknown.a) confidence region for the parameter ry and con versely that any level (1 .1 when (72 is unknown. . . 1 Ct ~r (b) Find the family aftests corresponding to the level (1. O.a. I . X n  00 ) is a level a test of H : 0 < 00 versus K : 0 > 00 • (d) Deduce that X(nk(Q)+l) (where XU) is the jth order statistic of the sample) is a level (1 . (a) Shnw that C(X) is a level (l . . 0. Hint: (c) X.00 . (b) The sign test of H versus K is given by. .a)y'n. ry) = O}. Suppose () = ('T}.~n + ~z(l .IX(j) < 0 < X(k)] do not depend on 1 nr I. and 1 is continuous and positive. 0.
p)} as in Problem 4.2. 8( S)] be the exact level (1 . alll/. ij(x)] for ry is said to be unbiased confidence interval if Pl/[ry(X) < ry' < 7)(X)] <1 a for all ry' F ry. 9. 1). Show that the Student t interval (4. 1) and q(O) the ray (X . Yare independent and X ~ N(v.a))2 if X > z(1 . Let a(S. This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line. Po) is a size a test of H : p = Po.a). (a) Show that the lower confidence bound for q( 8) obtained from the image under q of q(X) (X .5.a).2 a) (a) Show that J(X.10 Problems and Complements 283 8. let T denote a nuisance parameter. Suppose X.p) = = 0 ifiX .a.z(l. Show that as 0 ranges from O(S) to 8(S). Let 1] denote a parameter of interest. 1).5.Oo) denote the pvalue of the test of H : 0 = 00 versus K : 0 > 00 in Example 4.pYI < (1 1 otherwise.Y.2a) confidence interval for 0 of Example 4. that suP. Then the level (1 .1) is unbiased. Let X ~ N(O.Section 4. .1. and let () = (ry. 00) is = 0'.a) confidence interval [ry(x). Define = vl1J.O ~ J(X.3 and let [O(S).7. Y. That is. hence. Establish (iii) and (iv) of Example 4. Y ~ N(r/. or). 0) ranges from a to a value no smaller than 1 . (b) Describe the confidence region obtained by inverting the family (J(X. the interval is unbiased if it has larger probability of covering the true value 1] than the wrong value 1]'.7. Y. ' + P2 l' z(1 1 .2. Thus.5. the quantity ~ = 8(8) . 10. Note that the region is not necessarily an interval or ray. 12.z(1 .00 indicates how far we have to go from 80 before the value 8 is not at all surprising under H. p. Let P (p. laifO>O <I>(z(1 .[q(X) < 0 2 ] = 1. if 00 < O(S) (S is inconsistent with H : 0 = 00 ). Hint: You may use the result of Problem 4.5.20) if 0 < 0 and.a) .a) = (b) Show that 0 if X < z(1 .5. a( S. 11. 7)).
Show that the interval in parts (e) and (0 can be derived from the pivot T(x ) p  .05 could be the fifth percentile of the duration time for a certain disease. .F(x p)]. it is distribution free.5.a) confidence interval for x p whatever be F satisfying our conditions. (X(k)l X(n_I») is a level (1 . X n be a sample from a population with continuous distribution F.7. F(x). Confidence Regions for QlIalltiles. (b) The quantile sign test Ok of H versus K has critical region {x : L:~ 1 1 [Xi > 0] > k}. be the pth quantile of F.a. iI . vF(xp) [1 F(xp)] " Hint: Note that F(x p ) = p. Let F. I • (c) Let x· be a specified number with 0 < F(x') < 1. (g) Let F(x) denote the empirical distribution..a = P(k < S < n I + 1) = pi(1. . il.p) variable and choose k and I such that 1 . = ynIP(xp) .1» . i j (d) Deduce that X(n_k(Q)+I) (XU) is the jth order statistic of the sample) is a level (1 .4. and F+(x) be as in Examples 4. ! . k(a) . That is. In Problem 13 preceding we gave a disstributionfree confidence interval for the pth quantile x p for p fixed. Construct the interval using F. lOOx.. We can proceed as follows. P(xp < x p < xp for all p E (0. . . 14. Show that Jk(X I x'.6 and 4. Detennine the smallest value k = k(a) such that Jk(Q) has level a for H and show that for n large.a) LeB for x p whatever be f satisfying our conditions. (e) Let S denote a B(n. .) Suppose that p is specified. (t) Show that k and 1in part (e) can be approximated by h (~a) and h (1  ~a) where ! heal is given in part (b). 1 . " II [I .h(a).p)"j. (See Section 3. 0 < p < 1. .1 and Fu 1. Thus.a. Let Xl. Simultaneous Confidence Regions for Quantiles. Suppose we want a distributionfree confidence region for x p valid for all 0 < p < 1. X n x*) is a level a test for testing H : x p < x* versus K : x p > x*. Then   P(P(x) < F(x) < P+(x)) for all x E (a.. That is.. Show that P(X(k) < x p < X(nl)) = 1. L..4.p) versus K' : P(X > 0) > (1 . where ! 1 1 heal c: n(1  p) + zl_QVnp(1  pl.95 could be the 95th percentile of the salaries in a certain profession l or lOOx .b) (a) Show that this statement is equivalent to = 1.p).. (a) Show that testing H : x p < 0 versus I< : x p > 0 is equivalent to testing H' : P(X > 0) < (I .284 Testing and Confidence Regions Chapter 4 13. Let x p = ~ IFI (p) + Fi! I (P)]. k+ ' i .
where A is a placebo.(x) = F ~(Fx(x» . That is.. Express the band in terms of critical values for An(F) and the order statistics.z. .Q) simultaneous confidence band for the curve {VF(p) : 0 < p < I}.r < b.) < p} and.10 Problems and Complements 285 where x" ~ SUp{. We will write Fx for F when we need to distinguish it from the distribution F_ x of X. (c) Show how the statistic An(F) of Problem 4. Suppose X denotes the difference between responses after a subject has been given treatments A and B. where p ~ F(x).xp]: 0 < l' < I}. Give a distributionfree level (1 ..1. .. The hypothesis that A and B are equally effective can be expressed as H : F ~x(t) = Fx(t) for all t E R. Suppose that X has the continuous distribution F.. LetFx and F x be the empirical distributions based on the i.r" ~ inf{x: a < x < b.p. Hint: Let t.u are the empirical distributions of U and 1 .U with ~ U(O. X I. . TbeaitemativeisthatF_x(t) IF(t) forsomet E R.d.X n . L i==I n I IFx (Xi) . .X I. n Hint: nFx(x) = Li~ll[Fx(Xi) < Fx(x)] ~ nFu(F(x)) and ~ ~ nF_x(x) = L IIXi < xl ~ L i==I i==I n n IIFx( X. F(x) > pl.(x)] < Fx(x)] = nF1_u(Fx(x)).. that is.17(a) and (c) can be used to give another distributionfree simultaneous confidence band for x p .5.4. X n and . Note the similarity to the interval in Problem 4.. (b) Suppose we measure the difference between the effects of A and B by ~ the difference between the quantiles of X and X.T: a <.r p in terms of the critical value of the Kolmogorov statistic and the order statistics. then D(F ~ ~ ~ ~ as D(Fu 1 F I  U = F(X) U ).Section 4.. . VF(p) = ![x p + xlpl. 15. (b) Express x p and .13(g) preceding. F f (.. then L i==l n IIXi < X + t.j < F_x(x)] See also Example 4. I).1. . the desired confidence region is the band consisting of the collection of intervals {[.x. F~x) has the same distribution Show that if F x is continuous and H holds. where F u and F t .i. ~ ~ (a) Consider the test statistic x ..
a) simultaneous confidence band 15.5. then  nFy(x + A(X)) = L: I[Y. (p).:~ 1 1 [Fy(Y. i=I n 1 i t . Fx. we test H : Fx(t) ~ Fy (t) for all t versus K : Fx(t) # Fy(t) for some t E R.:~ I1IFx(X.(F(x)) . As in Example 1. vt(p)) is the band in part (b).x = 2VF(F(x)). Fx(t)l· .) = D(Fu . and by solving D( Fx. Properties of this and other bands are given by Doksum.2A(x) ~ HI(F(x)) l 1 + vF(F(x))... where x p and YP are the pth quan tiles of F x and Fy. Fenstad and Aaberge (1977).) < do.. for ~..) < Fy(x p )] = nFv (Fx (t)) under H. Show that for given F E F. It follows that if c. Let t (c) A Distdbution and ParameterFree Confidence Interval.a) simultaneous confidence band for A(x) = F~l:(Fx(x)) . Also note that x = H.d.d. let Xl. It follows that X is stochastically between Xsv F and XS+VF where X s HI(F(X)) has the symmetric distribution H. .Fy ) = maxIFy(t) 'ER "" . nFx(x) we set ~ 2. "" = Yp . Hint: nFx(t) = 2. the probability is (1 . where do:.F~x.) : :F VF ~ inf vF(p). then D(Fx.Fy ): 0 < p < 1}...) is a location parameter} of all location parameter values at F.3. .1..t::.) ~ < F x (")] ~ ~ = nFu(Fx(x)).U ). Hint: Let A(x) = FyI (Fx(x)) ..i. = 1 i 16. treatment B responses.p)] = ~[Fxl(p) + F~l:(p)]. (b) Consider the parameter tJ p ( F x.x p .(x) ~ R. <aJ Show that if H holds.F1 _u).l (p) ~ ~[Fxl(p) . i . Let Fx and F y denote the X and Y empirical distributions and consider the test statistic ~ ~ .x. < F y I (Fx(x))] i=I n = L: l[Fy (Y. F y ) .".a) that the interval IvF' vt] contains the location set LF = {O(F) : 0(. he a location parameter as defined in Problem 3.. To test the hypothesis H that the two treatments are equally effective.. . .F x l (l . ~ ~ ~ F~x. Then H is symmetric about zero. we get a distributionfree level (1 . where :F is the class of distribution functions with finite support. where F u and F v are independent U(O. 1 Yn be i..) < Fx(x)) = nFv(Fx(x)). I I I 1 1 D(Fx. F y ) has the same distribntion as D(Fu . F y ).:~ II[Fx(X. The result now follows from the properties of B(·). i I f F. Hint: Define H by H. Give a distributionfree level (1. treatment A (placebo) responses and let Y1 ..x (': + A(x)). then D(Fx .286 Testing and Confidence Regions ~ Chapter 4 '1.Jt: 0 < p < 1) for the curve (5 p (Fx.17... Let 8(.'" 1 X n be Li. vt = O<p<l sup vt(p) O<p<l where [V.) < Fx(t)] = nFu(Fx(t)). We assume that the X's and Y's are in~ dependent and that they have respective continuous distributions F x and F y . nFy(t) = 2.. Moreover.1) empirical distributions. . is the nth quantile of the distribution of D(Fu l F 1 .
)I L ti . F y ) and b _maxo<p<l 6p(Fx . moreover X + 6 < Y' < X + 6. (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E~ 1 X i )/ E~ 1 has uniformly smaller variance than T. F y. Let6. Let do denote a size a critical value for D(Fu .) = F (p) .. ) is a shift parameter} of the values of all shift parameters at (Fx . .) < do: for Ll.F y ' (p) = 6.Section 4. Fy ). then Y' = Y. F y ) < O(Fx . we find a distributionfree level (1 . F y. then B( Fx . nFx ("') = nFu(Fx(x)). Exhibit the UMA level (1 .) 12:7 I if. Q.4.0: UCB for J. F x ) ~ a and YI > Y. Fx +a) = O(Fx a.) I i=l L tt . Suppose X I.).. F y ) E F x F.0:) confidence bound forO. B(.Xn is a sample from a r (p.)...(Fx . thea It a unifonnly most accurate level 1 . F y ). (d) Show that E(Y) . Fy ) .. 0 < p < 1. .2uynz(1 i=! i=l all L i=l n ti is also a level (1.('. O(Fx. F y ) is in [6. and 8 are shift parameters.) ~ + t.E(X). Show that for given (Fx . Let 6 = minO<1'<16p(Fx. is called a shift parameter if O(Fx. 6+ maxO<p<1 6+(p).4..T. Now apply the axioms. ~ D(Fu •F v ). 2.(x)) . where :F is the class of distri butions with finite support. :F x :F O(Fx. tt Hint: Both 0 and 8* are nonnally distributed. (b) Consider the unbiased estimate of O. t Hint: Set Y' = X + t.a) that the interval [8. ~ ~ = (e) A Distribution and ParameterFree Confidence Interval. F y ) > O(Fx " Fy). 3.(x) = Fy(x thea D(Fx.2z(1 i=l n n a)ulL tt]} i=l is a uniformly most accurate lower confidence bound for 8. ~) distribution.6 1. the probability is (1 .(X). where is nnknown. = 22:7 I Xf/X2n( a) is .6]. x' 7 R. if I' = 1I>'.    Problems for Section 4.) . then by solving D(F x. . Show that n p is known and 0 O' ~ (2 2~A X. Properties of ~ £ = "" this and other bands are given by Doksum and Sievers (1976).6.Fy).(. T n n = (22:7 I X. Xl > X 8t 8t 0} (c) A parameter 0 = <5 ('..= minO<p<l 6.. t.6+] contains the shift parameter set {O(Fx. Show that B = (2 L X.2.L.. Fv ).·) is a shift parameter.3. (a) Consider the model of Problem 4.a) UCB for O. . Show that if 0(·. Show that for the model of Problem 4.10 Problems and Complements 287 Moreover.(P). It follows that if we set F\~•. Fy.0:) simultaneouS coofidence band for t. 6.
\ =.a) confidence intervals such that Po[8' < B' < 0'] < p.B')+..6 to V = (B . ·'"1 1 . P(>.W < BI = 7. s) distribution with T and s integers.6.\. distribution and that B has beta. where Show that if (B. j .B) = J== J<>'= p( s. g. Prove Corollary 4. B). and that B has the = se' (t. n = 1.1.' with "> ~. Problems for Section 4. . Suppose that given B Pareto. where So is some COnstant and V rv X~· Let T = L:~ 1 Xi' (a) Show that ()" I T m = k + 2t.1 are well defined and strictly increasing.. respectively. [B. then E. s). > 1" (h) Show how quantiles of the X2 distribution can be used to determine level (I . t)dsdt = J<>'= P. 1I"(t) Xl.6 for c fixed.2. . 5. f3( T.6.B)+.. V be random variables with d. (a) Show that if 8 has a beta. Hint: Apply Problem 4.X.7 1 .W < B' < OJ for allB' l' B. In Example 4.3). s > 0.( 0' Hint: E. . • . U(O. I'. .. U = (8 . then E(U) > E(V). .(O . uniform.2. Suppose that given. e> 0..\ is distributed as V /050. Construct unifoffilly most accurate level 1 .0 upper and lower confidence bounds for J1 in the model of Problem 4.d. Xl.i.4.3 and 4. and for ().Xn are Ll.B(n.B') < E.l. 1.l2(b)."" X n are i.Bj has the F distribution F 2r . Let U.a.1. G corresponding to densities j.. 0'].B). p. t > e.d.[B < u < O]du. establish that the UMP test has acceptance region (4.aj LCB such that I . Suppose that B' is a uniformly most accurate level (I .C. 2. where s = 80+ 2n and W ~ X. Poisson.2. .O] are two level (1 . 3. '" J. Hint: Use Examples 4. F1(tjdt. Hint: By Problem B. Establish the following result due to Pratt (1961). . ..2" Hint: See Sections 8. J • 6. distribution with r and s positive integers.2 and B. .4.288 Testing and Confidence Regions Chapter 4 4.6. t) is the joint density of (8.12 so that F. Suppose [B'. OJ. X has a binomial.B). Pa( c.3. (B" 0') have joint densities. 8. • • • = t) is distributed as W( s.. (b) Suppose that given B = B. (1: dU) pes.6.3.) and that. s).3.f:s F. /3( T. 0). (0 . .a) upper and lower credible bounds for A. Show how the quantiles of the F distribution can be used to find upper and lower credible bounds for. then )" = sB(r(1 . density = B.E(V) are finite. i . Show that if F(x) < G(x) for all x and E(U). ~. satisfying the conditions of Problem 8. E(U) = .
1 r.Xn ). r) where 7f(Ll I x .1. 4. .y) is 1f(r 180)1f(/11 1 r..112. In particular consider the credible bounds as n + 00. (d) Compare the level (1.. .s') withe' = max{c. y) is obtained by integrating out Tin 7f(Ll. Problems for Section 4. r 1x.Xn is observable and X n + 1 is to be predicted.y. ".rlm) density and 7f(/12 1r. Hint: p( 0 I x.Xn + 1 be i. Show that (8 = S IM ~ Tn) ~ Pa(c'.0:) confidence interval for B.. Hint: 7f(Ll 1x. as X. (c) Give a level (1 . Here Xl..y.. ..8 1.2 degrees of freedom. ..x) is aN(x. Yn are two independent N (111 \ 'T) and N(112 1 'T) samples.2. y) is proportional to p(8)p(x 1/11. 'P) is the N(x .a) upper and lower credible bounds forO to the level (Ia) upper and lower confidence bounds for B. (a) Give a level (1 . where (75 is known.r)p(y 1/12.y.10 Problems and Complements 289 (a)LetM = Sf max{Xj.m} and + n.. y) of is (Student) t with m + n . /11 and /12 are independent in the posterior distribution p(O X.Section 4. (d) Use part (c) to give level (Ia) credible bounds and a level (Ia) credible interval for Ll. . . Let Xl.1 )) distribution.a) upper and lower credible bounds for O.y) = 7f(r I sO)7f(LlI x . .d. T) = (111.. 'T).r). r > O. y). Xl.0:) prediction interval for Xnt 1.i. r In) density. 1f(/11 1r.x)' + E(Yj y)'. (a) Let So = E(x. r( TnI (c) Set s2 + n. (b) Find level (1 . .r 1x. . = So I (m + n  2).. y) is aN(y.Xm.l. Show thatthe posterior distribution 7f( t 1 x.x)1f(/1..0"5)... and Y1. Show fonnaUy that the posteriof?r(O 1 x.1. Suppose that given 8 = (ttl' J. (b) Show that given r. y) and that the joint density of. . Suppose 8 has the improper priof?r(O) = l/r. = PI ~ /12 and'T is 7f(Ll. N(J... respectively.y) proportional to where 1f(r I so) is the density of solV with V ~ Xm+n2.
2. " u(n+l).. 1 X n are i. Suppose that given (J = 0.'" ..8. n = 100. Suppose that Y. Present the results in a table and a graph. and a = . Suppose XI. random variable.x ) where B(·. give level (1 ~ a:) lower and upper prediction (c) If F is continuous with a positive density f on (a. give level 3.t for M = 5.d. distribution.• .3) by doing a frequentist computation of the probability of coverage.a) prediction interval for X n + l . 5. ~o = 10.9. . .12.05. Find the probability that the Bayesian interval covers the true mean j. F. . .Q) distribution free lower and upper prediction bounds for X n + 1 • < 00. 0) distribution given (J = 8. q(ylx)= ( .'.s+nx+my)/B(r+x.. . (a) If F is N(Jl. s).nl. (This q(y distribution. (b) If F is N(p" bounds for X n + 1 . 0 > 0. Let X have a binomial. Show that the conditional (predictive) distribution of Y given X = xis 1 J I i I . Give a level (1 . I x) is sometimes called the Polya q(y I x) = J p(y I 0). Establish (4.x .15. has a B(m. That is.2) hy using the observation that Un + l is equally likely to be any of the values U(I). as X where X has the exponential distribution F(x I 0) =I . where Xl. ! " • Suppose Xl.2n distribution.. as X . .0:) lower (upper) prediction bound on Y = X n+ 1 is defined to be a function Y(Y) of Xl. :[ = . suppose Xl.i. Then the level of the frequentist interval is 95%. Hint: Xi/8 has a ~ distribution and nXn+d E~ 1 Xi has an F 2 .8).i.8. 4. b).i.' . B( n. distribution.) Hint: First show that .Xn are observable and we want to predict Xn+l. N(Jl.(0 I x)dO..a (P(Y < Y) > I ..9..d.d. Show that the likelihood ratio statistic for testing If : 0 = ~ versus K : 0 i ~ is equivalent to 12X .10. In Example 4. x > 0.10. . Let Xl..11...·) denotes the beta function.... )B(r+x+y.Xn are observable and X n + 1 is to be predicted.8. . X n+ l are i. 0'5)' Take 0'5 ~ 7 2 = I. and that (J has a beta. . give level (I ..a).. X n such that P(Y < Y) > 1. A level (1 .0'5) with 0'5 known. .Xn + 1 be i.. X is a binomial. < u(n+l) denote Ul . 0). Un + 1 ordered. B(n. let U(l) < . ..a) lower and upper prediction bounds for X n + 1 . (3(r.9 1. which is not observable.e.s+n.5.. i! Problems for Section 4..8. 1 2. CJ2) with (J2 unknown.290 Testing and Confidence Regions Chapter 4 • • (b) Compare the interval in part (a) to the Bayesian prediction interval (4. .5. • . 00 < a < b (1 .
let X 1l .. of the X~I distribution.f. . In Problems 24. I . 0'2 < O'~ versus K : 0'2 > O'~. = Xnl (1 . (c) These tests coincide with the testS obtained by inverting the family of level (1  Q) lower confidence bounds for (12. 0' 4. = aD nO' 1".n) and >..V2nz(1 .(Xi < 2" " (10 i=I  X) 2 < C2 where CI and C2 satisfy. Show that (a) Likelihood ratio tests are of the form: Reject if.(x) = . Thus.. if ii' /175 < 1 and = ./(n . 0'2) sample with both JL and 0'2 unknown. XnI (1 . where Tn is the t statistic. if Tn < and = (n/2) log(1 + T.3. We want to test H . (i) F(c. Xn_1 (~a).X) aD· t=l n > C. and only if. Hint: log . 2 2 L.~Q) also approximately satisfy (i) and (ii) of part (a).(Xi . 2. (c) Deduce that the critical values of the commonly used equaltailed test.. JL > Po show that the onesided.~Q) n + V2nz(l.( x) ~ 0.F(CI) ~ 1. where F is the d.C2n !' 1 as n !' 00.Q). >.10 Problems and Complements 291 is an increasing function of (2x . In testing H : It < 110 versus K . · .Section 4... ° 3. n Cj I L. (ii) CI ~ C2 = n logcl/C2. (b) Use the nonnal approximatioh to check that CIn C'n n .(x) Hint: Show that for x < !n. Hint: Note that liD ~ X if X < 1'0 and ~ 1'0 otherwise. and only if.3. We want to test H : = 0'0 versus K : 0' 1= 0'0' (a) Show that the size a likelihood ratio test accepts if. log . onesample t test is the likelihood ratio test (fo~ Q <S ~).~Q) approximately satisfy (i) and also (ii) in the sense that the ratio .(nx). OneSided Tests for Scale.. TwoSided Tests for Scale.='7"nlog c l n /C2n CIn .. (n/2)[ii' / C a5  (b) To obtain size Q for H we should take Hint: Recall Theorem B.) . ~2 .I)) for Tn > 0.log (ii' / (5)1 otherwise..Q.Xn be a N(/L.(x) = 0.
(b) Consider the problem aftesting H : Jl. I'· (c) Find a level 0. i = 1.3... whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre.9. 190. The following blood pressures were obtained in a sample of size n = 5 from a certain population: 124.Xn are said to be serially correlated or to follow an autoregressive model if we can write ~ . 0'2). . Forage production is also measured in pounds per acre.95 when nl = nz = ~n and (1'1 1'2)/17 ~ ~.L. e 1 ) depends on X only throngh T. . Let Xl.05 onesample t test. find the sample size n needed for the level 0.05. 6. .tl > {t2. 0'2 ~ ! corresponding to inversion of the (c) Compute a level 0.4. (a) Find a level 0. (c) Using the normal approximation <l>(z(a)+ nl nz/n(1'1 I'z) /(7) to the power. respectively. The control measurements (x's) correspond to 0 pounds of mulch per acre. 9. Assume the onesample normal model..292 Testing and Confidence Regions Chapter 4 5. .90 confidence interval for the mean bloC<! pressure J.. €n are independent N(O. 7. and that T is sufficient for O.O). .. .P.l.. eo... = 0 and €l. .95 confidence interval for p. . 1 I 794 2012 1800 2477 576 3498 411 2092 897 1808 I Assume the twosample normal model with equal variances.Xn1 and YI . Suppose X has density p(x.90 confidence interval for a by using the pivot 8 2 ja z .L2 versus K : j. Assume a Show that the likelihood ratio statistic is equivalent to the twosample t statistic T. where a' is as defined in < ~.01 test to have power 0.1 < J. 100. can we conclude that the mean blood pressure in the population is significantly larger than IDO? (b) Compute a level 0.9. n.. . (X. .. The following data are from an experiment to study the relationship between forage production in the spring and mulch left on the ground the previous fall. 110..2' (12) samples. (a) Show that the MLE of 0 ~ (1'1. (a) Using the size 0: = 0.lIl (7 2) and N(Jl. 0 E e.1'2. x y vi I I . • Xi = (JXi where X o I 1 + €i. . Show that A(X. The nonnally distributed random variables Xl. (b) Can we conclude that leaving the indicated amount of mulch on the ground significantly improves forage production? Use ct = 0. 8. (7 2) is Section 4. I Yn2 be two independent N(J.95 confidence interval for equaltailed tests of Problem 4.z . a 2 ) random variables.. 114. Y.
/ L:: i 10. 0) by the following table. .O)e (a) What is the size a likelihood ratio test for testing H : B = 1 versus K : B 1= I? (b) Show that the test that rejects if.10 Problems and Complements 293 X n ) is n (a) Show that the density of X = (Xl. Fix 0 < a < and <>/[2(1 . Yl. has level a and is strictly more 11.OXi_d 2} i= 1 < Xi < 00. X powerful whatever be B. Yn2 be two independent N(P. (b) Show that the likelihood ratio statistic of H : 0 = 0 (independence) versus K : 0 o(serial correlation) is equivalent to C~=~ 2 X i X i _I)2 / l Xl. Y2 )dY2.2) In exp{ (I/Z(72) 2)Xi ... The power functions of one.6. P. . samples from N(J1l. .O) for 00 = (Z. . II. for each v > 0.6. (Yl.2. X n1 .n.. 1 {= xj(kl)ellx+(tVx/k'I'ldx. Stein). 13.[Z > tVV/k1 is increasing in 6.<» (1 . with all parameters assumed unknown.[T > tl is an increasing function of 6.. (b) P.and twosided t tests. Show that.IIZI > tjvlk] is increasing in [61. Then. I).. Consider the following model. respectively. From the joint distribution of Z and V. 7k. P. (Tn. Let Xl.. . = 0. 7k.. i = 1. ! x 0 2 1<> 2 I 0 <> I 2 1<> 2 1 ~a Ia 2 iI Oe (i~H <» (1') a 10: (t~) (~. Then use py. distribution. Suppose that T has a noncentral t.IITI > tl is an increasing function of 161. Let e consist of the point I and the interval [0. has density !k . Show that the noncentral t distribution.(t) = . and only if.Section 4. Hint: Let Z and V be independent and have N(o. .. (Yl) = f PY"Y.'" p(X. Condition on V and apply the double expectation theorem. Xo = o.<»1 < e < <>. (a) P. J7i'k(~k)ZJ(k+ll io Hint: Let Z and V be as in the preceding hint. 12.. get the joint distribution of Yl = ZI vVlk and Y2 = V. . (An example due to C. The F Test for Equality of Scale.. (T~). Define the frequency functions p(x. XZ distributions respectively.
(d) Relate the twosided test of part (c) to the confidence intervals for a?/ar obtained in Problem 4. i=l j=l n n (b) Show that if we have a sample from a bivariate N(1L1l 1L2. • . can you conclude at the 10% level of significance that blood cholest~rollevel is correlated with weight/height ratio? 'I I .0 has the same dis r j . the continuous versiop of (B.. . Un). as an approximation to the LR test of H : a1 = a2 versus K : a1 =I a2.7 and B.~ I i " 294 Testing and Confidence Regions Chapter 4 . si = L v. where a. F > /(1 . ! . . Finally. .4. i=2 i=2 n n S12 =L i=2 n U. .X)' (b) Show that (aUa~)F has an F nz obtained from the :F table.~ I .4. . x y 254 2.Y)'/E(X. a?.a/2) or F < f( n/2). and use Probl~1ll4. that T is an increasing function of R. .4.nt 1 distribution. L Yj'.24) implies that this is also the unconditional distribution.0 has the same distribution as R.19 315 2. then P[p > c] is an increasing function of p for fixed c.. . 15. (Un. I ' LX. p) distribution. • I " and (U" V. .7 to conclude that tribution as 8 12 /81 8 2 . T has a noncentral Tnz distribution with noncentrality parameter p.. 0. (1~.' ! . (a) Show that the likelihood ratio ~tatistic is equivalent to ITI where . .9. Because this conditional distribution does not depend on (U21 •.4.nl1 ar is of the fonn: Reject if. 1. . . p) distribution. (a) Show that the LR test of H : af = a~ versus K : ai > and only if. r> (c) Justify the twosided F test: Reject H if. !I " . . Argue as in Proplem 4..5) that 2 log '(X) V has a distribution. (11.. .9. Sbow..'. 1 . .68 250 2. Hint: Use the transfonnations and Problem B. Let R = S12/SIS" T = 2R/V1 R'. .1I(a). 0. F ~ [(nl . (Xn .4) and (4.'. Let (XI. .1)]E(Y. Yn) be a sampl~ from a bivariateN(O. Let '(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p the bivariate normal model.71 240 2.12 337 1. . show that given Uz = Uz .. and only if. vn 1 l i 16. 14. The following data are the blood cholesterol levels (x's) and weightlheight ratios (y's) of 10 men involved in a heart study..V. distribution and that critical values can be .1' Sf = L U.9. . p) distribution.1. where f( t) is the tth quantile of the F nz . Vn ) is a sample from a N(O. .1)/(n. <.8. and using the arguments of Problems B.). 1.J J . Un = Un. . i 0 in xi !:. using (4. YI ).64 298 2.61 310 2.4. · . > C.62 284 2.10. note that . Consider the problem of testing H : p = 0 versus K : p i O. where l _ .1 .96 279 2.8. V.37 384 2.94 Using the likelihood ratio test for the bivariate nonnal model.
00.3). P. More generally the closure of the class of Bayes procedures (in a suitable metric) is complete. . 187.0.11 NOTES Notes for Section 4.. "Is there a sex bias in graduate admissions?" Science. 1974) to the critical value is <.10.3. W. Essentially.9.[ii) where t = 1. . E. Because the region contains C(X). (a) Find the level" LR test for testing H : 8 E given in Problem 3. PROSCHAN. Wu.851. 00. Tl . AND J.. REFERENCES BARLOW. Apology for Ecumenism in Statistics and Scientific lnjerence.12 1965. !. T6 .2. AND F. Consider the cases Ii.o = 0. Editors New York: Academic Press.15 is used here. Nntes for Section 4. (2) We ignore at this time some reallife inadequacies of this experiment such as the placebo effect (see Example 1. Box.(t) = tl(. F. E.~. 1983. S in 8(X) would be replaced by 5 + and 5 in 8(X) is replaced by 5 .). i].01 + 0. G.1.Section 4.819 for" = 0. Rejection is more definitive. Mathematical Theory of Reliability New York: J. (2) The theory of complete and essentially complete families is developed in Wald (1950). and r. (2) In using 8(5) as a confidence hound we are using the region [8(5). T. The term complete is then reserved for the class where strict inequality in (4..05 and 0. Consider the bioequivalence example in Problem 3. P.2. Data Analysis and Robustness.3) holds for some 8 if <p 't V. if the parameter space is compact and loss functions are bounded. Stephens. see also Ferguson (1967). Box.4 (I) If the continuity correction discussed in Section A. and C. (b) Compare your solution to the Bayesian solution based on a continuous loss function rro = 0.[ii . HAMMEL.398404 (975).9. respectively. O'CONNELL.895 and t = 0. (3) A good approximation (Durbin.01. BICKEL. Wiley & Sons.3 (1) Such a class is sometimes called essentially complete.I\ E. R.035. it also has confidence level (1 . Leonard. 1973.. Notes for Section 4.11.. the class of Bayes procedures is complete. 4. G.0. i] versus K : 8 't [i.... Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding.11 Notes 295 17.1 (1) The point of view usually taken in science is that of Karl Popper [1968]. 4.
243258 (1945).296 Testing and Confidence Regions Chapter 4 BROWN.674682 (1959).. 1985. L. A. 1986. Amer. 1. SACKRoWITZ. R. j 1 . I . 1958. GLAZEBROOK. SIAM. 66. 1967. "EDF statistics for goodness of fit. T. KLETI. Philadelphia. A. H. R. 13th ed. "A twosample test for a linear hypothesis whose pOWer is independent of the variance. WILKS. AND E. Sratisr. 3rd ed. "Interval estimation for a binomial proportion.. WETHERDJ. . Statist. Aspin's tables:' Biometrika. J. I . Statist. Statist. 1961. AND G. AND 1.. Statistical Decision Functions New York: Wiley. WELCH. AND K." J.• Conjectures and Refutations.. Statistical Theory with Engineering Applications .. "Further notes on Mrs. .. A. PRATI. 9. . . Mathematical Statistics. OSTERHOFF. New York: I J 1 Harper and Row. DOKSUM. 1947." J. B. STEPHENs. Sequential Methods in Statistics New York: Chapman and Hall. 243246 (1949). L. 421434 (1976). "On the combination of independent test statistics. AND I. M. 1968. WALD. POPPER.. W. . Testing Statistical Hypotheses. Pennsylvania (1973). WALD. Amer. "Distribution theory for tests based on the sample distribution function. 54. FENSTAD.. Sequential Analysis New York: Wiley. "P values as random variablesExpected P values." Biometrika. ! . SIEVERS. "Plots and tests for symmetry. 36. 8. 64. The Theory of Probability Oxford: Oxford University Press. G.• Statistical Methods for Research Workers. Assoc.. DAS GUPTA. W. AND A. G... 473487 (1977). S.. "Probabilities of the type I errors of the Welch tests. 53. "Optimal confidence intervals for the variance of a normal distribution. 16." Biometrika. 38. 1952. VAN ZWET. 2nd ed. ! I I I FERGUSON. New York: Springer. A. Statist. Statistical Methods for MetaAnalysis Orlando.. R. AND G. Assoc. SAMUEL<:AHN. F. 69." An" Math. . 659680 (1967). 730737 (1974). c.. DURBIN. TATE. A Decision Theoretic Approach New York: Academic Press. 549567 (1961). AND R. "Length of confidence intervals.... 56. 63. j 1 ." Ann." Regional Conference Series in Applied Math. DOKSUM. CAl. A.. JEFFREYS. AARBERGE. Wiley & Sons. 605608 (1971). Mathematical Statistics New York: J. WANG." The AmencanStatistician. 1950. the Growth ofScientific Knowledge. HEDGES. D. T. STEIN. S." 1. "PloUing with confidence: Graphical comparisons of two populations. Assoc.. H. 54 (2000). Amer. K. HAlO.. 1%2.." The American Statistician. E. L. A. LEHMANN. R." J. K. D. j j New York: 1. Statist. FL: Academic Press. Y. . New York: Hafner Publishing Company.. V. Amer.. Math. 326331 (1999). . K. OLlaN. 1997. Wiley & Sons. FISHER..
i.Xn ) as an estimate of the population median II(F). To go one step further.1 degrees of freedom. the qualitative behavior of the risk as a function of parameter and sample size is hard to ascertain. (5. from Problem (B.3) gn(X) =n ( 2k ) F k(x)(1 .Chapter 5 ASYMPTOTIC APPROXIMATIONS 5. If XI. .1 (D. v(F) ~ F. Worse.1. However.2) and (5.1. This distribution may be evaluated by a twodimensional integral using classical functions 297 . consider evaluation of the power function of the onesided t test of Chapter 4.. Even if the risk is computable for a specific P by numerical integration in one dimension.1..2) where.1.9. .. telling us exactly how the MSE behaves as a function of n.d.1 INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS Despite the many simple examples we have dealt with. our setting for this section EFX1 and use X we can write.13).3). and calculable for any F and all n by a single onedimensional integration. if n ~ 2k + 1. and F has density f we can wrile (5.fiiXIS has a noncentral t distribution with parameter 1"1 (J and n .F(x)k f(x). Ifwe want to estimate J1(F} = (5. If n is odd.• . k Evaluation here requires only evaluation of F and a onedimensional integration. consider a sample Xl. the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5.1. and most of this chapter. N(J1.1. 1 X n from a distribution F. computation even at a single point may involve highdimensional integrals.Xn are i.2. .1) This is a highly informative formula. consider med(X 11 •. Worse.2 that . a 2 ) we have seen in Section 4. In particular.. closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule.1). but a different one for each n (Problem 5.
. X n + EF(Xd or p £F( . Thus.3. based on observing n i. more specifically. {X 1j l ' •• .Xnj ) . It seems impossible to determine explicitly what happens to the power function because the distribution of fiX / S requires the joint distribution of (X.2) and its qualitative properties are reasonably transparent. . .. The first. S) and in general this is only representable as an ndimensional integral. just as in numerical integration.fii(Xn  EF(Xd) + N(O. or the sequence Asymptotic statements are always statements about the sequence. X n as n + 00.4) i RB =B LI(F. in this context.. The classical examples are. 00. _.. Approximately evaluate Rn(F) by _ 1 B i (5.. VarF(X 1 ». Rn (F). In its simplest fonn. Draw B independent "samples" of size n.)2) } .n (Xl. . . for instance the sequence of means {Xn }n>l. which occupies us for most of this chapter.I I !' 298 Asymptotic Approximations Chapter 5 (Problem 5. .d. Monte Carlo is described as follows. .1.1. . 10=1 f=l There are two complementary approaches to these difficulties. But suppose F is not Gaussian. which we explore further in later chapters. . where X n = ~ :E~ of medians. or it refers to the sequence of their distributions 1 Xi. fiB ~ R n (F). 1 < j < B from F using a random number generator and an explicit fonn for F.. Asymptotics.3. but for the time being let's stick to this case as we have until now.Xn):~Xi<n_l (~2 (EX. .i. The other. • . of distributions of statistics. we can approximate Rn(F) arbitrarily closely. always refers to a sequence of statistics I. I j {Tn (X" . save for the possibility of a very unlikely event. is to approximate the risk function under study by a qualitatively simpler to understand and easier to compute function. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5. observations Xl. where A= { ~ . i .fiit !Xi. X nj }. Xn)}n>l.. j=l • By the law of large numbers as B j. is to use the Monte Carlo method. Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or.o(X1). We shall see later that the scope of asymptotics is much greater..
1 Introduction: The Meaning and Uses of Asymptotics 299 In theory these limits say nothing about any particular Tn (Xl.9) As a bound this is typically far too conservative.1.4.' m (5. below .1.1.7) where 41 is the standard nonnal d. (see A.6) for all £ > O. 1') < z] ~ ij>(z) (5. For instance.1. which are available in the classical situations of (5.3).6) to fall... . if more delicate Hoeffding bound (B. £ = .2 is unknown be. (5.1.1.1.2 hand side of (5./' is as above and .' 1'1 > €] < .11) .14.10) < 1 with ...9) is .1. Thus. then P [vn:(X F n . . The trouble is that for any specified degree of approximation.f. and P F • What we in principle prefer are bounds. for n sufficiently large. X n ) we consider are closely related as functions of n so that we expect the limit to approximate Tn(Xl~"· .15.8) are given in Problem 5.. DOD? Similarly..25 whereas (5.1.. . the celebrated BerryEsseen bound (A. For € = . (5. We interpret this as saying that.6) gives IX11 < 1. . Xn is approximately equal to its expectation. (5. say. the much PFlIx n  Because IXd < 1 implies that .Section 5. PF[IX n _  . ' . X n )) (in an appropriate sense).2 . X n ) but in practice we act as if they do because the T n (Xl.1.1. x. if EFXf < 00.l) (5.(Xn .VarF(Xd.Xn ) or £F(Tn (Xl.. Further qualitative features of these bounds and relations to approximation (5.6) and (5. Is n > 100 enough or does it have to be n > 100.omes 11m'. .9) when .1.. 7)..1. the weak law of large numbers tells us that. As an approximation. 1'1 > €] < 2exp {~n€'}. say.1') < x] _ ij>(x) < CEFIXd' v'~ 3 1/' " "n (5.' = 1 possible (Problem 5. For instance.1. the right sup PF x [r..10) is .1. n = 400.l1) states that if EFIXd' < 00. this reads (5. the central limit theorem tells us that if EFIXII < 00.1.6) does not tell us how large n has to be for the chance of the approximation ~ot holding to this degree (the lefthand side of (5.5) That is. if EFIXd < 00. by Chebychev's inequality.1.01.8) Again we are faced with the questions of how good the approximation is for given n.. then (5.01.14.9. Similarly.1.
The qualitative implications of results such as are very impor~ tant when we consider comparisons between competing procedures.12) where (T(O. consistency is proved for the estimates of canonical parameters in exponential families. For instance. • c F (y'n[Bn . (5. which is reasonable. We now turn to specifics. .11) is again much too consctvative generally. i .1. Yet. Asymptotics has another important function beyond suggesting numerical approximations for specific nand F. and asymptotically normal.1.3. good estimates On of parameters O(F) will behave like  Xn does in relation to Ji. Although giving us some idea of how much (5. although the actual djstribution depends on Pp in a complicated way. • model.F) > N(O 1) . Section 5. even here they are not a very reliable guide. As we mentioned. d) = '(II' . as we have seen. . . quite generally. : i .11) suggests. "(0) > 0.' (0)( (1 / y'n)( v'27i') (Problem 5. behaves like .8) differs from the truth. (b) Their validity for the given n and Ttl for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. If they are simple.1. It suggests that qualitatively the risk of X n as an estimate of Ji. I . Bounds for the goodness of approximations have been available for X n and its distribution to a much greater extent than for nonlinear statistics such as the median.2 deals with consistency of various estimates including maximum likelihood.I '1. for any loss function of the form I(F. The arguments apply to vectorvalued estimates of Euclidean parameters. 1 i .1.1.1. Consistency will be pursued in Section 5. In particular.di) where '(0) = 0.7) says that the behavior of the distribution of X n is for large n governed (approximately) only by j.t and 0"2 in a precise way. Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo.5) and quite generally that risk increases with (1 and decreases with n. (5.e(F)]) (1(e. The estimates B will be consistent. As we shall see.(l) The approximation (5. B ~ O(F). asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate. F) typically is the standard deviation (SD) of J1iOn or an approximation to this SD. The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17  j • i.1. (5. for all F in the n n .8) is typically much betler than (5.300 Asymptotic Approximations Chapter 5 where C is a universal constant known to be < 33/4.3 begins with asymptotic computation of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for oneparameter exponential families. Section 5. Practically one proceeds as follows: (a) Asymptotic approximations are derived.2 and asymptotic normality via the delta method in Section 5.' I I • If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good.
lS. AlS. 1denotes Euclidean distance. distributions. 5. . but we shall use results we need from A. The least we can ask of our estimate Qn(X I.7. 'in 1] q(8) for all 8. and probability bounds.Section 5.1). we talk of uniform cornistency over K. A stronger requirement is (5. The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A. where P is the empirical distribution.2) Bounds b(n. . remains central to all asymptotic theory.) forsuP8 P8 lI'in . If is replaced by a smaller set K.2 5.i.14.2) is called unifonn consistency.) However. by quantities that can be so computed.1 CONSISTENCY PlugIn Estimates and MlEs in Exponential Family Models Suppose that we have a sample Xl. with all the caveats of Section 5. But. That is. A..i. > O..d.4 deals with optimality results for likelihoodbased procedures in onedimensional parameter models.2.2. P where P is unknown but EplX11 < 00 then.Xn ) is that as e n 8 E ~ e. We will recall relevant definitions from that appendix as we need them.7 without further discussion. which is called consistency of qn and can be thought of as O'th order asymptotics. e Example 5. If Xl.1. asymptotics are methods of approximating risks.2.2. Means. The stronger statement (5. . for all (5.1). (5.2 Consistency 301 in exponential families among other results. Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to 00.14. in accordance with (A.2.14. for .5 we examine the asymptotic behavior of Bayes procedures. . For P this large it is not unifonnly consistent. Finally in Section 5.. The simplest example of consistency is that of the mean.2. Monte Carlo.1) and (B. and B.q(8)1 > 'I that yield (5. if.7. Summary. We also introduce Monte Carlo methods and discuss the interaction of asymptotics. 1 X n are i. and other statistical quantities that are not realistically computable in closed form. by the WLLN.d. case become increasingly valid as the sample size increases. Most aSymptotic theory we consider leads to approximations that in the i.2. In practice.2) are preferable and we shall indicate some of qualitative interest when we can. X ~ p(P) ~ ~  p = E(XJ) and p(P) = X.2. and 8.Xn from Po where 0 E and want to estimate a real or vector q(O)..1. . Section 5. (See Problem 5. .2. is a consistent estimate of p(P).. 00.1) where I . ' .
then by Chebyshev's inequality. with Xi E X 1 .2.2. q(P) is consistent. by A. But further. 0 ~ To some extent the plugin method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plugin estimates. w.1.2.ql > <] < Pp[IPn .• X n be the indicators of binomial trials with P[X I = I] = p.2. consider the plugin estimate p(Iji)ln of the variance of p. . w( q. p' E S. Binomial Variance. 0 In fact. Asymptotic Approximations Chapter 5 X is uniformly consistent over P because o Example 5. Then N = LXi has a B(n. Because q is continuous and S is compact. it is uniformly continuous on S. .. l iA) E S be the empirical distribution. xd is the range of Xl' Let N i = L~ 11(Xi = Xj) and Pj _ Njln.2. Suppose thnt P = S = {(Pl.b) say. If q is continuous w(q. 0 < p < 1. in this case. P = {P : EpX'f < M < oo}. w(q.. Evidently. for all PEP. Other moments of Xl can be consistently estimated in the same way.3) Evidently.l : [a.14... Letw. Then qn q(Pn) is a unifonnly consistent estimate of q(p).pi > 6«)] But. the kdimensional simplex. 6 >0 O.p'l < 6}. .b] < R+bedefinedastheinver~eofw.·) is increasing in 0 and has the range [a. .LJ~IPJ = I}. w(q. Ip' pi < 6«). = Proof. By the weak law of large numbers for all p. . sup{ Pp[liln pi > 61 : pES} < kl4n6 2 (Problem 5. where Pi = PIXI = xi]' 1 < j < k.1) and the result follows.I l 302 instance. there exists 6 «) > 0 such that p.2. 0) is defined by . which is ~ q(ji).p) distribution. (5.4) I i (5.. Thus. 6) It easily follows (Problem 5. Suppose that q : S + RP is continuous. i . for every < > 0." is the following: I X n are LLd.3) that > <} (5.6) = sup{lq(p)  q(p')I: Ip . · .2.. Pp [IPn  pi > 6] ~ .. implies Iq(p')q(p) I < <Then Pp[li1n . and (Xl.2.l «) = inf{6 : w(q.Pk) : 0 < Pi < 1. and p = X = N In is a uniformly consistent estimate of p.6. we can go further. where q(p) = p(1p).. Let X 1. Suppose the modulus ofcontinuity of q.5) I A simple and important result for the case in which X!. Pn = (iiI.6)! Oas6! O. Theorem 5.1 < j < k.. .
d. 0 Here is a general consequence of Proposition 5. let mj(O) '" E09j(X. X n are a sample from PrJ E P. ' . conclude by Proposition 5.) I < that h(D n) Foreonsisteney of h(g) apply Proposition B. !:.V2.3.. U implies !'. then vIP) h(g). 1 < i < n be i.6. (ii) ij is consistent. and correlation coefficient are all consistent.1.1 and Theorem 2. Var(U.UV) so that E~ I g(Ui . 'Vi) is the statistic generating this 5parameter exponential family.2.U 2. = Proof.1 . We may. Variances and Correlations. 11. Jl2. 1 are discussed in Problem 5.4.v} = (u. . Suppose [ is open.1.. ai. More generally if v(P) = h(E p g(X . 1 < j < d.2. if h is continuous..then which is well defined and continuous at all points of the range of m. h(D) for all continuous h. is consistent for v(P). Then.9d) map X onto Y C R d Eolgj(X')1 < 00.V.i. N2(JLI.) > 0. Questions of uniform consistency and consistency when P = { DistribuCorr(Uj.. If we let 8 = (JLI. 1 < j < d.a~.p).ar. Let g(u. If Xl. Then. Theorem 5. . af > o.) > 0. is a consistent estimate of q(O).p).2. ag. Let g (91. We need only apply the general weak law of large numbers (for vectors) to conclude that (5.)1 < 1 } tions such that EUr < 00. (i) Plj [The MLE Ti exists] ~ 1.2 Consistency 303 Suppose Proposition 5.6) if Ep[g(X. o Example 5.). where h: Y ~ RP. )) and P = {P : Eplg(X .2.JL2.2. .Section 5. Let TI. where P is the empirical distribution.Jor all 0. EVj2 < 00.2.lpl < 1.. Suppose P is a canonical exponentialfamily of rank d generated by T.3. and let q(O) = h(m(O)). [ and A(·) correspond to P as in Section 1. then Ifh=m. Var(V. thus. Let Xi = (Ui . .2. Vi).1 that the empirical means. )1 < oo}. variances.1: D n 00.2.7.
Let 0 be a minimum contrast estimate that minimizes I .d.3.2. The argument of the the previous subsection in which a minimum contrast estimate.. i I .2.lI) .1I 0 ) foreverye I .  inf{D(II.8) I > O.2.LT(Xi ).E1).304 Asymptotic Approximations Chapter 5 .. which solves 1 n A(1)) = .D(1I0 . j 1 Pn(X. and the result follows from Proposition 5. . is solved by 110.. Suppose 1 n PII sup{l. i=l 1 n (5. i i • • ! . the inverse AI: A(e) ~ e is continuous on 8. 1 i . exists iff the event in (5.1 that i)(Xj. {t: It . T(Xl). n LT(Xi ) i=I 21 En. (5.. .lI o): 111110 1 > e} > D(1I0 .7) occurs and (i) follows. BEe c Rd.1 that On CT the map 17 A(1]) is II and continuous on E. the MLE. I . We showed in Theorem 2.p(X" II) is uniquely minimized at 11 i=l for all 110 E 6. D . (T(Xl)) must hy Theorem 2.1 belong to the interior of the convex support hecause the equation A(1)) = to.. II) L p(X n.3.2. II) = where. By a classical result.L[p(Xi ..9) n and Then (} is consistent. 5. By definition of the interior of the convex support there exists a ball 8. '1 P1)'[n LT(Xi ) E C T) ~ 1. II) 0 j =EII.. .2. A more general argument is given in the following simple theorem whose conditions are hard to check.3. as usual. I ! .2. z=l 1 n i.. . I I Proof Recall from Corollary 2. if 1)0 is true. D( 110 .1I)11: II E 6} ~'O (5.d. .. Let Xl.. T(Xdl < 6} C CT' By the law of large numbers. I J Theorem 5. 7 I .7) I . is a continuous function of a mean of Li. . • .j. see. 0 Hence.3. Po.2. n . E1). for example.= 1 . But Tj.3. Rudin (1987). .1 to Theorem 2.2 Consistency of Minimum Contrast Estimates .. where to = A(1)o) = E1)o T(X 1 ). X n be Li. Note that. vectors evidently used exponential family properties. = 1 n p.. .Xn ) exists iff ~ L~ 1 T(X i ) = Tn belongs to the interior CT of the convex support of the distribution of Tn.1. .
IJ) p(Xi.2. .IIIJ  IJol > c] (5.IJ o)  D(IJo.12) which has probability tending to 0 by (5.8). But for ( > 0 let o ~ ~ inf{D(IJ.11) implies that 1 n ~ 0 (5. o Then (5..2. and (5. IJo)) .[IJ ¥ OJ] = PIJ. ifIJ is the MLE. IJj))1 > <I : 1 < j < d} ~ O. (5. IIJ ..8) follows from the WLLN and PIJ. [I ~ L:~ 1 (p(Xi . 2 0 (5.1 we need only check that (5. IJ) n ~ i=l p(X"IJ o)] .2. IJ)II ' IJ n i=l E e} > .2.2 Consistency 305 proof Note that.2.2.inf{D(IJo.11) sup{l. (5.IJO)) ' IIJ IJol > <} n i=l .2. {IJ" . PIJ. IJ).9) hold for p(x. PIJ. then. for all 0 > 0.5.10) By hypothesis. Coronary 5. IJ) .2.2. IJ) . is finite.D(IJo. 0) .2. EIJollogp(XI.2. _ I) 1 n PIJ o [16 . Ie ¥ OJ] ~ Ojoroll j. IJ) = logp(x.2. An alternative condition that is readily seen to work more widely is the replacement of (5. A simple and important special case is given by the following.2. E n ~ [p(Xi .D(IJo. IJ)I : IJ K } PIJ !l o.14) (ii) For some compact K PIJ."'[p(X i . But because e is finite.8) and (5. /fe e =   Proof Note that for some < > 0.IJol > E] < PIJ [inf{ . IJ) .L[p(Xi .IJor > <} < 0] because the event in (5.[max{[ ~ L:~ 1 (p(Xi .D(IJ o.10) tends to O. . IIJ .L(p(Xi.IJd ). IJj ) .8) can often failsee Problem 5.Section 5.2.IJol > <} < 0] (5. 0 Condition (5.lJ o): IIJ IJol > oj.11) implies that the righthand side of (5.2.9) follows from Shannon's lemma.2. IJ j )) : 1 <j < d} > <] < d maxi PIJ.p(X" IJo)) : IJ E g ~(P(X" K'} > 0] ~ 1.D( IJ o.13) By Shannon's Lemma 2.8) by (i) For ail compact K sup In { c e. [inf c e. 1 n PIJo[inf{.1.IJ)1 < 00 and the parameterization is identifiable.2.D(IJo.2. IJ j ) .
. sequence of estimates) consistency.2. =suPx 1k<~)(x)1 <M < .B(P)I > Ej : PEP} t 0 for all € > O. Let h : R ~ R. then = h(ll) + L (j)( ) j=l h ..2. We conclude by studying consistency of the MLE and more generally Me estimates in the case e finite and e Euclidean. I ~. D . > 2. that sup{PIiBn .3.1) where • ..1 The Delta Method for Moments • • We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders. If (i) and (ii) hold.8) and (5. let Ilglioo = sup{ Ig( t) I : t E R) denote the sup nonn. If fin is an estimate of B(P).AND HIGHERORDER ASYMPTOTlCS: THE DelTA METHOD WITH APPLICATIONS We have argued.1. consistency of the MLE may fail if the number of parameters tends to infinity. Unfortunately checking conditions such as (5. When the observations are independent but not identically distributed.3. m Mi) and assume (b) IIh(~)lloo j 1 . We show how consistency holds for continuous functions of vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families. + Rm (5.14) is in general difficult.3 FIRST.) = <7 2 Eh(X) We have the following.d.1 that the principal use of asymptotics is to provide quantitatively or qualitatively useful approximations to risk..33.Xn be i.306 Asymptotic Approximations Chapter 5 We shall see examples in which this modification works in the problems. 5. ~ 5. and assume (i) (a) h is m times differentiable on R. ! ! (ii) EIXd~ < 00 Let E(X1 ) = Il. X valued and for the moment take X = R. Sufficient conditions are explored in the problems. We then sketch the extension to functions of vector means.3. VariX.Il E(X Il). . J. Summary.i. . We introduce the minimal property we require of any estimate (strictly speak ing. A general approach due to Wald and a similar approach for consistency of generalized estimating equation solutions are left to the problems. ~l Theorem 5. As usual let Xl. . see Problem 5. Unifonn consistency for l' requires more. we require that On ~ B(P) as n ~ 00. in Section 5. We denote the jth derivative of h by 00 .3.
X ij ) least twice. (n . +'r=j J . . Xij)1 = Elxdj il. The more difficult argument needed for (5.i j appears at by Problem 5.'1 +. ..'" . then (a) But E(Xi . for j < n/2.. (b) a unless each integer that appears among {i I . then there are constants C j > 0 and D j > 0 such (5. In!.3.4) for all j and (5.4) Note that for j eveo.. ... (5..3) and j odd is given in Problem 5. Moreover.. . bounded by (d) < t. . rl l:: . _ .3.. EIX I'li = E(X _1')' Proof..t" = t1 .and HigherOrder Asymptotics: The Delta Method with Applications 307 The proof is an immediate consequence of Taylor's expansion. :i1 + .1 < k < r}}.[jj2] + 1) where Cj = 1 <r... . .3 First. .3.. ik > 2.5[jJ2J max {l:: { . 21.3.) ~ 0. .1'1.1. t r 1 and [t] denotes the greatest integer n.2) where IX' that 1'1 < IX . so the number d of nonzero tenns in (a) is b/2] (c) l::n _ r.. tI.3. and the following lemma. If EjX Ilj < 00.. . tr i/o>2 all k where tl . Lemma 5. 'Zr J . We give the proof of (5..3. Let I' = E(X. .3) (5. The expression in (c) is.3) for j even. . C [~]! n(n .ij } • sup IE(Xi . .3.. .1) .2. j > 2.5.Section 5. + i r = j.3..3.
3.l)h(J.. then G(n.l)}E(X . . respectively. I (b) Ifllh(j11l= < 00.+ O(n. 00 and 11hC 41 11= < then G(n.l)h(1)(J. using Corollary 5.3.308 But (e) Asymptotic Approximations Chapter 5 1 I j i njn(n 1) .llJ' + G(n3f2) (5..3. .. I I I Corollary 5. I .1')' + 1 E[h'](3)(X')(X _ 1')3 = h2 (J.3f2 ) in (5.ll j < 2j EIXd j and the lemma follows. then I • I.3. (5.1.3.3f2 ). (a) ifEIXd3 < 00 and Ilh( 31 11 oo < Eh(X) 00.3.l) + {h(21(J.1 with m = 4.1. In general by considering Xi .3. Then Rm = G(n 2) and also E(X . .1.l) and its variance and MSE.3.3.3.3. EIX1 .4) for j odd and (5.5) can be replaced by Proof For (5. 0 1 I . and (e) applied to (a) imply (5.3.3) for j even.5) (b) if E(Xt) < G(n. i • r' Var h(X) = 02[h(11(J.JL as our basic variables we obtain the lemma but with EIXd j replaced by EIX 1 .3.1')2 = 0 2 In.1')3 = G(n 2) by (5.6) n ..1') + {h(2)(J. Because E(X . • . By Problem 5. (5.6.3f') in (5. if I' = O. if 1 1 <j (a) Ilh(J)II= < 00..5) apply Theorem 5.2 ).4).l) + [h(1)]2(J.6) can be 1 Eh 2 (X) = h'(J.1 with m = 3. Corollary 5.3. 1 < j < 3.3.l) + 00 2n I' + G(n3f2).1'11. then h(')( ) 2 0 = h(J.l)h(J. apply Theorem 5.5) follows.3.J. I . If the conditions of (b) hold.l)E(X . Proof (a) Write 00. 0 The two most important corollaries of Theorem 5.l) + [h(llj'(1')}':. (n li/2] + 1) < nIJf'jj and (c).l) 6 + 2h(J. n (b) Next.2. and EXt < replaced by G(n'). 1 iln .. I 1 . give approximations to the bias of h(X) as an estimate of h(J. (d). < 3 and EIXd 3 < 00.
.3. We can use the two coronaries to compute asymptotic approximations to the means and variance of heX).exp( 2It).3.3.3.8) because h(ZI (t) = 4(r· 3 . (1Z 5. We will compare E(h(X)) and Var h(X) with their approximations.7) If h(t) ~ 1 .3 First. the bias of h(X) defined by Eh(X) .3.2 ) 2e.3. Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5. as the plugin estimate of the parameter h(Jl) then. 0 Clearly the statements of the corollaries as well can be turned to expansions as in Theorem 5. the MLE of ' is X . by Corollary 5. X n are i.5). {h(l) (/1)h(ZI U')/13 (5. (5.1 with bounds on the remainders. Thus.= E"X I ~ 11 A.Z.1 If the Xi represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours.3.2.and HigherOrder Asymptotics: The Delta Method with Applications 309 Subtracting Cal from (b) we get C5. . A qualitatively simple explanation of this important phenonemon will be given in Theorem 5. If heX) is viewed. as we nonnally would. by Corollary 5.11) Example 5.4) E(X .h(/1) is G(n. where /1. and.. Bias and Variance of the MLE of the Binomial VanOance. which is neglible compared to the standard deviation of h(X). for large n. To get part (b) we need to expand Eh 2 (X) to four terms and similarly apply the appropriate form of (5.h(/1)) h(2~(II):: + O(n. by expanding Eh 2 (X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = ~[h(I)U')]'(1Z +. Bias./1)3 = ~~.1. which is G(n.i.9) o Further expansion can be done to increase precision of the approximation to Var heX) for large n.3.3. then heX) is the MLE of 1 .exp( 2') c(') = h(/1). T!Jus.(h(X) .Z) (5.4 ) exp( 2/t).3.. Example 5.l ).3. .3.Section 5.t.  .6).d.t) and X.Z ".3.(h(X)) E.3(1_ ')In + G(n.3.3.i.3.2 (Problem (5..10) + [h<ZI (/1)J'(14} + R~ ! with R~ tending to zero at the rate 1/n3 .l / Z) unless h(l)(/1) = O. Note an important qualitative feature revealed by these approximations. then we may be interested in the warranty failure probability (5.1.1) = 11. when hit) = t(1 . If X [.
.2t.10) is in a situation in which the approximation can be checked. R'n p(1 ~ p) [(I .2p) n n +21"(1 . . 1 and in this case (5. • ~ O(n. M2) ~ 2.3. ~ Varh(X) = (1.p)] n I " . + id = m.p)'} + R~ p(l .p(1 .p)(I. 1 < j < { Xl . I .j .p) = p ( l . First calculate Eh(X) = E(X) .p(1 . .5) is exact as it should be.' • " The generalization of this approach to approximation of moments for functions of vector means is fonnally the same but computationally not much used for d larger than 2.6p(l. II'.310 Asymptotic Approximations Chapter 5 B(l.p) .2p)')} + R~. the error of approximation is 1 1 + . Next compute Varh(X)=p(I.3.{2(1.E(X') ~ p .2p(1 .. asSume that h has continuous partial derivatives of order up to m.e x ) : i l + .9d(Xi )f. n n Because MII(t) = I ..p)J n p(1 ~ p) [I .p) .[Var(X) + (E(X»2] I nI ~ p(1 . .P )} (n_l)2 n nl n Because 1'3 = p(l.3. a Xd d} .p)(I.2p)' . .!c[2p(1 n p) . and will illustrate how accurate (5.2p). 0 < i.2.5) yields E(h(X))  ~ p(1 .2p)p(l.pl.p) n .amh i. (5.p) + . D " ~ I . • . Theorem 5.. . " ! . Let h : R d t R. < m.10) yields . .P) {(1_2 P)2+ 2P(I..' . and that (i) l IIDm(h)ll= < 00 where Dmh(x) is the array (tensor) a i.p) {(I _ 2p)2 n Thus.2(1 . (5.2p)2p(1 .3. D 1 i I .p). Suppose g : X ~ R d and let Y i = g(Xi ) = (91 (Xi)..3 ).3.
4..and HigherOrder Asymptotics: The Delta Method with Applications 311 (ii) EIY'Jlm < 00. under appropriate conditions (Problem 5.).14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d .14) + (g:.11. Theorem 5.I' < 00.) )' Var(Y. as for the case d = 1.3.3).)} + O(n (5. (7'(h)) where and (7' = VariX. = EY 1 . E xl < 00 and h is differentiable at (5.h(!..3.15) Then .3. and (5.) Var(Y.) + 2 g:'..6).) gx~ (J1. B. (5. The most interesting application.3. then O(n. 1.'.3. if EIY.3.5). is to m = 3.3. Lemma 5. (J1.13).hUll)~.C( v'n(h(X) .))) ~ N(O.3 First.Section 5.3. + ~ ~:l (J1.) + ~ {~~:~ (J1.»)'var(Y'2)] +O(n') Approximations (5.3. (5..) Cov(Y".13) Moreover.3 / 2 ) in (5. by (5.3. The proof is outlined in Problem 5.3.2 ).) Var(Y. and the appropriate generalization of Lemma 5.~ (J1. ijYk = ~ I:~ Yik> Y = ~ I:~ I Vi. h : R ~ R.3. EI Yl13 < 00 Eh(Y) h(J1.3.. Suppose {Un} are real random variables and tfult/or a sequence {an} constants with an + 00 as n + 00. The results in the next subsection go much further and "explain" the fonn of the approximations we already have. 1 < j < d where Y iJ I = 9j(Xi ). Y12 ) (5. (J1. 5. for d = 2.12) Var h(Y) ~ [(:.). and J.2 The Delta Method for In law Approximations = As usual we begin with d !.x.2.1 Then. Y 12 ) 3/2). Similarly.1.8.) Cov(Y".3. then Eh(Y) This is a consequence of Taylor's expansion in d variables. .(J1.3.3.12) can be replaced by O(n.~E(X. We get.. Suppose thot X = R. 0/ The result follows from the more generally usefullernma.) + a.".
Consider 1 Vn = v'n(X !") !:. 'j But (e) implies (f) from (b). j . . . I . ~: .N(O. j " But.32 and B. .ul < Ii '* Ig(v)  g(u) . 0 . from (a). hence.3. by hypothesis.7.a 2 ). (ii) 9 : R Then > R is d(fferentiable at u with derivative 9(1) (u). for every Ii > 0 (d) j. Thus. Therefore. for every (e) > O. • • . I (g) I ~: . ( 2 ). an = n l / 2.3. V .17) .3. . = I 0 . for every € > 0 there exists a 8 > 0 such that (a) Iv . V and the result follows. V for some constant u.g(ll(U)(v  u)1 < <Iv .. (5. _ F 7 . The theorem follows from the central limit theorem letting Un X.N(O.• ' " and. ~ EVj (although this need not be true.15) "explains" Lemma 5.ul Note that (i) (b) '* .i. PIIUn €  ul < iii ~ 1 • "'.3. ·i Note that (5.3. Formally we expect that if Vn !:. By definition of the derivative. an(Un .16) Proof.8). see Problems 5.312 Asymptotic Approximations Chapter 5 (i) an(Un . u = /}" j \ V .1. . V. (e) Using (e). then EV.u) !:. we expect (5.u)!:.
Let Xl. . VarF(Xd = a 2 < 00. "t" Statistics. al = Var(X.+A)al + Aa~) . i=l If:F = {Gaussian distributions}.X) n 2 . Now Slutsky's theorem yields (5.2.2 and Slutsky's theorem.) and a~ = Var(Y. (1  A)a~ .Section 5.3.3. A statistic for testing the hypothesis H : 11. we can obtain the critical value t n.28) that if nI/n _ A. But if j is even.> 0 .18) In particular this implies not only that t n.. For the proof note that by the central limit theorem.l distribution.a) critical value (or Zla) is approximately correct if H is true and F is not Gaussian..3. 1'2 = E(Y. (a) The OneSample Case.3 we saw that the two sample t statistic Sn vn1n2 (Y X) ' n = = n s nl + n2 has a Tn2 distribution under H when the X's and Y's are normal withar = a~. Using the central limit theorem.. where g(u. (b) The TwoSample Case.17) yields O(n J. (1 . In Example 4.X" be i.3. FE :F where EF(X I ) = '". and the foregoing arguments. EZJ > 0.). In general we claim that if F E F and H is true.d.s Tn =. Let Xl. sn!a). and S2 _ n nl ( ~ X) 2) n~(X.3./2 ).3 First.18) because Tn = Un!(sn!a) = g(Un . then Tn . ••• . 1 2 a P by Theorem 5.9. 1).).'" 1 X n1 and Y1 .3. Consider testing H: /11 = j. Example 5.l (1. we find (Problem 5.'" . Then (5. j even = o(nJ/'). 1).1 (Ia) + Zla butthat the t n. v) = u!v.1 (1 . else EZJ ~ = O. N n (0 Aal.N(O.t2 versus K : 112 > 111. Yn2 be two independent samples with 1'1 = E(X I ).a) for Tn from the Tn . Slutsky's theorem.= 0 versuS K : 11.TiS X where S 2 = 1 nl L (X.and HigherOrder Asymptotics: The Delta Method with Applications 313 where Z ~ N(O. then S t:. £ (5.O < A < 1.j odd.i.
2 shows that when t n _2(10:) critical value is a very good approximation even for small n and for X. approximations based on asymptotic results should be checked by Monte Carlo simulations. Here we use the XJ distribution because for small to moderate d it is quite different from the normal distribution.05. or 50.I = u~ and nl = n2. then the critical value t.. .. each lime computing the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table.. Figure 5. ur .. Each plotted point represents the results of 10.3.'. 10. One sample: 10000 Simulations.. where d is either 2. 20.5 . X~.3.c:'::c~:____:_J 0.1. . The simulations are repeated for different sample sizes and the observed significance levels are plotted. Chisquare data I 0..95) approximation is only good for n > 10 2 . Y . 0) for Sn is Monte Carlo Simulation As mentioned in Section 5. 316. and in this case the t n .2 or of a~. Figure 5. " .. and the true distribution F is X~ with d > 10.. .5 2 2. 32...5 . oL.1 shows that for the onesample t test..1..1 (0.02 . i I ... We illustrate such simulations for the preceding t tests by generating data from the X~ distribution M times independently.. ..5 1 1. i 2(1 approximately correct if H is true and the X's and Y's are not normal. the For the twosample t tests. as indicated in the plot.314 Asymptotic Approximations Chapter 5 It follows that if 111 = 11. The X~ distribution is extremely skew. when 0: = 0.000 onesample t tests using X~ data. the asymptotic result gives a good approximation when n > 10 1. Other distributions should also be tried. .5 3 Log10 sample size j i Figure 5.3..
Other Monte Carlo runs (not shown) with I=. N(O. the t n 2(1 . To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T.n2 and at I=.3.a~.X = ~. y'ii:[h(X) .. 10. 1 (ri .L) where h is con tinuously differentiable at !'.9.Xd.Xi have a symmetric distribution.12 0.3.5 2 2.hoi n s[h(l)(X)1 . For each simulation the two samples are the same size (the size indicated on the xaxis). 10000 Simulations. Moreover. Each plotted point represents the results of 10. However. let h(X) be an estimate of h(J.o'[h(l)(JLJf).n2 and at = a'). and a~ = 12af.' 0.2 (1 .0:) approximation is good when nl I=.95) approximation is good for nl > 100. 2:7 ar Two sample._' 0. By Theoretn 5.5 1 1.02 d'::'.cCc":'::~___:.a~ show that as long as nl = n2. or 50.(})) do not have approximate level 0.. in the onesample situation. and the data are X~ where d is one of 2.4 based on Welch's approximation works well. when both nl I=. D Next. ChiSquare Dala.3. then the twosample t tests with critical region 1 {S'n > t n . the t n _2(O. Equal Variances 0. In this case Monte Carlo studies have shown that the test in Section 4. }.3. and Yi .h(JL)] ". af = a~. as we see from the limiting law of Sn and Figure 5.Section 5_3 First. _ y'ii:[h(X) . scaled to have the same means.) i O.5 3 log10 sample size Figure 5.ll)(!. in this case.and HigherOrder Asymptotics: The Delta Method with Applications 315 1 This is because. even when the X's and Y's have different X~ distributions.0' H<I o 0.000 twosample t tests. Y .3.2.
. The data in the first sample are N(O. £l __ __ 0.J~ f). 0.{x)() twosample t tests. For each simulation the two samples differ in size: The second sample is two times the size of the first..6. and beta. we see that here. From (5. as indicated in the plot.316 Asymptotic Approximations Chapter 5 Two Sample.6) and .3.5 Log10 (smaller sample size) l Figure 5. Unequal Variances.3. We have seen that smooth transformations heX) are also approximately normally distributed..10000 Simulations: Gaussian Data. " 0. Combining Theorem 5. gamma.' OIlS .4. Poisson.~'ri i _ . The xaxis denotes the size of the smaller of the two samples.3. '! 1 . It turns out to be useful to know transformations h.c:~c:_~___:J 0. and 9. 1) and in the second they are N(Ola 2) where a 2 takes on the values 1. 1) so that ZI_Q is the asymptotic critical value. such as the binomial. I: . Tn .12 .". if H is true .02""~".5 1 1. too... Variance Stabilizing Transfonnations Example 5. N(O.0.3. Each plotted point represents the results of IO.3. In Appendices A and B we encounter several important families of distributions.3. o. then the sample mean X will be approximately normally distributed with variance 0"2 In depending on the parameters indexing the family considered. +2 + 25 J O'. +___ 9.£.. ..oj:::  6K . such that Var heX) is approximately independent of the parameters indexing the family we are considering. 2nd sample 2x bigger 0..I _ __ 0 I +__ I :::. which are indexed by one or more parameters. If we take a sample from a member of one of these families. called variance stabilizing.3 and Slutsky's theorem. .
this leads to h(l)(A) = VC/J>. which has as its solution h(A) = 2. c) (5. h must satisfy the differential equation [h(ll(A)j2A = C > 0 for some arbitrary c > O. In this case a'2 = A and Var(X) = A/n. As an example. Suppose.3. 1976.)] 2/ n .d. by their definition. ..16.Section 5. .. . 0 One application of variance stabilizing transformations. A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models.3. 538) one can improve on .5.. h(t) = Ii is a variance stabilizing transformation of X for the Poisson family of distributions. suppose that Xl. To have Varh(X) approximately constant in A.. If we require that h is increasing. sample.13) we see that a first approximation to the variance of h( X) is a' [h(l) (/. Substituting in (5.6. .19) for all '/. Under general conditions (Bhattacharya and Rao. . 1/4) distribution. Also closely related but different are socaBed normalizing transformations.. is to exhibit monotone functions of parameters of interest for which we can give fixed length (independent of the data) confidence intervals. where d is arbitrary. Thus. a variance stabilizing transformation h is such that Vi'(h('Y) .3 First.1 0 ) Vi' ' is an approximate 1.' . in the preceding P( A) case. Thus. In this case (5.6) we find Var(X)' ~ 1/4n and Vi'((X) . X n) is an estimate of a real parameter! indexing a family of distributions from which Xl.Ct confidence interval for J>.i.. p. See Example 5. See Problems 5. Suppose further that Then again.3.3.19) is an ordinary differential equation. I X n is a sample from a P(A) family. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II.3..3. Such a function can usually be found if (J depends only on fJ.hb)) + N(o. Thus. A > 0.and HigherOrder Asymptotics: The Delta Method with Applications 317 (5. 1n (X I. The notion of such transformations can be extended to the following situation.)C. yX± r5 2z(1 .3.. finding a variance stabilizing transfonnation is equivalent to finding a function h such that for all Jl and (J appropriate to our family.Xn are an i. Some further examples of variance stabilizing transformations are given in the problems.(A)') has approximately aN(O.. Major examples are the generalized linear models of Section 6.\ + d. which varies freely.15 and 5. Edgeworth Approximations The normal approximation to the distribution of X utilizes only the first two moments of X.
9997 4.8000 0. .9999 1. I' 0.0397 0. Edgeworth Approximations to the X 2 Distribution.61 0. 0.0050 0.2.21) The expansion (5.1964 1.0287 0.2000 0.0655 O.9684 0.5421 0.95 3.95 :' 1.9999 1.9943 0.' •• . • " .. xro I • I I .0254 0. ii. .n.0005 0 0.9905 0. 1). We can use Problem B.9950 0.15 0.40 0.72 .0001 0 0. Suppose V rv X~.3.ססoo 1.38 0.91 0.3.4 to compute Xl.0284 1. According to Theorem 8. V has the same distribution as E~ 1 where the Xi are independent and Xi N(O. H 3 (x) = x3  3x.I .9500 'll . we need only compute lIn and 1'2n.0553 0. 0 x Exacl 2..6548 4.ססOO I .n)' _ 3 (2n)2 = 12 n 1 I .ססoo 0.86 0.15 0.9029 0.3.3000 0.9990 0.5000 0.9984 0.IOx 3 + 15x)] I I y'n(X n 9n ! I i !.1254 0.1) + 2 (x 3 .7792 5. ~ 2.3006 0.JL) / a and let lIn and 1'211 denote the coefficient of skewness and kurtosis of Tn.3.4415 0. 4 0. It follows from the central limit theorem that Tn = (2::7 I Xi n)/V2ri = (V .38 0.1051 0.1000 0. 1) distribution.2706 0.4000 0." • [3 n . .0877 0.. i = 1.0100 0.1 gives this approximation together with the exact distribution and the nonnal approximation when n = 10.0250 0.9900 0./2 2 . Table 5. 0.79 0. Example 5. " I. Therefore.4000 0.77 0.66 0. To improve on this approximation.. Fn(x) ~ <!>(x) .<p(x) ' .0500 ! ! .9506 0.5.!. Edgeworth(2) and nonna! approximations EA and NA to the X~o distribution. (5. Hs(x) = xS  IOx 3 + 15x. TABLE 5.3.2024 EA NA x 0.3x) + _(xS .4999 0.34 2.1.6999 0.3513 0.35 0.1.0208 ~1.9996 1.8008 0.6000 0.9995 0.9724 0. Exact to' EA NA {.n)1 V2ri has approximately aN(O.0105 0.. Then under some conditionsY) where Tn tends to zero at a rate faster than lin and H 2 • H 3 • and H s are Hermite polynomials defined by H 2 (x) ~ x 2  I.0032 ·1.318 Asymptotic Approximations Chapter 5 the normal approximation by utilizing the third and fourth moments.II 0. "J 1 '~ • 'YIn = E(Vn? (2n)1 E(V .0481 1.51 1.3. P(Tn < x).9876 0.85 0 0.75 0.9097 o. where Tn is a standardized random variable.ססoo 0.20) is called the Edgeworth expansion for Fn .9750 0..5999 0.7000 0.04 1.40 0.0010 ~: EA NA x Exact . Let F'1 denote the distribution of Tn = vn( X .
(ii) g.l = Cov(XkyJ. vn(i7f . a~ = Var Y.I.3.0.2 T20 . Using the central limit and Slutsky's theorems.2. _p2.22) 2 g(l)(u) = (2U.d.1) and vn(i7~ .i.9.and HigherOrder Asymptotics: The Delta Method with Applications 319 The Multivariate Case Lemma 5.0.9) that vn(C . (i) un(U n .2rJ >". aiD. where g(UI > U2.3. We can write r 2 = g(C._p2).2.6.3.2 >'1.ij~ where ar Recall from Section 4. The proof follows from the arguments of the proof of Lemma 5. UillL2U~) = (2p. It follows from Lemma 5.3 and (B. 4f. U3) = Ui/U2U3. Let central limit theorem T'f.2.3./U2U3.ar.m.1.6) that vn(r 2 N(O. Yn ) be i. 1..2 >'2.J12) / a2 to conclude that without use the transformations Xi j loss of generality we may assume J11 = 1k2 = 0. then by the 2 vn(U .2.3.0 .5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test of H : p = is based on Irl.0.2. . .Section 5. = ai = 1.n1EY/) ° ~ ai ~ and u = (p. and Yj = (Y . Example 5. Let p2 = Cov 2(X.0.n lEX.u) ~~ V dx1 forsome d xl vector ofconstants u.). Y) where 0 < EX 4 < 00. Then Proof. as (X..2. Suppose {Un} are ddimensional random vectors and that for some sequence of constants {an} with an !' (Xl as n Jo (Xl.. = Var(Xkyj) and Ak. >'1. we can show (Problem 5. E Next we compute = T11 .1) jointly have the same asymptotic distribution as vn(U n .= (Xi .0.0 >'1. with  r?) is asymptotically nonnal.2 + p'A2.3 First. . we can .a~) : n 3 !' R. (X n . and let r 2 = C2 /(j3<.1. 0< Ey4 < 00.J 2 2 + 4 2 4 2 Tn P 120 + P T 02 +2{ _2 p3Al. Y)/(T~ai where = Var X. xmyl).1. E). >'2.3.3.0 2 (5.o}. R d !' RP has a differential g~~d(U) at u.0.2 extends to the dvariate case.0 TO .p).3.0 >'1. . 1). Because of the location and scale invariance of p and r.1.u) where Un = (n1EXiYi.1. Let (X" Y.u) ~ N(O. Lemma 5. Ui/U~U3.J11)/a.2.j.5. P = E(XY)..
.h(p)J !:.h= (h i .3. E) = .rn) N(o. .fii(Y . Suppose Y 1.5 (a) ! .fii(r . and (Prohlem 5. (1 _ p2)2).u 2.. p=tanh{h(r)±z (1. . N(O. c E (1. Theorem 5. .p) !:.UI. Var Y i = E and h : 0 ~ RP where 0 is an open subset ofRd.~a)/vn3} where tanh is the hyperbolic tangent. . I Y n are independent identically distributed d vectors with ElY Ii 2 < 00.8. which is called Fisher's z. o Here is an extension of Theorem 5.9) Refening to (5.3.. 1938) that £(vn . per < c) '" <I>(vn .Y) ~ N(/li. (5.1'2.3[h(c) .1) is achieved hy choosing h(P)=!log(1+ P ).3.P The approximation based on this transfonnation. N(O. Argue as before using B. . and it provides the approximate 100(1 .p).24) I Proof.4. Asymptotic Approximations Chapter 5 = 4p2(1_ p2)'.10) that in the bivariate nonnal case a variance stahilizing transformation h(r) with . .h(p)]). . .l) distribution.3. ! f This expression provides approximations to the critical value of tests of H : p = O.3.h(p)) is closely approximated by the NCO. has been studied extensively and it has been shown (e. .hp )andhhasatotaldifferentialh(1)(rn) f • . 2 1. .: ! + h(i)(rn)(y . that is.23) • .a)% confidence interval of fixed length. Then ~.mil £ ~ hey) = h(rn) and (b) so that . ': I = Ilgki(rn)11 pxd.rn) + o(ly .19). we see (Problem 5.3(h(r) .3..3.320 When (X.3. EY 1 = rn. j f· ~. David. it gives approximations to the power of these tests. .fii[h(r) .g.1). then u5 . (5.
25) has an :Fk. Then.3. gives the quantiles of the distribution of V/k.1 in which the density of Vlk.m distribution can be approximated by the distribution of Vlk.1. if k = 5 and m = 60. . Then according to Corollary 8. 00.0 distribution and 2.3. . i=k+1 Using the (b) part of Slutsky's theorem.Section 5.. L::+.m. But E(Z') = Var(Z) = 1. 1 Xl has a x% distribution.= density.~1. the :Fk.m distribution. I). xi and Normal Approximation to the Distribution of F Statistics.i..3 First.9) to find an approximation to the distribution of Tk.7) implies that as m t 00. we can write.k.:: . Thus. EX.. we conclude that for fixed k. The case m = )'k for some ). 14.m  (11k) (11m) L. the:F statistic T _ k. only that they be i.l) distribution. > 0 and EXt < 00. We write T k for Tk. For instance. > 0 is left to the problems.' = Var(X.d.. D Example 5.. This row.211 = 0.i=k+l Xi ". we can use Slutsky's theorem (A. check the entries of Table IV against the last row.m' To show this. Suppose that n > 60 so that Table IV cannot be used for the distribution ofTk.and HigherOrder Asymptotics: The Delta Method with Applications 321 (c) jn(h(Y) . Next we turn to the normal approximation to the distribution of Tk. Suppose that Xl.26) l. We do not require the Xi to be normal. then P[T5 . When k is fixed and m (or equivalently n = k + m) is large.21 for the distribution of Vlk.x%.m. we first note that (11m) Xl is the average ofm independent xi random variables. as m t 00. (5. By Theorem B.1.h(m)) = y'nh(l)(m)(Y  m) + op(I).15.Xn is a sample from a N(O.k+m L::l Xl 2 (5. with EX l = 0. when the number of degrees of freedom in the denominator is large.3.3. where k + m = n.05 and the respective 0.3. where Z ~ N(O. where V . See also Figure B.3. Now the weak law of large numbers (A. the mean ofaxi variable is E(Z2).0 < 2. which is labeled m = 00.7.).37] = P[(vlk) < 2.. By Theorem B.1.:l ~ m k+m L X. is given as the :FIO .3. when k = 10.37 for the :F5 . To get an idea of the accuracy of this approximation.05 quantiles are 2. Suppose for simplicity that k = m and k . if .
:. .' !k . when rnin{k.3..7). !1 . v) ~ ~.3._o l or ~.3. Suppose P is a canonical exponential family of rank d generated by T with [open.a) . which by (5. = E(X. if it exists and equal to c (some fixed value) otherwise. l (: An interesting and important point (noted by Box.28) satisfies Zto '" 1 I I:. v'n(Tk . 1)T. ! i l _ .3. h(i)(u.3. ..m} t 00. Then if Xl. We conclude that xl jer}. I)T and h(u. P[Tk.3 Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families Our final application of the 8method follows.3.m • < tl P[) :'. a').'.m(l. When it can be eSlimated by the method of moments (Problem 5.k(T>. 1'. the F test for equality of variances (Problem 5. 2) distribution. .8(c» one has to use the critical value .8(d)).o) k) K (5.m 1) can be ap proximated by a N(O. .3.m = (1+ jK(:: z. the distribution of'.3.5. where J is the 2 x 2 v'n(Tk 1) ""N(0. i = 1. the upper h.27) where 1 = (1. In general (Problem 53. when Xi ~ N(o. and a~ = Var(X. By Theorem 5. E(Y i ) ~ (1.a).3.»)T and ~ = Var(Yll)J.29) is unknown.1)/12 j 2(m + k) mk Z.). 322 Asymptotic Approximations Chapter 5 where Yi1 = and Yi2 = Xf+i/a2. Equivalently Tk = hey) where Y i ~ (li" li. X n are a sample from PTJ E P and ij is defined as the MLE 1 . Thus (Problem 5. (5.. a'). = (~.m critical value fk.a) '" 1 + is asymptotically incorrect. ~ N (0.m(1. ~ Ck.)/ad 2 . j m"': k (/k. 1 5. I' Theorem 5. i.2Var(Yll))' In particular if X.k(tl)] '" 'fI() :'. • • 1)/12) .8(a» does not have robustness of level.k (t 1 (5.).3. _.4. ifVar(Xf) of 2a'. k.I.28) . 0 ! j i · 1 where K = Var[(X. 1'.. v) identity. Specifically. 4). 1953) is that unlike the t test. i.k(Tk.)T. . as k ~ 00. :f. In general.m 1) <) :'.1) "" N(O.m(1 .
32) Hence. = (5..3.In(ij .2 that.3. where. The result is a consequence.) .38) on the variance matrix of y'n(ij . the asymptotic variance matrix /1(11) of . = 1/2. Thus. Example 5. (I" +0')] .3.4 with AI and m with A(T/). and 1).4.3. by Example 2.I(T/) (5.30) But D A . by Corollary 1.11) eqnals the lower bound (3.A 1(71))· Proof.6.. thus. = A by definition and. Then T 1 = X and 1 T2 = n. Ih our case.3.2. . I'.4. = 1/20'. (i) follows from (5. (ii) follows from (5.24). are sufficient statistics in the canonical model.). (5.N(o.3.Nd(O. T/2 By Theorem 5. For (ii) simply note that.O.3. and. (5.l4.2. therefore. Now Vii11.EX.1.and HigherOrder Asymptotics: The Delta Method with Applications 323 (i) ii = 71 + ~ I:71 A •• • I (T/)(T(X.1.3. Let Xl>' .23). We showed in the proof ofTheorem 5.3 First. Recall that A(T/) ~ VarT/(T) = 1(71) is the Fisher information.i.4.3.'l 1 (ii) LT/(Vii(iiT/)).".8.d. 0').3. of Theorems 5.3.5. in our case.71) for any nnbiased estimator ij.33) 71' 711 Here 711 = 1'/0'.2 where 0" = T. PT/[ii = AI(T)I ~ L Identify h in Theorem 5. if T ~ I:7 I T(X.A(T/)) + oPT/ (n .8. iiI = X/O".2. This is an "asymptotic efficiency" property of the MLE we return to in Section 6. o Remark 5.2 and 5.ift = A(T/).31) .  (TIl' .Section 5. PT/[T E A(E)] ~ 1 and. X n be i. Thus. Note that by B. hence. as X witb X ~ Nil'..3.T.1.
L. as Y. variables are explained in tenns of similar stochastic approximations to h(Y) . we can use (5. J J . the (k+ I)dimensional simplex (see Example 1.') N(O. and h is smooth.g.s where Eo = diag(a 2 )2a 4 ).: CPo. and so on. < 1.1 Estimation: The Multinomial Case .. and PES.. .i. 0).)'. and confidence bounds.". Following Fisher (1958)'p) we develop the theory first for the case that X""" X n are i.4. when we are dealing with onedimensional smooth parametric models. 8 open C R (e. . Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean.324 Asymptotic Approximations Chapter 5 Because X = T. EY = J. 0 . I . ..1) i . Thus. Nk) where N j . testing. . which lead to a result on the asymptotic nonnality of the MLE in multiparameter exponential families.i.4 and Problem 2. .1 by studying approximations to moments and central moments of estimates. . The moment and in law approximations lead to the definition of variance stabilizing transfonnations for classical onedimensional exponential families.3.d. the difference between a consistent estimate and the parameter it estimates.d.P(Xk' 0)) : 0 E 8}.3.3. and il' = T.L~ I l(Xi = Xj) is sufficient.h(JL) where Y 1. We focus first on estimation of O. taking values {xo.4. Xk} only so that P is defined by p . .4 to find (Problem 5.15). .26) vn(X /". N = (No. These "8 method" approximations based on Taylor's fonnula and elementary results about moments of means of Ll...7). . 0).33) and Theorem 5..Pk) where (5. 0. I I I· < P. is twice differentiable for 0 < j < k.d. .1. Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal. Higherorder approximations to distributions (Edgeworth series) are discussed briefly. Consistency is Othorder asymptotics.(T. under Li.. Secondorder asyrnptotics provides approximations to the difference between the error and its firstorder approximation. 5. These stochastic approximations lead to Gaussian approximations to the laws of important statistics. Firstorder asymptotics provides approximations to the difference between a quantity tending to a limit and the limit. 5.d. Summary. Assume A : 0 ~ pix.3. sampling. . i 7 •• ..6.. Finally. I i I . see Example 2. Eo) . for instance. . In Chapter 6 we sketch how these ideas can be extended to multidimensional parametric families. stochastic approximations in the case of vector statistics and parameters are developed. il' . .4 ASYMPTOTIC THEORY IN ONE DIMENSION I: I " ! . i In this section we define and study asymptotic optimality for estimation..1. . We begin in Section 5. . .. We consider onedimensional parametric submodels of S defined by P = {(p(xo. . Y n are i.
Jor all 8.l (Xl.0).p(Xk..:) (see (2. > rl(O) M a(p(8)) P. (0) 80(Xj. fJl E.1.4.4.4.4.8)1(X I =Xj) )=00 (5. 88 (Xl> 8) and =0 (5. .8) ~ I)ogp(Xj. bounded random variable (5.1.4.Section 5.2(0.8) logp(X I .7) where . Then we have the following theorem.8) .. 8))T.8).4. Assume H : h is differentiable. . if A also holds.4. Theorem 5.3) Furthermore (See'ion 3. Next suppose we are given a plugin estimator h (r. h) with eqUality if and only if. < k.. Under H. Many such h exist if k 2.4. pC') =I 1 m . h) is given by (5.9) .4. (5. Consider Example where p(8) ~ (p(xo.4. (5.2 (8.4.6) 1.11). Moreover.2). 8) is similarly bounded and well defined with (5.11)) of (J where h: S satisfies ~ R h(p(8» = 8 for all 8 E e > (5.2) is twice differentiable and g~ (X I) 8) is a well~defined.1.4 Asymptotic Theory in One Dimension 325 Note that A implies that k [(X I . for instance.4) g. 0 <J (5.5) As usual we call 1(8) the Fisher infonnation..4.
16) o . &8(X. 8).')  h(p(8)) } asymptotically normal with mean 0.4.3./: .. (8. I (x j. using the correlation inequality (A. ( = 0 '(8.10). h(p(8)) ) ~ vn Note that.12) [t. Apply Theorem 5. h) = I . 0)] p(Xj. I h 2 (5.8)) = a(8) &8 (X" 8) +b(O) &h Pl (5.p(xj. h k &h &pj(p(8)) (N p(xj.2 noting that N) vn (h (.I n.. (5.13).8) ) ~ .9). we obtain &l 1 <0 . ~ir common variance is a'(8)I(8) = 0' (0.II. we s~~ iliat equality in (5.8) ) +op(I).8)). Taking expectations we get b(8) = O. 0) = • &h fJp (5.4. kh &h &Pj (p(8))(I(Xi = Xj) .l6) as in the proof of the information inequality (3.8) j=O 'PJ Note that by differentiating (5. h).4. using the definition of N j .4.15) i • .6).11) • (&h (p(O)) ) ' p(xj. Noting that the covariance of the right' anp lefthand sides is a(8). whil. for some a(8) i' 0 and some b(8) with prob~bility 1.4. by noting ~(Xj. which implies (5.4..4.p(x). by (5. (5.8) = 1 j=o PJ or equivalently.8) gives a'(8)I'(8) = 1.p(xj..4. By (5. but also its asymptotic variance is 0'(8.4.13) I I · .4.326 Asymptotic Approximations Chapter 5 Proof.h)Var. • • "'i. not only is vn{h (.4.14) .10) Thus.8)&Pj 2: • &h a:(p(0))p(x.12). : h • &h &Pj (p(8)) (N .4.h)I8) (5. (5.4.8) with equality iff. 2: • j=O &l a:(p(8))(I(XI = Xj) .. we obtain 2: a:(p(8)) &0(xj.
.4. Xk}) and rem 5. .4.4.4.I L::i~1 T(Xi) = " k T(xJ )". by ( 2. if it exists and under regularity conditions. p) and HardyWeinberg models can both be put into this framework with canonical parameters such as B = log ( G) in the first case. Example 5..xd).d.4.Xn are tentatively modeled to be distributed according to Po. . 0 Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena.8) is.1.. the MLE of B where It is defined implicitly ~ by: h(p) is the value of O. Xl.Oo) = E."j~O (5. . is a canonical oneparameter exponential family (supported on {xo.(vn(B . 0 E e.).::7 We shall see in Section 5.4.Section 5A Asymptotic Theory in One Dim"e::n::'::io::n ~_ __'3::2:.. Suppose i.3.0 (5.3) L..3 that the information bound (5.3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check. Let p: X x ~ R where e D(O. then. . E open C R and corresponding density/frequency functions p('. 0).2 Asymptotic Normality of Minimum Contrast and MEstimates o e o We begin with an asymptotic normality theorem for minimum contrast estimates. As in Theorem 5.3.. Note that because n N T = n .4. which (i) maximizes L::7~o Nj logp(xj..17) L. achieved by if = It (r...o(p(X"O)  p(X"lJo )) .2. Write P = {P. OneParameter Discrete Exponential Families. Suppose p(x. J(~)) with the asymptotic variance achieving the infonnation bound Jl(B)..i.A(O)}h(x) where h(x) = l(x E {xo.18) and k h(p) = [A]I LT(xj)pj )".. . 5. 0) and (ii) solvesL::7~ONJgi(Xj. : E Ell.0)) ~N (0. In the next two subsections we consider some more general situations.0)=0. Then Theo(5.8) = exp{OT(x) .5 applies to the MLE 0 and ~ e is open.19) The binomial (n.
p = Then Uis well defined. denote the distribution of Xi_ This is because.23) I. . ~i  i . PEP and O(P) is the unique solution of(5.328 Asymptotic Approximations Chapter 5 . Suppose AI: The parameter O(P) given hy the solution of .O(P)) +op(n.~(Xi. rather than Pe. . 8.4.t) . Under AOA5.4.O).=1 n ! (5.2. parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for. We need only that O(P) is a parameter as defined in Section 1. . i' On where = O(P) +. That is. O(Pe ) = O. O(P). O(P» / ( Ep ~~ (XI.. n i=l _ 1 n Suppose AO: .p(x.!:.21) i J A2: Ep. < co for all PEP.. O)dP(x) = 0 (5. O(P))) . That is. (5. o if En ~ O. .p(x.1/ 2) n '.4. as pointed out later in Remark 5.0(P)) l. Theorem 5. On is consistent on P = {Pe : 0 E e}.. ~I AS: On £. .1. f 1 .LP(Xi. 0 E e. A4: sup. j.O(p)))I: It .pix. ~ (X" 0) has a finite expectation and i ! ..4.O(P)I < En} £. { ~ L~ I (~(Xi. O)ldP(x) < co.' '.4.1. 0) is differentiable.4. J is well defined on P.p(x.p(.p(X" On) n = O.p E p 80 (X" O(P») A3: . Let On be the minimum contrast estimate On ~ argmin .L . P) = . under regularity conditions the properties developed in this section are valid for P ~ {Pe : 0 E e}.4.p(Xi.22) J i I .3. hence. 1 n i=l _ . • is uniquely minimized at Bo.20) In what follows we let p. As we saw in Section 2.p2(X. # o. (5.21) and. L.
21).25) where 18~  8(P)1 < IiJn .O(P)) Ep iJO (Xl.24) proof Claim (5.4. By expanding n.4.20) and (5. Next we show that (5. . applied to (5._ . we obtain. A2.. (iJ1jJ ) In (On .4.1j2 ).26) and A3 and the WLLN to conclude that (5. t=1 1 n (5.O(P)) = n.' " iJ (Xi .27) we get. Let On = O(P) where P denotes the empirical probability.Section 5.4.p) < 00 by AI.24) follows from the central limit theorem and Slutsky's theorem. 1 1jJ(Xi ..' " 1jJ(Xi .4.On)(8n . n.4.L 1jJ(X" 8(P)) = Op(n.O(P))) p (5.22) because .4.4.O(P)) 2' (E '!W(X. O(P)).p) = E p 1jJ2(Xl .lj2 and L 1jJ(X" P) + op(l) i=1 n EI'1jJ(X 1.Ti(en .4.O(P)) ~ . 8(P)) + op(l) = n ~ 1jJ(Xi .O(P)I· Apply AS and A4 to conclude that (5.29) .4 Asymptotic Theory in One Dimension 329 Hence.4. (5.O(P)) / ( El' ~t (X" O(P))) o while E p 1jJ2(X"p) = (T2(1jJ. and A3. where (T2(1jJ. But by the central limit theorem and AI.4.l   2::7 1 iJ1jJ.27) Combining (5..425)(5. en) around 8(P).28) . using (5.22) follows by a Taylor expansion of the equations (5.8(P)) I n n n~ t=1 n~ t=1 e (5.4.4.20).
12). 330 Asymptotic Approximations Chapter 5 Dividing by the second factor in (5. Solutions to (5. 0) is replaced by Covo(?jJ(X" 0). Our arguments apply even if Xl. We conclude by stating some sufficient conditions.4.O)?jJ(X I . {xo 1 ••• 1 Xk}.2. .21).4. O(P)) + op (I k n n ?jJ(Xi .O(P) Ik = n n ?jJ(Xi .. O(P) in AIA5 is then replaced by (I) O(P) ~ argmin Epp(X I .! . n I _ .30) suggests that ifP is regular the conclusion of Theorem 5. for A4 and A6.20) are called M~estimates.4. A6: Suppose P = Po so that O(P) = 0.O). A4 is found.4.4.29). 0) l' P = {Po: or more generally.4. that 1/J = ~ for some p).4. This extension will be pursued in Volume2.4.4. 0 • . AI.4.1.Eo (XI.31) for all O.4. X n are i.8).. it is easy to see that A6 is the same as (3. Z~ (Xl.4. 0) = logp(x. 0) Covo (:~(Xt. O)p(x. If further Xl takes on a finite set of values. Remark 5. en j• . essentially due to Cramer (1946). Our arguments apply to Mestimates. and that the model P is regular and letl(x. I o Remark 5.e. written as J i J 2::.2 may hold even if ?jJ is not differentiable provided that .?jJ(XI'O)). =0 (5. we see that A6 corresponds to (5. Nothing in the arguments require that be a minimum contrast as well as an Mestimate (i.1. An additional assumption A6 gives a slightly different formula for E p 'iiIf (X" O(P)) if P = Po. I 'iiIf 1 I • I I .4.30) is formally obtained by differentiating the equation (5.4.d..22) follows from the foregoing and (5. and A3 are readily checkable whereas we have given conditions for AS in Section 5. B) where p('. 0) = 6 (x) 0. for Mestimates.30) Note that (5. 1 •• I ! . P but P E a}. B» and a suitable replacement for A3. This is in fact truesee Problem 5. Identity (5.4. O(P)) ) and (5.=0 ?jJ(x. .'. and we define h(p) = 6(xj )Pj.4. • Theorem 5. Suppose lis differen· liable and assume that ! • &1 Eo &O(X I . A2.2.28) we tinally obtain On .(x) (5. Remark 5. .O) = O.: 1. If an unbiased estimateJ (X d of 0 exists and we let ?jJ (x. (2) O(P) solves Ep?jJ(XI. O)dp. e) is as usual a density or frequency function.4.2 is valid with O(P) as in (I) or (2).3.i. Conditions AO.
We can now state the basic result on asymptotic normality and e~ciency of the MLE. where EeM(Xl.01 < J( 0) and J:+: JI'Ii:' (x..4. 0) < A6': ~t (x. 0) g~ (x.p(x.1).4.Section 5.4) but A4' and A6' are not necessary (Problem 5. 5.4." Details of how A4' (with ADA3) iIT!plies A4 and A6' implies A6 are given in the problems.4. That is. then the MLE On (5.O) = l(x. In this case is the MLE and we obtain an identity of Fisher's. < M(Xl.35) with equality iff.4. 0') . 0') is defined for all x. = en en = &21 Ee e02 (Xl.:d in Section 3.0'1 < J(O)} 00. s) diL(X )ds < 00 for some J ~ J(O) > 0. B).20) occurs when p(x.p. AOA6.O) and P ~ ~ Pe. 0) and . then if On is a minimum contrast estimate whose corresponding p and '1jJ satisfy 2 0' (. 0) obeys AOA6. satisfies If AOA6 apply to p(x. 0) ..4. 0) (5. We also indicate by example in the problems that some conditions are needed (Problem 5. Theorem 5.O) = l(x.33) so that (5.34) Furthermore. 10' .4.8) is a continuous function of8 for all x.4.p = a(0) g~ for some a 01 o.3 Asymptotic Normality and Efficiency of the MLE The most important special case of (5.eo (Xl.4. where iL(X) is the dominating measure for P(x) defined in (A.4 Asymptotic Theory in One Dimension 331 A4/: (a) 8 + ~~(xI.O) 10gp(x. w..4.3. "dP(x) = p(x)diL(x).O). .32) where 1(8) is t~e Fisher information introduq. lO. (b) There exists J(O) sup { > 0 such that O"lj') Ehi) eo (Xl. Pe) > 1(0) 1 (5.
<1>( _n '/4 . Then X is the MLE of B and it is trivial to calculate [( B) _ 1..30) and (5. B) = 0. 442.. If B = 0.4.4. 0 1 . Therefore. }. ..4.1 once we identify 'IjJ(x.• •.4.l'1 Let X" .36) Because Eo 'lj.".4.nB) . . 1.4.3 is not valid without some conditions on the esti· mates being considered... Let Z ~ N(O.4.n(Bn . . We can interpret this estimate as first testing H : () = 0 using the test "Reject iff IXI > n. 1998. We discuss this further in Volume II. 1 I I... I.r 332 Asymptotic Approximations Chapter 5 Proof Claims (5. The optimality part of Theorem 5. .d.... PoliXI < n1/'1 . PoliXI < n 1/4 ] . 00.(X B).A'(B).i.4.4. we know that all likelihood ratio tests for simple BI e e eo.2.439) where .. . I I Example 5. By (5.39) with "'(B) < [I(B) for all B E eand"'(Bo) < [I(Bo). 1. Then .37) .4 Testing ! I i . 1).4.. (5.4. ... N(B.1/4 X if!XI > n. 0 e en e.35) is equivalent to (5. However..36) is just the correlation inequality and the theorem follows because equality holds iff 'I/J is a nonzero multiple a( B) of 3!. j 1 " Note that Theorem 5. Consider the following competitor to X: B n I 0 if IXI < n. PolBn = Xl .2. . 5. claim (5.nB). (5.B).nBI < nl/'J <I>(n l/ ' .i . for some Bo E is known as superefficiency..4.Xn be i. The major testing problem if B is onedimensional is H : < 8 0 versus K : > If p(.34) follow directly by Theorem 5.3 generalizes Example 5. p.. 0 because nIl' . . .35). Hodges's Example.. and.'(0) = 0 < l(~)' The phenomenon (5. B) is an MLR family in T(X). cross multiplication shows that (5.4. For this estimate superefficiency implies poor behavior of at values close to 0. {X 1. and Polen = OJ . 1 . . . thus.4. 1 . see Lehmann and Casella. .4.'(B) = I = 1(~I' B I' 0.33) and (5. .1/4 I (5.4.38) Therefore.1/ 4 " and using X as our estimate if the test rejects and 0 as our estimate otherwise.1). if B I' 0. We next compule the limiting distribution of . PIIZ + . B) with J T(x) . for higherdimensional the phenomenon becomes more disturbing and has important practical consequences.ne .
00) > z] .4. "Reject H for large values of the MLE T(X) of >.41) follows. The proof is straightforward: PoolvnI(Oo)(Bn ..:cO"=~ ~ 3=3:.m='=":c'.. 00 )" where P8 . b).l4. On the (5.h the same behavior.Section 5.. > AO.00) other hand. That is. 0 E e} is such thot the conditions of Theorem 5.4 Asymptotic Theo:c'Y:c. "Reject H if B > c(a. Let en (0:'. Suppose the model P = {Po . ~ 0.40) > Ofor all O. 00)] = PolvnI(O)(Bn  0) > VnI(O)(cn(a. a < eo < b.4. [o(vn(Bn 0)) ~N(O. If pC e) is a oneparameter exponential family in B generated by T(X).4.41) where Zlct is the 1 .4.:. < >"0 versus K : >.4. are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error Q at eo.4. X n distributed according to Po..Oo)] PolOn > cn(a.3 versus simple 2 .(a. It seems natural in general to study the behavior of the test.45) which implies that vnI(Oo)(c. 00) .4. 00) .42) is sometimes called consistency of the test against a fixed alternative.0:c":c'=D... PoolBn > 00 + zl_a/VnI(Oo)] = POo IVnI(Oo) (Bn  00) > ZI_a] ~ a. Thus. .::. .4. 00 )]  ~ 1.. and then directly and through problems exhibit other tests wi!. as well as the likelihood ratio test for H versus K. Then c. and (5.00) > z] ~ 11>(z) by (5..(1 1>(z))1 ~ 0. Proof. ljO < 00 . 1 < z .a quantile o/the N(O. • • where A = A(e) because A is strictly increasing.r'(O)) where 1(0) (5.40).4.44) But Polya's theorem (A. Then ljO > 00 .:c".I / 2 ) (5.2 apply to '0 = g~ and On. B E (a. PolO n > c.4.d.4. Theorem 5.[Bn > c(a. e e e e.42)  ~ o.Oo)] = a and B is the MLE n n of We will use asymptotic theory to study the behavior of this test when we observe ij. (5.(a. the MLE. Xl.4.Oo) = 00 + ZIa/VnI(Oo) +0(n. Suppose (A4') holds as well as (A6) and 1(0) < oofor all O. 1) distribution. ..(a.22) guarantees that sup IPoolvn(Bn . Zla (5.0)]. this test can also be interpreted as a test of H : >..4. The test is then precisely. eo) denote the critical value of the test using the MLE en based On n observations.43) Property (5. (5.46) PolBn > cn(a. derive an optimality property. (5.
8) + 0(1) .jnI(8)(80 .k'Pn(X1. these statements can only be interpreted as valid in a small neighborhood of 80 because 'Y fixed means () + B . f. X n ) is any sequence of{possibly randomized) critical (test) functions such that . In fact. 1 ~. X n ) i. j (5.8) < z] .jnI(8)(Bn .50) i.4. Suppose the conditions afTheorem 5.80) tends to infinity.48) and (5.4.jnI(O)(Bn .jnI(8)(80 .4.4.4.80) ~ 8)J Po[.80 1 < «80 )) I J .4. Write Po[Bn > cn(O'. iflpn(X1 .jnI(80) +0(n.80 ).50)) and has asymptotically smallest probability of type I error for B < Bo.. .47) lor.4.jI(80)) ill' < O.jnI(8)(Cn(0'.4. .4. 00 if8 > 80 0 and . .4.jnI(8)(Bn .80 ) tends to zero. j .[..[.4. LetQ) = Po. If "m(8 .···.1/ 2 ))].(1 lI(z))1 : 18 .4.2 and (5. Theorem 5.4. assume sup{IP. the power of tests with asymptotic level 0' tend to 0'.48).49)) the test based on rejecting for large values of 8n is asymptotically uniformly most powerful (obey (5. Claims (5. .017".8) > .48) . 0.8 + zl_a/.jnI(8)(80 . (3) = Theorem 5. o if "m(8 .40) hold uniformly for (J in a neighborhood of (Jo.8) > .  i.49) ! i . the test based on 8n is still asymptotically MP.   j Proof.5.41).4.jnI(80) + 0(n. . 80 )  8) . That is. (5..50) can be interpreted as saying that among all tests that are asymptotically level 0' (obey (5.(80) > O.42) and (5. .1 / 2 )) . 00 if 8 < 80 .4. ~! i • Note that (5.jnI(8)(Cn(0'. . On the other hand.8 + zl_o/.J 334 Asymptotic Approximations Chapter 5 By (5. Furthennore.43) follow. In either case.50). then limnE'o+.4. < 1 lI(Zl_a ")'.51) I 1 I I .4. Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ")' "m( 8 ..jI(80)) ill' > 0 > llI(zl_a ")'. then by (5. ~ " uniformly in 'Y. (5. I. the power of the test based on 8n tends to I by (5.4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to Ct from the left at 00 and continues rising steeply to 1 to the right of 80 .4.")' ~ "m(8  80 ). 80)J 1 P. (5. then (5. I .
0 n p(Xi . € > 0 rejects for large values of z1.4.48) follows. The details are in Problem 5.5 in the future will be referred to as a Wald test.?. . is O. 1(8) = 1(80 + + fi) ~ 1(80) because our uniformity assumption implies that 0 1(0) is continuous (Problem 5. 00)]1(8n > 80 ) > kn(Oo. ~ . . To prove (5.4.o + 7.. 1.4.)] ~ L (5....8) 1. 8 + fi) 0 P'o+:.8 0 ) . 0 (5.OO)J . " Xn. X.4.OO+7n) +EnP. for Q < ~. .50) for all. [5In (X" ..4.4. i=1 P 1.' L10g i=1 p (X 8) 1. 0 The asymptotic!esults we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n.Xn) is the critical function of the LR test then.4. Thus.." + 0(1) and that It may be shown (Problem 5..4 Asymptotic Theory in One Dimension 335 If.4. .53) tends to the righthand side of (5.5.jn(I(8o) + 0(1))(80 . hand side of (5..54) establishes that the test 0Ln yields equality in (5. .Section 5. is asymptotically most powerful as well. X n .jI(Oo)) + 0(1)) and (5. k n (80 ' Q) ~ if OWn (Xl.. These are the likelihood ratio test and the score or Rao test..<P(Zl_a . The test li8n > en (Q.4. Llog·· (X 0) =dn (Q. There are two other types of test that have the same asymptotic behavior.4.52) > 0. . > dn (Q.O+. 80)] of Theorems 5. .Xn) is the critical function of the Wald test and oLn (Xl.7)..53) is Q if. P. 0) denotes the density of Xi and dn .4.8) that.4. Finally.50) and.[logPn( Xl.4.4 and 5. .j1i(8  8 0 ) is fixed..<P(Zla(1 + 0(1)) + .50) note that by the NeymanPearson lemma..54) Assertion (5. 00 + E) logPn(X l E I " . hence. t n are uniquely chosen so that the right.. for all 'Y.Xn ) = 5Wn (X" . Further Taylor expansion and probabilistic arguments of the type we have used show that the righthand side of (5. note that the Neyman Pearson LR test for H : 0 = 00 versus K : 00 + t. .80 ) < n p (Xi. It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form "Reject if L log[p(X" 8n )!p(X i=1 n i . .. Q).4.53) where p(x. if I + 0(1)) (5.4.
336
Asymptotic Approximations
Chapter 5
where PlI(X 1, ... ,Xn ,8) is the joint density of Xl, ... ,Xn . Fort: small. n fixed, this is approximately the same as rejecting for large values of a~o logPn(X 11 • • • 1 X n ) eo).
,
• ,
The preceding argument doesn't depend on the fact that Xl" .. , X n are i.i.d. with common density or frequency function p{x, 8) and the test that rejects H for large values of a~o log Pn (XI, ... ,Xn , eo) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming
"Reject H iff
t
i=I
iJ~
: ,
logp(X i ,90) > Tn (a,9 0)."
0
I I
(5.4.55)
It is easy to see (Problem 5.4.15) that
Tn(a, 90 ) = Z10 VnI(90) + o(n I/2 )
and that again if G (Xl, ... , X n ) is the critical function of the Rao test then
nn
,
•
.,
,•
Po,+' WRn (X t, ... ,Xn ) = Ow n(X t , ... ,Xn)J ~ 1, rn
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(90 ), which d' may require numerical integration, can be replaced by _n l d021n(Bn) (Problem 5.4.10).
5.4,5
Confidence Bounds
Q
We define an asymptotic Levell that
lower confidence bound (LCB) On by the requirement
(5.4.57)
r I I i ,
., '
.'
for all () and similarly define asymptotic level!  a DeBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
(i) By using a natural pivot.
, f.
(
.,
(ii) By inverting the testing regions derived in Section 5.4.4.
,. "',
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)(A6), (A4'), and I(9) finite for all it follows (Problem 5.4.9) that
e.
Co(
V n  e)) ~ N(o, 1) nI(lin)(li
Z1a/VnI(lin ).
(5.4.58)
for all () and, hence. an asymptotic level!  a lower confidence bound is given by
9~ = lin e~,
(5.4.59)
Turning tto method (ii), inversion of 8Wn gives fonnally
= inf{9: en(a, 9) > 9n }
(5.4.60)
,
=
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
337
or if we use the approximation C (0, e) ~ n
e+ zlQ/vnI(iJ), (5,4,41),
e) > en}'
~~, = inf{e , Cn(C>,

(5,4,61)
In fact neither e~I' or e~2 properly inverts the tests unless cn(Q, e) and Cn (Q, e) are increasing in The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, e~l is preferable because this bound is not only approximately but genuinely level 1 Q. But computationally it is often hard to implement because cn(Q, 0) needs, in general, to be computed by simulation for a grid of values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4,10) are preferred but Can be quite inadequate (Problem 5,4,1 I), These bounds e~, O~I' e~2' are in fact asymptotically equivalent and optimal in a suitable sense (Problems 5,4,12 and 5,4,13),
e.
e
Summary. We have defined asymptotic optimality for estimates in oneparameter models. In particular, we developed an asymptotic analogue of the information inequality of Chapter 3 for estimates of in a onedimensional subfamily of the multinomial distributions, showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of oneparameter discrete exponential families. In Section 5.4.2 we developed the theory of minimum contrast and M estimates, generalizations of the MLE, along the lines of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mooel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimality results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson (1996), Le Cam and Yang (1990), Lehmann (1999), Rao (1973), and Serfling (1980),
e
5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as n t 00 in a sense we now describe. The framework we consider is the one considered in Sections 5.2 and 5.4, i.i.d. observations from a regular madel in which is open C R or = {e 1 , , , , , e,} finite, and e is identifiable, Most of the questions we address and answer are under the assumption that fJ = 0, an arbitrary specified value, or in frequentist tenns, that 8 is true.
e
e
Consistency The first natural question is whether the Bayes posterior distribution as n + 00 concentrates all mass more and more tightly around B. Intuitively this means that the data that are coming from Po eventually wipe out any prior belief that parameter values not close to are likely, Formalizing this statement about the posterior distribution, II(· I X It •.• , X n ), which is a functionvalued statistic, is somewhat subtle in general. But for = {O l ,' .. , Ok} it is
e
e
i
338
straightforward. Let
Asymptotic Approximations Chapter 5
I.
11(8 i XI,···, Xn)
Then we say that II(·
=PIO = 8 I Xl,···, Xn].
(5.5.1)
e, P,li"(8 I Xl, ... , Xn)  11 > ,] ~ 0 for all f. > O. There is a slightly stronger definition: rIC I XI,' .. ,Xn ) is a.S. iff for all 8 E e, 11(8 I Xl, ... , Xn) ~ 1 a.s. P,.
is consistent iff for all 8 E
General a.s. consistency is not hard to formulate:
I Xl, ... , Xn)
(5.5.2)
consistent
(5.5.3)
11(· I X), ... , Xn)
=}
OJ'} a.s. P,
(5.5.4)
where::::} denotes convergence in law and <5{O} is point mass at satisfactory result for finite.
e
e.
There is a completely
, ,
,
Theorem 5.5.1. Let 1rj  p[e = Bj ], j = 1, ... 1 k denote the prior distribution 0/8. Then II(· I Xl, ... ,Xn ) is consistent (a.s. consistent) iff 7fj > afor j = I, ... , k.
Proof. Let p(., B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1["] = 0 for some j implies that 1f(Bj I Xl, ... ,Xn ) = 0 for all Xl, .. . , X n because, by (1.2.8),
.,
,J,
11(8j
I Xl, ... ,Xn)
PIO = 8j I Xl, ... ,Xn ] 11j Ir~l p(Xi , 8il
L:.~l 11. ni~l p(Xi , 8.)
k
n .
(5.5.5)
,
,
, ,,
Intuitively, no amount of data can convince a Bayesian who has decided a priori that OJ is impossible. On the other hand, suppose all 71" j are positive. If the true is (J j or equivalently 8 = (J j, then
e
log
11(8.IXl, ... ,Xn) =n 11(8 j I X), ... ,Xn)
(11og+ L.. og P(Xi,8.)) . 11. 1{f.,1 n
7fj
n i~l
p(Xi ,8j)
,
By the weak (respectively strong) LLN, under POi'
.,i
1{f.,log p(Xi ,8.)  Lni~l
p(Xi ,8j )
+
E
OJ
(I
I
P(XI ,8.)) og p(X I ,8j )
i
I
in probability (respectively a.s.). But Eo;
(log: ~:::;))
~
< 0, by Shannon's inequality, if
Ba
• .,
=I=
Bj' Therefore,
11(8.IXI, ... ,Xn) 1 og 11(8j I X), ... , X n )
00
,
in the appropriate sense, and the theorem follows.
o
i ~,
h
_
Section 55
Asymptotic Behavior and Optimality of the Posterior Distribution
339
e
Remark 5.5.1. We have proved more than is stated. Namely. that for each I XI, . .. ,Xn ]  a exponentially.
e E e. Po[O =l0
As this proof suggests, consistency of the posterior distribution is very much akin to consistency of the MLE. The appropriate analogues of Theorem 5.2.3 are valid. Next we give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution
Under conditions AOA6 for p(x, B) that if B is the MLE,

= lex, Bj
=logp(x, B), we showed in Section 5.4
(5.5.6)
Ca(y'n(e  B» ~ N(O,rl(B)).
Consider C( ..;ii((}  B) I Xl, ... , X n ), the posterior probability distribution of y'n((} B( Xl, ... , X n )), where we emphasize that (j depends only on the data and is a constant given XI, ... , X n . For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions that strengthen A4 and A5 by replacing convergence in Po probability by convergence a.s. p•. We also add,



A7: For all (), and all 0> o there exists t(o,(})
p. [sup
> 0 such that
{~ t[I(Xi,B') /(Xi,B)]: 18'  BI > /j} < '(0, B)] ~ I.
e such that 1r(') is continuous and positive
AS: The prior distribution has a density 1f(') On at all B. Remarkably,
Theorem 55.2 (UBernsteinlvon Mises"). If conditions ADA3, A4(a.s.), A5(a.s.), A6, A7, and A8 hold. then
C(y'n((}(})
I X1, ... ,Xn )
~N(O,l
1
(B»)
(5.5.7)
a.s. under P%ralle.
We can rewrite (5.5.7) more usefully as
sup IP[y'n((}  e) < x I Xl, ... , X n]  of>(xVI(B»)j ~ 0
x
(5.5.8)
for all a.s. Po and, of course, the statement holds for our usual and weaker convergence in Po probability also. From this restatement we obtain the important corollary. Corollary 5.5.1. Under the conditions of Theorem 5.5.2,
e
sup IP[y'n(O  e) < x j Xl, ... , XnJ  of>(xVl(e)1
x
~0
(5.5.9)
a.s. P%r all B.
1 , ,
340
Asymptotic Approximations Chapter 5
Remarks
(I) Statements (5,5.4) and (5,5,7)(5,5,9) are, in fact, frequentist statements about the
asymptotic behavior of certain functionvalued statistics.
(2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in P, probability if A4 and
A5 are used rather than their strong formssee Problem 5.5.7.
(3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
identifiability guarantees consistency of Bin a regular model.

Proof We compute the posterior density of .,fii(O  B) as
(5.5.10)
where en = en(X!, . .. ,Xn) is given by

Divide top and bottom of (5.5.10) by
;,'
II7
1 p(Xi ,
B) to obtain
(5.5.11)

where l(x,B)
= 10gp(x,B) and
,
,
We claim that
for all B. To establish this note that (a) sup { 11" + 1I"(B) : ItI < M} tent and 1T' is continuous. (b) Expanding, (5.5.13)
(e In) 
~ 0 a.s.
for all M because
eis a.s. consis
I
I
1
i
1 ! , ,
I
p
J
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
341
where
Ie  Bit)1 < )n.
We use I:~
1
g~ (Xi, e) ~ 0 here. By A4(a.s.), A5(a.s.),
1
n
sup { n~[}B,(Xi,B'(t))n~[}B,(Xi,B):ltl<M
In [}'I
[}'l
}
~O,
for all M, a.s. Po. Using (5.5.13). the strong law of large numbers (SLLN) and A8, we obtain (Problem 5.5.3),
Po
[dnqn(t)~1f(B)exp{Eo:;:(Xl,B)~}
forallt] =1.
(5.5.14)
Using A6 we obtain (5.5.12).
Now consider
dn =
I:
r
+y'n
1f(e+
;")exp{~I(Xi,9+;,,) 1(Xi,e)}ds
(5.5.15)
dnqn(s)ds
J1:;I<o,fii
J
1f(t) exp
{~(l(Xi' t) 1(X
i,
9)) } l(lt 
el > o)dt
By AS and A7,
Po [sup { exp
{~(l(Xi,t) 1(Xi , e») } : It  el > 0} < e"'("O)] ~ 1
(5.5.16)
for all 0 so that the second teon in (5.5.14) is bounded by y'ne"'("O) ~ 0 a.s. Po for all 0> O. Finally note that (Problem 5.5.4) by arguing as for (5.5.14), tbere exists o(B) > 0 such that
Po [dnqn(t) < 21f(8) exp {~ Eo (:;: (Xl, B))
By (5.5.15) and (5.5.16), for all 0
~}
for all It 1 < 0(8)y'n]
~ I.
(5.5.17)
> 0,
(5.5.18)
Po [dn 
r dnqn(s)ds ~ 0] = I. J1:;I<o,fii
exp {_ 8'I(B)} ds
2
Finally, apply the dominated convergence theorem, Theorem B.7.5, to dnqn(sl(lsl < 0(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s. Po,
d ~ 1f(B)
n
r= L=
= 1f(8)v'21i'.
JI(B)
(5.5.19)
,
I
342
Hence, a.S. Po,
Asymptot'lc Approximations Chapter 5
qn(t) ~ V1(e)<p(tvI(e))
where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. 0 Example 5.5.1. Posterior Behavior in the Normal Translation Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N{ (), ( 2 ) distribution with a 2 known and we put aN ('TJ, 7 2 ) prior on 8. Then the posterior
distribution of8 isN(Wln7J!W2nX,
(~I r12)1) where
,,2
W2n
.,
I
• •
WIn
= nT 2 +U2'
= !WIn
(5.5.20)
,
'"
,
r
, .,,
I
Evidently, as n + 00, WIn + 0, X + 8, a.s., if () = e, and (~I T\) 1 + O. That is, the posterior distribution has mean approximately (j and variance approximately 0, for n large, or equivalently the posterior is close to point mass at as we vn(O  9) has posterior distribution expect from Theorem 5.5.1. Because 9 =
;
N ( .,!nw1n(ry  X), n (~+
;'»
1).
x,
e
Now, vnW1n
=
O(n 1/ 2) ~ 0(1) and
0
n (;i + ~ ) 1
rem 5.5.2.
+ (12 =
II (8) and we have directly established the conclusion of Theo
I ,
I
I
Example 5.5.2. Posterior Behavior in the Binomial~Beta Model. (Example 3.2.3 continued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X" ... , X n Li.d. Bernoulli (I, e) and put a beta, (3(r, s) prior on e, then, as in Example 3.2.3, (J has posterior (3(8n +r, n+s  8 n ). We have shown in Problem 5.3.20 that if Ua,b has a f3(a, b) distribution, then as a + 00, b + 00,
I
If 0
a) £ (a+b)3]1( Ua,b a+b ~N(o,I). [ ab
< B<
(5.5.21)
j I ,
i j
1 is true, Sn/n ~. () so that Sn + r + 00, n + s  Sn + 00 a.s. Po. By identifying a with Sn + r and b with n + s  Sn we conclude after some algebra that because 9 = X,
vn((J  X)!:' N(O,e(l e))
a.s. Po, as claimed by Theorem 5.5.2.
o
Bayesian optimality of optimal frequentist procedures and frequentist optimality of
Bayesian procedures
•
Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors are asymptotically efficient in the sense of the previous section.
,
1 ,
I
I
t
hz _
I
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
343
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following.
~
Theorem 5.~.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution ofB. Then
(i)
(5.5.22)
a.s. Pe for all
e. Consequently,
~, _ I ~ 1 az. I (e)ae(X" e) +op,(n 1/2 ) e  e+ n L.
l=l
(5.5.23)
and LO( .,fii(rr  e)) ~ N(o, rl(e)).
(ii)
(5.5.24)
E( .,fii(111 
el11111 Xl,'"
,Xn) = mjn E(.,fii(111  dl 
1(11) 1Xl.··· ,Xn) + op(I).
(5.5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions In (e, d) = .,fii(18 dlIell· But the Bayes estimatesforl n and forl(e, d) = 18dl must agree whenever E(11111 Xl, ... , Xn) < 00. (Note that if E(1111 I Xl, ... , X n ) = 00, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.l4.22)
sup IP[.,fii(O  e)
< x I Xl,'" ,Xu) 1>(xy'""'I(""'e))1 ~
Oa.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7.1 I). Thus, any median of the posterior distribution of .,fii(11  e) tends to 0, the median of N(O, II (~)), a.s. Po. But the median of the posterior of .,fii(0  (1) is .,fii(e'  e), and (5.5.22) follows. To prove (5.5.24) note that
~ ~ ~
and, hence, that
E(.,fii(IOelll1e'l) IXl, .. ·,Xn) < .,fiile
e'l ~O
(5.5.27)
a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?). claim (5.5.24) follows and, hence,
E( .,fii(10 
01 101) I h ... , Xn)
= E( .,fii(10 
0'1  101) I X" ... , X n ) + op(I).
(5.5.28)
344
~
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for In(e,d), (5.5.25) and the theorem follows. 0 Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang (1990).
Bayes credible regions
~
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
Theorem 5.5.4. Suppose the conditions afTheorem 5.5.2 are satisfied. Let
where en is chosen so that 1l"(Cn I Xl, ... ,Xn) = 1  0', be the Bayes credible region defined in Section 4.7. Let Inh) be the asymptotically level 1  'Y optimal interval based on B, given by
~
where dn(y)
i •
= z (! !)
JI~). ThenJorevery€ >
0, 0,
~ 1.
P.lIn(a + €) C Cn(X1 , .•. ,Xn ) C In(a  €)J
(5.5.29)
I
I
I
The proof, which uses a strengthened version of Theorem 5.5.2 by which the posterior density of Jii( IJ  0) converges to the N(O,Il (0)) density nnifonnly over compact neighborhoods of 0 for each fixed 0, is sketched in Problem 5.5.6. The message of the theorem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n 1j2 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n 1 order (see Schervisch, 1995).
Thsting
! ,
•
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of 00 given X I, ... ,Xn if H is false is of a different magnitude than the pvalue for the same data. For more on this socalled Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + D.) or (00  D., 00 + D.), then Bayes and freqnentist testing procedures agree in the limit. See Problem 5.5.2. Summary. Here we established the frequentist consistency of Bayes estimates in the finite parameter case, if all parameter values are a prior possible. Second. we established
i
! I
I b
II
!
_
j
TI
Section 5.6
Problems and Complements
345
the socalled Bernsteinvon Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the socalled Bernstein~von Mises theorem and frequentist contjdence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1
1. Suppose Xl, ... , X n are i.i.d. as X ous case density.
rv
F, where F has median F 1 (4) and a continu
(a) Show that, if n
= 2k + 1,
, Xn)
EFmed(X),
n (
_
~
)
l'
k (1  t)kdt F' (t)t
EF
med (X"
2
,Xn )
n( 2;) [1P'(t)f tk (1t)k dt
= 1, 3,
(b) Suppose F is unifonn, U(O, 1). Find the MSE of the sample median for n and 5.
2. Suppose Z ~ N(I', 1) and V is independent of Z with distribution X;'. Then T
Z/
=
(~)!
is said to have a noncentral t distribution with noncentrality J1 and m degrees
of freedom. See Section 4.9.2. (a) Show that
where fm(w) is the x~ density, and <P is the nonnal distribution function.
(b) If X" ... ,Xn are i.i.d.N(I',<T2 ) show that y'nX /
(,.~, L:(Xi
_X)2)! has a
noncentral t distribution with noncentrality parameter .fiiJ1/IT and n  1 degrees of freedom. (c) Show that T 2 in (a) has a noncentral :FI,m distribution with noncentrality parameter J12. Deduce that the density of T is
p(t)
= 2L
i=O
00
P[R = iJ . hi+,(f)[<p(t 1')1(t > 0)
+ <p(t + 1')1(t < 0)1
where R is given in Problem B.3.12.
, " I I. ,
346
Hint: Condition on
Asymptotic Approximations
Chapter 5
ITI.
1, then Var(X) < 1 with equality iff X
3. Show that if P[lXI < 11
±1 with
probability! .
Hint: Var(X) < EX 2
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through ..jiif..
(a) Show that the ratio of the Hoeffding function h( VilE) to the Chebychev function e( Jii€) tends to as Jii€ ~ 00 so that he) is arbitrarily better than en in the tails.
°
I,.,'
i~
(b) Show that the normal approximation 24> (
V;€)  1 gives lower results than h in
00.
,
the tails if P[lXI < 1] = 1 because, if ,,2 < 1. 1  <p(t)  <p(t)lt as t ~ Note: Hoeffding (1963) exhibits better bounds for known a 2 .
R has .\(0) ~ 0, is bounded, and has a hounded second derivative .\n. Show that if Xl, ... , X n are i.i.d., EX l = f.L and Var Xl = 02 < 00, then
5. Suppose.\ : R
~
E.\(X 
1') = .\'(0)
;'/!; +
0
(~)
as n >
00.
= E>.'(O)JiiIX  1'1 + E (";' (X  I')(X I'?) where IX  1'1 < Ix  1'1· The last term is < suPx I>." (x) 1,,2 In and the first tends to
Hint: JiiE(.\(IX 1,1) .\(0))
I
",
1! "
.\'(0)" f== Izl<p(z)dz by Remark B.7. 1(2).
Problems for Section 5,2
1. Using the notation of Theorern 5.2.1, show that
"
2. Let X" ... ,Xn be ij,d. N(I',,,2), Show that for all n
Sup p(•• u) [IX
u
>
1, all €
>
°
/,1 > ,] ~ 1.
Hint: Let (J
)
00.
•
3, Establish (5.2.5). Hint: Iiln  q(p)1
> € =} IPn  pi > w (€).
l
J
4. Let (Ui , V;), 1 < i < n, be i.i.d.  PEP.
(a) Let y(P)
= PIU, > 0, V, > OJ. Show that if P = N(O, 0,1,1, p), then
p
~ sin21l' (Y(P)  ~).
Section 5. N (/J. and the dominated convergence theorem. ". Hint: From continuity of p.6 Problems and Complements 347 (b) Deduce that if P is the bivariate normal distribution. .Xn are i. eol > A} > 0 for some A < 00.~ . eo)} : e E K n {I : 1 IB .d. Prove that (5.14)(i). inf{p(X. .•. eo) : Ie  : Ie . where A is an arbitrary positive and finite constant. (1 (J.14)(i) add (ii) suffice for consistency.\} c U s(ejJ(e j=l T J) {~ t n .e)1 : e' E 5'(e.\} .p(X. (c) Suppose p{P) is defined generally as Covp (U.o)} =0 where S( 0) is the 0 ball about Therefore. V)j /VarpU Varp V for PEP ~ {P: EpU' + EpV' < 00.lJ. 6.(eo)} < 00. and < > 0 there is o{0) > 0 such that e.. 5_0  lim Eo..(ii) holds.2 rII. Show that the maximum contrast estimate 8 is consistent. . 5.eol > . then is a consistent estimate of p.eol > . 7. e) .l)2 _ + tTl = o:J. Eo. (i).n 1 (XiJ. (Wald) Suppose e ~ p( X.2. Or of sphere centers such that K Now inf n {e: Ie . By compactness there is a finite number 8 1 .lO)2)) .p(X. .i. e) is continuous. Hint: sup 1. 05) where ao is known. VarpUVarpV > O}. e)1 (ii) Eo.e')p(X. n Lt= ".2. (a) Show that condition (5.2. Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but p is no longer consistent.e'l < . t e.sup{lp(X. . (Ii) Show that condition (5. for each e eo.p(X. e E Rand (i) For some «eo) >0 Eo. {p(X" 0) .. Suppose Xl. sup{lp(X. e') . by the basic property of maximum contrast estimates. inf{p(X. i (e))} > <. e') . A].8) fails even in this simplest case in which X ~ /J is clear.p(Xi . Hint: K can be taken as [A. eo) : e' E S(O.
then by Jensen's inequality. . + id = m E II (Y.3.xW) < I'lj· 3. Establish Theorem 5... L:~ 1(Xi . .3.k / 2 . Then EIX .Xn but independent of them.1.3) for j odd as follows: . Problems for Section 5. Establish (5. Compact sets J( can be taken of the form {II'I < A. . = 1. Establish (5. Establish (5..I'lj < EIX . L:!Xi . i . Extend the result of Problem 7 to the case () E RP. . . . 9.. 8. J (iii) Condition on IXi  X. for some constants M j .X. < > OJ. li .O'l n i=l p(X"Oo)}: B' E S(Bj. • • (i) Suppose Xf. The condition of Problem 7(ii) can also fail.i.L. .348 Asymptotic Approximations Chapter 5 > min l<J$r {~tinf{p(Xi.) 2] . < Mjn~ E (.1.. II I . Indicate how the conditions of Problem 7 have to be changed to ensure uniform consistencyon K. For r fixed apply the law of large numbers. . . (72). ~ .d.. n. with the same distribution as Xl...3 I.11). . (ii) If ti are ij.3.. 1 + . Hint: Taylor expand and note that if i . .X~ are i. Let X~ be i..l m < Cm k=I n. i .X'i j . I 10. 4."d' k=l d < mm+1 L d ElY.3. . in (i) and apply (ii) to get (iv) E IL:~ 1 (Xi  x. P > 1. and take the values ±1 with probability ~.2.O(BJll}.)I' < MjE [L:~ 1 (Xi .d. < < 17 < 1/<. 1/(J. 2.d.. and let X' = n1EX. and if CI.. .9) in the exponential model of Example 5. Show that the log likelihood tends to 00 as a + 0 and the condition fails.1'.[. ..)2]' < t Mjn~ E [. en are constants.X. Hint: See part (a) of the proof of Lemma 5..i.3.3.
. Show that under the assumptions of part (c). where s1 = (n.Fk.m be Ck. (d) Let Ck. 1 < j < m. under H : Var(Xtl ~ Var(Y. if a < EXr < 00.d.) I : i" . (b) Show that when F and G are nonnal as in part (a).Section 5. L~=1 i j = m then m a~l. FandYi""'Yn2 bei. replaced by its method of moments estimate..1 and 'In = n2 .. ~ iLli I IX. I < liLl}P(IXd < liLl)· 7. X. then EIX.i. ~ iLli < 2 i EIX.d.ad)]m < m L aj j=1 m <mmILaj.pli ~ E{IX..d.m) Then.(X.a~d < [max(al"" .. then (sVaf)/(s~/a~) has an . .. km K. . i.3.m 00. j > 2. P(sll s~ < ~ 1~ a as k ~ 00. respectively. n} EIX..aD.andsupposetheX'sandY's are independent.(l'i . theu the LR test of H : or = a~ versus K : af =I a~ is based on the statistic s1 / s~.). s~ ~ (n2 ..i.I.m distribution with k = nl .I) Hint: By the iterated expectation theorem EIX.28. R valued with EX 1 = O. Establish 5.. j = 1.c> . . 0'1 = Var(Xd· Xl /LI Ck.i.6 Problems and Complements 349 Suppose ad > 0..a as k t 00./LI = E(Xt}. . G. .\k for some . )=1 5. 8."" X n be i. . PH(st! s~ < Ck. ~ xl". 6. LetX1"". 1)'2:7' . Show that ~ sup{ IE( X.\ > 0 and = 1 + JI«k+m) ZIa. Show that if m = . Let XI. .r'j".Tn) + 1 . Show that if EIXlii < 00.m with K.1) +E{IX.. (a) Show that if F and G are N(iL"af) and N(iL2.l(e) Now suppose that F and G are not necessarily nonnal but that and that 0 < Var( Xn < Ck.1)'2:7' . ~ iLl' I IXli > liLl}P(lXd > 11. 2 = Var[ ( ) /0'1 ]2 .Xn1 bei.
1. (b) If (X.3.6.. Wnte (l 1pP . "i. . In particular. . use a normal approximation to tind an approximate critical value qk.lIl) + 1 .". (cj Show that if p = 0.l EXiYi .p') ~ N(O. . t N }. suppose i j = j.. . . (iJi .XN} or {(uI. to estimate Ii 1 < j < n.m (0 Let 9. In Example 5.1JT.. < 00 (in the supennodel). .. 4p' (I .d.. Consider as estimates e = ') (I X = x. . if ~ then .1 EX. X) U t" (ii) X R = bopt(U ~  iL) as in Example 3.i._ J . ' .a: as k + 00. TN i.. then P(sUs~ < qk.A») where 7 2 = Var(XIl. Show that under the assumptions of part (e). Without loss of generality. from {t I. (I _ P')'). . Instead assume that 0 < Var(Y?) < x.1'2)1"2)' such that PH(SI!S~ < Qk.. . .+xn n when T. (UN.l EY? .p) ~ N(O. • . .. (b) Use the delta method for the multivariate case and note (b opt  b)(U . () E such that T i = ti where ti = (ui. n.2< 7'. Po. • 10. n. Y) ~ N(I'I..p')') and. i = 1. = ("t ." •.(I_A)a2).i.+.) =a 2 < 00 and Var(U) Hint: (a) X . 1). (a) jn(X . iik. + I+p" ' ) I I I II." .00.4. = (1.N. N where the t:i are i. Var(€.xd.XC) wHere XC = N~n Et n+l Xi.1 that we use Til. = = 1. suppose in the context of Example 3. JLl = JL2 = 0. jn(r . (iJl. if 0 < EX~ < 00 and 0 < Eyl8 < 00.m with KI and K2 replaced by their method of moment estimates. if p i' 0.Ii > 0.iL) • . there exists T I . ! . Under the assumptions of part (c).. Show that jn(XRx) ~N(0.m) + 1 .p).a as k . be qk. (a) If 1'1 = 1'2 ~ 0. and if EX. .1 · /..\ 00. In survey sampling the modelbased approach postulates that the population {Xl. • op(n').I).1. ~ ] . . 7 (1 . then jn(r .. jn((C . i (b) Suppose Po is such that X I = bUi+t:i. . i = I. p). .x) ~ N(O.4. that is.2" ( 11.~) (X .. Et:i = 0.XN)} we are interested in is itself a sample from a superpopulation that is known up to parameters.I' ill "rI' and "2 = Var[( X.t . . "1 = "2 = I.I. Without loss of generality..p) ~ N(O. Tin' which we have sampled at random ~ E~ 1 Xi.3. .d...Il2. . . show that i' ~ . H En.xd.\ < 1.I))T has the same asymptotic distribution as n~ [n.". then jn(r' . .p. as N 2 t Show that. 0 < .m (depending on ~'l ~ Var[ (X I . err eri Show that 4log (~+~) is the variance stabilizing transfonnation for the correlation 1 Ip coefficient in Example 5.350 Asymptotic Approximations Chapter 5 (e) Next drop the assumption that G E g.6. .. . . Hint: Use the centra] limit theorem and Slutsky's theorem.
: . Here x q denotes the qth quantile of the distribution.E{h(X))1' = 0 to terms up to order l/n' for all A > 0 are of the form h{t) ~ ct'!3 + d. .. X n is a sample from a population with mean third central moment j1. Normalizing Transformation for the Poisson Distribution. Let Sn have a X~ distribution. (h) Deduce fonnula (5. X = XO. This < xl (h) From (a) deduce the approximation P[Sn '" 1>( v'2X  v'2rl). !) distribution. (a) Use this fact and Problem 5. X. Suppose X 1l . Show that IE(Y..10. EU 13. (a) Show that if n is large. Hint: If U is independent of (V. E(WV) < 00.25. . 15..3.3.3. 1931) is found to be excellent Use (5. (c) Compare the approximation of (b) with the central limit approximation P[Sn < xl = 1'((x ..90.99..i'b)(Yc ... then E(UVW) = O.14 to explain the numerical results of Problem 5.13(c).6) to explain why.12). . W). 16. and Hint: Use (5. .i'a)(Yb .3.. (a) Suppose that Ely. (h) Use (a) to justify the approximation 17. The following approximation to the distribution of Sn (due to Wilson and Hilferty.n)/v'2rl) and the exact values of PISn < xl from the X' table for x = XO.i'c) I < Mn'..y'ri has approximately aN (0.s. (a) Show that the only transformations h that make E[h(X) . 14.3.3 < 00.. each with HardyWeinberg frequency function f given by . n = 5. It can be shown (under suitable conditions) that the nonnal approximation to the distribution of h( X) improves as the coefficient of skewness 'Y1 n of heX) diminishes. (b) Let Sn . X~.3' Justify fonnally j1" variance (72. Suppose XI.6 Problems and Complements 351 12. Suppose X I I • • • l X n is a sample from a peA) distrihution. = 0.14). is known as Fisher's approximation.. X n are independent..Section 5.. .
1'2)]2C7n + O(n 2) i . 20.n tends to zero at the rate I/(m + n)2.a) Hmt: Use Bm. I ~ 352 x Asymptotic Approximations Chapter 5 where °< e < f(x) 1. (e) What is the approximate distribution of Vr'( X . 1'2)h2(1'1. Yn are < < . . Y) .• • • ~{[hl (1". and h'(t) > for all t. which are integers.. . Let X I.. Show that ifm and n are both tending to oc in such a way that m/(m + n) > a. 19.1'2)]2 177 n +2h. (1'1. Justify fonnally the following expressions for the moments of h(X.Xm . 1'2) = h.n  m/(m + n») < x] 1 > 'li(x). Var(Y) = C7~. .a) E(Bmn ) = . + (mX/nY)] where Xl. (Xn .. 0< a < I. ..y). Let then have a beta distribution with parameters m and n.I'll + h2(1'1.n = (mX/nY)i1 independent standard eXJX>nentials. (1'1.y) = a axh(x.y) = a ayh(x. Bm. . h(l) = I. Y" .2. X n be the indicators of n binomial trials with probability of success B. ° .5 that under the conditions of the previous problem. (a) (b) Var(h(X.1') + X 2 . Show directly using Problem B.h(l'l.n P . Cov(X. Var Bmnm + n +Rmn = 'm+n ' ' where Rm. . where I' = (a) Find an approximation to P[X < E(X. Yn) is a sample from a bivariate population with E(X) ~ 1'1.. 1'2)pC7..a tends to zero at the rate I/(m + n)2. .. if m/(m + n) . V)) '" .. then I m a(l. Variance Stabilizing Transfo111U1tion for the Binomial Distribution. (b) Find an approximation to P[JX' < t] in terms of 0 and t. 1'2)(X . [vm +  n (Bm.. Show that the only variance stabilizing transformation h such that h(O) = 0. Y) ~ p!7IC72. is given by h(t) = (2/1r) sin' (Yt).(x.y). where I I h. E(Y) = 1'2. 172 + [h 2(1'l. v'a(l. h 2 (x.)? 18. • j b _ . Var(X) = 171. i 21. 1'2) Hint: h(X. Y) where (X" Yi). 1'2)(Y  + O( n '). 101 02 20(10) I f (10)2 2 tJ in terms of fJ and t.
J1.2)/24n2 Hint: Therefore. Let Xl>"" X n be a sample Irom a population with and let T = X2 be an estimate of 112 .2V where V ~ 00.J1.).)1J1.l4 is finite. xi. XI.) 23.Section 5.)] !:. Usc Stirling's approximation and Problem 8. (a) When J1.)] ~ as ~h(2)(J1. < t) = p(Vi < (b) When J1.6 Problems and Complements 353 22. J1.(1.4 to give a directjustification of where R n / yin 0 as in n Recall Stirling's approximation: + + 00. Show that Eh(X) 3 h(J1. Suppose IM41 (x)1 < M for all x and some constant M and suppose that J. Suppose that Xl.X) in tenns of the distribution function when J.2V with V "' Give an approximation to the distribution of X (1 . (e) Fiud the limiting laws of y'n(X .) + ~h(2)(J1. . Let Sn "' X~. (a) Show that y'n[h(X) . 1 X n be a sample from a population with mean J.)2 and n(X . ()'2 = Var(X) < 00. and that h(1)(J..h(J1.2) using the delta method.. (It may be shown but is not required that [y'nR" I is bounded. = E(X) "f 0. _. . = ~.)2.l) = o.3 1/6n2 + M(J1.J1. 24.X2 .4 + 3o.. X n is a sample from a population and that h is a realvalued function 01 X whose derivatives of order k are denoted by h(k). xi 25.. find the asymptotic distrintion of nT nsing P(nT y'nX < Vi). k > I..2. ° while n[h(X h(J1.)] is asymptotically distributed (b) Use part (a) to show that when J1. n[X(I . = 0.. .J1.l and variance (J2 < Suppose h has a second derivative h<2) continuous at IJ. .) + ": + Rn where IRnl < M 1(J1.l = ~. find the asymptotic distribution of y'n(T. Let Xl. Compare your answer to the answer in part (a).J1.
4. Show that if Xl" . Hint: (a). What happens? Analysis for fixed n is difficult because T(~) no longer has a '12n2 distribution.) (b) Deduce that if. . cyi . .9. the intervals (4. (a) Show that P[T(t. 28. (d) Make a comparison of the asymptotic length of (4.\ < 1.3. 29. N(Pl (}2).JTi times the length of the onesample t interval based on the differences.3). = = az./".a. that E(Xt) < 00 and E(Y.\)al + .) of Example 4.2 . . " Yn as separate samples and use the twosample t intervals (4. . (c) Show that if IInl is the length of the interval In.9.3) has asymptotic probability of coverage < 1 . .') < 00.I I (a) Show that T has asymptotically a standard Donnal distribution as and I I . .4..1/ S D where D and S D are as in Section 4. 1 Xn. We want to study the behavior of the twosample pivot T(Li.3 independent samples with III = E(XIl. . p) distribution. Let T = (D .9.~a) > 2( Val + a'4  ~ a) where the righthand side is the limit of . if n}.Jli = ~.3. . . Viillnl ~ 2vul + a~z(l . Asymptotic Approximations Chapter 5 then ~ £ c: 2 yn(X .}.\. I i t I .t.(~':~fJ'). Yn2 are as in Section 4.9. the interval (4. .9. . then limn PIt. (c) Apply Slutsky's theorem. Assume that the observations have a common N(jtl.\.d. Suppose that instead of using the onesample t intervals based on the differences Vi .\)a'4)/(1 . Hint: Use (5. Suppose (Xl. . . 27.3) and the intervals based on the pivot ID .. XII are ij.\a'4)I'). and a~ = Var(Y1 ). /"' = E(YIl. < tJ ~ <P(t[(.\ probability of coverage.t. n2 + 00.9.3. .354 26. Let n _ 00 and (a) Show that P[T(t.9.3) have correct asymptotic (c) Show that if a~ > al and .Xi we treat Xl.•. .2a 4 ). 2pawz)z(1  > 0 and In is given by (4. E In] > 1 . and SD are as defined in Section 4. al = Var(XIl.a. Y1 .. Eo) where Eo = diag(a'. (Xnl Yn ) are n sets of control and treatment responses in a matched pair experiment. I .3). b l .) < (b) Deduce that if p tl ~ <P (t [1. .4. so that ntln ~ . Suppose Xl. .33) and Theorem 5. a~. 1 X n1 and Y1 .9. Suppose nl + 00 .... whereas the situation is reversed if the sample size inequalities and variance inequalities agree.9.)/SD where D..o.. n2 + 00. 'I :I ! t.\ > 1 .\ul + (1 ~ or al .. . 0 < .t2.(j ' a ) N(O. We want to obtain confidence intervals on J1. Yd.
.3 by showing that if Y 1. It may be shown that if Tn is any sequence of random variables such that Tn if the variances ofT and Tn exist. . Carry out a Monte Carlo study such as the one that led to Figure 5.. then P(Vn < va<) t a as 11 t 00. . < XnI (a)) (b) Let Xn_1 (a) be the ath quantile of X. . then lim infn Var(Tn ) > Var(T). ...3. I is the Euclidean norm.I) + y'i«n .8Z = (n . Show that if 0 < EX 8 < 00.9 and 4..z = Var(X). as X ~ F and let I' = E(X).6 Problems and Complements 355 (b) Let k be the Welch degrees of freedom defined in Section 4. Tn . 33.3.a)) and evaluate the approximations when F is 7..xdf and Ixl is Euclidean distance.3 using the Welch test based on T rather than the twosample t test based on Sn. Plot your results. where tk(1. Let . I< = Var[(X 1')/()'jZ. Let X ij (i = I.9.Section 5. has asymptotic level a. X. k) be independent with Xi} ~ N(l'i.£. but Var(Tn ) + 00. Generalize Lemma 5.z has a X~I distribution when F is theN(fJl (72) distribution.Z).1.3. (d) Find or write a computer program that carries out the Welch test.a). . then for all integers k: where C depends on d.4.4.3. Hint: If Ixll ~ L.I k and k only.z are Iii = k. Show that k ~ ex:: as 111 ~ 00 and 112 t 00.. Show that as n t 00.1)1 L.I)sz/().a) is the critical value using the Welch approximation.£.d. . Vn = (n . Let XI. 32. . x = (XI. (c) Show using parts (a) and (b) that the tests that reject H : fJI = 112 in favor of K: 112 > III when T > tk(l . Y nERd are Li..16." . EIY. (a) SupposeE(X 4 ) < 00. ().d.. (c) Let R be the method of moment estimaie of K._I' Find approximations to P(Yn and P{Vn < XnI (I .:~l IXjl.i.~ t (Xi . j = I. vectors and EIY 1 1k < 00. (a) Show that the MLEs of I'i and (). U( 1. where I. then there exist universal constants 0 < Cd < Cd < 00 Such that cdlxl1 < Ixl < Cdlxh· 31. T and where X is unifonn. ()..1). 30. .X)" Then by Theorem B.I)z(a).1 L j=I k X ij and iT z ~ (kp)I L L(Xij i=l j=l p k Iii)" . Hint: See Problems B. X n be i.p. and let Va ~ (n .
0))  7'(0)) 0.. Assmne lbat '\'(0) < 0 exists and lbat ..d. random variables distributed according to PEP... . .0) = O..nCOn .4 1.)). .. . Suppose !/! : R~ R (i) is monotone nondecreasing (ii) !/!(oo) < 0 < !/!(oo) . Let On = B(P).p(X.0) !. • 1 = sgn(x).\(On)] !:.f. 1948). . . .L~. [N(O))' . 1) for every sequence {On} wilb On 1 n = 0 + t/.n for t E R. . (c) Give a consistent estimate of (]'2. .p(x) (b) Suppose lbat for all PEP. .O(P)) > 0 > Ep!/!(X.. ! Problems for Section 5. .On) < 0).356 Asymptotic Approximations Chapter 5 .p(X. Show that _ £ ( . Let i !' X denote the sample median. I(X.n(X .. N(O) = Cov(. \ . (c) Assume the conditions in (a) and (b). aliI/ > O(P) is finite. O(P) is lbe unique solution of Ep. N(O.. I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _1 . Show lbat if I(x) ~ (d) Assume part (c) and A6. . where P is the empirical distribution of Xl.. . . (a) Show that (i)... Let Xl.n(On . . I i .1)cr'/k..On) . F(x) of X. 00 (iii) I!/!I(x) < M < for all x.. 1/). 1 X n. 1/4/'(0)).... is continuous and lbat 1(0) = F'(O) exists. Deduce that the sample median is a consistent estimate of the population median if lbe latter is unique. then < t) = P(O < On) = P (.p(X. . n 1 F' (x) exists. . .0) !:. . I . N(O.i.p(X.... (k . .p(X. (e) Suppose lbat the d.n7(0) ~[. Show that.. (Use .) Hint: Show that Ep1/J(X1  8) is nonincreasing in B. I ! 1 i .(0) Varp!/!(X... Ep...0) ~ N Hint: P( .0). _ ..p(X.0) and 7'(0) .p(Xi . under the conditions of (c). Use the bounded convergence lbeorem applied to .. That is the MLE jl' is not consistent (Neyman and Scott.0). then 8' !. (b) Show that if k is fixed and p ~ 00. Show that On is consistent for B(P) over P.p( 00) as 0 ~ 00. ..Xn be i. Set (. and (iii) imply that O(P) defined (not uniquely) by Ep. (ii).
b)p(x. 2.Xn ) is the MLE.7. 1979) which you may assume.exp(x/O). " relative efficiency of 8 1 with respect to 82 is defined as ep(8 1 . then 'ce(n(B . denotes the N(O. oj).  = a?/o.(x). 0» density. Show that A6' implies A6.. (h) Suppose that Xl has the Cauchy density fix) = 1/1r(1 + x 2). 0) = .O)p(x. Condition A6' pennits interchange of the order of integration by Fubini's theorem (Billingsley.0. If (] = I..O))dl"(x) ifforalloo < a < b< 00. lJ :o(1J.(x ."' c .d. Conclude that (b) Show that if 0 = max(XI.i.(x. Show that   ep(X.(x) = (I . Hint: teJ1J.2.(x..B) < xl = 1.b)dl"(x) J 1J. 0 > 0. Thus. J = 1.(x) + <'P. 'T = 4.y. 0 «< 0. then ep(X.c)'P..(1:e )n ~ 1.    .9)dl"(x) = J te(1J. x E R. Let XI. Show that if (g) Suppose Xl has the gross error density f.X) =0. This is compatible with I(B) = 00.'" .0) (see Section 3.I / 2 .6 Problems and Complements 357 (0 For two estImates 0. evaluate the efficiency for € = ..O)dl"(x)) = J 1J.B)) ~ [(I/O).(x. and O WIth .a)p(x. . Show that assumption A4' of this section coupled with AGA3 implies assumption A4. 4. Hint: Apply A4 and the dominated convergence theorem B. X n be i. not only does asymptotic nonnality not hold but 8 converges B faster than at rate n.(x.. ~"'. X) = 1r/2.05..0.' . U(O.4.O). X) as defined in (f).a)dl"(x).(x.15 and 0.0)p(x. Find the efficiency ep(X.5) where f. the asymptotic 2 .O)p(x. 82 ) Pis N(I".0) ~ N(O. .5  and '1'.O) is defined with Pe probability I but < x. .10. . (a) Show that g~ (x.~ for 0 > x and is undefined for 0 g~(X.jii(Oj .Section 5. 3. ( 2 ).20 and note that X is more efficient than X for these gross error cases. not O! Hint: Peln(B ..
I .4.in) . .4) to g.) P"o (X .i (x.80 + p(X .0'1 < En}'!' 0 . A2."log .In) + . 1(80 ). . 0 for any sequence {en} by using (b) and Polyii's theorem (A. Show that 0 ~ 1(0) is continuous.80 p(X" 80) +.O') .(X. j.14.i. I (c) Prove (5. ~ L. i I £"+7.18 .5 continue to hold if .4... I 1 ..4.. Hint: (b) Expand as in (a) but aroond 80 (d) Show that ?'o+". i. &l/&O so that E. 80 +. ~log (b) Show that n p(Xi . ) ! . ) n en] . I I j 6.i'''i~l p(Xi.< ~log n p(X. 0)..358 5. 0 ~N ~ (. Suppose A4'.in) ~ . O.Oo) p(Xi .[ L:. . (8) Show that in Theorem 5.·_t1og vn n j = p( x"'o+. 8 ) i . . and A6 hold for.In ~ "( n & &8 10gp(Xi.in) 1 is replaced by the likelihood ratio statistic 7. Show that the conclusions of PrQple~ 5.50). I .00) . 8)p(x.i(X.in. 8) is continuous and I if En sup {:.. under F oo ' and conclude that .O) 1(0) and Hint: () _ g."('1(8) .~ (X.5 Asymptotic Approximations Chapter 5 ~lOg n p(Xi.g.80 + p(Xi .p ~ 1(0) < 00.. 8 ) 0 .22).7. Apply the dominated convergence theorem (B.
Section 5.14.4. Let B~ be as in (5. hence. (a) Establish (5. 11.59).4. Hint: Use Problem 5. = X.59) coincide. (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5. at all e and (A4'). Consider Example 4. Hint: Use Problems 5. (a) Show that under assumptions (AO)(A6) for all Band (A4').a.5. which agrees with (4.4.4. B' _n + op (n 1/') 13. 10.58). (Ii) Compare the bounds in Ca) with the bound (5.4.4. Let B be two asymptotic l.59). B).54) and (5.6. A. ~ 1 L i=I n 8'[ ~ 8B' (Xi.4. .2.4. and give the behavior of (5. Let [~11. Hint. Compare Theorem 4.7.:vell . ~.4.3). Then (5. which is just (4. if X = 0 or 1. all the 8~i are at least as good as any competitors.3.6 Problems and Complements 359 8. f is a is an asymptotic lower confidence bound for fJ.57) can be strengthened to: For each B E e.2.56).4.2.4.Q" is asymptotically at least as good as B if.7). 9.61) for X ~ 0 and L 12. there is a neighborhood V(Bo) of B o o such that limn sup{Po[B~ < BJ : B E V(Bo)} ~ 1 .B lower confidence bounds.4.4. setting a lower confidence bound for binomial fJ. Establish (5.4. gives B Compare with the exact bound of Example 4. We say that B >0 nI Show that 8~I and.6 and Slutsky's theorem.5 and 5. (Ii) Deduce that g. for all f n2 nI n2 .4.61).4. (a) Show that.4. (Ii) Suppose the conditions of Theorem 5. the bound (5.5 hold. (a) Show that under assumptions CAO)(A6) for 1/! consistent estimate of I (fJ). B' nj for j = 1.
LJ. .I). the pvalue fJ . Let X 1> .•. X n be i. I' has aN(O.i. X > 0. Jl > OJ .I ' '1 against H as measured by the smallness of the pvalue is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "paradox"). Establish (5. 0 given Xl. 1 .2. 1).)) and that when I' = LJ. 1.d. Show that (J l'.21J2 2x A} .d.riX) = T(l + nr 2)1/2rp (I+~~. " " I I .t = 0 versus K : J. N(I'.\ = '\0 versus K • (b) Show that the NP test is UMP for testing H : A > AO versus K : A < AO. Consider the Bayes test when J1. Hint: By (3. (a) Show that the test that rejects H for large values of v'n(X . if H is false. ! (a) Show that the posterior probability of (OJ is where m n (. where Jl is known. 1 15.55).i. inverse Gaussian with parameters J. ..2. (b) Suppose that I' ).5. the evidence I .. N(Jt. is a given number. Suppose that Xl. J1. : A < Ao (a) Find the NeymanPearson (NP) test for testing H : .1 and 3. Consider testing H : j.10) and (3.'" XI! are ij. variables with mean zero and variance 1(0). . 00..4.4. (e) Find the Rao score test for testing H : . each Xi has density ( 27rx3 A ) 1/2 {AX + J. the test statistic is a sum of i.\ > o. phas a U(O. . A exp .2.i.. (c) Suppose that I' = 8 > O.) has pvalue p = <1>(v'n(X .5 ! 1.)l/2 Hint: Use Examples 3. 7f(1' 7" 0) = 1 . 1) distribution.1. I .=2[1  <1>( v'nIXI)] has a I U(O.01 . Now use the central limit theorem. > ~.LJ. (c) Find the approximate critical value of the NeymanPearson test using a Donnal approximation. Consider the problem of testing H : I' E [0. . X n i.11).] versus K .1. 2...\ l ..A 7" 0. By Problem 4.L and A.360 1 Asymptotic Approximations Chapter 5 14.d.. Problems for Section 5. LJ. Show that (J/(J l'.. = O. (d) Find the Wald test for testing H : A = AO versus K : A < AO. That is.4. =f:. 1. is distributed according to 7r such that 1> 7f({O}) and given I' ~ A > 0. 1) distribution.d.6.. That is. where .1.. T') distribution. ! I • = '\0 versus K : .\ < Ao.
. all .l has a N(O.029 .( Xt.054 50 .2.2 and the continuity of 7f( 0).5. (e) Verify the following table giving posterior probabilities of 1..:. [0.O') : 100'1 < o} 5.5.) (d) Compute plimn~oo p/pfor fJ.(Xi..5. Establish (5.01 < 0 } continuThen 00.s.17). 10 .Oo»)+log7f(O+ J. Hint: By (5. Hint: 1 n [PI .i Itlqn(t)dt < J:J')..s. 1) prior.058 20 .} dt < .5. for all M < 00. ifltl < ApplytheSLLN and 0 ous at 0 = O.1.for M(.5. J:J').8) ~ 0 a.1 I'> = 1. i ~.034 ..052 100 .645 and p = 0.5.17).sup{ :. 1) and p ~ L U(O.042 .05.Section 5. 0' (i» n L 802 i=l < In n ~sup {82 8021(X i . P.S..' " .(Xi.13) and the SLLN.~~(:. oyn. Extablish (5.14). yn(anX   ~)/ va.4. ~ L N(O.i It Iexp {iI(O)'.0 3.046 .6 Problems and Complements 361 (b) Suppose that J. Fe. ~l when ynX = n 1'>=0. Suppose that in addition to the conditions of Theorem 5.I).2.. Show that the posterior probability of H is where an = n/(n + I). Apply the argnment used for Theorem 5.. By (5.0'): iO' .) sufficiently large.050 logdnqn(t) = ~ {I(O)...O'(t)).I(Xi. Hint: In view of Theorem 5. yn(E(9 [ X) . (e) Show that when Jl ~ ~. L M M tqn(t)dt ~ 0 a. 4. By Theorem 5.1 is not in effect. ~ E.2 it is equivalent to show that J 02 7f (0)dO < a.5.)}. (Lindley's "paradox" of Problem 5.
S.I(Xi}»} 7r(t)dt.1.29).Pn where Pn a or 1. A proof was given by Cramer (1946). (0» (b) Deduce (5.5.4 (1) This result was first stated by R. Notes for Section 5. ~ II• '. The sets en (c) {t . Show tbat (5. (I) If the rightband side is negative for some x. See Bhattacharya and Ranga Rao (1976) for further discussion. .16) noting that vlnen. all d and c(d) / in d. I: . . • . qn (t) > c} are monotone increasing in c. . Hint: (t : Jf(O)<p(tJf(O)) > c(d)} = [d.".) ~ 0 and I jtl7r(t)dt < co. dn Finally. Fn(x) is taken to be O. J I . I : ItI < M} O. i=l }()+O(fJ) I Apply (5. for all 6.5.7 NOTES Notes for Section 5.1 . . For n large these do not correspond to distributions one typically faces. .3 ). .5. ~ 0 a. . Fisher (1925). Bernstein and R. " . (1) This famous result appears in Laplace's work and was rediscovered by S. . .s.) and A5(a. Finally. 362 f Asymptotic Approximations Chapter 5 > O. 7.2 we replace the assumptions A4(a.9) hold with a.s.) by A4 and A5. ! ' (2) Computed by Winston Cbow.8) and (5.. von Misessee Stigler (1986) and Le Cam and Yang (1990). (a) Show that sup{lqn (t) .5. . jo I I ..1 (I) The bound is actually known to be essentially attained for Xi = a with probability Pn and 1 with probability 1 .5.5.!. . • Notes for Section 5.I Notes for Section 5. convergence replaced by convergence in Po probability. 5. . I i. A.s(8)vn tqn(t)dt ~ roc vIn(t  0) exp {i=(l(Xi' t) . . (O)<p(tI. L .f.5 " I i .s. to obtain = I we must have C n = C(ZI_ ~ [f(0)nt l / 2 )(1 + op(I)) by Theorem 5.d) for some c(d). 'i roc J. Suppose that in Theorem 5.
R. Prob. Normal Approximation and Asymptotic Expansions New York: Wiley. "Probability inequalities for sums of bounded random variables." J. Cambridge University Press.Section 5. A.. Assoc.. 1976. The Behavior of the Maximum Likelihood Estimator Under NonStandard Conditions. I Berkeley. P. A CoUrse in Large Sample Theory New York: Chapman and Hall. RANGA RAO. Acad. Theory o/Statistics New York: Springer. NJ: Princeton University Press. M.S.. 1985. E. P. 1990. E. 'The distribution of chi square. HAMMERSLEY. Editors Cambridge: Cambridge University Press. C." Proc.. HfLPERTY. H.A. M. p. 1998. B. F. AND R. Phil. H. ." Proc. 1967. FISHER. AND M. R. Sci.. 1979. reprinted in Biometrika Tables/or Statisticians (1966). 58. 1980. STIGLER. M. Vol. 318324 (1953). LEHMANN. 1958. Nat. J. Vol. 3rd ed. L. SCOTT. DAVID. R.. 1946. RAo. P." Econometrica. Monte Carlo Methods London: Methuen & Co.. 684 (1931). Theory ofPoint EstimatiOn New York SpringerVerlag. MA: Harvard University press. 22. HUBER.. S. CASELLA. E. J. Symp. I.. Wiley & Sons. R. YANG. SCHERVISCH. T. Some Basic Concepts New York: Springer. J. C. LE CAM.. Camb. HOEFFDING. AND G.. AND E.. New York: J. CRAMER. CA: University of California Press. AND G. E. FERGUSON. Vth Berkeley Symposium.. R. 1987. H. Math. Statisticallnjerence and Scientific Method. L. J... L. 1986. W. 1973.. 1996. 1380 (1963). New York: McGraw Hill. HANSCOMB. S. 1964. BILLINGSLEY. G. 1999. 17. J. A. Tables of the Correlation Coefficient. 1938. AND D. 3rd ed. L.. LEHMANN. The History of Statistics: The Measurement of Uncertainty Before 1900 Cambridge. Probability and Measure New York: Wiley.8 References 363 5. Hartley and E. FISHER. 1995. Proc. RUDIN.. "Consistent estimates based on partially consistent observations. Linear Statistical Inference and Its Applications. SERFLING. Box. Mathematical Methods of Statistics Princeton. O. 700725 (1925). U. Statist.. 40. Amer. Statistical Decision Theory and Bayesian Analysis New York: SpringerVerlag. "Theory of statistical estimation. Mathematical Analysis. W. Asymptotics in Statistics.8 REFERENCES BERGER. WILSON. Pearson. NEYMAN.. L. "Nonnormality and tests on variances:' Biometrika. Approximation Theorems of Mathematical Statistics New York: J.. Vth Berk.. S. 2nd ed. 132 (1948). N. Elements of LargeSample Theory New York: SpringerVerlag. Soc. 16. Statist. Wiley & Sons. BHATTACHARYA.
I . . ' i ' ..\ . . .I 1 . . (I I i : " .
However. the multinomial (Examples 1. the bootstrap.and semiparametric models. 365 .5.3).3. confidence regions. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for functionvalued statistics. tests. 2.1) and more generally have studied the theory of multiparameter exponential families (Sections 1. 2. We shall show how the exact behavior of likelihood procedures in this model correspond to limiting behavior of such procedures in the unknown variance case and more generally in large samples from regular ddimensional parametric models and shall illustrate our results with a number of important examples. the properties of nonparametric MLEs. often many.or nonpararnetric models.1. an important aspect of practical situations that is not touched by the approximation. multiple regression models (Examples 1.8 C R d We have presented several such models already.3). and confidence regions in regular onedimensional parametric models for ddirnensional models {PO: 0 E 8}.4. 2.Chapter 6 INFERENCE IN THE MUlTIPARAMETER CASE 6. the number of observations.2.1 INFERENCE FOR GAUSSIAN LINEAR MODElS • Most modern statistical questions iovol ve large data sets. for instance. we have not considered asymptotic inference. This chapter is a leadin to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non. 1. real parameters and frequently even more semi.1. in which we looked at asymptotic theory for the MLE in multiparameter exponential families. There is.3. and prediction in such situations.4. curve estimates. 2. and efficiency in semiparametric models. the modeling of whose stochastic structure involves complex models governed by several. however. We begin our study with a thorough analysis of the Gaussian linear model with known variance in which exact calculations are possible. [n this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates.2 and 5. are often both large and commensurate or nearly so.6.2.7. and n. the fact that d.6. with the exception of Theorems 5. the number of parameters. testing.2. Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II. The inequalities ofVapnikChervonenkis.3.
" n (6.. • . We consider experiments in which n cases are sampled from a population.6 we will investigate the sensitivity of these procedures to the assumptions of the model. are U. 0"2). the Zij are called the design values. • i .d.N(O. In vector and matrix notation..l p.1. . .1. we have a response Yi and a set of p . i = l. I I.1. . e ~N(O... I The regression framewor~ of Examples 1." ..1. These are among the most commonly used statistical techniques. '.zipf.1. when there is no ambiguity. n (6. 1 Yn from a population with mean /31 = E(Y)..4 and 2. . Here is Example 1.Zip' We are interested in relating the mean of the response to the covariate values.1...4) 0 where€I.1.. It turn~ out that these techniques are sensible and useful outside the narrow framework of model (6. .'" . The model is Yi j = /31 + €i. . We have n independent measurements Y1 . .. Notational Convention: In this chapter we will.3) whereZi = (Zil. i .Zip.. In the classical Gaussian (nannal) linear model this dependence takes the fonn p 1 Yi where EI. 6.1 is also of the fonn (6.1) j .1. In Section 6.l)T. let expressions such as (J refer to both column and row vectors.3).2.n (6.. .a2).Herep= 1andZnxl = (l..1 The Classical Gaussian linear Model l' f Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants Zil. say the ith case.5) . . Z = (Zij)nxp.d. andJ is then x nidentity matrix.j .€nareij. Regression.1 covariate measurements denoted by Zi2 • .1. i = 1. . 1 en = LZij{3j j=1 +Ei. we write (6. " 'I and Y = Z(3 + e..2) Ii .1.1. .2J) (6.1..2(4) in this framework. N(O. .3). and for each case.1. . Example 6. The normal linear regression model is }i = /31 + L j=2 p Zij/3j + €i.1. i = 1. The OneSample Location Problem.1. and Z is called the design matrix.. . Here Yi is called the response variable. 366 Inference in the Multiparameter Case Chapter 6 ! . .. .3): l I • Example 6. . In this section we will derive exact statistical procedures under the assumptions of the model (6.
and so on. We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional distribution of Y given a set of observed covariate values.1.6) is an example of what is often called analysis ojvariance models.6) is often reparametrized by introducing ( l ~ pl I:~~1 (3k and Ok = 13k . we want to do so for a variety of locations. . Yn1 + 1 .1. To see that this is a linear model we relabel the observations as Y1 ./.. the design matrix has elements: 1 if L jl nk + 1<i < L nk k=1 j k=l ootherwise and z= o 0 ••• o o Ip where I j is a column vector of nj ones and the 0 in the "row" whose jth member is I j is a column vector of nj zeros. Ynt +n2 to that getting the second. · · . To fix ideas suppose we are interested in comparing the performance of p > 2 treatments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k. The model (6. this terminology is commonly used when the design values are qualitative.1.3) applies.Yn . 13k is the mean response to the kth treatment. The model (6. n.9. 0 Example 6.. + n p = n.•. The pSample Problem Or One... if no = 0.1. Frequently. then the notation (6. If we are comparing pollution levels. (6. The random design Gaussian linear regression model is given in Example 1. Generally. .2) and (6. In Example I. we often have more than two competing drugs to compare.. Yn1 correspond to the group receiving the first treatment.~) random variables. and so on. one from each population.1. We treat the covariate values Zij as fixed (nonrandom).0: because then Ok represents the difference between the kth and average treatment . . 1 < k < p. . If the control and treatment responses are independent and nonnally distributed with the same variance a 2 . (6."i = 1.4. In this case. /3p are called the regression coefficients.1fwe set Zil = 1. Then for 1 < j < p. we arrive at the one~way layout or psample model. nl + ..3. where YI. we are interested in qualitative factors taking on several values.3 we considered experiments involving the comparisons of two population means when we had available two independent samples.5) is called the fixed design norrnallinear regression model..1 Inference for Gaussian Linear Models 367 where (31 is called the regression intercept and /32. ..1. . .3 and Section 4..3.Way Layout. . and the €kl are independent N(O.1.6) where Y kl is the response of the lth subject in the group obtaining the kth treatment. TWosample models apply when the design values represent a qualitative factor taking on only two values.Section 6.
i = 1. i5p )T. • ... V n for Rn such that VI. and with the parameters identifiable only once d . n. Because dimw = r. b I _ . we call Vi and Vj orthogonal. . . (3 E RP}. TJi = E(Ui) = V'[ J1.. • I t Ew ¢:> t = L:(v[t)Vi i=l ¢:> vTt = 0.g.po In tenns of the new parameter {3* = (0'. The parameter set for f3 is RP and the parameter set for It is W = {I' = Z(3.. E ~N(O.. the linear Y = Z'(3' + E.1. n... . . the vector of means p of Y always is.Y. there exists (e.. We assume that n > r..6jCj are the columns of the design matrix. r i' . It is given by 0 = (itl. n. . j = I. Ui = v.. V r span w. . This type oflinear model with the number of columns d of the design matrix larger than its rank r. . .• znj)T. i = T + 1. Even if {3 is not a parameter (is unidentifiable). is common in analysis of variance models.. We now introduce the canonical variables and means .! x (p+l) = (1.(T2J) where Z. I5 h . i . {3* is identifiable in the pdimensional linear subspace {(3' E RP+l : L:~~l dk = O} of RP+ 1 obtained by adding the linear restriction E~=l 15k = 0 forced by the definition of the 15k '5. ...P. . 1 JLn)T I' where the Cj = Z(3 = L j=l P . Let T denote the number of linearly independent Cj..a 2 ) is identifiable if and only ifr = p(Problem6. Note that any t E Jl:l can be written . i=I (6. Cj = (Zlj. j = I •... .368 effects. •• Note that w is the linear space spanned by the columns Cj. Z) and Inx I is the vector with n ones. •.. .7) !.. by the GramSchmidt process) (see Section B. . j = 1. " .17).. The Canonical Fonn of the Gaussian Linear Model The linear model can be analyzed easily using some geometry. When VrVj = O. then r is the rank of Z and w has dimension r..3. .2).p. an orthononnal basis VI. It follows that the parametrization ({3. However. .1. Recall that orthononnal means Vj = 0 fori f j andvTvi = 1.. of the design matrix..r additional linear restrictions have been specified. . vT n i I t and that T = L(vTt)v. . k model is Inference in the Multiparameter Case Chapter 6 1. Note that Z· is of rank p and that {3* is not identifiable for {3* E RP+l. ..
(6. . '" ~ A 1'1. . E w. which are sufficient for (IJ..1.1... and then translate them to procedures for .2. (6.. = 1. . U1 .. 7J = AIJ.• .1. •. while (1]1. (3.1.•• (ii) U 1 . IJ.2 ) using the parametrization (11. 20. .. i (iv) = 1.2.Ur istheMLEof1]I.. 1]r)T varies/reely over Rr.10). (T2)T. • ..2 We first consider the known case.1.u) 1 ~ 2 n 2 .)..2 and E(Ui ) = vi J.1.(Ui . i it and U equivalent.2 L i=l + _ '" ryiui _ 02~ i=l 1 r r 2 (6.. 2 0.'J nxn .". Moreover.'Ii) .1]r. r.1. 1 ViUi and Mi is UMVU for Mi. Let A nxn be the orthogonal matrix with rows vi.1 Inference for G::'::"::"::"::"::L::'::"e::'::'::M::o::d::e'::' '" '3::6::::. .. 2 ' " 'Ii i=l ~2(T2 6. also UMVU for CY. ...9 N(rli. .. . Theorem 6. The U1 are independent and Ui 11i = 0. .1.. . v~...1. it = E. .2 known (i) T = (UI •.. . 1 If Cl. . U. .Section 6. We start by considering the log likelihood £('1. Un are independent nonnal with 0 variance 0.. In the canonical/orm ofthe Gaussian linear model with 0.2<7' L.10) It will be convenient to obtain our statistical procedures for the canonical variables U. n. which is the guide to asymptotic inference in general. (iii) U i is the UMVU estimate of1]i.3. n because p.1. .9) Var(Y) ~ Var(U) = . where = r + 1.8) So.L = 0 for i = r + 1. then the MLE of 0: Ci1]i is a = E~ I CiUi. n. and by Theorem B. = L. . . u) based on U £('I. whereas (6.' based on Y using (6... .. Theorem 6.Cr are constants.. i i = 1.. . Then we can write U = AY. . Proof..)T is sufJicientfor'l. n.. 0.8)(6.2"log(21r<7 ) t=1 n 1 n _ _ ' " u. and .11) _ n log(21r"').and 7J are equivalently related.2 Estimation 0. is Ui = making vTit. ais (v) The MLE of. Note that Y~AIU. observing U and Y is the same thing. .
..) = a.• EUl U? Ul Projections We next express j1 in terms of Y.(J2J) by Theorem B.2. (i) By observation.O)T E R n . Ui is UMVU for E(U... i = 1. .1 L~ r+l U. 2(J i=r+l ~ U. I i > . The distribution of W is an exponential family..Ur'L~ 1 Ul)T is sufficient. To show (ii).4.'7. recall that the maximum of (6. .11) by setting T j = Uj . (i) T = (Ul1" .. (6.. this statistic is equivalent to T and (i) follows. j = 1. . and give a geometric interpretation of ii.10. " V n of R n with VI = C = (c" .4. .2. the MLE of q(9) = I:~ I c.2 . .11) is a function of 1}1. . . and . 1 Ur . ( 2 )T.3.1)... .N(~.4. .11) is an exponential family with sufficient statistic T.4. But because L~ 1 Ui2 L:~ I Ui2 + r+l Ul..2 .11). . obtain the MLE fJ of fl. (iv) The conclusions a/Theorem 6. (We could also apply Theorem 2.1]r.. 0.. . where J is the n x n identity matrix.. l Cr . r.3 and Example (iii) is clear because 3.ti· 1 . ~i = vi'TJ. n " . . n.4 and Example 3.6 to the canonical exponential family obtained from (6.. 0  i Next we consider the case in which a 2 is unknown and assume n >r+ 1. By GramSchmidt orthogonalization.1.) (iii) By Theorem 3. I. = 1. 1lr only through L~' 1 CUi ."7i)2 and is minimized by setting '(Ii = Ui. " as a function of 0.6. apply Theorem 3. "7r because. . Proof. there exists an orthonormal basis VI) . .2). 2 2 ()j = 1'0/0..1. {3.1. Q is UMVU. r.11) has fJi 1 2 = Ui . by observation. Let Wi = vTu. wecan assume without loss of generality that 2:::'=1 c.1.. Assume that at least one c is different from zero. . . (v) Follows from (iv). To this end. . (iii) 8 2 =(n . T r + 1 = L~ I and ()r+l = 1/20..1. 1 U1" are the MLEs of 1}1.1.. By Problem 3. (ii) The MLE ofa 2 is n. The maximizer is easily seen to be n 1 L~ r+l (Problem 6. i > r+ 1.3. . To show (iv).Ui · If all thee's are zero. we need to maximize n (log 27f(J2) 2 II ". That is.. (6.1. In the canonical Gaussian linear model with a 2 unknown.. and is UMVU for its expectation E(W. WI = Q is sufficient for 6 = Ct.2 (ii)..2.. (U I .'  .. . I .. = 0.) = '7" i = 1. 2:7 r+l un Tis sufficientfor (1]1. (ii) U1. . Proof. then W .3.4. .r) 2:7 1 L~ r+l ul is an unbiased estimator of (J2.. Theorem 6. By (6.. (iv) By the invariance of the MLE (Section 2. is q(8) = L~ 1 <.1. define the norm It I of a vector tERn by  ! It!' = I:~ .1. I i I . 370 Inference in the Multiparameter Case Chapter 6 I iI .(v) are still valid.
In the Gaussian linear model (i) jl is the unique projection oiY on L<. /3 = (ZTZ)l ZT. . ZTZ is nonsingular..n. Proof. .1.1.1.1. Thus.. IY .. That is.2<T' IY . and /3 = (ZTZll ZT 1". note that the space w. .1. fj = (ZTZ)lZTji.1 Inference for Gaussian linear Models 371 E Rn on w is the point Definition 6.Z/3I' : /3 E W}..4. /3. . It follows that /3T (ZT s) = 0 for all/3 E RP. which implies ZT s = 0 for all s E w.p.1. then /3 is identifiable. the MLE of {3 equal~ the least squares estimate (LSE) of {3 defined in Example 2..ti..' and by Theorems 6. The projection Yo = 7l"(Y I L. the MLE = LSE of /3 is unique and given by (6.3 E RP..Section 6.. ZT (Y .12) implies ZT.jil' /(n r) (6.1. To show f3 = (ZTZ)lZTy. any linear combination of U's is a UMVU estimate of its expectation.. ~ The maximum likelihood estimate . fj = arg min{Iy . 0 ~ . by (6.13) (iv) lfp = r.1.1.1 and Section 2.9). i=l.J) of a point y Yo =argmin{lyW: tEw}. equivalently. f3j and Jii are UMVU because.2.Z/3 I' ..3.3 of. and  Jii is the UMVU estimate of J.L = {s ERn: ST(Z/3) = Oforall/3 E RP}.L of vectors s orthogonal to w can be written as ~ 2:7 w. note that 6. j = 1.2(iv) and 6.. To show (iv). = Z/3 and (6. any linear combination ofY's is also a linear combination of U's.14) follows.14) (v) f3j is the UMVU estimate of (3j.' and is given by ji (ii) jl is orthogonal to Y (iii) 8' ~ = zfj (6.3 maximizes 1 log p(y.3(iv). ) 2 or..3 because Ii = l:~ 1 ViUi and Y . spans w. because Z has full rank. We have Theorem 6.1. _.12) ii.1.1.L..ji) = 0 and the second equality in (6.. (i) is clear because Z. = Z T Z/3 and ZTji = ZTZfj and.1.Ii = . (ii) and (iii) are also clear from Theorem r+l VjUj .1.n log(21f<T . <T) = .
(6. In the Gaussian linear model (i) the fitted values Y = iJ. That is. . Taking z = Zi. (iv) ifp = r. Y = il. 1 < i < n. n lie on the regression line fitted to the data {('" y.3) that if J = J nxn is the identity matrix. .'.14) and the normal equations (Z T Z)f3 = Z~Y. There the points Pi = {31 + fhzi. whenp = r.. n}. L7 1 .H)Y. Note that by (6. i . the residuals €. H T 2 =H.1.ill 2 = 1 q. The residuals are the projection of Y on the orthocomplement of wand ·i .15) Next note that the residuals can be written as ~ I . I. In statistics it is also called the hat matrix because it "puts the hat on Y.. . i = 1.({3t + {3.)I are the vertical distances from the points to the fitted line.' € = Y .372 Inference in the Multiparameter Case Chapter 6 Note that in Example 2.Y = (J .4.ii is called the residual from this fit. = [y. and the residual € are independent.3) is to be taken. i = 1. In this method of "prediction" of Yi.1.12) and (6. By Theorem 1.14). € ~ N(o. . The goodness of the fit is measured by the residual sum of squares (RSS) IY . ." As a projection matrix H is necessarily symmetric and idempotent. Example 2.).1.1.. : H = Z(Z T Z)IZT The matrix H is the projection matrix mapping R n into w. Var(€) = (12(J . Suppose we are given a value of the covariate z at which a value Y following the linear model (6.H)).1. I .1. 1 < i < n.1. H I I It follows from this and (B. .. (j ~ N(f3. (12 (ZTz) 1 ). see also Section RIO. . then (6. 1 ~ ~ ! I ~ ~ ~ i .1. moreover.1 we give an alternative derivation of (6.5. The estimate fi = Z{3 of It is called thefitted value and € = y .2 illustrates this tenninology in the context of Example 6.I" (12H). (ii) (iii) y ~ N(J. it is commOn to write Yi for 'j1i.. • . 1 · . ~ ~ ·. We can now conclude the following.1.1. the ith component of the fitted value iJ.. and I. • ~ ~ Y=HY where i• 1 .1.H). the best MSPE predictor of Y if f3 is known as well as z is E(Y) = ZT{3 and its best (UMVU) estimate not knowing f3 is Y = ZT {j. w~ obtain Pi = Zi(3.2. =H .16) J I I CoroUary 6.2 with P = 2. (12(J . .
Var(. in the Gaussian model..Section 6.j = 1.1.1.ii = Y.. Regression (continued).8 and that {3j and f1i are UMVU for {3j and J1. Y..8. then replacement of a subscript by a dot indicates that we are considering the average over that subscript.2..1.81 and i1 = {31 Y.3. where n = nl + . By Theorem 6. 0 Example 6. i = 1...1.p. Thus.2). . The OneWay Layout (continued).8 and (3. 1=1 At this point we introduce an important notational convention in statistics. the treatments. k=I..1. and€"= Y .p. 0 ~ We now return to our examples..1 and Section 2.a of the kth treatment is ~ Pk = Yk.8. (not Y. in general) p k=l L and the UMVU estimate of the incremental effect Ok = {3k . a = {3. . . If {Cijk _.. is th~ UMVU estimate of the average effect of all a 1 p = Yk .1. . nk(3k = LYkl..3). 0 ~ ~ ~ ~ ~ ~ ~ Example 6.1. Here Ji = . o .p. Moreover.Y are given in Corollary 6. . . k = 1. . hence.ii.1).Ill'. k = 1. One Sample (continued). Example 6. In this example the nonnal equations (Z T Z)(3 = ZY become n.2. 'E) is a linear transformation of U and. .1.i respectively.3. . . joint Gaussian. In the Gaussian case. (n .no The variances of . If the design matrix Z has rank P. The independence follows from the identification of j1 and € in tenns of the Ui in the theorem.3. and € are nonnally distributed with Y and € independent. We now see that the MLE of It is il = Z. then the MLE = LSE estimate is j3 = (ZTZr1 ZTy as seen before in Example 2..8) = (J2(Z T Z)1 follows from (B..p.. .5.. + n p and we can write the least squares estimates as {3k=Yk .1.. The error variance (72 = Var(El) can be unbiasedly estimated by 8 2 = (n _p)lIY . which we have seen before in the unbiased estimator 8 2 of (72 is L:~ I (Yi ~  Yf/ Problem 1.1 ~ Inference for Gaussian Linear Models 373 Proof (Y. } is a multipleindexed sequence of numbers or variables.4.
. . In general. . For instance.\(y) = sup{p(Y. The first inferential question is typically "Are the means equal or notT' Thus we test H : 131 = . .1.1. .2. the LADEs are obtained fairly quickly by modem computing methods.374 Inference in the Multiparameter Case Chapter 6 Remark 6. {JL : Jti = 131 + f33zi3.1.17) I. • = 131 + f32Zi2 + f33Zi3...31.1.JL): JL E wo} ~ ! .al.3 Tests and Confidence Intervals .p}.. 1 < .. j where Zi2 is the dose level of the drug given the ith patient. However. The most important hypothesistesting questions in the context of a linear model correspond to restriction of the vector of means JL to a linear subspace of the space w.zT .JL): JL E w} sup{p(Y. we let w correspond to the full model with dimension r and let Wo be a qdimensional linear subspace over which JL can range under the null hypothesis H. under H. in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider. An alternative approach 10 the MLEs for the nonnal model and the associated LSEs of this section is an approach based on MLEs for the model in which the errors El. q < r. The LSEs are preferred because of ease of computation and their geometric properties. . (6.4.L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L~ 1 IYi . i = 1. Zi3 is the age of the ith patient. For more on LADEs.i 1 6." Now.Li = 13 E R. a regression equation of the form mean response ~ . and the matrix IIziillnx3 with Zil = 1 has rank 3.1. The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEssee Stigler (1986). .1) have the Laplace distribution with density . " ..3 with 13k representing the mean resJXlnse for the kth population. whereas for the full model JL is in a pdimensional subspace of Rn. n} is a twodimensional linear subspace of the full model's threedimensional linear subspace of Rn given by (6. 1 and the estimates of f3 and J.1. see Koenker and D'Orey (1987) and Portnoy and Koenker (1997).. Next consider the psample model of Example 1. • . Thus. Now we would test H : 132 = 0 versus K : 132 =I O. j. see Problems 1. which together with a 2 specifies the model. ! j ! I I .En in (6.. = {3p = 13 for some 13 E R versus K: "the f3's are not all equal.1.17). i = 1. . We first consider the a 2 known case and consider the likelihood ratio statistic .. . which is a onedimensional subspace of Rn. I. .7 and 2. in the context of Example 6. . j 1 • 1 .2.6. the mean vector is an element of the space {JL : J. under H.
.1. v~' such that VI. by Theorem 6. i=q+l r = a 21 fL  Ito [2 (6..\(Y) has a X.1. 2 log ..1. 1) distribution with OJ = 'Ida.\(Y) Proof We only need to establish the second equality in (6.1.l9) then.\(Y) = L i=q+l r (ud a )'. (}r.. = IJL i=q+l r JLol'. Note that (uda) has a N(Oi. In this case the distribution of L:~q+1 (Uda? is called a chisquare distribution with r .1.21 ) where fLo is the projection of fL on woo In particular. We write X. respectively. In the Gaussian linear model with 17 2 known..IY . But if we let A nx n be an orthogonal matrix with rows vi. Write 'Ii is as defined in (6. .20) i=q+I 210g.. . .l2).1. 2 log . by Theorem 6. . .'" (}r)T (see Problem B.\(Y) ~ exp {  2t21Y .. V q (6.q( (}2) distribution with 0 2 =a 2 ' \ '1]i L.21).J X. We have shown the following._q' = AJL where A L 'I.1.2(v).1. when H holds. V r span wand set .1.. span Wo and VI.Section 6_1 Inference for Gaussian Linear Models 375 for testing H : fL E Wo versus j{ : JL E W  woo Because (6.\(Y) It follows that = exp 1 2172 L '\' Ui' (6.iio 2 1 } where i1 and flo are the projections of Yon wand wo.3..I8) then..1. o .q degrees offreedom and noncentrality parameter Ej2 = 181 2 = L:=q+ I where 8 = «(}q+l.19). then r.q {(}2) for this distribution. Proposition 6. .iii' .1. .
In the Gaussian linear model the F statistic defined by (6.q)'Ili .2 is known" is replaced by './L.• j .r IY ..1.I}. .itol 2 have a X.q)'{[A{Y)J'/n .'IY _ iii' = L~ r+l (Ui /0)2.r degrees affreedam (see Prohlem B. Proposition 6.1. has the noncentral F distribution F r _q.5) that if we introduce the variance equal likelihood ratio w'" .m{O') for this distrihution where k = r .~I..1.a:. .1.r){r . we obtain o A(Y) = P Y.q variable)/df (central X~T variahle)/df with the numerator and denominator independent. We write :h.376 Inference in the Multiparameter Case Chapter 6 Next consider the case in which a 2 is unknown.r. and 86 into the likelihood ratio statistic.L n where p{y.. T = n .J.n_r(02) where 0 2 = u. .1.itol 2 = L~ q+ 1 (Ui /0) 2. J • T = (noncentral X:.. we can write . it can be shown (Problem 6.. Remark 6. We have seen in PmIXlsition 6.1 suppose the assumption "0.JLo. In Proposition 6.a = ~(Y'E.3. (6.1./to I' ' n J respectively. ]t consists of rejecting H when the fit. IIY . I . We have shown the following.14).2 is the same under Hand K and estimated by the MLE 0:2 for /.iT'): /L E w} . The distribution of such a variable is called the noncentral F distribution with noncentrality parameter 0 2 and r . T has the (central) Frq.. is poor compared to the fit under the general model. statistic :\(y) _ max{p{y. In particular.. which has a X~r distribution and is independent of u.18).1. By the canonicaL representation (6. E In this case.2 1JL .'}' YJ.(Y).22).23) .1 that 021it .1./L.0.wo. {io.'I/L.I. when H I lwlds.22) Because T = {n . as measured by the residual sum of squares under the model specified by H.l) {Ir .21ii./Lol'.liol' (n _ r) 'IY _ iii' .. T is an increasing function of A{Y) and the two test statistics are equivalent.1. The resulting test is intuitive._q(02) distribution with 0' = .L a =  n I' and . 8 2 .') denotes the righthand side of (6.JtoI 2 . = aD IIY . .1. . We know from Problem 6.iT') : /L E wo} (6. T has the representation .1 that the MLEs of a 2 for It E wand It E wQ are . T is called the F statistic for the general linear hypothesis.1.2.q and n .liol' IY r _q IY _ iii' iii' (r .19).2.E Wo for K : Jt E W . Substituting j).q and m = n . /L. which is equivalent to the likelihood ratio statistic for H : Ii.max{p{y. Thus. For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >.nr distribution.
iLl' + liL .24) Remark 6. (6.1. Y3 y Iy . and the Pythagorean identity. It follows that ~ (J2 known case with rT 2 replaced by r.1. r = 1 and T = 1'0 versus K : (3 'f 1'0.iLol' = IY .1. The projections il and ilo of Yon w and Wo.IO.q noncentral X? q 2Iog.(Y) equals the likelihood ratio statistic for the 0'2.iLl' ~++Yl I' I' 1 .1.r)ln central X~_cln where T is the F statistic (6.9. See Pigure 6.3.1. In this = (Y 1'0)2 (nl) lE(Y.\(Y) = .Section 6.1.1.Y)2' which we recognize as t 2 / n.1 and Section B.iLol'.~T = (n . Yl = Y2 Figure 6. We next return to our examples. The canonical representation (6.25) which we exploited in the preceding derivations.2. (6.1. 0 . This is the Pythagorean identity.1'0 I' y.1. Example 6. where t is the onesample Student t statistic of Section 4.19) made it possible to recognize the identity IY .22).1. We test H : (31 case wo = {I'o}. q = 0. One Sample (continued).1 Inference for Gaussian Linear Models ~ 377 then >.
vy.ito 1 are the residual sums of squares under the full model and H.26) and H.1.g. j ! j • litPol2 = LL(Yk _y.g. 0 I j Example 6..27) 1 .dfp) RSSF/dh I 2 where RSSF = IY and RSSH = IY . L~"(Yk' ..ZfZl(ZiZl)'ziz. respectivelY.n_p(fP) distribution with noncentrality parameter (Problem 6.n 378 Inference in the Multiparameter Case Chapter 6 iI II " • Example 6.. economic status) coefficients. I3I) where {32 is a (p .26) 0 versus K : f3 2 i' O. •.{3p are Y1. Under H all the observations have the same mean so that..2. .1. treatment) effect coefficients and f3 1 is a q x 1 vector of "nuisance" (e.q)1f3nZfZ2 .22) we obtain the F statistic for the hypothesis H in the oneway layout . We consider the possibility that a subset of p .1 L~=. Recall that the least squares estimates of {31.. P .)2= Lnk(Yk. respectively. .q covariates in multiple regression have an effect after fitting the first q.Q)f3f(ZfZ2)f32. . To formulate this question. In this case f3 (ZrZ)lZry and f3 0 = (Z[Ztl1Z[Y are the ML& under the full model (6. However.RSSF)/(dflf . .1. 7) iW l.q). Z2) where Z1 is n x q and Z2 is 11 X (p .np distribution. Under the alternative F has a noncentral Fp_q. in general 02 depends on the sample correlations between the variables in Zl and those in Z2' This issue is discussed further in Example 62.y)2 k .. 02 simplifies to (J2(p .. Now the linear model can be written as i i I I ! We test H : f32 (6. In the special case that ZrZ2 = 0 so the variables in Zl are orthogonal to the variables in Z2..1. The One. Without loss of generality we ask whether the last p ..Yk. which only depends on the second set of variables and coefficients.}f32. 1 02 = (J2(p . • 1 F = (RSSlf .P .1.3.1.=11=1 k=1 •• Substituting in (6...1. _y.22) we can write the F statistic version of the likelihood ratio test in the intuitive fonn =   i . Using (6.. .Way Layout (continued).1. Regression (continued). = {3p.' • ito = Thus. As we indicated earlier. we partition the design matrix Z by writing it as Z = (ZI. '1 Yp.1. P nk (Y.. we want to test H : {31 = . and we partition {3 as f3T = (f3f. The F test rejects H if F is large when compared to the ath quantile of the Fpq. I • ! . I • I b . (6.q covariates does not affect the mean response..  I 1.q) x 1 vector of main (e. .)2' .: I :I T _ n ~ p L~l nk(Y . . anddfF = np and dfH = nq are the corresponding degrees of freedom. age..)2.
Note that this implies the possibly unrealistic . The sum of squares in the numerator. which is their ratio. (6. compute a. .. SST.fJ.. Because 88B/0" and SSw / a' are independent X' variables with (p . the between groups (or treatment) sum of squares and SSw.30) can also be viewed stochastically.. and the F statistic. k=1 1=1 which measures the variability of the pooled samples. (3" . the unbiased estimates of 02 and a 2 .···.p) of the (n 1) degrees offreedom of SST /0" as "cooling" from S8w/a'... we see that the decomposition (6. See Tables 6. then by the Pythagorean identity (6. identifying 0' and (p . Ypl .25) 88T = 888 + 88w .1. . consider the following data(I) giving blood cholesterol levels of men in three different socioeconomic groups labeled I. are not all equal. respectively. Ypnp' The 88w ~ L L(Y Y p "' k'  k )'.1 Inference for Gaussian Linear Models 379 When H holds. . are often summarized in what is known as an analysis a/variance (ANOVA) table. we have a decomposition of the variability of the whole set of data. To derive IF. [f we define the total sum of squares as 88T =L l' L(Y nk k'  Y..fLo 12 for the vector Jt (rh. .)'..Y.np distribution. . k=l 1=1 measures variation within the samples. n..1....)' . As an illustration. (3p)T and its projection 1'0 = (ij.1) and (n .p).29) Thus. 888 =L k=I P nk(Yk. .Section 6. T has a Fpl.1) degrees of freedom and noncentrality parameter 0'. .1. This information as well as S8B/(P .1 and 6. the within groups (or residual) sum of squares. .3. (3p.1. ijjT = There is an interesting way of looking at the pieces of infonnation summarized by the F statistic. . (31. is a measure of variation between the p samples YII . Yi nl . We assume the oneway layout is valid.2 1M ..1) degrees offreedom as "coming" from SS8/a' and the remaining (n .1) and 8Sw /(n ..np distribution with noncentrality parameter (6.28) where j3 = n I :2:=~"" 1 nifJi. . and III with I being the "high" end. .1.p) degrees of freedom. the total sum of squares. . . sum of squares in the denominator. SST/a 2 is a (noncentral) X2 variable with (n . into two constituent components. If the fJ. T has a noncentral Fp1..1. SSB..
j:
I ,
380
Inference in the Multiparameter Case Chapter 6
,
:
TABLE 6.1.1. ANOYA table for the oneway layout
Sum of squares
d.f.
,
Between samples
Within samples
SSe
I:
r I lld\'k_
Mean squares
F value
1\1 S '
i'dS B
Lf!
P
1
MSB
~
, ,
Total
58W  ""k '1'1n I "(I'k l  I')'  I " k 55T  ,. P 1 L "I" 1 (I'kl _ I' )' ~ k I
np
Tl  1
" '''
A1S w = SSw
1
1
j
TABLE 6.1.2. Blood cholesterol levels
I
J
286 290
II III
403 312 403
311 222 244
269 302 353
336 420 235
259 420 319
386 260
353
210
l 1
I
I
,
i,
.!
, , I
I , ,
assumption that the variance of the measurement is the same in the three groups (not to speak of normality). But see Section 6.6 for "robustness" to these assumptions. We want to test whether there is a significant difference among the mean blood cholesterol of the three groups. Here p = 3, nl = 5, n2 = 10, n3 = 6. n = 21, and we compute
TABLE 6.1.3. ANOYA table for the cholesterol data
;
,
88
Between groups Within groups Total
I
,
,I
1202.5 85,750,5 86,953.0
dJ, 2 18 20
M8
601.2 4763,9
F~value
0.126
'I
From :F tables, we find that the pvalue corresponding to the Fvalue 0.126 is 0.88. Thus, there is no evidence to indicate that mean blood cholesterol is different for the three socioeconomic groups. 0 Remark 6.1.4. Decompositions such as (6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding to variation of covariates are referred to as analysis oj variance. They can be fonnulated in any linear model including regression models. See Scheff" (1959, pp. 4245) and Weisberg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to establish the distribution theory of the components via a device known as Cochran's theorem (Graybill, 1961, p. 86). Their principal use now is in the motivation of the convenient summaries of infonnation we call ANOVA tables.
,
,I "
,
,
,
Section 6.1
Inference for Gaussian linear Models
~
381
Confidence Intervals and Regions We next use our distributional results and the method of pivots to find confidence intervals for J.li, 1 < i < n, !3j, 1 <j < p, and in general, any linear combination
n
,p
= ,p(/l) = Lai/Li
i::: 1
~ aT /l
of the J.l's. If we set;j;
= 1:7
1 ai/Ii
= aT fl and
~
~
where H is the hat matrix, then (,p  ,p)ja(,p) has a N(O, I) distribution. Moreover,
(n  r)8 2ja 2 ~
IY  iii 2 ja 2 =
L
i=r+l
n
(Uda 2 )'
has a X~r distribution and is independent of ;;;. Let
~
~
be an estimate of the standard deviation a('IjJ) of
~
'0. This estimated standard deviation is
called the standard error of 'IjJ. By referring to the definition of the t distribution, we find that the pivot
has a TnT. distribution. Let t n  r (1  40) denote the 1 ~o: quantile of the bution, then by solving IT(,p)1 < t n _, (I  ~",) for,p, we find that
Tn r
distri
is, in the Gaussian linear model, a 100(1  a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider'IjJ = p. We obtain the interval
i' ~ Y ± t n 1
(1
~q) sj.,fii,
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = T. First consider 1/J = f3j for some specified ~gression coefficient (3j. The 100(1  a)% confidence interval for (3j is
(3j
}" = (3j ± t n p (','" s{ [ (ZT Z) 11 j j ' 1 )
382
Inference in the Multiparameter Case
Chapter 6
where [(ZTZ)~l]j) is the jthdiagonal element of (ZTZ)~ '. Computersoftware computes (ZTZ)~I and labels S{[(ZTZ)~lI]j); as the standard errOr of the (estimated) jth regression coefficient. Next consider ljJ = j.ti = mean response for the ith case, 1 < i < n. The level (1  0:) confidence interval is
J.Li =
Jii ± t n _ p (1 
!a) sJh:
where hii is the ith diagonal element of the hat matrix H. Here sjh;; is called the standard error of the (estimated) mean of the ith case. Next consider the special case in which p = 2 and
Yi = 131 + 132Zi2 + Eil i = 1 ", n.
"
If we use the identity
n
~)Zi2  Z2)(l'i  Y)
i=1
~)Zi2  Z.2)l'i,
We obtain from Example 2.2.2 that
ih ~
Because Var(Yi) = a 2 , we obtain
~ Var(.6,)
L~l (Zi2  z.,)l'i
L~ 1 (Zi2  Z.2)2 .
(6.1.30)
= (J I L)Zi' i=l
''''
n
Z.2) ,
,
and the 100(1  a)% confidence interval for .6, has the form
732 ± t n p (I  ~a) sl J2:(Zi2  Z.2)'. The confidence interval for 131 is given in Problem 6.1.10. .62
=
Similarly, in the p = 2 case, it is straightforward (Problem 6.1.10) to compute
i.
h ii
I ,
•
;
_ 1

(Zi2  z.,)2
",n
+ n
L....i=l Zi2 
(
Z·2
)'
,
0
•
I
and the confidence interval for th~ mean response J1.i of the ith case has a simple explicit
fonn.
,
1
,
i ,
.I
Example 6.1.3. OneWay Layout (continued). We consider 'I/J = 13k. 1 < k .6k = Yk. ~ N(.6k, (J'lnk), we find the 100(1  a)% confidence interval
~
< p. Because
•
,
I
= 7J. ± t n  p (1 ~a) slj'nj; where 8' = SSwl(np). The intervals for I' = .6. and the incremental effect Ok =.6k1'
.6k
are given in Problem 6.1.11. 0
j
j
"I
I
i
,
I
I
Do
l
Section 6.2
Asymptotic Estimation Theory in p Dimensions
383
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j, 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector /3 with prescribed probability (1  0:). This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of f3 0 that is accepted when the level (1  a) F test is used to test H : (3 ~ (30' Under H, /' = /'0 = Z(3o; and the numerator of the F statistic (6.1.22) is based on
Iii  /'01'
C "
=
Izi3 
Z(3ol' =
(13  (30)T(Z T Z)(i3 (30)
(30)
Thus, using (6.1.22), the simultaneous confidence region for
f3 is the ellipse
= {(30 .. (13  (30)T(Z 2 Z)(i3 rs
T
< f r,ni"" (1 _ 1 20
'J}
. .T
(6.1.31)
where fr,nr
(1  40:)
is the 1  !o quantile of the :Fr,nr distribution.
Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) write Z = (ZJ, Z2) and /3T = (f3j, {3f), where f32 is a vector of main effect coefficients and f3 1is a vector of "nuisance" coefficients. Similarly, we partition {3 as {3 = ({31 ,(32 ) where (31 is q x 1 and (3, is (p  q) x 1. By Corollary 6.1.1, O"(ZTZ) is the variancecovariance matrix of {3. It follows that if we let 8 denote the lower right (p  q) x (p ~ q) comer of (ZTZ)I, then (728 is the variancecovariance matrix of 132' Thus, a joint 100(1  0:)% confidence region for {32 is the p  q dimensional ellipse
~ ~ ~
.T ",T
C={(3
0' .
. (i3,(302)TSl(i3,_(3o,) <f
() 2
pqs

pq,np
(11
20:·
l}
o
Summary. We consider the classical Gaussian linear model in which the resonse Yi for (3jZij of the ith case in an experiment is expressed as a linear combination J1i = covariates plus an error fi, where Ci, ... 1 f n are i.i.d. N (0, (72). By introducing a suitable orthogonal transfonnation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transfonnation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {(3j}, the resJXlnse means {J1i}, and linear combinations of these.
LJ=l
6.2
ASYMPTOTIC ESTIMATION THEORY IN p DIMENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic properties of the MLE and related tests and confidence bounds for onedimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generalizing Section 5.4.2.
384'
f ",'c:,c:""""ce=:::;'=''''h':.M=,'"';,,pa::.'::'m'''e::.t::"...:C::':::",..::Cc::h,:,p:::'e::'~6
6.2.1
Estimating Equations
OUf assumptions are as before save that everything is made a vector: X!, ... , X n are i.i.d. Pwhere P E Q, a model containing P = {PO: 0 E e} such that
(i)
e open C RP.
e.
(ii) Densities of P, are pC 0),9 E
1 , I
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=(,p" ... , ,pp)T where,p) ~ g:., is well defined and
 L >I'(X n.
1=1
1
n
..
i,
On) = O.
(6.2.1)
A solution to (6.2.1) is called an estimating equation estimate or an M estimate.
AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p
unknowns):
J
A2. Epl'l'(X" 0(P)1 2
>I'(x, O)dP(x)
=0
(6.2.2)
, , ,
I
is well defined on Q so that O(P) is the unique solution of (6.2.2). Necessarily O(PO) because Q => P.
=0
I , ,
< 00 where I· I is the Euclidean nonn.
I I:
.' , ,
,
I
A3. 'l/Ji{', 8), 1 < i < P. have firstorder partials with respect to all coordinates and using the notation of Section B.8,
I
where
': I
,
~
.,
i
~~
l:iI ~
is nonsingular.
A4. sup
..
{I ~ L~ 1 (D>I'(X"
p
t)  D>I'(Xi , O(P))) I: It  O(P)I < En} ~ 0 if En ~ O.
, .I ,
,
I
I
AS. On ~ O(P) for all P E Q.
Theorem 6.2.1. Under AOA5 ofthis section
8n = O(P) + where
L iii(X n.
t=l
1
n
i,
O(P))
+ op(n 1 / 2 )
(6.2.3)
i..I , ,
• •
iii(x,o(p))
= (EpD>I'(X" 0(P)W 1 >1'(x, O(P)).
(6.2.4)
b
Section 6.2
Asymptotic Estimation Theory in p Dimensions
385
Hence,
(6.2.5)
where
E(q" P) ~ J(O. p)Eq,q,T(X" O(p))F (0, P)
and
81/J~
J
(6.2.6)
r'(o,p)
= EpDq,(X"O(P))
~
E p 011 (X"O(P))
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
 n
2.:= q,(Xi , O(P)) = n 2.:= Dq,(Xi,0~)(9n
i=l
n
1
n
 O(P)).
(6.2.7)
i=l
Note that the lefthand side of (6.2.7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open , subset of RP , representable, for instance, as the Set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to conclude that A3 and A4 guarantee that with probability tending to 1, ~ l::~ I Dq,(Xi , 6~) is nonsingular. Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by p, the behavior in (6.2.3) is guaranteed for P E Q, which can include P <1c P. In fact, typically Q is essentially the set of P's for which O(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to: A6. If 1(,0) is differentiabLe
EODq,(X"O)

EOq,(X" O)DI(X" 0) CovO(q,(X" 0), DI(X, , 0))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the onedimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however. be shown that with probability tending to 1, a rootfinding algorithm starting at a consistent estimate 6~ will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
386
Inference in the Multiparameter Case
Chapter 6
6,2,2
comes
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x,O)
=
l(x,O)
= 10gp(x,0), and >I>(x,O)
obeys AOA6, then (62,8) be.
T
l
EODl (X" O)D l( X 1,0» VarODl(X"O)
(62,9)
where
j 1 ,
, ,
is the Fisher information matrix I(e) introduced in Section 3.4. If p: e _ R, e c R d , is a scalar function, the matrix)! 8~~P(}J (e) is known as the Hessian or curvature matrix of the sutface p. Thus, (6.2.9) stateS that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
Theorem 6.2.2. If AOA6 holdfor p(x, 0)
,1, , ,
I
j
=10gp(x, 0), then the MLE  satisfies On
i,
(62,10)
(62,11)
On ~ 0+  :Lr'(O)DI(Xi,O) + op(n"/')
n
i=1
1
n
,,
so that
is a minimum contrast estimate with p and 'f/J satisfying AOA6 and corresponding asymptotic variance matrix E('I1, Pe), then
"
"
If en
"
, ,
E(>I>,PO)
On
> r'(O)
(62,12)
in the sense of Theorem 3.4.4 with equality in (6,2,12) for 0
,
= 0 0 iff, undRr 0 0 ,
(6,2,13)
~
On + op(n 1/2 ),
, i ,
•
,
Proof. The proofs of (6,2,10) and (6,2,11) parallel those of (5.4.33) and (5.4,34) exactly, The proof of (6,2,12) parallels that of Theorem 3.4.4, For completeness we give it Note that hy (6,2,6) and (6,2,8)
I
" :I I.
1" ,
where U >I>(Xt,O). V Var(U T , VT)T nonsingular Var(V)
=
E(>I>,PO)
= CovO'(U, V)VarO(U)CovO'(V, U)
~
(62,14)
DI(X1,0),
But hy (B,lO.8), for any U,V with
(6,2.15)
> Cov(U, V)Var1(U)Cov(V, U),
Taking inverses of both sides yields
r1(0)
= Var01(V) < E(>I>,O).
(6.2.16)
• • !
Section 6.2
Asymptotic Estimation Theory in p Dimensions
387
Equality holds in (6.2.15) by (B. 10.2.3) iff for some b U
= b(O)
(6.2.17)
= b + Cov(U, V)Var1(V)V with probability 1. This means in view of Eow = EODl = 0 that
w(X"O) = b(0)Dl(X1,0).
In the case of identity in (6.2.16) we must have
[EODw(X 1, OW'W(X 1 , 0)
=
r1(0)DI(X" 0).
(6.2.18)
Hence, from (6.2.3) and (6.2.10) we conclude that (6.2.13) holds.
o
apxl, a T 8 n
~
We see that, by the theorem, the MLE Is efficient in the sense that for any has asymptotic bias o(n1/2) and asymptotic variance nlaT ]1(8)a, which is n.? larger than that of any competing minimum contrast estimate. Further any competitor 8 n such that aTO n has the same asymptotic behavior as a T 8 n for all a in fact agrees with On to ordern 1/2


A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6
on the asymptotic nonnality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zr, Yi)T. 1 < i < n, be ij.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways:
(il
(6.2.19) where, is distributed as N(O, (),2) independent of Z and E(Z) Z, Y has aN (a + ZT [3, (),2) distrihution.
= O.
That is, given
(ii) The distribution Ho of Z is known with density h o and E(ZZT) is nonsingular.
The second assumption is unreasonable but easily dispensed with. It readily follows (Prohlem 6.2.6) that the MLE of [3 is given by (with probability 1) [3 = [Z(n}Z(n}]
T
~
lT
Zen} Y.
(6.2.20)
Here Zen) is the n x p matrix IIZij ~ Z.j II where z.j = ~ 1 Zij. We used subscripts (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Zen) = (Zl, .. , 1 Zn)T is referred to as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and ()'2 are
p
2:7
Ci
=Y
 ~ Zj{3j, (j
J=l
"

2
.
1 ~2 = IY  (Ci + Z(n)[3)1 .
n
(6.2.21 )
I
388
~
Inference in the Multiparameter Case
Chapter 6
Note that although given ZI," " Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AOA6 hold in this case because if H o has density k o and if 8 denotes (a,{3T,a 2 )T. then
~
j
1
j
I(X,8) Dl(X,8)
and
 20'[Y  (a + Z T 13W
(
1
)
2(logo'
1
)
+ log21T) + logho(z)
(6.2.22)
;2'
1
i,
Z ;" 20 4
(.2
o o
1)
I
.,,
a 2
1(8) =
I
0 0' E(ZZT)
0
,
o o
20 4
(6.2.23)
i ,
,
1
so that by Theorem 6.2.2
iI
I'
L(y'n(ii 
a,(3  13,8'  0 2»
~ N(O,diag(02,02[E(ZZ T W',20 4 ».
(6.2.24)
1
!
I
!
,,
This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of H o 2 known plays no role in the limiting result for a,j3,Ci • Of course, these will only be the MLEs if H o depends only on parameters other than (a, f3,( 2 ). In this case we can estimate E(ZZT) by ~ L~ 1 ZiZ'[ and give approximate confidence intervals for (3j. j = 1 .. ,po " An interesting feature of (6.2.23) is that because 1(8) is a block diagonal matrix so is II (6) and, consequently, f3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perfonn inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew 0 2 , the MLE would still be and its asymptotic varianc~ optimal for this model. If we knew a and 13. ;;2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, 0=2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = {P(9,"} : 8 E e, '/ E £}
~
j,
1 ,
l
I ,
1 ,
1
13
. ,
:
~
•
L
.' ,
we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE (J (or more generally, an efficient estimate of e) in the pair (e, iiJ is lhe same as that of e(,/o), the efficient estimate for 'P'I7o = {P(9,l'jo) : E 8}. The possibility of adaptation is in fact rare. though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, ({32 ... , /3p ) with
~ ~
e
(i) a, (3" ... , {3p known. (ii)
13 arbitrary.
/32
... = {3p = O. LetZi
In case (i), we take, without loss of generality, a = (ZiI, ... , ZiP) T. then the efficient estimate in case (i) is
,
•• "
I
;;n _ L~l Zit Yi Pi n 2
Li=l Zil
(6,2.25)
, , ,
, I
.
I
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance (T2[EZlJ1. On the other hand, [31 is the first coordinate of {3 given by (6.2.20). Its asymptotic variance is the (1,1) element of O"'[EZZTjl, which is strictly bigger than ,,' [EZfj1 unless [EZZTj1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate [31 adaptively if {3z, .. . , (3p are regarded as nuisance parameters. What is happening can be seen by a representation of [Z'fn) Z(n)ll Zen) Y and [11(11) where ['(II) Ill'j(lI) II· We claim that


=
2i
fJl
= 2:~_l(Z"
",n
 Zill))y; (Z _ 2(11 ),
tl
t
(6.2.26)
L....t=1
where Z(1) is the regression of (Zu, ... , Z1n)T on the linear space spanned by (Zj1,' . " Zjn)T, 2 < j < p. Similarly.
[11 (II)
= 0"'/ E(ZlI
 I1(ZlI
1
Z'l,"" Zp,»)'
(6.2.27)
where II(Z11 I ZZ1,' .. , Zpl) is the projection of ZII on the linear span of Z211 .. , , Zpl (Problem 6.2.11). Thus, I1(ZlI I Z'lo"" Zpl) = 2:;=, Zjl where (ai, ... ,a;) minimizes E(Zll  2: P _, ajZjl)' over (a" ... , ap) E RPI (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3" ... , f3p when the variables Z2, .. . ,Zp are in any way correlated with Z1 and the price is measured by
a;
[E(Zll  I1(Zll I Z'lo"" Zpl)'jl = ( _ E(I1(Zll 1 Z'l, ... ,ZPI)),)I '''''';:;f;c;o,~''"'''''1 , E(ZII) E(ZII)
(6.2.28) In the extreme case of perfect collinearity the price is 00 as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where (Z2, . .. , Zp) have no value in predicting Zl linearly (see Section 1.4). Correspondingly in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, ... , n, (31 is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if E(Zu  I1(ZlI I Z'l, ... , Zpt})' = O. 0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are ij.d. but not necessarily Gaussian with density ~fo (~). for instance, e x
fo(x)
~
=
(1 + eX)"
the logistic density. Such error densities have the often more realistic, heavier tails(l) than the Gaussian density. The estimates {301 0'0 now solve
and
390
Inference in the Multiparameter Case
Chapter 6
f' ~ ~ where1/; = 'X;,X(Y) =  (iQ)~ Yfo(y)+1 ,(30 (fito, ... ,ppO)T The assumptions of Theorem 6.2.2 may be shown to hold (Problem 6.2.9) if
=
(i) log fa is strictly concave, i.e.,
*
is strictly decreasing.
(ii) (log 10)" exists and is bonnded.
Then, if further fa is symmetric about 0,
I(IJ)
cr"I((3T, I)
= cr"
where
Cl
(C 1 Z ) E(t
~)
(6.2.29)
, ,
= J (f,(x»)' lo(x)dx, c, = J (xf,(x) + I)' 10 (x)dx.
~
Thus, ,Bo, ao are opti
mal estimates of {3 and (l in the sense of Theorem 6.2.2 if fo is true. Now suppose fa generating the estimates Po and (T~ is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable conditions we can apply Theorem 6.2.1 with
1 ,
,
i
where
i
1f;j(Z,y,(3,(J)
,
;
"
~ 1f; (y  L~1 Zkpk )
, I
<j < p
(6.2.30)
!
,
•
> (y  L~1
to conclude that
where (301 ao solve
I 1
j
Zkpk )
1
I
I
I
L:o( y'n(,Bo  (30)) ~ N(O, I: ('I', P») L:( y'n(a  cro) ~ N(O, (J'(P))
1 •
,
(6.2.31)
• ,
J'I'(y zT(30)dP=O
• ,
I
p
II . ,.I'
ii
I:
and E('I', P) is as in (6.2.6). What is the relation between (30' (Jo and (3, (J given in the Gaussian model (6.2.19)? If 10 is symmetric about 0 and the solution of (6.2.31) is unique, . then (30 = (3. But (Jo = c(fo)q for some, c(to) typically different from one. Thus, (30 can be used for estimating (3 althougll if t1j. true distribution of the is N(O, (J') it should ' perform less well than (3. On the Qther hand, o is an estimate of (J only if normalized by a constant depending on 10. (See Problem 6.2.5.) These are issues of robustness, that
~

a
'i
1:
is, to have a bounded sensitivity curve (Section 3~5. Problem 3.5.8), we may well wish to use a nonlinear bounded '.Ii = ('!f;11'" l ,pp)T to estimate f3 even though it is suboptimal when € rv N(O, (12). and to use a suita~ly nonnalized version of {To for the same purpose. One effective choice of.,pj is the Hu~er function defined in Problem 3.5.8. We will discuss these issues further in Section 6.6 and Volume II. 0
•
2. 1T(8) rr~ 1 P(Xi1 8). These methods are beyond the scope of this volume but will be discussed briefly in Volume II.19) (Laplace's method). The problem arises also when. under PeforallB. for fixed n.19). We have implicitly done this in the calculations leading up to (5. Kadane.(t) n~ 1 p(Xi .8 we obtain Theorem 6. and interpret I ..) and A6A8 hold then. fJ p vary freely. Summary. we are interested in the posterior distribution of some of the parameters.I as the Euclidean nonn in conditions A7 and A8. A4(a. 6. . A5(a. Op). We simply make 8 a vector. Although it is easy to write down the posterior density of 8. . 2 The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5. .3. say (fJ 1. However.).s. O ). We defined minimum contrast (Me) and M estimates in the case of pdimensional parameters and established their convergence in law to a nonnal distribution.2.32) a. the latter can pose a fonnidable problem if p > 2. When the estimating equations defining the M estimates coincide with the likelihood . All of these will be developed in Section 6. and Tierney (1989).Section 6. as is usually the case.2. Confidence regions that parallel the tests will also be developed in Section 6. Ifthe multivariate versions of AOA3. up to the proportionality constant f e . and Rao's tests. Using multivariate expansions as in B. ~ (6.2. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : fJ 1 < 0 versus K : fh > 0 where fJ 2 . This approach is refined in Kass. The three approaches coincide asymptotically but differ substantially. typically there is an attempt at "exact" calculation. The consequences of Theorem 6.s..5. A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems. Wald tests (a generalization of pivots). Again the two approaches differ at the second order when the prior begins to make a difference.3 are the same as those of Theorem 5.s. the likelihood ratio principle.2 Asymptotic Estimation Theory in p Dimensions 391 Testing and Confidence Bounds There are three principal approaches to testing hypotheses in multiparameter models.5. say.5.3 The Posterior Distribution in the Multiparameter Case The asymptotic theory of the posterior distribution parallels that in the onedimensional case exactly. See Schervish (1995) for some of the relevant calculations.3. because we then need to integrate out (03 . A new major issue that arises is computation. ife denotes the MLE.. the equivalence of Bayesian and frequentist optimality asymptotically. ..2. .3. in perfonnance and computationally. t)dt.
LARGE SAMPLE TESTS AND CONFIDENCE REGIONS ~ 0) converges a. . ! I In Sections 4. .2 to extend some of the results of Section 5. I H I! ~.. I . In Section 6. Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic i. the exact critical value is not available analytically. 6. and confidence procedures. 1 Zp.4 to vectorvalued parameters. We shall show (see Section 6. '(x) = sup{p(x. covariates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations.. In such . this result gives the asymptotic distribution of the MLE. Wald and Rao large sample tests. Another example deals with M estimates based on estimating equations generated by linear models with nonGaussian error distribution.1 we developed exact tests and confidence regions that are appropriate in re~ gression and anaysis of variance (ANOVA) situations when the responses are normally distributed.3 • . to the N(O.9). X II . However. . X n are i. 9 E 8} sup{p(x.B} has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {PO. A(X) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions. .I . as in the linear model. Finally we show that in the Bayesian framework where given 0. 1 .4. adaptive estimation of 131 is possible iff Zl is uncorre1ated with every linear function of Z2) . We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {PO.6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal. ry E £} if the asymptotic distribution of In(() . These were treated for f) real in Section 5. In these cases exact methods are typically not available. I .denotes the MLE for PO' then the posterior distribution of yn(O if 0 r' (9)) distribution.9 and 6.i.s. 170 specified. However.1 we considered the likelihood ratio test statistic. In linear regression.•• . . in many experimental situations in which the likelihood ratio test can be used to address important questions.". equations. and showed that in several statistical models involving normal distributions. "' ~ 6. PO'  .1 . We present three procedures that are used frequently: likelihood ratio.4. In this section we will use the results of Section 6.4.f/ : () E 8.9) . and we tum to asymptotic approximations to construct tests. 9 E 8 0 } for testing H : 9 E 8 0 versus K . confidence regions.d. We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AIestimate if we know the correct model P = {Po : BEe} and use the MLE for this model.3.8 0 . we need methods for situations in which.I .110 : () E 8}. 1 '~ J . I j I I . and other methods of inference. under P.392 Inference in the Multiparameter Case Chapter 6 . 9 E 8" 8 1 = 8 .
. exists and in Example 2. Let YI .i = 1. 2 log .. . i = 1.\(Y) = L X.£ X~" for degrees of freedom d to be specified later..1. 0 We iHustrate the remarkable fact that X~q holds as an approximation to the null distribution of 2 log A quite generally when the hypothesis is a nice qdimensional submanifold of an rdimensional parameter space with the following.1.Section 6. As in Section 6.3.v'[... . . . the numerator of .3.l. .2 we showed how to find B as a nonexplicit solution of likelihood equations. iJ).2. Then Xi rvN(B i . The Gaussian Linear Model with Known Variance.i = 1. under regularity conditions. the X~_q distribution is an approximation to £(21og . r and Xi rv N(O. q < r.3.3 we test whether J.2 we showed that the MLE. . x ~ > 0. . where A nxn is an orthogonal matrix with rows vI. Other approximations that will be explored in Volume II are based on Monte Carlo and bootstrap simulations. Y n be nn. The MLE of f3 under H is readily seen from (2. the hypothesis H is equivalent to H: 6 q + I = .\(X) based on asymptotic theory.1 exp{ iJx )/qQ). E W .\(x) is available as p(x. We next give an example that can be viewed as the limiting situation for which the approximation is exact: independent with Yi rv N(Pi. In this u 2 known example. •.3.5) to be iJo = l/x and p(x.Wo where w is an rdimensional linear subspace of Rn and W ::> Wo.. SetBi = TU/uo.4. 0) = IT~ 1 p(Xi. Suppose we want to test H : Q = 1 (exponential distribution) versus K : a i. . . and we transform to canonical form by setting Example 6. V q span Wo and VI. Wilks's approximation is exact...\(Y)). .d. i = r+ 1. Moreover.n. iJ). Using Section 6. distribution with density p(X. =6r = O. Q > 0.q' i=q+1 r Wilks's theorem states that. 1)..1.'··' J. iJ > O.."" V r spanw. which is usually referred to as Wilks's theorem or approximation. ~ X. such that VI... .i.3 Large Sample Tests and Confidence Regions 393 cases we can turn to an approximation to the distribution of . .l.randXi = Uduo. iJo) is the denominator of the likelihood ratio statistic. Thus. 0).3. ~ In Example 2. This is not available analytically. ~ ~ ~ ~ ~ The approximation we shall give is based on the result "2 log '\(X) .Xn are i. Suppose XL X 2 .tl..n. when testing whether a parameter vector is restricted to an open subset of Rq or R r .. = (J. 0 It remains to find the critical value. as X where X has the gamma.tn)T is a member of a qdimensional linear subspace of woo versus the alternative that J. 0) ~ iJ"x·. 0 = (iX. .. <1~) where {To is known. we conclude that under H. qQ. . .1. .1). Here is an example in which Wilks's approximation to £('\(X)) is useful: Example 6... 1.
2) V ~ N(o. ".3.6. case with Xl. Suppose the assumptions of Theorem 6. we derived e e 2log >'(Y) ~ n log ( 1 + 2::" L. we can conclude arguing from A.2 but (T2 is unknown then manifold whereas under H. . an expansion of lnUI) about en evaluated at 8 = 0 0 gives 2[ln«(Jn) In((Jo)] = n«(Jn .2. Apply Example 5. Because ..2 unknown case.(In[ < l(Jn .d. V T I«(Jo)V ~ X. . : .3.(Jo) In«(Jn)«(Jn . Note that for J. B). 21og>.. If Yi are as in = (IL. hy Corollary B. . 1 • ! We tirst consider the simple hypothesis H : 6 = 8 0 _ Theorem 6._q as well.3. I.3.3 and AA that that In((J~) "'" 2[ln((Jn) In((Jo)] £r ~ V I((Jo)V.. N(o. where DO is the derivative with respect to e. !.. (72) ranges over an r + Idimensional Example 6.. .i. where x E Xc R'. Hence.. 2 log >'(Y) = Vn ". I1«(Jo)).3.. In Section 6.. "'" (6. .. .(Y) defined in Remark ° 6. The result follows because. an = n.\(Y) £ X. = 2[ln«(Jn) In((Jo)] ~ ~ ~ £ X.1 ((Jo)). By Theorem 6. ·'I1 .logp(X.(Jo) ". Then.q' Finally apply Lemma 5.(Jol. I I i Example 6.· . "'" T.1.1. and conclude that Vn S X.i=q+l 'C'" I=r+l Xi' ) X.~ L H . under H .2.2..(Jo) for some (J~ with [(J~ .2. (J = (Jo.(Jol.. Write the log likelihood as e In«(J) ~ L i=I n logp(X i .3.q also in the 0. o . where Irxr«(J) is the Fisher information matrix.. . . Here ~ ~ j. y'ri(On .3. The Gaussian Linear Model with Unknown Variance. X. (6." . 0). I(J~ . ranges over a q + Idimensional manifold.. j ..1) " I. Proof Because On solves the likelihood equation DOln(O) = 0..2 with 9(1) = log(l + I). ! .(J) Uk uJ rxr c'· • 1 .7 to Vn = I:~_q+lXl/nlI:~ r+IX.2 are satisfied. i\ I . I . I( I In((J) ~ n 1 n L i=l & & &" &".Onl + IOn  (Jol < 210n .2.3. .3.Xn a sample from p(x.(X) ~ 1 .(Jol < I(J~ .1.394 Inference in the Multiparameter Case Chapter 6 1 . c = and conclude that 210g . 0 Consider the general i. Eln«(Jo) ~ I«(Jo). . and (J E c W.
35) where 8(00) = n 1/2 L Dl(X" 0) i=1 n and 8 = (8 1 .j.a.T.4).)T. 10 (0 0 ) = VarO. 0(2) = (8.n and (6.3.81(00).(] .2[1. Make a change of parameter. Suppose that the assumptions of Theorem 6.q. .)1'.1) applied to On and the corresponding argument applied to 8 0 .1 and 6... the test that rejects H : () 2Iog>.0(1) = (8 1" ".n = (80 .. SupposetharOo istheMLEofO underH and that 0 0 satisfiesA6for Po..(Ia). Let Po be the model {Po: 0 E eo} with corresponding parametn"zation 0(1) = ~(1) (1) ~(1) (8 1. and {8 0 .. T (1) (2) Proof.3.j} are specified values. j ~ q + 1.2.3.)T.8.0:'.Section 6.".8q ). Theorem 6.3.2. Next we tum to the more general hypothesis H : 8 E 8 0 .. (JO. has approximately level 1.3) is a confidence region for 0 with approximat~ coverage probability 1 .0) is the 1 and C!' quantile of the X. Let 00.3. 0 E e. 0). where x r (1. Then under H : 0 E 8 0.10) and (6.+l. where e is open and 8 0 is the set of 0 E e with 8j = 80 .a)) (6. 8 2)T where 8 1 is the first q coordinates of8. OT ~ (0(1).. distribution.8. Furthennore. for given true 0 0 in 8 0 • '7=M(OOo) where.0 0 ). (6.(9 0 . Examples 6.. ~(11 ~ (6.3. ~ {Oo : 2[/ n (On) /n(OO)] < x.n) /n(Oo)l.3. dropping the dependence on 6 0 • M ~ PI 1 / 2 (6.80.6) .2 hold for p(x. By (6.3.4) It is easy to see that AOA6 for 10 imply AOA5 for Po.2 illustrate such 8 0 .' ".3 Large Sample Tests and Confidence Regions 395 = As a consequence of the theorem.2. Let 0 0 E 8 0 and write 210g >'(X) ~ = 2[1n(9 n )  In(Oo)] . We set d ~ r . .(X) 8 0 when > x. 0b2) = (80"+1".0(2».
This asymptotic level 1 ..00 : e E 8 0} MAD = {'1: '/. with ell _ .5) to "(X) we obtain. I I eo~ ~ ~ {(O. if A o _ {(} . under the conditions of Theorem 6. '1 E Me}. : • Var T(O) = pT r ' /2 II.3.+ 1 .0.. = 0.2.Tf(O)T .1'1 '1 E eo} and from (B. because in tenus of 7].all (6. J) distribution by (6. Or)J < xr_.LTl(o) + op(l) i=l i=l L i=q+l r Ti2 (0) + op(l).9) . . then by 2 log "(X) TT(O)T(O) .0: confidence region is i . ). .3.l.11) .00 .(X) = y{X) where ! 1 1 . Moreover. ••.8) that if T('1) then = 1 2 n.6) and (6.. by definition. 1e ). Note that. Such P exists by the argument given in Example 6.1 '1).a) is an asymptotically level a test of H : 8 E Of equal importance is that we obtain an asymptotic confidence region for (e q+ 1.(X) > x r .8.1ITD III(x.+I> .1/ 2 p = J.9).1 '1ll/sup{p(x.I '1): 11 0 + M. We deduce from (6. 18q acting as r nuisance parameters. IIr ) : 2[ln(lI n ) ..In( 00 . 0 Note that this argument is simply an asymptotic version of the one given in Example 6./ L i=I n D'1I{X" liD + M.3. His {fJ E applying (6.3. Me : 1]q+l = TJr = a}.3.(1 . ~ 'I.2. (O) r • + op(l) (6..... .10) LTl(o) . = . Now write De for differentiation with respect to (J and Dry for differentiation with respect to TJ.. Tbe result follows from (6. which has a limiting X~q distribution by Slutsky's theorem because T(O) has a limiting Nr(O.2. Thus.(l . is invariant under reparametrization .+1 ~ .3.3.II).1> . a piece of 8. • 1 y{X) = sup{P{x.1I0 + M.1I0 + M...8 0 : () E 8}.3..7).1 because /1/2 Ao is the intersection of a q dirnensionallinear subspace of R r with JI/2{8 .1I 0 + M.. . (6.l.3.396 Inference in the Multiparameter Case Chapter 6 and P is an orthogonal matrix such that.3.. .1 '1) = [M. • . rejecting if ..• I .13) D'1I{x.
o''C6::.2 and the previously conditions on g.( 0 ) ~ o} (6..3.14) a hypothesis of the fonn (6.3. 8q o assuming that 8q + 1 ...8r are known. More sophisticated examples are given in Problems 6.3..3.13) Evideutly.. (6.2)(6.3.n under H is consistent for all () E 8 0. As an example ofwhatcau go wrong..:::m. themselves depending on 8 q + 1 . It can be extended as follows.q at all 0 E 8. Suppose the assumptions of Theorem 6..:::f.p:::'e:.3. (6.3. such that Dg(O) exists and is of rauk r .. Suppose H is specified by: There exist d functions.. Here the dimension of 8 0 and 8 is the same but the boundary of 8 0 has lower dimension. q + 1 <j < r}. 8. Vl' are orthogonal towo and v" . More complicated linear hypotheses such as H : 6 . A(X) behaves asymptotically like a test for H : 0 E 8 00 where 800 = {O E 8: Dg(Oo)(O . Theorem 6.cTe:::. will appear in the next section. t. Defiue H : 0 E 8 0 with e 80 = {O E 8 : g(O) = o}.3..3.13). The esseutial idea is that. .::... Suppose the MLE hold (JO. v r span " R T then. 8q)T : 0 E 8} is opeu iu Rq.12) The extension of Theorem 6...2 falls under this schema with 9. We only need note that if WQ is a linear space spanned by an orthogonal basis v\. . The proof is sketched iu Problems (6.. q + 1 < j < r. R.g. if A(X) is ~ the likelihood ratio statistic for H : 0 E 8 0 given in (6. which require the following general theorem.2 is still inadequate for most applications. .3. 9j . X~q under H. 2' 2 + X 2 )) 2 and 210g A(X) ~ xf but if 81 + 82 < 1 clearly 2 log A(X) ~ op(I).3. B2 . of f)l" .3.6.3.o:::"'' 397 CC where 00.d. The formulation of Theorem 6.5 and 6.. if 0 0 is true.2 to this situation is easy and given in Problem 6.3. . Theorem 6..13) then the set {(8" .l:::.t.i.) be i.::Se::.3).3.80. Then.2. let (XiI.e:::S:::.. " ' J v q and Vq+l " .3..8 0 E Wo where Wo is a linear space of dimension q are also covered.. N(B 1 ..3. q + 1 < j < r written as a vector g.... 2 e eo Ii o ~ (Xl + X 2) 2 + ~ 1 _ (X. "'0 ~ {O : OT Vj = 0. . where J is the 2 x 2 identity matrix aud 8 0 = {O : B1 + O < I}. Examples such as testing for independence in contingency tables. 210g A(X) ".. .... . l B. If 81 + B2 = 1.3.13). We need both properties because we need to analyze both the numerator and denominator of A(X).. .. X.'':::':::"::d_C:::o:::. J)."de"'::'"e:::R::e"g. Wilks's theorem depends critically On the fact that nOt only is open but that if giveu in (6.'! are the MLEs..(0) ~ 8j .
3. hence. it follows from Proposition B.~ D 2l n (1J 0 ) or I (IJ n ) or .. .fii(1J IJ) ~N(O.q' ~(2) (6.7.3. I22(lJo)) if 1J 0 E eo holds. More generally. for instance.4.. ~ ~ 00. i favored because it is usually computed automatically with the MLE.Or) and define the Wald (6. theorem completes the proof.1. .2. r 1 (1J)) where. respectively. j Wn(IJ~2)) !:.X. Slutsky's it I' I' .31). has asymptotic level Q.2.18) i 1 Proof. . Then . [22 is continuous.16) n(9 n _IJ)TI(9 n )(9 n IJ) !:.15) and (6.Nr(O.fii(lJn (2) L 1J 0 ) ~ Nd(O.a)} is an ellipsoid in W easily interpretable and computablesee (6. y T I(IJ)Y. . I . i . .2. (6.17) ~ ~ e ~ en 1 • I I Wn(IJ~2)) ~ n(9~2) 1J~2)l[I"(9nJrl(9~) _1J~2») where I 22 (IJ) is the lower diagonal block of II (IJ) written as I2 I 1 _ (Ill(lJ) (IJ) I21(IJ) I (1J)) I22(IJ) with diagonal blocks of dimension q x q and d x d.l(a) that I(lJ n ) ~ I(IJ) asn By Slutsky's theorem B.398 Inference in the Multiparameter Case Chapter 6 6.2.3.7.2.. (6..10).: b.3.15) Because I( IJ) is continuous in IJ (Problem 6. (6.3.6. y T I(IJ)Y . Y .16).~ D 2ln (IJ n ). It follows that the Wald test that rejects H : (J = 6 0 in favor of K : () • i= 00 when .lJ n ) where IJ n statistic as ~(1) ~(2) ~(1) = (0" . The last Hessian choice is ~ ~ . But by Theorem 6.3.) and IJ n ~ ~ ~(2) ~ (Oq+" .~ D 2 1n ( IJ n). the lower diagonal block of the inverse of ..rl(IJ» asn ~ p ~ L 00. I. More generally /(9 0 ) can be replaced by any consistent estimate of I( 1J0 ). It and I(8 n ) also have the advantage that the confidence region One generates {6 : Wn (6) < xp(l . X. according to Corollary B.2. For the more general hypothesis H : (} E 8 0 we write the MLE for 8 E as = ~ ! I: (IJ n .3. the Hessian (problem 6. ~ Theorem 6.3. Under the conditions afTheorem 6. 0 ""11 F . in particular . .2 hold.2 Wald's and Rao's Large Sample Tests The Wald Test Suppose that the assumptions of Theorem 6.0.3.3. If H is true.9). I 22 (9 n ) is replaceable by any consistent estimate of J22( 8)..2. 1(8) continuous implies that [1(8) is continuous and..
This test has the advantage that it can be carried out without computing the MLE. The argument is sketched in Problem 6. and so on. and the convergence Rn((}o) ~ X. a consistent estimate of .nl]  2  1 . ~1 '1'n(OO.\(X) is the LR statistic for H Thus. (6.2 that under H.3. 2 Wn(Oo ) ~(2) where .n W (80 . which rejects iff HIn (9b ») > x 1_ q (1 .3. in practice they can be very different.nlE ~ (2) _ T  .. requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests. is. Rao's SCOre test is based on the observation (6. What is not as evident is that..n] D"ln(Oo.20) vnt/Jn(OO) !.3. The Rao Score Test For the simple hypothesis H that.n) 2  + D21 ln (00. X.Section 6.ln(Oo.' (0 0 )112 (00) (6.(00 )  121 (0 0 )/.3.Q). 1(0 0 )) where 1/J n = n . therefore. It can be shown that (Problem 6.n) ~  where :E is a consistent estimate of E( (}o). The Rao test is based on the statistic Rn(Oo ) ~n'1'n(Oo. Furthermore. .9. It follows from this and Corollary B. Rn(Oo) = nt/J?:(Oo)r' (OO)t/Jn(OO) !.. the two tests are equivalent asymptotically. the asymptotic variance of ... 112 is the upper right q x d block.22) .n)[D. N(O.21) where III is the upper left q x q block of the r x r infonnation matrix I (80 ).3...3.... under H.3 Large Sample Tests and Confidence Regions 399 The Wold test.0:) is called the Rao score test.6. ~ B ) T given by {8(2) : r W n (0(2)) < x r _ q (1 These regions are ellipsoids in R d Although.. as n _ CXJ. asymptotically level Q. = 2 log A(X) + op(l) (6.n under H.19) indicates. un.3.ln(8 0 . Let eo '1'n(O) = n.n) under n H.3. as (6..:El ((}o) is ~ n 1 [D. the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n.1/ 2 D 2 In (0) where D 1 l n represents the q x 1 gradient with respect to the first q coordinates and D 2 l n the d x 1 gradient with respect to the last d.8) E(Oo) = 1. by the central limit theorem. The test that rejects H when R n ( (}o) > x r (1.19) : (J c: 80.I Dln ((Jo) is the likelihood score vector. The Wald test leads to the Wald confidence regions for (B q +] . (J = 8 0 . (Problem 6.9) under AOA6 and consistency of (JO. The extension of the Rao test to H : (J E runs as follows.
)I(On)( vn(On  On) + A) !:. Summary. We also considered a quadratic fonn. Finally._q(A T I(Oo)t. which stales that if A(X) is the LR statistic. where X~ (')'2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'? It may he shown that the equivalence (6.400 Inference in the Multiparameter Case Chapter 6 where D~ is the d x d matrix of second partials of ill with respect to matrix of mixed second partials with respect to e(l). .On) + t. X.0 0 ) = T'" "" (vn(On .19) holds under On and that the power behavior is unaffected and applies to all three tests.3. called the Wald statistic. b .3. In particular we shall discuss problems of .0 0 ) I(On)(On . .. 0(2). We considered the problem of testing H : 8 E 80 versus K : 8 E 8 .4 LARGE SAMPLE METHODS FOR DISCRETE DATA In this section we give a number of important applications of the general methods we have developed to inference for discrete data.2.) . X~' 2 j > xd(1 . Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score testssee Rao (1973) for more on this. The analysis for 8 0 = {Oo} is relatively easy. Power Behavior of the LR. which measures the distance between the hypothesized value of 8(2) and its MLE. and Wald Tests It is possible as in the onedimensional case to derive the asymptotic power for these tests for alternatives of the form On = ~ eo + ~ where 8 0 E 8 0 . Rao. then. and showed that this quadratic fonn has limiting distribution q .5.. Under H : (J E A6 required only for Po eo and the conditions ADAS a/Theorem 6. .8 0 where is an open subset of R: and 8 0 is the collection of 8 E 8 with the last r ~ q coordinates 0(2) specified. .2 but with Rn(lJb 2 1 1 » !:. and so on.a)} and • The Rao large sample critical and confidence regions are {R n (Ob » {0(2) : Rn (0(2) < xd(l. . under regularity conditions. We established Wilks's theorem. . . I . D 21 the d x d Theorem 6. which is based on a quadratic fonn in the gradient of the log likelihood. I I .. . we introduced the Rao score test. 2 log A(X) has an asymptotic X. For instance. The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. The asymptotic distribution of this quadratic fonn is also q• I e X: X: 6. On the other hand. for the Wald test neOn . it shares the disadvantage of the Wald test that matrices need to be computed and inverted. e(l).q distribution under H.a)}.
00k)2lOOk.6.8 we found the MLE 0. we need the information matrix I = we find using (2. L(9. Because ()k = 1 . . 2. = L:~ 1 1{Xi = j}.()k_l)T and test the hypothesis H : ()j = ()OJ for specified (JOj. j=l . Let ()j = P(Xi = j) be the probability of the jth category. k1. .Section 6. For i.i. log(N. we consider the parameter 9 = (()l. trials in which Xi = j if the ith trial prcx:luces a result in the jth category.OOi)(B.. j=1 Do. .1...3. treated in more detail in Section 6. In.nOo. In Example 2. j = 1.. where N. k. .4.OO.8.. k .00. .) > Xkl(1. It follows that the large sample LR rejection region is ~ k 2IogA(X) = 2 LN. .+ ] 1 1 0.0). or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by theory.. + n LL(Bi . the Wald statistic is klkl Wn(Oo) = n LfB. Pearson's X2 Test As in Examples 1.33) and (3. ~ N.]2100.32) thaI IIIij II.7. 6. Thus.)2InOo.2.d. .1 GoodnessofFit in a Multinomial Model.E~.4 Large Sample Methods for Discret_e_D_a_ta ~_ _ 401 goodnessoffit and special cases oflog linear and generalized linear models (GLM).3.. .4. Ok Thus.2. with ()Ok =  1 1 E7 : kl j=l ()OJ. k Wn(Oo) = L(N..lnOo.5. j = 1. j = 1.)/OOk. ()j. we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution. consider i. j=l i=1 The second term on the right is kl 2 n Thus. j=1 To find the Wald test.) /OOk = n(Bk . . and 2. # j.2. Ok if i if i = j. I ij [ ." .
 8k 80 k = {[8 0k (8 j  80j )]  [8 0j (8 k ~  80k ) ] } . we could invert] or note that by (6. To derive the Rao test.L ..Expected)2 Expected (6..~) 8 80j 80 k ~ ~ ~ ] 2 0j = n [~ 80 k ~ 1] 2 To simplify the first term on the right of (6.4. . 80i 80j 80k (6.1) of Pearson's X2 will reappear in other multinomial applications in this section. with To find ]1.80j ) = (Ok .~ 8 8 0j 0k  2 80j ) ( n LL _ _ L kl kl kl ( J=1 z=1 ~ 8_Z 80i ~) 8k 80k (~ 8 _J 80J ~) ) 8k . The general form (6. . The second term on the right is n [ j=1 (!.2. where N = (N1 . the Rao statistic is .Nk_d T and. n ( L kl ( j=1 ~ ~) !.. j=1 . ]1(9) = ~ = Var(N).4.2. Then...8..2)..L .1) X where the sum is over categories and "expected" refers to the expected frequency E H (Nj ).11).2) . It is easily remembered as 2 = SUM (Observed . .80k). '" . ~ = Ilaijll(kl)X(kl) with Thus.1S. we write ~ ~  8j 80j ...402 Inference in the Multiparameter Case Chapter 6 The term on the right is called Pearson's chisquare (X 2 ) statistic and is the statistic that is typically used for this multinomial testing problem. ~ .13.4. . because kl L(Oj .4. 80j 80k 1 and expand the square keeping the square brackets intact. note that from Example 2. by A.
fh = 03 = 3/16.Section 6.i::.: . • 6.25)2 104. k = 4 X 2 = (2. n020 = n030 = 104. Example 6.4).4. . 8 0 . Then.. nOlO = 312. this value may be too small! See Note 1. Contingency Tables Suppose N = (NI . distribution. which has a pvalue of 0.k.25 + (3.75.4 Large Sample Methods for Discrete Data 403 the first term on the right of (6. 2. LOi=l}.9 when referred to a X~ table.75. There is insufficient evidence to reject Mendel's hypothesis.2 GoodnessofFit to Composite Multinomial Models. 02. i=1 For example. (2) wrinkled yellow.25 + (2.2) becomes It follows that the Rao statistic equals Pearson 's X2. n040 = 34.75 .1. we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1. 3. If we assume the seeds are produced independently. Mendel's theory predicted that 01 = 9/16..25)2 312. 4 as above and associated probabilities of occurrence fh. .fh.75)2 = 04 34. 8). However. and we want to test whether the distribution of types in the n = 556 trials he performed (seeds he observed) is consistent with his theory.4. M(n. For comparison 210g A = 0. We will investigate how to test H : 0 E 8 0 versus K : 0 ¢:. where 8 0 is a composite "smooth" subset of the (k . Nk)T has a multinomial. n3 = 108. and (4) wrinkled green. in the HardyWeinberg model (Example 2. Possible types of progeny were: (1) round yellow. Here testing the adequacy of the HardyWeinberg model means testing H : 8 E 8 0 versus K : 8 E .75)2 104.1. 04 = 1/16. n4 = 32.48 in this case. Mendel observed nl = 315. l::. In experiments on pea breeding.25. 7. j.4. 04. (3) round green. which is a onedimensional curve in the twodimensional parameter space 8.1) dimensional parameter space k 8={8:0i~0. n2 = 101. Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds. Testing a Genetic Theory.75 + (3.
l now leads to the Rao statistic R (8("") n 11 =~ ~ j=l [Ni . approximately X. If £ is not open sometimes the closure of £ will contain a solution of (6. The Rao statistic is also invariant under reparametrization and... we define ej = 9j (8). To apply the results of Section 6.4. 8(11)) for 11 E £.4. . . which will be pursued further later in this section.q' Moreover..nk. . . To avoid trivialities we assume q < k .3 that 2 log . to test the HardyWeinberg model we set e~ = e1 . under H. . .. where 9j is chosen so that H becomes equivalent to "( e~ .1). l~j~q.8 0 . .4.. ij satisfies l a a'r/j logp(nl..2) by ej(ij)." For instance. The Wald statistic is only asymptotically invariant under reparametrization. {p(.ne j (ij)]2 = 2 ne. r. . also equal to Pearson's X2 • .X approximately has a xi distribution under H. e~ = e2 .. . .. i=l k If 11 ~ e( 11) is differentiable in each coordinate. 8(11)) : 11 E £}.. .3. Consider the likelihood ratio test for H : e E 8 0 versus K : e 1:.r for specified eOj ... we obtain the Rao statistic for the composite multinomial hypothesis by replacing eOj in (6.~) and test H : e~ = O. nk) = L ndlog(ni/n) log ei(ij)]. . the log likelihood ratio is given by log 'x(nb . nk.. .q distribution for large n. 8 0 • Let p( n1.("") X J 11 I where the righthand side is Pearson's X2 as defined in general by (6. Maximizing p(n1. e~) T ranges over an open subset of Rq and ej = eOj . That is..4.X approximately has a X. If a maximizing value. 1 ~ j ~ q or k (6. £ is open."" 'r/q) T. 8) denote the frequency function of N.. . a 11 'r/J . . the Wald statistic based on the parametrization 8 (11) obtained by replacing e by e (ij). . by the algebra of Section 6. ek(11))T takes £ into 8 0 . ij = (iiI. and ij exists.3) Le. j = q + 1. However. Other examples.(t)~ei(11)=O.. nk.r. We suppose that we can describe 8 0 parametrically as where 11 = ('r/1.. is a subset of qdimensional space. nk.3).. 8) for 8 E 8 0 is the same as maximizing p( n1.... 8(11)) = 0.q) exists. thus. j Oj is. 2 log . j = 1. then it must solve the likelihood equation for the model. .2~(1 .4.3 and conclude that. i=l t n.1. and the map 11 ~ (e 1(11)..404 Inference in the Multiparameter Case Chapter 6 8 1 where 8 1 = 8 .4.1. The algebra showing Rn(8 0 ) = X2 in Section 6. involve restrictions on the e obtained by specifying independence assumptions on i classifications of cases into different categories. . Then we can conclude from Theorem 6.. .
N 4 ) has a M(n. . (h.4.a) with if (2nl + n2)/2n. 0 Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories.4. A linkage model (Fisher.1). An individual either is or is not inoculated against a disease.• . The only root of this equation in [0. . Denote the probabilities of these types by {}n. (6. We often want to know whether such characteristics are linked or are independent. Independent classification then means that the events [being an A] and [being a B] are independent or in terms of the B .. We found in Example 2. Then a randomly selected individual from the population can be one of four types AB. specifies that where TJ is an unknown number between 0 and 1. iTJ) : TJ :.4. The results are assembled in what is called a 2 x 2 contingency table such as the one shown. green base leaf versus white base leaf) leads to four possible offspring types: (1) sugarywhite. 1} a "onedimensional curve" of the threedimensional parameter space 8. AB. (3) starchywhite. AB.3) becomes i(1 °:. The Fisher Linkage Model. ( 2n3 + n2) 2) T 2n 2n o Example 6.. If Ni is the number of offspring of type i among a total of n offspring. HardyWeinberg. 1958. T() test the validity of the linkage model we would take 8 0 {G(2 + TJ). and so on. (4) starchygreen.4) which reduces to a quadratic equation in if. nl (2+TJ) (n2 + n3) n4 (1TJ) +:ry 0.. 9(ij) ((2nl + n2) 2n 2n 2.4 Methods for Discrete Data 405 Example 6.TJ). 301).4. (2) sugarygreen. Because q = 1. p.4. H is rejected if X2 2:: Xl (1 . AB. (2nl + n2)(2 3 + n2) . . To study the relation between the two characteristics we take a random sample of size n from the population. i(1 .Section 6. {}12.4. 1] is the desired estimate (see Problem 6. do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second Band B. respectively.2.6 that Thus.5. For instance. {}22. A selfcrossing of maize heterozygous on two characteristics (starchy versus sugary. we obtain critical values from the X~ tables. ij {}ij = (Bil + Bi2 ) (Blj + B2j ). . is or is not a smoker. then (Nb . TJ). The likelihood equation (6. {}21. (}4) distribution. k 4. is male or female.
respectively. We test the hypothesis H : () E 8 0 versus K : 0 f{. 0 11 .'r/1) (1 . By our theory if H is true. X2 has approximately a X~ distribution. if N = (Nll' N 12 . 'r/2 ::. where z tt [R~jl1 2=1 J=l .~ + n12) iiI (nll + n2d (nll i72 (n21 + n22) (1 . Pearson's statistic is then easily seen to be (6. N 22 )T.4. N 21 . because k = 4. where 8 0 is a twodimensional subset of 8 given by 8 0 = {( 'r/1 'r/2.4. Thus. q = 2. ( 22 ). Then. the likelihood equations (6. 0 ::.6) the proportions of individuals of type A and type B.'r/1).4. 'r/2 (1 .5) whose solutions are Til 'r/2 = (n11+ n 12)/n (nll + n21)/n.RiCj/n) are all the same in absolute value and.4.011 + 021 as 'r/1. 1. 'r/1 ::. (1 .4. 0 12 .406 Inference in the Multiparameter Case Chapter 6 l A 11 The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column.'r/2)) : 0 ::. we have N rv M(n.7) where Ri = Nil + Ni2 is the ith row sum. These solutions are the maximum likelihood estimates. (6. the (Nij .Ti2) (6. This suggests that X2 may be written as the square of a single (approximately) standard normal variable. 'r/1 (1 .3) become . which vary freely. For () E 8 0 . C j = N 1j + N 2j is the jth column sum. 8 0 . 'r/2 to indicate that these are parameters. for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic. 021 . Here we have relabeled 0 11 + 0 12 . In fact (Problem 6. I}.2).'r/2).Tid (n12 + n22) (1 .
g. Therefore. i = 1. Z ~ z(l. b} "J M(n. if X2 measures deviations from independence.. B.. It may be shown (Problem 6. 1 :::. If (Jij = P[A randomly selected individual is of type i for 1 and j for 2].. a.. (Jij : 1 :::. Nab Cb Ra n .4. .4. j :::. j :::. . and only if. .. B. Thus. b ~ 2 (e. B. .8) Thus. hair color). eye color.. i :::.. that A is more likely to occur in the presence of B than it would in the presence of B).a) as a level a onesided test of H : peA I B) = peA I B) (or peA I B) :::. 1). then {Nij : 1 :::. a. j = 1. The N ij can be arranged in a a x b contingency table. peA I B)) versus K : peA I B) > peA I B).4 Large Sample Methods for Discrete Data 407 An important altemative form for Z is given by (6. i :::. B to denote the event that a randomly selected individual has characteristic A. Z indicates what directions these deviations take. Next we consider contingency tables for two nonnumerical characteristics having a and b states. it is reasonable to use the test that rejects. respectively. b where N ij is the number of individuals of type i for characteristic 1 and j for characteristic 2. 1 :::. If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N ij . b).. . Positive values of Z indicate that A and B are positively associated (i. j :::.. a. b NIb Rl a Nal C1 C2 .. . . then Z is approximately distributed as N(O.peA I B)] [~(B) ~(~)ll/2 peA) peA) where P is the empirical distribution and where we use A. peA I B) = peA I B).4.4. that is.3) that if A and B are independent. 1 :::..e. b where the TJil..Section 6. The X2 test is equivalent to rejecting (twosidedly) if. if and only if.... . . . 1 Nu 2 N12 . TJj2 are nonnegative and 2:~=1 = 2:~=1 TJj2 = 1.. Z = v'n[P(A I B) . (Jij TJil The hypothesis that the characteristics are assigned independently becomes H : TJil TJj2 for 1 :::. a.. a. i :::.
In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1. such as the probit gl(7r) = <I>1(7r) where <I> is the N(O..~ =1 Zij {3j = unknown parameters {31. (6.7r)]. Other transforms.Li = L.1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are.."F~1 Yij where Yij is the response on the jth of the mi trials in block i. whicll we introduced in Example 1."" 7rk)T based on X = (Xl"'" Xk)T is t.{3p.4.3 Logistic Regression for Binary Responses In Section 6. Thus. The argument is left to the problems as are some numerical applications." We assume that the distribution of the response Y depends on the known covariate vector ZT.4. 1) d. . and the loglog transform g2(7r) = 10g[log(1 ..408 Inference in the Multiparameter Case Chapter 6 with row and column sums as indicated.. ' XI. 1 :::. i :::.f.8 as the canonical parameter zT TJ = g ( 7r) = log [7r / (1 .7r)] are also used in practice. k. log ( ~: ) . B(mi' 7ri)' where 7ri = 7r(Zi) is the probability of success for a case with covariate vector Zi. Examples are (1) medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0). Because 7r(z) varies between 0 and 1. Instead we turn to the logistic transform g( 7r). approximately normally distributed ~ for known constants {Zij} and and whose means are modeled as J.11) When we use the logit transfonn g(7r). Maximum likelihood and dimensionality calculations similar to those for the 2 x 2 table show that Pearson's X2 for the hypothesis of independence is given by (6.) over the whole range of z is impossible. log(l "il] + t. we call Y = 1 a "success" and Y = 0 a "failure. 6. with Xi binomial. In this section we assume that the data are grouped or replicated so that for each fixed i.9) which has approximately a X(al)(bl) distribution under H. we observe the number of successes Xi = L. . a simple linear representation zT ~ for 7r(. [Xi C :i"J log + m. perhaps after a transfonnation. we obtain what is called the logistic linear regression model where . The log likelihood of 7r = (7rl. (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y = 0). usually called the log it.4. Next we choose a parametric model for 7r(z) that will generate useful procedures for analyzing experiments with binary responses.10. As is typical.4. (6.6. or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y = 0). we observe independent Xl' .
. Zi)T is the logistic regression model of Problem 2.. mi 2mi Xi 1 Xi 1 = .3.Li = E(Xi ) = mi1fi. Similarly.. .4.L) = O. As the initial estimate use (6.p. ..4.4.! ) ( mi . Although unlike coordinate ascent NewtonRaphson need not converge.1fi)*}' + 00.1. (6. The condition is sufficient but not necessary for existencesee Problem 2.15) Vi = log X+. Note that IN(f3) is the log likelihood of a pparameter canonical exponential model with parameter vector f3 and sufficient statistic T = (T1' . Theorem 2.1fi )* = 1 . i ::. j = 1.Li.(1..+ . we can guarantee convergence with probability tending to 1 as N + 00 as follows.1 applies and we can conclude that if 0 < Xi < mi and Z has rank p. the likelihood equations are just ZT(X .J) SN(O.3.[ i(l7r iW')· . k ( ) (6. Then E(Tj ) = 2:7=1 Zij J.4. .4..12) where T j = 2:7=1 ZijXi and we make the dependence on N explicit. ( 1 . It follows that the NILE of f3 solves E f3 (Tj ) = T . Tp) T. . p (Jj Tj  ~ mi loge1+ exp{Zif3} ) + ~ log '. W is estimated using Wo = diag{mi1f.3.13) By Theorem 2.3. IN(f3) of f3 = (f31.J.14).. the empirical logistic transform. if N = 2:7=1 mi.1 and by Proposition 3.16) 1fi) 1]. k m} (V. or Ef3(Z T X) = ZTX. j Thus.3.4.f3p )T is.+ . in (6.14) where W = diag{ mi1fi(l1fi)}kxk.l.4.4 Large Sample Methods for Discrete Data 409 The special case p = 2..4 the Fisher information matrix is I(f3) = ZTWZ (6.. Zi The log likelihood l( 7r(f3)) = == k (1. 7r Using the 8method..4.~ Xi 2 1 +"2 ' (6. where Z = IIZij Ilrnxp is the design matrix. mi 2mi Here the adjustment 1/2mi is used to avoid log 0 and log 00. We let J. The coordinate ascent iterative procedure of Section 2.2 can be used to compute the MLE of f3. that if m 1fi > 0 for 1 ::. I N(f3) = f. Because f3 = (ZTz) 1 ZT 'T] and TJi = 10g[1fi (1 1fi) in TJi has been replaced by 730 is a plugin estimate of f3 where 1fi and (1 1f.Section 6.log(l:'.3. it follows by Theorem 5. Alternatively with a good initial value the NewtonRaphson algorithm can be employed. the solution to this equation exists and gives the unique MLE 73 of f3..
where ji is the MLE of JL for TJ E w. .14) that {3o is consistent.q distribution as mi + 00. The samples are collected indeB(1ri' mi). D(X.IOg(E. The Binomial OneWay Layout. We want to contrast w to the case where there are no restrictions on TJ. Theorem 5. Thus. Testing In analogy with Section 6.:IITlPt'pr Case 6 Because Z has rankp. 2 log .4. D(X.17) where X: = mi Xi and M~ mi Mi. from (6. As in Section 6. The LR statistic 2 log .1.8 that the inverse of the logit transform 9 is the logistic distribution function Thus.. In this case the likelihood is a product of independent binomial densities. distribution for TJ E w as mi + 00. i = 1. fl) has asymptotically a X%r.. For a second pendently and we observe Xl.4. k < oosee Problem 6. one from each location.\ for testing H : TJ E w versus K : TJ n . then we can form the LR statistic for H : TJ E Wo versus K: TJ E w .. We obtain k independent samples. By the multivariate delta method. ji).1 (L:~=l Xij(3j).1 linear subhypotheses are important.13. and the MLEs of 1ri and {Li are Xdmi and Xi.4.4. and for the ith location count the number Xi among mi that has the given attribute.. we let w = {TJ : 'f]i = {3. that is."" Xk independent with Xi example..\ has an asymptotic X.11) and (6. f"V .12) k zT D(X. .4. To get expressions for the MLEs of 7r and JL.) +X. i 1.~. we set n = Rk and consider TJ E n. ji) measures the distance between the fit ji based on the model wand the data X..18) {Lo~ where jio is the MLE of JL under H and fl~i = mi .410 Inference in the Mllltin::lr.Wo 210g. ji) 2 I)Xi 10g(Xdfli) + XIlog(XI!flDJ i=1 (6. k.flm.1. i = 1.6. . by Problem 6. .4. Example 6. suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition. In the present case.13.)l {LOt (6.. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and recording the number Xi of patients that recover.w is denoted by D(y.3. . Here is a special case.k.4. {3 E RP} and let r be the dimension of w. .\=2t [Xi log i=1 (Ei. recall from Example 1.. it foHows (Problem 6. the MLE of 1ri is 1ri = 9.4. If Wo is a qdimensional linear subspace of w with q < r.4..
N log(1+ exp{{3}) + ~ log ( k ~: ) where T = 2::=1 Xi. the Rao statistic is again of the Pearson X2 form. Under H the log likelihood in canonical exponential form is {3T .15).4. and (3 = 10g[7r1(1 . The Pearson statistic 2 _ X  L k1 i=l (X "'(1) m'7r . The LR statistic is given by (6.13).3.1).Li = EYij.4.1 and 6. the Wald and Rao statistics take a form called "Pearson's X2.3 to find tests for important statistical problems involving discrete data.5 GENERALIZED LINEAR MODELS Yi In Sections 6. an important hypothesis is that the popUlations are homogenous.7r ~ i "')2 mi 7r is a Wald statistic and the X2 test is equivalent asymptotically to the LR test (Problem 6. we test H : 7rl = 7r2 7rk = 7r. z.4.1 that if 0 < T < N the MLE exists and is the solution of (6.7r)].Li of a response is expressed as a function of a linear combination ~i =.3 we considered experiments in which the mean J. 7r E (0. the Pearson statistic is shown to have a simple intuitive form. We derive the likelihood equations.Li 7ri = gl(~i)' where gl(y) is the logistic distribution . Thus. where J. In particular. When the hypothesis is that the multinomial parameter is in a qdimensional subset of the k . In the special case of testing independence of multinomial frequencies representing classifications in a twoway contingency table.Idimensional parameter space e.4." which equals the sum of standardized squared distances between observed frequencies and expected frequencies under H. we find that the MLE of 7r under His 7r = TIN.3.1. discuss algorithms for computing MLEs. versus the alternative that the 7r'S are not all equal. and give the LR test. we give explicitly the MLEs and X2 test.4. N 2::=1 mi.mk7r)T. J.L (ml7r. and as in that section. then J. We found that for testing the hypothesis that a multinomial parameter equals a specified value.5 Generalized Linear Models 411 This model corresponds to the oneway layout of Section 6. (3 = L Zij{3j j=l p of covariate values.1. . Using Z as given in the oneway layout in Example 6. Finally. 6. In the special case of testing equality of k binomial parameters. It follows from Theorem 2.18) with JiOi mi7r.3. if J. We used the large sample testing results of Section 6. .Section 6. Summary. In Section 6.Li = ~i. in the case of a Gaussian response.. we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates..
5. = L:~=1 Zij{3. in which case 9 is also called the link function.. . the natural parameter space of the nparameter canonical exponential family (6.1) where 11 is not in E. Y) where Y = (Yb . g = or 11 L J'j Z(j) = Z{3. . Special cases are: (i) The linear model with known variance. g(fL) is of the form (g(JL1) . in which case JLi A~ (TJi)... the GLM is the canonical subfamily of the original exponential family generated by ZTy. Typically. most importantly the log linear model developed by Goodman and Haberman.. that is. 11) given by (6. We assume that there is a onetoone transform g(fL) of fL. The generalized linear model with dispersion depending only on the mean The data consist of an observation (Z.1). (ii) Log linear models. g(JLn)) T. Typically.r Case 6 function. McCullagh and NeIder (1983. See Haberman (1974). such that g(fL) = L J'jZ(j) j=1 p Z{3 where Z(j) (Zlj."" zn) with Zi = (Zib"" Zip)T nonrandom and Y has density p(y. j=1 p In this case. Znj)T is the jth column vector of Z. fL determines 11 and thereby Var(Y) A( 11). the identity. called the link junction.412 Inference in the 1\III1IITin'~r:>rn"'T. As we know from Corollary 1. Yn)T is n x 1 and ZJxn = (ZI.. Canonical links The most important case corresponds to the link being canonical. the mean fL of Y is related to 11 via fL = A(11). .1. ••• .. which is p x 1. A (11) = Ao (TJi) for some A o. Note that if A is oneone. More generally. but in a subset of E obtained by restricting TJi to be of the form where h is a known function.6..5. 1989) synthesized a number of previous generalizations of the linear model. These Ii are independent Gaussian with known variance 1 and means JLi The model is GLM with canonical g(fL) fL. .
4. See Haberman (1974) for a further discussion. b.5.1. that Y = IIYij 111:Si:Sa.2) (or ascertain that no solution exists). if maximum likelihood estimates they necessarily uniquely satisfy the equation f3 exist. j ::. .5. Algorithms If the link is canonical. where f3i.7.. j::. Suppose. The models we obtain are called log linearsee Haberman (1974) for an extensive treatment.3. But.t1.2) can be interpreted geometrically in somewhat the same way as in the Gausian linear modelthe "residual" vector Y . B+ j = 2:~=1 Bij . a.1. 1f.1::. the link is canonical. With a good starting point f3 0 one can achieve faster convergence with the NewtonRaphson algorithm of Section 2. say. f3j are free (unidentifiable parameters). Bj > 0. p are canonical parameters. /1(.. b. i ::.80 ~ f3 0 . . for example. in general.2) It's interesting to note that (6. 0 < (}i < 1 withcanonicallinkg(B) = 10g[B(1B)].3) where In this situation and more generally even for noncanonical links. by Theorem 2..5 Generalized linear Models 413 Suppose (Y1. The coordinate ascent algorithm can be used to solve (6 .5.8) (6.Section 6. classification i on characteristic 1 and j on characteristic 2. Then () = IIBij II.4. so that Yij is the indicator of. This isjust the logistic linear model of Section 6. .5. that procedure is just (6. log J. NewtonRaphson coincides with Fisher's method of scoring described in Problem 6. j ::.B 1. If we take g(/1) = (log J. 1 ::.6. . "Ij = 10gBj. Then. is that of independence Bij = Bi+B+j where Bi+ = 2:~=1 Bij . The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ((}i). ZTy or = ZT E13 y = ZT A(Z.Bp). as seen in Example 1.tp) T .5.. 1 ::..Yp)T is M(n.8) is not a member of that space. and the log linear model corresponding to log Bij = f3i + f3j.8) is orthogonal to the column space of Z. p. 2:~=1 B = 1. . In this case. the .. 1 ::. 1 ::./1(.3. . j ::.
Unfortunately tl =f.4) That is. called the deviance between Y and lLo.1) for which p = n. (6. If Wo C WI we can write (6. In that case the MLE of J1 is iL M = (Y1 . As in the linear model we can define the biggest possible GLM M of the form (6. iLo) . as n ~ 00.414 Inference in the Multiparameter Case Chapter 6 true value of {3. the deviance of Y to iLl and tl(iL o. Testing in GLM Testing hypotheses in GLM is done via the LR statistic. We can think of the test statistic . the correction Am+ l is given by the weighted least squares formula (2. ••• . Write 1](') for AI. iLl) where iLl is the MLE under WI.2. IL) : IL E wo} where iLo is the MLE of IL in Woo The LR statistic for H : IL E Wo versus K : IL E WI .20) when the data are the residuals from the fit at stage m. This name stems from the following interpretation. Let ~m+l == 13m+1 . In this context. iLo) . 1]) > o}).5) is always ~ O. .13m' which satisfies the equation (6.1](Y)) l(Y . We can then formally write an analysis of deviance analogous to the analysis of variance of Section 6. Yn ) T (assume that Y is in the interior of the convex support of {y : p(y.5. the algorithm is also called iterated weighted least squares.4). iLl)' each of which can be thought of as a squared distance between their arguments. then with probability tending to 1 the algorithm converges to the MLE if it exists.6) a decomposition of the deviance between Y and iLo as the sum of two nonnegative components. the variance covariance matrix is W m and the regression is on the columns ofW mZProblem 6. iLo) == inf{D(Y. For the Gaussian linear model with known variance 0'5 (Problem 6.5. iLl) == D(Y.Wo with WI ~ Wo is D(Y.D(Y .2.5. 210gA = 2[l(Y. D generally except in the Gaussian case. This quantity. The LR statistic for H : IL E Wo is just D(Y.1](lLa)] for the hypothesis that IL = lLa within M as a "measure" of (squared) distance between Y and lLo.1.D(Y.5..5.5.
..2.q. f3p ). which we temporarily assume known.3.2. 1.5. 6. (Zn.5. and (6.5 Generalized Linear Models 415 Formally if Wo is a GLM of dimension p and WI of dimension q with canonical links. .3).2. .0.3 applies straightforwardly in view of the general smoothness properties of canonical exponential families. y . . and 6. asymptotically exists.2 and 6. which.. ... f3d+l... For instance.5.7) More details are discussed in what follows. (3))Z. can be estimated by f = :EA(Z{3) where :E is the sample variance matrix of the covariates. This can be made precise for stochastic GLMs obtained by conditioning on ZI.3.5. . if we take Zi as having marginal density qo. Zn in the sample (ZI' Y 1 ).1 . (Zn' Y n ) has density (6.. . we obtain If we wish to test hypotheses such as H : 131 = . .. then ll(fio. (Zn' Y n ) can be viewed as a sample from a population and the link is canonical.10) is asymptotically X~ under H.3 hold (Problem 6.8) This is not unconditionally an exponential family in view of the Ao (z. Y n ) from the family with density (6.9) What is ]1 ((3)? The efficient score function a~i logp(z . Ao(Z. there are easy conditions under which conditions of Theorems 6. is consistent with probability 1. . Similar conclusions r '''''''. fil) is thought of as being asymptotically X. ••• . Asymptotic theory for estimates and tests If (ZI' Y 1 ). However. so that the MLE {3 is unique. d < p.. we can calculate i=l where {3 H is the (p xl) MLE for the GLM with {3~xl = (0.5. then (ZI' Y1 ). if we assume the covariates in logistic regression with canonical link to be stochastic. . Thus. and can conclude that the statistic of (6. = f3d = 0. the theory of Sections 6. (3) term. (3) is (Yi and so.. in order to obtain approximate confidence procedures. .Section 6 . .
which we postpone to Volume II. if in the binary data regression model of Section 6.14) so that the variance can be written as the product of a function of the mean and a general dispersion parameter.r)dy.A(11))}h(y. These conclusions remain valid for the usual situation in which the Zi are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables. 11.11) Jp(y. for c( r) > 0.12) The lefthand side of (6. then it is easy to see that E(Y) = A(11) Var(Y) = c(r)A(11) (6. JL ::. For instance.1) depend on an additional scalar parameter r.5.10) is of product form A( 11) [1/ c( r)] whereas the righthand side cannot always be put in this form. 11.4.5.3. As he points out. General link functions Links other than the canonical one can be of interest. 1989). It is customary to write the model as. From the point of analysis for fixed n. (J2) and gamma (p. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean. we take g(JL) = <1>1 (JL) so that 7ri = <1> (zT!3) we obtain the socalled probit model.5. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6. (6.416 Inference in the Multiparameter Case Chapter 6 follow for the Wald and Rao statistics. noncanonicallinks can cause numerical problems because the models are now curved rather than canonical exponential families. p(y. then A(11)/c(r) = log! exp{c1 (r)11T y}h(y. when it can. . Cox (1970) considers the variance stabilizing transformation which makes asymptotic analysis equivalent to that in the standard Gaussian linear model. r)dy = 1. r). For further discussion of this generalization see McCullagp.5.9 are rather similar. However. Note that these tests can be carried out without knowing the density qo of Z1. r) = exp{c.5. Because (6. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred. .1 (r)(11T y . the results of analyses with these various transformations over the range .1 ::.5. Important special cases are the N (JL. and NeIder (1983.13) (6. A) families.
with density f for some f symmetric about O}.31) with fa symmetric about O. There is another semiparametric model P 2 = {P : (ZT. still a consistent asymptotically normal estimate of (3 and.19) even if the true errors are N(O. For this model it turns out that the LSE is optimal in a sense to be discussed in Volume II. in fact. That is. the right thing to do if one assumes the (Zi' Yi) are i. Y) has a joint distribution and if we are interested in estimating the best linear predictor /lL(Z) of Y given Z. confidence procedures. is "act as ifthe model were the one given in Example 6.r+i". under further mild conditions on P. had any distribution symmetric around 0 (Problem 6. then the LSE of (3 is. PI {P : (zT. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function. the GaussMarkov theorem.6.6 Robustness Pr.5). have a fixed covariate exact counterpart.i. of a linear predictor of the form Ef3j Z(j). we studied what procedures would be appropriate if the linearity of the linear model held but the error distribution failed to be Gaussian..2. We found that if we assume the error distribution fa. Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem 6. called the link function. and Models 417 Summary. EpZTZ nonsingular.6. We discussed algorithms for computing MLEs of In the random design case.2. which is symmetric about 0.2. ". in fact. a 2 ) or.8 but otherwise also postponed to Volume II.2.2. Furthermore. ____ .i.2.d. These issues.d.1) where E is independent of Z but ao(Z) is not constant and ao is assumed known." Of course.6 ROBUSTNESS PROPERTIES AND SEMIPARAMETRIC MODELS As we most recently indicated in Example 6. roughly speaking.19) with Ei i."n". so is any estimate solving the equations based on (6.2. we use the asymptotic results of the previous sections to develop large sample estimation results.. the resulting MLEs for (3 optimal under fa continue to estimate (3 as defined by (6.. the LSE is not the best estimate of (3.2. are discussed below. y)T rv P that satisfies Ep(Y I Z) = ZT(3. if we consider the semiparametric model. and tests. E p y2 < oo}. Y) rv P given by (6. the distributional and implicit structural assumptions of parametric models are often suspect. We considered the canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor. /3. Ii 6. where the Z(j) are observable covariate vectors and (3 is a vector of regression coefficients.2. In Example 6. whose further discussion we postpone to Volume II. for estimating /lL(Z) in a submodel of P 3 with /3 (6.Section 6..J .1. if P3 is the nonparametric model where we assume only that (Z.
6. Because E( Ei) = 0.2) holds. In fact.En are i.1. By Theorems 6. Many of the properties stated for the Gaussian case carry over to the GaussMarkov case: Proposition 6.S..1.. .d.3) are normal. Let Ci stand for any estimate linear in Y1 . . our current Ji coincides with the empirical plugin estimate of the optimal linear predictor (1.1. in addition to being UMVU in the normal case... Example 6. 13i of Section 6. .1.1.En in the linear model (6.Ej) = a La.Yn .2(iv) and 6.H).. Varc(Ci) for all unbiased Ci. Varc(a) ::.6.1.'. n a Proof.. for any parameter of the form 0: = E~l aiJLi for some constants a1. The preceding computation shows that VarcM(Ci) = Varc(Ci). . .an.14).. . In Example 6.4.4. and by (B..4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case.6. Ej). .1. N(O. and ifp = r.1. E(a) = E~l aiJLi Cov(Yi. Instead we assume the GaussMarkov linear model where (6.2.6). the estimate = E~l aiJii has uniformly minimum variance among all unbiased estimates linear in Y 1 . 0 Note that the preceding result and proof are similar to Theorem 1.1.6. € are n x 1. jj and il are still unbiased and Var(Ji) = a 2 H.Y .p .. However. the result follows.6. {3 is p x 1. . Moreover. Suppose the GaussMarkov linear model (6.1. One Sample (continued).4 are still valid. Theorem 6. moreover. See Problem 6. The optimality of the estimates Jii. the conclusions (1) and (2) of Theorem 6. .3(iv).1 and the LSE in general still holds when they are compared to other linear estimates. i=l i<j i=l where VarCM refers to the variance computed under the GaussMarkov assumptions.3.' En with the GaussMarkov assumptions. ( 2 ). Var(e) = a 2 (I . 418 Inference in the Multiparameter Case Chapter 6 Robustness in Estimation We drop the assumption that the errors E1.. . JL = f31. Ifwe replace the Gaussian assumptions on the errors E1.1. Because Varc = VarCM for all linear estimators. Var(jj) = a 2 (ZTZ)1.i.2) where Z is an n x p matrix of constants. Then. is UMVU in the class of linear estimates for all models with EY/ < 00. and Y. for . . and a is unbiased. in Example 6. n = 0:. and Ji = 131 = Y. where Varc(Ci) stands for the variance computed under the Gaussian assumption that E1. n 2 VarcM(a) = La.1. Yj) = Cov( Ei. The GaussMarkov theorem shows that Y. Var(Ei) + 2 Laiaj COV(Ei.
. Remark 6.3) .Ill}.2. From the theory developed in Section 6.. 1 (Zn.ennip. Suppose we have a heteroscedastic 0.and twosample problems using asymptotic and Monte Carlo methods. P)) with :E given by (6. a major weakness of the GaussMarkov theorem is that it only applies to linear estimates.:. There is an important special case where all is well: the linear model we have discussed in Example 6.ti". y E R. Il E R.5) .P) i= II (8 0 ) in general. which is the unique solution of ! and w(x.2.:lrarnetlric Models 419 sample n large.2 are UMVU.2 we know that if '11(" 8) = Die.6. Thus. ~('11. 8)dP(x) ° = 80 (6..1.2. for instance. the asymptotic behavior of the LR. Example 6. Y is unbiased (Problem 3.3 we investigated the robustness of the significance levels of t tests for the one.6 Robustness "" ."n". Because ~ ('lI .d. the asymptotic distribution of Tw is not X. Now we will use asymptotic methods to investigate the robustness of levels more generally.6.. The 6method implies that if the observations Xi come from a distribution P.6..8(P)) ~ N(O. Y n ) are d. as (Z.4.6. then Tw ~ VT I(8 0 )V where V '" N(O. ~(w. 8) and Pis true we expect that 8n 8(P).4) and the Wald test statis8 0 evidently we have i= Tw But if 8(P) = 8 0 ..6.4. P)). our estimates are estimates of Section 2.ii(8 .Section 6.. A > O. . but Var( Ei) depends on i. Suppose (ZI' Yd. consider H : 8 tics Tw = n(e ( 0 )T I(8 0 )(e ( 0 ). This observation holds for the LR and Rao tests as wellbut see Problem 6.6. If the are version of the linear model where E( €) known. 0. and . when the not optimal even in the class of linear estimates. Y has a larger variance (Problem 6. The Linear Model with Stochastic Covariates. For this density and all symmetric densities.6).. which does not belong to the model.2.. 00. we can use the GaussMarkov theorem to conclude that the weighted least squares are unknown.1. aT Robustness of Tests In Section 5. Wald and Rao tests depends critically on the asymptotic behavior of the underlying MLEs 8 and 80 .6. Another weakness is that it only applies to the homoscedastic case where Var(1'i) is the same for all i.6. Y) where Z is a (p x 1) vector of random covariates and we model the relationship between Z and Y as (6. As seen in Example 6. 0. .. However.12). If 8(P) (6.5) than the nonlinear estimate Y median when the density of Y is the Laplace density 1 2A exp{ Aly .
. n lZT Z (n) (n) so that the confidence ~ ~ ZiZ'f n ~ 1 (6. E p E2 < 00. and Rao tests all are equivalent to the F test: Reject if Tn = (3 Z(n)Z(n)(3/ S 2 !p.2.np(l a) .. hence.q+ 1. (12).420 Inference in the M . Y) such that Eand Z are independent. by the law of large numbers.a) where _ ""T T  2 (6. it is still true that if qi is given by (6.6) (3 is the LSE. Zen) S (Zb .np(1 . It is intimately linked to the fact that even though the parametric model on which the test was based is false.p or more generally (3 E £0 + (30' a qdimensional affine subspace of RP.. the test (6.. say. and we consider. .7) and (6.t. X.6.r::nn".3) equals (3 in (6. !ltin:::.6.P i=l n p  Z(n)(31 .2. when E is N(O.2.8) Moreover. Then. (i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model. procedures based on approximating Var(.6.2) the LR.5) and (12(p) = Varp(E). (Jp (Jo. Wald. This kind of robustness holds for H : (Jq+ 1 (Jo.t xp(l a) by Example 5.r Case 6 with the distribution P of (Z.. EpE 0.7.9) i=l n1Zfn)Z(n)S2 / Vii are still asymptotically of correct level.6. .1 and 6. the limiting distribution of is Because !p. under H : (3 = 0. For instance. (see Examples 6. It follows by Slutsky's theorem that....6.  2 Now..6. it is still true by Theorem 6. H : (3 = O.Zn)T and  2 1 ~ " 2 1 ~(Yi . Thus.Yi) = IY n .30) then (3(P) specified by (6. even if E is not Gaussian.B) by where W pxp is Gaussian with mean 0 and.6.3.5) has asymptotic level a even if the errors are not Gaussian.1 that (6.. .2.
. we assume variances are heteroscedastic.. V2 VarpX{I) (~2) Xz in general. (6.6...11) i=1 &= v 2n ! t(xP) + X?»). It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general.13) but & V2Ep(X(I») ::f.Section 6.14) o Example 6.. E (i. X(2») where X(I)..5) are still meaningful but (and Z are dependent. Suppose without loss of generality that Var( (') = L Then (6. X(2) are independent E of paired pieces of equipment..4. let the distribution of ( given Z = z be that of a(z)(' where (' is independent of Z.6. If we take H : Al . Simply replace (12 by (J ~2 = ~ ~ { ( ~1) n~ 2=1 Xz _ (X(I) + X(2»))2 2 + _ (X(1) + X(2»))2} 2 .10) does not have asymptotic level a in general. the theory goes wrong. then it's not clear what H : (J = (Jo means anymore. However suppose now that X(I) 101 and X(2) 102 are identically distributed but not exponential.6. 2  (th). Suppose E(E I Z) = 0 so that the parameters f3 of (6.6.1 Vn(~ where (3) + N(O.3.6 Robustness Properties and Semiparametric Models 421 If the first of these conditions fails. Then H is still meaningful. XU) and X(2) are identically distributed. That is..10) nL and ~ ~ X(j) 2 (6.6.3.17) is Suppose our model is that X "Reject iff n (0 1 where &2 2 O) > x 1 (1  .6. But the test (6.6.12) The conditions of Theorem 6.) respectively.3. If it holds but the second fails.6.6. the lifetimes A2 the standard Wald test (6. For simplicity.2.2 clearly hold. We illustrate with two final examples. The Linear Model with Stochastic Covariates with E and Z Dependent. To see this.6. .15) . (6.a)" (6. Q) (6. note that.7) fails and in fact by Theorem 6. Example 6.6. i=1 (6. (X(I). under H. The TwoSample Scale Problem.
6. Compare this bound to the variance of 8 2 • . provided we restrict the class of estimates to linear functions of Y1 . 1967). and are uncorrelated.Ad)T where (I1. then the MLE (iil.d. . where Qis a consistent estimate of Q..6.. Wald.6. use Theorem 3. TJr..6) by Q1.11 . and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model.9. . 6.. . n 1 L..1 are still approximately valid as are the LR. j ::. = {3d = 0 fail to have correct asymptotic levels in general unless (52 (Z) is constant or A1 = .6) and. (52) assumption on the errors is replaced by the assumption that the errors have mean zero. identical variances.4.11) with both 11 and (52 unknown. The GaussMarkov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i. . For the canonical linear Gaussian model with (52 unknown. . ••• . .1 1. Yn .7 PROBLEMS AND COMPLEMENTS Problems for Section 6. In the linear model with a random design matrix. LR.6. (i) the MLE does not exist if n = r.. . In partlCU Iar. (52) are still reasonable when the true error distribution is not Gaussian. 1 ::. the methods need adjustment. the confidence procedures derived from the normal model of Section 6.3. ..4. in general.. and Rao tests.. . It is easy to see that our tests of H : {32 = . A solution is to replace Z(n)Z(n)/ 8 2 in (6.. 0 < Aj < 1.i.JL • ""n 2. Specialize to the case Z = (1. ""2 _ ""12 (TJ1. We considered the behavior of estimates and tests when the model that generated them does not hold.A1. Show that in the canonical exponential model (6.1.16) Summary. In particular. Ur. (5 . .3 to compute the information lower bound on the variance of an unbiased estimator of (52..422 Inference in the Multiparameter Case Chapter 6 (Problem 6.1. .Id . The simplest solution at least for Wald tests is to use as an estimate of [Var y'n8]1 not 1(8) or ~D2ln(8) but the socalled "sandwich estimate" (Huber.6.6) does not have the correct level. In this case.fiT) o:2)T of U 2)T . (6. we showed that the MLEs and tests generated by the model where the errors are U. N(O. then the MLE and LR procedures for a specific model will fail asymptotically in the wider model."i=r+ I i ..n l1Y . in general.Id) has a multinomial (A1.Ad) distribution.. d. We also demonstrated that when either the hypothesis H or the variance of the MLE is not preserved when going to the wider model. . This is the stochastic version of the dsample model of Example 6. = Ad = 1/ d (Problem 6. 0 To summarize: If hypotheses remain meaningful when the model is false then. and Rao tests need to be modified to continue to be valid asymptotically.6). N(O... the test (6. (5 2)T·IS (U1. Wald. . . For d = 2 above this is just the asymptotic solution of the BehrensFisher problem discussed in Section 4. the twosample problem with unequal sample sizes and variances. (ii) if n 2: r + 1.d.
1] is a known constant and are i. = (Zi2.1.. Let 1.'''' Zip)T.13. .l+c (_c)j+l)/~ (1. {Ly = /31.(_C)i)2 l+c L i=l (a) Show that Bis the weighted least squares estimate of (). 1 n f1.1. i = 1.d... . . .29) for the noncentrality parameter 82 in the oneway layout. of 7.i.2. i = 1..2 known case with 0.2 with p = r coincides with the empirical plugin estimate of JLL = ({LL1. 1 {LLn)T. and (3 and {Lz are as defined in (1... c E [0..2 replaced by 0: 2 • 6. t'V (b) Show that if ei N(O. where IJ. N(O. 0. zi = (Zi21"" Zip)T. see Problem 2.1..a)% confidence interval for /31 in the Gaussian linear model is Z2)2 ] . Yl). . fO 0.n. i eo = 0 (the €i are called moving average errors. then B the MLE of (). 4. 9. Show that Ii of Example 6.29).1. and the 1.4. .Li = {Ly+(zi {Lz)f3. .4 • 3. Derive the formula (6. Show that in the regression example with p = r = 2.. Find the MLE B (). (e) Show that Var(B) < Var(Y) unless c = O. . Yn ) where Z. .Section 6. n..28) for the noncentrality parameter ()2 in the regression example. Derive the formula (6. 5. is (c) Show that Y and Bare unbiased."" (Z~.1. Show that >:(Y) defined in Remark 6. Here the empirical plugin estimate is based on U. . (Zi.d.5) Yi () + ei.14). n. 0. J =~ i=O L)) c i (1.2 . . .2 ).. (d) Show that Var(O) ~ Var(Y). . Let Yi denote the response of a subject at time i. where a..n ~ where fi can be written as fi = ceil + ei for given constant c satisfying 0 ~ c are independent identically distributed with mean zero and variance 0. Var(Uf) = 20. i = 1. Consider the model (see Example 1. i = 1. the 100(1 .22..7 Problems and Complements 423 Hint: By A. fn where ei = ceil + fi. Suppose that Yi satisfies the following model Yi ei = () + fi. 8.2 ).2 coincides with the likelihood ratio statistic A(Y) for the 0..
Yn )..p (1 ..Z.a)% confidence intervals for a and 6k • 12.e .• 424 10. Show that if p Inference in the Multiparameter Case Chapter 6 = r = 2 in Example 6.2.1 2)? and that a level (1 a) confidence interval for a 2 is given by ~a)::.• statistics HYl..a) confidence intervals for linear functions of the form {3j .. = np n/2(p 1).a) confidence interval for the best MSPE predictor E(Y) = {31 + {32 Z. . n2 = .{3i are given by = ~(f. ..I). l(Yh . Assume the linear regression model with p future observation Y to be taken at the pont z. We want to predict the value of a 14. where n is even. then Var(8k ) is minimized by choosing ni = n/2.1 and variance a 2 . which is the amount . ~ 1 . ••..2)(Zj2 . Consider the three estimates T1 = and T3 Y. Yn be a sample from a population with mean f. .p (~a) (n . (n . Note that Y is independent of Yl.:1 Yi + (3/ 2n) L~= ~ n+ 1 }i.1. 1  (a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2? (b) Which estimate has the smallest MSE for estimating 0 13. 15.p)S2 /x n . (Zi2 Z.p)S2 /x n . .. Show that for the estimates (a) Var(a) = +==_=:. 'lj. Yn . Let Y 1 . Consider a covariate x. (d) Give the 100(1 . a 2 ::. Y ::.2) L(Zi2 . (b) Find confidence intervals for 'lj.2)2 a and 8k in the oneway layout Var(8k ) .. Often a treatment that is beneficial in small doses is harmful in large doses. The following model is useful in such situations.. ~(~ + t33) t31' r = 2. (a) Show that level (1 . ::. ..k)' C (b) If n is fixed and divisible by p.2).. then Var( a) is minimized by choosing ni = n / p.. In the oneway layout. (a) Find a level (1 .":=::.a).Z. (c) If n is fixed and divisible by 2(p . (b) Find a level (1 a) prediction interval for Y (i. = 2(Y .~ L~=1 nk = (p~:)2 + Lk#i . Yn ) such that P[t. T2 1 (1/ 2n) L. then the hat matrix H = (h ij ) is given by 1 n 11. .
2 1.77 167.emEmts 425 . P.r2 )l 1 2 7 > O.0'2)(x ) + E<t'(JL.I::llogxi' You may use X = 0.A6 when 8 = ({L. •. hence. p. (32.1) + 2<t'CJL. : {L E R. a 2 <7 2 . compute confidence intervals for (31. Assume the model = (31 + (32 X i + (33 log Xi + Ei. ( 2 ). Check AO. ..7 Probtems and Lornol.• . Yi) and (Xi. xC' = 0 => x = 0 for any rvector x. But xC'C = o => IIxC'I1 2 = xC'Cx' = 0 => xC' = O. 8) = 2. ( 2 )T is identifiable if and only if r = p. Show that if C is an n x r matrix of rank r. r :::. and a response variable Y.. n.or dose of a treatment.. . n are independent N(O. 1 2 where <t'CJL. (a) For the following data (from Hald. (a) Show that AOA4 and A6 hold for model Qo with densities of the form 1 2<t'CJL. In the Gaussian linear model show that the parametrization ({3. Find the value of X that maxi mizes the estimated yield Y= e131 e132xx133. then the r x r matrix C'C is of rank r and. .2. .1 and let Q be the class of distributions with densities of the form (1 E) <t' CJL.Section 6. En Xi. fli) where fi1.r2) (x). ( 2 ) density. ( 2 ). 7 2 ) does not exist so that A6 doesn't hold. (b) Show that the MLE of ({L. a 2 > O}. i = 1.95 Yi = e131 e132xi Xf3. and p be as in Problem 6. Y (yield) 3. fi2.0'2) is the N(Il'.. (b) Plot (Xi. 8). Let 8. p(x. . Problems for Section 6. Hint: Because C' is of rank r. 17. (33. Suppose a good fit is obtained by the equation where Yi is observed yield for dose 10gYi where E1. 16. ( 2 ) logp(x. which is yield or production. X (nitrogen) Hint: Do the regression for {Li = (31 + (32zil + (32zil = 10gXi . nonsingular. 1952. 653).0289. P = {N({L. fi3 and level 0.5 Zi2 * + (33zi2 where Zil = Xi  X. E :::. and Q = P.
(b) The MLE minimizes . that (d) (fj  (3)TZ'f. (6. (e) Construct a method of moment estimate moments which are ~ consistent.2.10 that the estimate Raphson estimates from On is efficient.Pe"H). Y).1) EZ .2. and ..1 show thatMLEs of (3. (P("H) : e E e. Show that if we identify X = (z(n). show that c(fo) = (To/a is 1 if fo is normal and is different from 1 if 10 is logistic. In Example 6. 8.  (3)T)T has a (".2. t) is then called a conditional MLE. then ({3. given Z(n).1. hence. . Z i ~O.2. (J derived as the limit of Newtonwith equality if and only if > [EZfj1 4. (a) In Example 6. T 2 ) based on the first two I (d) Deduce from Problem 6. (e) Show that 0:2 is unconditionally independent of (ji. show that ([EZZ T jl )(1. 5. . Establish (6. Hint: (a).2.1.'. n[z(~)~en)jl)' X...2.  I . H abstract.IY .24) directly as follows: (a) Show that if Zn = ~ I:~ 1 Zi then. ._p distribution.fii(ji .2 are as given in (6..2.1/ 2 ). and that 0:2 is independent of the preceding vector with n(j2 /0 2 having a (b) Apply the law oflarge numbers to conclude that T P n I Z(n)Z(n) ~ E ( ZZ T ). In Example 6. In Example 6.2. b ..Zen) (31 2 e ~ 7. '".. . In some cases it is possible to find T(X) such that the distribution of X given T(X) = tis Q" which doesn't depend on H E H. (I) Combine (aHe) to establish (6.20). /1. The MLE of based on (X. I .2. e Euclidean.21). (c) Apply Slutsky's theorem to conclude that and. I I . 6.. 3.2.2. show that the assumptions of Theorem 6.2.2. iT 2 ) are conditional MLEs.24).". T(X) = Zen). . X . (fj multivariate nannal distribution with mean 0 and variance. H E H}.(3) = op(n.i > 1. ell of (} ~ = (Ji.I3). Hint: !x(x) = !YIZ(Y)!z(z). Fill in the details of the proof of Theorem 6.426 Inference in the Multiparameter Case Chapter 6 I .)Z(n)(fj . i' " (b) Suppose that the distribution of Z is not known so that the model is semiparametric.2 hold if (i) and (ii) hold.
the <ball about 0 0 . (iv) Dg(O) is continuous at 0 0 . a' . (al Let ii n be the first iterate of the NewtonRapbson algorithm for solving (6.•...p(Xi'O~) n. ao (}2 = (b) Write 8 1 uniquely solves 1 8.p(Xi'O~) + op(1) n i=l n ) (O~ . R d are such that: (i) sup{jDgn(O) . .0 0 1 < <} ~ 0. (iii) Dg(0 0 ) is nonsingular.l exists and uniquely ~ .O) has auniqueOin S(Oo. p is strictly convex. Hint: n . Him: You may use a uniform version of the inverse function theorem: If gn : Rd j. p i=l J.LD. 8~ = 1 80 + Op(n.1' _ On = O~  [ . that is. hence. 10 .2.x (1 + C X ) .Oo) i=cl (1 .LD.1) starting at 0.(XiI') _O.l + aCi where fl.L'l'(Xi'O~) n. Examples are f Gaussian. Let Y1 •. .' . ] t=l 1 n .1/ 2 ). t=l 1 tt Show that On satisfies (6. E R. Suppose AGA4 hold and 8~ is vn consistent. = J. (a) Show that if solves (Y ao is assumed known a unique MLE for L. (logistic).. Show that if BB a unique MLE for B1 exists and 10.7 Problems and Complements 427 9. (b) Show that under AGA4 there exists E > 0 such that with probability tending to 1.3).0 0 ).l real be independent identically distributed Y. (ii) gn(Oo) ~ g(Oo). a > 0 are unknown and ( has known density f > 0 such that if p(x) log f(x) then p" > 0 and.. <).2. and f(x) = e. .Dg(O)j .'2:7 1 'l'(Xi.L 'l'(Xi'O~) n i=l 1 1 =n n L 'l'(Xi. Y.Section 6.
•. < Zn 1 for given covariate values ZI. > 0 such that gil are 1 .12) and the assumptions of Theorem (6. II.(j) l. Reparametnze l' by '1(0) = Lj~1 '1j(O)Vj where '1j(9) 0 Vj.(Z" . that Theorem 6.. log Ai = ElI + (hZi. Problems for Section 6.3. i2. ~. 0 < ZI < .3). Zi. N(O. ....26) and (6. . = versus K : O oF 0. 2 ° 2.d. CjIJtlZ..1 on 8(9 0 • 6) (c) Conclude that with probability tending to 1.. minimizing I:~ 1 (Ii 2 i. and ~ . .ip range freely and Ei are i.2. . {Vj} are orthonormal. Show that if 3 0 = {'I E 3 : '1j = 0. )Ih  '" j=l LJ(lJj + where ~(l) = I:j and the Cj do not depend on j3.3 i 1. Suppose responses YI . P(Ai). 'I) = p(.. a 2 ).II(Zil I Zi2.6). . Wald. .' " 13..2. •. q(. I '" LJ i=l ~(1) Y. q + 1 < j < r} then )'(X) for the original testing problem is given = by )'(X) = sup{ q(X· 'I) : 'I E 3}/ sup{q(X./3)' = L i=1 1 CjZiU) n p .2 holds for eo as given in (6. i . hence... = 131 (Zil . • (6. then it converges to that solution.i. Establish (6.. 'I) : 'I E 3 0 } and. t• where {31. 1 Zw Find the asymptotic likelihood ratio... iteration of the NewtonRaphson algorithm starting at 0.. 1 Yn are independent Poisson variables with Yi . and Rao tests for testing H : Ih.2. LJ j=2 Differentiate with respect to {31. Similarly compute the infonnation matrix when the model is written as Y. P I . 8) for 'I E 3 and 3 = {'1( 0) : 9 E e}.(Z"  ~(') z. Suppose that Wo is given by (6. .3.~_. . f.l hold for p(x.Zi )13. Y. Thus.O). . Hint: You may use the fact that if the initial value of NewtonRaphson is close enough to a unique solution. i=l • Z. Hint: Write n J • • • i I • DY. Zip)) + L j=2 p "YjZij + €i J . .3.12). for "II sufficiently large. . Inference in the Multiparameter Case Chapter 6 > O. there exists a j and their image contains a ball S(g(Oo).O E e.3. Ii 'I :f Z'[(3)2 over all {3 is the same as minimizing n .428 then. converges to the unique root On described in (b) and that On satisfies 1 1 • j.27).
. aTB. independent.) O.7 Problems and C:com"p"":c'm"'. Suppose 8)" > 0. IJ. Oil + Jria(00 . . 'I) S(lJo)} then.3.(I(X i . Let e = {B o . = ~ 4.Xn ) be the likelihood ratio statistic. 1 5. ...(X" .l.. (ii) 'I is IIon S( 1J0 ) and D'I( IJ) is a nonsingular r x r matrix for aIlIJ E S( 1J0). 1 < i < n.. > 0 or IJz > O.) Show that if we reparametrize {PIJ: IJ E S(lJo)} by q('. Show that under H.n) satisfy the conditions of Problem 6.. OJ}.) where k(Bo..d.gr.. IJ. .'1) and Tin '1(lJ n) and. {IJ E S(lJo): ~j(lJ) = 0. . Consider testing H : 8 1 = 82 = 0 versus K : 0. Assume that Pel of Pea' and that for some b > 0.. B). N(IJ" 1).'" . even if ~ b = 0.'1(IJ» is uniquely defined on:::: = {'1(IJ) : IJ E Tio.n 7J(Bo. Testing Simple versus Simple. Suppose that Bo E (:)0 and the conditions of Theorem 6.aq are orthogonal to the linear span of ~(1J0) .3.n:c"' 4c:2=9 3. hence. S(lJ o) and a map ce which is continuously differentiable such that (i) ~J(IJ) = 9J(IJ) on S(lJo).2. Consider testing H : (j = 00 versus K : B = B1 . X n Li. . respectively. .) 1(Xi . (ii) E" (a) Let ).3 hold. (b) If h = 2 show that asymptotically the critical value of the most powerful (NeymanPearson) test with Tn ~ L~ . \arO whereal..3 is valid.. Oil and p(X"Oo) = E. with density p(. Vi). =  p(·.Section 6. p . Bt ) is a KullbackLeibler in fonnation ° "_Q K(Oo.X.(X1 .. . N(Oz.3. Let (Xi. Xi.. . 1). . (Adjoin to 9q+l.d. q + 1 <j < r S(lJo) . < 00. q('. be ii.2.o log (X 0) PI. 210g.'1 8 0 = and. There exists an open ball about 1J0. q + 1 <j< rl. j = 1. 0 ») is nK(Oo.IJ) where q(. with Xi and Y.. Deduce that Theorem 6. pXd .
0.6. under H. Then xi t.\(Xi.3. Z 2 = poX1Y v' .0.~.. = v'1~c2' 0 <. and Dykstra. = (T20 = 1 andZ 1= X 1. j ~ ~ 6. ±' < i < n) is distributed as a respectively.6.2 and compute W n (8~2» showing that its leading term is the same as that obtained in the proof of Theorem 6. 1) with probability ~ and U with the same distribution.1. Yi : 1 and X~ with probabilities ~.. 1 < i < n. (ii) Show that Wn(B~2» is invariant under affine reparametrizations "1 B is nonsingular. and ~ where sin.. 1988). In the model of Problem 5(a) compute the MLE (0..0. ~ ° with probability ~ and V is independent of ~ (e) Obtain tbe limit distribution of y'n( 0.li : 1 <i< n) has a null distribution.0. (b) Suppose Xil Yi e 4 (e) Let (X" Y. ' · H In!: Consl'defa1O .3. Show that 2log. (0.\( Xi. Exhibit the null distribution of 2 log . (iii) Reparametrize as in Theorem 6. 2 e( y'n(ii. (h) : 0 < 0. 7. . < .=0 where U ~ N(O..)) ~ N(O. /:1. Let Bi l 82 > 0 and H be as above.0).3.) have an N. > OJ. Po) distribution and (Xi.. = a + BB where . XI and X~ but with probabilities ~ . Hint: ~ (i) Show that liOn) can be replaced by 1(0). Sucb restrictions are natural if.0"10' O"~o. = 0. we test the efficacy of a treatment on the basis of two correlated responses per individual. I .430 Infer~nce in the Multiparameter Case Chapter 6 (a) Show that whatever be n. Note: The results of Problems 4 and 5 apply generally to models obeying AQA6 when we restrict the parameter space to a cone (Robertson. ii.d. < cIJ"O.1. Hint: By sufficiency reduce to n = 1.  OJ. (d) Relate the result of (b) to the result of Problem 4(a). mixture of point mass at 0. Wright.0.2 for 210g. be i. =0. > 0.) if 0.19) holds. are as above with the same hypothesis but = {(£It. Po . > 0.  0. 0. Show that (6.\(X). . Yi : l<i<n). li)..) under the model and show that (a) If 0. • i . 2 log >'( Xi. O > 0. which is a mixture of point mass at O.i. (b) If 0. for instance.
is of the fonn (b) Deduce that X' ~ Z' where Z is given by (6.3. A611 Problems for Section 6.0 < PCB) < 1.21).2.3. A3.4 ~ 1(11) is continuous. 2. hence. 9. Show that under A2.6 is related to Z of (6.10.3.4.3. 1. then Z has a limitingN(011) distribution.1(0 0 ).7 Problems and Complements ~(1 ) 431 8. (a) Show that the correlation of Xl and YI is p = peA n B) .. (b) (6.8) (e) Derive the alternative form (6.4.P(B))· (b) Show that the sample correlation coefficient r studied in Example 5. Hint: Write and apply Theorem 6. (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (estimated) expectations from each entry has all rows and columns summing to zero. 3. Under conditions AOA6 for (a) and AOA6 with A6 for (a) ~(1 ) i!~1) for (b) establish that [~D2ln(en)]1 is a consistent estimate of 1.8) by Z ~ .nf.4. Show that under AOA5 and A6 for 8 11 where ~(lIo) is given by (6.22) is a consistent estimate of ~l(lIo). 0 < P( A) < 1.P(A)P(B) JP(A)(1 .8) for Z.P(A))P(B)(1 . In the 2 x 2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a Born. (e) Conclude that if A and B are independent. Exhibit the two solutions of (6. 10.2 to lin .Section 6. Hint: Argue as in Problem 5.4. .4) explicitly and find the one that corresponds to the maximizer of the likelihood.
1 : X 2 (b) How would you.1 II . j = 1.)).TI. C i = Ci. .. .. i = 1.~..D. Hint: (a) Consider the likelihood as a function of TJil. nal ) n12..9) and has approximately a X1al)(bl) distribution under H. j = 1. Suppose in Problem 6. TJj2 ~ = Cj n where R. TJj2 = 2::: a x b contingency table with associated probabilities Bij and 1 Oij. ( where ( . a .. . . 8l! / (8 1l + 812 ))..1. N 22 ) rv M (u.1 .ra) nab .6 that H is true. Fisher's Exact Test From the result of Problem 6. . 811 \ 12 .. n a2 ) .~~. Let N ij be the entries of an let 1Jil = E~=l (}ij. . .4. . CI.4 deduce that jf j(o:) (depending on chosen so that Tl. 432 Inference in the Multiparameter Case Chapter 6 i 4.. n) can be then the test that rejects (conditionally on R I = TI' C 1 = GI) if N ll > j(a) is exact level o. n. N 12 • N 21 . B!c1b!. n R. 7. ( rl. Ti) (the hypergeometric distribution).. TJj2. . Let R i = Nil + N i2 • Ci = Nii + N 2i · Show that given R 1 = TI.4.u. (22 ) as in the contingency table. are the multinomial coefficients.2 is 1t(Ci... .. =T 1.. i = 1. in principle.. C j = L' N'j. b I Ri ( = Ti. I (c) Show that under independence the conditional distribution of N ii given R.54)..~l. 8(r2' 82 I/ (8" + 8. S.4.b1only. Cj = Cj] : ( nll. = Lj N'j.. TJj2 are given by TJil ~ = . 6.. . This is known as Fisher's exact test. (a) Let (NIl. N ll and N 21 are independent 8( r.. use this result to construct a test of H similar to the test with probability of type I error independent of TJil' TJj2? 1 .. 8 21 . (b) Sbow that 812 /(8l! ° + 812 ) ~ 821 /(8 21 + 822 ) iff R 1 and C 1 are independent. (b) Deduce that Pearson's X2 is given by (6. .2. . It may be shown (see Volume IT) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5. nab) A ) = B. Consider the hypothesis H : Oij = TJil TJj2 for all i.~~. (a) Show that the maximum likelihood estimates of TJil.C. R 2 = T2 = n . j. (a) Show that then P[N'j niji i = 1.
if A and C are independent or B and C are independent. ~i = {3. Zi not all equal.BINDEPENDENTGIVENC) n B) = P(A)P(B) (A. (a) If A. and petfonn the same test on the resulting table.(rl + cd or N 22 < ql + n .1""".12 I . Suppose that we know that {3. if and only if. (e) The following 2 x 2 tables classify applicants for graduate study in different departments of the university according to admission status and sex. 5 Hint: (b) It is easier to work with N 22 • Argue that the Fisher test is equivalent to rejecting H if N 22 > q2 + n .. The following table gives the number of applicants to the graduate program of a small department of the University of California.Section 6.ZiNi > k] = a. 2:f . where Pp~ [2:f . for suitable a. 10. C2). C are three events. 11. but (iii) does not.4.ziNi > k. (i) p(AnB (ii) I C) = PeA I C)P(B I C) (A. consider the assertions. N 22 is conditionally distributed 1t(r2. Give pvalues for the three cases.) Show that (i) and (ii) imply (iii). Then combine the two tables into one. Would you accept Or reject the hypothesis of independence at the 0. Show that.0. n. (b). classified by sex and admission status..4. which rejects. Admit Men Women 1 235 1~35' 38 7 273 42 n = 315 Deny Admit 270 45 Men Women I 122 1'.BINDEPENDENTGNENC) p(AnB I C) ~ peA I C)P(B I C) (A.93'1 215 103 69 172 Deny 225 162 n=387 (d) Relate your results to the phenomenon discussed in (a). 9. = 0 in the logistic model. Establish (6.14).05 level (a) using the X2 test with approximate critical value? (b) using Fisher's exact test of Problem 6. + IhZi. B INDEPENDENT) (iii) PeA (C is the complement of C. and that under H.7 Problems and Complements 433 8. Test in both cases whether the events [being a man] and [being admitted] are independent.5? Admit Deny Men Women 1 19 . and that we wish to test H : Ih < f3E versus K : Ih > f3E.. . B. there is a UMP level a test.(rl + cI). (h) Construct an experiment and three events for which (i) and (ii) hold.
I < i < k are independent.4.4)..OiO)2 > k 2 or < k}.. .(Ni ) < nOiO(1 . Y n ) have density as in (6. I.'Ir(. 1 i Show that for GLM this method coincides with the NewtonRaphson method of Section 2.4) is as claimed formula (2. Suppose the ::i in Problem 6. 15. X k be independent Xi '" N (Oi. 2. E {z(lJ. . I). .LJ2 Z i )). but under K may be either multinomial with 0 #. 13. < f3g in this (c) By conditioning on L~ 1 Xi and using the approach of Problem 6. .. Let Xl. . . . or Oi = Bio (known) i = 1..d. ~ 8 m + [1(8 m )Dl(8 m ).f..5.i. .00 or have Eo(Nd : .8) and.. . Verify that (6. . but Var. • I 1 • • (a)P[Z. mk ~ 00 and H : fJ E Wo is true then the law of the statistic of (6. . .15) is consistent. Nk) ~ M(n. Zi and so that (Zi.4... i Problems for Section 6. 1 .4. for example. n) and simplification of a model under which (N1. . a 2 ) where either a 2 = (known) and 01 . if the design matrix has rank p. . .Ld.4.. Tn. Compute the Rao test statistic for H : (32 case.d.. Use this to imitate the argument of Theorem 6.4.) ~ B(m.. then 130 as defined by (6. Show that the likelihO<Xt ratio test of H : O} = 010 .2.l 1 . Given an initial value ()o define iterates  Om+l .: nOiO... Suppose that (Z" Yj).3) l:::~ I (Xi . · .4. a 2 = 0"5 is of the form: Reject if (1/0".". (a) Compute the Rao test for H : (32 Problem 6.i. (Zn. .Oio)("Cooked data").18) tends 2 to Xrq' Hint: (Xi . This is an approximation (for large k.. which is valid for the i.5 construct an exact test (level independent of (31). with (Xi I Z. Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fishersee Rao (1973)...)llmi~i(1 'lri). = rn I < /3g and show that it agrees with the test of (b) Suppose that 131 is unknown.. . In the binomial oneway layout show that the LR test is asymptotically equivalent to Pearson's  X2 test in the sense that 2log'\  X2 . .5. " lOk vary freely. Show that if Wo C WI are nested logistic regression models of dimension q < r < k and mI. 010. " Ok = 8kO.. . 434 • Inference in the Multiparameter Case Chapter 6 f • 12... Xi) are i.. in the logistic regression model. Show that.3. 3.11. a5 j J j 1 J I .20) for the regression described after (6.3.z(kJl) ~I 1 . asymptotically N(O.. case. i 16.11 are obtained as realization of i.. . a under H.OkO) under H.5.1 14.5 1. k and a 2 is unknown..
1 = 0 V fJ~ .) and C(T) such that the model for Yi can be written as O.T).) .. ..ztkl} is RP (c) P[ZI ~ z(jl] > 0 for all j. (a) Show that the likelihood equations are ~ i=I L.. distribution.). Give 0. . C(T). J . b(9). j = 1. T. as (Z. . d".(. . Yn be independent responses and suppose the distribution of Yi depends on a covariate vector Zi.5). Show that when 9 is the canonical link. ~ .) and v(. <..5. Suppose Y.Oi)~h(y. Show that the conditions AQA6 hold for P = P{3o E P (where qo is assumed known).. g(p. Hint: By the chain rule a l(y a(3j' 0) = i3l dO d" a~ . . .b(Oi)} p(y. give the asymptotic distribution of y'n({3 . J JrJ 4. and v(. ~ N("" <7. (d) Gaussian GLM..O(z)) where O(z) solves 1/(0) = gI(zT {3). h(y..) = ~ Var(Y)/c(T) b"(O). Give 0..F. P(I")... Assume that there exist functions h(y.. . Show that. . Hint: Show that if the convex support of the conditional distribution of YI given ZI = zU) contains an open interval about p'j for j = 1.T)exp { C(T) where T is known. 00 d" ~ a(3j (b) Show that the Fisher information is Z].i. then the convex support of the conditional distribution of = 1 Aj Yj zU) given Z j = Z (j) .)z'j dC. C(T).5.p. Y follow the model p(y.(3). . (c) Suppose (Z" Y I ). g("i) = zT {3.. 9 = (b')l. ..9). ball about "k=l A" zlil in RP . 05.WZ v where Zv = Ilz'jll is the design matrix and W = diag( WI. has the Poisson. Let YI. h(y. Set ~ = g(. and b' and 9 are monotone.9).= 1 . T).) ~ 1/11("i)( d~i/ d". .. Find the canonical link function and show that when g is the canonical link. Yn ) are i. under appropriate conditions. b(B). (e) Suppose that Y. k.. T.Section 6. the resuit of (c) coincides with (6. and v(.y ..5. . (y..7 Problems and Complements 435 (b) The linear span of {ZII). L.d. Wn). k. T). Y) and that given Z = z. Show that for the Gaussian linear model with known variance D(y.. b(9).. the deviance is 5. your result coincides with (6. contains an open L. In the random design case.). Wi = w(". 1'0) = jy 1'01 2 /<7.. .. (Zn.
Show that the standard Wald test forthe problem of Example 6.10). and Rao tests are still asymptotically equivalent in the sense that if 2 log An. . then the sample median X satisfies ~ f at I I Vri(X where O' (p) 2 yep») ~ N(0. 2. I: .s(t). 7.2. j. W n and R n are computed under the assumption of the Gaussian linear model with a 2 known.6. = 1" (b) Show that if f is N(I'. (a) Show that if f is symmetric about 1'. t E R.1.15) by verifying the condition of Theorem 6. hence..7. (6. In the hinary data regression model of Section 6.6 1. the infonnation bound and asymptotic variance of Vri(X 1'). Apply Theorem 6. then v( P) . if P has a positive density v(P). and R n are the corresponding test statistics. • I 3. Consider the linear model of Example 6. l 1 . . I 4. O' 2(p)/0'2 = 2/1r. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation.4. P. I "j 5.14) is a consistent estimate of2 VarpX(l) in Example 6. Suppose ADA6 are valid. . if VarpDl(X. i 6.d. then 0'2(P) < 0'2. . W n . 0'2). the unique median of p. Wn Wn + op(l) + op(I).p under the sole assumption that E€ = 0. set) = 1.Xn are i. Note: 2 log An. then it is. Suppose Xl. .. By Problem 5.O' 2 (p») = 1/4f(v(p».1. Show that the LR.). . Suppose that the ttue P does not belong to"P but if f}(P) is defined by (6. then under H. ! I I ! I .6. Show that. let 1r = s(z. 0 < Vae f < 00. Wald.3.6.3 is as given in (6. then the Rao test does not in general have the correct asymptotic level. but if f~(x) = ~ exp Ix 1'1.10) creates a valid level u test. .Q+1>'" .6. then O' 2(p) > 0'2 = Varp(X. (3) where s(t) is the continuous distribution function of a random variable symmetric about 0.6.j 436 Inference in the Multiparameter Case Chapter 6 i Problems for Section 6. n . " ..6. replacing &2 by 172 in (6. in fact.3 and. . but that if the estimate ~ L:~ dDIllDljT (Xi.3) then f}(P) = f}o.2.2 and the hypothesis (3q+l = (3o.1) ~ . Consider the Rao test for H : f} = f}o for the model "P = {P/I : /I E e} and ADA6 hold.6. /10) is estimated by 1(80 ).6.1 under this model and verifying the fonnula given.3.i.(3p ~ (3o. I . Establish (6. that is. Show that 0: 2 given in (6. f}o) is used.
for instance).2 is an unbiased estimate of EPE(P). that Zj is bounded with probability 1 and let ih(X ln )). .. L:~ 1 (P. /3P+I = '" = /3d ~ O.. . O)T and deduce that (c) EPE(p) = RSS(p) + ~.. .jP)2 where ILl ) = z.(b) continue to hold if we assume the GaussMarkov model. Zi) : 1 <i< n}. . hence. are indepeqdent of}] 1 ' · ' 1 Yn and ~* is distributed as }'i.Zn and evaluate the performance of Yjep) . 1 V. A natural goal to entertain is to obtain new values Yi". Let RSS(p) = 2JY.. Zi are ddimensional vectors for covariate (factor) values.9...lpI)2 be the residual sum of squares. = (3p = 0 by the (average) expected prediction error ~ ~ n EPE(p) = n. _. .(p) the corresponding fitted value. Suppose that . Model selection consists in selecting p to minimize EPE(p) and then using Y(P) as a predictor (Mallows.2 is known. where ti are i. the model with 13d+l = .d... (a) Show that EPE(p) ~ .p. Show that if the correct model has Jri given by s as above and {3 = {3o. (b) Suppose that Zi are realizations of U. Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d . 0.' i=l .Section 6. 1 n. . . Zi.9. i ..{3lp) p and {3(P) ~ (/31. 8.lp)2 Here Yi" l ' •• 1 Y. y~p) and.1) Yi = J1i + ti. then 13L is not a consistent estimate of f3 0 unless s(t) is the logistic distribution.I EL(Y. i = 1. But if 13L is defined as the solution of EZ1s(Zr {30) = Q(/3) where Q({3) = E(Zr A(Zr/3) is p x 1..p don't matter... then ~ Vri(rJL QI ({3) Var(ZI (Y1   {3 Ll has a limiting normal distribution with mean 0 and variance A(Zr {3)) )[QI ({3)J where Q({3) = E(Zr A(Zr{30)ZI) is p x p and necessarily nonsingular... Let f3(p) be the LSE under this assumption and Y.i .2 (b) Show that (1 + ~D + . f3p. .7 Problems and Complements 437 (a) Show that ~ Jr can be written in this form for both the probit and logit models. 1 n. 1973. Gaussian with mean zero J1i = Z T {3 and variance 0'2. ~ ~ (d) Show that (a).. (Model Selection) Consider the classical Gaussian linear model (6. Hint: Apply Theorem 6.2.. be the MLE for the logit model. . .1. i = 1.d.. where Xln) ~ {(Y..1. at Zll'" .
Use 0'2 = 1 and n = 10. .5. 6.1 (1) From the L A. Introduction to Statistical Analysis. we might envision the possibility that an overzealous assistant of Mendel "cooked" the data..  J . 1 I . but it is reasonable.16).". ! . . DIXON. . . Y. Y/" j1~. 3rd 00. 1969. R. t12 and {Z/I. New York: McGrawHill. 1 "'( i=1 . which are not multinomial. .9 REFERENCES Cox. and (ii) 'T}i = /31 Zil + f32zi2. 1 n n L (I'. MASSEY. Note for Section 6. i=1 Derive the result for the canonical model. AND F. Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good. Hint: (a) Note that ".9 for a discussion of densities with heavy tails. D. this makes no sense for the model we discussed in this section. (b) The result depends only on the mean and covariance structure of the i=l"". Note for Section 6.i2} such that the EPE in case (i) is smaller than in case (d) and vice versa.8 NOTES Note for Section 6. Heart Study after Dixon and Massey (1969). ~(p) n . The Analysis of Binary Data London: Methuen. )I'i). A.' .~I'. .z. For instance. + Evaluate EPE for (i) ~i ~ "IZ.) .L . if we consider alternatives to H.n.. W..4 (1) R. To guard against such situations he argued that the test should be used in a twotailed fashion and that we should reject H both for large and for small values of X2 • Of course.2 (1) See Problem 3.438 (e) Suppose p ~ Inference in the Multiparameter Case Chapter 6 2 and 11(Z) ~ Ii. Give values of /31. The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section 6. 1970. EPE(p) n R SS( p) = .6. ti. LR test statistics for enlarged models of this type do indeed reject H for data corresponding to small values of X2 as well as large ones (Problem 6. i 6.4.
DYKSTRA. 1995." Memoires de l'Academie des Sciences de Paris (Reprinted in Oevres CompUtes.. STIGLER. I. $CHERVISCH. An Inrmdllction to Linear Stati.Section 6. Fifth BNkeley Symp. L. PS. R. "Computing regression qunntiles. S. I New York: McGrawHill." Technometrics. M. 1988. D'OREY. AND j. GRAYBILL. 1952. J.9 References 439 Frs HER. NELDER. R. 1958.. 2nd ed." J. S. SCHEFFE. New York: J. AND R. PORTNOY. C. A. S. AND R. 13th ed. 12. f. C. R." Biometrika. KADANE AND L. HUBER.. II.. P. 1959. 1985. WRIGHT. Applied linear Regression. 1986. of California Press. 1973. 1989. P.661675 (1973). ROBERTSON.. MA: Harvard University Press. Statist. c. The l!istory of Statistics: The Measuremem of Uncel1ainty Before 1900 Cambridge. t 983. AND Y. The Analysis of Frequency Data Chicago: University of Chicago Press.. E. New York: Wiley. LAPLACE. Vol. MALLOWS. Soc. 76. "The Gaussian Hare and the Laplacian Tortoise: Computability of squarederror versus absoluteerror estimators. /5. 279300 (1997). KASS. Roy. Ser. 1961. Linear Statisticallnference and Its Applications. 2nd ed. S. KOENKER.425433 (1989)." Proc. TIERNEY. "Some comments on C p . Genemlized Linear Models London: Chapman and Hall. T. 1974. Paris) (1789). Univ. A. 36. "Approximate marginal densities of nonlinear functions. 383393 (1987). . II. KOENKER. Order Restricted Statistical Inference New York: Wiley.• "Sur quelques points du systeme du monde. second edition.Him{ Models. Prob. HABERMAN. Statistical Theory with £r1gineering Applications New York: Wiley. The Analysis of Variance New York: Wiley. Statist.• St(ltistical Methods {Of· Research lV(Jrkers. New York. GauthierVillars.. Wiley & Sons. New York: Hafner. McCULLAGH.. "The behavior of the maximum likelihood estimator under nonstandard conditions. WEISBERG.. Math. HAW. RAO. Theory ofStatistics New York: Springer. 221233 (1967)." Statistical Science. J. T. R.. A. 475558.
i 1 I . I . j.. . .. . . I j i. . .I :i . I I .
Therefore. we include some commentary.14 and A. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties. The adjective random is used only to indicate that we do not. A coin that. an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation. we include some proofs as well in these sections. This 441 . although we do not exclude this case. A.Appendix A A REVIEW OF BASIC PROBABILITY THEORY In statistics we study techniques for obtaining and using information in the presence of uncertainty. We add to this notion the requirement that to be called an experiment such an action must be repeatable. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts. in addition. Probability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. which are relevant to our study of statistics. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. Sections A. at least conceptually.IS contain some results that the student may not know. The Kolmogorov model and the modem theory of probability based on it are what we need. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. The situations we are going to model can all be thought of as random experiments. require that every repetition yield the same outcome. The reader is expected to have had a basic course in probability theory.I THE BASIC MODEl Classical mechanics is built around the principle that like causes produce like effects. The intensity of solar flares in the same month of two different years can vary sharply. Viewed naively. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level. is tossed can land heads or tails. The purpose of this appendix is to indicate what results we consider basic and to introduce some of the notation that will be used in the rest of the book.
which by definition is a nonempty class of events closed under countable unions." The set operations we have mentioned have interpretations also.l. A2 . whether it is conceptually repeatable or not. almost any kind of activity involving uncertainty. and complementation (cf. We denote it by n. and Loeve. 1974. If A contains more than one point. " are pairwise disjoint sets in I • I Recall that Ui I Ai is just the collection of points that are in anyone of the sets Ai and that two sets are disjoint if they have no points in common. . For a discussion of this approach and further references the reader may wish to consult Savage (1954).\ is the number of times the possible outcome A occurs in n repetitions.l The sample space is the set of all possible outcomes of a random experiment. . A random experiment is described mathematically in tenns of the following quantities. we preSUnle the reader to be familiar with elementary set theory and its notation at the level of Chapter I of Feller (1968) or Chapter 1 of Parzen (1960). . n. Grimmett and Stirzaker. the null set or impossible event. However. We denote events by A. is to many statistician::. complementation. Chung. and so on or by a description of their members. de Groot (1970). induding the authors. = 1. I l 1. For technical mathematical reasons it may not be possible to assign a probability P to every subset of n. and Berger (1985). If wEn. Its complement. i ! A.~ n and is typically denoted by w.1/ /1.l. where 11. the operational interpretation of the mathematical concept of probability. For example. 1977). intersection. .4 We will let A denote a class of subsets of to which we an assign probabilities.I I 442 A Review of Basic Probability Theory Appendix A 1 longtefm relative frequency 11 . is denoted by A.1. they are willing to assign probabilities in any situation involving uncertainty. {w} is called an elementary event. In this sense. set theoretic difference. Ii I I I j . intersections. We now turn to the mathematical abstraction of a random experiment. A.1. By interpreting probability as a subjective measure.. We shall use the symbols U. A probabiliry distribution or measure is a nonnegative function P on A having the following properties: (i) P(Q) n 1 n . • .2 A sample point is any member of 0. the probability model.3 Subsets of are called events. then (ii) If AI. and inclusion as is usual in elementary set theory. C for union. 1992. . The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A. . Lindley (1965).." Another school of statisticians finds this formulation too restrictive.. A. A. falls under the vague heading of "random experiment. c. from horse races to genetic experiments. it is called a composite event. In this section and throughout the book. A is always taken to be a sigma field. the relation A C B between sets considered as events means that the occurrence of A implies the occurrence of B. as we shall see subsequently. 1 1  . Raiffa and Schlaiffer (\96\). B. Savage (1962).
6 If A . c .2.). Sections 15 Pitman (1993) Sections 1. A. and Stone (1971) Sections 1.' 1 An) < L. A.2 PeN) 1 . 1. we have for any event A.1..S The three objects n.A) = PCB) . when we refer to events we shall automatically exclude those that are not members of A.Section A.4). then PCB . References Gnedenko.. Port. we can write f! = {WI.L~~l peA. and Stone (1992) Section 1. (U.3 DISCRETE PROBABILITY MODELS A.2. P(0) ~ O.7 P 1 An) = limn~= P(A n ).3 Hoel. c A.' 1 peA. (1967) Chapter l.. P(B) A. C An . and P together describe a random experiment mathematically.. > PeA).3 A.2 Parzen (1960) Chapter 1.3 Panen (1960) Chapter I.2.l1. A. .3. then P (U::' A.. That is.40 < P(A) < 1. ~ A. by axiom (ii) of (A. References Gnedenko (1967) Chapter I. } and A is the collection of subsets of n. P) either as a probability model or identify the model with what it represents as a (random) experiment.S P A. We shall refer to the triple (0.3. J. A.3 If A C B.2 and 1.2 ELEMENTARY PROPERTIES OF PROBABILITY MODELS The following are consequences of the definition of P.2 Elementary Properties of Probability Models 443 A. W2.2.PeA). A.2.t A probability model is called discrete if is finite or countably infinite and every subset of f! is assigned a probability.P(A).. Sections 13. Section 8 Grimmett and Stirzaker (1992) Section 1..68 Grimmett and Stirzaker (1992) Sections 1..2.3 A.1 If A c B.l.) (Bonferroni's inequality).3 Hoe!. (n~~l Ai) > 1..2) n . Sections 45 Pitman (1993) Section 1. (A. For convenience.2. In this case. . Port..
=:.3. WN are the members of some population (humans. the function P(.4)(ii) and (A.l n' " I"I ". is an experiment leading to the model of (A. (A. (A..:c Number of elements in A N . From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred.3).3) If B l l B 2 .~ 1 1 ~• 1 . Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another.1) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment.3) yield U. all of which are equally likely. then .. and drawing. Sections 67 Pitman (1993) Section l. I B). Transposition of the denominator in (AA.2) In fact.(A n B j ).3.I B) is a probability measure on (fl. . Then P( {w}) = 1/ N for every wEn. etc. (A. A) which is referred to as the conditional probability measure given B. and P( A ) ~ j . i • i: " i:. .. j=l (A. we define the conditional probability of A given B..l) gives the multiplication rule. Such selection can be carried out if N is small by putting the "names" of the Wi in a hopper. . a random number table or computer can be used. . ~f 1 .).. Sections 45 Parzen (1960) Chapter I.. A. .:.:c:. shaking well..4 Suppose that WI. n PiA) = LP(A I Bj)P(Bj ).' . . by l P(A I B) ~ PtA n B) P(B) . say N.3) 1 I A. selecting at random. for fixed B as before.=. I B) = ~ PiA. '< = ~=~:.• 1 B n are (pairwise) disjoint events of positive probability whose union is fl.3. guinea pigs. machines.4. 1 1 PiA n B) = P(B)P(A I B).4.1 • 1 .• P (Q A. If A" A 2 . then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur. Given an event B such that P( B) > 0 and any other event A.1.444 A Review of Basic Probability Theory Appendix A An important special case arises when n has a finite number of elements.4. flowers. which we write PtA I B).4) . (A. i References Gnedenko (1967) Chapter I.4.4 CONDITIONAL PROBABILITY AND INDEPENDENCE I j 1 1 ! . For large N. (A.4. are (pairwise) disjoint events and P(B) > 0. the identity A = ..
4. References Gnedenko (1967) Chapter I.4. .S) The conditional probability of A given B I defined by .)P(B. Section 4.lO) for any subset {iI. (AA..}.) A) ~ ""_ PIA I B )p(BT L. B ll is written P(A B I •.. A and B are independent if knowledge of B does not affect the probability of A. Port...1 ).···... .) j=l k (AA. . Simple algebra leads to the multiplication rule.. Ifall theP(A i ) are positive. B I . . ..En such that P(B I n ...II) for any j and {i" . n··· nA i .} such thatj ct {i" . B 2 ) .8) > 0. P(B 1 n·· n B n ) ~ P(B 1 )P(B2 I BJlP(B3 I ill.. ..8) may be written P(A I B) ~ P(A) (AA.i.. Chapter 3. .3).. The events All .. . P(B" I Bl.. and (A....A. (AA.I_1 . BnJl (AA..7) whenever P(B I n .i k } of the integers {l.Section AA Conditional Probability and Independence 445 If P( A) is positive.) ~ P(A J ) (AA. . ..4) and obtain Bayes rule ." .S parzen (1960) Chapter 2. P(il.. . An are said to be independent if P(A i .n}.I J (AA. .i. .• ... n B n ) > O.) = II P(A. and Stone (1971) Sections lA.. the relation (AA. n Bnd > O. relation (A. I... we can combine (A.lO) is equivalent to requiring that P(A J I A... Sections IA Pittnan (1993) Section lA . Two events A and B are said to be independent if P(A n B) If P( B) ~ P(A)P(B).9) In other words. I PeA I B. B n ) and for any events A. Sections 9 Grimmett and Stirzaker (1992) Section IA Hoel.4.
x An by 1 j P(A. Loeve. .0 1 x··· XOi_I x {wn XOi+1 x··· x.3) for events Al x . it can be uniquely extended to the sigma field A specified in note (l) at the end of this appendix.o i . . then Ai corresponds to . P([A 1 x fl2 X .5. More generally.. .0) given by . If we are given n experiments (probability models) t\..' . 1 1 . . . x flnl n ' . W n ) : W~ E Ai. On the other hand. x An. If P is the probability measure defined on the sigma field A of the compound experiment. .1 .wn )}) I: " = PI ({Wi}) . Pn({W n }) foral! Wi E fli. . A.. I n  I . The reader not interested in the formalities may skip to Section A. . There are certain natural ways of defining sigma fields and probabilities for these experiments. that is. the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second. (A... the Cartesian product Al x ... 1 < i < n}. .53) holds provided that P({(w"".' .6 where examples of compound experiments are given. r.5 COMPOUND EXPERIMENTS There is an intuitive notion of independent experiments. .on. x An of AI. 1977) that if P is defined by (A. then (A52) defines P for A) x ". x'" . x . An is by definition {(WI.) . A2). 1974. P. . X fl n 1 X An). . then the probability of a given chip in the second draw will depend on the outcome of the first dmw. . . ." X An) = P. the sigma field corresponding to £i..0 ofthe n stage compound experiment is by definition 0 1 x·· . I P n on (fln> An).. and if we do not replace the first chip drawn before the second draw. These will be discussed in this section. x " . X . InformaUy. x flnl n Ifl) x A 2 x ' . we should have . £n with respective sample spaces OJ. if we toss a coin twice. independent.. The interpretation of the sample space 0 is that (WI. I <i< n. x fl 2 x '" x fln)P(fl..0 if and only if WI is the outcome of £1..o n = {(WI. we introduce the notion of a compound experiment.5. X A 2 X '" x fl n ).on. if Ai E Ai...An with Ai E Ai. then the sample space . Chung. \ I I . (A53) It may be shown (Billingsley.'" £n and recording all n outcomes. In the discrete case (A. If we want to make the £i independent. . x On in the compound experiment. the subsets A ofn to which we can assign probability(1). on (fl2. then intuitively we should have alI classes of events AI. We shall speak of independent experiments £1. 1995....0 1 X '" x . Pn(A n ). a compound experiment is one made up of two or more component experiments..1 X Ai x 0i+ I x .2) 1 i " 1 If we are given probabilities Pi on (fl" AI). wn ) is a sample point in .. . I ! i w? 1 . 1 £n if the n stage compound experiment has its probability structure specified by (A5.3)... (A5. This makes sense in the compound experiment... P(fl.. it is easy to give examples of dependent experiments: If we draw twice at random from a hat containing two green chips and one red chip. . . The (n stage) compound experiment consists in performing component experiments £1. To be able to talk about independence and dependence of experiments. 446 A Review of Basic Probability Theory Appendix A A."" . ... x An) ~ P(A. (A5A) i. To say that £t has had outcome pound event (in . . .. . An are events. . For example. W2 is the outcome of £2 and E Oi corresponds to the Occurrence of the comso on.. ) = P(A.w n ) E o : Wi = wf}.1 Recall that if AI.
5. n The probability structure is determined by these conditional probabilities and conversely.1 Suppose that we have an experiment with only two possible outcomes.6 Hoel.: n )}) for each (WI. then (A. (A.£"_1 has outcome wnd. /I.. we refer to such an experiment as a multinomial trial with probabilities PI. If the experiment is perfonned n times independently. ill the discrete case. 1.. . .wq and P( {Wi}) = Pi.k)!' The fonnula (A. the following.3) where n ) ( k = n! kl(n . if an experiment has q possible outcomes WI.6.3) is known as the binomial probability.. .6.6.6 BERNOULLI AND MULTINOMIAL TRIALS. J. we shall refer to such an experiment as a Bernoulli trial with probability of success p...Pq' If fl is the sample space of this experiment and W E fl.6. .2) where k(w) is the number of S's appearing in w. and Stone (1971) Section 1.. . Ifweassign P( {5}) = p. we say we have performed n Bernoulli trials with success probability p. Other examples will appear naturally in what follows..6. ..5.. . . which we shall denote by 5 (success) and F (failure).5 Parzen (1960) Chapter 3 A.Section A 6 Bernoulli and Multinomial Trials. . SAMPLING WITH AND WITHOUT REPLACEMENT A... In the discrete case we know P once we have specified P( {(wI . Port. The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success).: n ) with Wi E 0 1.4 More generally. If we repeat such an experiment n times independently. any point wEn is an ndimensional vector of S's and F's and.·. By the multiplication rule (A.· IPq.4. If Ak is the event [exactly k S's occur]. the compound experiment is called n multinomial trials with probabilities PI. If o is the sample space of the compound experiment. Sampling With and Without Replacement 447 Specifying P when the £1 are dependent is more complicated...6. i = 1" .7) we have..S) .w n )}) = P(£l has outcome wd P(£2 hasoutcomew21 £1 has outcome WI)'" P(£T' has outcome W I £1 has outcome WI.. A. then (A. References Grimmett and Stirzaker ( (992) Sections (. A.'" .u.SP({(w] . .
i .k. = N! (N . . . . we are sampling with replacement.) of the compound experiment. then P( k" . Sections 14 Pitman (1993) Section 2. References Gnedeuko (1967) Chapter 2. .6. . then ~ ..A l P). exactly kqwq's are observed).p)) < k < min( n. n . .1O) J for max(O.1 .7 If we perform an experiment given by (.(N(I. kq!Pt' . for any outcome a = (Wil"" 1Wi.S) as follows.1 j . Sectiou 11 Hoel. n P({a}) ~ (N)n where 1 I (A.. 10) is known as the hypergeometric probability.7 PROBABILITIES ON EUCLIDEAN SPACE Random experiments whose outcomes are real numbers playa central role in theory and practice.0.. .N (1 .9) .n)!' If the case drawn is replaced before the next drawing.. Np).) A = n! k k k !.P))nk k (N)n = (A6. and the component experiments are independent and P( {a}) = liNn. . · · • A.6. A. .6. . . with replacement is added to distinguish this situation from that described in (A.6.. The probability models corresponding to such experiments can all be thought of as having a Euclidean space for sample space. .1 . the component experiments are not independent and. .. . When n is finite the tenn..S If we have a finite population of cases = {WI"" WN} and we select cases Wi successively at random n times without replacement. A. P) independently n times. The fonnula (A. (N)n .448 A Review of Basic Probability Theory Appendix A where k~(w) = number of times Wi appears in the sequence w. I' . PtA ) k =( n ) (Np). and Stone (1971) Section 2. If AkJ..k q is the event (exactly k l WI 's are observed. __________________J i . .6) where the k i are natural numbers adding up to n. ·Pq t (A. we shall sometimes refer to the outcome of the compound experiment as a sample of size n from the population given by (n. If Np of the members of n have a "special" characteristic S and N (1 ~ p) have the opposite characteristic F and A k = (exactly k "special" individuals are obtained in the sample). exactly k 2 wz's are observed. Port. i . and P(Ak) = 0 otherwise.. 1 .6.6. I A.4 Parzen (1960) Chapter 3.
1If (al... } are equivalent. only an Xi can occur as an outcome of the experiment..7.(ak.S) is given by (A. X n .bd.7. .I) because the study of this model and that of the model that has = {Xl.S) for some density function P and all events A. Riemann integrals are adequate..8 are usually called absolutely continuous. where ( )' denotes transpose.bk ) are k open intervals. and 0 otherwise. It may be shown that a function P so defined satisfies (AlA). P defined by A.7... However. .bd x '" (ak. This definition is consistent with (A. .S) is by definition r JR' 1A(X)P(x)dx where 1A(x) ~ 1 if x E A. An important special case of (A.7 Probabilities on Euclidean Space 449 We shall use the notation R k of k~dimensional Euclidean space and denote members of Rk by symbols such as x or (Xl.2 The Borelfield in R k . any nonnegative function pon R k vanishing except on a sequence {Xl.7. .. is defined to be the smallest sigma field having all open k rectangles as members.9) . (A7.. P(A) is the volume of the "cylinder" with base A and height p(x) at x. . (A7A) Conversely. xd'. } of vectors and that satisfies L:~ I P(Xi) = 1 defines a unique discrete probability distribution by the relation P(A) = L x..5) A. Integrals should be interpreted in the sense of Lebesgue.7. We will write R for R1 and f3 for f31. A. x A.Section A..7. which we denote by f3k. A. Geometrically.. we shall call the set (aJ.3.3 A discrete (probability) distribution on R k is a probability measure P such that L:~ I P( {Xi}) = 1 for some sequence of points {xd in R k . for practical purposes. dx n • is called a density function.7.7.b k ) = {(XI"".7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation P(A) = L p(x)d x =1 (A7.6 A nonnegative function p on R k • which is integrable and which has r p(x)dx = 1.Xn . . . JR' where dx denotes dX1 ...).Xk) :ai <Xi <bi.. We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely. A. .7. That is. . Thefrequency function p of a discrete distribution is defined on Rk by n pix) = P({x»). .EA pix. Any subset of R k we might conceivably be interested in turns out to be a member of f3k. Recall that the integral on the right of (A. 1 <i <k}anopenkrectallgle.
5. r . For instance. POlt.f..7 Pitman (1993) Sections 3."(l) Although in a continuous model P( {x}) = 0 for every x. (A..5 i • . x n j X =? F(x n ) ~ (A. P Xl (A.14) . Xo P([xo . 3.13HA.7. We always have F(x)F(xO)(2) =P({x}).1'.h. limx~oo F(x) limx~_oo =1 F(x) = O.J x . 4. the density function has an operational interpretation close to thal of the frequency function. defines P in the sense that if P and Q are two probabilities with the same d.1.7.11 The distribution function (dJ.2. Sections 14.2 parzen (1960) Chapter 4.7.7..7. . F is continuous at x if and only if P( {x}) (A. 5. (A.7.16) defines a unique P on the real line. 22 Hoel.7.1 and 4.1. then P = Q. thus. (A. be thought of as measuring approximately how much more Or less likely we are to obtain an outcome in a neighborhood of XQ then one in a neighborhood of Xl_ A. Thus. F is a function of a real variable characterized by the following properties: > I .450 A Review of Basic Probability Theory Appendix A It turns out that a continuous probability distribution determines the density that generates it "uniquely. and h is close to 0..x. Sections 21. When k = 1.15) I.) F is defined by F(Xl' .h.lO) The ratio p(xo)jp(xl) can. • J .17) = O.7. .7.7.) = P( ( 00.13) x <y =? F(x) < F(y) (Monotone) F(x) (Continuous from the right) (A..12) The dJ.xo + h]) '" 2hp(xo) and P([ h Xl 1 + h]) + Xl p(xo) hi) '" ( ).16) I It may be shown that any function F satisfying (A.:0 and Xl are in R. then by the mean value theorem paxo . References Gnedenko (1967) Chapter 4. . and Stone (1971) Sections 3.]). x. x (00. x. . if p is a continuous density on R. ! I .4.
's. the concentration of a certain pollutant in the atmosphere. 13k . The probability distribution of a random vector X is. When we are interested in particular random variables or vectors. dj.Section A. by definition.8) PIX E: A] LP(X). dJ.5) and (A. In the probability model. and so on. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X. density.2 A random vector X = (Xl •.5) p(x)dx . or equivalently a function from to Rk such that the set {w . Here is the formal definition of such transformations.1 (H) is in A for every B E BkJI) For k = 1 random vectors are just random variables. Letg be any function from Rk to Rm.8 Random VariOlbles and Vectors: Transformations 451 A.a RANDOM VARIABLES AND VECTORS: TRANSFORMATIONS Although sample spaces can be very diverse. m > 1. .or vectorvalued functions of a random vectOr X is central in the theory of probability and of statistics. Px ) given by n Px(B) = PIX E BI· (A.8. The study of real.1 A random variable X is a function from Oto Rsuch that the set {w: X(w) E B} X. in fact. and so on of a random vectOr when we are. the time to breakdown and length of repair time for a randomly chosen machine. Similarly. The event Xl( B) will usually be written [X E B] and P([X E BJ) will be written PIX E: H].3) A. A. we measure the weight of pigs drawn at random from a population. the yield per acre of a field of wheat in a given year. and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted. from (A. if X is continuous. we will refer to the frequency Junction.8.1 (B) is in 0 fnr every BE B.S. the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred. . X(w) E: B} ~ X. we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined.7..4 A random vector is said to have a continuous or discrete distribution (or to be continuous or discrete) according to whether its probability distribution is continuous or discrete. Thus.7. these quantities will correspond to random variables and vectors.8. ifXisdiscrete xEA L (A. Forexample.(1) = A. k.Xk)T is ktuple of random variables. such that(2) gl(B) = {y E: Rk : g(y) E: .8. the probability measure Px in the model (R k . The subscript X or X will be used for densities. referring to those features of its probability distribution. In the discrete case this means we need only know the frequency function and in the continuous case the density.
S. Then g(X) is continuous with density given by . i . then 1 (I . Pg(X) (I) = j.8.7) If X is discrete with frequency function Px.(5) (A.9) t E g(S). is given by(4) i I ( PX(X) = LP(X.1O) From (A.y)(x. and X is continuous. y). .8.).8)) that X is a marginal density function given by px(x) ~ 1: P(X.y). y)T is continuous with density p(X. j .S. Furthennore.12) 1 . ! for Pg(x)(I) ~ PX(gl(t)) Ig'(g 1(1))1 (A. . assume that the derivative l of 9 exists and does not vanish on S. max{X. known as the marginal frequency function.1') . (A.452 A Review of BClsic ProbClbility Theory Appendix A [J} E BI. The probability distribution of g(X) is completely detennined by that of X through L:: P[g(X) E BI = PIX E gI(B)]. a # 0.7.8.jPX a (A.8) Suppose that X is continuous with density PX and 9 is realvalued and onetoone(3) on an open set S such that P[X E 5] = 1. and 0 otherwise.8. .11) and (A. (A. Another common example is g(X) = (min{X. then the frequency function of X.8. V). • II ! I • .Y).Y) (x. . Yf is a discrete random vector with frequency function p(X. Discrete random variables may be used to approximate continuous ones arbitrarily closely and vice versa.6) An example of a transformation often used in statistics is g (91.8.11) Similarly.8. y (A.X)2. . .8.1 E~' l(X i .92/ with 91 (X) = k. a random vector obtained by putting two random vectors together.8.7) and (A. These notions generalize to the case Z = (X.y)(x.y)dy.Y).12) by summing or integrating out over yin P(X.)'. This is called the change of variable formula. If g(X) ~ aX + 1'.: for every B E Bill. it may be shown (as a consequence of (A. The (marginal) frequency or density of X is found as in (A. Then the random tran~form(lti(m g( X) is defined by g(X)(w) = g(X(w)). .8) it follows that if (X.1 1 Xi = X and 92(X) = k. then g(X) is discrete and has frequency function Pg(X)(t) = L {x:g(x)=t} Px(x). if (X. (A.
then X = (Xl.1.6. . the approximation of a discrete distribution by a continuous one is made reasonable by one of the limit theorems of Sections A. . For example. . . whatever be g and h.. Nevertheless. Port. so are g(X) and h(Y). One possibility is that the observed random variable or vector is obtained by rounding off to a large number of places the true unobservable continuous random variable specified by some idealized physical modeL Or else.9.9. E BI are independent.. ..9 Independence of Random Variables and Vectors 453 In practice. . " X n ) is continuous. A. the events [XI E AI and IX. 1 X n are independent if. X n with X (XI>"" Xn)' .. Then the random variables Xl.4 A. and Stone (1971) Sections 3.9 INDEPENDENCE OF RANDOM VARIABLES AND VECTORS A.1.. . . which may be easier to deal with.9.9.Xn are said to be (mutually) independent if and only if for any sets AI. 9 Pitman (1993) Section 4. S.S.) and Y" and so on.Section A.9. The justification for this may be theoretical or pragmatic.Xn (not necessarily of the same dimensionality) we need only use the events [Xi E Ail where Ai is a set in the range of Xi' A.13 A convention: We shan write X = Y if the probability of the event IX fe YI is O. ..9.. if X and Yare independent.7 Hoel. Theorem A... so are Xl + X2 and YI Y2 .An in S.5) = following two conditions hold: A.4 Parzen (1960) Chapter 7.. X 2 ) and (Yt> Y2 ) are independent..6IftheXi are all continuous and independent. References Gnedenko (1967) Chapter 4.[Xn E An] are independent. if (Xl.2.7.3 By (A. 6.l5 and B. and only if.. ' .9. 5.S. it is common in statistics to work with continuous distributions. all random variables are discrete because there is no instrument that can measure with perfect accuracy. either ofthe (A. Sections 15.. .. Suppose X (Xl. Sections 2124 Grimmett and Stirzaker (1992) Section 4. (X"XIX.3. To generalize these definitions to random vectors Xl.1 Two random variables X I and X 2 are said to be independent if and only if for sets A and B in E. .4) (A.2 The random variables X I. A. " X n ) is either a discrete or continuous random vector. the events [X I E All. A.7).9.7 The preceding equivalences are valid for random vectors XlJ .
. Port. Sections 23. .. by 00 1 i .I l I .9) If we perfonn n Bernoulli trials with probability of success p and we let Xi be the indicator of the event (success on the ith trial).. If X is a nonnegative.454 A Review of Basic Probability Theory Appendix A A. Sections 6. (A9.I0. it follows that this average is given by 'L.. Fx or density (frequency function) Px.~ 1 i E(X) = LXiPX(Xi).I) . X l l is called a random sample (l size n from a population with dJ. 24 Grimmett and Stirzaker (1992) Sections 3. px(i) = i(i+1)' i= 1. we define the random variable 1 A. such a random sample is often obtained by selecting n members at random in the sense of (A.2. 7 Pitman (1993) Sections 2.4 Par""n (1960) Chapter 7. written E(X).. . i . . X n are independent identically distributed kdimensional random vectors with dJ..3.2 Hoel.' (Infinity is a possible value of E(X).2 More generally.. by 1 ifw E A o otherwise. 5.S If Xl .X. ... then the Xi form a sample from a distribution that assigns probability p to 1 and (1. discrete random variable with possible values {Xl.Ll XiP[X = Xi] where P[X = Xi] is just the proportion of individuals of height Xi in the population..9. then Xl . . we define the expectation or mean of X. 1 Xi =i.2. "" 1 X q are the only heights present in the population. . The same quantity arises (approximately) if we use the longrun frequency interpretation of probability and calculate the average height of the individuals in a large sample from the population in question. . In statistics.I0 THE EXPECTATION OF A RANDOM VARIABLE r: r I Let X be the height of an individual sampled at random from a finite population..p) to 0. X2. 4.W. A. I. decompose {Xl l X2.. Such samples win be referred to as the indicators of n Bernoulli tn"als with probability ofsuccess p" References Gnedenko (1967) Chapter 4. F x or density (frequency function) PX.I . the indicator of the event A. If XI ..• }.4) from a population and measuring k characteristics on each member..3 . . } into two sets A and B where A consists of aU nonnegative Xi and B of all negative Xi· If either 'L.EA XiPX (Xi) < . .~ ( . . If A is any event. I i I I . if X is discrete. Take • ..j j . In line with these ideas we develop the general concept of expect