Second Edition
Mathematical Statistics
Basic Ideas and Selected Topics
Volume I
Peter J. Bickel
University of California
Kjell A. Doksum
University of California
1'1"(.'nl icc
Hall
.. ~
PRENTICE HALL
Upper Saddle River, New Jersey 07458
Library of Congress CataloginginPublication Data Bickel. Peter J. Mathematical statistics: basic ideas and selected topics / Peter J. Bickel, Kjell A. Doksum2 nd ed. p. em. Includes bibliographical references and index. ISBN D13850363X(v. 1) L Mathematical statistics. L Doksum, Kjell A. II. Title.
QA276.B47200l
519.5dc21 00031377
Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte
}'I('I1II(\'
lI,dl
@2001, 1977 by PrenticeHall, Inc. Upper Saddle River, New Jersey 07458
All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
I
i
ISBN: D1385D363X
PrenticeHall International (UK) Limited, London PrenticeHall of Australia Pty. Limited, Sydney PrenticeHall of Canada Inc., Toronto PrenticeHall Hispanoamericana, S.A., Mexico PrenticeHall of India Private Limited, New Delhi PrenticeHall of Japan, Inc., Tokyo Pearson Education Asia Pte. Ltd. Editora PrenticeHall do Brasil, Ltda., Rio de Janeiro
J
To Erich L Lehmann
." i
~I
"I
~ !
,'
,
I,
,.~~
_.

..
CONTENTS
PREFACE TO THE SECOND EDITION: VOLUME I PREFACE TO THE FIRST EDITION I STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA 1.1 Data, Models, Parameters, and Statistics
xiii
xvii
1 1
1.1.1
1.1.2
Data and Models
Pararnetrizations and Parameters
I
6
1.1.3
1.2
Statistics as Functions on the Sample Space
8
1.3
1.4
1.5
1.6
1.7 1.8 1.9
1.1.4 Examples, Regression Models Bayesian Models The Decision Theoretic Framework 1.3.1 Components of the Decision Theory Framework 1.3.2 Comparison of Decision Procedures 1.3.3 Bayes and Minimax Criteria Prediction Sufficiency Exponential Families 1.6.1 The OneParameter Case 1.6.2 The Multiparameter Case 1.6.3 Building Exponential Families 1.6.4 Properties of Exponential Families 1.6.5 Conjugate Families of Prior Distributions Problems and Complements Notes References
9 12 16 17 24 26 32 41 49 49 53 56 58 62 66 95 96
VII
VIII
•••
CONTENTS
2 METHODS OF ESTIMATION 2.1 Basic Heuristics of Estimation 2.1.1 Minimum Contrast Estimates; Estimating Equations 2.1.2 The PlugIn and Extension Principles 2.2 Minimum Contrast Estimates and Estimating Equations 2.2.1 Least Squares and Weighted Least Squares 2.2.2 Maximum Likelihood 2.3 Maximum Likelihood in Multiparameter Exponential Families *2.4 Algorithmic Issues 2.4.1 The Method of Bisection 2.4.2 Coordinate Ascent 2.4.3 The NewtonRaphson Algorithm 2.4.4 The EM (ExpectationlMaximization) Algorithm 2.5 Problems and Complements 2.6 Notes 2.7 References
3
MEASURES OF PERFORMANCE
99 99 99 102 107 107 114 121 127 127 129 132 133 138 158 159
3.1 Introduction 3.2 Bayes Procedures 3.3 Minimax Procedures *3.4 Unbiased Estimation and Risk Inequalities 3.4.1 Unbiased Estimation, Survey Sampling 3.4.2 The Information Inequality *3.5 Nondecision Theoretic Criteria 3.5.1 Computation 3.5.2 Interpretability 3.5.3 Robustness 3.6 Problems and Complements 3.7 Notes 3.8 References
4 TESTING AND CONFIDENCE REGIONS 4.1 Introduction 4.2 Choosing a Test Statistic: The NeymanPearson Lemma 4.3 UnifonnIy Most Powerful Tests and Monotone Likelihood Ratio
161 161 161 170 176 176 179 188 188 189 190 197 210 211 213 213 223
227 233
4.4
Models Confidence Bounds, Intervals, and Regions
CONTENTS
ix
241 248 251 252
4.5 *4.6 *4.7 4.8
The Duality Between Confidence Regions and Tests Uniformly Most Accurate Confidence Bounds Frequentist and Bayesian Formulations Prediction Intervals
4.9
Likelihood Ratio Procedures 4.9.1 Inttoduction
4.9.2 4.9.3 Tests for the Mean of a Normal DistributionMatched Pair Experiments Tests and Confidence Intervals for the Difference in Means of
255 255
257
4.9.4
4.9.5
Two Normal PopUlations The TwoSample Prohlem with Unequal Variances
Likelihood Ratio Procedures fOr Bivariate Nonnal
261 264 266 269 295 295
Distrihutions 4.10 Problems and Complements 4.11 Notes 4.12 References
5 ASYMPTOTIC APPROXIMATIONS 5.1 Inttoduction: The Meaning and Uses of Asymptotics
5.2 Consistency
297 297
301
5.2.1
5.2.2
PlugIn Estimates and MLEs in Exponential Family Models
Consistency of Minimum Contrast Estimates
301
304
5.3
5.4
5.5 5.6 5.7 5.8
First and HigherOrder Asymptotics: The Delta Method with Applications 5.3.1 The Delta Method for Moments 5.3.2 The Delta Method for In Law Approximations 5.3.3 Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families Asymptotic Theory in One Dimension 5.4.1 Estimation: The Multinomial Case *5.4.2 Asymptotic Normality of Minimum Conttast and M Estimates *5.4.3 Asymptotic Normality and Efficiency of the MLE *5.4.4 Testing *5.4.5 Confidence Bounds Asymptotic Behavior and Optimality of the Posterior Distribution Problems and Complements Notes References
306 306 311 322 324 324 327 331 332 336 337 345 362 363
x
CONTENTS
6 INFERENCE IN THE MULTIPARAMETER CASE
6.1 Inference for Gaussian Linear Models 6.1.1 6.1.2 6.1.3 *6.2 6.2.1 6.2.2 6.2.3 *6.3 6.3.1 6.3.2 *6.4 6.4.1 6.4.2 6.4.3 *6.5 *6.6 6.7 6.8 6.9 The Classical Gaussian Linear Model Estimation Tests and Confidence Intervals Estimating Equations Asymptotic Normality and Efficiency of the MLE The Posterior Distribution in the Multiparameter Case Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic Wald's and Rao's Large Sample Tests GoodnessofFit in a Multinomial Model. Pearson's X 2 Test GoodnessofFit to Composite Multinomial Models. Contingency Thbles Logistic Regression for Binary Responses
365
365 366 369 374 383 384 386 391 392 392 398 400 401 403 408 411 417 422 438 438 441 441 443 443
Asymptotic Estimation Theory in p Dimensions
Large Sample Tests and Confidence Regions
Large Sample Methods for Discrete Data
Generalized Linear Models
Robustness Properties and Scmiparametric Models Problems and Complements Notes References
A A REVIEW OF BASIC PROBABILITY THEORY A.I The Basic Model A.2 Elementary Properties of Probability Models A.3 Discrete Probability Models A.4 Conditional Probability and Independence A.5 Compound Experiments A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement A.7 Probabilities On Euclidean Space A.8 Random Variables and Vectors: Transformations A.9 Independence of Random Variables and Vectors A.IO The Expectation of a Random Variable A.II Moments A.12 Moment and Cumulant Generating Functions
444
446 447 448 451 453 454 456 459
CONTENTS
XI
•
A. t3 Some Classical Discrete and Continuous Distributions A.14 Modes of Convergence of Random Variables and Limit Theorems A. I5 Further Limit Theorems and Inequalities A.16 Poisson Process
A.17 Notes
460
466 468 472
474 475 477 477 477 479 480 482 484 485 485 488 491 491 494 497 502 502 503 506 506 508
A.18 References
B ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS
B.I Conditioning by a Random Variable or Vector B.I.l B.I.2 B.1.3 B.IA B.l.5 B.2.1 B.2.2 B.3
B.3.1
The Discrete Case Conditional Expectation for Discrete Variables Properties of Conditional Expected Values Continuous Variables Comments on the General Case The Basic Framework The Gamma and Beta Distributions
The X2 , F, and t Distributions
B.2 Distribution Theory for Transformations of Random Vectors
Distribution Theory for Samples from a Normal Population B.3.2 Orthogonal Transformations
BA The Bivariate Normal Distribution
B.5 Moments of Random Vectors and Matrices B.5.1 B.5.2 B.6.1 B.6.2 B.8 Basic Properties of Expectations Properties of Variance Definition and Density Basic Properties. Conditional Distributions
Op
B.6 The Multivariate Normal Distribution
B.7 Convergence for Random Vectors: Multivariate Calculus B.9 Convexity and Inequalities
and Op Notation
511
516 518 519 519 520 521
B.1O Topics in Matrix Theory and Elementary Hilbert Space Theory
B.1O.1 Symmetric Matrices B,10.2 Order on Symmetric Matrices
B.10.3 Elementary Hilbert Space Theory
B.Il Problems and Complements B.12 Notes B.13 References
524
538 539
•• XII
CONTENTS
C TABLFS
Table I The Standard Nonna! Distribution Table I' Auxilliary Table of the Standard Normal Distribution
Table II t Distribution Critical Values Table In x2 Distribution Critical Values Table IV F Distribution Critical Values
INDEX
541 542 543 544 545 546 547
I
PREFACE TO THE SECOND EDITION: VOLUME I
In the twentythree years that have passed since the first edition of our book appeared statistics has changed enonnollsly under dIe impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phy
logenetic and other), and other types of combinatorial objects.
(2) The generation of enonnous amounts of dataterrabytes (the equivalent of 10 12 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore's "law") and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number of directions:
(I) Methods for inference based on larger numbers of observations and minimal assumptionsasymptotic methods in non and semiparametric models, models with ''infinite'' number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo.
XIll
.,q;,
•
:ci'~:,f"
.;.
'4,.
<.it:
...
Despite advances in computing speed. our focus and order of presentation have changed. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. These will not be dealt with in OUr work. However. Hilbert space theory is not needed. Volume I. Appendix B is as selfcontained as possible with proofs of mOst statements. and functionvalued statistics. instead of beginning with parametrized models we include from the start non. (5) The study of the interplay between numerical and statistical considerations. we assume either a discrete probability whose support does not depend On the parameter set. However. some methods run quickly in real time. covers material we now view as important for all beginning graduate students in statistics and science and engineering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing concluSIons. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. Specifically. weak convergence in Euclidean spaces. which includes more advanced topics from probability theory such as the multivariate Gaussian distribution. Chapter 1 now has become part of a larger Appendix B. There have. problems. Our one long book has grown to two volumes. is much changed from the first. we do not require measure theory but assume from the start that our models are what we call "regular. From the beginning we stress functionvalued parameters. reflecting what we now teach our graduate students. then go to parameters and parametric models stressing the role of identifiability." That is. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. and references to the literature for proofs of the deepest results such as the spectral theorem. of course. Volume [ covers the malerial of Chapters 16 and Chapter 10 of the first edition with pieces of Chapters 710 and includes Appendix A on basic probability theory.and semipararnetric models. We . As in the first edition. such as the empirical distribution function. but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis.xiv Preface to the Second Edition: Volume I (4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. and probability inequalities as well as more advanced topics in matrix theory and analysis. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. such as the density. As a consequence our second edition. which we present in 2000. each to be only a little shorter than the first edition. or the absolutely continuous case with a density. Others do not and some though theoretically attractive cannot be implemented in a human lifetime.
Chapter 6 is devoted to inference in multivariate (multiparameter) models. There is more material on Bayesian models and analysis. The conventions established on footnotes and notation in the first edition remain. Chapters 14 develop the basic principles and examples of statistics. Chapter 5 of the new edition is devoted to asymptotic approximations. Almost all the previous ones have been kept with an approximately equal number of new ones addedto correspond to our new topics and point of view. which parallels Chapter 2 of the first edition. including a complete study of MLEs in canonical kparameter exponential families. Generalized linear models are introduced as examples. Robustness from an asymptotic theory point of view appears also. if somewhat augmented. include examples that are important in applications. Major differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs). we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis. and some parailels to the optimality theory and comparisons of Bayes and frequentist procedures given in the univariate case in Chapter 5. Finaliy. the Wald and Rao statistics and associated confidence regions. Although we believe the material of Chapters 5 and 6 has now become fundamental. As in the first edition problems playa critical role by elucidating and often substantially expanding the text. There are clear dependencies between starred . such as regression experiments. The main difference in our new treatment is the downplaying of unbiasedness both in estimation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the mOre advanced topics of Volume II. Nevertheless. Included are asymptotic normality of maximum likelihood estimates. including some optimality theory for estimation as well and elementary robustness considerations. from the start. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that are given in Appendix B.Preface to the Second Edition: Volume I xv also. are an extended discussion of prediction and an expanded introduction to kparameter exponential families. Wilks theorem on the asymptotic distribution of the likelihood ratio test. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in muitiparameter exponential families and ail introduction to the EM algorithm. Also new is a section relating Bayesian and frequentist inference via the Bemsteinvon Mises theorem. One of the main ingredients of most modem algorithms for inference. Save for these changes of emphasis the other major new elements of Chapter 1. It includes the initial theory presented in the first edition but goes much further with proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference. there is clearly much that could be omitted at a first reading that we also star. Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation. Chapters 3 and 4 parallel the treatment of Chapters 4 and 5 of the first edition on the theory of testing and confidence regions. inference in the general linear model.
4. Jianhna Hnang. Semiparametric estimation and testing will be considered more generally. elementary empirical process theory. greatly extending the material in Chapter 8 of the first edition.berkeley. Ying Qing Chen.3 ~ 6. and our families for support. 5. The topic presently in Chapter 8. Fujimura. other transformation models. With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics. Examples of application such as the Cox model in survival analysis.6 Volume II is expected to be forthcoming in 2003.4. Topics to be covered include permutation and rank tests and their basis in completeness and equivariance. j j Peter J. Nancy Kramer Bickel and Joan H.4 ~ 6. and the classical nonparametric k sample and independence problems will be included. and Prentice Hall for generous production support. convergence for random processes. Michael Jordan. and the functional delta method. taken on a new life.5 I. appeared gratifyingly ended in 1976 but has. encouragement.berkeley.3 ~ 6. particnlarly Jianging Fan.2 ~ 5. Bickel bickel@stat. • with the field. will be studied in the context of nonparametric function estimation. For the first volume of the second edition we would like to add thanks to new colleagnes.edn .XVI • Pref3ce to the Second Edition: Volume I sections that follow. We also thank Faye Yeager for typing. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo. We also expect to discuss classification and model selection using the elementary theory of empirical processes. and active participation in an enterprise that at times seemed endless. in part in the text and. Yoram Gat for proofreading that found not only typos but serious errors. Michael Ostland and Simon Cawley for producing the graphs. and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley. are weak. Last and most important we would like to thank our wives. The basic asymptotic tools that will be developed or presented.2 ~ 6. in part in appendices.edn Kjell Doksnm doksnm@stat. 6. density estimation.
(4) Show how the ideas aod results apply in a variety of important subfields such as Gaussian linear mOdels. and the structure of both Bayes and admissible solutions in decision theory. PREFACE TO THE FIRST EDITION This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be. we need probability theory and expect readers to have had a course at the level of. The work of Rao. and the GaussMarkoff theorem.arters. Although there are several good books available for tbis purpose. In the twoquarter courses for graduate students in mathematics. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated. Linear Statistical Inference and Its Applications. statistics. multinomial models.. Our book contains more material than can be covered in tw~ qp. for instance. Hoel. Introduction to Mathematical Statistics. which go from modeling through estimation and testing to linear models.. the physical sciences. These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. the information inequality. and nonparametric models. the Lehmann5cheffe theorem. (2) Give careful proofs of the major "elementary" results such as the NeymanPearson lemma. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory). the treatment is abridged with few proofs and no examples or problems. we select topics from xvii . 3rd 00. 2nd ed. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. we feel that none has quite the mix of coverage and depth desirable at this level. Be cause the book is an introduction to statistics. and engineering that we have taught we cover the core Chapters 2 to 7. At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig. However. covers most of the material we do and much more but at a more abstract level employing measure theory. Finally. Port. (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates. Our appendix does give all the probability that is needed. and Stone's Introduction to Probability Theory. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice.
. preliminary edition. R. They need to be read only as the reader's curiosity is piqued. Gray. Conventions: (i) In order to minimize the number of footnotes we have added a section of comments at the end of each Chapter preceding the problem section. distribution functions. reservations. We would like to acknowledge our indebtedness to colleagues. and so On. The foundation of oUr statistical knowledge was obtained in the lucid. P. Chou. (i) Various notational conventions and abbreviations are used in the text. respectively.5 was discovered by F. I for the first. densities. U. C. or it may be included at the end of an introductory probability course that precedes the statistics course. L. E. Pyke's careful reading of a nexttofinal version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D. Cannichael. We would also like tn thank tlle colleagues and friends who Inspired and helped us to enter the field of statistics. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text. and A Samulon. enthusiastic. Minassian who sent us an exhaustive and helpful listing. and stimUlating lectures of Joe Hodges and Chuck Bell. in proofreading the final version. caught mOre mistakes than both authors together. G. and moments is established in the appendix. J. A special feature of the book is its many problems.6 would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. I .2. A serious error in Problem 2. Chapter 1 covers probability theory rather than statistics. Quang. and additional references. These comments are ordered by the section to which they pertain. They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text. S. Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers. 2 for the second. Drew. The comments contain digressions. Later we were both very much influenced by Erich Lehmann whose ideas are strongly rellected in this hook. students. final draft) through which this book passed.xviii Preface to the First Edition Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. It may be integrated with the material of Chapters 27 as the course proceeds rather than being given at the start. Bickel Kjell Doksum Berkeley /976 : . Gupta. A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text. Among many others who helped in the same way we would like to mention C. Peler J. i . (iii) Basic notation for probabilistic objects such as random variables and vectors. and friends who helped us during the various stageS (notes. Scholz. W. Much of this material unfortunately does not appear in basic probability texts but we need to draw on it for the rest of the book. Without Winston Chow's lovely plots Section 9. X. Chen. Lehmann's wise advice has played a decisive role at many points.
Mathematical Statistics Basic Ideas and Selected Topics Volume I Second Edition .
j 1 J 1 . . .. I .
in particUlar. A 1 .1 DATA. The particular angle of mathematical statistics is to view data as the outcome of a random experiment that we model mathematically. AND PERFORMANCE CRITERIA 1. functions as in signal processing. which statisticians share. for example.2. large scale or small. The goals of science and society.1 and 6. and so on. as usual.1. (2) Matrices of scalars and/or characters. (3) Arrays of scalars and/or characters as in contingency tablessee Chapter 6or more generally multifactor IDultiresponse data on a number of individuals. PARAMETERS AND STATISTICS Data and Models Most studies and experiments. GOALS. digitized pictures or more routinely measurements of covariates and response on a set of n individualssee Example 1. for example. Data can consist of: (1) Vectors of scalars. produce data whose analysis is the ultimate object of the endeavor.5. trees as in evolutionary phylogenies. and/or characters. A generic source of trouble often called grf?ss errors is discussed in greater detail in the section on robustness (Section 3. are to draw useful information from data using everything that we know. but we will introduce general model diagnostic tools in Volume 2. (4) All of the above and more.1 1. Subject matter specialists usually have to be principal guides in model formulation.1. In any case all our models are generic and. a single time series of measurements.Chapter 1 STATISTICAL MODELS. MODELS. ''The Devil is in the details!" All the principles we discuss and calculations we perform should only be suggestive guides in successful applications of statistical analysis in science and policy. measurements.4 and Sections 2. scientific or industrial.3).1. we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading. A detailed discussion of the appropriateness of the models we shall discuss in particUlar situations is beyond the scope of this book. Moreover. Chapter 1.
(2) We can derive methods of extracting useful information from data and. a shipment of manufactured items. A to m.5 and throughout the book. are never true but fortunately it is only necessary that they be usefuL" In this book we will study how. have the patients rated qualitatively for improvement by physicians. for example. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. in particular. (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee. and more general procedures will be discussed in Chapters 24. and B to n. Goodness of fit tests. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes. confidence regions. The population is so large that. is distributed in a large population. robustness. and diagnostics are discussed in Volume 2. Here are some examples: (a) We are faced with a population of N elements. So to get infonnation about a sample of n is drawn without replacement and inspected. In this manner. An unknown number Ne of these elements are defective. e.3 and continue with optimality principles in Chapters 3 and 4. This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population.! I 2 Statistical Models. randomly selected patients and then measure temperature and blood pressure. . and so on. For instance. producing energy. treating a disease. for instance. reducing pollution. and so on.. we approximate the actual process of sampling without replacement by sampling with replacement. It is too expensive to examine all of the items. learning a maze. we can assign two drugs. in the words of George Box (1979). (b) We want to study how a physical or economic feature. "Models of course. give methods that assess the generalizability of experimental results.. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p. height or income. Hierarchies of models are discussed throughout. testing. Random variability " : . we obtain one or more quantitative or qualitative measures of efficacy from each experiment. For instance. to what extent can we expect the same effect more generally? Estimation. An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the population. Goals. We begin this in the simple examples that follow and continue in Sections 1. if we observe an effect in our data. We begin this discussion with decision theory in Section 1. (3) We can assess the effectiveness of the methods we propose.21. The data gathered are the number of defectives found in the sample. for modeling purposes. (c) An experimenter makes n independent detenninations of the value of a physical constant p. plus some random errors. (5) We can be guided to alternative or more general descriptions that might fit better. starting with tentative models: (I) We can conceptualize the data structure and our goals more precisely. and Performance Criteria Chapter 1 priori. Chapter I.
The mathematical model suggested by the description is well defined. N. ..6) (J.. which are modeled as realizations of X I.2.. ..1. .. if the measurements are scalar. Sampling Inspection. completely specifies the joint distribution of Xl. identically distributed (i. Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording "defective" or not. . as Xi = I..) random variables with common unknown distribution function F. Given the description in (c). H(N8.. The sample space consists of the numbers 0. X n as a random sample from F. That is. in principle. can take on any value between and N.N(I . So. then by (AI3.1. We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models. Thus. .. It can also be thought of as a limiting case in which N = 00..i."" €n are independent. n) distribution. 0 ° Example 1. and Statistics 3 here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs. although the sample space is well defined. The main difference that our model exhibits from the usual probability model is that NO is unknown and. What should we assume about the distribution of €.d.. . A random experiment has been perfonned. If N8 is the number of defective items in the population sampled. Here we can write the n determinations of p. Parameters. k ~ 0. . €I. we postulate (l) The value of the error committed on one determination does not affect the value of the error at other times..".Section 1. X n are ij.. OneSample Models. . . . . N.Xn independent. .. Fonnally. F. . On this space we can define a random variable X given by X(k) ~ k. Sample from a Population. which we refer to as: Example 1.t + (i.1 Data.8).l. n...1. X has an hypergeometric. X n ? Of course. n corresponding to the number of defective items found.. I.. 1.. 1 €n) T is the vector of random errors.. we cannot specify the probability structure completely but rather only give a family {H(N8." The model is fully described by the set :F of distributions that we specify. n)} of probability distributions for X./ ' stands for "is distributed as. The same model also arises naturally in situation (c). we observe XI. First consider situation (a). so that sampling with replacement replaces sampling without. .l) if max(n . which together with J1. n). We often refer to such X I.... . Models..1. that depends on how the experiment is carried out. anyone of which could have generated the data actually observed. and also write that Xl. as X with X . 1 <i < n (1.0) < k < min(N8.2) where € = (€ I . where .xn .. ..d.
.. or by {(I".1.. the Xi are a sample from a N(J. Heights are always nonnegative. we refer to the x's as control observations. £n are identically distributed. Goals. There are absolute bounds on most quantitiesIOO ft high men are impossible. That is. cr > O} where tP is the standard normal distribution. 1 X Tn . be the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B. if drug A is a standard or placebo.t and cr. 0'2).. . where 0'2 is unknown.. Natural initial assumptions here are: (1) The x's and y's are realizations of Xl. heights of individuals or log incomes. Equivalently Xl. Thus. We call this the shift model with parameter ~.4 Statistical Models. Commonly considered 9's are all distributions with center of symmetry 0... 1 X Tn a sample from F. and Performance Criteria Chapter 1 (2) The distribution of the error at one determination is the same as that at another." Now consider situation (d). We call the y's treatment observations. for instance. TwoSample Models. . G) : J1 E R. will have none of this.t.. •. '.Xn are a random sample and. Let Xl. Then if treatment B had been administered to the same subject instead of treatment A. that is. 1 Yn a sample from G. Then if F is the N(I". By convention... cr 2 ) distribution.2) distribution and G is the N(J.3) and the model is alternatively specified by F. response y = X + ~ would be obtained where ~ does not depend on x. Often the final simplification is made. respectively. we have specified the Gaussian two sample model with equal variances. (3) The distribution of f is independent of J1.~). or alternatively all distributions with expectation O. . patients improve even if they only think they are being treated. ... 0 This default model is also frequently postulated for measurements taken on units obtained by random sampling from populations.Yn. G E Q} where 9 is the set of all allowable error distributions that we postulate. A placebo is a substance such as water tpat is expected to have no effect on the disease and is used to correct for the welldocumented placebo effect. and Y1. (2) Suppose that if treatment A had been administered to a subject response X would have been obtained.t + ~. . so that the model is specified by the set of possible (F. The classical def~ult model is: (4) The common distribution of the errors is N(o. It is important to remember that these are assumptions at best only approximately valid. . All actual measurements are discrete rather than continuous. then F(x) = G(x  1") (1.1. if we let G be the distribution function of f 1 and F that of Xl. whatever be J... We let the y's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo. .3. tbe set of F's we postulate. The Gaussian distribution. 0 ! I = . Example 1. (3) The control responses are normally distributed. E}. To specify this set more closely the critical constant treatment effect assumption is often made. This implies that if F is the distribution of a control. (72) population or equivalently F = {tP (':Ji:) : Jl E R. . YI. G) pairs. then G(·) F(' .
1. In Example 1. if they are true. The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise. This distribution is assumed to be a member of a family P of probability distributions on Rn. For instance. we can be reasonably secure about some aspects.1 Data. In some applications we often have a tested theoretical model and the danger is small. That is. The advantage of piling on assumptions such as (I )(4) of Example 1. Statistical methods for models of this kind are given in Volume 2. For instance. in Example 1. Models. Experiments in medicine and the social sciences often pose particular difficulties. For instance. This will be done in Sections 3. may be quite irrelevant to the experiment that was actually performed. our analyses.1).2.1. we observe X and the family P is that of all bypergeometric distributions with sample size n and population size N. When w is the outcome of the experiment. we now define the elements of a statistical model. In others. As our examples suggest. though correct for the model written down.1. if (1)(4) hold.5.4. On this sample space we have defined a random vector X = (Xl. for instance. in comparative experiments such as those of Example 1. Since it is only X that we observe.Section 1. It is often convenient to identify the random vector X with its realization. We are given a random experiment with sample space O.1. and Statistics 5 How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations.Xn ). In this situation (and generally) it is important to randomize. there is tremendous variation in the degree of knowledge and control we have concerning experiments.1. Using our first three examples for illustrative purposes. The number of defectives in the first example clearly has a hypergeometric distribution.3 when F. However.3 and 6. Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter. The danger is that. the methods needed for its analysis are much the same as those appropriate for the situation of Example 1.1. in Example 1. if they are false.2 is that. we need only consider its probability distribution. we know how to combine our measurements to estimate JL in a highly efficient way and also assess the accuracy of our estimation procedure (Example 4.1. Parameters. Fortunately. we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + n available patients. we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated. the data X(w). the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed.'" . equally trained observers with no knowledge of each other's findings.2. but not others. G are assumed arbitrary. ~(w) is referred to as the observations or data.6. All the severely ill patients might.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease. we can ensure independence and identical distribution of the observations by using different. P is referred to as the model. A review of necessary concepts and notation from probability theory are given in the appendices. P is the . have been assigned to B.
we have = R+ x R+ . If. .6 Statistical Models.. by (tt. Of even greater concern is the possibility that the parametrization is not onetoone. Models such as that of Example 1.G) : I' E R. tt = to the same distribution of the observations a~ tt = 1 and N (~ 1. . However. we can take = {(I'. e j . Now the and N(O. It's important to note that even nonparametric models make substantial assumptionsin Example 1. NO. models such as that of Example 1. in Example 1. 1) errors.. Thus. 1) errors lead parametrization is unidentifiable because.2) suppose that we pennit G to be arbitrary. .1 we can use the number of defectives in the population.1.. in senses to be made precise later. G) into the distribution of (Xl. if) (X'.. parts of fJ remain unknowable. still in this example. Pe the distribution on Rfl with density implicitly taken n~ 1 .2 with assumptions (1)(3) are called semiparametric. j. models P are called parametric. as a parameter and in Example 1.X l .. we will need to ensure that our parametrizations are identifiable. 02 ). Then the map sending B = (I'. . that is. () t Po from a space of labels.." that is..G taken to be arbitrary are called nonparametric..3 with only (I) holding and F. 0. Thus. we know we are measuring a positive quantity in this model. knowledge of the true Pe. that is. The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything. The critical problem with such parametrizations is that ev~n with "infinite amounts of data. For instance. that is. the first parametrization we arrive at is not necessarily the one leading to the simplest analysis.Xn are independent and identically distributed with a common N (IL.1 we take () to be the fraction of defectives in the shipment.. moreover.2.. For instance. . G) : I' E R. Finally. In Example 1. . for eXample. on the other hand.2 ) distribution. . . .1.t2 + k e e e J . . if () = (Il.I} and 1. . We may take any onetoone function of 0 as a new parameter.JL) where cp is the standard normal density. the parameter space e. 1 • • • . Yn .Xrn are identically distributed as are Y1 .1. or equivalently write P = {Pe : BEe}. such that we can have (}1 f.. () is the fraction of defectives. tt is the unknown constant being measured. we may parametrize the model by the first and second moments of the normal distribution of the observations (i. . having expectation 0. we only wish to make assumptions (1}(3) with t:. Pe 2 • 2 e ° . When we can take to be a nice subset of Euclidean space and the maps 0 + Po are smooth. Note that there are many ways of choosing a parametrization in these and all other problems.1.. X n ) remains the same but = {(I'.. I 1.2 Parametrizations and Parameters i e To describe P we USe a parametrization.3 that Xl. .G) has density n~ 1 g(Xi 1'). in Example = {O. a map. in (l. to P. as we shall see later.2)). X rn are independent of each other and Yl .2 with assumptions (1)(4) we have = R x R+ and. 02 and yet Pel = Pe 2 • Such parametrizations are called unidentifiable. N. Yn . G with density 9 such that xg(x )dx = O} and p(". n) distribution.e.l. Ghas(arbitrary)densityg}.1. Pe the H(NB.1. If. under assumptions (1)(4). Goals.1..1. What parametrization we choose is usually suggested by the phenomenon we are modeling. and Performance Criteria Chapter 1 family of all distributions according to which Xl. lh i O => POl f. .
a function q : + N can be identified with a parameter v( P) iff Po. that is. instead of postulating a constant treatment effect ~. In addition to the parameters of interest. When we sample. Models. and Statistics 7 Dual to the notion of a parametrization. As we have seen this parametrization is unidentifiable and neither f1 nor ~ arc parameters in the sense we've defined. where 0'2 is the variance of €. in Example 1.. in Example 1.1. which correspond to other unknown features of the distribution of X.2 again in which we assume the error € to be Gaussian but with arbitrary mean~. or the midpoint of the interquantile range of P. .1.1.2 where fl denotes the mean income and.1. or the median of P. which indexes the family P. The (fl. v. Parameters. For instance. or more generally as the center of symmetry of P. G) parametrization of Example 1. . More generally. in Example 1.1. A parameter is a feature v(P) of the distribution of X. the focus of the study.1. For instance. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable).1 Data. We usually try to combine parameters of interest and nuisance parameters into a single grand parameter ()." = f1Y .1. implies 8 1 = 82 . ~ Po. Similarly. is that of a parameter. For instance. the fraction of defectives () can be thought of as the mean of X/no In Example 1. Then "is identifiable whenever flx and flY exist. we can start by making the difference of the means. consider Example 1. we can define (J : P + as the inverse of the map 8 + Pe. implies q(BJl = q(B2 ) and then v(Po) q(B)... and observe Xl. For instance. But given a parametrization (J + Pe. Here are two points to note: (1) A parameter can have many representations.2. E(€i) = O. our interest in studying a population of incomes may precisely be in the mean income.1. formally a map. from P to another space N. which can be thought of as the difference in the means of the two populations of responses. X n independent with common distribution. that is. as long as P is the set of all Gaussian distributions. thus. Implicit in this description is the assumption that () is a parameter in the sense we have just defined. But = Var(X.2 with assumptions (1)(4) the parameter of interest fl. say with replacement.3. . in Example 1.M(P) can be characterized as the mean of P. if POl = Pe'). if the errors are normally distributed with unknown variance 0'2. it is natural to write e e e .3) and 9 = (G : xdG(x) = J O}. Then P is parametrized by 8 = (fl1~' 0'2). Formally.flx. then 0'2 is a nuisance parameter.2 is now well defined and identifiable by (1.3 with assumptions (1H2) we are interested in~. there are also usually nuisance parameters. Sometimes the choice of P starts by the consideration of a particular parameter. make 8 + Po into a parametrization of P.1. a map from some e to P. ) evidently is and so is I' + C>. For instance. (J is a parameter if and only if the parametrization is identifiable. from to its range P iff the latter map is 11.Section 1.
and Performance Criteria Chapter 1 1.2 is the statistic I i i = i i 8 2 = n1 "(Xi . called the empirical distribution function. Goals. we have to formulate a relevant measure of the difference in performance of the drugs and decide how to estimate this measure.• 1 X n ) = X ~ L~ I Xi.. Nevertheless.1. Informally.2 a cOmmon estimate of J1. For instance. is the statistic T(X 11' . in Example 1.X)" L i=l I n How we use statistics in estimation and other decision procedures is the subject of the next section. for example. which now depends on the data. For future reference we note that a statistic just as a parameter need not be real or Euclidean valued. a statistic we shall study extensively in Chapter 2 is the function valued statistic F. For instance.. Often the outcome of the experiment is used to decide on the model and the appropriate measure of difference.. < xJ. Next this model. If we suppose there is a single numerical measure of performance of the drugs and the difference in performance of the drugs for any given patient is a constant irrespective of the patient. then our attention naturally focuses on estimating this constant If. this difference depends on the patient in a complex manner (the effect of each drug is complex). It estimates the function valued parameter F defined by its evaluation at x E R. Thus.8 Statistical Models. . however.1. can be related to model formulation as we saw earlier. Deciding which statistics are important is closely connected to deciding which parameters are important and. usually a Euclidean space. In this volume we assume that the model has . Formally. Our aim is to use the data inductively. This statistic takes values in the set of all distribution functions on R. The link for us are things we can compute. consider situation (d) listed at the beginning of this section. a statistic T is a map from the sample space X to some space of values T. is used to decide what estimate of the measure of difference should be employed (ct. T(x) = x/no In Example 1. . Mandel.1. statistics. Xn)(x) = n L I(X n i < x) i=l where (X" . X n ) are a sample from a probability P on R and I(A) is the indicator of the event A. T( x) is what we can compute if we observe X = x. x E Ris F(X " ~ I . 1964).. hence. we can draw guidelines from our numbers and cautiously proceed. Databased model selection can make it difficult to ascenain or even assign a meaning to the accuracy of estimates or the probability of reaChing correct conclusions. F(P)(x) = PIX. the fraction defective in the sample. but the true values of parameters are secrets of nature.3 Statistics as Functions on the Sample Space I j • . These issues will be discussed further in Volume 2. Models and parametrizations are creations of the statistician. which evaluated at ~ X and 8 2 are called the sample mean and sample variance. to narrow down in useful ways our ideas of what the "true" P is.1. a common estimate of 0.
1990). X m ). . Y. drugs A and B are given at several . Zi is a d dimensional vector that gives characteristics such as sex. Moreover. . (B. patients may be considered one at a time. ).X" . for example. 0). we shall denote the distribution corresponding to any particular parameter value 8 by Po. . in Example 1. Thus. (2) All of the P e are discrete with frequency functions p(x.. See A..1.. in situation (d) again. assign the drug that seems to be working better to a higher proportion of patients.). 'L':' Such models will be called regular parametric models. Yn are independent. However.. weight. .3 we could take z to be the treatment label and write onr observations as (A.3. The experimenter may. This selection is based on experience with previous similar experiments (cf. Example 1. Regular models. In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1. and so On of the ith subject in a study. Expectations calculated under the assumption that X rv P e will be written Eo. Models. and Statistics 9 been selected prior to the current experiment. Yn ). For instance. This is the Stage for the following.4 Examples.. When dependence on 8 has to be observed. Problems such as these lie in the fields of sequential analysis and experimental deSign. age. 0). in the study. 8). In the discrete case we will use both the tennsfrequency jimction and density for p(x.1 Data. We observe (ZI' Y. 0).Section 1. height. Regression Models. Parameters. We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more infonnation. and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients. are continuous with densities p(x. For instance.1. It will be convenient to assume(1) from now on that in any parametric model we consider either: (I) All of the P. sequentially. assign the drugs alternatively to every other patient in the beginning and then. Lehmann. the number of patients in the study (the sample size) is random. (B. the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other. Regression Models We end this section with two further important examples indicating the wide scope of the notions we have introduced. Notation. 0). (A. ) that is independent of 0 such that 1 P(Xi' 0) = 1 for all O. 1.. Distribution functions will be denoted by F(·. (zn. This is obviously overkill but suppose that. There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. They are not covered under our general model and will not be treated in this book. Y n ) where Y11 ••• . Xl). The distribution of the response Vi for the ith subject or case in the study is postulated to depend on certain characteristics Zi of the ith subject. density and frequency functions by p(. after a while. and there exists a set {XI.1.4.. Thus.1. these and other subscripts and arguments will be omitted where no confusion can arise.lO.
I'(B) = I' + fl. (b) where Ei = Yi . In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed observations.1. Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on Zi.(z) only.. which we can write in vector matrix form.3 with the Gaussian twosample model I'(A) ~ 1'. . We usually ueed to postulate more.. and Performance Criteria Chapter 1 dose levels. The most common choice of 9 is the linear fOlltl. . By varying the assumptions we obtain parametric models as with (I). z) where 9 is known except for a vector (3 = ((31... . Then. i = 1. . For instance. [n general.2) with . Example 1..E(Yi ). the effect of z on Y is through f.L(z) is an unknown function from R d to R that we are interested in. Then we have the classical Gaussian linear model. and nonparametric if we drop (1) and simply . then we can write..3(3) is a special case of this model. In the two sample models this is implied by the constant treatment effect assumption. So is Example 1. .. 0 .8. Here J. (3) g((3. (3) and (4) above..L(z) denote the expected value of a response with given covariate vector z.Z~)T and Jisthen x n identity.1.1.z) = L. then the model is zT (a) P(YI. n. Clearly. . in Example 1. Identifiability of these parametrizations and the status of their components as parameters are discussed in the problems.1. If we let f(Yi I zd denote the density of Yi for a subject with covariate vector Zi.~II3Jzj = zT (3 so that (b) becomes (b') This is the linear model. I3d) T of unknowns.. That is. Goals. On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) 1'(z) = 9((3. (c) whereZ nxa = (zf. See Problem 1.. d = 2 and can denote the pair (Treatment Label.1.treat the Zi as a label of the completely unknown distributions of Yi.2 with assumptions (1)(4). Treatment Dose Level) for patient i. semiparametric as with (I) and (2) with F arbitrary. A eommou (but often violated assumption) is (I) The ti are identically distributed with distribution F. .Yn) ~ II f(Yi I Zi).. .10 Statistical Models.2 uuknown. . i=l n If we let J. Often the following final assumption is made: (4) The distributiou F of (I) is N(O.
X n spent above a fixed high level for a series of n consecutive wave records at a point on the seashore. . Let j.5. . X n be the n determinations of a physical constant J. . Measurement Model with Autoregressive Errors. . say. X n is P(X1. we give an example in which the responses are dependent. . .. Let XI.. ... 0'2) density.n ei = {3ei1 + €i.. ...t.. To find the density p(Xt.Section 1. . Xi .. Xl = I' + '1. n.. i = l. x n ). ergodicity. Parameters.(3) + (3X'l + 'i.. . and the associated probability theory models and inference for dependent data are beyond the scope of this book. . the elapsed times X I..(3)I'). p(e n I end f(edf(c2 . . Of course.Il. Xi ~ 1. we have p(edp(C21 e1)p(e31 e"e2)". A second example is consecutive measurements Xi of a constant It made by the same observer who seeks to compensate for apparent errors.(1 . . ...t+ei.1 Data. at best an approximation for the wave example. Consider the model where Xi and assume = J. Example 1.. ' . . and Statistics 11 Finally. An example would be.t = E(Xi ) be the average time for an infinite series of records.. the conceptual issues of stationarity. Then we have what is called the AR(1) Gaussian model J is the We include this example to illustrate that we need not be limited by independence..n. . Models.xn ) ~ f(x1 I') II f(xj . model (a) assumes much more but it may be a reasonable first approximation in these situations. C n where €i are independent identically distributed with density are dependent as are the X's.cn _d p(edp(e2 I edp(e3 I e2) . i = 1. . eo = a Here the errors el.. i ~ 2. It is plausible that ei depends on Cil because long waves tend to be followed by long waves. save for a brief discussion in Volume 2. In fact we can write f..p(e" I e1. However.. we start by finding the density of CI. en' Using conditional probability theory and ei = {3ei1 + €i. . The default assumption. .{3Xj1 j=2 n (1.. ..1. . is that N(O. the model for X I. 0 ..(3cd··· f(e n Because €i =  (3e n _1). .
vector observations X with unknown probability distributions P ranging over models P.1. How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences. PIX = k."" 1l"N} for the proportion (J of defectives in past shipments. N.2. ([. This is done in the context of a number of classical examples. 1l"i is the frequency of shipments with i defective items. Now it is reasonable to suppose that the value of (J in the present shipment is the realization of a random variable 9 with distribution given by P[O = N] I = IT" i = 0... 0 = N I I ([. N. i = 0. in the inspection Example 1. to think of the true value of the parameter (J as being the realization of a random variable (} with a known distribution.2) This is an example of a Bayesian model. N.2. . we can construct a freq uency distribution {1l"O. n). given 9 = i/N. We know that. In this section we introduced the first basic notions and formalism of mathematical statistics.1. in the past.2 BAYESIAN MODELS Throughout our discussion so far we have assumed that there is no information available about the true value of the parameter beyond that provided by the data. and indeed necessary. the most important of which is the workhorse of statistics. it is possible that. a . The general definition of parameters and statistics is given and the connection between parameters and pararnetrizations elucidated. That is. the regression model... Goals. we have had many shipments of size N that have subsequently been distributed. . . Models are approximations to the mechanisms generating the observations. The notions of parametrization and identifiability are introduced.1) Our model is then specified by the joint distribution of the observed number X of defectives in the sample and the random variable 9.I 12 Statistical Models. and Performance Criteria Chapter 1 Summary. If the customers have provided accurate records of the number of defective items that they have found. 1. We view statistical models as useful tools for learning from the outcomes of experiments and studies. Thus. This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as measure of the beliefs of the experimenter concerning the true value of (J before he or she takes any data. There are situations in which most statisticians would agree that more can be said For instance. There is a substantial number of statisticians who feel that it is always reasonable. X has the hypergeometric distribution 'H( i. . They are useful in understanding how the outcomes can he used to draw inferences that go beyond the particular experiment.
I(O. The joint distribution of (8. (1.1 of being defective independently of the other members of the shipment. X) is that of the outcome of a random experiment in which we first select (J = (J according to 7r and then. After the value x has been obtained for X.2) is an example of (1. the resulting statistical inference becomes subjective. The theory of this school is expounded by L. the information about (J is described by the posterior distribution. let us turn again to Example 1. In the "mixed" cases such as (J continuous X discrete.. given (J = (J. However. and Berger (1985). with density Or frequency function 7r.2. Suppose that we have a regular parametric model {Pe : (J E 8}.3) Because we now think of p(x. we can obtain important and useful results and insights. An interesting discussion of a variety of points of view on these questions may be found in Savage et al. Before the experiment is performed. Raiffa and Schlaiffer (1961). x) = ?T(O)p(x. select X according to Pe.2. De Groot (1969). Before sampling any items the chance that a given shipment contains . The function 7r represents our belief or information about the parameter (J before the experiment and is called the prior density or frequency function. the joint distribution is neither continuous nor discrete.3).1)'(0. X) is appropriately continuous or discrete with density Or frequency function.1. For instance. There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in tenus of frequencies. . We shall return to the Bayesian framework repeatedly in our discussion. which is called the posterior distribution of 8. suppose that N = 100 and that from past experience we believe that each item has probability . whose range is contained in 8.3).l. To get a Bayesian model we introduce a random vector (J. ?T. For a concrete illustration. by giving (J a distribution purely as a theoretical tool to which no subjective significance is attached.9)'00'. This would lead to the prior distribution e. = ( I~O ) (0. 100. (1) Our own point of view is that SUbjective elements including the views of subject matter experts arc an essential element in all model building. Eqnation (1.1. If both X and (J are continuous or both are discrete. the information or belief about the true value of the parameter is described by the prior distribution. (8.2.. We now think of Pe as the conditional distribution of X given (J = (J. (J) as a conditional density or frequency function given 8 = we will denote it by p(x I 0) for the remainder of this section.4) for i = 0. (1. 1. The most important feature of a Bayesian model is the conditional distribution of 8 given X = x. . In this section we shall define and discuss the basic clements of Bayesian models. 1. then by (B. Savage (1954).O). However.2. Lindley (1965). (1962). insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later).Section 1.2 Bayesian Models 13 Thus.
II ~.:.Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1.1)(0.9 independently of the other items.3). to calculate the posterior. . some variant of Bayes' rule (B.1)(0.2. This leads to P[1000 > 20 I X = lOJ '" 0.9)(0. (A 15.X. 1r(t)p(x I t) if 8 is discrete.X) .30.52) 0..8) roo 1r(9)p(x 19) 1r(t)p(x I t)dt if 8 is continuous.1100(0.. Specifically. (i) The posterior distribution is discrete or continuous according as the prior distri bution is discrete or continuous. Bernoulli Trials. ..9) I .1 and good with probability . P[IOOO > 201 = 10] P [ .4) can be used.10 '" .<1>(0.8) as posterior density of 0.181(0. this will continue to be the case for the items left in the lot after the 19 sample items have been drawn.2. P[1000 > 20 I X = 10] P[IOOO . . 1r(9)9k (1 _ 9)nk 1r(Blx" . (ii) If we denote the corresponding (posterior) frequency function or density by 1r(9 I x).8.9)(0. 1008 . Suppose that Xl. we obtain by (1. I 20 or more bad items is by the normal approximation with continuity correction. the number of defectives left after the drawing.10).j .. .2. 14 Statistkal Models.1 > .1100(0. Here is an example..1) distribution. If we assume that 8 has a priori distribution with deusity 1r.2.9) > . ( 1. 1.2. theu 1r(91 x) 1r(9)p(x I 9) 2.2.. In the cases where 8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (0.9 ] .6) we argue loosely as follows: If be fore the drawing each item was defective with probability .181(0. Example 1.k dt (1.1...2.1) '" I .9) I .7) In general. . Thus.30.1) (1..J C3 (1. 0. (1.. and Performance Criteria Chapter 1 I .. X) given by (1.X > 10 I X = lOJ P [(1000 .6) To calculate the posterior probability given iu (1. .xn )= Jo'1r(t)t k (lt)n. Therefore.2.001. Goals.<I> 1000 . Now suppose that a sample of 19 has been drawn in which 10 defective items are found.5) 5 ) = 0. is iudependeut of X and has a B(81.1.2.
upon substituting the (3(r..2. say 0. As Figure B. r.2. so that the mean is r/(r + s) = 0. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions.13) leads us to assume that the number X of geniuses observed has approximately a 8(n. Xi = 0 or 1. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B.2. nonVshaped bimodal distributions are not pennitted.2.ll) in (1. where k ~ l:~ I Xj.9). 1). If n is small compared to the size of the city.nk+s). L~ 1 Xi' We also obtain the same posterior density if B has prior density 1r and 1 Xi.2. which corresponds to using the beta distribution with r = . Another bigger conjugate family is that of finite mixtures of beta distributionssee Problem 1. To get infonnation we take a sample of n individuals from the city. . . Or else we may think that 1[(8) concentrates its mass near a small number.9) we obtain 1 L: L7 (}k+rl (1 _ 8)nk+s1 c (1. 'i = 1. x n ).6. 0 A feature of Bayesian models exhibited by this example is that there are natural parametric families of priors such that the posterior distributions also belong to this family.2. we need a class of distributions that concentrate on the interval (0. (A. must (see (B.2 indicates. Specifically.2.·) is the beta function. the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all. and the posterior distribution of B givenl:X i =kis{3(k+r...n. We also by example introduce the notion of a conjugate family of distributions. we are interested in the proportion () of "geniuses" (IQ > 160) in a particular city. introduce the notions of prior and posterior distributions and give Bayes rule. Such families are called conjugaU? Evidently the beta family is conjugate to the binomial. n . which has a B( n. If we were interested in some proportion about which we have no information or belief. and s only.Section 1. We may want to assume that B has a density with maximum value at o such as that drawn with a dotted line in Figure B.2.10) The proportionality constant c. we might take B to be uniformly distributed on (0. The result might be a density such as the one marked with a solid line in Figure B. s) density (B. Then we can choose r and s in the (3(r. For instance.1).k + s) where B(·. We return to conjugate families in Section 1.2. . = 1. Summary.15.2.11» be B(k + T. for instance. We present an elementary discussion of Bayesian models. we only observe We can thus write 1[(8 I k) fo". .05 and its variance is very small. which depends on k. Suppose.05.(8 I X" .2. One such class is the twoparameter beta family. 8) distribution given B = () (Problem 1.s) distribution.. To choose a prior 11'. k = I Xi· Note that the posterior density depends on the data only through the total number of successes.16.2 Bayesian Models 15 for 0 < () < 1.<.8) distribution.
1. then depending on one's philosophy one could take either P's corresponding to J. the expected value of Y given z. the receiver wants to discriminate and may be able to attach monetary Costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment. As the second example suggests. Given a statistical model. shipments. the information we want to draw from data can be put in various forms depending on the purposes of our analysis. i. as in Example 1.1. one of which will be announced as more consistent with the data than others.1 do we use the observed fraction of defectives . in Example 1. there are k! possible rankings or actions. and as we shall see fonnally later.1.2.3. Unfortunately {t(z) is unknown.16 Statistical Models.e. there are many problems of this type in which it's unclear which oftwo disjoint sets of P's. For instance.3.1.1 contractual agreement between shipper and receiver may penalize the return of "good" shipments.l < J.L in Example 1. However. if we believe I'(z) = g((3. of g((3. A very important class of situations arises when. "hypothesis" or "nonspecialness" (or alternative). in Example 1. In other situations certain P are "special" and we may primarily wish to know whether the data support "specialness" or not. z) we can estimate (3 from our observations Y. if we have observations (Zil Yi). These are estimation problems.3 THE DECISION THEORETIC FRAMEWORK I .1 or the physical constant J.1. such as. state which is supported by the data: "specialness" or. 0 Example 1. Intuitively.lo is the critical matter density in the universe so that J. at a first cut. i we can try to estimate the function 1'0. whereas the receiver does not wish to keep "bad. Ranking. 1 < i < n. if there are k different brands. placebo and treatment are equally effective) are special because the FDA (Food and Drug Administration) does not wish to pennit the marketing of drugs that do no good.l < Jio or those corresponding to J. There are many possible choices of estimates.lo means the universe is expanding forever and J." In testing problems we.1. sex. Thus. (age." (} > Bo. On the basis of the sample outcomes the organization wants to give a ranking from best to worst of the brands (ties not pennitted). If J.lo as special.l > JkO correspond to an eternal alternation of Big Bangs and expansions. For instance.l > J. A consumer organization preparing (say) a report on air conditioners tests samples of several brands.2. for instance. say. a reasonable prediction rule for an unseen Y (response of a new patient) is the function {t(z).. Po or Pg is special and the general testing problem is really one of discriminating between Po and Po. the fraction defective B in Example 1. say a 50yearold male patient's response to the level of a drug.1. P's that correspond to no treatment effect (i. For instance. In Example 1. Zi) and then plug our estimate of (3 into g. We may wish to produce "best guesses" of the values of important parameters. Thus. Example 1. We may have other goals as illustrated by the next two examples. 0 In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function. Goals. drug dose) T that can be used for prediction of a variable of interest Y. Note that we really want to estimate the function Ji(')~ our results will guide the selection of doses of drug for future patients. say. we have a vector z. with (J < (Jo. Making detenninations of "specialness" corresponds to testing significance.4. Prediction.3. and Performance Criteria Chapter 1 1. as it's usually called.
1. it is natural to take A = R though smaller spaces may serve equally well. 3. (3. Thus. in testing whether we are right or wrong.Section 1.1.1. (2) point to what the different possible actions are. a large a 2 will force a large m l n to give us a good chance of correctly deciding that the treatment effect is there. (2.3. A new component is an action space A of actions or decisions or claims that we can contemplate making. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify Po with the set of "special" P's).2.. On the other hand. 1). ~ 1 • • • l I} in Example 1. A = {a. there are 3! = 6 possible rankings. In any case. Action space. in ranking what mistakes we've made. Thus. Here are action spaces for our examples. (2. 1. accuracy. or combine them in some way? In Example 1. once a study is carried out we would probably want not only to estimate ~ but also know how reliable our estimate is. A = {O. : 0 E 8). P = {P. (3. 3). A = {( 1. X = ~ L~l Xi. By convention. If we are estimating a real parameter such as the fraction () of defectives.1. 1. For instance. 2. most significantly.3. (1. ik) of {I. In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter. we begin with a statistical model with an observation vector X whose distribution P ranges over a set P. in Example 1. in Example 1. on what criteria of performance we use. Estimation. Thus.. whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing. or the median. even if. 2). taking action 1 would mean deciding that D.3 The Decision Theoretic Framework 17 X/n as our estimate or ignore the data and use hislOrical infonnation on past shipments. Here quite naturally A = {Permntations (i I .1 Components of the Decision Theory Framework As in Section 1.1. 3). 2). k}}. for instance. Intuitively. we would want a posteriori estimates of perfomlance. . and reliability of statistical procedures. (3) provide assessments of risk.1. in estimation we care how far off we are.. . . in Example 1. We usnally take P to be parametrized. The answer will depend on the model and.3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that. 2. # 0. . That is.. if we have three air conditioners.6. defined as any value such that half the Xi are at least as large and half no bigger? The same type of question arises in all examples. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study. (4) provide guidance in the choice of procedures for analyzing outcomes of experiments. . I} with 1 corresponding to rejection of H. I)}.1. 3.1. Ranking. is large. we need a priori estimates of how well even the best procedure can do. or p. Testing. and so on.1.2 lO estimate J1 do we use the mean of the measurements. in Example 1. 1.
r I I(P. a) = 1 otherwise.. a) = 0. I(P.1). . 1 d} = supremum distance. if Y = 0 or 1 corresponds to.2. [f Y is real. is the nonnegative loss incurred by the statistician if he or she takes action a and the true "state of Nature.. If v ~ (V1. Sex)T. or 1(0. Here A is much larger. less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P.a)2. Although estimation loss functions are typically symmetric in v and a. Closely related to the latter is what we shall call confidence interval loss.. examples of loss functions are 1(0. they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. ' . If. the expected squared error if a is used. ~ 2.V. is P.a(z))2dQ(z). . d'}. Goals. a) = [v( P) . Estimation.a)2). M) would be our prediction of response or no response for a male given treatment B.a)2 (or I(O. a) if P is parametrized. say. "does not respond" and "responds. sometimes can genuinely be quantified in economic terms. say. and z = (Treatment. The interpretation of I(P.qd(lJ)) and a ~ (a""" ad) are vectors.a) ~ (q(O) . For instance. a) I d . a) = min {(v( P) . [v(P) . a). a) = l(v < a). As we shall see. as the name suggests. then a(B.18 Statistical Models.i = We can also consider function valued parameters. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider.. Quadratic Loss: I(P. a) 1(0. Other choices that are. Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·).)2 = Vj [ squared Euclidean distance/d absolute distance/d 1. Loss function. and z E Z. Q is the empirical distribution of the Zj in . = max{laj  vjl. asymmetric loss functions can also be of importance. A = {a . This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable.3. and truncated quadraticloss: I(P.Vd) = (q1(0). l( P.ai. which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1. Far more important than the choice of action space is the choice of loss function defined as a function I : P X A ~ R+. In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is.." respectively. a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z. For instance. For instance.. and Performance Criteria Chapter 1 Prediction. r " I' 2:)a.al < d. in the prediction example 1.3. as we shall see (Section 5. a) = J (I'(z) .3. a) 1(0. . although loss functions.: la. . a) = (v(P) . the probability distribution producing the data. l(P." that is. 1'() is the parameter of interest. .
the statistician takes action o(x). in Example 00 defectives results in a penalty of s dollars whereas every defective item sold results in an r dollar replacement cost.. In Example 1. Using 0 means that if X = x is observed. In Section 4..1.1 suppose returning a shipment with °< 1(8.1'(zn))T Testing. This 0 . where {So.1.a) = I n n..3 with X and Y distributed as N(I' + ~. (Zn...). relative to the standard deviation a. 0. is a or not.3. Then the appropriate loss function is 1. L(I'(Zj) _0(Z. Y). .2) and N(I'. y) = o if Ix::: 111 (J <c (1.1 times the squared Euclidean distance between the prediction vector (a(z.2) I if Ix ~ 111 (J >c . in the measurement model. Here we mean close to zero relative to the variability in the experiment. the decision is wrong and the loss is taken to equal one. respectively..3. For the problem of estimating the constant IJ.))2.2). For instance. ]=1 which is just n. e ea.3 The Decision Theoretic Framework 19 the training set (Z 1.3 we will show how to obtain an estimate (j of a from the data. . if we are asking whether the treatment effect p'!1'ameter 6. Of course. The decision rule can now be written o(x. We next give a representation of the process Whereby the statistician uses the data to arrive at a decision. . The data is a point X = x in the outcome or sample space X. (1. that is.. then a reasonable rule is to decide Ll = 0 if our estimate x . . We ask whether the parameter () is in the subset 6 0 or subset 8 1 of e. a) ~ 0 if 8 E e a (The decision is correct) l(0. other economic loss functions may be appropriate.. Otherwise. We define a decision rule or procedure to be any function from the sample space taking its values in A. . a) = 1 otherwise (The decision is wrong)..9.1 loss function can be written as ed. Y n). is a partition of (or equivalently if P E Po or P E P. Testing. I) /(8.0) sif8<80 Oif8 > 80 rN8.1) Decision procedures. and to decide Ll '# a if our estimate is not close to zero. . I) 1(8.).. a Estimation. we implicitly discussed two estimates or decision rules: 61 (x) = sample mean x and 02(X) = X = sample median. If we take action a when the parameter is in we have made the correct decision and the loss is zero.).I loss: /(8.y is close to zero. this leads to the commonly considered I(P.Section 1.. a(zn) jT and the vector parameter (I'( z.
A useful result is Bias(fi) Proposition 1. e Estimation. then the loss is l(P. (If one side is infinite.3.. we turn to the average or mean loss over the sample space. Suppose v _ v(P) is the real parameter we wish to estimate and fJ(X) is our estimator (our decision rule). the cross term will he zero because E[fi . and Performance Criteria Chapter 1 where c is a positive constant called the critical value. If we expand the square of the righthand side keeping the brackets intact and take the expected value. then Bias(X) 1 n Var(X) . We illustrate computation of R and its a priori use in some examples. 1 is the loss function. X n are i.20 Statistical Models. That is.. i=l . our risk function is called the mean squared error (MSE) of. 6 (x)) as a random variable and introduce the riskfunction R(P. (fi .1.i.3. fi) = Ep(fi(X) . The other two terms are (Bias fi)' and Var(v). and X = x is the outcome of the experiment.. . measurements of IJ. we typically want procedures to have good properties not at just one particular x.3. R('ld) is our a priori measure of the performance of d. Example 1.3.. for each 8.. Moreover. Suppose Xl. lfwe use the mean X as our estimate of IJ. 6(X)] as the measure of the perfonnance of the decision rule o(x). R maps P or to R+. . (]"2) errors.. and assume quadratic loss. Goals. Thus.v(P))' (1. Estimation of IJ. but for a range of plausible x·s. If d is the procedure used.6) = Ep[I(P. How do we choose c? We need the next concept of the decision theoretic framework.3) where for simplicity dependence on P is suppressed in MSE. Proof.) = 2L n ~ n . () is the true value of the parameter.E(fi)] = O. Write the error as 1 = (Bias fi)' + E(fi)] Var(v). The MSE depends on the variance of fJ and on what is called the bias ofv where = E{fl) .v can be thought of as the "longrun average error" of v. (Continued). so is the other and the result is trivially true.v) = [V  + [E(v) . we regard I(P. MSE(fi) I.1']. Thus. If we use quadratic loss. withN(O. and is given by v= MSE{fl) = R(P." Var(X. the risk or riskfunction: The risk function.) 0 We next illustrate the computation and the a priori and a posteriori use of the risk function.d. fJ(x)). We do not know the value of the loss because P is unknown.
3. in general. If. If we have no idea of the value of 0. whereas if we have a random sample of measurements X" X 2.5) This harder calculation already suggests why quadratic loss is really favored. If we only assume. with mean 0 and variance a 2 (P).1 is useful for evaluating the performance of competing estimators.8)X. or na 21(n .2 . planning is not possible but having taken n measurements we can then estimate 0. an}).23).1")' ~ E(~) can only be evaluated numerically (see Problem 1. .3. Then R(I". that the E:i are i. ( N(o.6). The a posteriori estimate of risk (j2/n is.X) =. .3. Then (1. X) = a' (P) / n still. analytic. by Proposition 1.4) which doesn't depend on J.3.a 2 then by (A 13. 0 ~ = a We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1. then for quadratic loss. For instance. 1) and R(I".a 2 .2. If we have no data for area A.8 can only be made on the basis of additional knowledge about demography or the economy. as we discussed in Example 1. Suppose that instead of quadratic loss we used the more natural(l) absolute value loss.t.3. computational difficulties arise even with quadratic loss as soon as we think of estimates other than X.1". If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = <:TolE measurements. a natural guess for fl would be flo. .2)1"0 + (O.1).3 The Decision Theoretic Framework 21 and.fii 1 af2 00 Itl<p(t)dt = .Xn ) (and we. 00 (1. write for a median of {aI.1 2 MSE(X) = R(I".fii/a)[ ~ a . (. census. .. itself subject to random error.. of course.X)2. or numerical and/or Monte Carlo computation. the £. a'. .6 through a . 1 X n from area A.1"1 ~ Eill 2 ) where £.3. = X. Example 1.4) can be used for an a priori estimate of the risk of X. age or income. Next suppose we are interested in the mean fl of the same measurement for a certain area of the United States. say.2..i.. we may wantto combine tLo and X = n 1 L~ 1 Xi into an estimator. . for instance.Section 1. Ii ~ (0.d.fii V. areN(O. as we assumed. an estimate we can justify later.X) = ". if X rnedian(X 1 . but for absolute value loss only approximate.2 and 0. Let flo denote the mean of a certain measurement included in the U. n (1. E(i . or approximated asymptotically. In fact.. is possible. X) = EIX .3.S.4.1. R( P. The choice of the weights 0.. We shall derive them in Section 1.. for instance by 8 2 = ~ L~ l(X i . Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be < crJ.
A test " " e e e .64)0" In.I.04(1'0 . I' E R} being 0.1 gives the graphs of MSE(ii) and MSE(X) as functions of 1'.1')' + (.8)'Yar(X) = (. and Performance Criteria Chapter 1 i formal Bayesian analysis using a nonnal prior to illustrate a way of bringing in additional knowledge.1') (0. The two MSE curves cross at I' ~ 1'0 ± 30'1 . We easily find Bias(ii) Yar(ii) 0. R(i'>. X) = 0" In of X with the minimum relative risk inf{MSE(ii)IMSE(X). The test rule (1.81' .64)0" In MSE(ii) = . If I' is close to 1'0. Y) = 01 if i'> i O.. Y) = IJ.I' = 0.21'0 R(I'. Y) = which in the case of 0 .64 when I' = 1'0. MSE(ii) MSE(X) i .6) = 1(i'>.6) P[6(X.3. the risk is = 0 and i'> i 0 can only take on the R(i'>. The mean squared errdrs of X and ji. the risk R(I'. then X is optimal (Example 3. Ii) of ii is smaller than the risk R(I'.0)P[6(X. = 0. D Testing.3.2(1'0 . Here we compare the performances of Ii and X as estimators of Jl using MSE.4). In the general case X and denote the outcome and parameter space. I)P[6(X. I. 22 Statistical Models.2) for deciding between i'> two values 0 and 1. Goals. neither estimator can be proclaimed as being better than the other. Figure 1. Figure 1.n. However. using MSE.3. Because we do not know the value of fL. thus. if we use as our criteria the maximum (over p.1 loss is 01 + lei'>. iiJ + 0. Y) = I] if i'> = 0 P[6(X. and we are to decide whether E 80 or E 8 where 8 = 8 0 U 8 8 0 n 8 .1. respectively.) ofthe MSE (called the minimax criteria).3.
[X < k]. it is natural to seek v(X) such that P[v(X) > vi > 1. 8 < 80 rN8P.a (1. we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding ~ =1= 0 when ~ = 0).05 or .01 or less.3.a) upper coofidence houod on v. For instance.3. This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function.1. a > v(P) I. on the probability of Type I error. and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding Do = 0 when Do ¥ 0). For instance. Suppose our primary interest in an estimation type of problem is to give an upper bound for the parameter v. R(8.3. 6(X) = IIX E C].IX > k] + rN8P. "Reject the shipment if and only if X > k. If (say) X represents the amount owed in the sample and v is the unknown total amount owed. Here a is small. Thus.3 The Decision Theoretic Framework 23 function is a decision rule <5(X) that equals 1 On a set C C X called the critical region and equals 0 on the complement of C.8) sP. where 1 denotes the indicator function.3. If <5(X) = 1 and we decide 8 E 8 1 when in fact E 80.3. that is. I(P. a < v(P) . Such a v is called a (1 .8) for all possible distrihutions P of X. confidence bounds and intervals (and more generally regions). (1. the focus is on first providing a small bound." in Example 1. we call the error committed a Type I error. an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed. say .1) and tests 15k of the fonn.o) 0.[X < k[. This is not the only approach to testing.05.7) Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation. Finding good test functions corresponds to finding critical regions with smaIl probabilities of error.6) P(6(X) = 0) if 8 E 8 1 Probability of Type II error. and then trying to minimize the probability of a Type II error. the loss function (1. usually . whereas if <5(X) = 0 and we decide () E 80 when in fact 8 E 8 b we call the error a Type II error. [n the NeymanPearson framework of statistical hypothesis testing. For instance.18).Section 1.6) E(6(X)) ~ P(6(X) ~ I) if8 E 8 0 (1.6) Probability of Type [error R(8.1 lead to (Prohlem 1. in the treatments A and B example. the risk of <5(X) is e R(8. 8 > 80 .
replace it. for some constant c > O. and Performance Criteria Chapter 1 1 . rather than this Lagrangian form.5. it is customary to first fix a in (1. aJ.a)  a~v(P) . administer drugs. in fact it is important to get close to the truthknowing that at most 00 doJlars are owed is of no use. For instance. as we shall see in Chapter 4. general criteria for selecting "optimal" procedures. Suppose that three possible actions.: Comparison of Decision Procedures In this section we introduce a variety of concepts used in the comparison of decision procedures. the connection is close.2 . though upper bounding is the primary goal. where x+ = xl(x > 0). thai this fonnulation is inadequate because by taking D 00 we can achieve risk O. or wait and see.1 nature makes it resemble a testing loss function and. The same issue arises when we are interested in a confidence interval ~(X) l v(X) 1for v defined by the requirement that • P[v(X) < v(P) < Ii(X)] > 1 . sell the location. For instance = I(P. and so on. A has three points. a 2 certain location either contains oil or does not. We shall go into this further in Chapter 4. are available. though. Suppose we have two possible states of nature. We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model. It is clear.3. a2. . and a3.3.a>v(P) .24 Statistical Models.3. or sell partial rights. which we represent by Bl and B . 1. and the risk of all possible decision procedures can be computed and plotted. we could operate. What is missing is the fact that. Typically. a) (Drill) aj (Sell) a2 (Partial rights) a3 . The loss function I(B. or repair it.. We shall illustrate some of the relationships between these ideas using the following simple example in which has two members. . The 0.a ·1 for all PEP. The decision theoretic framework accommodates by adding a component reflecting this. e Example 1. I I (Oil) (No oil) Bj B2 0 12 10 I 5 6 \ .3. an asymmetric estimation type loss function.8) and then see what one can do to control (say) R( P.a<v(P). Goals. In the context of the foregoing examples. Ii) = E(Ii(X) v(P))+. we could leave the component in. Suppose the following loss function is decided on TABLE 1. We next tum to the final topic of this section. we could drill for oil.1. c ~1 . a component in a piece of equipment either works or does not work. a patient either has a certain disease or does not.
a.3. e.a3)P[5(X) = a3]' 0(0.). Risk points (R(B 5. a. R(B" 52) R(B2 ." 02 corresponds to ''Take action Ul> if X = 0. if X = 1. . whereas if there is no oil and we drill.).. 6 a." Criteria for doing this will be introduced in the next subsection. a3 7 a3 a.3. 1. it is known that formation 0 occurs with frequency 0.7.7 0.6) + 1(0.) R(B.4 8 8. R(B2.2 (Oil) (No oil) e. 5 a.5) E[I(B.5. and when there is oil.3. The risk of 5 at B is R(B.) 0 12 2 7 7.a2)P[5(X) ~ a2] + I(B.. take action U2. 8 a3 a2 9 a3 a3 Here. 01 represents "Take action Ul regardless of the value of X. (R(O" 5). we can represent the whole risk function of a procedure 0 by a point in kdimensional Euclidean space. .. We list all possible decision rules in the following table.Section 1.3.5 3 7 1.6 and 0.3 and formation 1 with frequency 0. a3 4 a2 a.0 9 5 6 It remains to pick out the rules that are "good" or "best.5. and so on.6 4 5 1 " 3 5. 0 .3 0. 3 a. Next. formations 0 and 1 occur with frequencies 0.3 The Decision Theoretic Framework 25 Thus.6 0. The frequency function p(x.3. B) given by the following table TABLE 1. Possible decision rules 5i (x) .). and frequency function p(x.4) ~ 7.4 and graphed in Figure 1. a.5(X))] = I(B. i Rock formation x o I 0.5 4. e TABLE 1. If is finite and has k members.4. if there is oil and we drill.3) For instance. a. 1 9. The risk points (R(B" 5. whereas if there is no oil. I x=O x=1 a.)P[5(X) = ad +1(B.52 ) + 10(0. .5 9. 2 a.6..)) 1 1 R(B . the loss is zero.4 = 1.6 3 3. X may represent a certain geological formation. ." and so on. R(B k .2 for i = 1. B2 Thus. 5)) and if k = 2 we can plot the set of all such points obtained by varying 5.3. R(025. TABLE 1. an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0..5 8.2. the loss is 12.7) = 7 12(0.4. 5.4 10 6 6.)) are given in Table 1.
The risk points (R(8 j .) < R( 8" S6) but R( 8" S. Here R(8 S. and Performance Criteria Chapter 1 R(8 2 . Consider. and only if. Goals. if J and 8' are two rules. we obtain M SE(O) = (}2.9 . The absurd rule "S'(X) = 0" cannot be improved on at the value 8 ~ 0 because Eo(S'(X)) = 0 ifand only if O(X) = O.9.Si) Figure 1. We shall pursue this approach further in Chapter 3. Usually. S. and S6 in our example.3. (2) A second major approach has been to compare risk functions by global crite .) > R( 8" S6)' The problem of selecting good decision " procedures has been attacked in a variety of ways. neither improves the other. i = 1.S) < R(8.) I • 3 10 . Section 1. for instance. Si). Extensions of unbiasedness ideas may be found in Lehmann (1997. R( 8" Si)).5).4 .   . .2 . if we ignore the data and use the estimate () = 0. For instance.. in estimating () E R when X '"'' N(().S') for all () with strict inequality for some ().7 . 1.. unbiasedness (for estimates and tests). Researchers have then sought procedures that improve all others within the class. R(8.S.8 5 .2. . It is easy to see that there is typically no rule e5 that improves all others. We say that a procedure <5 improves a procedure Of if.6 .5 0 0 5 10 R(8 j . or level of significance (for tests). a5).3.26 Statistical Models. Symmetry (or invariance) restrictions are discussed in Ferguson (1967).3 Bayes and Minimax Criteria The difficulties of comparing decision procedures have already been discussed in the special contexts of estimation and testing. i (1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry.
and only if.3. .) maxi R(0" Oil. Recall that in the Bayesian model () is the realization of a random variable or vector () and that Pe is the conditional distribution of X given 0 ~ O.0) Table 1.5. it has smaller Bayes risk.8R(0.7 6.9 8.5 we for our prior. if we use <5 and () = ().3 The Decision Theoretic Framework 27 ria rather than on a pointwise basis. suppose that in the oil drilling example an expert thinks the chance of finding oil is . We shall discuss the Bayes and minimax criteria. we need not stop at this point.5 7 7. the expected loss.3.3. If we adopt the Bayesian point of view. If there is a rule <5*. which attains the minimum Bayes risk l that is. (1.. + 0. . 0) isjust E[I(O. if () is discrete with frequency function 'Tr(e).3.o)] = E[I(O.2R(0.4 5 2. Bayes: The Bayesian point of view leads to a natural global criterion. (1. r( 09) specified by (1. . such that • see that rule 05 is the unique Bayes rule then it is called a Bayes rule. therefore.2. 0)11(0).).) ~ 0. We postpone the consideration of posterior analysis.0).8 6 In the Bayesian framework 0 is preferable to <5' if.3. ()2 and frequency function 1I(OIl The Bayes risk of <5 is.6 4 4. . the only reasonable computational method.I.10) r( 0) ~ 0. r(o") = minr(o) reo) = EoR(O. .6 12 2 7.38 9.2. R(0" Oi)) I 9. but can proceed to calculate what we expect to lose on the average as () varies. = 0.9).8.4 8 4..3.48 7.o)1I(O)dO.5 9 5. 11(0. r(o. Then we treat the parameter as a random variable () with possible values ()1. O(X)) I () = ()]. The method of computing Bayes procedures by listing all available <5 and their Bayes risk is impracticable in general.Section 1.3. In this framework R(O. TABLE 1.6 3 8.9) The second preceding identity is a consequence of the double expectation theorem (B.92 5. and reo) = J R(O. Note that the Bayes approach leads us to compare procedures on the basis of.3.20) in Appendix B. to Section 3.8 10 6 3. Bayes and maximum risks oftbe procedures of Table 1.2.. To illustrate.5 gives r( 0.02 8. From Table 1.o(x))]. This quantity which we shall call the Bayes risk of <5 and denote r( <5) is then. given by reo) = E[R(O.
. This is. a weight function for averaging the values of the function R( B. e 4. Our expected risk would be. a randomized decision procedure can be thought of as a random experiment whose outcomes are members of V. the sel of all decision procedures. sup R(O. Such comparisons make sense even if we do not interpret 1r as a prior density or frequency. supR(O.5. Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk.J') .75 if 0 = 0. The principle would be compelling.3." Nature (Player I) picks a independently of the statistician (Player II). Of course. . . For simplicity we shall discuss only randomized . in Example 1. This criterion of optimality is very conservative. I Randomized decision rules: In general.20 if 0 = O . 2 The maximum risk 4. we prefer 0" to 0"'. = infsupR(O. J').(2) We briefly indicate "the game of decision theory. J) A procedure 0*. in Example 1.3. The maximum risk of 0* is the upper pure value of the game. J).28 Statistical Models. if and only if. For instance.4. Nevertheless l in many cases the principle can lead to very reasonable procedures. Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to emplOy.5 we might feel that both values ofthe risk were equally important. 6). But this is just Bayes comparison where 1f places equal probability on f}r and ()2. R(O. and Performance Criteria Chapter 1 if (J is continuous with density rr(8).J). Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule. To illustrate computation of the minimax rule we tum to Table 1. we toss a fair coin and use 04 if the coin lands heads and 06 otherwise. Goals. R(02.75 is strictly less than that of 04. The criterion comes from the general theory of twoperson zero sum games of von Neumann.3. Nature's choosing a 8. who picks a decision procedure point 8 E J from V. is called minimax (minimizes the maximum risk). which has . < sup R(O. but only ac. if V is the class of all decision procedures (nonrandomized). J)) we see that J 4 is minimax with a maximum risk of 5. suppose that. For instance. which makes the risk as large as possible. It aims to give maximum protection against the worst that can happen. 4. 8)]. From the listing ofmax(R(O" J). It is then natural to compare procedures using the simple average ~ [R( fh 10) + R( fh. if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used. Player II then pays Player I.4.
That is.5.Ji)' r2 = tAiR(02. .R(O.Ji )] i=l A randomized Bayes procedure d. = ~A. (1.3. . Jj . .(0 d = i = 1 . S is the convex hull of the risk points (R(0" Ji ).Section 1.3. A point (Tl.(02 ). R(O .13) intersects S. We will then indicate how much of what we learn carries over to the general case. the Bayes risk of l5 (1.J): J E V'} where V* is the set of all procedures. If . this point is (10. given a prior 7r on q e. S= {(rI.d) anwng all randomized procedures. As in Example 1. R(O .r2):r.J) ~ L.. we then define q R(O.\. i = 1. If the randomized procedure l5 selects l5i with probability Ai.i/ (1 . i=l (1. J)) and consider the risk sel 2 S = {(RrO"J).J. We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1. TWo cases arise: (I) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule. For instance. when ~ = 0. including randomiZed ones. Finding the Bayes rule corresponds to finding the smallest c for which the line (1. then all rules having Bayes risk c correspond to points in S that lie on the line irl + (1 . Ji ) + (1 . By (1. .3 The Decision Theoretic Framework 29 procedures that select among a finite set (h 1 • • • .3. ~Ai=1}. 9 (Figure 2 1.11) Similarly we can define.13) As c varies. 0 < i < 1.i)r2 = c. AR(02.2. Jj ).. T2) on this line can be written AR(O"Ji ) + (1..J. we represent the risk of any procedure J by the vector (R( 01 . 1 q. which is the risk point of the Bayes rule Js (see Figure 1.5. Ji )). (1. (1. J).3.10)..13) defines a family of parallel lines with slope i/(1 .3.). Ai >0.3.3.A)R(O"Jj ).r).3.3. This is thai line with slope .3. minimizes r(d) anwng all randomized procedures. All points of S that are on the tangent are Bayes. (2) The tangent is the line connecting two unonrandomized" risk points Ji .1).R(02.. Ei==l Al = 1. l5q of nOn randomized procedures. A randomized minimax procedure minimizes maxa R(8.3.R(OI.14) .3).).'Y) that is tangent to S al the lower boundary of S. i = 1 ...12) r(J) = LAiEIR(IJ.A)R(02..3).
30 Statistical Models.3.3. i = 1./(1 .11) corresponds to the values Oi with probability>. .15) Each one of these rules. namely Oi (take'\ = 1) and OJ (take>. and.. .3. = 0). is Bayes against 7I".. OJ with probability (1 .16) touches S is the risk point of the minimax rule. See Figure 1. the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (Le. < 1. We can choose two nonrandomized Bayes rules from this class. and Performance Criteria Chapter 1 r2 I 10 5 Q(c') o o 5 10 Figure 1. thus. In our example. R(B .3.3.e.)') of the line given by (1.3. oil. ranges from 0 to 1. (\)). = r2. 2 The point where the square Q( c') defined by (1.3. by (1.9.>'). where 0 < >. (1. as >.16) whose diagonal is the line r.3.• all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes). To locate the risk point of the nilhimax rule consider the family of squares.13).. It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*. Then Q( c*) n S is either a point or a horizontal or vertical line segment. < 1. the first point of contact between the squares and S is the . Let c' be the srr!~llest c for which Q(c) n S i 0 (i. 0 < >.3. (1. The convex hull S of the risk poiots (R(B!. Because changing the prior 1r corresponds to changing the slope . Goals. the first square that touches S).
y) in S such that x < Tl and y < T2.)). . the minimax rule is given by (1.\).59. > a for all i. 0. j = 6 and . (d) All admissible procedures are Bayes procedures. then any Bayes procedure corresponding If e is not finite there are typically admissible procedures that are not Bayes. . {(x. which yields . if and only if.14) with i = 4.5 can be shown to hold generally (see Ferguson.\ the solution of From Table 1.5(1 . for instance. if and only if. However.4 < 7. (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the Sense of (1. Y < T2} has only (TI 1 T2) in common with S. From the figure it is clear that such points must be on the lower left boundary. There is another important concept that we want to discuss in the context of the risk set.0).3.4 we can see. they are Bayes procedures. 1 .. The following features exhibited by the risk set by Example 1. we can define the risk set in general as s ~ {( R(8 .V) : x < TI. If e is finite.3. (e) If a Bayes prior has 1f(Oi) to 1f is admissible. or equivalently.. In fact.. this equation becomes 3.Section 1..14).3. Thus. for instance). thus. under some conditions. Naturally.6 ~ R( 8" J. all admissible procedures are either Bayes procedures or limits of .\) = 5.4. Using Table 1.\ ~ 0.3. there is no (x. A decision rule <5 is said to be inadmissible if there exists another rule <5' such that <5' improves <5. agrees with the set of risk points of Bayes procedures.3 The Decision Theoretic Framework 31 intersection between TI = T2 and the line connecting the two points corresponding to <54 and <56 .\ + 6. if there is a randomized one. l Ok}.) and R( 8" 04) = 5.4'\ + 3(1 . R( 81 . all rules that are not inadmissible are called admissible. A rule <5 with risk point (TIl T2) is admissible.e. 1967. (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the positive quadrant. 04) = 3 < 7 = R( 81 . e = {fh. the set of all lower left boundary points of S corresponds to the class of admissible rules and. (c) If e is finite(4) and minimax procedures exist. . that <5 2 is inadmissible because 64 improves it (i. To gain some insight into the class of all admissible procedures (randomized and nOnrandomized) we again use the risk set.3. R(8" 0)) : 0 E V'} where V" is the set of all randomized decision procedures. (a) For any prior there is always a nonrandomized Bayes procedure..
4 PREDICTION . Here are some further examples of the kind of situation that prompts our study in this section. Next we must specify what close means. The MSPE is the measure traditionally used in the . A meteorologist wants to estimate the amount of rainfall in the coming spring. We introduce the decision theoretic foundation of statistics inclUding the notions of action space. Other theorems are available characterizing larger but more manageable classes of procedures. confidence bounds. Z is the information that we have and Y the quantity to be predicted. Summary.g(Z») = E[g(Z) ~ YI 2 or its square root yE(g(Z) . and risk through various examples including estimation. We assume that we know the joint probability distribution of a random vector (or variable) Z and a random variable Y. testing. although it usually turns out that all admissible procedures of interest are indeed nonrandomized. we referto Blackwell and Girshick (1954) and Ferguson (1967). I II . at least when procedures with the same risk function are identified. For example.32 . I The prediction Example 1.4). ]967. An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson. Section 3. Since Y is not known. we tum to the mean squared prediction error (MSPE) t>2(Y.. The joint distribution of Z and Y can be calculated (or rather well estimated) from the records of previous years that the admissions officer has at his disposal. Goals. loss function. In terms of our preceding discussion.I 'jlI . A stockholder wants to predict the value of his holdings at some time in the future on the basis of his past experience with the market and his portfolio.y)2. We stress that looking at randomized procedures is essential for these conclusions. Z would be the College Board score of an entering freshman and Y his or her firstyear grade point average. For more information on these topics. .3. are due essentially to Waldo They are useful because the property of being Bayes is ea·der to analyze than admissibility. decision rule. Statistical Models. in the college admissions situation. ranking. Similar problems abound in every field. The frame we shaH fit them into is the following. A college admissions officer has available the College Board scores at entrance and firstyear grade point averages of freshman classes for a period of several years. The basic biasvariance decomposition of mean square error is presented. • 1. at least in their original fonn. We want to find a function 9 defined on the range of Z such that g(Z) (the predictor) is "close" to Y. and Performance Criteria Chapter 1 '.Yj'.2 presented important situations in which a vector z of 00variates can be used to predict an unseen response Y. One reasonable measure of "distance" is (g(Z) . which include the admissible rules. Bayes procedures (in various senses). The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility. These remarkable results. he wants to predict the firstyear grade point averages of entering freshmen on the basis of their College Board scores. A government expert wants to predict the amount of heating oil needed next winter. and prediction. Using this information. which is the squared prediction error when g(Z) is used to predict Y.
see Example 1. See Remark 1. (1. we can conclude that Theorem 1.4.4 Prediction 33 mathematical theory of prediction whose deeper results (see. that is. Just how widely applicable the notions of this section are will become apparent in Remark 1.16).4.c) implies that p. Proof. E(Y . in which Z is a constant.c)2 is either oofor all c or is minimized uniquely by c In fact. EY' = J. exists.YI) (Problems 1.L and the lemma follows.3) If we now take expectations of both sides and employ the double expectation theorem (B. The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6. 0 Now we can solve the problem of finding the best MSPE predictor of Y.4.g(Z))' (1. we can find the 9 that minimizes E(Y . g(Z)) such as the mean absolute error E(lg( Z) .5 and Section 3. and by expanding (1. By the substitution theorem for conditional expectations (B.) = 0 makes the cross product term vanish.g(z))' IZ = z] = E[(Y I'(z))' I Z = z] + [g(z) I'(z)]'.4. for example. Because g(z) is a constant.L = E(Y).g(Z))2.C)2 has a unique minimum at c = J. given a vector Z. E(Y .4. see Problem 1. We see that E(Y .4. consider QNP and the class QL of linear predictors of the form a + We begin the search for the best predictor in the sense of minimizing MSPE by considering the case in which there is no covariate information.c)' ~ Var Y + (c _ 1')'.g(z))' I Z = z]. In this section we bjZj .4.c = (Y 1') + (I' .4.1. then either E(Y g(Z))' = 00 for every function g or E(Y .4. (1.1) follows because E(Y p.4.4) .711). The class Q of possible predictors 9 may be the nonparametric class QN P of all 9 : R d Jo R or it may be to some subset of this class.1.2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss.4. when EY' < 00.I'(Z))' < E(Y . L1=1 Lemma 1. 1.3.4.Section 1.4.1) < 00 if and only if E(Y . In this situation all predictors are constant and the best one is that number Co that minimizes B(Y .6. EY' < 00 Y .20). we have E[(Y .c)2 as a function of c. Grenander and Rosenblatt. 1957) presuppose it. Lemma 1.c)' < 00 for all c.g(Z))' I Z = z] = E[(Y .l. ljZ is any random vector and Y any random variable. Let (1.25.1 assures us that E[(Y .2) I'(z) = E(Y I Z = z).(Y. or equivalently.
Infact.4. Var(Y I z) = E([Y . E{h(Z)<1 E{ E[h(Z)< I Z]} E{h(Z)E[Y I'(Z) I Z]} = 0 because E[Y I'(Z) I Z] = I'(Z) I'(Z) = O. and recall (B. let h(Z) be any function of Z.5) follows from (a) because (a) implies that the cross product term in the expansionof E ([Y 1'(z)J + [I'(z) g(z)IP vanishes. Equivalently U and V are uncorrelated if either EV[U E(U)] = 0 or EUIV .Il(Z) denote the random prediction error.4.6). so is the other. that is.6) which is generally valid because if one side is infinite. then (a) f is uncorrelated with every function ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <. If E(IYI) < 00 but Z and Y are otherwise arbitrary.4. That is. . = O. Theorem 1. then by the iterated expectation theorem.4.6) and that (1. = I'(Z).1.6) is linked to a notion that we now define: Two random variables U and V with EIUVI < 00 are said to be uncorrelated if EIV .5) An important special case of (1. Property (1. Goals.E(Y I z)]' I z).34 Statistical Models.4. we can derive the following theorem.4.E(V)J Let € = Y . As a consequence of (1.E(U)] = O. I'(Z) is the unique E(Y .2. Proof. Properties (b) and (c) follow from (a).4. (1.5) is obtained by taking g(z) ~ E(Y) for all z. 0 Note that Proposition 1. Vat Y = E(Var(Y I Z» + Var(E(Y I Z).20).4. 1.4.4.8) or equivalently unless Y is a function oJZ. Suppose that Var Y < 00. Write Var(Y I z) for the variance of the condition distribution of Y given Z = z.4. then Var(E(Y I Z)) < Var Y. To show (a).4. then (1. when E(Y') < 00. which will prove of importance in estimation theory.4.7) IJVar Y • < 00 strict inequality ooids unless 1 Y = E(Y I Z) (1.5) becomes.E(v)IIU . and Performance Criteria Chapter 1 for every 9 with strict inequality holding unless g(Z) best MSPE predictor.g(Z)' = E(Y .1(c) is equivalent to (1. (1. then we can write Proposition 1.I'(Z))' + E(g(Z) 1'(Z»2 ( 1.
25 0. or 3 failures among all days.E(Y I Z))' ~ 0 By (A. These fractional figures are not too meaningful as predictors of the natural number values of Y.025 0.30 I 0.lI. also. i=l 3 E (Y I Z = ~) ~ 2.7) follows immediately from (1. the best predictor is E (30. whereas the column sums py (y) yield the frequency of 0.50 z\y 1 0 0.7) can hold if. I. Within any given month the capacity status does not change. half.10.1.4.15 3 0. E (Y I Z = ±) = 1. o Example 1. The following table gives the frequency function p( z.E(Y I Z = z))2 p (z.4. E(Y .1 2. and only if.Section 1. (1.4.15 I 0. An assembly line operates either at full. and only if.05 0.4. But this predictor is also the right one. or quarter capacity.'" 1 Y.6). or 3 shutdowns due to mechanical failure. 2.9) this can hold if.:. We find E(Y I Z = 1) = L iF[Y = i I Z = 1] ~ 2.25 1 0.30 py(y) I 0.4. Each day there can be 0.4 Prediction 35 Proof.05 0. 1. y) = p (Z = Z 1 Y = y) of the number of shutdowns Y and the capacity state Z of the line for a randomly chosen day.20.45. the average number of failures per day in a given month. if we are trying to guess.10 I 0.y) pz(z) 0. We want to predict the number of failures for a given day knowing the state of the assembly line for the month. 2.10 0. I z) = E(Y I Z). as we reasonably might. The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state. E(Var(Y I Z)) ~ E(Y . y) = 0.25 0.8) is true. p(z.E(Y I Z))' =L x 3 ~)y y=o .45 ~ 1 The MSPE of the best predictor can be calculated in two ways.05 0. In this case if Yi represents the number of failures on day i and Z the state of the assembly line. Equality in (1.25 2 0. The first is direct.885.10 0.10 0. The assertion (1.025 0. .
The term regression was coined by Francis Galton and is based on the following observation.2 tells us that the conditional dis /lO(Z) Because . can reasonably be thought of as a measure of dependence. . Regression toward the mean. o tribution of Y given Z = z is N City + p(Uy j UZ ) (z . If (Z. is usually called the regression (line) of Y on Z. and Performance Criteria Chapter 1 The second way is to use (1. p) distribution.9) I .2. " " is independent of z. Because of (1. Therefore. If p = 0.y as we would expect in the case of independence.36 Statistical Models. The Bivariate Normal Distribution.LY. If p > O. the MSPE of our predictor is given by. Theorem B. E((Y .11) I The line y = /lY + p(uy juz)(z ~ /lz).4.E(Y I Z p') (1. In the bivariate normal case. whereas its magnitude measures the degree of such dependence. the best predictor is just the constant f.L[E(Y I Z y .4. = z)]'pz(z) 0.4. E(Y ~ E(Y I Z))' VarY ~ Var(E(Y I Z)) E(Y') ~ E[(E(Y I Z))'] I>'PY(Y) . which corresponds to the best predictor of Y given Z in the bivariate normal model.J. O"~. the predictor is a monotone increasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z. this quantity is just rl'. = /lY + p(uyjuz)(Z ~ /lZ). Goals.10) I I " The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution. which is the MSPE of the best constant predictor. u~. = Z))2 I Z = z) = u~(l _ = u~(1 _ p')./lz).6) we can also write Var /lO(Z) p = VarY . for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y. Similarly. 2 I .6) writing.885 as before. Suppose Y and Z are bivariate normal random variables with the same mean . E(Y . The larger this quantity the more dependent Z and Yare. Thus. a'k. (1.4.E(Y I Z))' (1.4. . Y) has a N(Jlz l f. the best predictor of Y using Z is the linear function Example 1.4. p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence. (1 .p')). One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y.4.
(Zn. N d +1 (I'. Let Z = (Zl.3. usual correlation coefficient p = UZy / crfyCT lz when d = 1. .4. .5 states that the conditional distribution of Y given Z = z is N(I"Y +(zl'zf. I"Y = E(Y).8 = EziEzy anduYYlz ~ uyyEyzEziEzy. YI ).12) with MSPE ElY l"o(Z)]' ~ E{EIY l"o(zlI' I Z} = E(uYYlz) ~ Uyy  EyzEziEzy. I"Y) T. Thus. The variability of the predicted value about 11. ttd)T and suppose that (ZT. these were the heights of a randomly selected father (Z) and his son (Y) from a large human population.8. the MCC equals the square of the . variance cr2.11). Then the predicted height of the son.p)1" + pZ) = p'(T'..4. ." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox. By (1. the best predictor E(Y I Z) ofY is the linear function (1. y)T has a (d + !) multivariate normal.. Y) is unavailable and the regression line is estimated on the basis of a sample (Zl. or the average height of sbns whose fathers are the height Z.6) in which I' (I"~. be less than that of the actual heights and indeed Var((! . . E). should. E zy = (COV(Z" Y).. and positive correlation p. Note that in practice..4 Prediction 37 tt. . there is "regression toward the mean. Yn ) from the population. The Multivariate Normal Distribution. Y» T T ~ E yZ and Uyy = Var(Y).6). The quadratic fonn EyzEZ"iEzy is positive except when the joint nonnal distribution is degenerate. distribution (Section B. in particular in Galton's studies. 0 Example 1.p)tt + pZ.. is closer to the population mean of heights tt than is the height of the father. .4. .. coefficient ofdetennination or population Rsquared. (1 . uYYlz) where. where the last identity follows from (1. Ezy O"yy E zz is the d x d variancecovariance matrix Var(Z) .6. One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y.Section 1. tall fathers tend to have shorter sons. " Zd)T be a d x 1 covariate vector with mean JLz = (ttl. the distribution of (Z. This quantity is called the multiple correlation coefficient (MCC). We write Mee = ' ~! _ PZy ElY /lO(Z))' ~ Var l"o(Z) Var Y Var Y . so the MSPE of !lo(Z) is smaller than the MSPE of the constant predictor J1y. Thus. consequently.COV(Zd. We shall see how to do this in Chapter 2.4.. . Theorem B. In Galton's case.
when the distribution of (ZT.ba) + Zba]}' to get ba )' . Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z.1 and 2. Then the unique best zero intercept linear predictor i~ obtained by taking E(ZY) b=ba = E(Z') .:. We expand {Y . Y) a I VarZ. the hnear predictor l'a(Z) and its MSPE will be estimated using a 0 sample (Zi. In words. The best linear predictor.)T. y)T is unknown. (Z~. Z2)T be the heights in inches of a IOyearold girl and her parents (Zl = mother's height..4.y = .4.. If we are willing to sacrifice absolute excellence. We first do the onedimensional case. E(Y .. l.9% and 39. Y. and Performance Criteria Chapter 1 For example. are P~.5%. yn)T See Sections 2.zz ~ (.2.E(Z')b~.bZ)' is uniquely minintized by b = ba. Suppose that E(Z') and E(Y') are finite and Z and Y are not constant. 29W· Then the strength of association between a girl's height and those of her mother and father. The problem of finding the best MSPE predictor is solved by Theorem 1.y = . knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33. Suppose(l) that (ZT l Y) T is trivariate nonnal with Var(Y) = 6.13) . What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1. .bZ}' = E(Y)  b E(Z) 1· = (Y  [Z(b .l Proof. P~. we can avoid both objections by looking for a predictor that is best within a class of simple predictors.335. The natural class to begin with is that of linear combinations of components of Z.bZ)' = E(Y') + E(Z')(b  Therefore.38 Statistical Models. let Y and Z = (Zl.1.zy ~ (407.~ ~:~~). and parents. Z2 = father's height). Goals. ..3%. p~y = . respectively.4. The percentage reductions knowing the father's and both parent's heights are 20. respectively. and E(Y _ boZ)' = E(Y') _ [E(ZY)]' E(Z') (1. E(Y .39 l. whereas the unique best linear predictor is ILL (Z) = al + bl Z where b = Cov(Z.393.209. In practice. Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor.3.
Section 1. Best Multivariate Linear Predictor.l).bZ)2 is uniquely minimized by taking a ~ E(Y) .16) directly by calculating E(Y . it must coincide with the best linear predictor. and only if.E(Z) and Y .1 the best linear predictor and best predictor differ (see Figure 1.4. This is because E(Y .E(Y)) . In that example nothing is lost by using linear prediction. OUf linear predictor is of the fonn d I'I(Z) = a + LbjZj j=l = a+ ZTb [3 = (E([Z . A loss of about 5% is incurred by using the best linear predictor. This is in accordance with our evaluation of E(Y I Z) in Example 1.4. then a = a. whiclt corresponds to Y = boZo We could similarly obtain (A.bZ)' ~ Var(Y .. 0 Remark 1.E(Z)f[Z .4.4 Prediction 39 To prove the second assertion of the theorem note that by (lA. E(Y .E(Y) to conclude that b.4. whatever be b. . by (1. E(Y .E(Z)J))1 exist.a)'. From (1.l7) in the appendix.E(Z)IIZ . = I'Y + (Z  J. On the other hand. then the Unique best MSPE predictor is I'dZ) Proof.4.4.E(Z)J[Y . Note that R(a.a .4. b) X ~ (ZT.b(Z .14) Yl T Ep[Y .boZ)' > 0 is equivalent to the CauchySchwarz inequality with equality holding if.4. That is.4.13) we obtain tlte proof of the CauchySchwarz inequality (A. E(Y . (1.E(Y)]) = EziEzy. in Example 1.a . Therefore. 0 Note that if E(Y I Z) is of the form a + bZ.1.a. and b = b1 • because.I1.a .1). Theorem 1. I 1.tzf [3.t and covariance E of X.E(ZWW ' E([Z . Let Po denote = .bE(Z) .bZ) + (E(Y) .5).b.boZ)' = 0. If EY' and (E([Z .E(Z))]'.I'l(Z)1' depends on the joint distribution P of only through the expectation J. if the best predictor is linear.2.bE(Z).Z)'. E[Y . We can now apply the result on zero intercept linear predictors to the variables Z .bZ)2 we see that the b we seek minimizes E[(Y .I'd Z )]' = 1 05 ElY I'(Z)j' . Substituting this value of a in E(Y . is the unique minimizing value.
. I"(Z) = E(Y I Z) = a + ZT 13 for unknown Q: E R and f3 E Rd.4. and Performance Criteria Chapter 1 y .4. . b) is miuimized by (1.I'(Z) and each of Zo.4 by extending the proof of Theorem 1.d.4.4. See Problem 1. R(a.15) . our new proof shows how secondmoment results sometimes can be established by "connecting" them to the nannal distribution.50 0. the MCC gives the strength of the linear relationship between Z and Y.. Zd are uncorrelated.14).4. The line represents the best linear predictor y = 1. . thus. .4. Because P and Po have the same /L and E. Suppose the mooel for I'(Z) is linear.. that is. E(Zj[Y . Thus.05 + 1. 2 • • 1 o o 0.00 z Figure 1.(a + ZT 13)]) = 0.1(a). p~y = Corr2 (y.I'LCZ)). Y).4. By Example 1. Ro(a.4. We could also have established Theorem 1. (1. not necessarily nonnal. . case the multiple correlation coefficient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y.25 0. . Goals.45z.40 Statistical Models.4.. However. b) is minimized by (1. the multivariate uormal. 0 Remark 1. . j = 0. that is.I'I(Z)]'.4. . 0 Remark 1..4. and R(a.4.2.75 1. By Proposition 1.3 to d > 1. . In the general.17 for an overall measure of the strength of this relationship.3. 0 Remark 1. b).1. ~ Y .4. distribution and let Ro(a.19. N(/L.14).3. E). Set Zo = 1. . The three dots give the best predictor. b) = Ro(a.4. We want to express Q: and f3 in terms of moments of (Z. b) = Epo IY . A third approach using calculus is given in Problem 1.
We return to this in Section 3.(Y 1ge) ~ "(I'(Z) 19d (1. but as we have seen in Sec~ tion 1.(Y.4.23)... the optimal MSPE predictor is the conditional expected value of Y given Z.5. If we identify II with Y and X with Z.Thus. The range of T is any space of objects T. and projection 1r notation.' (l'e(Z).3. . we can conclude that I'(Z) = . 1. Consider the Bayesian model of Section 1.5(X»]. The notion of mean squared prediction error (MSPE) is introduced. Moreover.4. then by recording or taking into account only the value of T(X) we have a reduction of the data.12)..5) = (9 . Remark 1. We begin by fonnalizing what we mean by "a reduction of the data" X EX.g(Z»): 9 E g). usually R or Rk.16) is the Pythagorean identity.Section 1. can also be a set of functions.15) for a and f3 gives (1. With these concepts the results of this section are linked to the general Hilbert space results of Section 8. Recall that a statistic is any function of the observations generically denoted by T(X) or T. o Summary. If T assigns the same value to different sample points.17) ""'(Y. The optimal MSPE predictor in the multivariate normal distribution is presented. Remark 1.5)'.2.(Y I 9).16) (1..14) (Problem 1. When the class 9 of possible predictors 9 with Elg(Z)1 space as defined in Section B. T(X) = X loses information about the Xi as soon as n > 1. Note that (1.3. I'dZ» = ".8) defined by r(5) = E[I(II. and it is shown that if we want to predict Y on the basis of information contained in a random vector Z. Thus. then 90(Z) is called the projection of Y on the space 9 of functions of Z and we write go(Z) = .4.I'(Z) is orthogonal to I'(Z) and to I'dZ). I'(Z» Y .4.4..5 SUFFICIENCY Once we have postulated a statistical model. this gives a new derivation of (1. we would clearly like to separate out any aspects of the data that are irrelevant in the context of the model and that may obscure our understanding of the situation.5 Sufficiency 41 Solving (1.. g(Z) and h(Z) are said to be orthogonal if at least one has expected value zero and E[g(Z)h(Z)] ~ O.(Y 19NP).10. It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear.2 and the Bayes risk (1.4.1. the optimal MSPE predictor E(6 I X) is the Bayes procedure for squared error loss.6.4.4. I'dZ) = . we see that r(5) ~ MSPE for squared error loss 1(9. Using the distance D.I 0 and there is a 90 E 9 such that 0 < oc form a Hilbert go = arg inf{". Because the multivariate normal model is a linear model.4. I'(Z» + ""'(Y. We consider situations in which the goal is to predict the (perhaps in the future) value of a random variable Y.
A machine produces n items in succession.X. We give a decision theory interpretation that follows. By Example B. " .Xn ) where Xi = 1 if the ith item sampled is defective and Xi = 0 otherwise.2.) and Xl +X. X. in the context of a model P = {Pe : (J E e}. Example 1. Begin by noting that according to TheoremB.1. Thus. .2. X. A statistic T(X) is called sufficient for PEP or the parameter if the conditional distribution of X given T(X) = t does not involve O.l.. the conditional distribution of X 0 given T = L:~ I Xi = t does not involve O.5.. (Xl.4). o Example 1. Then X = (Xl. 1) whatever be t. X 2 the time between the arrival of the first and second customers.l. By (A. .8)n' (1.0. Each item produced is good with probability 0 and defective with probability 1..42 Statistical Models.1) where Xi is 0 or 1 and t = L:~ I Xi. Thus.) and that of Xlt/(X l + X.' • .Xn ) does not contain any further infonnation about 0 or equivalently P. = tis U(O.1 we had sampled the manufactured items in order. and Performance Criteria Chapter 1 Even T(X J.. ~ t. . is sufficient. The idea of sufficiency is to reduce the data with statistics whose use involves no loss of information. Using our discussion in Section B. Xl has aU(O. . 0 In both of the foregoing examples considerable reduction has been achieved. The total number of defective items observed. • X n ) (X(I) . T is a sufficient statistic for O.'" . t) and Y = t . recording at each stage whether the examined item was defective or not.'" . is a statistic that maps many different values of (Xl.. Instead of keeping track of several numbers.) is conditionally distributed as (X. Although the sufficient statistics we have obtained are "natural.) given Xl + X. We prove that T = X I + X 2 is sufficient for O. . 1). the conditional distribution of Xl = [XI/(X l + X.3.9. . given that P is valid. T = L~~l Xi. .'" . However. The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X." it is important to notice that there are many others e. + X. are independent and the first of these statistics has a uniform distribution on (0.. where 0 is unknown.) are the same and we can conclude that given Xl + X. the conditional distribution of XI/(X l + X. One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following. By (A. . Suppose that arrival of customers at a service counter follows a Poisson process with arrival rate (parameter) Let Xl be the time of arrival of the first customer. Goals. whatever be 8. Thus. suppose that in Example 1.5. XI/(X l +X. when Xl + X. = t. Xl and X 2 are independent and identically distributed exponential random variables with parameter O.Xn ) is the record of n Bernoulli trials with probability 8.l. once the value of a sufficient statistic T is known.)](X I + X. . For instance. Y) where X is uniform on (0.5. . Suppose there is no dependence between the quality of the items produced and let Xi = 1 if the ith item is good and 0 otherwise. We could then represent the data by a vector X = (Xl. whatever be 8. X n ) into the same number. t) distribution. X(n))' loses information about the labels of the Xi. It follows that.5). the sample X = (Xl.1.l we see that given Xl + X2 = t. it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T. P[X I = XI.. we need only record one.Xn = xnl = 8'(1. Therefore. 16.
i = 1. To prove the sufficiency of (152). Proof. then T 1 and T2 provide the same information and achieve the same reduction of the data.] Po[X = Xj.e) = g(ti.} p(x. Let (Xl.] o if T(xj) oF ti if T(xj) = Ii. O)h(x) for all X E X. e porT = til = I: {x:T(x)=t.. Fortunately. In general.4) if T(xj) ~ Ii 41 o if T(xj) oF t i . Po [X ~ XjlT = t. • only if. there exists afunction g(t. 0 E 8. The complete result is established for instance by Lehmann (1997. a simple necessary and sufficient criterion for a statistic to be sufficient is available.Ll) and (152). . if (152) holds. a statistic T(X) with range T is sufficient for e if.6). if Tl and T2 are any two statistics such that 7 1 (x) = T 1 (y) if and only if T2(x) = T 2(y). It is often referred to as the factorization theorem for sufficient statistics.") be the set of possible realizations of X and let t i = T(Xi)' Then T is discrete and 2::~ 1 porT = Ii] = 1 for every e. forO E Si. checking sufficiency directly is difficult because we need to compute the conditional distribution. Section 2.} h(x). Po[X = XjlT = t. h(xj) (1. X2.5) .O) Theorem I.S. e) definedfor tin T and e in on X such that p(X.5 Sufficiency 43 that will do the same job.. 0) porT ~ Ii] ~~CC+ g(li. We shall give the proof in the discrete case. e)h(Xj) Po[T (15. Such statistics are called equivalent.Section 1.5. Being told that the numbers of successeS in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one.O) I: {x:T(x)=t.. and e and a/unction h defined (152) = g(T(x). More generally. and Halmos and Savage.In a regular model.. Now. This result was proved in various forms by Fisher. we need only show that Pl/[X = xjlT = til is independent of for every i and j.. Neyman.]/porT = Ii] p(x.I. By our definition of conditional probability in the discrete case.j is independent ofe on cach of the sets Si = {O: porT ~ til> OJ. Applying (153) we arrive at. T = t. (153) By (B.2. it is enough to show that Po[X = XjlT ~ l.
.. •. Then the density of (X" . X n ).16.4)). l X n are the interarrival times for n customers.'" .. . We may apply Theorem 1. Common sense indicates that to get information about B.5. = max(xl.Xn ) is given by (see (A. then the joint density of (X" . The population is sampled with replacement and n members of the population are observed and their labels Xl.. A whole class of distributions.5.Il) .6) p(x. Let 0 = (JL.5.~. and both functions = 0 otherwise.. . 0 o Example 1. The probability distribution of X "is given by (1. _ _ _1 . Estimating the Size of a Population..5.X n JI) = 0 otherwise. Takeg(t. 1=1 (!.O) where x(n) = onl{x(n) < O}.'t n /'[exp{ .' (L x~ 1=1 1 n n 2JL LXi))].Xn. 1.[27f. . If X 1. we need only keeep track of X(n) = max(X .Xn ) = L~ IXi is sufficient. > 0. .xn }. we can show that X(n) is sufficient..n /' exp{ . ..5.4.JL)'} ~=l I n I! 2 ..O)=onexP[OLXil i=l (1. let g(ti' 0) = porT = tiL h(x) = P[X = xIT(X) = til Then (1...2. 0) by (B..5.X.5.44 Statistical Models. . 0 Example 1..5. By Theorem 1.8) if all the Xi are > 0. 10 fact.8) = eneStift > 0.2 (continued). T = T(x)] = g(T(x).2. T is sufficient. 1 X n be independent and identically distributed random variables each having a normal distribution with mean {l and variance (j2.5. [27". . X(n) is a sufficient statistic for O. .5. ..' L(Xi .1.. Expression (1.')...10) P(Xl.Xn ) is given by I. . }][exp{ .'I.9) can be rewritten as " 1 Xn . ~ Po[X = x. ' .7) o Example 1. which admits simple sufficient statistics and to which this example belongs. if T is sufficient.3. Let Xl. Conversely. and Performance Criteria Chapter 1 Therefore.. .. . and P(XI' . I x n ) = 1 if all the Xi are > 0.1 to conclude that T(X 1 . .Xn are recorded. both of which are unknown. are introduced in the next section.9) if every Xi is an integer between 1 and B and p( X 11 • (1. O)h(x) (1.5. Consider a population with () members labeled consecutively from I to B. Goals. n P(Xl. O} = 0 otherwise..S. and h(xl •.3). .
a 2 ).( 2?Ta ')~ exp p {E(fJ.Zi)'} exp {Er. 0 X Example 1.5 Sufficiency 45 I Evidently P{Xl 1 • •• . By randomized we mean that o'(T(X») can be generated from the value random mechanism not depending on B.. 1 2 2 . The first and second components of this vector are called the sample mean and the sample variance.9) and _ (y. Then pix. where we assume diat the given constants {Zi} are not all identical. [1/(n i=l n n 1)1 I:(Xi . we can. a 2)T is identifiable (Problem 1. we construct a rule O'(X) with the sarne risk = mean squared error as o(X) as follows: Conditionally. I)) = exp{nI)(x . Suppose. Then 6 = {{31. respectively.5. I)) . An equivalent sufficient statistic in this situation that is frequently used IS S(X" . in Example 1. 2:>.5..2a I3. X n are independent identically N(I). .4 with d = 2. {32.) i=l i=l is sufficient for B.1 we can conclude that n n Xi.··· .• Yn are independent. that is. with JLi following the linear regresssion model ("'J . . Yi N(Jli. R(O.o) = R(O. X n ) = [(lin) where I: Xi.~0)}(2"r~n exp{ . X is sufficient.12) t ofT(X) and a Example 1.l) distributed.5. T = (EYi . . 0') for allO.~ I: xn By the factorization theorem.. i= 1 = (lin) L~ 1 Xi.EZiY.. Here is an example.5. L~l x~) and () only and T(X 1 . (1. Suppose X" . if T(X) is sufficient.5. find a randomized decision rule o'(T(X)) depending only on T(X) that does as well as O(X) in the sense of having the same risk function. EYi 2 .6. . that Y I . Specifically. for any decision procedure o(x).Section 1.} + EY.' + 213 2a + 213. a<.1. Thus.X)']. . Ezi 1'i) is sufficient for 6. X n ) = (I: Xi.xnJ)) is itself a function of (L~ upon applying Theorem 1. o Sufficiency and decision theory Sufficiency can be given a clear operational interpretation in the decision theoretic setting.1. Let o(X) = Xl' Using only X.
1 (Bernoulli trials) we saw that the posterior distribution given X = x is 1 Xi. ). then T(X) = (L~ I Xi. Let SeX) be any other sufficient statistic. Minimal sufficiency For any model there are many sufficient statistics: Thus. we find E(T') ~ E[E(T'IX)] ~ E(X) ~ I" ~ E(X . Using Section B.4. IfT(X) is sufficient for 0. in that. But T(X) provides a greater reduction of the data.1 (continued)..6). ... We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X). T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of () given T(X) = T(x) for all x.2. we can find a transformation r such that T(X) = r(S(X)). In Example 1.12) follows along the lines of the preceding example: Given T(X) = t. 0 The proof of (1.Xn is a N(p..5. O)h(x) E:  .6). ) Var(T') = E[Var(T'IX)] + VarIE(T'IX)]  ~ nl n + 1 n ~ 1 = Var(X .14. T = I Xi was shown to be sufficient. Then by the factorization theorem we can write p(x. Goals. .6'(T))IT]} ~ E{E[R(O. = 2:7 Definition. Theorem }.2.6') ~ E{E[R(O. 0) as p(x. and Performance Criteria Chapter 1 given X = t. . Example 1. 0 and X are independent given T(X). (Kolmogorov).5.. the distribution of 6(X) does not depend on O.l and (1. (T2) sample n > 2. choose T" = 15"' (X) from the normal N(t. This 6' (T(X)) will have the same risk as 6' (X) because. if Xl. Equivalently. Thus. X n ) are hoth sufficient.6(X»)IT]} = R(O. L~ I xl) and S(X) = (X" . 6'(X) and 6(X) have the same mean squared error. n~l) distribution. it is Bayes sufficient for every 11. the same as the posterior distribution given T(X) = L~l Xi = k. R(O.5.5. o Sufficiency aud Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P = {Po : 0 E e} we postulate a prior distribution 11 for e.O) = g(S(x). where k In this situation we call T Bayes sufficient. by the double expectation theorem. Now draw 6' randomly from this conditional distribution. This result and a partial converse is the subject of Problem 1. In this Bernoulli trials case.46 Statistical Models.
the likelihood function (1. if X = x. if we set 0. It is a statistic whose values are functions.1) nlog27r . for a given 8. 20. ~ 2/3 and 0.O)h(x) forall O.) n/' exp {nO. However. Thus. In the discrete case. The formula (1. (T') example.O E e. L x (()) gives the probability of observing the point x. we find OT(1 . ()) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient.5.5 Sufficiency 47 Combining this with (1.5. Lx is a map from the sample space X to the class T of functions {() t p(x.11) is determined by the twodimensional sufficient statistic T Set 0 = (TI.5.T. take the log of both sides of this equation and solve for T.8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of 8. it gives. when we think of Lx(B) as a function of ().1). the "likelihood" or "plausibility" of various 8.) = (/". then Lx(O) = (27rO. Example 1. the ratio of both sides of the foregoing gives In particular.)}. 1/3)]}/21og 2. as a function of (). L x (') determines (iI.5. t. for example.4 (continued). 2/3)/g(S(x). The likelihood function o The preceding example shows how we can use p(x. = 2 log Lx(O. Thus. T is minimally sufficient. for a given observed x. For any two fixed (h and fh.) ~ (L Xi. In this N(/". t2) because. 20' 20 I t. }exp ( I . (T').(t.o)nT ~g(S(x). the statistic L takes on the value Lx..Section 1. ()) x E X}.O). L Xf) i=l i=l n n = (0 10 0. we find T = r(S(x)) ~ (log[2ng(s(x). Now. We define the likelihood function L for a given observed data vector x as Lx(O) = p(x. ~ 1/3. In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x.
. Let Ax OJ c (x: p(x. hence. Thus.5. where R i ~ Lj~t I(X.Rn ).5. The "irrelevant" part of the data We can always rewrite the original X as (T(X). it is Bayes sufficient for O.Oo) > OJ = Lx~6o)' Thus.5. the residuals. then for any decision procedure J(X).. If T(X) is sufficient for 0. S(X) = (R" . . We define a statistic T(X) to be Bayes sufficient for a prior 7r if the posterior distribution of f:} given X = x is the same as the posterior distribution of 0 given T(X) = T(x) for all X.O) > for all B. . SeX)~ where SeX) is a statistic needed to uniquely detennine x once we know the sufficient statistic T(x). and Performance Criteria Chapter 1 with a similar expression for t 1 in terms of Lx (O.1.X Cn )' the order statistics. l:~ I (Xi .48 Statistical Models.4. Let p(X.. Ax is the function valued statistic that at (J takes on the value p x. 1 X n ). (X( I)' . O)h(X). ifT(X) = X we can take SeX) ~ (Xl . we can find a randomized decision rule J'(T(X» depending only on the value of t = T(X) and not on 0 such that J and J' have identical risk functions. B ul the ranks are needed if we want to look for possible dependencies in the observations as in Example 1. 1) and Lx (1. .Xn .5. a statistic closely related to L solves the minimal sufficiency problem in general.. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic SeX) we can find a ttansformation r such that T(X) = r(S(X).5.X)2) is sufficient. itself sufficient. and Scheffe. We show the following result: If T(X) is sufficient for 0. but if in fact a 2 I 1 all information about a 2 is contained in the residuals. if the conditional distribution of X given T(X) = t does not involve O. but if in fact the common distribution of the observations is not Gaussian all the information needed to estimate this distribution is contained in the corresponding S(X)see Problem 1. hence. SeX) becomes irrelevant (ancillary) for inference if T(X) is known but only if P is valid. We say that a statistic T (X) is sufficient for PEP.13. OJ denote the frequency function or density of X. . Lehmann. . If P specifies that X 11 •. L is minimal sufficient.17). 1. Suppose there exists 00 such that (x: p(x.X. (X. For instance. If.5. 1 X n are a random sample. or if T(X) ~ (X~l)' . 0) = g(T(X). The likelihood function is defined for a given data vector of .X). in Example 1. Then Ax is minimal sufficient.5. Suppose that X has distribution in the class P ~ {Po : 0 E e}.. 0 In fact. See Problem ((x:). Thus. the ranks. . Consider an experiment with observation vector X = (Xl •. By arguing as in Example 1. a 2 is assumed unknown. X is sufficient. as in the Example 1. • X( n» is sufficient.1 (continued) we can show that T and.. L is a statistic that is equivalent to ((1' i 2) and.. if a 2 = 1 is postulated. Summary.. or for the parameter O..12 for a proof of this theorem of Dynkin.. 0 the likelihood ratio of () to Bo.5. < X. Goals. The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t. 0) and heX) such that p(X.). 1) (Problem 1. _' .
Note that the functions 1}. B). The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman. They will reappear in several connections in this book. 1. x.B) ~ h(x)exp{'7(B)T(x) . for x E {O. and Dannois through investigations of this property(l). Probability models with these common features include normal. Ax(B) is a minimally sufficient statistic. B.6 Exponential Families 49 observations X to be the function of B defined by Lx(B) = p(X. realvalued functions T and h on Rq. This is clear because we need only identify exp{'7(B)T(x) . B E' sufficient for B. B) > O} c {x: p(x. many other common features of these families were discovered and they have become important in much of the modern theory of statistics.6.6. gamma. Example 1.2"" }. We shall refer to T as a natural sufficient statistic of the family. these families form the basis for an important class of models called generalized linear models. Let Po be the Poisson distribution with unknown mean IJ. 1. binomial. Poisson.1 The OneParameter Case The family of distributions of a model {Pe : B E 8}. is said to be a oneparameter exponentialfami/y. Pitman.Section 1.1.2) . Here are some examples.B(B)} with g(T(x).1) where x E X c Rq. Subsequently. beta.B) and h(x) with itself in the factorization theorem. B) of the Pe may be written p(x. (1. and multinomial regression models used to relate a response variable Y to a set of predictor variables. B>O. More generally. and if there is a value B E such that o e e.B(B)} (1. p(x. = 1 . B( 0) on e. if there exist realvalued functions fJ(B).o I x.6.6. 1. The Poisson Distribution. Then. and T are not unique. IfT(X) is {x: p(x. then.B o) > O}. such that the density (frequency) functions p(x. In a oneparameter exponential family the random variable T(X) is sufficient for B.exp{xlogBB}.B) = eXe.6 EXPONENTIAL FAMILIES The binomial and normal models considered in the last section exhibit the interesting feature that there is a natural sufficient statistic whose dimension as a random vector is independent of the sample size. by the factorization theorem. the likelihood ratio Ax(B) = Lx(B) Lx(Bo) depends on X through T(X) only. We return to these models in Chapter 2. BEe.
B(O) = 0. The Binomial Family.0)].~O'(y  z)' logO}. we have p(x.O) IIh(x.. Z and W are f(x. .Z)OI) Z)'O'J} (2rrO)1 exp { (2rr)1 exp {  ~ [z' + (y  ~z' } exp { . .) i=1 m B(8)] ( 1.6. Goals. the Pe form a oneparameter exponential family with ~~I . 1.B(O)=nlog(I0).~O'. n) < 0 < 1...y. 0) distribution. . T(x) ~ x.3. fonn a oneparameter exponential family as in (1. This is a oneparameter exponential family distribution with q = 2.) exp[~(O)T(x. o The families of distributions obtained by sampling from oneparameter exponential families are themselves oneparameter exponential families.' x. B) are the corresponding density (frequency) functions.2.(y I z) = <p(zW1<p((y . B(O) = logO.O) = ( : ) 0'(1_ Or" ( : ) ex p [xlog(1 ~ 0) + nlog(l.~z.. . Suppose X has a B(n. Here is an example where q (1.3) I' .6) .5) o = 2.T(x) = (y  z)'. and Performance Criteria Chapter 1 Therefore.6.1](0) = logO. .T(x)=x.6. + OW. .h(X)=( : ) .6. (1. 0 Then. 0 Example 1. Statistical Models.h(x) = . If {pJ=». for x E {O. Specifically. Suppose X = (Z. y)T where Y = Z independent N(O. Example 1.50 . h(x) = (2rr)1 exp { . Then > 0.O) f(z. . q = 1.1](O)=log(I~0).} . 1).1](0) = .6. p(x. 1 X m ) considered as a random vector in Rmq and p(x. 0 E e.1).6. suppose Xl.. where the p.6. is the family of distributions of X = (Xl' . 1 (1.O) = f(z)f. .Xm are independent and identically distributed with common distribution Pe. the family of distributions of X is a oneparameter exponential family with q=I. I I! . .4) Therefore.
\) (3(r. This family of Poisson distIibutions is oneparameter exponential whatever be m.8) ... For example.6. · . .7) Note that the natural sufficient statistic T(m) is onedimensional whatever be m.6.1 the sufficient statistic T :m)(x1.6. and h. then the m ) fonn a I Xi.Section 1. B.6 Exponential Families 51 where x = (x 1 . Therefore. In the discrete case we can establish the following general result. If we use the superscript m to denote the corresponding T. . 1 X m). 1 X m ) is a vector of independent and identically distributed P(O) random variables and m ) is the family of distributions of x. i=l (1.6. if X = (Xl. oneparameter exponential family with natural sufficient statistic T(m)(x) = Some other important examples are summarized in the following table. ily of distributions of a sample from any of the foregoin~ is just In our first Example 1. •. h(m)(x) ~ i=l II h(xi). then qCm) = mq.\ 1"/'" (p (s 1) 1) 1) (r T(x) x (x . pi pi I:::n TABLE 1. .. B.1 Family of distributions ... and h.. . . Let {Pe } be a oneparameter exponential family of discrete distributions with corresponding functions T. 1)(0) N(I'. fl.B(O)} for suitable h * . and pJ LT(Xi). I:::n I::n Theorem 1..Xm ) = I Xi is distributed as P(mO).· ..' fixed I' fixed p fixed . 1]. .x log x 10g(l x) logx The statistic T(m) (X I. X m ) corresponding to the oneparameter exponential fam1 T(Xd.1. then the family of distributions of the statistic T(X) is a oneparameter exponential family of discrete distributions whose frequency functions may be written h'Wexp{1)(O)t .2) r(p. . the m) fonn a oneparameter exponential family.\ fixed r fixed s fixed 1/2. B(m)(o) ~ mB(O). . We leave the proof of these assertions to the reader..Il)" .
A(1})1 for s in some neighborhood 0[0.. the result follows. If X is distributed according to (1.2.2.1.8) h(x) exp[~(8)T(x) . .9) where A(1}) = logJ . The Poisson family in canonical form is q(x.8) exp[~(8)t .1.. Let E be the collection of all 1} such that A(~) is finite. Example 1. The model given by (1. Goals.6. x E X c Rq (1.6.6.6. P9[T(x) = tl L {x:T(x)=t} p(x.A(~)I.B(8)] L {x:T(x)=t} ( 1. if q is definable. We obtain an important and useful reparametrization of the exponential family (1. where 1} = log 8. 00 00 exp{A(1})} andE = R. = L(e"' (x!) = L(e")X (x! = exp(e").B(8)]{ Ifwe let h'(t) ~ L:{"T(x)~t} h(x). x E {O. Then as we show in Section 1. x=o x=o o Here is a useful result.9) with 1} E E contains the class of models with 8 E 8. 0 A similar theorem holds in the continuous case if the distributions ofT(X) are them Canonical exponential families.2. Theorem 1. The exponential family then has the form q(x. selves continuous.9) and ~ is an Ihlerior point of E.9) with 1} ranging over E is called the canonical oneparameter exponential family generated by T and h.1}) = (1(x!)exp{1}xexp[1}]}. By definition.6. . and Performance Criteria Chapter 1 Proof. then A(1}) must be finite..6. E is either an interval or all of R and the class of models (1. L {x:T{x)=t} h(x)}.6. }..6. the momentgenerating function o[T(X) exists and is given by M(s) ~ exp[A(s + 1}) .1}) = h(x)exp[1}T(x) . (continued). £ is called the natural parameter space and T is called the natural sufficient statistic. If 8 E e.52 Statistical Models.6.1) by letting the model be indexed by 1} rather than 8. Jh(x)exp[1}T(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case.
is one. 2 i=1 i=l n 1 n Here 1] = 1/20 2. The rest of the theorem follows from the momentenerating property of M(s) (see Section A.2 The Multiparameter Case Our discussion of the "natural form" suggests that oneparameter exponential families are naturally indexed by a onedimensional real parameter fJ and admit a onedimensional sufficient statistic T(x).8) ~ (x/8 2)exp(_x 2/28 2).8) = (il(x.j8 2))exp(i=l n n I>U282) ~=1 = (il xi)exp[202 LX. x> 0. ' e k p(x. We compute M(s) = = E(exp(sT(X))) {exp[A(s ~ + 1])  A(1])]} J'" J J'" J h(x)exp[(s + 1])T(x) . X n is a sample from a population with density p(x. 0 Here is a typical application of this result. More generally. Koopman. . Now p(x. 1 Tk. .4 Suppose X 1> ••• . 02 ~ 1/21]. and Dannois were led in their investigations to the following family of distributions.10) . We give the proof in the continuous case.. Direct computation of these moments is more complicated.6 Exponential Families 53 Moreover.A(1])]dx h(x)exp[(s + 1])T(x)  A(s + 1])Jdx = . This is known as the Rayleigh distribution. E(T(X)) ~ A'(1]). and realvalued functions T 1. exp[A(s + 1]) ..Section 1.A(1])] because the last factor. Therefore. A family of distrib~tioos {PO: 0 E 8}.6. _ . being the integral of a density. B(O) the natural sufficient statistic E~ 2n(j2 and variance nlt}2 4n04 . (1. Pitman. Var(T(X)) ~ A"(1]). is said to be a kparameter exponential family.8 > O. which is naturally indexed by a kdimensional parameter and admit a kdimensional sufficient statistic.17k and B of 6.<·or .6.O) = h(x) exp[L 1]j(O)Tj (x) .12).6. nlog0 J. Example 1. 1. x E X j=1 c Rq. 0 = n log 82 and A(1]) 1 xl has mean nit} = = nlog(21]).. Proof. if there exist realvalued functions 171.8(0)]. c R k . h on Rq such that the density (frequency) functions of the Po may be written as. It is used to model the density of "time until failure" for certain types of equipment.
.O) =exp[". i=l . which corresponds to a twOparameter exponential family with q = I.'" . .Tk(x)f and. A(71) is defined in the same way except integrals over Rq are replaced by sums. the canonical kparameter exponential indexed by 71 = (fJI.6. I ! In the discrete case... suppose X = (Xl. and Performance Criteria Chapter 1 By Theorem 1.I' 54 Statistical Models.' . . . the vector T(X) = (T.(X) •.3. Goals.2(".5.LX. Again.(x) = x'.. (1.. h(x) = I..6. 8 2 = and I" 1 2' Tl(x) = x.LTk(Xi )) t=1 Example 1. letting the model be Thus.. then the preceding discussion leads us to the natural sufficient statistic m m (LXi.').4).5.11) (]"2.'l) = h(x)exp{TT(x)'l.') population. + log(27r"'». .'). 0 Again it will be convenient to consider the "biggest" families. The Normal Family.fJk)T rather than family generated by T and h is e.6. .x . Suppose lhat P. ... .Xm ) where the Xi are independent and identically distributed and their common distribution ranges over a k~parameter exponential family given by (1.. = N(I".. X m ) from a N(I".') : 00 < f1 x2 1 J12 2 p(x.10).2. Then the distributions of X form a kparameter exponential family with natural sufficient statistic m m TCm)(x) = (LTl(Xi). i=l i=1 which we obtained in the previous section (Example 1.Ot = fl. The density of Po may be written as e ~ {(I".. . J1 < 00. (]"2 > O}.. in the conlinuous case. It will be referred to as a natural sufficient statistic of the family. 2 (". . we define the natural parameter space as t: = ('l E R k : 00 < A('l) < oo}. 1)2(0) = 2 " B(O) " 1"' 1 " T..TdX))T is sufficient. If we observe a sample X = (X" .x E X c Rq where T(x) = (T1 (x). . q(x.A('l)}. . +log(27r" »)]. .1. . In either case.
/(7'.~. 1/) =exp{T'f..5. j=1 k This is a kparameter canonical exponential family generated by TIl"" Tk and h(x) = II~ I l[Xi E {I. However 0. n.k. < l.5 that Y I .... E R k ._l)(x)1/nlog(l where + Le"')) j=1 .6./(7'. and AJ ~ P(X i ~ j).'I2.4 and 1. ~l ~ /. Example 1. . . A(1/) ~ ~[(~U2~. Suppose a<. . We write the outcome vector as X = (Xl. TT(x) = T(Y) ~ (EY"EY.5. In this example..'12 ~ (3.. a 2 ).ti = f31 + (32Zi.k.5. with J.~3 < OJ.)T.. This can be identifiable because qo(x. 2.~·(x) . Now we can write the likelihood as k qo(x.. Then p(x. . I <j < k 1. = 1/2(7'../(7'.Xn)'r where the Xi are i.. .7. ..'. Multinomial Trials. A E A. 0.. A) = I1:~1 AJ'(X). 0) for 1 = (1.6.) + log(rr/.6. .id. . in Examples 1. 0) ~ exp{L .= {(~1.~l= (3. . From Exam~ pIe 1. and rewriting kl q(x./Ak) = ". k}] with canonical parameter 0. we can achieve this by the reparametrization k Aj = eO] /2:~ e Oj ..EziY. i = 1. remedied by considering IV ~j = log(A. is not and all c. ~3 and t: = {(~1.5.T."k.1.(x) = L:~ I l[X i = j]. .n log j=l L exp(. We observe the outcomes of n independent trials where each trial can end up in one of k possible categories. h(x) = I andt: ~ R x R. ~.. Linear Regression. Let T..)j.. < OJ.~3 ~ 1/2(7'.(x). Yi rv N(Jli.. . . the density of Y = (Yb . where m. 0 + el) ~ qo(x. . . . I yn)T can be put in canonical form with k = 3.~3): ~l E R.j j=l = 1.. E R. .(x»). x') = (T. (N(/" (7') continued}. L71 Aj = I}. where A is the simplex {A E R k : 0 < A..6 Exponential Families 55 (x..~2): ~l E R. as X and the sample space of each Xi is the k categories {I. Example 1.~2)]. Y n are independent.Section 1. .6. A(1/) = 4n[~:+m'~5+z~1~'+2Iog(rr/~3)]. .j = 1. It will often be more convenient to work with unrestricted parameters.. In this example.~. = n'Ezr Example 1. and t: = Rk .kj. k ~ 2.
1 < i < n. B(n.i. X n ) T where the Xi are i. .6.17 e for details.6. Goals. 1/(0») (1.6. is unchanged. Ai).' A(1/) = L:7 1 ni 10g(l + e'·).2. 1 <j < k . and Performance Criteria Chapter 1 1 parameter canonical exponential family generated by T (k 1) {I. h(y) = 07 I ( ~. 0) ~ q(x. 0 < Ai < 1. I < k.6. " Example 1. .·.56 Note that q(x.. 1 < i < n. ) 1(0 < However. if X is discrete Affine transformations from Rk to RI defined by UP is the canonical family generated by T kx 1 and hand M is the affine transformation I' .13) .3 Building Exponential Families Submodels A submodel of a kparameter canonical exponential family {q(x. as X.).6.d. k}] with canonical parameter TJ and £ = Rk~ 1. 17).12) taking on k values as in Example 1. " . If the Ai are unrestricted. < X n be specified levels and (1. then all models for X are exponential families because they are submodels of the multinomial trials model. See Problem 1.. More10g(P'I[X = jI!P'IIX ~ k]). 0 1. Here is an example of affine transformations of 6 and T. M(T) sponding to and ~ MexkT + hex" it is easy to see that the family generated by M(T(X» and h is the subfamily of P corre 1/(0) = MTO. then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h.6. is an npararneter canonical exponential family with Yi by T(Yt .. Let Y. . 11 E £ C R k } is an exponential family defined by p(x. Here 'Ii = log 1\. . < . and 1] is a map from e to a subset of R k • Thus.6.. if c Re and 1/( 0) = BkxeO C R k . this..8. 1]) is a k and h(x) = rr~ 1 l[xi E over. let Xt =integers from 0 to ni generated Vi < n. from Example 1.1.Y n ) ~ Y.7 and X = (X t. be independent binomial. where (J E e c R1. are identifiable. Logistic Regression.. the parameters 'Ii = Note that the model for X Statistical Models. . Similarly.
9. This is a 0 In Example 1. However. i = 1.8.10.'V T(Y) = (Y" .'.i. n.6.. '.PIX < x])) 8. .x)W'. .. .a 2 ).Section 1. + 8.8.. }Ii N(J1. L:~ 1 Xl yi)T and h with A(8. N(Jl. Y. .i/a. LocationScale Regression. .6 Exponential Families 57 This is a linear transformation TJ(8) = B nxz 8 corresponding to B nx2 = (1. .14) 8. E R.1)T.6. then this is the twoparameter canonical exponential family generated by lYIY = (L~l. with B = p" we can write where T 1 = ~~ I Xi. PIX < xl ~ [1 + exp{ (8. x). p(x.. Example 1. suppose the ratio IJlI/a. . Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic. a..Xn i.6. Suppose that YI ~ .xn)T. 8) in the 8 parametrization is a canonical exponential family.8.. ~ (1.!t is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied.13) holds.. this is by Example 1. Then.. In the nonnal case with Xl. ranges over (0. SetM = B T . which is called the coefficient of variation or signaltonoise ratio.. Yn.x = (Xl.12) with the range of 1J( 8) restricted to a subset of dimension l with I < k . + 8.6. > O. and l1n+i = 1/2a.).d.) = L ni log(1 + exp(8.i. I ii. + 8. are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization. Example 1..00). so it is not a curved family. T 2 = L:~ 1 X. y..6. Then (and only then). i=I This model is sometimes applied in experiments to determine the toxicity of a substance.. The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance... the 8 parametrization has dimension 2. where 1 is (1. Gaussian with Fixed SignaltoNoise Ratio.)T .6. 'fJ1(B) curved exponential family with l = 1. . Y n are independent..6. . o Curved exponential families Exponential families (1. log(P[X < xl!(1 . If each Iii ranges over R and each a. generated by . is a known constant '\0 > O.x and (1.1. = '\501 and 'fJ2(0) = ~'\5B2. which is less than k = n when n > 3.Xi)).. that is. .. .5 a 2nparamctercanonical exponential family model with fJi = P.
Next suppose that (JLi. In Example 1..).6. 1 Bj(O). Carroll and Ruppert.. 1.8. with an exponential family density Then Y = (Y1. Goals.2. the map 1/(0) is Because L:~ 11]i(6)Yi + L:~ i17n+i(6)Y? cannot be written in the fonn ~. for unknown parameters 8 1 E R. Even more is true.(O)Tj'(Y) for some 11.4 Properties of Exponential Families Theorem 1. Supermode1s We have already noted that the exponential family structure is preserved under i. Sections 2.g.E Yj c Rq. be independent.6.i.(0).6.6.1/(O)) as defined in (6. Ii.1 generalizes directly to kparameter families as does its continuous analogue. M(s) as the momentgenerating function. 8.. Section 15. n. 0) ~ q(y. Models in which the variance Var(}'i) depends on i are called heteroscedastic whereas models in which Var(Yi ) does not depend on i are called homoscedastic. Recall from Section B. a.8 note that (1.13) exhibits Y. We return to curved exponential family models in Section 2.5 that for any random vector Tkxl. 1988. respectively. and Snedecor and Cochran. . and Performance Criteria Chapter 1 and h(Y) = 1. 1978.12. then p(y. Let Yj .6. ~. as being distributed according to a twoparameter family generated by Tj(Y.d.) depend on the value Zi of some covariate.) and 1 h'(YJ)' with pararneter1/(O).6.6 are heteroscedastic and homoscedastic models. sampling.8. 1989.12) is not an exponential family model. Examples 1. . We extend the statement of Theorem 1.6.58 Statistical Models.) = (Yj. but a curved exponential family model with 0 1=3.10)..5. and = Ee sTT .1.10 and 1. 1 < j < n._I11. E R.xjYj ) and we can apply the supennode! approach to reach the same conclusion as before.3. we define I I . say. Bickel. 83 > 0 (e. 1 Tj(Y. TJ'(Y). Thus. and B(O) = ~. yn)T is modeled by the exponential family generated by T(Y) ~. For 8 = (8 1 .
t: and (a) follows. ('70) II· The corollary follows immediately from Theorem B.Section 1. Fiually (c) 0 The formulae of Corollary 1.1.a)'72 E is proved in exactly the same way as Theorem 1.15) that a'7.1 and Theorem 1. J u(x)v(x)h(x)dx < (J ur(x)h(x)dx)~(Jv'(x)h(x)dx)~.3(c).3.a)'72) < aA('7l) + (1 .6. Under the conditions ofTheorem 1.1 give a classical result in Example 1. ('70)) . . (with 00 pennitted on either side). Let P be a canonicaL kparameter exponential famiLy generated by (T. We prove (b) first.a)'7rT(x)) and take logs of both sides to obtain.4).3 V.8".A('7o)} vaLidfor aLL s such that 110 + 5 E E.r(T) = IICov(7~. Suppose '70' '71 E t: and 0 < a < 1.6. By the Holder inequality (B. vex) = exp«1 .6. Theorem 1. J exp('7 TT (x))h(x)dx > 0 for all '7 we conclude from (1.r'7o T(X) .2. If '7" '72 E + (1 .6. 1O)llkxk.6. h(x) > 0.3.15) t: the righthand side of (1. for any u(x).5.a)A('72) Because (1.6. . 8".9. 8A 8A T" = A('7o) f/J.6..A('70) = II 8". Since 110 is an interior point this set of5 includes a baLL about O.6. T. h) with corresponding natural parameter space E and function A(11). v(x). u(x) ~ exp(a'7. ~ = 1 .6. then T(X) has under 110 a moment M(s) ~ exp{A('7o + s) . + (1 . Proof of Theorem 1. Substitute ~ ~ a. S > 0 with ~ + ~ = 1. A(U'71 Which is (b). A where A( '70) = (8". .T(x)).6 Expbnential Families 59 V.15) is finite. Then (a) E is convex (b) A: E ) R is convex (c) If E has nonempty interior in R k generating function M given by and'TJo E E.a.6. ('70).6..6. Corollary 1.
if we consider Y with n'> 2 and Xl < X n the family as we have seen remains of rank < 2 and is in fact of rank 2. for all 9 because 0 < p«X. We establish the connection and other fundamental relationships in Theorem 1.x. 9" 9.lx ~ j] = AJ ~ e"'ILe a .8.4 that follows.60 Statistical Models. However.6. (ii) 7) is a parameter (identifiable).4. It is intuitively clear that k . such that h(x) > O. p:x. (i) P is a/rank k. 2' Going back to Example 1.(Tj(X» = P>. II ! I Ii .(9) = 9. .7). Sintilarly. 7) E f} is a canonical exponential/amily generated by (TkXI ' h) with natural parameter space e such that E is open. Then the following are equivalent. there is a minimal dimension. T. using the a: parametrization. k A(a) ~ nlog(Le"') j=l and k E>. Theorem 1. P7)[L. (iii) Var7)(T) is positive definite. But the rank of the family is 1 and 8 1 and 82 are not identifiable."I 11 J Evidently every kparameter exponential family is also k'dimensional with k' > k.. Our discussion suggests a link between rank and identifiability of the 'TJ parameterization.~~) < 00 for all x. Tk(X) are linearly independent with positive probability. Suppose P = (q(x. f=l . and Performance Criteria Chapter 1 Example 1. Note that PO(A) = 0 or Po (A) < 1 for some 0 iff the corresponding statement holds i. Formally.1.7 we can see that the multinomial family is of rank at most k . ifn ~ 1.7. However.~. in Example 1. (continued).1 is in fact its rank and this is seen in Theorem 1.4.6. + 9.6.6. An exponential family is of rank k iff the generating statistic T is kdimensional and 1. o The rank of an exponential family I . .6. (X)". Goads. .6. . we are writing the oneparameter binomial family corresponding to Yl as a twoparameter family with generating statistic (Y1 . and ry. ajTj(X) = ak+d < 1 unless all aj are O. Here. Xl Y1 ).
Suppose tlult the conditions of Theorem 1.) with probability 1 =~(i). 0 Corollary 1. thus. Thus. for all T}. . of 1)0· Let Q = (P1)o+o(1).6. We give a detailed proof for k = 1. Then (a) P may be uniquely parametrized by "'(1)) E1)T(X) where". have (i) . by our remarks in the discussion of rank.T(x) .~> .. Taking logs we obtain (TJI .(v).Section 1. all 1) ~ (iii) = a T Var1)(T)a = Var1)(aT T) = 0 for some a of 0.3. = Proof. A is defined on all ofE.2. Proof.A(TJ2))h(x). A' is constant.A(TJ..A(TJI)}h(x) ~ exp{TJ2T(x) .2 and. which that T implies that A"(TJ) = 0 for all TJ and.1)o)TT. 1)0) E n· Q is the exponential family (oneparameter) generated by (1)1 . 0 . (iii) = (iv) and the same discussion shows that (iii) . F!~ . ranges over A(I'). with probability 1.6 Exponential Families 61 (iv) 1) ~ A(1)) is 11 onto (v) A is strictly convex on E. III. Note that. ". Now (iii) =} A"(TJ) > 0 by Theorem 1. = Proof ofthe general case sketched I.. all 1) = (~i) II. ~ (ii) ~ _~ (i) (ii) = P1).1)o) : 1)0 + c(1). ~ (i) =~ (iii) ~ (i) = P1)[aT T = cJ ~ 1 for some a of 0.(ii) = (iii). Conversely. This is equivalent to Var"(T) ~ 0 '*~ (iii) {=} f"V(ii) There exist T}I =I= 1J2 such that F Tll = Pm' Equivalently exp{r/. We.T ~ a2] = 1 for al of O. because E is open. '" '. hence. 1)) is a strictly concavefunction of1) On 1'. hence. :. . ~ P1)o some 1).' ..6.6." Then ~(i) > 1 is then sketched with '* P"[a.4 hold and P is of rank k. A'(TJ) is strictly monotone increasing and 11. by Theorem 1.TJ2)T(X) = A(TJ2) . (b) logq(x. Let ~ () denote "(. The proof for k details left to a problem. (iv) '" (v) = (tii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined as previously for arbitrary 1)0' 1). A"(1]O) = 0 for some 1]0 implies c.6..) is false.. Apply the case k = 1 to Q to get ~ (ii) =~ (i).. This is just a restatement of (iv) and (v) of the theorem..'t·· . .
. and.4 applies.a 2 + J12).11...6.Jl.21)..Yp. E(X. E). Jl. See Section 2. (1.X') = (JL.... ifY 1.1 (Y . An important exponential family is based on the multivariate Gaussian distributions of Section 8. and Performance Criteria Chapter 1 The relation in (a) is sometimes evident and the j.6. .5..)] eXP{L 1/.a 2 ). with mean J. . the 8(11) 0) family is parametrized by E(X).5 Conjugate Families of Prior Distributions In Section 1.1 '" 11"'.62 Statistical Models. . which is obviously a 11 function of (J1. then X . Example 1.6.18) _.6. the relation in (a) may be far from obvious (see Problem 1.3.{Y.. 0 ! = 1. E) = Idet(E) 1 1 2 / . .(Y).6.2 p p (1.._.2 (Y ..ljh<i<. We close the present discussion of exponential families with the following example. Thus.17) can be rewritten ( 2: 1 <i<j~p aijYiYj + ~ 2: aii r:2 ) + L(2: aij /lj)Yi i=l i=l j=l p where E. Y n are iid Np(Jl. revealing that this is a k = p(p + 3)/2 parameter exponential family with statistics (Yi.Jl) .17) The first two terms on the right in (1.. By our supermodel discussion."5) family by E(X).6.~p). where X is the Bernoulli trial. Suppose Xl. Recall that Y px 1 has a p variate Gaussian distribution.6. families to which the posterior after sampling also belongs.t parametrization is close to the initial parametrization of classical P.tpx 1 and positive definite variance covariance matrix L::pxp' iff its density is f(Y. E. . The corollary will prove very important in estimation theory. E) _~yTEIy + (E1JllY 2 I 2 (log Idet(E)1 + JlT I P log".6. with its distinct p(p + I) /2 entries. (1. This is a special case of conjugate families of priors..(9) LT.Yn)T follows the k = p(p + 3)/2 parameter exponential family with T = (EiYi . E). The p Van'ate Gaussian Family. It may be shown (Problem 1.. N p (1J. which is a p x p symmetric matrix. Then p(xI9) = III h(x. write p(x 19) for p(x.6.. and that (. Goals.. h(Y) . 9 ~ (Jl..I.Xn is a sample from the kparameter exponential family (1. where we identify the second element ofT.6.E).. B(9) = (log Idet(E) I+JlTEl Jl).(Xi) i=l j=l i=l n • n nB(9)}.6.11. However.._... . For {N(/" "')}.. is open. the N(/l. . T (and h generalizing Example 1.10).Jl)}.16) Rewriting the exponent we obtain 10gf(Y. so that Theorem 1.6. ~iYi Vi). as we always do in the Bayesian context.p/2 I exp{ .29) that I) generate this family and that the rank of the family is indeed p(p + 3)/2. .2 we considered beta prior distributions for the probability of success in n Bernoulli trials. 0).Jl)TE.
..6.20) given by the last expression in (1.6.'" . then k n . (1.12.1. X n become available.fk+Jl. .6.fkIJ)<oo) with integrals replaced by sums in the discrete case. If p(xIO) is given by (1. tk+l) E 0.6. 0 Remark 1. Suppose Xl. then Proposition 1. which is k~dimensional..20) where t = (fJ. ..(0).6. . . where u6 is known and (j is unknown.18) and" by (1.. A conjugate exponential family is obtained from (1.6. Note that (1. . We assume that rl is nonempty (see Problem 1.36).. .22) 02 p(xIO) ex exp{ 2 .6. j = 1. Because two probability densities that are proportional must be equal.21) is an updating formula in the sense that as data Xl.... ik + ~Tk(Xi)' tk+l + 11.6.6. 1r(6jx) is the member of the exponential family (1. the parameter t of the prior distribution is updated to s = (t + a)..19) o {(iJ. That is.21) and our assertion follows.6.."" Sk+l)T = ( t} + ~TdxiL . tk+ I) T and w(t) = 1= eXP{hiJ~J(O)' _='" _= 1 k ik+JB(II)}dO J···dOk (1. .6.fk+IB(O) logw(t)) j=1 (1.1. and t j = L~ 1 Tj(xd.20). be "parameters" and treating () as the variable of interest.21) where S = (SI.(0) ~ exp{L ~j(O)fj . Forn = I e.6 Exponential Families 63 where () E e.6. ..Xn is a N(O. k. where a = (I:~ 1 T1 (Xi).6. . .6.(O) ex exp{L ~J(O)(LT.(lk+1 j=1 i=1 + n)B(O)) ex ".6. . . Proof..Section 1.u5) sample. let t = (tl.. To choose a prior distribution for model defined by (1. we consider the conjugate family of the (1. n n )T and ex indicates that the two sides are proportional functions of (). .. The (k + I)parameter exponential/amity given by k ".) . is a conjugate prior to p(xIO) given by (1. Example 1.6.18) by letting 11.'''' I:~ 1 Tk(Xi).(Xi)+ f.(Olx) ex p(xlO)".O<W(tJ. n)T D It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way.18).2)' Uo 2uo Ox .20).
Moreover. E RP and f symmetric positive definite is a conjugate family f'V . we obtain 1r. Using (1. rJ) distributions where Tlo varies freely and is positive.(S) = t. tl(S) = 1]0(10 2 TO 2 + S.6.6. the posterior has a density (1.) density. we must have in the (t l •t2) parametrization "g (1. consists of all N(TJo.21)..6.6.) } 2a5 h t2 t1 2 (1. Goals. = rg(n)lrg so that w. Eo). 75) prior density. (0) is defined only for t. 761) where '10 varies over RP.37).6.(n) = .6. If we start with aN(f/o. Np(O.d.(O) <X exp{ .6. 1r.i.6.24). + n.23) Upon completing the square.30) that the Np()o" f) family with )0.27) Note that we can rewrite (1..W.6. Eo known. 1 < i < n.n) and variance = t. therefore. 0 These formulae can be generalized to the case X. (J Np(TJo. = 1 .6. = s.6.(n) (g +n)I[s+ 1]0~~1 TO TO (1. .26) intuitively as ( 1. W.20) has density (1.26) (1.28) where W. and Performance Criteria Chapter 1 This is a oneparameter exponential family with The conjugate twoparameter exponential family given by (1.24) Thus.64 Statistical Models. = nTg(n)I(1~.( 0 . Tg is scalar with TO > 0 and I is the p x p identity matrix (Problem 1. Our conjugate family. i. it can be shown (Problem 1. t. > 0 and all t I and is the N (tI!t" (1~ It.6. if we observe EX.25) By (1.6. we find that 1r(Olx) is a normal density with mean Jl(s.23) with (70 TO .
20). ••• . (1.32. which admit kdimensional sufficient statistics for all sample sizes. the family of distributions in this example and the family U(O...3 is not covered by this theory.(s) = exp{A(7Jo + s) . starting with Koopman. The canonical kparameter exponentialfamily generated by T and h is T(X) = where A(7J) ~ log J:. 2. The natural sufficient statistic max( X 1 ~ . h on Rq such that the density (frequency) function of Pe can be written as k p(x. which is onedimensional whatever be the sample size.J: h(x)exp{TT(x)7J}dx in the continuous case.29) (Tl (X). must be kparameter exponential families. E 1 R is convex.20) except for p = 1 because Np(A.A(7Jo)} e e. Summary. and realvalued functions TI.6. e. . the conditions of Proposition 1..'T}k and B on e. . The set E is convex. 8 C R k . .1 are often too restrictive. 0) = h(x) expl2:>j(I1)Ti (x) ..6.6.31 and 1.. Problem 1. r) is a p(p + 3)/2 rather than a p + I parameter family. .5. for all s such that '10 + s is in Moreover E 7Jo [T(X)] = A(7Jo) and Var7Jo[T(X)] A( '10) where A and A denote the gradient and Hessian of A. In fact. B) are not exponential. The set is called the natural parameter space. {PO: 0 E 8}. a theory has been built up that indicates that under suitable regularity conditions families of distributions..6. T k (X)) is called the natural sufficient statistic of the family. the map A . In fact.Section 1.6. " Tk. Despite the existence of classes of examples such as these..6..Xn ). If bas a nonempty interior in R k and '10 E then T(X) has for X ~ P7Jo the momentgenerating function . is not of the form L~ 1 T(Xi).6. Discussion Note that the uniform U( (I. x j=1 E X c Rq.6 Exponential Families 65 but a richer one than we've defined in (1. . O}) model of Example 1.10 is a special result of this type.. . It is easy to see that one can consUUct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1..B(O)]. with integrals replaced by sums in the discrete case. In the onedimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape. See Problems 1. is a kparameter exponential/amity of distributions if there are realvalued functions 'T}I . Pitman. and Darmois. . Some interesting results and a survey of the literature may be found in Brown (1986).. to Np (8 . Eo).
7 PROBLEMS AND COMPLEMENTS Problems for Section 1. is conjugate to the exponential family p(xl9) defined in (1.. (v) A is strictlY convex on f.6. (iii) Var'7(T) is positive definite. He wishes to use his observations to obtain some infonnation about J.1 units.parameter exponential family k ". If P is a canonical exponential family with E open. and Performance Criteria Chapter 1 An exponential family is said to be of rank k if T is kdimensional and 1.L and variance a 2 . 1..L and a 2 but has in advance no knowledge of the magnitudes of the two parameters. State whether the model in question is parametric or nonparametric. then the following are equivalent: (i) P is of rank k. tk+1) 0 ~ {(t" . The (k + 1).. and t = (t" .B(O)tk+l logw} j=l where w = 1:" 1: E exp{l:.(0) = exp{L1)j(O)t) .29).L. l T k are linearly independent with positive Po probability for some 8 E e... (a) A geologist measures the diameters of a large number n of pebbles in an old stream bed. Goals. (iv) the map '7 ~ A('7) is I ' Ion 1'. tk+.) E R k+l : 0 < w < oo}. Give a formal statement of the following models identifying the probability laws of the data and the parameter space.1)) (O)t j .. T" .66 Statistical Models. A family F of priOf distributions for a parameter vector () is called a conjugate family of priors to p(x I 8) if the posterior distribution of () given x is a member of F. Suppose that the measuring instrument is known to be biased to the positive side by 0. Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean J. . (b) A measuring instrument is being used to obtain n independent determinations of a physical constant J.1 1. Ii II I .B(O)}dO. Assume that the errors are otherwise identically distributed nonnal random variables with known variance. . (ii) '7 is identifiable.
. Can you perceive any difficulties in making statements about f1.) (a) Xl.1(d). (a) Let U be any random variable and V be any other nonnegative random variable. 4... restricted to p = (all' . each egg has an unknown chance p of hatching and the hatChing of one egg is independent of the hatching of the others. .Xp are independent with X~ ruN(ai + v.. 3.. Q p ) {(al. i=l (c) X and Y are independent N (I' I . e (e) Same as (d) with (Qt. B = (1'1. .Section 1. ~ N(l'ij. = o. 1 Ab) restricted to the sets where l:f I (Xi = 0 and l:~~1 A.. (b) The parametrization of Problem 1..1...1. . . Two groups of nl and n2 individuals. 0..2). = (al.ap ) : Lai = O}. v..Xp ). . ... and Po is the distribution of X (b) Same as (a) with Q = (X ll .. Are the following parametrizations identifiable? (Prove or disprove. 2. .for this model? (d) The number of eggs laid by an insect follows a Poisson distribution with unknown mean A.1.l(d) if the entomologist observes only the number of eggs hatching but not the number of eggs laid in each case.2 ) and Po is the distribution of Xll.2 ).. .. 1'2) and we observe Y X. (e) The parametrization of Problem l. are sampled at random from a very large population.2) and N (1'2.. . Ab. 0. . bare independeht with Xi..1 describe formaHy the foHowing model. . i = 1. . Show that Fu+v(t) < Fu(t) for every t. Once laid.7 Problems and Complements 67 (c) In part (b) suppose that the amount of bias is positive but unknown..··. then X is (b) As in Problem 1. . At. .) < Fy(t) for every t.1(c). respectively.Xpb . .. . (If F x and F y are distribution functions such that Fx(t) said to be stochastically larger than Y. Each . An entomologist studies a set of n such insects observing both the number of eggs laid and the number of eggs hatching for each nest..t. j = 1. (d) Xi. . Q p ) and (A I .2) where fLij = v + ai + Aj..) (a) The parametrization of Problem 1."" a p .p. Which of the following parametrizations are identifiable? (Prove or disprove. .
. .· .0'2) (d) Suppose the possible control responses in an experiment are 0. P8[N I . .2.  . • "j :1 member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour..0. Goals. ..O}.LI ni + . ml . i = 1. then P(Y < t + c) = P( Y < t . (b)P. M k = mk] I! n. is the distribution of X when X is unifonn on (0. Which of the following models are regular? (Prove or disprove. 68 Statistical Models. . 1 < i < k and (J = (J. + fl.. mk·r. Vi = (1 11")(1 ~ v)iI V for i = 1. Let N i be the number dropping out and M i the number graduating during year i. .k is proposed where 0 < 71" < I. Let Po be the distribution of a treatment response. if and only if. I nl.2.c has the same distribution as Y + c.Li < + ..:". (a) What are the assumptions underlying this model'? (b) (J is very difficult to estimate here if k is large. .. Let Y and Po is the distribution of Y. 0 < Vi < 1..4(1). 0'2). is the distribution of X e = (0.L.. nk·mI .... The number n of graduate students entering a certain department is recorded. o < J1 < 1.t) = 1 . Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour. whenXis unifonnon {O..0). VI.. Show that Y . Suppose the effect of a treatment is to increase the control response by a fixed amount (J..Nk = nk. }.c) = PrY > c . 7. . . + mk + r = n 1. the density or frequency function p of Y satisfies p( c + t) ~ p( c . Hint: IfY .. 0 = (1'. k. .) (a) p. .9). J. Consider the two sample models of Examples 1....LI.' I .c bas the same distribution as Y + c.. 8. ... In each of k subsequent years the number of students graduating and of students dropping out is recorded.. V k mk P r where J. It is known that the drug either has no effect or lowers blood pressure. e = = 1 if X < 1 and Y {J.Li = 71"(1 M)iIJ. .p(0. 2. What assumptions underlie the simplification? 6.p(0.Lk VI n".1. (0).. I.. but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown..3(2) and 1.1)...k + VI + .I nl .1.. 5. = n11MI = mI.. .9 and they oecur with frequencies p(0. and Performance Criteria Chapter 1 ":. I . Mk.. • ...0. + Vk + P = 1. =X if X > 1. 0 < V < 1 are unknown.P(Y < c .t) fj>r all t.2). + nk + ml + . I ! fl.. (e) Suppose X ~ N(I'. The simplification J.1. .... The following model is proposed. 0 < J..··· .t). Vk) is unknown. Both Y and p are said to be symmetric about c.
j = 0. and (I'. o (a) Show that in this case. then C(·) ~ F(.tl and o(x) ~ 21' + tl. = 9. eX o. C). the two cases o(x) . I} and N is known.tl). > 0. .Xn are observed i. Let c > 0 be a constant. Yn denote the survival times of two groups of patients receiving treatments A and B. .3 let Xl...Section 1. e(j). ci '" N(O.. (]"2) independent. .6.2x yield the same distribution for the data (Xl... e) : p(j) > 0.. fJp) are not identifiable if n pardffieters is larger than the number of observations.N = 0.. . . N} are identifiable. . are not collinear (linearly independent).1. . . Collinearity: Suppose Yi LetzJ' = L:j:=1 4j{3j + Ci. if the number of = (min(T.. C(·) = F(· ."'). then satisfy a scale model with parameter e. i .tl) does not imply the constant treatment effect assumption. p(j). X m and Yi.Xn ).Znj?' (a) Show that ({31. Sbow that {p(j) : j = 0. In Example 1.tl) implies that o(x) tl. suppose X has a distribution F that is not necessarily normal.. t > O.'" .. N}. .zp. e(j) > 0. 10. Sx(t) = . 0 < j < N. . 0 'Lf 'Lf op(j) = I. . .. r(j) = = PlY = j. eY and (c) Suppose a scale model holds for X. log X and log Y satisfy a shift model with parameter log (b) Show that if X and Y satisfy a shift model With parameter tl. that is. . . 11. Positive random variables X and Y satisfy a scale model with parameter 0 > 0 if P(Y < t) = P(oX < t) for all t > 0. (b) Deduce that (th. according to the distribution of X.{3p) are identifiable iff Zl. I(T < C)) = j] PIC = j] PIT where T. . That is. Therefore. Suppose X I. . Let X < 1'. Hint: Consider "hazard rates" for Y min(T.. 1 < i < n. The Scale Model.. Does X' Y' = yc satisfy a scale model? Does log X'. {r(j) : j ~ 0.. .. C)... . N. Show that if we assume that o(x) + x is strictly increasing. Y = T I Y > j].d.i. . . = (Zlj. Y. 12. C(t} = F(tjo). The Lelunann 1\voSample Model."') and o(x) is continuous..7 Problems and Complements 69 (a) Show that if Y ~ X + o(X). e) vary freely over F ~ {(I'. log y' satisfy a shift model? = Xc.2x and X ~ N(J1. o(x) ~ 21' + tl . or equivalently."" Yn )... . (b) In part (a). (Yl. then C(·) = F(. For what tlando(x) ~ 2J1+tl2x? type ofF is it possible to have C(·) ~ F(·tl) for both o(x) (e) Suppose that Y ~ X + O(X) where X ~ N(J1. C are independent.
I (b) Assume (1.7. Sy(t) = Sg. . So (T) has a U(O. hy(t) = c. A proportional hazard model. For treatments A and E. .) Show that (1. Goals.7.l3. Moreover.i. Then h. .).1) Equivalently. respectively..7.(t). where TI"...{a(t). are called the survival functions.. 7..2) where ho(t) is called the baseline hazard function and 9 is known except for a vector {3 ({311 . I) distribution. The most common choice of 9 is the linear form g({3. t > 0.. " T k are unobservable and i. t > 0. h(t I Zi) = f(t I Zi)jSy(t I Zi). Specify the distribution of €i.  . . Show that hy(t) = c. I' 70 StCitistical Models.(t) if and only if Sy(t) = Sg. . I (3p)T of unknowns.G(t). (b) By extending (bjn) from the rationals to 5 E (0. Let f(t I Zi) denote the density of the survival time Yi of a patient with covariate vector Zi apd define the regression survival and hazard functions of Y as i Sy(t I Zi) = 1~ fry I zi)dy. Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t). Hint: See Problem 1.h. = = (. .(t) = fo(t)jSo(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y. F(t) and Sy(t) ~ P(Y > t) = 1 .(t). and Performance CriteriCl Chapter 1 t Ii .. 12.z)}.ho(t) is called the Cox proportional hazard model.ll) with scale parameter 5. (1. k = a and b. The Cox proportional hazani model is defined as h(t I z) = ho(t) exp{g(. Also note that P(X > t) ~ Sort).l3. Let T denote a survival time with density fo(t) and hazard rate ho(t) = fo(t)j P(T > t). then X' = log So (X) and Y' ~ log So(Y) follow an exponential scale model (see Problem Ll... (c) Suppose that T and Y have densities fort) andg(t). Set C.2. thus. 13. z) zT.00). z)) (1. = exp{g(.1.12. ~ a5.l3..2) is equivalent to Sy (t I z) = sf" (t). 1 P(X > t) =1 (.2) and that Fo(t) = P(T < t) is known and strictly increasing. log So(T) has an exponential distribution.I ~I .(t) with C. then Yi* = g({3. Hint: By Problem 8. Show that if So is continuous. Zj) + cj for some appropriate €j.d.) Show that Sy(t) = S'. Tk > t all occur. (c) Under the assumptions of (b) above. show that there is an increasing function Q'(t) such that ifYt = Q'(Y. we have the Lehmann model Sy(t) = si.. t > 0. Survival beyond time t is modeled to occur if the events T} > t. as T with survival function So.
the median is preferred in this case. the parameter of interest can be characterl ized as the median v = F. X n are independent with frequency fnnction p(x posterior1r(() I Xl. '1/1' > 0. (a) Suppose Zl' Z. Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function J=oo Jo F(x. x> c x<c 0.. Examples are the distribution of income and the distribution of wealth.Section 1.2 unknown.2 1. .. J1 and v are regarded as centers of the distribution F. F.. .L . In Example 1. ) = ~. have a N(O. J1 may be very much pulled in the direction of the longer tail of the density. (b) Suppose X" .d. and Zl and Z{ are independent random variables.6 Let 1f be the prior frequency function of 0 defined by 1f(I1. Yl .X n ). Let Xl. 1f(O2 ) = ~. Ca) Find the posterior frequency function 1f(0 I x). 'I/1(±oo) = ±oo.1..Xm be i. what parameters are? Hint: Ca) PIXI < tl = if>(..7 Problems and Complements 71 Hint: See Problems 1. G. Find the . (b) Suppose Zl and Z. When F is not symmetric. .2 with assumptions (t){4).p(t)).11 and 1.000 is the minimum monthly salary for state workers.d. Problems for Section 1.1.. and suppose that for given ().2 0.l (0. 15.i. and for this reason.. Show that both . Show how to choose () to make J.v arbitrarily large. wbere the model {(F.5) or mean I' = xdF(x) = Fl(u)du. still identifiable? If not. Yn be i.4 1 0.. an experiment leads to a random variable X whose frequency function p(x I 0) is given by O\x 01 O 2 0 0.12. Find the median v and the mean J1 for the values of () where the mean exists.. Are.8 0." . . . Observe that it depends only on l:~ I Xi· I 0).2) distribution with . Here is an example in which the mean is extreme and the median is not.G)} is described by where 'IjJ is an unknown strictly increasing differentiable map from R to R. have a N(O.1. 1) distribution.O) 1 . Merging Opinions. 14.i. Generally. Consider a parameter space consisting of two points fh and ()2. where () > 0 and c = 2.p and C. .p and ~ are identifi· able.(x/c)e.
O)kO. Unllormon {'13} . what is the most probable value of 8 given X = 2? Given X = k? (c) Find the posterior distribution of (} given X = k when the prior distribution is beta. 1. Suppose that for given 8 = 0.0 < 0 < 1.25. Compare these B's for n = 2 and 100. .5n) for the two priors and 1f} when 7f (e) Give the most probable values 8 = arg maxo 7l"(B and 71"1. 71"1 (0 2 ) = . and Performance CriteriCi Chapter 1 (cj Same as (b) except use the prior 71"1 (0 ... }. X n be distributed as where XI. < 0 < 1. Assume X I'V p(x) = l:. . (3) Find the posterior density of 8 when 71"( 0) ~ 1. 1r(J)= 'u . . .. . 0 (c) Find E(81 x) for the two priors in (a) and (b). . ':. .. 7l" or 11"1. Find the posteriordensityof8 given Xl = Xl. (b) Find the posterior density of 8 when 71"( OJ = 302 . where a> 1 and c(a) .'" 1rajI Show that J + a. Consider an experiment in which. .2. 7r (d) Give the values of P(B n = 2 and 100. is used in the fonnula for p(x)? 2. IE7 1 Xi = k) forthe two priors (f) Give the set on which the two B's disagree. This is called the geometric distribution (9(0)). c(a).0 < B < 1. 4.Xn are natural numbers between 1 and Band e= {I. . X n are independent with the same distribution as X. 2. . "X n ) = c(n 'n+a m) ' J=m. ) ~ .2.8). Let 71" denote a prior density for 8.75. (3) Suppose 8 has prior frequency. .'" . Then P. l3(r. does it matter which prior. 'IT . Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success O.... 3. For this convergence. 0 < x < O..)=1.. (J. . 3. (d) Suppose Xl.. k = 0.[X ~ k] ~ (1. Let X I. I XI . ~ = (h I L~l Xi ~ = . " J = [2:.. for given (J = B. Goals. ... Show that the probability of this set tends to zero as n t 00. " ••.=l1r(B i )p(x I 8i ).. X has the geometric distribution (a) Find the posterior distribution of (J given X = 2 when the prior distribution of (} is .72 Statistical Models. the outcome X has density p(x OJ ~ (2x/O').m+I. 4'2'4 (b) Relative to (a).Xn = X n when 1r(B) = 1.
I Xn+k) given .1= n+r+s _ . then the posterior distribution of D given X = k is that of k + Z where Z has a B(N .b> 1.. Interpret this result. then !3( a. } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of () given X = x is .. Show in Example 1. If a and b are integers.··.c(b."" Zk) where the marginal distribution of Y equals the posterior distribution of 6 given Xl = Xl.2. . ZI.7 Problems . Show that the conditional distribution of (6. . .1. . Xn) + 5. W.. Justify the following approximation to the posterior distribution where q. .'" . . ... (a) Show that the family of priors E. X n+ k ) be a sample from a population with density f(x I 0).• X n = Xn' and the conditional distribution of the Zi'S given Y = t is that of sample from the population with density e.."'tF·t'.and Complements 73 wherern = max(xl. D = NO has a B(N. S..8) that if in Example 1. In Example 1..0 E Let 9 have prior density 1r.(1. 11'0) distribution. X n+ Il . a regular model and integrable as a function of O. 11'0) distribution. . Suppose Xl. n." . Show rigorously using (1. 6. Let (XI.1 suppose n is large and (1 In) E~ I Xi = X is not close to 0 or 1 and the prior distribution is beta. Va' W t . .• X n = Xn. Next use the central limit theorem and Slutsky's theorem. X n = Xn is that of (Y.1.. Assume that A = {x : p( x I 8) > O} does not involve O. XI = XI... 9. are independent standard exponential. f(x I f). Show that a conjugate family of distributions for the Poisson family is the gamma family. b) is the distribution of (aV loW)[1 + (aV IbW)]t.. where VI. ..n.) = n+r+s Hint: Let!3( a. Xn) = Xl = m for all 1 as n + 00 whatever be a. . where E~ 1 Xi = k.J:n).2..2. .. f3(r..2..f) = [~.. 10. b) denote the posterior distribution.Xn is a sample with Xi '"'' p(x I (}). . Show that ?T( m I Xl. s). 7. is the standard normal distribution function and n _ T 2 x + n+r+s' a J.. where {i E A and N E {l..1 that the conditional distribution of 6 given I Xi = k agrees with the posterior distribution of 6 given X I = Xl •..Section 1. (b) Suppose that max(xl.
Q'j > 0. 1 X n .. then the posterior density ~(Jt 52 = _1_ "'(X _ X)2 nI I. .. has density r("" a) IT • fa(u) ~ n' r(. N. (. D(a). given (x. (N. vB has a XX distribution. •• • .) u/' 0 < cx) L". N(I". a = (a1.) pix I 0) ~ ({" . x n ).2) and we formally put 7t(I". Q'r)T."Z In) and (n1)S2/q 2 '" X~l' This leads to p(x. I(x I (J).. 'V . Letp(x I OJ =exp{(xO)}.. Note that. . . . . X n . X n are i. The posterior predictive distribution is the conditional distribution of X n +l given Xl. .~l .J u.\ > 0. Let N = (N l . Xl . Suppose p(x I 0) is the density of i. ... 01 )=1 j=l Uj < 1... 82 I fL. . L. . .. T6) densities. 0 > O. .9). Goals.J t • I X. The Dirichlet distribution is a conjugate prior for the multinomial. (a) If f and 7t are the N(O. 14. posterior predictive distribution.2 is (called) the precision of the distribution of Xi.. (12)..d.. where Xi known. < 1. . O(t + v) has a X~+n distribution. Show that if Xl.. i)... 52) of J1. Here HinJ: Given I" and ".O. In a Bayesian model whereX l . LO.Xn. Find I 12. .6 ""' 1r..Xn + l arei. 11. given x. .i. compute the predictive and n + 00. unconditionally.. 9= (OJ... and Performance Criteria Chapter 1 + nand ({. 0> 0 a otherwise..)T. . v > 0. . the predictive distribution is the marginal distribution of X n + l .. .. (b) Discuss the behavior of the two predictive distributions as 15. (c) Find the posterior distribution of 0". po is I t = ~~ l(X i  1"0)2 and ()( denotes (b) Let 7t( 0) ()( 04 (>2) exp { . Next use Bayes rule. X and 5' are independent with X ~ N(I". (a) Show that p(x I 0) ()( 04 n exp (~tO) where "proportional to" as a function of 8.74 where N' ~ N Statistical Models.~vO}.d.i. 0<0.) be multinomial M(n. 0 > O. j=1 '\' • = 1.. .") ~ . "5) and N(Oo. x> 0.d. I (b) Use the result (a) to give 7t(0) and 7t(0 I x) when Oexp{ Ox}. 8 2 ) is such that vn(p. The Dirichlet distribution. <0<x and let7t(O) = 2exp{20}.~X) '" tnI.O the posterior density 7t( 0 Ix). j=1 • .1 < j < T./If (Ito. . 13.i. . X 1. and () = a. = 1.. . Find the posterior distribution 7f(() I x) and show that if>' is an integer.
1.. Suppose that in Example 1. 1C(O. 0..5.3.5 and (ii) l' = 0.. 1.. The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether 8 < O. (b) Find the minimax rule among {b 1 . .1.5. and the loss function l((J. 1C(02) l' = 0. Let the actions corresponding to deciding whether the loss function is given by (from Lehmann. or 0 > a be penoted by I.p) (I .3.3 IN ~ n) is V( a + n). (c) Find the minimax rule among bt.. Find the Bayes rule when (i) 3.7 Problems and Complements 75 Show that if the prior7r( 0) for 0 is V( a). then the posteriof7r(0 where n = (nt l · · · . (J2.q) and let when at.3. the possible actions are al. ..1. n r ).159 be the decision rules of Table 1. a2. a) is given by 0. .. Problems for Section 1. respectively and suppose . . (b)p=lq=. I. = 11'. (d) Suppose that 0 has prior 1C(OIl (a).. 159 for the preceding case (a). a3.5.Section 1. (d) Suppose 0 has prior 1C(01) ~ 1'. = 0. Find the Bayes rule for case 2.3. (c) Find the minimax rule among the randomized rules.3. See Example 1.) = 0. Suppose the possible states of nature are (Jl. 159 of Table 1. . 8 = 0 or 8 > 0 for some parameter 8. I b"9 }. (J) given by O\x a (I . a new buyer makes a bid and the loss function is changed to 8\a 01 O 2 al a2 a3 a 12 7 I 4 6 (a) Compute and plot the risk points in this case for each rule 1St. O 2 a 2 I a I 2 I Let X be a random variable with frequency function p(x. . .3.1. 1957) °< °= a 0. Compute and plot the risk points (a)p=q= .
andastratumsamplemeanXj.. random variables Xlj"". We want to estimate the mean J1. Show that the strata sample sizes that minimize MSE(!i2) are given by (1.j = 1. 1 < j < S. where <f> = 1 <I>.i.s.0)). I I I.0)) +b<f>( y'n(s ... Stratified sampling. i I For what values of B does the procedure with r procedure with r = ~s = I? = 8 = 1 have smaller risk than the 4. I . < 00. Within the jth stratum we have a sample of i. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e. . Suppose that the jth stratum has lOOpj% of the population and that the jth stratum population mean and variances are f£j and Let N = nj and consider the two estimators 0. How should nj. iiz = LpjX j=Ii=1 )=1 s n] s j where we assume that Pj.) c<f>( y'n(r . (. geographic locations or age groups).. 1) distribution function. Goals. variances. and <I> is the N(O. J". are known. b<f>(y'ns) + b<I>(y'nr).I L LXij. E. 1) sample and consider the decision rule = 1 if X <r o (a) Show that the risk function is given by ifr<X<s 1 if X > s. 0 be chosen to make iiI unbiased? I (b) Neyman allocation. (b) Plot the risk function when b = c = 1. 1 (l)r=s=l. < J' < s.7.g. 1 < j < S.3) .Xnjj..8. 0<0 0=0 0 >0 R(O.d.j = I. c<I>(y'n(sO))+b<I>(y'n(rO)). (a) Compute the biases. and Performance Criteria ChClpter 1 O\a 1 0 c 1 <0 0 >0 J". Suppose X is aN(B. .=l iii = N. . 11 .(X) 0 b b+c 0 c b+c b 0 where b and c are positive.. .) r=2s=1. n = 1 and . are known (estimates will be used in a latet chapter). Assume that 0 < a. Weassurne that the s samples from different strata are independent.76 StatistiCCII Models.. and MSEs of iiI and 'j1z.
15. > 0.) with nk given by (1. Suppose that n is odd. (c) Show that MSE(!ill with Ok ~ PkN minus MSE(!i. (d) Find EIX . b = '.p).2. respectively.bl.2p. Use a numerical integration package.2. where lJ is defined as a value satisfyingP(X < v) > ~lllldP(X >v) >~.bl for the situation in (i). that X is the median of the sample. Let X and X denote the sample mean and median.i. Each value has probability . .0 . p = . . .13..2.bl when n = compare it to EIX . ~ ~ ~ . We want to estimate "the" median lJ of F.5. Hint: Use Problem 1.'. plot RR for p = . N(o.3.. ~ 7. . and that n is odd. X n be a sample from a population with values 0.3.7 Problems and Complements 77 Hint: You may use a Lagrange multiplier..0 + '.5. Next note that the distribution of X involves Bernoulli and multinomial trials. Hint: By Problem 1.. X n are i. The answer is MSE(X) ~ [(ab)'+(cb)'JP(S where k = . ~ (a) Find the MSE of X when (i) F is discrete with P(X a < b < c. 5.. P(X ~ b) = 1 .. .b.. ~ ~  ~ 6.20.. .4.25. (b) Evaluate RR when n = 1. U(O.5(0 + I) and S ~ B(n.. and 2 and (e) Compute the relative risks M SE(X)/MSE(X) in questions (ii) and (iii)..3. (iii) F is normal. ~ a) = P(X = c) = P. Hint: See Problem B. I). show that MSE(Xb ) and MSE(Xb ) are the same for all values of b (the MSEs of the sample mean and sample median are invariant with respect to shift).=1 pp7J" aV.b.7. .1.0 ~ + 2'. .5. Let XI. . < ° P < 1. .9...= 2:. l X n ..2'.45.Section 1. (a) Find MSE(X) and the relative risk RR = MSE(X)/MSE(X).3) is N.. (b) Compute the relative risk RR = M SE(X)/MSE(X) in question (i) when b = 0. Suppose that X I.2. ali X '"'' F. 1).40. ° = 15. a = . Also find EIX . n ~ 1.I 2:.0.5. 75.3.=1 Pj(O'j where a. . and = 1.d. Let Xb and X b denote the sample mean and the sample median of the sample XI b. > k) (ii) F is uniform.5. (c) Same as (b) except when n ~ 1.. Hint: See Problem B. set () = 0 without loss of generality. [f the parameters of interest are the population mean and median of Xi .
A person in charge of ordering equipment needs to estimate and uses I . . I . 8. (b) Suppose Xi _ N(I'. then expand (Xi .a)2.(1(6'.2 ~~ I (Xi . (a)r = 8 =1 (b)r= ~s =1. I I 78 Statistical Models..3.1)1 E~ 1 (Xi . (i) Show that MSE(S2) (ii) Let 0'6 c~ .3) that . I .J..4. Find MSE(6) and MSE(P). i1 . 6' E e. .[X . the powerfunction.. Show that the value of c that minimizes M S E(c/. Let () denote the proportion of people working in a company who have a certain characteristic (e. . for all 6' E 8 1. suppose (J is discrete with frequency function 11"(0) = 11" (!) = 11" U) = Compute the Bayes risk of I5r •s when i.X)2 has a X~I distribution. It is known that in the state where the company is located.0): 6 E eo}.X)2 keeping the square brackets intact.0(X))) for all 6. (a) Show that s:? = (n . II. Goals. ~ 2(n _ 1)1.X)2. A decision rule 15 is said to be unbiased if E. You may use the fact that E(X i . = c L~ 1 (Xi . 0) (J(6'. ' . .. 10..  . In Problem 1. e 8 ~ (.for what 60 is ~ MSE(8)/MSE(P) < 17 Give the answer for n = 25 and n = 100.0) = E. 0 < a 2 < 00. .. I (b) Show that if we use the 0 1 loss function in testing. then a test function is unbiased in this sense if.3.L)4 = 3(12.1'])2. and Performance Criteria Chapter 1 :! . (a) Show that if 6 is real and 1(6.) is (n+ 1)1 Hint!or question (bi: Recall (Theorem B.X)2 is an unbiased estimator of u 2 • Hint: Write (Xi . If the true 6 is 60 . being lefthanded). .(o(X)). defined by (J(6.(1(6.Xn be a sample from a population with variance a 2 . and only if. Let Xl.2)(.10) + (. .1'] .').3(a) with b = c = 1 and n = I. 10% have the characteristic.0(X))) < E. satisfies > sup{{J(6. 9.8lP where p = XI n is the proportion with the characteristic in a sample of size n from the company. Which one of the rules is the better one from the Bayes point of view? 11. a) = (6 . then this definition coincides with the definition of an unbiased estimate of e.g.X)' = ([Xi ..
) for all B.1. there is a randomized procedure 03 such that R(B.1. . 1) and is independent of X. In Problem 1. o Suppose that U . then. (c) Same as (b) except k = 2.. b.. . Show that J agrees with rp in the sense that. For Example 1.wo to X.3. In Example 1.. Define the nonrandomized test Ju . B = .3. a) is .' and 0 = II' ." (a) Show that the risk is given by (1. by 1 if<p(X) if <p(X) >u < u.3.~ is unbiased 13. (B) = 0 for some event B implies that Po (B) = 0 for all B E El.UfO. Your answer should Mw If n. Furthersuppose that l(Bo. (b) find the minimum relative risk of P.1) and let Ok be the decision rule "reject the shipment iff X > k. Suppose the loss function feB.1'0 I· 17. Ok) as a function of B. and k o ~ 3. Compare 02 and 03' 19.1. 18. 0. 0.Section 1. and then JT. Suppose that the set of decision procedures is finite.7). Consider a decision problem with the possible states of nature 81 and 82 . Po[o(X) ~ 11 ~ 1 . If U = u. find the set of I' where MSE(ji) depend on n.7 Problems and Complements 79 12. 03) = aR(B.) + (1 . wc dccide El. show that if c ::. Consider the following randomized test J: Observe U.4. In Example 1. The interpretation of <p is the following.3. 14. If X ~ x and <p(x) = 0 we decide 90. s = r = I. plot R(O. but if 0 < <p(x) < 1.3.3.a )R(B.' and 0 = = WJ10 + (1  w)x' II'  1'0 I are known. procedures.ao) ~ O. A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 < <p(X) < 1. z > 0. (a) find the value of Wo that minimizes M SE(Mw). use the test J u . 0 < u < 1. (h) If N ~ 10. Show that the procedure O(X) au is admissible. we petiorm a Bernoulli trial with probability rp( x) of success and decide 8 1 if we obtain a success and decide 8 0 otherwise. Show that if J 1 and J 2 are two randomized.. Suppose that Po.Polo(X) ~ 01 = Eo(<p(X)). consider the estimator < MSE(X). given 0 < a < 1. and possible actions at and a2. consider the loss function (1. 16.4. 15. if <p(x) = 1. Convexity ofthe risk set.
al a2 0 3 2 I 0.l. Let X be a random variable with probability function p(x ! B) O\x 0.4.4 = 0. 0 0. 7. 6.) = 0. and Performance Criteria Chapter 1 B\a 0. its MSPE. An urn contains four red and four black balls.Is Z of any value in predicting Y? = Ur + U?. Goals. Give an example in which Z can be used to predict Y perfectly.80 Statistical Models. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn.. What is 1.1. (c) Suppose 0 has the prior distribution defined by .2 0. the best linear predictor. and the best zero intercept linear predictor.8 0. U2 be independent standard normal random variables and set Z Y = U I. 3.1 calculate explicitly the best zero intercept linear predictor.9. Let Y be any random variable and let R(c) = E(IYcl) be the mean absolute prediction error. 2.6 (a) Compute and plot the risk points of the nonrandomized decision rules. but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z).(0.. !.) the Bayes decision rule? Problems for Section 1. 4. (a) Find the best predictor of Y given Z.. In Problem B. The midpoint of the interval of such c is called the conventionally defined median or simply just the median. . Four balls are drawn at random without replacement. Give the minimax rule among the randomized decision rules. (b) Compute the MSPEs of the predictors in (a). Give the minimax rule among the nonrandomized decision rules. 0.7 find the best predictors ofY given X and of X given Y and calculate their MSPEs. Show that either R(c) = 00 for all cor R(c) is minimized by taking c to be any number such that PlY > cJ > pry < cJ > A number satisfying these restrictions is called a median of (the distribution of) Y. Give an example in which the best linear predictor of Y given Z is a constant (has no predictive value) whereas the best predictor Y given Z predicts Y perfectly. 5. In Example 1. !. and the ratio of its MSPE to that of the best and best linear predictors. Let U I .4 I 0. (b) Give and plot the risk set S.(0.
and otherwise. Suppose that Z. Let Y have a N(I". z > 0. (a) Show that the relation between Z and Y is weaker than that between Z' and y J . p( z) is nonincreasing for z > c. Show that c is a median of Z. Let Zl and Zz be independent and have exponential distributions with density Ae\z. 9. Let ZI. exhibit a best predictor of Y given Z for mean absolute prediction error. p) distribution and Z".t.Section 1. Find (a) The best MSPE predictor E(Y I Z = z) ofY given Z = z (b) E(E(Y I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z» . (a) Show that E(IY . a 2. Yare measurements on such a variable for a randomly selected father and son. which is symmetric about c and which is unimodal. a 2.  = ElY cl + (c  co){P[Y > co] . If Y and Z are any two random variables. p( c all z. 0.cl) = oQ[lc . 10.2 ) distrihution. 7 2 ) variables independent of each other and of (ZI. Define Z = Z2 and Y = Zl + Z l Z2. Suppose that if we observe Z = z and predict 1"( z) for Your loss is 1 unit if 11"( z) . Y'.L. Y) has a bivariate normal distribution. Y') have a N(p" J.7 Problems and Complements 81 Hint: If c ElY . Y = Y' + Y". ICor(Z. Y)I < Ipl. where (ZI .col < eo. Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables.z) for 11. 12. (b) Suppose (Z.el) as a function of c. which is symmetric about c. Sbow that the predictor that minimizes our expected loss is again the best MSPE predictor. Suppose that Z has a density p.PlY < eo]} + 2E[(c  Y)llc < Y < co]] 8. Y" are N(v.1"1/0] where Q(t) = 2['I'(t) + t<I>(t)]. (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict y J • 13. Y) has a bivariate nonnal distribution the best predictor of Y given Z in the sense of MSPE coincides with the best predictor for mean absolute error. Show that if (Z. (a) Show that P[lZ  tl < s] is maximized as a function oftfor each s > ° by t = c. Y" be the corresponding genetic and environmental components Z = ZJ + Z". Z". that is. ° 14. (b) Show directly that I" minimizes E(IY . that is. Suppose that Z has a density p.YI > s. y l ). + z) = p(c .
. Consider a subject who walks into a clinic today. It will be convenient to rescale the problem by introducing Z = (Zo . Show that. .1] . Let S be the unknown date in the past when the sUbject was infected. €L is uncorrelated with PL(Z) and 1J~Y = P~Y' 18. in the linear model of Remark 1. (c) Let '£ = Y . that is. y> O. where piy is the population multiple correlation coefficient of Remark 1./L(Z)J' /Var(Y) ~ Var(/L(Z))/Var(Y). hat1s./L£(Z)) ~ maxgEL Corr'(Y. and is diagnosed with a certain disease. at time t. ~~y = 1 . IS. (b) Show that if Z is onedimensional and h is a 11 increasing transfonnation of Z. I. 2 'T ' .) = Var( Z.. Yo > O. 82 Statisflcal Models.S from infection until detection. . Let /L(z) = E(Y I Z ~ z).exp{ >. Predicting the past from the present. > 0.4. (See Pearson. Here j3yo gives the mean increase of Zo for infected subjects over the time period Yo. Hint: Recall that E( Z. Show that Var(/L(Z))/Var(Y) = Corr'(Y.g(Z)) where £ is the set of 17. and Performance Criteria Chapter 1 . a blood cell or viral load measurement) is obtained.4. = Corr'(Y. One minus the ratio of the smallest possible MSPE to the MSPE of the constant predictor is called Pearson's correlation ratio 1J~y. j3 > 0. on estimation of 1]~y.y}. then'fJh(Z)Y =1JZY./LdZ) be the linear prediction error. . 1995.) (a) Show that 1]~y > piy. 1905. . Goals.I • ! .4.4. . .: (0 The best linear MSPE predictor of Y based on Z = z./L)/<J and Y ~ /3Yo/<J. Hint: See Problem 1. 16. At the same time t a diagnostic indicator Zo of the severity of the disease (e. Assume that the conditional density of Zo (the present) given Yo = Yo (the past) is where j1.(y) = >.) = E( Z. mvanant un d ersuchh. 1). g(Z)) 9 where g(Z) stands for any predictor. We are interested in the time Yo = t . and Doksurn and Samarov.. >. and Var( Z.) ~ 1/>''. and a 2 are the mean and variance of the severity indicator Zo in the population of people without the disease.3.g.) ~ 1/>.15. Show that p~y linear predictors./L(Z)) = max Corr'(Y. (a) Show that the conditional density j(z I y) of Z given Y (b) Suppose that Y has the exponential density = y is H(y.ElY . IS.
Let Y be the number of heads showing when X fair coins are tossed.>. (b) Show that (a) is equivalent to (1. density. solving for (a. Hint: Use Bayes rule. 2000). c.. (b) The MSPE of the optimal predictor of Y based on X. 19.). (c) Show that if Z is real. including the "prior" 71".g(Zo)l· Hint: See Problems 1.(>' . 1990. (e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z ~ z) ~ cI<p(>. wbere Y j . b) equal to zero. Find (a) The mean and variance of Y.14 by setting the derivatives of R(a. (d) Find the best predictor of Yo given Zo = zo using mean absolute prediction error EIYo . b).~ [y  (z .9. . Z] = Cov{E[r(Y) (d) Suppose Y i = al + biZ i + Wand 1'2 = a2 and Y2 are responses of subjects 1 and 2 with common influence W and separate influences ZI and Z2.. Cov[r(Y). 6. see Berman.4. where X is the number of spots showing when a fair die is rolled.>. + b2 Z 2 + W.z). (a) Show that ifCov[r(Y). x = 1. (c) Find the conditional density 7I"o(Yo I zo) of Yo given Zo = zo. Establish 1.>')1 2 } Y> ° where c = 4>(z .Section 1. (c) The optimal predictor of Y given X = x.7 and 1.I exp { .. and checking convexity. and Nonnand and Doksum.6) when r = s. s(Y) I Z]} + Cov{ E[r(Y) I Z]' E[s(Y) I Z]}. Let Y be a vector and let r(Y) and s(Y) be real valued. 1).z) . (In practice. need to be estimated from cohort studies. Z2 and W are independent with finite variances. s(Y)] < 00. Write Cov[r(Y). s(Y» given Z = z. N(z . I Z]' Z}. Find Corr(Y}.4. 20. Y2) using (a)..4. This density is called the truncated (at zero) normal.. .7 Problems and Complements 83 Show that the conditional distribution of Y (the past) given Z = z (the present) has density . 21. all the unknowns. where Z}. then = E{Cov[r(Y). .4. 7r(Y I z) ~ (27r) .s(Y)] Covlr(Y). s(Y) I z] for the covatiance between r(Y) and s(Y) in the conditional distribution of (r(Y).
Xn is a sample from a population with one of the following densities.z) = 6w (y. g(Z)) < for some 9 and that Po is a density. z) = I. Y z )? (f) In model (d).p~y). Show that L~ 1 Xi is sufficient for 8 directly and by the factorization theorem. Show that the mean weighted squared prediction error is minimized by Po(Z) = EO(Y I Z).z)lw(y.3.. In Example 1. 0) = Oa' Ix('H). 25. < 00 for all e.z). Suppose Xl.z) be a positive realvalued function. 0 > 0. . z) ~ ep(y. 23.0> O. . z) = z(1 . 2..density. Oax"l exp( Ox"). 24. z) and c is the constant that makes Po a density. . Zz).. 1 2 y' . (a) Let w(y.4.5 1.g(z)]'lw(y. (a) p(x. Assume that Eow (Y.. .. we say that there is a 50% overlap between Y 1 and Yz. Problems for Section 1. 0 (b) p(x. 0) = Ox'. Goals.e)' = Y' . 1 X n be a sample from a Poisson. 00 (b) Suppose that given Z = z.4. (a) Show directly that 1 z:::: 1 Xi is sufficient for 8. n > 2. .2eY + e' < 2(Y' + e'). 0 > 0. p(e). Y ~ B(n. Let Xl.6(O. This is thebeta. 0 < z < I. 1).z). s).e)' Hint: Whatever be Y and c.~(1 . 0) = < X < 1.g(z») is called weighted squared prediction error.x > 0. and (ii) w(y.14). . 3. This is known as the Weibull density.e' < (Y . > a. suppose that Zl and Z. \ (b) Establish the same result using the factorization theorem. Find Po(Z) when (i) w(y..15) yields (1. population where e > O. a > O. a > O. 1. where Po(Y.') and W ~ N(po. Let Xi = 1 if the ith item drawn is bad.84 Statistical Models. Zz and ~V have the same variance (T2. Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. x  . . Then [y .4. Find the 22.6(r. (c) p(x. . Verify that solving (1. are N(p. In this case what is Corr(Y.l . if b1 = b2 and Zl. and = 0 otherwise. and suppose that Z has the beta. z "5). density. show that the MSPE of the optimal predictor is . Show that EY' < 00 if and only if E(Y . and Performance Criteria Chapter 1 (e) In the preceding model (d). optimal predictor of Yz given (Yi.
Section 1. 7
Problems and Complements
85
This is known as the Pareto density. In each case. find a realvalued sufficient statistic for (), a fixed. 4. (a) Show that T 1 and T z are equivalent statistics if, and only if, we can write T z = H (T1 ) for some 11 transformation H of the range of T 1 into the range of T z . Which of the following statistics are equivalent? (Prove or disprove.) (b) n~ (c) I:~
1
x t and I:~
and I:~
I:~
1
log Xi, log Xi.
Xi Xi
I Xi
1
>0 >0
l(Xi X
1 (Xi 
(d)(I:~ lXi,I:~ lxnand(I:~ lXi,I:~
(e) (I:~
I Xi, 1
?)
xl)
and (I:~
1 Xi,
I:~
X)3).
5. Let e = (e lo e,) be a bivariate parameter. Suppose that T l (X) is sufficient for e, whenever 82 is fixed and known, whereas T2 (X) is sufficient for (h whenever 81 is fixed and known. Assume that eh ()2 vary independently, lh E 8 1 , 8 2 E 8 2 and that the set S = {x: pix, e) > O} does not depend on e. (a) Show that ifT, and T, do not depend one2 and e, respectively, then (Tl (X), T2 (X)) is sufficient for e. (b) Exhibit an example in which (T, (X), T2 (X)) is sufficient for T l (X) is sufficient for 8 1 whenever 8 2 is fixed and known, but Tz(X) is not sufficient for 82 , when el is fixed and known. 6. Let X take on the specified values VI, .•. 1 Vk with probabilities 8 1 , .•• ,8k, respectively. Suppose that Xl, ... ,Xn are independently and identically distributed as X. Suppose that IJ = (e" ... , e>l is unknown and may range over the set e = {(e" ... ,ek) : e, > 0, 1 < i < k, E~ 18i = I}, Let Nj be the number of Xi which equal Vj' (a) What is the distribution of (N" ... , N k )? (b) Sbow that N = (N" ... , N k _,) is sufficient for 7. Let Xl,'"
1
e,
e.
X n be a sample from a population with density p(x, 8) given by
pix, e)
o otherwise.
Here
e = (/1, <r) with 00 < /1 < 00, <r > 0.
(a) Show that min (Xl, ... 1 X n ) is sufficient for fl when a is fixed. (b) Find a onedimensional sufficient statistic for a when J1. is fixed. (c) Exhibit a twodimensional sufficient statistic for 8. 8. Let Xl,. " ,Xn be a sample from some continuous distribution Fwith density f, which is unknown. Treating f as a parameter, show that the order statistics X(l),"" X(n) (cf. Problem B.2.8) are sufficient for f.
,
"
I
!
86
Statistical Models, Goals, and Performance Criteria
Chapter 1
9. Let Xl, ... ,Xn be a sample from a population with density
j,(x)
a(O)h(x) if 0, < x < 0,
o othetwise
where h(x)
> 0,0= (0,,0,)
with
00
< 0, < 0, <
00,
and a(O)
=
[J:"
h(X)dXr'
is assumed to exist. Find a twodimensional sufficient statistic for this problem and apply your result to the U[()l, ()2] family of distributions. 10. Suppose Xl" .. , X n are U.d. with density I(x, 8) = ~elx61. Show that (X{I),"" X(n», the order statistics, are minimal sufficient. Hint: t,Lx(O) =  E~ ,sgn(Xi  0), 0 't {X"" . , X n }, which determines X(I),
. " , X(n)'
11. Let X 1 ,X2, ... ,Xn be a sample from the unifonn, U(O,B). distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for O.
12. Dynkin, Lehmann, Scheffe's Theorem. Let P = {Po : () E e} where Po is discrete concentrated on X = {x" x," .. }. Let p(x, 0) p.[X = xl Lx(O) > on Show that f:xx(~~) is minimial sufficient. Hint: Apply the factorization theorem.
=
=
°
x,
13. Suppose that X = (XlI" _, X n ) is a sample from a population with continuous distribution function F(x). If F(x) is N(j1., ,,'), T(X) = (X, ,,'). where,,2 = n l E(Xi 1')2, is sufficient, and S(X) ~ (XCI)"" ,Xin»' where XCi) = (X(i)  1')/'" is "irrelevant" (ancillary) for (IL, a 2 ). However, S(X) is exactly what is needed to estimate the "shape" of F(x) when F(x) is unknown. The shape of F is represented hy the equivalence class F = {F((·  a)/b) : b > 0, a E R}. Thus a distribution G has the same shape as F iff G E F. For instance, one "estimator" of this shape is the scaled empirical distribution function F,(x) jln, x(j) < x < x(i+1)' j = 1, . .. ,nl
~
0, x
< XCI)
> x(n)
1, x
~
Show that for fixed x, F,((x  x)/,,) converges in prohahility to F(x). Here we are using F to represent :F because every member of:F can be obtained from F.
I
I ,
'I i,
14. Kolmogorov's Theorem. We are given a regular model with e finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on 9, the posterior distrihution of 9 depends on x only through T(x). Show that T(X) is sufficient.
(b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the posterior distribution depends on x only through T(x).
Section 1.7
Problems and Complements
87
Hint: Apply the factorization theorem.
15. Let X h .··, X n be a sample from f(x  0), () E R. Show that the order statistics arc minimal sufficient when / is the density Cauchy Itt) ~ I/Jr(1 + t 2 ). 16. Let Xl,"" X rn ; Y1 ,· . " ~l be independently distributed according to N(p, (72) and N(TI, 7 2 ), respectively. Find minimal sufficient statistics for the following three cases:
(i) p, TI,
0", T
are arbitrary:
00
< p, TI < 00, a <
(J,
T.
(ii)
(J
=T
= TJ
and p, TI, (7 are arbitrary. and p,
0", T
(iii) p
are arbitrary.
17. In Example 1.5.4. express tl as a function of Lx(O, 1) and Lx(l, I). Problems to Sectinn 1.6
1. Prove the assertions of Table 1.6.1.
2. Suppose X I, ... , X n is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show that the distribution of X fonns a oneparameter exponential family. Identify 'TI, B, T, and h. 3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability nf success O. Then P, IX = k] = (I  0)'0, k ~ 0 0 1,2, ... This is called thc geometric distribution (9 (0». (a) Show that the family of geometric distributions is a oneparameter exponential family with T(x) ~ x. (b) Deduce from Theorem 1.6.1 that if X lo '" oXn is a sample from 9(0), then the distributions of L~ 1 Xi fonn a oneparameter exponential family. (c) Show that E~
1
Xi in part (b) has a negative binomial distribution with parameters
(noO)definedbyP,[L:71Xi = kJ =
(n+~I
)
(10)'on,k~0,1,2o'"
(The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success 0.) Hint: By Theorem 1.6.1, P,[L:7 1 Xi = kJ = c.(1  o)'on. 0 < 0 < 1. If
=
' " CkW'
I
= c(;'',)::, 0 lw n
k=O
L..J
< W < I, then
4. Which of the following families of distributions are exponential families? (Prove or
disprove.) (a) The U(O, 0) fumily
88
(b) p(.", 0)
Statistical Models, Goals, and Performance Criteria
Chapter 1
(c)p(x,O)
= {exp[2Iog0+log(2x)]}I[XE = ~,xE {O.I +0, ... ,0.9+0j = 2(x +
0)/(1 + 20), 0
(0,0)1
(d) The N(O, 02 ) family, 0 > 0
(e)p(x,O)
< x < 1,0> 0
(f) p(x,9) is the conditional frequency function of a binomial, B(n,O), variable X, given that X > O.
5. Show that the following families of distributions are twoparameter exponential families and identify the functions 1], B, T, and h. (a) The beta family. (b) The gamma family. 6. Let X have the Dirichlet distribution, D( a), of Problem 1.2.15. Show the distribution of X form an rparameter exponential family and identify fJl B, T, and h.
7. Let X = ((XI, Y I ), ... , (X no Y n » be a sample from a bIvariate nonnal population.
Show that the distributions of X form a fiveparameter exponential family and identify 'TJ, B, T, and h.
8. Show that the family of distributions of Example 1.5.3 is not a one parameter eX(Xloential family. Hint: If it were. there would be a set A such that p(x, 0) > on A for all O.
°
9. Prove the analogue of Theorem 1.6.1 for discrete kparameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (XI, X 2) is a sample of size 2 from f(·, 0), then XI + X2 is sufficient for B. Show that f(·, B) corresponds to a onearameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x" X2) such that log f(x" 0) + log f(X2, 0) = g(xI + X2, 0) + h(XI, X2). Fix 00 and let r(x, 0) = log f(x, 0)  log f(x, 00), q(x, 0) = g(x,O)  g(x,Oo). Then, q(xI + X2,0) = r(xI,O) +r(x2,0), and hence, [r(x" 0) r(O, 0)1 + [r(x2, 0)  r(O, 0») = r(xi + X2, 0)  r(O, 0). 11. Use Theorems 1.6.2 and 1.6.3 to obtain momentgenerating functions for the sufficient statistics when sampling from the following distributions. (a) normal, () ~ (ll,a 2 )
(b) gamma. r(p, >.), 0
= >., p fixed
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3)
(0 gamma. r(p, >'). ()
= (p, >.).
 
,

Section 1. 7
Problems and Complements
89
12. Show directly using the definition of the rank of an ex}X)nential family that the multinomialdistribution,M(n;OI, ... ,Ok),O < OJ < 1,1 <j < k,I:~oIOj = 1, is of rank k1. 13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any (k ~ I)dimensional hyperplane. 14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of &. Show that if k = 1 and &0 oJ 0 and A, A are defined on all of &, then Theorem 1.6.3 continues to hold. 15. Let P = {P. : 0 E e} where p. is discrete and concentrated on X = {x" X2, ... }, and let p( x, 0) = p. IX = x I. Show that if P is a (discrete) canonical ex ponential family generated bi, (T, h) and &0 oJ 0, then T is minimal sufficient. Hint: ~;j'Lx('l) = Tj(X)  E'lTj(X). Use Problem 1.5.12.
16. Life testing. Let Xl,.'" X n be independently distributed with exponential density (20)l e x/2. for x > 0, and let the ordered X's be denoted by Y, < Y2 < '" < YnIt is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r aparticles have been emitted. Show that
(i) The joint distribution of Y1 , •.. , Yr is an exponential family with density
n! [ (20), (n _ r)! exp (ii) The distribution of II:: I Y;
(iii) Let
1
I::l Yi + (n 20
r)Yr]
' 0  Y,  ...  Yr·
<
<
<
+ (n 
r)Yrl/O is X2 with 2r degrees of freedom.
denote the time required until the first, second,... event occurs in a Poisson process with parameter 1/20' (see A.I6). Then Z, = YI/O', Z2 = (Y2 Yr)/O', Z3 = (Y3  Y 2)/0', ... are independently distributed as X2 with 2 degrees of freedom, and the joint density of Y1 , ••. , Yr is an exponential family with density
Yi, Yz , ...
The distribution of Yr/B' is again XZ with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burnedout tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
I ,
90
Statistical Models, Goals, and Performance Criteria Chapter 1
1)(Y; l~~l)/e (I = 1", .. ,') are independently distributed as X2 with 2 degrees of freedom, and [L~ 1 Yi + (n  7")Yr]/B = [(ii): The random variables Zi ~ (n  i
+
L::~l Z,.l
17. Suppose that (TkXl' h) generate a canonical exponential family P with parameter k 1Jkxl and E = R . Let
(a) Show that Q is the exponential family generated by IlL T and h exp{ cTT}. where IlL is the projection matrix of Tonto L = {'I : 'I = BO + c). (b) Show that ifP has full rank k and B is of rank I, then Q has full rank l. Hint: If B is of rank I, you may assume
18. Suppose Y1, ... 1 Y n are independent with Yi '" N(131 + {32Zi, (12), where Zl,'" , Zn are covariate values not all equaL (See Example 1.6.6.) Show that the family has rank 3.
Give the mean vector and the variance matrix of T.
19. Logistic Regression. We observe (Zll Y1 ), ... , (zn, Y n ) where the Y1 , .. _ , Y n are independent, Yi "' B(TIi, Ad The success probability Ai depends on the characteristics Zi of the ith subject, for example, on the covariate vector Zi = (age, height, blood pressure)T. The function I(u) ~ log[u/(l  u)] is called the logil function. In the logistic linear re(3 where (3 = ((31, ... ,/3d ) T and Zi is d x 1. gression model it is assumed that I (Ai) = Show that Y = (Y1 , ... , yn)T follow an exponential model with rank d iff Zl, ... , Zd are
zT
not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). 20. (a) In part IT of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by ('11 'Io)TT and that ~(ii) =~(i). (b) Fill in the details of part III of the proof of Theorem 1.6.4. 21. Find JJ.('I) ~ EryT(X) for the gamma,
qa, A), distribution, where e = (a, A).
I
22. Let X I, . _ . ,Xn be a sample from the k·parameter exponential family distribution (1.6.10). Let T = (L:~ 1 1 (Xi ), ... , L:~ 1Tk(X,») and let T
I
S
~
((ryl(O), ... ,ryk(O»): e E 8).
Show that if S contains a subset of k + 1 vectors Vo, .. _, Vk+l so that Vi  Vo, 1 < i are not collinear (linearly independent), then T is minimally sufficient for 8.
< k.
I .' jl,
"
23. Using (1.6.20). find a conjugate family of distributions for the gamma and beta families. (a) With one parameter fixed. (b) With both parameters free.
:
I
Section 1.7
Problems and Complements
91
24. Using (1.6.20), find a conjugate family of distributions for the normal family using as parameter 0 = (O!, O ) where O! = E,(X), 0, ~ l/(Var oX) (cf. Problem 1.2.12). 2 25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with (72 known. Find a conjugate family of prior distributions for (131,132) T. 26. Using (1.6.20), find a conjugate family of distributions for the multinomial distribution. See Problem 1.2.15. 27. Let P denote the canonical exponential family genrated by T and h. For any TJo E £, set ho(x) = q(x, '10) where q is given by (1.6.9). Show that P is also the canonical exponential family generated by T and h o.
28. Exponential/amities are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by h(f)
~ E(logf(X)) =
l:IIOgf(X)I!(X)dx.
This quantity arises naturally in infonnation in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S ~ {x: f(x) > OJ. (a) Show that the canonical kparameter exponential family density
f(x, 'I)
= exp
• ryjrj(x) 1/0 + I:
j:=1
A('I)
, XES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is
f(x)dx
~ 1,
Is
f(x)rj(x)
~ aj,
1 < j < k,
where '17o, .•.• '17k are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. (b) Find the maximum entropy densities when rj(x) = x j and (i) S ~ (0,00), k = 1, at > 0; (ii) S = R, k = 2, at E R, a, > 0; (iii) S = R, k = 3, a) E R, a, > 0, a3 E R. 29. As in Example 1.6.11, suppose that Y 1, ...• Y n are Li.d. Np(f.L. E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y = (Y ... , Yn ) is the p(p + 3)/2 canonical " exponential family generated by h = 1 and the p(p + 3)/2 statistics
n n
Tj
=
LYii>
i=l
1 <j <Pi
Tjl =
LJ'ijJ'iI.
i=l
1 <j< l<p
where Y i = (Yi!, ... , Yip). Show that <: is open and that this family is of rank pcp + 3)/2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = pcp + 3)/2 statistics Tj(Y) ~ Yj, 1 < j < p, and Tj,(Y) = YjYi, 1 <j < I < p,
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Np(J.l, E). As E ranges over all p x p symmetric positive definite matrices, so does E 1 • Next establish that for symmetric matrices M,
J
M
exp{ _uT Mu}du
< 00 iff M
is positive definite
by using the spectral decomposition (see B.I0.1.2)
=L
j=1
p
AjejeJ for el, ... , e p orthogonal. Aj E R.
To show that the family has full rank m, use induction on p to show that if Zt, ... , Zp are i.i.d. N(O, 1) and if B pxp = (b jl ) is symmetric, then
p
P
LajZj
j"" 1
+ Lbj,ZjZ,
j,l
~c
= P(aTZ + ZTBZ = c) ~ 0
N p(l', E), then
unless a ~ 0, B = 0, c = 0. Next recall (Appendix B.6) that since Y ~ y = SZ for some nonsingular p x p matrix S.
I
30. Show that if Xl,'" ,Xn are d.d. N p (8,E o) given (J where ~o is known, then the Np(A, f) family is conjugate to N p(8, Eo), where A varies freely in RP and f ranges over
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I'j, Tj) : 1 < j < k} be a given collection of pairs with I'j E R, Tj > 0. Let (tt, tT) be a random pair with Aj = P«(I', tT) = (I'j, Tj)), 0 < Aj < 1, L:~~l Aj = 1. Let 8 be a random variable whose conditional distribution given (IL, IT) = (p,j 1 Tj) is nonnal, N(p,j, rJ). Consider the model X = 8 + f, where 8 and € are independent and € rv N(O, a3), a~ known. Note that 8 has the prior density
11'(0)
=L
j=l
k
Aj'l'rj (0  I'j)
(1.7.4)
where 'I'r denotes the N(O, T 2 ) density. Also note that (X tion. (a) Find the posterior
i!
I 0) has the N(O, (75) distribu
" "
,;
k
11'(0 I x)
and write it in the fonn
= LP((tt,tT) ~
j=1
(l'j,Tj) I X)1I'(O I (l'j,Tj),X)
" • ,
k
L
j=1
Aj (x)'I'rj(x) (0 l'j(X»
Section 1.7
Problems ;3nd Complements
93
for appropriate A) (x), Tj (x) and ILJ (x). This shows that (1. 7.4) defines a conjugate prior for the N(O, (76), distribution. (b) Let Xi = + Ei, I < i < n, where is as previously and EI," ., En are ij.d. N(O, (76)' Find the posterior 7r( 0 I Xl, ... , x n ), and show that it belongs to class (1.7 A). Hint: Consider the sufficient statistic for p(x I B).
e
e
32. A Hierarchical BinomialBeta Model. Let {(rj, Sj) : 1 <j < k} be a given collection of pair.; with rj > 0, sJ > 0, let (R, S) be a random pair with P(R = cJ' S = 8j) = Aj, D < Aj < 1, E7=1 Aj = 1, and let e be a random variable whose conditional density ".(0, c, s) given R = r, S = S is beta, (3(c, s). Consider the model in which (X I 0) has the binomial, B( n, fJ), distribution. Note that e has the prior density
".(0)
Find the posterior
k
=L
j=1
k
Aj"'(O, cJ ' sJ)'
(J .7.5)
".(0 I x) = LP(R= cj,S =
j=l
8j
I x)7r(O I (rj,sj),x)
and show that it can be written in the form J (x)7r(O,rj(x),sj(x)) for appropriate Aj(X), Cj(x) and 8j(X). This shows that (1.7.5) defines a class of conjugate prior.; for the B( n, 0) distribution.
L:A
33. Let p(x,TJ) be a one parameter canonical exponential family generated by T(x) = x and h(x), X E X C R, and let 1jJ(x) be a nonconstant, nondecreasing function. Show that E,1jJ(X) is strictly increasing in ry. Hint:
Cov,(1jJ(X), X)
~E{(X 
X')i1jJ(X) 1jJ(X')]}
where X and X' are independent identically distributed as X (see A.Il.12).
34. Let (Xl, ... , X n ) be a stationary Markov chain with two states D and 1. That is.
P[Xi
where
= Ei I Xl = EI,·· .,Xi  = Eid = P[Xi = Ci I X i  l = Eid =
l
PEi_1Ei
(POO PIO
pal) is the matrix of transition probabilities. Suppose further that Pn '
= 1  p.
(i) poo
(ii)
= PII = p, so that, PlO = Pal PIX, = OJ = PIX, = IJ = !.
94
Statistical Models, Goals, and Performance Criteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, oneparameter exponential family with T = NOD + N ll where Nt) the number of transitions from i to j. For example, 01011 has N Ol = 2, Nil = 1, N oo = 0, N IO ~ 1.
(b) Show that E(T)
= (n 
l)p (by the method of indicators or otherwise).
35. A Conjugate Priorfor the Two~Sample Problem. Suppose that Xl, ... , X n and Y1 , ... , Yn are independent N(fLI' (12) and N(1l2' ( 2 ) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro 2 has a X~ distribution and given 0 2 , /11 and fL2 are independent with N(~I, <7 2/ kt} and N(6, <7 2/ k2) distributions, respectively, where ~j E R, k j > 0, j = 1,2. Show that Jr is a conjugate prior.
I
36. The inverse Gaussian density. IG(j..t, .\), is
f(x,J1.,>')
= [>./21Tjl/2 x 3/2 exp { >.(x 
J1.)2/ 2J1.2 X}, x> 0, J1. > 0, >. > O.
(a) Show thatthis is an exponentialfamily generated hy T( X) h(x) = (21T)1/2 X 3/'. (b) Show that the canonical parameters TJl, TJ2 are given by TJI that A( 'II, '12) =  [! [Og('I2) + v''I1'I2]'£ = [0,00) x (0,00).
= ! (X, XI) T and = fL 2A, 1]2 =
= '\, and
(e) Fwd the momentgenerating function ofT and show that E(X) J1. 3>., E(XI) = J1. 1 + >.1, Var(X I ) = (>'J1.)1 + 2>'2. (d) Suppose J1. pnor.
~
J1., Var(X)
J1.o is known. Show that the gamma family, qa,,6), is a conjugate
(e) Suppose that>' = >'0 is known. Show that the conjngate prior formula (1.6.20) produces a function that is not integrable with respect to fl. That is, defined in (1.6.19) is empty.
n
(I) Suppose that J1. and>. are both unknown. Show that (1.6.20) produces a function
that is not integrable; that is, f! defined in (1.6,19) is empty. 37, Let XI, ... , X n be i.i.d. as X ~ Np(O, ~o) where ~o is known. Show that the conjugate prior generated by (1.6.20) is the N p ( 1]0,761) family, where 1]0 varies freely in RP, 76 > 0 and I is the p x p identity matrix.
,
•
38. Let Xi
"
(Zi, Yi)T be jj,d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of XI, ... ,Xn as a canonical exponential family and identify T, h, A, and E. Find the expected value and variance of the sufficient statistic.
=
,!,i
i!: "
39. Suppose that Y1 , •.. 1 Y n are independent, Yi
'"'' N(fLi, a 2 ), n > 4.
"'
(a) Write the distribution of Y1 , ..• ,Yn in canonical exponential family fonn. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 1) : (0 1, O , 03)T ~ (1'7, <72 jT where 1) is detennined by 2
fLi
I i'i
I ,
I
= exp{OI
+ 02Zi},
Zl
< Z2 < .. , <
Zn;
(72 =
03


Section 1.8
Notes
95
where 8 r E R, O E R, 03 > O. This model is sometimes used when IIi is restricted to be 2 positive. Show that p(y, 0) as given by (1.6.12) is a curved exponential family model with
1=3.
40. Suppose Y 1 , • •. , y;'l are independent exponentially, E' (Ai), distributed survival times, n > 3. (a) Write the distribution of Y1 , ... 1 Yn in canonical exponential family form. Identify T, h, '1, A, and E. (b) Recall that J.1i = E (Y'i) = Ai I. Suppose lJi depends on the value Zi of a covariate. Because Iti > O. fLi is sometimes modeled as
fLi
= CXp{ 0 1 + (hZi},
i
=
1, ... , n
where not all the z's are equal. Show that p(y, fi) as given by (1.6.12) is a curved exponential family model with 1 = 2.
1.8
NOTES
Note for Section 1.1
(1) For the measure theoretically minded we can assume more generally that the Po are
all dominated by a derivative.
(J
finite measure It and that p(x, 8) denotes
dJ;lI, the Radon Nikodym
Notes for Section 1,3
~
(I) More natural in the sense of measuring the Euclidean distance between the estimate f} and the "truth" Squared error gives much more weight to those that are far away from f} than those close to f}.
e.
e
~
(2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r.
Note for Section 1,4
(I) Source; Hodges, Jr., J. L., D. Keetch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGrawHill, 1975.
Notes for Section 1,6
(1) Exponential families arose much earlier in the work of Boltzmann in statistical mechanics as laws for the distribution of the states of systems of particlessee Feynman (1963), for instance. The connection is through the concept of entropy, which also plays a key role in infonnation theorysee Cover and Thomas (199]). (2) The restriction that's x E Rq and that these families be discrete or continuous is artificial. In general if fL is a (J finite measure on the sample space X. p( x, e) as given by (1.6.1)
" Biomelika.7 (I) u T Mu > 0 for all p x 1 vectors u l' o. and so on. and Later Developments. 1991. KENDALL. 1986. GIRSHICK. M. 5.. Nonlinearity. ley. FERGUSON. 77. . 6.. R. AND A. the Earth). A 143. A. CARROLL.733741 (1990). P. Ch. 14431473 (1995).547572 (1957). BROWN. Testing Statistical Hypotheses. 23. M. • • ." Ann. BOX. E. R. 0 .." Ann. Royal Statist. L. I and II.. r HorXJEs.. J.. L. J. J. D. "A Theory of Some Multiple Decision Problems.. . FEYNMAN. T. DoKSUM. L. CRUTCHFIELD.9 REFERENCES BERGER.I' L6HMANN. m New York: Hafner Publishing Co. I.. M. 1975. . 2nd ed. AND COVER. H. P. T. S. J.. JR.• Statistical Decision Theory and Bayesian Analysis New York: Springer. Transformation and Weighting in Regression New York: Chapman and Hall. "Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression.. E.. The Feynmtln Lectures on Physics. Goals. . 1966.266291 (1978). Optimal Statistical Decisions New York: McGrawHili.g. BERMAN. IMS Lecture NotesMonograph Series. 1. P. 40 Statistical Mechanics ofPhysics Reading. BLACKWELL. 160168 (1990). v. and Performance Criteria Chapter 1 can be taken to be the density of X with respect to fLsee Lehmann (1997). Soc. AND M. E.. Statist." Statist. Statistical Analysis of Stationary Time Series New York: Wi • . L.. . 1961.. DE GROOT. E. P. 22. B. R. S. Feynman. I: I" Ii L6HMANN. D. Vols. Statist. Elements of Information Theory New York: Wiley. 1967. Hayward. "Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discussion). STUART. and spheres (e. positions. L. "Using Residuals Robustly I: Tests for Heteroscedasticity.96 Statistical Models. Iii I' i M. Sands. RUPPERT. THOMAS.. 1954. GRENANDER." J. Eels. 125. "Model Specification: The Views of Fisher and Neyman. Mathematical Statistics New York: Academic Press. II.. Theory ofGames and Statistical Decisions New York: Wiley. "A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts. Note for Section 1. ROSENBLATT. 1957. A." Ann. The Advanced Theory of Statistics.. for instance. 1997... 1985.t AND D. 1969. Statlab: An Empirical Introduction to Statistics New York: McGrawHili. Ii . This permits consideration of data such as images. 383430 (1979). AND M. R. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. G.. L6HMANN. Leighton. MA: AddisonWesley. 1963. . K A AND A. Science 5. 1988. New York: Springer. G. BICKEL. U. SAMAROV. and M. Math Statist. KRETcH AND R.
AND K. Part I: Probability. AND K.. SL. GLAZEBROOK. AND R. ET AL. K. . Dulan & Co. "On the General Theory of Skew Correlation and NOnlinear Regression.. 303 (1905). The Foundation ofStattstical Inference London: Methuen & Co. Biometrics Series II. 9 References 97 LINDLEY. G. Roy. 2000. 1954. Statistical Methods. L. D. New York. London 71. 1964. NORMAND. G. Division of Research. W. Applied Statistical Decision Theory. Wiley & Sons. Boston. (Draper's Research Memoirs. COCHRAN. E. PEARSON. Ahmed and N. SNEDECOR. Harvard University. 1961. B." Empirical Bayes and Likelihood Inference. Editors: S. Part II: Inference London: Cambridge University Press. WETHERILL. G. H. 6779. AND W. A. SAVAGE. Introduction to Probability and Statistics from a Bayesian Point of View. MANDEL.. Reid.. Sequential Methods in Statistics New York: Chapman and Hall. 8th Ed. D. SAVAGE. J. J. J. DOKSUM. Ames.. 1989. 1965. IA: Iowa State University Press. Soc.Section 1. The Foundations ofStatistics.. Graduate School of Business Administration." Proc. 1962. "Empirical Bayes Procedures for a Change Point Problem with Application to HIVJAIDS Data. L. 1986. New York: Springer.. The Statistical Analysis of Experimental Data New York: J. SCHLAIFFER. Wiley & Sons. V. Lecture Notes in Statistics.) RAIFFA.. 1.
! ~J .'. . I .: '". :i. I II 1'1 .·. I:' o. I 'I i .
if PO o were true and we knew DC 8 0 . As a function of 8. Of course.8) as a function of 8. how do we find a function 8(X) of the vector observation X that in some sense "is close" to the unknown 81 The fundamental heuristic is typically the following.1 2.1) Arguing heuristically again we are led to estimates B that solve \lOp(X. but in a very weak sense (unbiasedness). Now suppose e is Euclidean C R d .1. In this parametric case. X E X. Estimating Equations Our basic framework is as before. So it is natural to consider 8(X) minimizing p(X. We consider a function that we shall call a contrast function  p:Xx8>R and define D(Oo. we could obtain 8 0 as the minimizer. Then we expect  \lOD(Oo.1. 6) ~ o.1 BASIC HEURISTICS OF ESTIMATION Minimum Contrast Estimates. and 8 Jo D( 8 0 .8). p(X. DC8 0l 9) measures the (population) discrepancy between 8 and the true value 8 0 of the parameter.O).2) define a special form of estimating equations.Chapter 2 METHODS OF ESTIMATION 2.8) is an estimate of D(8 0 .8) is smooth. the true 8 0 is an interior point of e. we don't know the truth so this is inoperable. X """ PEP. In order for p to be a contrast function we require that DC 8 0l 9) is uniquely minimized for 8 = (Jo.O) where V denotes the gradient.O) EOoP(X. how do we select reasonable estimates for 8 itself? That is. This is the most general fonn of the minimum contrast estimate we shall consider in the next section. That is.1. usually parametrized as P = {PO: 8 E e}. The equations (2. =0 (2. 8).2) 99 . (2.1.
Least Squares. {3 E R d .' .z.1.»)'.1. An estimate (3 that minimizes p(X.zd = LZij!3j j=1 andzi = (Zil. ..1. then {3 satisfies the equation (2. I D({3o.1. Then {3 parametrizes the model and we can compute (see Problem 2. A naiural(l) function p(X. J which is indeed minimized at {3 = {3o and uniquely so if and only if the parametrization is identifiable. Here is an example to be pursued later..5) Strictly speaking P is not fully defined here and this is a point we shall explore later.1. z. .!I [" '1 L • . i=l (2.1.4 with I'(z) = g({3. (1).). suppose we are given a function W : X and define Methods of Estimation Chapter 2 X R d . l'i) : 1 < i < n} where Yi.1L1 2 = L[Yi . z) is continuous and ! I lim{lg(l3.g({3. ZI). .z. g({3.. W(X. z. . g({3. z). .8/3 ({3. further. (3) exists if g({3.1. {3) to consider is the squared Euclidean distance between the vector Y of observed Yi and the vector expectation of Y. Wd)T V(l1 o. Consider the parametric version of the regression model of Example 1.1. The estimate (3 is called the least squares estimate. If. IL( z) = (g({3.I1) =0 is an estimating equation estimate. i=l 3 I (2. (3) = !Y . Zn»T That is.7) i=l J . for convenience.1.10).8) ~ E I1 . Example 2. (3) n n<T. . . we take n p(X..100 More generally.6) . Evidently. (1) Suppose V (8 0 .1. Here the data are X = {(Zi. W . I In the important linear case.2) or equivalently the system of estimating eq!1ations.3) = 0 has 8 0 as its unique solution for all 8 0 ~ E e. "'.1.)g(l3.i + L[g({3o.z)l: 1{31 ~ oo} 'I = 00 I (Problem 2.) .8/3 ({3.d. (3) E{3.)Y. Yn are independent.4) w(X. ~ Then we say 8 solving (2.1.Zid)T 'j • r . where the function 9 js known. :i = ~8g~ ~ L.p(X. z. ~ ~8g~ L. N(O. (2.('l/Jl.g({3.16).i. z.4 R d. suppose we postulate that the Ei of Example 1. d g({3. there is a substantial overlap between the two classes of estimates.4 are i. But.)J i=1 ~ 2 (2. z) is differentiable in {3.. ua).
Method a/Moments (MOM).. .. The motivation of this simplest estimating equation example is the law of large numbers: For X "' Po.2.Section 2. Suppose Xl. once defined we have a method of computing a statistic fj from the data X = {(Zi 1 Xi).2 and Chapter = 6. I'd( 8) are the first d moments of the population we are sampling from. 0 Example 2.9) where Zv IIZijllnxd is the design matrix. if we want to estimate a Rkvalued function q(8) of 9..1. . thus. 8 E R d and 8 is identifiable.1 Basic Heuristics of Estimation 101 the system becomes (2..1. provides a first example of both minimum contrast and estimating equation methods. Here is another basic estimating equation example.8) the normal equations. 1 1j ~_l". ~ . Suppose that 1'1 (8).. which can be judged on its merits whatever the true P governing X is. . lid) as the estimate of q( 8).. = 1'. . We return to the remark that this estimating method is well defined even if the Ci are not i.1 <j < d if it exists.i. I'd).. 1 < i < n}. More generally.  n n.1. . ..1 from R d to Rd. Least squares. and then using h(p!. we need to be able to express 9 as a continuous function 9 of the first d moments. . u5). In fact. To apply the method of moments to the problem of estimating 9. These equations are commonly written in matrix fonn (2. Thus..Xn are i.d. . we obtain a MOM estimate of q( 8) by expressing q( 8) as a function of any of the first d moments 1'1.Xi t=l . This very important example is pursued further in Section 2. Thus. L. suppose is 1 . I'd of X. . The method of moments prescribes that we estimate 9 by the solution of p.i..(8). 1 1 < j < d. as X ~ P8. d > k. .d. we assume the existence of Define the jth sample moment f1j by. N(O. . say q(8) = h(I'I. /1j converges in probability to flj(fJ).
4 and in Chapter 6. as X and Ni = number of indices j such that X j = Vi. Frequency Plugin(2) and Extension. and 1" ~ E(X') = .6. express () as a function of IJ" and fl3 = E(X 3 ) and obtain a method of moment estimator based on /11 and fi3 (Problem 2. consider a study in which the survival time X is modeled to have a gamma distribution. for instance. f(u.. 2. A>O. We can. .·)  l~t.d.5.1. 1'1 for () gives = E(X) ~ "/ A. If we let Xl. .10. 3. There are many algorithms for optimization and root finding that can be employed.102 Methods of Estimation Chapter 2 For instance. Vi = i.2 The PlugIn and Extension Principles We can view the method of moments as an example of what we call the plugin (or substitution) and extension principles. or 5. . in particular Problem 6.2 and 0=2 = n 1 EX1. ~ iii ~ (X/a)'. . then setting (2.Pk are completely unknown. 4.. . 0 Algorithmic issues We note that.·) fli =D'I'(X. x>O. . . 1968). 2.. case. . We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments..1 0) . then the natural estimate of Pi = P[X = Vi] suggested by the law of large numbers is Njn.•..X 2 • In this example... with density [A"/r(.. . Here is some job category data (Mosteller. Suppose we observe multinomial trials in which the values VI. A = Jl1/a 2 . Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category. A = X/a' where a 2 = fl. the method of moment estimator is not unique. but their respective probabilities PI. In this case f) ~ ('" A). two other basic heuristics particularly applicable in the i.(X")1 dxd J isquickandM is nonsingular with high probability is the NewtonRaphson algorithm. An algorithm for estimating equations frequently used when computationofM(X. It is defined by initializing with eo.3. '\). the proportion of sample values equal to Vi.~ (I'l/a)'. As an illustration consider a population of men whose occupations fall in one of five different job categories. l Vk of the population being sampled are known.1.(1 + ")/A 2 Solving . X n be Ltd.1.)]xO1exp{Ax}. .i. This algorithm and others will be discussed more extensively in Section 2. i = 1.. I.>0. neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn. Example 2. Here k = 5. . I. .11). in general..
" .. then (NIl N 2 . . .31 5 95 0. 1~!. use (2.4. If we use the frequency substitution principle. . 1 < think of this model as P = {all probability distributions Pan {VI.pl . we are led to suppose that there are three types of individuals whose frequencies are given by the socalled HardyWeinberg proportions 2 . in v(P) by 0 P ). v. N 3 ) has a multinomial distribution with parameters (n..Pk) = (~l . . Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles.1. P3) given by (2.... . Suppose .12) If N i is the number of individuals of type i in the sample of size n.(P2 + P3). the difference in the proportions of bluecollar and whitecollar workers..12 3 289 0. let P dennte p ~ (P". (2. HardyWeinberg Equilibrium.Pk by the observable sample frequencies Nt/n. v(P) = (P4 + Ps) and the frequency plugin principle simply says to replace P = (PI. 1 ()d) and that we want to estimate a component of 8 or more generally a function q(8).0).Xn .44 . .Pk) with Pi ~ PIX = 'Vi]. Next consider the marc general problem of estimating a continuous function q(Pl' .1. Many of the models arising in the analysis of discrete data discussed in Chapter 6 are of this type... P3 PI = 0 = (1. . that is.. . . Now suppose that the proportions PI' ..03 2 84 0. . For instance. categories 4 and 5 correspond to bluecollar jobs. The frequency plugin principle simply proposes to replace the unknown population frequencies PI..0)2. . . suppose that in the previous job category table..09. i < k.0.P2. P2 ~ 20(1.41 4 217 0. the multinomial empirical distribution of Xl. can be identified with a parameter v : P ~ R.Section 2. whereas categories 2 and 3 correspond to whitecollar jobs. Equivalently.1 Basic Heuristics of Estimation 103 Job Category l I Ni Pi ~ 23 0.. together with the estimates Pi = NiJn.13 n2::.Ps) = (P4 + Ps) . .0 < 0 < 1.1.Ni =708 2:::o. the estimate is which in our case is 0.. lPk).1. Pk do not vary freely but are continuous functions of some ddimensional parameter 8 = ((h. + P3).. 1 Nk/n. . .12)..11) to estimate q(Pll··. If we assume the three different genotypes are identifiable. Pk) of the population proportions. and Then q(p) (p.}}. That is. . Example 2. We would be interested in estimating q(P"".»i=1 for Danish men whose fathers were in category 3.53 = 0..
case if P is the space of all distributions of X and Xl. as X p.. If PI. .. E A) n.Pk.. Now q(lJ) can be identified if IJ is identifiable by a parameter v . X n are Ll. We can think of the extension principle alternatively as follows. If w. we can use the principle we have introduced and estimate by J N l In. I) and ! F'(a) " = inf{x. (2.E have an estimate P of PEP such that PEP and v : P 4 T is a parameter.d. . the frequency of one of the alleles. Po ~ R given by v(PO) = q(O).d.Pk are continuous functions of 8. Nk) .Pk(IJ»)... "·1 is.1. by the law of large numbers. I: t=l (2.13) defines an extension of v from Po to P via v. .1. 0 write 0 = 1 . the empirical distribution P of X given by    f'V 1 n P[X E Al = I(X. thus. . q(lJ) with h defined and continuous on = h(p. suppose that we want to estimate a continuous Rlvalued function q of e. Fu'(a) = sup{x. The plugin and extension principles can be abstractly stated as follows: Plugin principle. Because f) = ~....(IJ).).13) Given h we can apply the extension principle to estimate q( 8) as.1. (2. . in the Li. T(Xl. . . We shall consider in Chapters 3 (Example 3. however. Let be a submodel of P..4.14) As we saw in the HardyWeinberg case. In particular.'" . 1  V In general. F(x) < a}. . Then (2.13) and estimate (2.1. .16) . .Xn ) =h N. (2. if X is real and F(x) = P(X < x) is the distribution function (dJ. . consider va(P) = [Fl (a) + Fu 1(a )1. . Note.14) are not unique. F(x) > a}. . that is.h(p) and v(P) = v(P) for P E Po. a natural estimate of P and v(P) is a plugin estimate of v(P) in this nonparametric context For instance..1. that we can also Nsfn is also a plausible estimate of O.104 Methods of Estimation Chapter 2 we want to estimate fJ./P3 and.4) and 5 how to choose among such estimates. where a E (0. . ( . the representation (2.1.. then v(P) is the plugin estimate of v.1. we can usually express q(8) as a continuous function of PI.15) ~ ". P > R where v(P) .
A natural estimate is the ath sample quantile . As stated.' Nl = h N () ~ ~ = v(P) and P is the empirical distribution.Section 2.12) and to more general method of moment estimates (Problem 2. is a continuous map from = [0. if X is real and P is the class of distributions with EIXlj < 00.i.. If v: P ~ T is an extension of v in the sense that v(P) = viP) on Po.1.1.2.1.4. (P) is called the population (2.14. ~ ~ .1.i.1 I:~ 1 Xl. and v are continuous. these principles are general.1.3 and 2.1.) h(p) ~ L vip.1. because when P is not symmetric. However.d. let Po be the class of distributions of X = B + E where B E R and the distribution of E ranges over the class of symmetric distributions with mean zero. context is the jth sample moment v(P) = xjdF(x) ~ 0. Suppose Po is a submodel of P and P is an element of P but not necessarily Po and suppose v: Po ~ T is a p~eter. = Lvi n i= 1 k . then Va. In this case both VI(P) = Ep(X) and V2(P) = "median of P" satisfy v(P) = v(P).1. the sample median V2(P) does not converge in probability to Ep(X). then v(P) is an extension (and plugin) estimate of viP). i=1 k /1j ~ = ~ LXI 1=1 I n .13). For instance. case (Problem 2.1 Basic Heuristics of Estimation 105 V 1. II to P.1. in the multinomial examples 2. The plugin and extension principles are used when Pe. For instance. ~ For a second example. P 'Ie Po. v(P8 ) = q(8) = h(p(8)) is a continuous map from to Rand D( P) = h(p) is a continuous map from P to R. The plugin and extension principles must be calibrated with the target parameter. Pe as given by the HardyWeinberg p(O).tj = E(Xj) in this nonparametric . (P) is the ath population quantile Xa. Remark 2. 1/. . they are mainly applied in the i. P E Po. ~ I ~ ~ Extension principle. casebut see Problem 2. With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plugin estimates for multinomial trials because I'j(8) where =L i=l k vfPi(8) = h(p(8» = viP. Let viP) be the mean of X and let P be the class of distributions of X = 0 + < where B E R and the distribution of E ranges over the class of distributions with mean zero. Here x!. = v(P). e e Remark 2. Here x ~ = median. This reasoning extends to the general i. is called the sample median. then the plugin estimate of the jth moment v(P) = f.17) where F is the empirical dJ.. but only VI(P) = X is a sensible estimate of v(P).d.
valuable as preliminary estimates in algorithms that search for more efficient estimates. therefore. X(l . for large amounts of data. Example 2. 'I I[ i['l i . This minimal property is discussed in Section 5. •i . Plugin is not the optimal way to go for the Bayes.. . minimax. X. O}. Example 2. U {I. The method of moments estimates of f1 and a 2 are X and 2 0 a. . . ".3. these estimates are likely to be close to the value estimated (consistency).1.X). as we shall see in Section 2. Because f. . " . they are often difficult to compute. Thus.. and there are best extensions. those obtained by the method of maximum likelihood. C' X n are ij. X n is a N(f1. because Po = P(X = 0) = exp{ O}.LI (8) = (J the method of moments leads to the natural estimate of 8. then B is both the population mean and the population variance. Algorithms for their computation will be introduced in Section 2.1. What are the good points of the method of moments and frequency plugin? (a) They generally lead to procedures that are easy to compute and are. a special type of minimum contrast and estimating equation method. we may arrive at different types of estimates than those discussed in this section. if we are sampling from a Poisson population with parameter B.3 where X" . )" I I • .Ll). X n are the indicators of a set of Bernoulli trials with probability of success fJ. there are often several method of moments estimates for the same q(9). 0 = 21' . Suppose that Xl. 2. For example.1. optimality principle solutions agree to first order with the best minimum contrast and estimating equation solutions. [ [ . as we shall see in Chapter 3. Because we are dealing with (unrestricted) Bernoulli trials. If the model fits. .. In Example 1. the frequency of successes. Unfortunately.4. This is clearly a foolish estimate if X (n) = max Xi> 2X . where Po is n. Moreover.l [iX. a 2 ) sample as in Example 1.. Discussion. estimation of B real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of (J values rather than minimizers of functions p( (J. For instance. the plugin principle is justified. . • . Estimating the Size of a Population (continued).2 with assumptions (1)(4) holding.1. = 0]. a saving grace becomes apparent in Chapters 5 and 6. It does turn out that there are "best" frequency plugin estimates. or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter 3. a frequency plugin estimate of 0 is Iogpo.) = ~ (0 + I). I . Remark 2. We will make a selection among such procedures in Chapter 3. When we consider optimality principles. we find I' = E. 0 Example 2. . See Section 2. However.8) we are led by the first moment to the estimate.4.5.7. these are the frequency plugin (substitution) estimates (see Problem 2. (b) If the sample size is large.. .1.d.5.4. Suppose X I. To estimate the population variance B( 1 . (X.1 is a method of moments estimate of B.6..I and 2X . . The method of moments can lead to either the sample mean or the sample variance.1 because in this model B is always at least as large as X (n)' 0 As we have seen. X).106 Methods of Estimation Chapter 2 Here are three further simple examples illustrating reasonable and unreasonable MOM estimates.2.
.Pk). . z) = ZT{3. It is of great importance in many areas of statistics such as the analysis of variance and regression theory.1 L:~ 1 I[X i E AI. ~ ~ ~ ~ ~ v we find a parameter v such that v(p.2 Minimum Contrast Estimates and Estimating Equations 107 Summary. 0).1 MINIMUM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares Least squares(1) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement. When P is the empirical probability distribution P E defined by Pe(A) ~ n... . Zi).ll. .. where 9 is a known function and {3 E Rd is a vector of unknown regression coefficients. . li) : I < i < n} with li independent and E(Yi ) = g((3. when g({3. P. For this contrast.lj = E( XJ\ 1 < j < d.Zi)j2. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters.ld)T where f. Let Po and P be two statistical models for X with Po c P. The general principles are shown to be related to each other.) = q(O) and call v(P) a plugin estimator of q(6)..g((3. where Pj is the probability of the jth category. Method of moment estimates are empirical PIEs based on v(P) = (f. The plugin estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P.. In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6. A minimum contrast estimator is a minimizer of p(X.2. 1 < j < k.O) = O.2 2. D(P) is called the extensionplug·in estimate of v(P).lJ) ~ E'o"p(X..Section 2. 6). If P = PO. 0 E e c Rd is uniquely minimized at the true value 8 = 8 0 of the parameter.. In the multinomial case the frequency plugin estimators are empirical PIEs based on v(P) = (Ph . P E Po. Suppose X . ~ ~ ~ ~ 2.. a least squares estimate of {3 is a minimizer of p(X.(3) ~ L[li . and the contrast estimating equations are V Op(X. then is called the empirical PIE. For the model {Pe : () E e} a contrast p is a function from X x H to R such that the discrepancy D(lJo. An extension ii of v from Po to P is a parameter satisfying v(P) = v(P). where ZD = IIziJ·llnxd is called the design matrix. I < i < n. f. 0 E e. If P is an estimate of P with PEP. is parametric and a vector q( 8) is to be estimated. the associated estimating equations are called the normal equations and are given by Z1 Y = ZtZD{3. For data {(Zi.
z).zt}. Note that the joint distribution H of (C}. Y) ~ PEP = {Alljnintdistributions of(Z. . That is. Z)f. j. .::2 : In Example 2. 13) = LIY. 1 < i < j < n 1 > 0\ .. z.f3d and is often applied in situations in which specification of the model beyond (2. that is. This follows "I . This is frequently the case for studies in the social and biological sciences.. " (2.z. . that is. • . E«.2.g(l3.:0::d=':Of.1 we considered the nonlinear (and linear) Gaussian model Po given by Y. .. The contrast p(X.3) which is again minimized as a function of 13 by 13 = 130 and uniquely so if the map 13 ~ (g(j3.4 with <j simply defined as Y. where Ci = g(l3. (3)  E p p(X.d. If we consider this model. z. 1 < i < n. Z could be educationallevel and Y income.2. .2) Are least squares estimates still reasonable? For P in the semiparametric model P. 13 E R d }.) ..2. For instance.En) is any distribution satisfying the GaussMarkov assumptions. and 13 ~ (g(l3.) + <" I < i < n n (2. z. I<i<n.6) Var( €i) = u 2 COV(fi.Y) such thatE(Y I Z = z) = g(j3. The least squares method of estimation applies only to the parameters {31' ..2.g(l3.g(l3.(z.108_~ ~_ _~ ~ ~cMc.6) is difficult Sometimes z can be viewed as the realization of a population variable Z.zn)f is 11.) = g(l3. a 2 and H unknown.g(l3.::.13) n n i=1  L Varp«. as (Z.4}{2. .)]' i=l ~ led to the least squares estimates (LSEs) {3 of {3. = 0.:. .E:::'::'.::'hc. < i < n.) =0.. NCO.3) continues to be valid. the model is semiparametric with {3.z. .C=ha:::p::. we can compute still Dp(l3o.).g(l3.i. fj) because (2.. or Z could be height and Y log weight Then we can write the conditional model of Y given Zj = Zj.d.m::a.1. I Zj = Zj).) = 0. j..5) (2. I) are i.»)2. Suppose that we enlarge Po to P where we retain the independence of the Yi but only require lJ. .2. .2. E«.) = E(Y.2. 1_' . 1 < j < n.2. aJ) and (3 ranges over R d or an open subset. (Z" Y.) + L[g(l3 o.i. Yi). I' ...1. " . is modeled as a sample from a joint distribution.E(Y.o=nC..2).=.2. as in (a) of Example 1. 13 = I3(P) is the miniml2er of E(Y .z. which satisfies (but is not fully specified by) (2.2.) i.1.. (Zi. The estimates continue to be reasonable under the GaussMarkov assumptions. .4) (2. then 13 has an interpretation as a parameter on p. (2.). .:1. . i=l (2. Zn»)T is 1.
2 Minimum Contrast Estimates and Estimating Equations 109 from Theorem 1. y)T has a nondegenerate multivariate Gaussian distribution Nd+1(Il.1. as we have seen in Section lA. is called the linear (multiple) regression modeL For the data {(Zi. {3d).7).5.. Yi). For nonlinear cases we can use numerical methods to solve the estimating equations (2. z) = zT (3. .2.I to each of the n pairs (Zil Yi).. i = 1. J for Zo an interior point of the domain.4).2. (2.2. In that case.2. n} we write this model in matrix fonn as Y = ZD(3 + €    where Zv = IIZij 11 is the design matrix.2.1. (2) If as we discussed earlier. 2:).2. j=1 (2. I)/.6). . Yi) are a sample from a (d + 1)dimensional distribution and the covariates that are the coordinates of Z are continuous.L{3jPj.1).. .41. see Rnppert and Wand (1994). E(Y I Z = z) can be written as d p(z) where = (30 + L(3jZj j=1 (2.1 the most commonly used 9 in these models is g({3.7) (2.4. where P is the empirical distribution assigning mass n.4.Section 2. E1=1 g~ (zo)zo as an + 1) dimensional d p(z) = L(3jZj.2.zo).. As we noted in Example 2. We can then treat p(zo)  nnknown (30 and identify (zo) with (3j to give an approximate (d 1 and Zj as before and linear model with Zo = t!.5) and (2. Sections 6. We continue our discussion for this important special case for which explicit fonnulae and theory have been derived.3 and 6.9) . j=O This type of approximation is the basis for nonlinear regression analysis based on local polynomials. = py . and Volnme n. In this case we recognize the LSE {3 as simply being the usual plugin estimate (3(P). . a further modeling step is often taken and it is assumed that {Zll . See also Problem 2. (2. Zd. and Seber and Wild (1989).2... we can approximate p(z) by p(z) ~ p(zo) +L j=1 d a: a (zo)(z .8) d f30 = («(3" .1. we are in a situation in which it is plausible to assume that (Zi. Fan and Gijbels (1996). which. in conjnnction with (2. The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth.
see Problem 2. Nine samples of soil were treated with different amounts Z of phosphorus. we will find that the values of y will not be the same. ~ = (lin) L:~ 1Yi = ii.il2Zi) = 0.(3zz.) = O. whose solution is . 1 64 4 71 5 54 9 81 11 76 13 23 77 93 23 95 28 109 The points (Zi' Yi) and an estimate of the line 131 + 132z are plotted in Figure 2. we have a Gaussian linear regression model for Yi. We want to estimate f31 and {32. Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil. il.8).) ~ 0. the least squares estimate.ecor and Cochran. . The parametrization is identifiable if and only ifZ D is of rank d or equivalently if Z'bZD is affulI rank d.no Furthennore. is independent of Z and has a N(O. the relationship between z and y can be approximated well by a linear equation y = 131 + (32Z provided z is restricted to a reasonably small interval. 139). il.10) Here are some examples.2.1 '. In that case. exists and is unique and satisfies the Donnal equations (2. .2. If we run several experiments with the same z using plants and soils that are as nearly identical as possible. Zi 1'. Example 2.2. L 1 that. g(z.il. We have already argued in Examp}e 2.2. the solution of the normal equations can be given "explicitly" by (2. we get the solutions (2.ill . < Methods of Estimation Chapter 2 =y  I'(Z) Eyz:Ez~Ezy.25. For certain chemicals and plants.2.1. Y is random with a distribution P(y I z). The nonnal equations are n i=l n ~(Yi . 0 Ii. . Zi.il.) = 1 and the normal equation is L:~ 1(Yi . d = 1. {3. Estimation of {3 in the linear regression model.12) . Example 2. I I.11) When the Zi'S are not all equal.) = il" a~. i=l (2. given Zi = 1 < i < n.2. Following are the results of an experiment to which a regression model can be applied (Sned. g(z. We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil. LZi(Yi .1. For this reason. In the measurement model in which Yi is the detennination of a constant Ih. 1 < i < n. i I . il. necessarily. we assume that for a given z. 1967. p.2. ( (J2 2 ) distribution where = ayy Therefore.1.I 'i. if the parametrization {3 ) ZD{3 is identifiable.2.
. .9.. i = 1.(/31 + (32Zi)] are called the residuals of the fit. This connection to prediction explains the use of vertical distance! in regression.(Z). n. &atter plot {(Zi. €6 is the residual for (Z6. The linear regression model is considerably more general than appears at first sight. . that is p I'(z) ~ 2:).. suppose we select p realvalued functions of z.13) where Z ~ (lin) I:~ 1 Zi. and 131 = fi ...132 z ~ ~ (2. .1. The regression line for the phosphorus data is given in Figure 2. (zn..Yi) and a line Y ~ a + bz vertically by di = Iy. . 91. j=1 Then we are still dealing with a linear regression model because we can define WpXI .42. ... Geometrically.2. Y6)...2.4. 9p( z). . The line Y = /31 + (32Z is an estimate of the best linear MSPE predictor Ul + b1Z of Theorem 1.3.58 and 132 = 1..Section 2. P > d and postulate that 1'( z) is a linear combination of 91 (z). n} and sample regression line for the phosphorus data.. .9p. Zn. Here {31 = 61. 0 ~ ~ ~ ~ ~ Remark 2..1.2. then the regression line minimizes the sum of the squared distances to the n points (ZI' Yl). i = 1. . The vertical distances €i = fYi . if we measure the distance between a point (Zi. . Yn). . Yi).. For instance. . . and fi = (lin) I:~ 1 Yi· The line y = /31 + f32z is known as the sample regression line or line of best fit of Yl.(a + bzi)l. .2 Minimum Contrast Estimates and Estimating Equations 111 y 50 • o 10 20 x 30 Figure 22. .1. .Yn on ZI.
2...)]2 i=l i=l ~ n (2.. Such models are called heteroscedastic (as opposed to the equal variance models that are homoscedastic).= g({3. Zi) + Ei .2. then = WW2/Wi = .8.. We need to find the values {3l and /32 of (3.Zi)]2 = L ~.z..16) as a function of {3. Example 2. _ Yi . We return to this in Volume II. z.3. and (32 that minimize ~ ~ "I.2. I and the Y i satisfy the assumption (2.. we may be able to characterize the dependence of Var( ci) on Zi at least up to a multiplicative constant. fi = <d. Weighted least squares. . + /3zzi)J2 (2. Consider the case in which d = 2.2. . The weighted least squares estimate of {3 is now the value f3. and g({3. but the Wi are known weights. ... 1 n. That is.2. However.2... Zi)/.filii minimizes ~  I i i L[1Ii . i = 1. .wi 1 < . iI L Vi[Yi i=l n ({3."...2..5).14) where (J2 is unknown as before. .'l are sufficient for the Yi and that Var(Ed.2.g(. . 1<i <n Wij = 9j(Zi). .2 and many similar situations it may not be reasonable to assume that the variances of the errors Ci are the same for all levels Zi of the covariate variable. • I' "0 r .24 for more on polynomial regression.wi. The method of least squares may not be appropriate because (2. I . Zi) = (2..15) . 0 < j < 2. we can write (2. if we setg«(3. < n. which for given Yi = yd.II 112 Methods of Estimation Chapter 2 (91 (z). 1 <i < n Yi n 1 ... Whether any linear model is appropriate in particular situations is a delicate matter partially explorable through further analysis of the data and knowledge of the subject matter. Note that the variables ... Zil = 1. Zi) = {3l + fhZi. . Yi = 80 + 8 1Zi + 82 + Cisee Problem 2. For instance. gp( z)) T as our covariate and consider the linear model Yi = where L j=l p ()jWij + Ci.2. Thus. Zi2 = Zi.wi..wi  _ g((3.g«(3.wi . Weighted Linear Regression.2. In Example 2.5) fails.wi) g((3.zd + ti...= y. we arrive at quadratic regression. [Yi . if d = 1 and we take gj (z) = zj.17) .
(Zn..Section 2. ..7).2. By following the steps of Exarnple 2.2 Minimum Contrast Estimates and Estimating Equations 113 where Vi = l/Wi_ This problem may be solved by setting up analogues to the normal equations (2.. a__ Cov(Z"'.2. Y*) denote a pair of discrete random variables with possible values (z" Yl). We can also use the results on prediction in Section 1.B.2.2. z) = zT.1 7) is equivalent to finding the best linear MSPE predictor of y . .B+€ can be transformed to one satisfying (2. . When ZD has rank d and Wi > 0. .8).2. .~ UiZi· n n i"~l I" n n F"l This computation suggests.1.1.3. That is.Y. ~ .13 minimizing the least squares contrast in this transformed model is given by (2.~ UiYi .. we may allow for correlation between the errors {Ei}. Then it can be shown (Problem 2.. as we make precise in Problem 2.. If Itl(Z*) = {31 given by + {32Z* denotes a linear predictor of y* based on Z .. .18) ~ ~ ~I" (31 = E(Y') .2.2. Var(Z") and  _ L:~l UiZiYi . when g(.(3.(L:~1 UlYi)(L:~l uizd Li~] Ui Zi . .4.19) and (2. Zi) = z. . we can write Remark 2. i ~ I. the .2. using Theorem 1..2.28) that the model Y = ZD.20). (3 and for general d. 0 Next consider finding the.. Yn) and probability distribution given by PI(Z".2. .B.1) and (2.(Li~] Ui Z i)2 n 2 n (2. . More generally. we find (Problem 2.2.26.. Moreover. wn ) and ZD = IIZijllnxd is the design matrix.2.n where Ui n = vi/Lvi..16) for g((3.6).1 leading to (2.B that minimizes (2.n. 1 < i < n.4H2. Let (Z*. i i=l = 1.(32E(Z') ~ .2. then its MSPE is ElY" ~ 1l1(Z")f ~ :L UdYi i=l n «(3] + (32 Zi)1 2 It follows that the problem of minimizing (2.4 as follows. Y*) V. that weighted least squares estimates are also plugin estimates.Y") ~(Zi.2.27) that f3 satisfy the weighted least squares normal equations ~ ~ where W = diag(wI.)] ~Ui. suppose Var(€) = a 2W for some invertible matrix W nxn. Thus.
.
.
.
.
Similarly. there may be solutions of (2.OJ'".0)' < ° for all B E (0. situations with f) well defined but (2. and n3 denote the number of {Xl. Then the same calculation shows that if 2nl + nz and n2 + 2n3 are both positive. equivalently. Here are two simple examples with (j real. Consider a popUlation with three kinds of individuals labeled 1. Because . 8' 80.118 Methods of Estimation Chapter 2 which again enables us to analyze the behavior of B using known properties of sums of independent random variables. Here X takes on values {O. Let X denote the number of customers arriving at a service counter during n hours. + n.. let nJ.l.>. which has the unique solution B = ~. In general. 1). respectively. Example 2. ~ maximizes Lx(B).1. the maximum likelihood estimate exists and is given by 8(x) = 2n.x=O.2. the rate of arrival. Evidently. 0 e Example 2.O) = 0'. where ). ]n practice. .. which is maximized by 0 = 0. so the MLE does not exist because = (0. OJ = (1  0)' where 0 < () < 1 (see Example 2.27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section. the dual point of view of (2. X3 = 1.2. + n. x. x n } equal to 1.29) '(j .6. 0)p(2.5. 2.2.ny l. then I • Lx(O) ~ p(l.2. .27) doesn't make sense.2.lx(0) = 5 1 0' . I). the MLE does not exist if 112 + 2n3 = O.2. 0) = 20'(1. and as we have seen in Example 2.28) If 2n.0)..4). If we make the usual simplifying assumption that the arrivals form a Poisson process.27) that are not maxima or only local maxima. A is an unknown positive constant and we wish to estimate A using X. If we observe a sample of three individuals and obtain Xl = 1.. .7.2. X2 = 2. 2n (2. represents the expected number of arrivals in an hour or. the likelihood is {1 . 2 and 3. (2. 0) ~ 20(1 .(1... . is zero.0) The likelihood equation is 8 80 Ix (0) ~ = 5 1 0 10 =0. Nevertheless. and 3 and occurring in the HardyWeinberg proportions p(I. } with probabilities.2. O)p(l. p(2. p(3. .) = ·j1 e'"p.. p(X.2. .22) and (2. 1. n2. then X has a Poisson distribution with parameter nA.2.
2. for an experiment in which we observe nj ~ 2:7 1 I[X.1.2. Then p(x.. . = x/no If x is positive. 8) = 0 if any of the 8j are zero.2. A similar condition applies for vector parameters.30) for all This is the condition we applied in Example 2.30)). .logB. 80. ~ n .7.. As in Example 1.2.Section 2.d.LBj j=1 • (2.\ I 0.2. and the equation becomes (BkIB j ) n· OJ = . j=d (2. .2.2 Minimum Contrast Estimates and Estimating Equations 119 The likelihood equation is which has the unique solution). fJE6= {fJ:Bj >O'LOj ~ I}. k. .32) to find I. Example 2. L.8. . the maximum is approached as . (see (2. this is well known to be equivalent to ~ e. = jl. We assume that n > k ~ 1. = L. e. let B = P(X. Multinomial Trials. 8Bk/8Bj (2. this estimate is the MLE of ). j = 1.32) We first consider the case with all the nj positive. A sufficient condition.. familiar from calculus.n. and k j=1 k Ix(fJ) = LnjlogB j . fJ) = TI7~1 B7'. and let N j = L:~ 1 l[Xi = j] be the number of observations in the jth category.. the MLE must have all OJ > 0. Let Xi = j if the ith trial produces a result in the jth category. thus.. trials in which each trial can produce a result in one of k categories. = j) be the probability J of the jth category. p(x. 0 To apply the likelihood equation successfully we need to know when a solution is an MLE..31 ) To obtain the MLE (J we consider l as a function of 8 1 .kl.. .6.6. 8' (2.2..32). is that l be concave in If l is twice differentiable. ~ ~ .2. 8B.8B " 1=13 k k =0. and must satisfy the likelihood equa~ons ~ 8B1x(fJ) = 8B 3 8 8 " n.ekl with kl Ok ~ 1.lx(B) <0. However. Then. 31=1 By (2. If x = 0 the MLE does not exist.. consider an experiment with n Li.o. J = I...
. zn) 03W. OJ > ~ 0 case. then <r <k I. least squares estimates are maximum likelihood for the particular model Po.11(a)).2.2. 1 Yn . i = 1. Thus.2 ) with J1. The 0 < OJ < 1.28. .). (2. .Zi)]2. YnJT..2. 'ffi f.1 holds and X = (Y" . Summary. gi. n.9.lV.(Xi ..2 are Maximum likelihood and least squares We conclude with the link between least squares and maximum likelihood.' . Suppose that Xl.d. o 1=1 Evidently maximizing Ix«(3) is equivalent to minimizing L:~ n (2. Zi)) 00 ao n log 2 I"" 2 "2 (21Too) .2. Next suppose that nj = 0 for some j. X n are Li.g«(3. see Problem 2. As we have IIV. where Var(Yi) does not depend on i. See Problem 2. . and 0.120 ~ Methods of Estimation Chapter 2 To show thaI this () maximizes lx(8).30.34) g«(3. This approach is applied to experiments in which for the ith case in a study the mean of the response Yi depends on • . are known functions and {3 is a parameter to be estimated from the independent observations YI . In Section 2.6.IV..... Then Ix «(3) log IT ~'P (V. Zj)]Wij where W = Ilwijllnxn is a symmetric positive definite matrix.33) It follows that in this nj > 0.1.1 L:~ . we find that for n > 2 the unique MLEs of Il and Ii = X and iT2 ~ n. where E(V. Then (} with OJ = njjn. . Suppose the model Po of Example 2. z. .2. Example 2. version of this example will be considered in the exponential family case in Section 2.... It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model Yi independent N(g({3. N(lll 0. we check the concavity of lx(O): let 1 1 <j < k . wi(5). Ix(O) is strictly concave and (} is the unique ~ ~ maximizer of lx(O)..X)2 (Problem 2.2. Using the concavity argument. .1). ((g((3. these estimates viewed as an algorithm applied to the set of data X make sense much more generally. More generally. i=l g«(3.202 L.1.gi«(3) 1 ..) = gi«(3)..  seen and shall see more in Section 6. we can consider minimizing L:i.. j = 1.. 0. ~ g((3.3.. . is still the unique MLE of fJ.1 we consider least squares estimators (LSEs) obtained by min2 imizing a contrast of the fonn L:~ 1 IV.2. 1 < i < n.2 both unknown. k. 1 < j < k. zi)1 . . as maximum likelihood estimates for f3 when Y is distributed as lV. Zi)J[l:} . ~ g«(3. g«(3. Zi).
oo}}.3 Maximum Likelihood in Multiparameter Exponential Families 121 a set of available covariate values Zil. 2 e Lemma 2.a < b< oo} U {(a. Then there exists 8 E e such that 1(8) = max{l(li) : Ii E e}.1) lim{l(li) : Ii ~ ~ De} = 00.I ) all tend to De as m ~ 00.ae as m t 00 to mean that for any subsequence {B1nI<} either 8 1nk t t with t ¢ e.6. m). Suppose 8 c RP is an open set. . as k .b) . though the results of Theorems 1. or 8 1nk diverges with 181nk I . if X N(B I .00.1 ). where II denotes the Euclidean norm.3. (m. (h). That is. In Section 2. m.. (m. (}"2) distribution.bE {a. m. (m. .4 and Corollaries 1. . Concavity also plays a crucial role in the analysis of algorithms in the next section. including all points with ±oo as a coordinate. e e I"V e De= {(a. For instance.5. Properties that derive solely fTOm concavity are given in Propositon 2. are given.3. Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2. b).6.zid. z=1=1 ~ 2. In general.3. (a. b) : aER. which are appropriate when Var(Yj) depends on i or the Y's are correlated.3. Proof. = RxR+ and ee e.2 we consider maximum likelihood estimators (MLEs) 0 that are defined as maximizers of the likelihood Lx (B) = p(x. (a. Extensions to weighted least squares. it is shown that the MLEs coincide with the LSEs. See Problem 2.2.. a8 is the set of points outside of 8 that can be obtained as limits of points in e. b). we define 8 1n . Suppose we are given a function 1: continuous. We start with a useful general framework and lemma. Formally. oo]P.6. Suppose also that e t R where e c RP is open and 1 is (2.3. For instance. in the N(O" O ) case. Let &e = be the boundary of where denotes the closure of in [00.1 and 1.1. In particular we consider the case with 9i({3) = Zij{3j and give the LSE of (3 in the case in which I!ZijllnXd is of rank d.1 only. In the case of independent response variables Yi that are modeled to have a N(9i({3).2 and other exponential family properties also playa role. These estimates are shown to be equivalent to minimum contrast estimates based on a contrast function related to Shannon entropy and KullbackLeibler information divergence. 8).6.1.3 MAXIMUM LIKELIHOOD IN MUlTIPARAMETER EXPONENTIAL FAMILIES Questions of existence and uniqueness of maximum likelihood estimates in canonical exponential families can be answered completely and elegantly.Section 2.00.a= ±oo..3 and 1. for a sequence {8 m } of points from open. This is largely a consequence of the strict concavity of the log likelihood in the natural parameter TI.
a contradiction. = . then lx(11 m ) . thus. if lx('1) logp(x..'1) with T(x) = 0. with corresponding logp(x.1.12. Applications of this theorem are given in Problems 2.1 hold.122 Methods of Estimation Chapter 2 Proposition 2. /fCT is the convex suppon ofthe distribution ofT (X). is unique.1.3. We give the proof for the continuous case.to.. we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) .3.3. ! " . II) is strictly concave and 1. Suppose the cOlulitions o/Theorem 2. I I.3) has i 1 !.)) 0 Themem 2. II E j. lI(x) =  exists.'1 .3. and is a solution to the equation (2. Furthennore.8 and 2. [ffunher {x (II) e = e e t 88.3. no solution. have a necessary and sufficient condition for existence and uniqueness of the MLE given the data.3.3. If 0. Existence and Uniqueness o/the MLE ij.1.)) > ~ (fx(Od +lx (0. if to doesn't satisfy (2. h) and that (i) The natural parameter space.6.3) I I' (b) Conversely.. which implies existence of 17 by Lemma 2. ~ Proof. .3. ' L n . Suppose P is the canonical exponential/amity generated by (T. Without loss of generality we can suppose h(x) = pix. '10) for some reference '10 E [ (see Problem 1. Corollary 2.3.1.3. Write 11 m = Am U m .1.00. " We. U m = Jl~::rr' Am = . " . We show that if {11 m } has no subsequence converging to a point in E.). II).(11) ~ 00 as densities p(x.2). then Ix lx(O. i . We can now prove the following.27). are distinct maximizers. Then.9) we know that II lx(lI) is continuous on e. Proofof Theorem 2. then 1] exists and is unique ijfto E ~ where C!i: is the interior of CT.1.2) then the MLE Tj exists. From (B.3. Suppose X ~ (PII .  (hO! + 0.3. is open. .3. then the MLE8(x) exists and is unique. Let x be the observed data vector and set to = T(x). (a) If to E R k satisfies!!) Ifc ! 'I" 0 (2. (ii) The family is of rank k. and 0. E. . Define the convex suppon of a probability P to be the smallest convex set C such that P(C) = 1. By Lemma 2. open C RP. then the MLE doesn't exist and (2.
Por n = 1. Evidently.3). The equivalence of (2. that is.Xn are i. existence of MLEs when T has a continuous ca. this is the exponential family generated by T(X) = 1 Xi. Then because for some 6 > 0.3. Nonexistence: if (2. I Xl) and 1. um~. lIu m ll 00.3 Maximum Likelihood in Multiparameter Exponential Families ~ 123 1. u mk t u.5.k Lx( 11m. N(p" ( 2 ).3. CT = C~ and the MLE always exists. Then AU ¢ £ by assumption. So we have Case 2: Amk t A. T(X) has a density and. (1"2 > O.3.3. Write Eo for E110 and Po for P11o .3.Section 2. Theorem 2.* P'IlcTT = 0] = I.1 follow. Because any subsequence of {11m} has no subsequence converging in E we conclude L ( 11m) t 00 and Tj exists. 0 ° '* '* PrOD/a/Corollary 2.3. = 0 because T(XI) is always a point on the parabola T2 = T'f and the MLE does not exist.d. As we observed in Example 1.i. Suppose X"".1.se density is a general phenomenon. This is equivalent to the fact that if n = 1 the formal solution to the likelihood equations gives 0'2 = 0. t u.3. there exists c # such that Po[cTT < 0] = 1 E1](c T T(X)) < 0. .2) and COTollary 2. contradicting the assumption that the family is of rank: k. If ij exists then E'I T ~ 0 E'I(cTT) ~ 0. 0 (L:7 L:7 CJ.9. Suppose the conditions ofTheorem 2..3. So In either case limm.1. So.3. which is impossible.4. Po[uTT(X) > 61 > O.2. In fact.6. if {1]m} has no subsequence converging in £ it must have a subsequence { 11m k} that obeys either case 1 or 2 as follows. .1 hold and T k x 1 has a continuous case density on R k • Then the MLE Tj exists with probabiliry 1 and necessarily satisfies (2. fOT all 1]. By (B.J = 00. It is unique and satisfies x (2.2) fails. 0 Example 2.1) a point to belongs to the interior C of a convex set C iff there exist points in CO on either side of it. I' E R. both {t : dTt > dTto} n CO and {t : dTt < dTto} n CO are nonempty open sets.6.3) by TheoTem 1. CT = R )( R+ FOT n > 2. Then. iff. The Gaussian Model. thus. for every d i= 0. Case 1: Amk t 111]=11.
Because the resulting value of t is possible if 0 < tjO < n.6. Suppose Xl. = = L L i i bJ _ I . liz = I  vn3/n and li3 = (2nl + nz)/2n are frequency substitution estimates (Problem 2..6. From Theorem 1.. but only Os is a MLE.1).Xn are i. then and the result follows from Corollary 2. o Remark 2. The TwoParameter Gamma Family. We assume n > k .(X) = r~. if T has a continuous case density PT(t).6. the MLE 7j in exponential families has an interpretation as a generalized method of moments estimate (see Problem 2. To see this note that Tj > 0.3.3. thus. How to find such nonexplicit solutions is discussed in Section 2. Thasadensity.ln. When method of moments and frequency substitution estimates are not unique. exist iff all Tj > O.)e>"XxPI.3. if we write c T to = {CjtjO : Cj > O} + {cjtjO : Cj < O} we can increase c T to by replacing a tjO by tjo + I in the first sum or a tjO by tjO .3. For instance. > O.3.1 in the second. with density 9p.. I <j < k. using (2.1. P > 0. Thus. with i'. Thus.1. in the HardyWeinberg examples 2.I which generates the family is T(k_l) = (Tl>"" T..3. the maximum likelihood principle in many cases selects the «best" estimate among them..3. Example 2.1. I < j < k.3. ...2 follows.4 and 2. the best estimate of 8. 0 = If T is discrete MLEs need not exist. and one of the two sums is nonempty because c oF 0. where 0 < Aj PIX = j] < I. by Problem 2.I.5) have a unique solution with probability 1. where Tj(X) L~ I I(Xi = j). LXi).3.13 and the next example). I < j < k.= r(jJ) A log X (2.3 we know that E. .1. The boundary of a convex set necessarily has volume 0 (Problem 2. Multinomial Trials. We conclude from Theorem 2. >.3. in a certain sense. This is a rank 2 canonical exponential family generated by T   = (L log Xi. In Example 3.5) ~=X A where log X ~ L:~ t1ogXi. A nontrivial application of Theorem 2. They are determined by Aj = Tj In.1.d.1 that in this caseMLEs of"'j = 10g(AjIAk).3.>. x > 0. I < j < k iff 0 < Tj < n.3.3.2.3).... I < j < k.4) (2.3. Example 2.2 that (2.i.2(b» r' log .n. h(x) = XI.3. The likelihood equations are equivalent to (problem 2.2. Here is an example.3.4) and (2._t)T.4. I <t< k . We follow the notation of Example 1.4. It is easy to see that ifn > 2.9).I and verify using Theorem 2.7.2) holds.3. The statistic of rank k .).4 we will see that 83 is.2(a)..124 Methods of Estimation Chapter 2 Proof.T(X) = A(. lit ~ . we see that (2.
1 is useful. be a cUlVed exponential family p(x.1.1 directly D (Problem 2. The remaining case T k = a gives a contradiction if c = (1.exist and are unique.3.1.. 0 The argument of Example 2. x E X. Let Q = {PII : II E e).3) does not have a closedform solution as in Example 1.3. Corollary 2.I. for example.1.lI) ~ exp{cT (II)T(x) . and unicity can be losttake c not onetoone for instance. our parameter set is open.3.8see Problem 2.2.3.2..3 Maximum likelihood in Multiparameter Exponential Families 125 On the other hand.B) ~ h(x)exp LCj(B)Tj(x) .2) by taking Cj = 1(i = j).6.k.3. Ck (B)) T and let x be the observed data.6. (2. note that in the HardyWeinberg Example 2. Here [ is the natural paramerer space of the exponential fantily P generated by (T.6) on c( II) = ~~. then it is the unique MLE ofB. Similarly. = 0. The following result can be useful.. Then Theorem 2. .A(c(ii)) ~ = O. .3. 0 < j < k .1. . n > kl.1 we can obtain a contradiction to (2.. when we put the multinomial in canonical exponential family form. 1)T.3.2.1).3.3. if any TJ = 0 or n.3..2) so that the MLE ij in P exists. lfP above satisfies the condition of Theorem 2.13). Consider the exponential family k p(x.A(c(II))}h(x) Suppose c : 8 ~ [ C R k has a differential (2. (B). the bivariate normal case (Problem 2.3. whereas if B = [0. BEe. Remark 2. (II) . In some applications. .Section 2. mxk e. Let CO denote the interior of the range of (c. L. c(8) is closed in [ and T(x) = to satisfies (2.1] it does exist ~nd is unique. 1 < i < k . .10). e open C R=..3. If the equations ~ have a solution B(x) E Co.3 can be applied to determine existence in cases for which (2. m < k .7) Note that c(lI) E c(8) and is in general not ij.B(B) j=l .1 and Haberman (1974).3. Alternatively we can appeal to Corollary 2. .3. the following corollary to Theorem 2. the MLEs ofA"j = 1.~l Aj = I}.3. . However. When P is not an exponential family both existence and unicity of MLEs become more problematic.3. In Example 2. the MLE does not exist if 8 ~ (0. if 201 + 0. h).8 we saw that in the multinomial case with the clQsed parameter set {Aj : Aj > 0. Unfortunately strict concavity of Ix is not inherited by curved exponential families. then so does the MLE II in Q and it satisfies the likelihood equation ~ cT(ii) (to .
4.5.6.. that . Equation (2.\()'. m} is a 2nparameter canonical exponential family with 'TJi = /tila.3. = ( ~Yll'''' .nit.2~~2 . .i.5X' +41i2]' < 0.. .11.126 The proof is sketched in Problem 2. i = 1.3.Xn are i. l = 1..1/'1') T . where 0l N(tLj 1 0'1).A6iL2 = 0 Ii. < O}. we can conclude that an MLE Ii always exists and satisfies (2. i Example 2. . tLi = (}1 + 82 z i ..gx±)...I LX. Using Examples 1.6.. LocationScale Regression.. m m m f?. Next suppose.~Y'1.: Note that 11+11_ = '6M2 solution we seek is ii+... . I . are n independent random samples... = 1 " = 2 n( ryd'1' .it. which implies i"i+ > 0.'. As in Example 1. as in Example 1.3. .ryl > O.. Jl > 0.2 . C(O) and from Example 1. Suppose that }jI •..) : ry.!0b''. we see that the distribution of {01 : j = 1.0'2) with Jl/ a = AO > a known. simplifies to Jl2 + A6XIl. Methods of Estimation Chapter 2 family with (It) = C2 (It) = . < Zn where Zt.. Zl < . As a consequence of Theorems 2. .6. .3. We find CI Example 2. ry. Now p(y. 9) is a curved exponential family of the form (2.6.ry.3.n..Zn are given constants.it.. .3.".3.. '() A 1/ Thus.oJ). . af = f)3(8 1 + 82 z i )2. .6) with " • . n.d. generated by h(Y) = 1 and "'> I T(Y) '.~Yn.10.2 and 2./2'1' . corresponding to 1]1 = //x' 1]2 = .. l}m.n(it' + )"5it'))T which with Ji2 = n.7) if n > 2.7) becomes = 0. with I. j = 1. '1. = L: xl. Gaussian with Fixed Signal to Noise. .10. Because It > 0.< O. This is a curved exponential ¥.g( it' .9. = ~ryf. 1 n. Ii± ~ ~[). L: Xi and I..'' .ry.5 = . ).3 )(t.?' m)T " ". 'TJn+i = 1/2a.. . Evidently c(8) = {(lh. .2.6.3)T. which is closed in E = {( ryt> ry. ..3..5 and 1. < O}.3.) : ry. suppose Xl. t. the o q. N'(fl. . E R. 11.
7).02 E R. Finally.x*l: Find Xo < x" f(xo) < 0 < f(x') by taking Ixol. the fonnula (2.O. strict concavity. MLEs may not be given explicitly by fonnulae but only implicitly as the solutions of systems of nonlinear equations. b) such that f(x*) = O.4 ALGORITHMIC ISSUES As we have seen. such as the twoparameter gamma. even in the context of canonical multiparameter exponential families.1).3. then. order d3 operations to invert.1 The Method of Bisection The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in kparameter exponential families.4. the basic property making Theorem 2. f( a+) < 0 < f (b. is isolated and shown to apply to a broader class of models. Given f continuous on (a. d. by the intermediate value theorem.3. an MLE (J of (J exists and () 0 ~ ~ Summary. In this section we derive necessary and sufficient conditions for existence of MLEs in canonical exponential families of full rank with £ open (Theorem 2.3. L 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of Z'bZD requires on the order of nd 2 operations to evaluate each of d(d + 1)/2 tenns with n operations to get Z'bZD and then.3.4 Algorithmic Issues 127 If m > 2. > 2. We begin with the bisection and coordinate ascent methods.1. Here.1 and Corollary 2. if implemented as usual.1. is the bisection algOrithm to find x*. It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis. f i strictly. then the full 2nparameter model satisfies the conditions of Theorem 2. E R. L 10). These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency.). However. Initialize x61d ~ XI. Given tolerance € > a for IXfinal . Let £ be the canonical parameter set for this full model and let e ~ (II: II. In fact. even in the classical regression model with design matrix ZD of full rank. The packages that produce least squares estimates do not in fact use fonnula (2. Then c(8) is closed in £ and we can conclude that for m satisfies (2. . there exists unique x*€(a. we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all.Section 2. Ixd large enough. x o1d = xo. entrust ourselves.3. in pseudocode.1 work.3. ~ 2. at various times. > OJ. in this section. which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2. b). 2.
[(8.4. may befound (to tolerance €) by the method afbisection applied to f(. which exists and is unique by Theorem 2. Let p(x I 1]) be a oneparameter canonical exponentialfamily generated by (T.. Xnew = !(x~ld + x~ld)' ~ Xnew· (3) If f(xnew) = 0.1 and T = to E C~. h).. in addition.1.. 00. exists.IXm+l . .xol· 1 (2) Therefore. X n be i.. I. o ! i If desired one could evidently also arrange it so that. (2.4.. 0 Example 2.1. The Shape Parameter Gamma Family.xol/€). by the intermediate value theorem.d.i..3.x*1 < €. x~ld (5) If f(xnew) > 0. By Theorem 1.) = EryT(X) ..1) .. !'(.4. Then. i I i. f(a+) < 0 < f(b).3.4. • . b) of the convex support a/PT. lx.1. Moreover. x~ld = Xnew· Go to (I). xfinal = !(x~ld + x old ) and return xfinal' (2) Else.) = VarryT(X) > 0 for all . End Lemma 2. so that f is strictly increasing and continuous and necessarily because i. From this lemma we can deduce the following. xfinal = Xnew· (4) If f(xnew) < 0.xoldl Methods of Estimation Chapter 2 < 2E. .4. I (3) and X m + x* i as m j. If(xfinal)1 < E. Let X" . Theorem 2..128 (I) If IX~ld .1)..to· I' Proaf.. satisfying the conditions of Theorem 2. Xm < x" < X m +l for all m.6. the interior (a. The bisection algorithm stops at a solution xfinal such that i Proot If X m is the mth iterate of Xnew i (I) Moreover. the MLE Tj.1. . for m = log2(lxt .
1.4. It solves the r'(B) r(B) T(X) n which by Theorem 2.1 can be evaluated by bisection. This example points to another hidden difficulty. which is slow.. . for a canonical kparameter exponential family. In fact.oJ _ = (~1 "" {}) ~02 _ 1]1. for each of the if'. in cycle j and stop possibly in midcycle as soon as <I< . Notes: (1) in practice. The function r(B) = x'Iexdx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method. 1] ~Ok = (~1 ~1 {} "") 1]I.'fJk· Repeat.· . r > 1.1]2. E1/(T(X» = A(1/) = to when the MLE Tj = Tj(to) exists.Section 2. always converges to Ti.4 Algorithmic Issues 129 Because T(X) = L:~' equation I log Xi has a density for all n the MLE always exists.···.4. getting 1j 1r).···.'TJk.'fJk' ~1) an soon. 0 J:: 2. bisection itself is a defined function in some packages. However. it is in fact available to high precision in standard packages such as NAG or MATLAB. Here is the algorithm.4. 1 k. we would again set a tolerenceto be.1]2. d and finally 1] _ ~Il) _ =1] (~1 1]1"". The case k = 1: see Theorem 2. say c.2 Coordinate Ascent The problem we consider is to solve numerically. but as we shall see. eventually.'TJk =tk' ) . The general case: Initialize ~o 1] = ("" 'TJll···' "") • 'TJk Solve ~1 f or1]k: Set 1] 8'T]k [) A(~l ~1 1]ll1'J2.'TJ3.
4. (3) I (71j) = A for all j because the sequence oflikelihoods is monotone. x ink) ( '~ini . Suppose that we can write fiT = (fir.. (I) l(ij'j) Tin j for i fixed and in.1j{ can be explicit.3.5). Bya standard argument it follows that. We pursue this discussion next.1J k) . Hence.2..4. _A (1»). For n > 2 we know the MLE exists.. I • . (i).2. to ¢ Fortunately in these cases the algorithm. . the log likelihood. 1J ! But 71j E [.. Suppose that this is true for each l.j l(i/i) = A (say) exists and is > 00. ff 1 < j < k.2: For some coordinates l. Thus. ij = (V'I) . 1 < j < k. the MLE 1j doesn't exist.. The case we have C¥..2'  It is natural to ask what happens if. We give a series of steps. 0 ~ (4) :~i (77') = 0 because :~. .1. .\(0) = ~. by (I). We note some important generalizations. that is.. j (5) Because 1]1. .) where flj has dimension d j and 2::. ij as r t 00. A(71') ~ to.I ~(~ (1) 1 to get V.A(71) + log hex).. (2) The sequence (iji!. Here we use the strict concavity of t. 1j(r) ~ 1j.3. in fact. (6) By (4) and (5).I)) = log X + log A 0) and then A(1) = pY. Proof.I) solving r0'. I(W ) = 00 for some j. .2. ijik) has a convergent subsequence in t x . We can initialize with the method 2 of moments estimate from Example 2. Because l(ijI) ~ Aand the MLE is unique. To complete the proof notice that if 1j(rk ) is any subsequence of 1j(r) that converges to ij' (say) then. 71 1 is the unique MLE.'..l(ij') = A. 'II.=1 d j = k and the problem of obtaining ijl(tO. (3) and (4) => 1]1 . Continuing in this way we can get arbitrarily close to 1j..4. We use the notation of Example 2. j ¥ I) can be solved in closed form. The TwoParameter Gamma Family (continued). Whenever we can obtain such steps in algorithms. . 0 Example 2.. the algorithm may be viewed as successive fitting of oneparameter families..2. refuses to converge (in fI space!)see Problem 2.··.···) C!J. is computationally explicit and simple. Consider a point we noted in Example 2. Then each step of the iteration both within cycles and from cycle to cycle is quick..4. W = iji = ij. . (ii) a/Theorem 2. = 7Jk. . (fij(e) are as above. they result in substantial savings of time.1 because the equa~ tion leading to Anew given bold' (2. 71J. TIl = ..4. Theorem 2. .. 'tn.130 Methods of Estimation Chapter 2 (2) Notice that (iii.pO) = We now use bisection ' r' .2. This twodimensional problem is essentially no harder than the onedimensional problem of Example 2. Else lim. .3.1 hold and to E 1j(r) t g:. (Wn') ~ O. fI.. ij'j and ij'(j+I) differ in only one coordinate for which iji(j+1) maximizes l.. rp differ only in the second coordinate. Therefore.. as it should.'11 11 t t (I . limi..? Continuing. ilL" iii. Let 1(71) = tif '1. is the expectation of TI(X) in the oneparameter exponential family model with all parameters save T1I assumed known.
1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method..1 illustrates the j process. which we now sketch.. values of (B 1 . find that member of the family of contours to which the vertical (or horizontal) line is tangent. . At each stage with one coordinate fixed..4. BJ+l' . Change other coordinates accordingly.4..4.4.4 Algorithmic Issues 131 just discussed has d 1 = .7.p.4. .2 has a generalization with cycles of length r. r = k.4.3. iterate and proceed. the log likelihood for 8 E open C RP. The coordinate ascent algorithm can be slow if the contours in Figure 2.10.B2 )T where the log likelihood is constant. for instance. and Problems 2. The coordinate ascent algorithm.92... and Holland (1975).4. If 8(x) exists and Ix is differentiable. Solve g~: (Bt. . is strictly concave. 1 B. . that is.1. 1' B .B~) = 0 by the method of j bisection in B to get OJ for j = 1. The graph shows log likelihood contours. Next consider the setting of Proposition 2. Feinberg. each of whose members can be evaluated easily.1 in which Ix(O).. e ~ 3 2 I o o 1 2 3 Figure 2. the method extends straightforwardly. Figure 2. = dr = 1.Section 2. Then it is easy to see that Theorem 2.. See also Problem 2. . A special case of this is the famous DemingStephan proportional fitting of contingency tables algorithmsee Bishop.
3. X n be a sample from the logistic distribution with d.B) We find exp{ (x . Let Xl.B)} [i+exp{(xB)}j2' n 1 l(B) l(B) n2Lexp{(X. if l(B) denotes the log likelihood. B) The density is = [I + exp{ (x = B) WI. can be shown to be faster than coordinate ascent. If 7J o ld is close to the root 'ij of A(ij) expanding A(ij) around 11old' we obtain = to.6. I 1 The NewtonRaphson method can be implemented by taking Bold X. though there is a distinct possibility of nonconvergence or convergence to a local rather than global maximum. B)}F(Xi.3) Example 2.4.3 The NewtonRaphson Algorithm An algorithm that.. the argument that led to (2.. 1 l(x. Here is the method: If 110 ld is the current value of the algorithm.4.10. A hybrid of the two methods that always converges and shares the increased speed of the NewtonRaphson method is given in Problem 2.4. Bjork. _. Newton's method also extends to the framework of Proposition 2. then ii new = iiold . when it converges. which may counterbalance its advantage in speed of convergence when it does converge. coordinate ascent. and Anderson (1974). If 110 ld is close enough to fj.2) gives (2.7.132 Methods of Estimation Chapter 2 2. iinew after only one step behaves approximately like the MLE. methods such as bisection. then by f1new is the solution for 1] to the approximation equation given by the right.and lefthand sides. We return to this property in Problem 6.4.4. In this case. in general.  I o The NewtonRaphson algorithm has the property that for large n. A onedimensional problem 7 .2) The rationale here is simple. and NewtonRaphson's are still employed.to).1.B) i=l 1 1 2 L i=l n I(X" B) < O.f. This method requires computation of the inverse of the Hessian.3. this method is known to converge to ij at a faster rate than coordinate ascentsee Dahlquist.k I (iiold)(A(iiold) . F(x. (2. is the NewtonRaphson method. When likelihoods are noncave.4.
Ei3). I)] (1 .4.. 1 8 m are not Xi but (€il. As in Example 2. This could happen if. If we suppose (say) that observations 81. where Po[X = (1. E'2. 0. difficult to compute.6. . for some individuals. The log likelihood of S now is 1". i = 1. (0) = log q(s. A fruitful way of thinking of such problems is in terms of 8 as representing part of X. S = S(X) where S(X) is given by (2. and its main properties. 0) is difficultto maximize.0.4.Section 2.x(B) is concave in B. Bjork. however. e = Example 2. (2.. and Weiss (1970). €i2 + €i3).4. and so on. . 0) where I". Lumped HardyWeinberg Data. Unfortunately. let Xi. the rest of X is "missing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood. 0 leads us to an MLE if it exists in both cases.4 The EM (Expectation/Maximization) Algorithm There are many models that have the following structure. The algorithm was fonnalized with many examples in Dempster. 0).4. and Rubin (977). a < 0 < I.13.B). m+ 1 <i< n. Laird. Xi = (EiI. A prototypical example folIows.€i3). be a sample from a population in HardyWeinberg equilibrium for a twoallele locus.x(B) is "easy" to maximize. X ~ Po with density p(x. Po[X = (0. then explicit solution is in general not lXJssible.OJ] i=l n (2. Yet the EM algorithm. Here is another important example. What is observed.4. a)] = B2. Many examples and important issues and methods are discussed. for instance.B) + 2E'310g(1 ... the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€i2 = 1). Po[X (0. . but the computation is clearly not as simple as in the Original HardyWeinberg canonical exponential family example. we observe 5 5(X) ~ Qo with density q(s.4. and Anderson (1974). There are ideal observations. is not X but S where = 5i 5i Xi.0)2. Say there is a closedfonn MLE or at least Ip. It does tum out that in this simplest case an explicit maximum likelihood solution is still possible. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997). the function is not concave. 1 <i< m (€i1 +€i2. in Chapter 6 of Dahlquist. with an appropriate starting point.2.0)] ~ 28(1 .(0) })2EiI 10gB + Ei210g2B(1 . 1. though an earlier general form goes back to Baum.4 Algorithmic Issues 133 in which such difficulties arise is given in Problem 2.n.0 E c Rd Their log likelihood Ip. Petrie. 2.4. . We give a few examples of situations of the foregoing type in which it is used.5) + ~ [(EiI+Ei2)log(1(10)')+2Ei310g(IB)] i=m+l a function that is of curved exponential family fonn.4).4) Evidently. Soules.
9) I for all B (under suitable regularity conditions). .12) q(s.I.(sl'd +.\ < 1. Note that (2.Ai)I'Z.4. The log likelihood similarly can have a number of local maxima and can tend to 00 as e tends to the boundary of the parameter space (Problem 2.\)"'u. . A.B) =E.2)andO <. o (:B 10gp(X. reset Bold = Bnew and repeatthe process.p() [A.. p(s. The second (M) step is to maximize J(B I Bold) as a function of B.i = 1.<T.4. The EM Algorithm. that under fJ. Bo) _ I S(X) = s ) (2. I. if this step is difficult. .8). a local maximum close to the true 8 0 turns out to be a good "proxy" for the 0 nonexistent MLE. (P(X.B) IS(X)=s) q(s. Initialize with Bold = Bo· Tbe first (E) step of the algorithm is to compute J(B I Bold) for as many values of Bas needed. = 01. which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2. This fiveparameter model is very rich pennitting up to two modes and scales. Mixture of Gaussians.o log p(X.0'2 > 0.\"'u. Here is the algorithm.\ = 1. S has the marginal distribution given previously. O't. ~i tells us whether to sample from N(Jil. we have given.Bo) and (2. Then we set Bnew = arg max J(B I Bold).5.B) 0=00 = E.. i.B) I S(X) = s) 0=00 (2. + (1  A. .6). I' Hi where we suppress dependence on s. Although MLEs do not exist in these models.4.\.7) ii. If this is difficul~ the EM algorithm is probably not suitable.4. = 11 = . I:' .4. Suppose that given ~ = (~11"" ~n). I .4.p (~). J. Suppose 8 1 . . the M step is easy and the E step doable.(sI'z)where() = (. The EM algorithm can lead to such a local maximum.4.ll. Again.. (P(X.. It is not obvious that this falls under our scheme but let (2.af) or N(J12.B) J(B I Bo) ~ E.4. .12). B) = (1. the Si are independent with L()(Si I . .<T.4) = L()(Si I Ai) = N(Ail'l + (1.11).Bo) 0 p(X. I That is.6) where A.4.4. As we shall see in important situations.Sn is a sample from a population P whose density is modeled as a mixture of two Gaussian densities.). :B 10gq(s. Let . including the examples. • • i ! I • . differentiating and exchanging E oo and differentiation with respect i.tz E Rand 'Pu (5) = . The rationale behind the algorithm lies in the following formulas.. EM is not particularly appropriate. Thus. It is easy to see (Problem 2..(I'. .134 Methods of Estimation Chapter 2 Example 2.(f~). we can think of S as S(X) where X is given by (2.9) follows from (2.)<T~). j.8) .4.8) by o taking logs in (2.4. are independent identically distributed with p() [A.
4.Section 2.0) } ~ log q(s.4 Algorithmic Issues 135 to 0 at 00 .o log .2. then .0) (2. In the discrete case we appeal to the product rule.13) q(8. DO The main reason the algorithm behaves well follows.00Id) by Shannon's ineqnality.4.4. the result holds whenever the quantities in J(O I (0 ) can be defined in a reasonable fashion.4. 00ld are as defined earlier and S(X) (2. Lemma 2. uold Now. I S(X) ~ s > 0 } (2. formally. (2. Because.(X 18. 0 ) +E. 0) I S(X) = 8 ) DO (2.12) = s.1.14) whete r(· j ·.0 ) I S(X) ~ 8 . O n e w ) } log ( " ) ~ J(Onew I 0old) . However. We give the proof in the discrete case.4. On the other hand.Onew) {r(X I 8 .1.O)r(x I 8. ~ E '0 (~I ogp(X ." ) I S(X) ~ 8 .E. uold 0 r s.10) [)J(O I 00 ) DO it follows that a fixed point 0 of the algorithm satisfies the likelihood equation..4. Onew) > q(s.0) = O.3. Let SeX) be any statistic.4.4. Lemma 2. 0old)' Equality holds in (2. Fat x EX.13) iff the conditional distribution of X given S(X) forOnew asfor 00ld and Onew maximizes J(O I 0old)' =8 is the same Proof. J(Onew I 0old) > J(Oold I 0old) = 0 by definition of Onew. Suppose {Po : () E e} is a canonical exponential family generated by (T.lfOnew.Onew) E'Old { log r(X I 8.11)  D IOgq(8. Id log (X I .16) ° q(s.O) { " ( X I 8. (2.0) is the conditional frequency function of X given S(X) J(O I 00 ) If 00 = s. DJ(O I 00 ) [)O and.h) satisfying the conditions a/Theorem 2.3. (2. Then (2. Theorem 2. 0 0 = 0old' = Onew.1. S (x) ~ 8 p(x.4. r(Xj8.4. 0) ~ q(s.17) o The most important and revealing special case of this lemma follows. q S. hence.4.15) q(s.
A proof due to Wu (I 983) is sketched in Problem 2.A('1)}h(x) (2. = 11 <it +<" = IJ. = Ee(T(X) I S(X) = s) ~ (2. A(1)) = 2nlog(1 + e") and N jn = L:~ t <ij(Xi).20) has a unique solution. after some simplification. • I E e (2Ntn + N'n I S) = 2Nt= + N. (b) lfthe sequence of iterates {Bm} so obtained is bounded and the equation A(B) o/q(s.4. h(x) = 2 N i . .21) Part (a) follows.4. A'(1)) = 2nB (2. I • Under the assumption that the process that causes lumping is independent of the values of the €ij. 1 <j < 3. we see. Now. In this case. that.A(Bo)) I S(X) = s} (B . (2.22) • ~ 0) .(A(O) .18) exists it is necessarily unique.19) = Bnew · If a solution of(2. i i.4.24) " ! I .Pol<i.4.4. 1'I.+l (2.I<j<2 I r r: I I r! Pol<it 11 <it + <i' = 11 0' B' B' + 20(1 .B) 1) = log (1 = exp(1)(2Ntn (x) + N'n(x)) . m + 1 < i < n) .4. 1 I • • Po I<ij = 1 I <it + <i' = OJ =  0.(x).B).18) (2.4.136 (a) The EM algorithm consists of the alternation Methods of Estimation Chapter 2 A(Bnew ) = Eoold(T(X) I S(X) Bold ~ = s) (2.23) I .16.4. .Bof Eoo(T(X) I S(X) = y) .= +Eo ( t (2<" + <i') I <it + <i'.4 (continued).25) I" . Proof. I .(1 .0)' 1.BO)TT(X) .. X is distributed according to the exponential family I where p(x.B) 1 .A(Bo)) (2. which is necessarily a local maximum J(B I Bo) I I = Eoo{(B .4. then it converges to a limit 0·. 0 Example 2.(A(O) . i=m.4. Thus. Part (b) is more difficult.4.
. at Example 2.8) of B based on the observed data.12) that if2Nl = Bm. Y) ~ N (I"" 1"" 1 O'~. r) (Problem 2. p). b. which is indeed the MLE when S is observed. (Zn.4.) E. iii. and find (Problem 2....(y.} U {Y. This completes the Estep.T2 ]}' . we observe only Yi .26) It may be shown directly (Problem 2. .4.new T4 (Bo1d ) .Yn) be i. For the Mstep.4. I'l)/al [1'2 + P'Y2(Z. I Zi) 1"2 + P'Y2(Z.).1). In this case a set of sufficient statistics is T 1 = Z. Jl2.) E. we oberve only Zi.(2N3= + N 2=)B + 2 (N1= + (I n n =0 0 in (0. new = T 3 (8ol d) .(Y.4. [T5 (8ald) 2 = T2 (8old)' iii. i=I i=l n n n i=l s = {(Z" Y.2 I Z. Let (Zl'yl). for nl + 1 < i < n2. and for n2 + 1 < i < n.(ZiY.1) A( B) = E.4).l LY/.4 Algoritnmic Issues 137 where Mn ~ Thus. the conditional expected values equal their observed values. Y). a~. the EM iteration is L i=tn+l n ('iI + <.l L ZiYi.d.Bold n ~ (2.p2)a~ [1'2 + P'Y2(Z. ii2 .. B). I).new ~ + I'i.= > N 3 =)) 0 and M n > 0.4.1) that the Mstep produces I1l.): 1 <i< nt} U {Z. l"l)/al]Zi with the corresponding Z on Y regression equations when conditioning on Yi (Problem 2. al a2P + 1"1/L2).1. we note that for the cases with Zi andlor Yi observed. compute (Problem 2.: nl + 1 <i< n.: n.i.4. + 1 <i< n}. T 5 = n. For other cases we use the properties of the bivariate nonna] distribution (Appendix BA and Section 1.4.Section 2. where (Z. a~ + ~.T{ (2. to conclude ar. I Z.27) hew i'tT2Ji{[T3 (8ol d) . ~ T l (801 d). T 2 = Y.. then B' . T3 = The observed data are n l L zl. where B = (111. 2 Mn . ai ~ We take Bo~ = B MOM ' where BMOM is the method of moment estimates (11" 112.T = (1"1. unew  _ 2N1m + N 2m n + 2 . 1"1)/ad 2 + (1 .Tl IlT4 (8ol d) . /L2.new ii~. as (Z. T4 = n.6. To compute Eo (T I S = s). Suppose that some of the Zi and some of the Yi are missing as follows: For 1 < i < nl we observe both Zi and Yi.T 2 . 112. converges to the unique root of ~ + N.
j : > I) that one system 3... Consider n systems with failure times X I . B)/p(X. We then.1 with respective probabilities PI. 2.4.2.P2.3 and the problems. . (a) Find the method of moments estimate of >.2. which yields with certainty the MLEs in kparameter canonical exponential families with E open when it exists..4.B o)].5 PROBLEMS AND COMPLEMENTS Problems fnr Sectinn 2. E"IJ(O I 00)] is the KullbackLeiblerdivergence (2. including the NewtonRaphson method. (a) Show that T 3 = N1/n + N 2 /2n is a frequency substitution estimate of e...d. based on the first moment. the EM algorithm is often called multiple imputation. where 0 < B < 1.2.4. Also note that. (3( 0'1. use this algorithm as a building block for the general coordinate ascent algorithm.. 0) is minimized. in Section 2. Suppose that Li.23). what is a frequency substitution estimate of the odds ratio B/(l. . which as a function of () is maximized where the contrast logp(X. X n have a beta. (c) Suppose X takes the values 1. Remark 2.. (c) Combine your answers to (a) and (b) to get a method of moment estimate of >.). Now the process is repeated with B MOM replaced by ~ ~ ~ ~ 0new.i' I' I i. 2B(1 . involves imputing missing values. ! I 2. Important variants of and alternatives to this algorithm. show that T 3 is a method of moment estimate of (). Find the method of moments estimates of a = (0'1. Note that if S(X) = X.6.4. distributions..B)2.0'2) distribution. The basic bisection algorithm for finding roots of monotone functions is developed and shown to yield a rapid way of computing the MLE in all oneparameter canonical exponential families with E open (when it exists). By considering the first moment of X.0. in the context of Example 2.. based On the first two moments.0'2) based on the first two moments. (b) Using the estimate of (a). . 0 Because the Estep. . in general. based on the second moment.B)? . &(>.4 we derive and discuss the important EM algorithm and its basic properties. . then J(B i Bo) is log[p(X. X n assumed to be independent and identically distributed with exponential. (b) Find the method of moments estimate of>.1 1. Summary.4. Consider a population made up of three different types of individuals occurring in the HardyWeinberg proportions 02.138 Methods of Estimation Chapter 2 where T j (B) denotes T j with missing values replaced by the values computed in the Estep and T j = Tj(B o1d )' j = 1. Finally in Section 2.1. j.0) and (I .P3 given by the HardyWeinberg proportions. X I. are discussed and introduced in Section 2. respectively. ~' ". (d) Find the method of moments estimate of the probability P(X1 will last at least a month. .
. < ~ ~ Nk+1 = n(l  F(tk)). . . Yn ) be a set of independent and identically distributed random vectors with common distribution function F. .F(t. (b) Exhibit method of moments estimates for VaroX = 8(1 . which we define by ~ F ~( s. . .Section 2. < ... . t) is the bivariate empirical distribution function FCs. t). . = Xi with probability lin. (See Problem B.. . .5 Problems and Complements 139 Hint: See Problem 8.. < X(n) be the order statistics of a sample Xl. X n . .2.t ) = Number of vectors (Zi.. of Xi < xl/n. X 1l be the indicators of n Bernoulli trials with probability of success 8. (d) FortI tk. Let X I.Xn be a sample from a population with distribution function F and frequency function or density p. 6. Hint: Consi~r (N I .. Let Xl.F(tk). Yi) such that Zi < sand Yi < t n . Let X(l) < . N 2 = n(F(t2J . Let (ZI. Y2 ). we know the order statistics. (a) Show that X is a method of moments estimate of 8.8. (c) Argue that in this case all frequency substitution estimates of q(8) must agree with q(X).)). ~ ~ ~ (a) Show that in the finite discrete case. The empirical distribution function F is defined by F(x) = [No. If q(8) can be written in the fonn q(8) ~ s(F) for sOme function s of F we define the empirical substitution principle estimate of q( 8) to be s( F). . YI ).. (Z2.. 4. 8..findthejointfrequencyfunctionofF(tl).) There is a Onetoone correspondence between the empirical distribution function ~ F and the order statistics in the sense that. Show that these estimates coincide. Nk+I) where N I ~ nF(t l ). empirical substitution estimates coincides with frequency substitution estimates..2.8)ln first using only the first moment and then using only the second moment of the population... ~  7.. Give the details of this correspondence. of Xi n ~ =X. (Zn. given the order statistics we may construct F and given P. The natural estimate of F(s. (b) Show that in the continuous case X ~ rv F means that X (c) Show that the empirical substitution estimate of the jth moment JLj is the jth sample moment JLj' Hinr: Write mj ~ f== xjdF(x) ormj = Ep(Xj) where XF.. . . ~ Hint: Express F in terms of p and F in terms of P ~() X = No....5. 5. . See A. The jth cumulant '0' of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant Cj' Give the first three sample cumulants. .12.
as X ~ P(J. .2 with X iiI and /is.".Xn ) where the X.140 Methods of Estimation Chapter 2 (a) Show that F(. . L I.Y) = Z)(Yk . There exists a compact set K such that for f3 in the complement of K. are independent N"(O. In Example 2. I . 9. .• . " . respectively. and so on..L. find the method of moments estimate based on i 12.. (c) Use the empirical substitution principle to COnstruct an estimate of cr using the relation E(IX. (J E R d. Show that the least squares estimate exists. . 0). the sample covariance.) is the distribution function of a probability P on R2 assigning mass lin to each point (Zi. Hint: See Problem B. as the corresponding characteristics of the distribution F.. . 10. 1".. (b) Define the sample product moment of order (i. z) is continuous in {3 and that Ig({3. In Example 2. ."ZkYk .j2. X n be LLd. Hint: Set c = p(X. Suppose X = (X"".2. : J • . the result follows. >. z) I tends to 00 as \. Y are the sample means of the sample correlation coefficient is given by Z11 •. = #.... . (See Problem 2.. .1.IU9) that 1 < T < L . Show that the sample product moment of order (i.j) is given by ~ The sample covariance is given by n l n L (Zk k=l . . {3) > c.2). Zn and Y11 . f(a.ZY. Since p(X.17.81 tends to 00. j). {3) is continuous on K. p(X. ~ i . with (J identifiable.)..' where Z.) Note that it follows from (A. The All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates. (a) Find an estimate of a 2 based on the second mOment. k=l n  n . (b) Construct an estimate of a using the estimate of part (a) and the equation a . I) = ". Vk and that q(9) can be written as ec q«(J) = h(!'I«(J)"" . suppose that g({3. Let X". the sampIe correlation.1. Vi). .Yn .4.!'r«(J)) I . . Suppose X has possible values VI. . 11. ...
1'(0.. . with (J E R d and (J identifiable. as X rv P(J.i.6.1) (iii) Raleigh. . X n are i. . . j i=l = 1... Suppose that X has possible values VI.. 0 ~ (p. . .5.10).. General method of moment estimates(!). Consider the Gaussian AR(I) model of Example 1. .• it may still be possible to express parameters as functions of moments and then use estimates based on replacing population moments with "sample" moments. can you give a method of moments estimate of {3? .0) (ii) Beta.5 Problems and Com"... See Problem 1. = h(PI(O). Iii = n. where U. in estimate.x (iv) Gamma.Section 2. to give a method of moments estimate of (72. 1 < j < k.:::nt:::' ~__ 141 for some R k valued function h. r.O) ~ (x/O')exp(x'/20').9r be given linearly independent functions and write ec n Pi(O) = EO(gj(X)).p:::'..l Lgj(Xi ). Let 911 . > 0. (b) Suppose P ~ po and I' = b are fixed. rep. Suppose X 1..d. p fixed (v) Inverse Gaussian. find the method of moments estimates (i) Beta.. (a) Use E(X.l and (72 are fixed.36. 1'(1..p(x. . When the data are not i. . (b) Suppose {PO: 0 E 8} is the kparameter exponential family given by (1. 13.6. 0).6.ilr) is a frequency plug (a) Show that the method of moments estimate if = h(fill ..i. Vk and that q(O) for some Rkvalued function h.:::m:::. . Use E(U[).(X) ~ Tj(X).  1/' Po) / .0 > 0 14. ~ (X. fir) can be written as a frequency plugin estimate. A). . A).1. (c) If J..d.Pr(O)) . IG(p. In the fOllowing cases. Hint: Use Corollary 1. Let g.) to give a method of moments estimate of p..1. Show that the method of moments estimate q = h(j1!.
.. . Y) and (ai. The HardyWeinberg model specifies that the six genotypes have probabilities L7=1 I Genotype Genotype Probability SS 2 II 8~ 3 FF 8j 4 SI 28.. j > 0. Q.! . bI) are as in Theorem 1. and F at one locus resulting in six genotypes labeled SS. .•• Om) can be expressed as a function of the moments.. where OJ = 1. I ... I. I.) ..z.. n. Show that <j < i ! ...83 8t Let N j be the number of plants of genotype j in a sample of n independent plants.. and F..z... .ZY ~ ....z. ..z.J is' _ n n. Establish (2. The reading Yi . 1 6 and let Pl ~ N j / n. .Y. k > 0. let the moments be mjkrs = E(XtX~). HardyWeinberg with six genotypes. i = 1.~b. II. respectively. b..._ . = Y . I I .6). + '2P4 + 2PS . .3.g(I3.. 1 q.8.. 16. 5 SF 28. Show that method of moments estimators of the parameters b1 and al in the best linear predictor are estimate ~ 8 of B is obtained L:Z. Let 8" 8" and 83 denote the probabilities of S... P3 + "2PS + '2P6 PI are frequency plugin estimates of OJ.. t n on the position of the object. S = 1. ).• l Yn are taken at times t 1.  t=1 liB = (0 1 . SI..4.S 1 .. FF..)] = [Y...142 Methods of Estimation Chapter 2 15.. where (Z. Y) and B = (ai. = n 1 L:Z. SF. .... . ' l X q ). Hint: [y. In a large natural population of plants (Mimulus guttatus) there are three possible alleles S... An Object of unit mass is placed in a force field of unknown constant intensity 8..2 1. ()2. For a vector X = (Xl.1 " Xir Xk J.. Multivariate method a/moments. and (J3. . _ (Z)' ' a. . of observations. 1".. Let X = (Z.."" X iq ).. ...q..1.)]. 83 6 IF 28. 1 I ... 11P2 + "2P4 + '2P6 .... .. n 1  Problems for Sectinn 2. k> Q mjkrs L...g(l3o. I .. the method of moments l by replacing mjkrs by mjkrs.)] + [g(l3o.. we define the empirical or sample moment to be j ~ .. >. Readings Y1 ..... 1 ... For independent identically distributed Xi = (XiI.1".bIZ. 17.. I • .g(I3. and IF.  1. ..
(zn. (J2) variables. (a) f(x.. 3. Let X I. B) = Be'x. Hint: The quantity to be minimized is I:7J (y." + 82 z i + €i = nl with ICi a~ given by = 81 + ICi. Show that the least squares estimate is always defined and satisfies the equations (2.)' 1+ B5 6. if we consider the distribution assigning mass l/n to each of the points (zJ. i = 1.. . < O. g(zn.z.1.. to have mean 2.B.4. (Pareto density) .BJ . i + 1.2. the range {g(zr. 13).. Suppose that observations YI .y. .. Show that the fonnulae of Example 2. Hint: Write the lines in the fonn (z..We suppose the oand be uncorrelated with constant variance. . . Find the line that minimizes the sum of the squared perpendicular distance to the same points.5) provided that 9 is differentiable with respect to (3" 1 < i < d. . Yl). (exponential density) (b) f(x.Yi).. . Yn have been taken at times Zl. . 13). What is the least squares estimate based on YI . 2 10. . Find the least squares estimates for the model Yi = 8 1 (2. Yn be independent random variables with equal variances such that E(Yi) = O:Zj where the Zj are known constants..13 E R d } is closed. B > O. ~p ~(yj}) ~ .6) under the restrictions BJ > 0. and 13 ranges over R d 8.. (a) Let Y I . X n denote a sample from a population with one of the following densities Or frequency functions. .2. (b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1. Find the LSE of 8.4)(2. . . .. B) = Bc'x('+JJ.. Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (Zi. . . B. Find the MLE of 8.).. i = 1.. c cOnstant> 0. in fact.4.t of the best (MSPE) predictor of Yn + I ? 4. nl and Yi = 82 + ICi. 9.. .. Suppose Yi ICl".3. The regression line minimizes the sum of the squared vertical distances from the points (Zl.. Find the least squares estimates of ()l and B . B > O.<) ~ . _. nl + n2.. x > 0. . Y. Yn)..5 Problems and Complements 143 ICi differs from the true position (8 /2)tt by a mndom error f l . 7. . all lie on aline..I is to be taken at time Zn+l.Yn). Find the least squares estimate of 0:. . " 7 5. ..2 may be derived from Theorem 1. .(zn. x > c. .2.n. Zn and that the linear regression model holds.. .Section 2. A new observation Ynl. . where €nlln2 are independent N(O.
X)" > 0. bl to make pia) ~ p(b) = (b ..  e . . I I I (b) LetP = {PO: 0 E e}.0"2 (a) Find maximum likelihood estimates of J.. 0) = (x/0 2) exp{ _x 2/20 2 }.  under onetoone transformations). density) (e) fix. Suppose that Xl.:r > 0. X n . e c Let q be a map from e onto n. Hint: Let e(w) = {O E e : q(O) = w}. ry) denote the density or frequency function of X in terms of T} (Le. Let X" . they are equivariant 16. Hint: You may use Problem 2. 0 > O.. J1 E R. for each wEn there is 0 E El such that w ~ q( 0). MLEs are unaffected by reparametrization.. I). then {e(w) : wEn} is a partition of e. 0 > O. ~ I in Example 2.0"2). x > 1". (a) Let X . . is a sample from a N(/l' ( 2 ) distribution. I < k < p. 0 E e and let 0 denote the MLE of O. (72 Ii (a) Show that if J1 and 0. WEn . then the unique MLEs are (b) Suppose J. If n exists. (Wei bull density) 11..144 Methods of Estimation Chapter 2 (c) f(".1).. I • :: I Suppose that h is a onetoone function from e onto h(e).. reparametrize the model using 1]). n > 2. .1. (Rayleigh density) (f) f(x.0"2) 15. Because q is onto n. ncR'.. (beta. . 0 > O.16(b).5 show that no maximum likelihood estimate of e = (1". x> 0. I i' 13. then i  WMLE = arg sup sup{Lx(O) : 0 E e(w)}. . . Hint: Use the factorization theorem (Theorem 1.l exp{ OX C } . q(9) is an MLE of w = q(O).r(c+<). c constant> 0. the MLE of w is by definition ~ W. Define ry = h(8) and let f(x. and o belongs to only one member of this partition. c constant> 0. Show that the MLE of ry is h(O) (i. Suppose that T(X) is sufficient for 0 and ~at O(X) is an MLE of O. iJ( VB. X n • n > 2.Pe. 0) = VBXv'O'. Show that any T such that X(n) < T < X(1) + is a maximum likelihood estimate of O. > O.P > I. (Pareto density) (d) fix. 00 = I 0" exp{ (x .t and a 2 .O) where 0 ~ (1".a)l rather than 0. Pi od maximum likelihood estimates of J1 and a 2 . 0) = cO". Show that if 0 is a MLE of 0. say e(w).2. . < I" < 00.5.I")/O") . Thus.t and a 2 are both known to be nonnegative but otherwise unspecified. 0 <x < I.: I I \ .l L:~ 1 (Xi . Let X I. x > 0. X n be a sample from a U[0 0 + I distribution. = X and 8 2 ~ n. (We write U[a. be a family of models for X E X C Rd.II ". be independently and identically distributed with density f(x. .e. 0) = Ocx c . Show that depends on X through T(X) only provided that 0 is unique.) ! ! !' ! 14. 0> O. (b) Find the maximum likelihood estimate of PelX l > tl for t > 1".2 are unknown. . 12.
Section 2. •• ... . B) = 100' cp 9 (x I') + 0' 10 cp(x 1') 1 where cp is the standard normal density and B = (1'. .r f(r + I. (a) The observations are indicators of Bernoulli trials with probability of success 8. k=I.Xn be independently distributed with Xi having a N( Oi. .X I = B(I. X n ) is a sample from a population with density f(x.  .. .[X ~ k] ~ Bk . Let Xl." Y: . Y n IS _ B(Y) = L. 1) distribution.. (b) The observations are Xl = the number of failures before the first success.P.B)...2. . .. Censored Geometric Waiting Times. 21. where 0 < 0 < 1. if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods.B) = 1.. 20..[X < r] ~ 1 LB k=l k  1 (1_ B) = B'. a model that is often used for the time X to failure of an item is P.B) =Bk .. . in a sequence of binomial trials with probability of success fJ.~1 ' Li~1 Y. but e . Suppose that we only record the time of failure. A general solution of this and related problems may be found in the book by Barlow.6. (We denote by "r + 1" survival for at least (r + 1) periods. . identically distributed. X 2 = the number of failures between the first and second successes. 17..1 (1 . If time is measured in discrete periods.O") : 00 < fl. Show that the maximum likelihood estimate of 0 based on Y1 . In the "life testing" problem 1. (a) Find maximum likelihood estimates of the fJ i under the assumption that these quantities vary freely. 16(i). Derive maximum likelihood estimates in the following models.. We want to estimate B and Var. Bremner. We want to estimate O. < 00. (KieferWolfowitz) Suppose (Xl..B). and have commOn frequency function. 1 <i < n.) Let M = number of indices i such that Y i = r + 1.n . Yn which are independent. and Brunk (1972). k ~ 1. find the MLE of B. Bartholomew. . and so on.5 Problems and Complements 145 Now show that WMLE ~ W ~ q(O)..1 (I_B). Thus. Show that maximum likelihood estimates do not exist. 19. we observe YI .0") E = {(I'. . 0 < a 2 < oo}. M 18. (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B. f(k.
9 3. N. Tool life data Peed 1 Speed 1 1 Life 54.'.0 0. and assume that zt'·· .< J. Show that the MLE of = (fl.ji. Y.0 11. . and only if. (1') where e n ii' = L(Xi . Let Xl.2.a 2 ) and NU". 1 1 1 o o .2. Weisberg.6.0 2. 23.p(x.z!. 1 <k< p}.Xm and YI .4)(2. . Suppose X has a hypergeometric.Xj for i =F j and that n > 2. 1985). Show that the maximum likelihood estimate of b for Nand n fixed is given by if ~ (N + 1) is not an integer. " Yn be two independent samples from N(J.146 Methods of Estimation Chapter 2 that snp. .t. Ii equals one of the numbers 22. 7 ..0 I .J..5 86.. II I I :' 24.8 2.. . where'i satisfy (2.1 2.2 3. J2 J2 o o o I 3. Hint: Coosider the mtio L(b + 1. (1') is If = (X. . Polynomial Regression. Suppose Y.0 5.. TABLE 2.1.8 3. respectively. .' wherej E J andJisasubsetof {UI . x)/ L(b.Y)' i=1 j=1 /(rn + n).I' fl.8 14.(Zi) + 'i.4 o o o o o o o o o o o o o Life 20.jp): 0 <j. and ~ b(X) = X X (N + 1) or (N + 1)1 n n othetwise. where [t] is the largest integer that is < t. . 1i(b.tl..up(X.5 66. = fl.2 Peed Speed  2 I 1 1 1 1 1 1 1 J2 o o 1 1 1 3. Assume that Xi =I=. n).a 2 ) = Sllp~.6).2 4.0"2) if.. Xl.x n . distribution. . i: In an experiment to study tool life (in minutes) of steelcutting tools as a function of cutting speed (in feet per minute) and feed rate (in thousands of an inch per revolution).5 0. the following data were obtained (from S..2 4.8 0. x) as a function of b.X)' + L(Y. (1') populations. Set zl ~ .5 I' I I .
0:4Z? + O:SZlZ2 + f Use a least squares computer package to compute estimates of the coefficients (f3's and o:'s) in the two models.6.  27.13)/6.6) with g({3.. .8 and let v(z. . Let (Z. (b) Let P be the empirical probability . Consider the model (2.3.Z)]'). = (cutting speed . (2.z. weighted least squares estimates are plugin estimates. 26.5 Problems and Complements 147 The researchers analyzed these data using Y = log tool life.ZD is of rank d. let v(z.4)(2. Y)Z') and E(v(Z.6) with g({3.y)dzdy. Y)[Y .6)). Derive the weighted least squares nonnal equations (2. (a) Show tha!. z) ing are equivalent. (2. .19).y)!(z.1. E{v(Z.(P)Z of Y is defined as the minimizer of t = zT {3. (a) Let (Z'.1. Let Z D = I!zijllnxd be a design matrix and let W nXn be a known symmetric invertible matrix. Two models are contemplated (a) Y ~ 130 + pIZI + p.900)/300. Y)Y') are finite.Y') have density v(z. in Problem 2..2. Show that p. y). + E eto (b) Y = + O:IZl + 0:2Z2 + Q3Zi +. Y')/Var Z' and pI(P) = E(Y*) 11. Set Y = Wly. The best linear weighted mean squared prediction error predictor PI (P) + p.. However. this has to be balanced against greater variability in the estimated coefficients.(P) = Cov(Z'.l be a square root matrix of W. 25. z) ~ Z D{3.4)(2. Being larger. the second model provides a better approximation. Y) have joint probability P with joint density ! (z. .Y = Z D{3+€ satisfy the linear regression mooel (2.. Both of these models are approximations to the true mechanism generating the data.(P) and p. z.2.2.(P) coincide with PI and 13.. Use these estimated coefficients to compute the values of the contrast function (2. Show that the follow ZDf3 is identifiable.y)!(z.2.2.2. y) > 0 be a weight funciton such that E(v(Z. (12 unknown.2. Consider the model Y = Z Df3 + € where € has covariance matrix (12 W.y)/c where c = I Iv(z. l/Var(Y I Z = z). .1). This will be discussed in Volume II. of Example 2. (c) Zj.(P)E(Z·).5) for (a) and (b).1).2. 28.Section 2. Let W. ZD = Wz ZD and € = WZ€. ZI ~ (feed rate .y) defined .(b l + b. Show that (3.1 (see (B. ~. That is. (a) The parameterization f3 (b) ZD is of rank d.
918 2. = L.2. Then = n2 = . . \ (e) Find the weighted least squares estimate of p..a. i = 1.476 3..689 4.391 0.081 2.511 3.788 2.6) is given by (2.196 2.) (c) Find a matrix A such that enxl = Anx(n+l)ECn+l)xl' (d) Find the covariance matrix W of e.075 4.669 7. I Yi is (/1 + Vi). .. . .jf3j for given covariate values {z.921 2.393 3.038 3.091 3. k. .053 4. Show that the MLE of .916 6.093 1..100 3. suppose some of the nj are zero.480 5.676 5.858 3.. /1i + a}. (See Problem 2. . (a) Show that E(1'.6) T 1 (Y .665 2..148 ~ Methods of Estimation Chapter 2 (b) Show that if Z  D has rank d.301 2.6)T(y . ~ ~ ~Show that the MLE of OJ is 0 with OJ = nj In.d. in this model the optimal " MSPE predictor of the future Yi+ 1 given the past YI . i = 1. Chon) shonld be read row by row.5.455 9.). where 1'.392 4..156 5..nk > 0.856 2. J.582 2. = nq = 0.703 4.) ~ ~ (I' + 1'.ZD.457 2..723 1. The data (courtesy S. . Consider the model Yi = 11 + ei. 2. 0) = II j=q+l k 0.397 4.071 0.131 5. then the  f3 that minimizes ZD. Is ji.. In the multinomial Example 2. where El" . .611 4. ' I (f) The following data give the elapsed times YI . Suppose YI .097 1. Yn are independent with Yi unifonnly distributed on [/1i . That is.046 2.716 7.860 5.i.834 3. Let ei = (€i + {HI )/2.093 5. Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore.908 1..958 10. j = 1.968 2.1.564 1.229 4. with mean zero and variance a 2 • The ei are called moving average errors.020 8.~=l Z.379 2. n.. 1'. . = (Y  29.182 0. .2. k.453 1. 31.666 3.870 30.968 9.ZD..039 9.058 3.6) W.17.300 6.20). .054 1. which vanishes if 8 j = 0 for any j = q + 1.. <J > 0.ZD. 4 (b) Show that Y is a multivariate method of moments estimate of p.8.'.249 1. . _'. Use a weighted least squares c~mputer routine to compute the weighted least squares estimate Ii of /1. different from Y? I I TABLE 2.360 1.. .971 0. .(Y .+l I Y .599 0.019 6.1.. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay.114 1.€n+l are i.274 5.064 5. n.... nq+1 > p(x.j}. Hint: Suppose without loss of generality that ni 0.155 4.. .
34.. the sample median yis defined as ~ [Y(. The HodgesLehmann (location) estimate XHL is defined to be the median of the ~n(n + 1) pairwise averages ~(Xi + Xj). . . (a) Show Ihat if Ill[ < 1. An asymptotically equivalent procedure is to take the median of the distribution placing mass and mass . ~ 33.. Yn are independent with Vi having the Laplace density 1 2"exp {[Y. Find the MLE of A and give its mean and variance. the sample median fj is defined as Y(k) where k ~ ~(n + 1) and Y(l).O) Hint: See Problem 2.28Id(F.. be i. Let X.l1./3p that minimizes the maximum absolute value conrrast function maxi IYi . be the observations and set II = ~ (Xl . as (Z. (3p. (a) Show that the MLE of ((31.x. If n is even.>0 ~ ~ where tLi = :Ej=l Ztj{3j for given covariate values {Zij}. .6.Jitl.i..32(b).) Suppose 1'. = L Ix. 35. (See (2. Show that the sample median ii is the minimizer of L~ 1 IYi ... . Give the MLE when . Hint: Use Problem 1.. Y( n) denotes YI . x E R. Let B = arg max Lx (8) be "the" MLE. i < j. then the MLE exists and is unique. " . 8 E R.Ili I and then setting (j = n~I L~ I IYi . a) is obtained by finding (31.) + Y(r+l)] where r = ~n.3. ..{3p. + Xj i<j  201· (b) Define BH L to be the minimizer of J where F [x .)..(1 + x'j.tLil and then setting a = max t IYi .  Illi < 1.. . . . Suppose YI ... y)T where Y = Z + v"XW. .{:i.5 Problems and Complements 149 (f3I' .. ./Jr ~ ~ and iii. where Jii = L:J=l Zij{3j. be the Cauchy density. = I' for each i. Let g(x) ~ 1/". A > 0..I/a}. 1). . .2. be i.Section 2. and x. a)T is obtained by finding 131..d.J. Yn ordered from smallest to largest.d. F)(x) *F denotes convolution. ~ f3I.7 with Y having the empirical distribution F. Z and W are independent N(O. XHL i& at each point x'. Show that XHL is a plugin estimate of BHL.:C j • i <j (a) Show that the HodgesLehmann estimate is the minimizer of the contrast function p(x..& at each Xi. . where Iii = ~ L:j:=1 ZtjPj· 32..i.8).I'.1. These least absolute deviation estimates (LADEs).d.(3p that minimizes the least absolute deviation contrast function L~ I IYi ..4.17).fin are called (b) If n is odd. . . Let x. with density g(x ... . .. . Hint: See Example 1. let Xl and X.
n. Let Xi denote the number of hits at a certain Web site on day i. 51.h(y) .0). and positive everywhere. that 8 1 E. N(B. . Sl..) be a random samplefrom the distribution with density f(x. B) is and that the KullbackLiebler divergence between p(x. Define ry = h(B) and let p' (x. Ii I. The likelihood function is given by Lx(B) g(xI . . P(nA). If we write h = logg. o Ii !n 'I . X. 9 is continuous. ry) = p(x. = I ! I!: Ii.f)) in the likelihood equation. then h"(y) > 0 for some nonzero y.i. . .B)g(i . Let K(Bo.. ~ (d) Use (c) to show that if t> E (a. symmetric about 0. BI ) (p' (x. B) denote their joint density.B)g(x. Show that (a) The likelihood is symmetric about ~ x. Bo) is tn( B. 37.150 Methods of Estimation Chapter 2 (b) Show that if 1t>1 > 1. where x E Rand (} E R. where Al +. Assume :1. Let 9 be a probability density on R satisfying the following three conditions: I. ryl Show that o n ». Let X ~ P" B E 8. Also assume that S.. B) = g(xB).. . 9 is twice continuously differentiable everywhere except perhaps at O. distribution.B) g(x + t> . . ryo) and p' (x.. > h(y) . .b). . and 8 2 are independent.x2)/2. such that for every y E (a. Assume that 5 = L:~ I Xi has a Poisson.t> .j'. B ) and p( X. a < b. then the MLE is not unique. j = n + I. i = 1. 1 n + m. Let Vj and W j denote the number of hits of type 1 and 2 on day j.d.)/2 and t> = (XI . ~ (XI + x. (a. 2. then B is not unique. Suppose h is a II function from 8 Onto = h(8).\2 = A..+~l Wj have P(rnAl) and P(mA2) distributions. Let (XI..+~l Vi and 8 2 = E.. 3.. b) there exists a 0 > 0 ~ (c) There is an interval such that h(y + 0) . 1985). On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making). 38.B I ) (K'(ryo. .. ~ Let B ~ arg max Lx (B) be "the" MLE.ryt» denote the KullbackLeibler divergence between p( x. Show that the entropy of p(x. (7') and let p(x. . . 'I i 39.B). Problem 35 can be generalized as follows (Dhannadhikari and JoagDev. 36. Let Xl and X2 be the observed values of Xl and X z and write j.. and 5.Xn are i.B )'/ (7'. Suppose X I.h(y . (b) Either () = x or () is not unique. . Find the values of B that maximize the likelihood Lx(B) when 1t>1 > L Hint: Factor out (x . h I (ry») denote the density or frequency function of X for the ry parametrization. b). B) and pix. Find the MLEs of Al and A2 hased on 5.
= log a .2.Section 2. n where A = (a. Aj) + Ei. 42. . n > 4..7) for a. I' E R.1. 41. + C. A) = g(z.. exp{x/O. and 1'. Variation in the population is modeled on the log scale by using the model logY. . 1). . = L X. For the ca. [3 > 0.5 Problems and Complements 151 40. Suppose Xl. Seber and Wild.. i = 1. . < 0] are sufficient (b) Find the maximum likelihood estimates of 81 and 82 in tenns of T 1 and T 2 • Carefully check the "T1 = 0 or T 2 = 0" case. give the lea...j ~ x <0 1. j 1 1 cxp{ x/Oj}. o + 0. where 0 (a) Show that a solution to this equation is of the form y (a. 1959.I[X. (.~t square estimating equations (2. .)/o]}" (b) Suppose we have observations (t I .olog{1 + exp[[3(t.0). (tn. and g(t.. where OJ > O. T. .1. a.)/o)} (72. . i = 1. x > 0. n.1.  J1. 1'..1'. 1 En are uncorrelated with mean 0 and variance estimating equations (2.. [3..p.O) ~ a {l+exp[[3(tJ1. Give the least squares where El" •. = LX. j = 1.5. . . on a population of a large number of organisms. yd 1 ••• . 1').}. 1989) Y dt ~ [3 1  Idy [ (Y)!] ' y > 0. and /1. [3. > OJ and T. .7) for estimating a. .I[X. 6. and fj. h(z.c n are uncorrelated with mean zero and variance (72. [3. . X n be a sample from the generalized Laplace distribution with density 1 O + 0. An example of a neural net model is Vi = L j=l p h(Zij. Let Xl. a > 0. 0). [3. /3. . a get. Yn). . 0 > O. (e) Let Yi denote the response of the ith organism in a sample and let Zij denote the level of the jth covariate (stimulus) for the ith organism.) Show that statistics. The mean relative growth of an organism of size y at time t is sometimes modeled by the equation (Richards..~e p = 1. 1 X n satisfy the autoregressive model of Example 1.
l.. = If C2 > 0. r(A.. the bound is sharp and is attained only if Yi x. S.3 1.0 .. ..l. Hint: I . = OJ.. fi exists iff (Yl . C2' y' 1 1 for (a) Show that the density of X = (Xl. (One way to do this is to find a matrix A such that enxl = Anxn€nx 1. Suppose Y I . + 0. = L(CI + C.. .) Then find the weighted least square estimate of f. I < j < 6. Xn)T can be written as the rank 2 canonical exponential family generated by T = (ElogX"EXi ) and hex) = XI with ryl = p.). Give details of the proof or Corollary 2...(8 .)I(c.x. 0 for Xi < _£I. " l Xn) does the Mill exist? What is the MLE? Is it unique? e 4.15.'" £n)T of autoregression errors. find the covariance matrix W of the vector € = (€1. Let = {(01. > 0.. < I} and let 03 = 1.y.3. .) 1 II(Z'i /1)2 (b) If j3 is known.5).0.. .3. There exists a compact set K c e such that 1(8) < c for all () not in K. (b) Show that the likelihood equations are equivalent to (2.x.d. t·. . write Xi = j if the ith plant has genotype j.. _ C2' 2. Is this also the MLE of /1? Problems for Section 2...152 Methods of Estimation Chapter 2 (aJ If /' is known. 1 Xl <i< n. + 8. . I . Let Xl. L x. • II ~. Under what conditions on (Xl 1 .. fi) = a + fix. < Show that the MLE of a.3.fi) = 1log p pry. .l  L: ft)(Xi /.4) and (2. Yn) is not a sequence of 1's followed by all O's or the reverse.1. < . This set K will have a point where the max is attained. In a sample of n independent plants. X n be i. . Yn are independent PlY. + C1 > 0). show that the MLE of fi is jj =  2. gamma. a.1 > _£l.. . .Xi)Y' < L(C1 + c. .i. = 1] = p(x"a.3.J. Prove Lenama 2. Consider the HardyWeinberg model with the six genotypes given in Problem 2. '12 = A and where r denotes the gamma function. Xn· Ip (x.02): 01 > 0.p).. Hint: Let C = 1(0). 3. n > 2. n i=l n i=l n i=1 n i=l Cl LYi + C. .. (X.. .::.
a(i) + a. 12. 0 < Zl < .0).A( ryj) ~ max{ 'IT toA( 'I) : 'I E c(8)} > 00. X n be Ll. _..6. 8. and assume forw wll > 0 so that w is strictly convex. . and Zi is the income of the person whose duration time is ti..8) . Prove Theorem 2.1} 0 OJ = (b) Give an algorithm such that starting at iP = 0. .Xn E RP be i. But c( e) is closed so that '10 = c( eO) and eO must satisfy the likelihood equations. .l E R. . w( ±oo) = 00. Show that the boundary of a convex C set in Rk has volume 0. .3.'~ve a unique solution (. (b) Show that if a = 1.3. the MLE 8 exists but is not unique if n is even. Let Y I . .Section 2. Let Xl. Then {'Ij} has a subsequence that converges to a point '10 E f..3. See also Problem 1.. ji(i) + ji.1 (a) ~ ~ c(a) exp{ Ix . (O.lkIl :Ij >0..en. 0:0 = 1. ~fo (X u I..40. [(Ai).. the likelihood equations logfo that t (X It) w' iOJ t=I = 0 ~ { (X i . ... u > 1 fR exp{Ixlfr}dxand 1·1 is the Euclidean norm. (0. Use Corollary 2.t).1 <j < ac k1.. . then it must contain a sphere and the center of the sphere is an interior point by (B.9. .O. Suppose Y has an exponential. . Hint: The kpoints (0.n. MLEs of ryj exist iffallTj > 0. < Zn Zn. Hint: If BC has positive volume. distribution where Iti = E(Y. Yn denote the duration times of n independent visits to a Web site. > 3. 1 <j< k1. Let XI.Ld.0. e E RP.. . < Zn. n > 2.) ~ Ail = exp{a + ilz..5 Problems and Complements 153 n 6.. fe(x) wherec. .ill T exists and is unique. with density... Zj < . if n > 2.6.1 to show that in the multinomial Example 2.. • It) w' ( Xi It) . 9. .fl.3. < Show that the MLE of (a. 10. " (a) Show that if Cl: > ~ 1..z=7 11. J.d. Hint: If it didn't there would exist ryj = c(9j ) such that ryJ 10 . . In the heterogenous regression Example 1.}.n) are the vertices of the :Ij <n}. (3) Show that.0).10 with that the MLE exists and is unique. C! > 0.3.1). the MLE 8 exists and is unique.' ...0 < ZI < . . convex set {(Ij. show 7..
J1. . E" . (b) Reparametrize by a = ..2. p) population. Note: You may use without proof (see Appendix B. show that the estimates of J11. . it is unique. O'?. "i . (Xn . EeT = (J11. b) ~ I w( aXi . then the coordinate ascent algorithm doesn't converge to a member of t:.)(Yi . . 2. and p = [~(Xi . (a) Show that the MLEs of a'f. Describe in detail what the coordinate ascent algorithm does in estimation of the regression coefficients in the Gaussian linear model i: y ~ ZDJ3 + <.154 Methods of Estimation Chapter 2 (c) Show that for the logistic distribution Fo(x) [1 + exp{ X}]I.1)".d.i. ~J ' .) has a density you may assume that > 0.l and CT. 3..2 are assumed to be known are = (lin) L:. See.3. respectively. EM for bivariate data. 1985. > 0. il' i. IPl < 1. J12. 0'1 Problems for Section 2.J1. for example. provided that n > 3. 8aob :1 13.) I . ag.) Hint: (a) Thefunction D( a. {t2.'2). 7if)F 8 08 D 802 vb2 2 2 > (a'D)2 ' then D'IS strictIy convex. b) = x if either ao = 0 or 00 or bo = ±oo. b) and lim(a. Y. Show that if T is minimal and t: is open and the MLE doesn't exist. . En i.2).n log a is strictly convex in (a. Chapter 10. Apply Corollary 2...3. . pal0'2 + J11J.b) .6. .8.(Yi . ..b)_(ao. rank(ZD) = k.J1. I (Xi . ar 1 a~.4 1. P coincide with the method of moments estimates of Problem 2. Let (X10 Yd.. O'~ + J1~.1. O'~ + J1i.. complete the Estep by finding E(Zl I l'i) and E(ZiYi I Yd· (b) In Example 2. (See Example 2. verify the Mstep by showing that E(Zi I Yi). (I't') IfEPD 8a2 a2 D > 0 an d > 0.2)/M'''2] iI I'. (b) If n > 5 and J11 and /12 are unknown.4. Golnb and Van Loan.:. (i) If a strictly convex function has a minimum. b successively. (Check that you are describing the GaussSeidel iterative method for solving a system of linear equations.' ! (a) In the bivariate nonnal Example 2.J1.bo) D(a.. O'~. L:.2)2. = (lin) L:.:. N(O.6. /12. .9).4.. W is strictly convex and give the likelihood equations for f. and p when J1.4. Yn ) be a sample from a N(/11. b = : and consider varying a.1 and J1.. Hint: (b) Because (XI.
Y (~ A 2:7 I(Yi  Y)'  O"~)/ (0". thus. I n ) 8"(10)"" (a) Show that P(X = x MLE exists and is unique.B)n[n .(I. .and Msteps of the EM algorithm for this problem. 1 and (a) Show that X .n. B= (A. Let (Ii.5 Problems and Complements 155 4.3) to show that the NewtonRaphson algorithm gives ~ _ 81 ~ 8 ~ _ ~ _ B(I. is distributed according to an exponential ryl < n} = i£(1 I) uzuy. Hint: Use Bayes rule. 1]. also the trait)... For families in which one member has the disease.I ~ + (1  B)n] .P..[lt ~ 1] = A ~ 1 . 8new .B)"]'[(1 . as the first approximation to the maximum likelihood . ~2 = log C\) + (b) Deduce that T is minimal sufficient. (a) Justify the following crude estimates of Jt and A._x(1.[lt ~ OJ. Because it is known that X > 1. IX > 1) = \ X ) 1 (1 6)" . Consider a genetic trait that is directly unobservable but will cause a disease among a certain proportion of the individuals that have it.{(Ii.B)[I.O"~). 6.Ii). Suppose that in a family of n members in which one has the disease (and.B)n} _ nB'(I. 2 (uJ L }iIi + k 2: Y.? (b) Give as explicitly as possible the E. 1') E (0.1) x Rwhere P. Yd : 1 < i family with T afi I a? known. . 1 < i < n. B E [0. Suppose the Ii in Problem 4 are not observed. I' >. j = 0. (c) Give explicitly the maximum likelihood estimates of Jt and 5. B) variable.x = 1.. i ~ 1'. 2:Ji) . be independent and identically distributed according to P6.2B)x + nB'] _. Y1 '" N(fLl a}). Y'i).[1 . where 8 = 80l d and 8 1 estimate of 8.. X is the number of members who have the trait.(1 . Do you see any problems with >. the model often used for X is that it has the conditional distribution of a B( n. it is desired to estimate the proportion 8 that has the genetic trait.Section 2.(1 _ B)n]{x nB .. given X>l. and given II = j.4. and that the (b) Use (2. when they exist.
1 (1 + (x e..9)')1 Suppose X. W).2 noting that the sequence of iterates {rymJ is bounded and. Let Xl.1 generated by N++c. . (1) 10gPabc = /lac = Pabe. 1 < b < B.X3 be independent observations from the Cauchy distribution about f(x. . Hint: Apply the argument of the proof of Theorem 2.2. W = c] 1 < a < A. X 3 = a. X Methods of Estimation Chapter 2 = 2. Show that the sequence defined by this algorithm converges to the MLE if it exists. * maximizes = iJ(A') t. V. Consider the following algorithm under the conditions of Theorem 2. = 1. Let and iJnew where ). . n N++ c ++c  .. b1 c and then are . X n be i.4.Na+c. + Vbc where 00 < /l. 7. c]' Show that this holds iff PIU ~ a. 8.e. Let Xl.b. X. N +bc given by < N ++c for all a.N+bc where N abc = #{i : Xi = (a.A(iJ(A)). Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a. hence. . Define TjD as before.c Pabc = l.4. (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the other of the local maxima (b) Make a similar study of the NewtonRaphson method in this case. iff U and V are independent given W. 9.v ~ b I W = c] = P[U = a I W = c]P[V = b I W ~ i. V = b.c)} and "+" indicates summation over the ind~x. PIU = a. the sequence (11ml ijm+l) has a convergent subse quence. v < 00. 1 <c < C and La. N .i.X 2 .156 (c) [f n = 5. v vary freely is an exponential family of rank (C . find (}l of (b) above using (} =   x/n as a preliminary estimate. (h) Show that the family of distributions obtained by letting 1'.9) = ".b. (a) Suppose for aU a. (c) Show that the MLEs exist iff 0 < N a+c . . N++ c N a +c N+ bc Pabc = . ~ 0. where X = (U. b1 C.d.1) + C(A + B2) = C(A + B1) .riJ(A) .
1)(B .I) ~ AB + AC + BC . but now (2) logPabc = /lac + Vbc + 'Yab where J1.a>!b".b'. (b) Consider the following "proportional fitting" algorithm for finding the maximum likelihood estimate in this model. Initialize: pd: O) = N a ++ N_jo/>+ N++ c abc nnn Pabc d2) dl) Nab+ Pabc dO) n n Pab+ d1) dl) dO) Pabc d3) N a + c Pabc Pa+c N+bc Pabc d 2) Pabc n P+bc d2)· Reinitialize with ~t~.:.=_ L ellalblp~~~'CI a'.4.. Justify formula (2.c' obtained by fixing the "b.Section 2.1) + (A .N++c/A. Show that the algorithm converges to the MLE if it exists and di· verges otheIWise.N++ c / B.5 has the specified mixtnre of Gaussian distribution. 10. f vary freely. Suppose X is as in Problem 9. 12. Hint: P. 11.I)(C . (a) Show that this is an exponential family of rank A + B + C .(x) :1:::: 1(S(x) ~ = s).I) + (B . c" parameters.4. Hint: Note that because {p~~~} belongs to the model so do all subsequent iterates and that ~~~ is the MLE for the exponential family Pabc = ellauplO) _=. (a) Show that S in Example 2. Nl+co (c) The model implies Pabc = P+bcPa±c/P++c and use the likelihood equations.8).and Msteps of the EM algorithm in this case. (b) Give explicitly the E. v. N±bc .(A + B + C).I)(C .[X = x I SeX) = s] = 13.c. Let f. c" and "a.5 Problems and Complements 157 Hint: (b) Consider N a+ c .3 + (A . = fo(x  9) where fo(x) 3'P(x) + 3'P(x ..a) 1 2 .
17. .6. but not on Yi." .3 for the actual MLE in that example. (2) The frequency plugin estimates are sometimes called Fisher consistent. For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1986). The assumption underlying the computations in the EM algorithm is that the conditional probability that a component X j of the data vector X is missing given the rest of the data vector is not a function of X j . N(O.. In Example 2. Ii .4. Then using the Estep to impute values for the missing Y's would greatly unclerpredict the actual V's because all the V's in the imputation would have Y < 2.2. 15.1 (1) "Natural" now was not so natural in the eighteenth century when the least squares principle was introduced by Legendre and Gauss. NOTES ..6.and Msteps of the EM algorithm for estimating (ftl. That is. Establish part (b) of Theorem 2. Complete the E.. A. . .p is the N(O} '1) density. . Verify the fonnula given in Example 2.~ go. For example.158 Methods of Estimation Chapter 2 and r.{Xj }. in Example 2. 2 ).5. al = a2 = 1 and p = 0. ! .. we observe only}li. {32). . the process determining whether X j is missing is independent of X j . if a is sufficiently large.) underpredicts Y. Hint: Show that {(19 m. the "missingness" of Vi is independent of Yi. For instance. Show for 11 = 1 that bisection may lead to a local maximum of the likelihood.4. Establish the last claim in part (2) of the proof of Theorem 2. If /12 = 1.n}. 18. consider the model I I: ! I I = J31 + J32 Z i + Ei I are i. this assumption may not be satisfied. I I' ."" En. thus. = {(Zil Yi) Yi :i = 1.6 . Zn are ii. Fisher (1922) argued that only estimates possessing the substitution property should be considered and the best of these selected.4. suppose Yi is missing iff Vi < 2.5.i. {31 1 a~. (J*) and necessarily g. and independent of El. These considerations lead essentially to maximum likelihood estimates. Hint: Show that {( 8m . necessarily 0* is the global maximizer. ~ . Bm + I)} has a subsequence converging to (B* . suppose all subjects with Yi > 2 drop out of the study.d.3. If Vi represents the seriousness of a disease.. • . find the probability that E(Y.4. I Z. For X . Bm+d} has a subsequence converging to (0" JJ*) and. That is. the probability that Vi is missing may depend on Zi.4. R.d. given X .. Limitations a/the EM Algorithm. . ~ 16. 14. EM and Regression.. En a an 2. N(ftl. given Zi. This condition is called missing at random. . Zl. Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n. where E) . a 2 . Hint: Use the canonical nature of the family and openness of E. Noles for Section 2.
199200 (1985). M. Acad.. "Y.. T. P[T(X) E Al o for all or for no PEP. CAMPBELL. "The Meaning of Least in Least Squares.. La. 1972. a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates. 1975.Section 2. Fisher 1950) New York: J.7 REFERENCES BARLOW. W. GIlBELS. R. Statistical Inference Under Order Restrictions New York: Wiley. VAN LOAN. Discrete Multivariate Analysis: Theory and Practice Cambridge. DEMPSTER.. T. see Cover and Thomas (1991). BAUM.5 (1) In the econometrics literature (e. BJORK. 2. ANOH. E. G.. Elements oflnfonnation Theory New York: Wiley. c. WEISS. C. DHARMADHlKARI. 2433 (1964).loAGDEV. AND P. S. and MacKinlay. t 996. F.g. "On the Mathematical Foundations of Theoretical Statistics. G.2. S. Note for Section 2. for any A. GoLUB. 1922. HABERMAN. M. J. La. A. LAIRD. Roy. "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm. Appendix A. ANDERSON. M. A. Campbell. M.. M. R. Numerical Analysis New York: Prentice Hall. SOULES. 1. BARTHOLOMEW. COVER. L. AND A. E. 1991. B. RUBIN. Matrix Computations Baltimore: John Hopkins University Press.2 (I) An excellent historical account of the development of least squares methods may be found in Eisenhart (1964). 1997. M. 41. 39. D. AND N. MA: MIT Press. AND D." J. J." Journal Wash. BREMNER. Soc. G. Wiley and Sons. . A." The American Statistician. (2) For further properties of KullbackLeibler divergence.. BRUNK. Local Polynomial Modelling and Its Applications London: Chapman and Hall.. AND C.. Sciences. 1974. A. AND J. BISHOP. AND K. 54. HOLLAND. 0. 1985. "Y. H.." reprinted in Contributions to MatheltUltical Statistics (by R. NJ: Princeton University Press. THOMAS. AND N. A.1974. MACKINLAY. FAN. B. "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Am!. PETRIE.• AND I. EiSENHART. DAHLQUIST... A. Statist. FEINBERG. 1997). Math. "Examples of Nonunique Maximum Likelihood Estimators. The Analysis ofFrequency Data Chicago: University of Chicago Press.7 References 159 Notes for Section 2.. S. Statist. FISHER. E. 138 (1977). 164171 (1970).1. The Econometrics ofFinancial Markets Princeton. W. Note for Section 2.3 (I) Recall that in an exponential family. J.
11. W.." Ann. A. 128 (1968).. • . i I. D. AND W. "Association and Estimation in Contingency Tables.. A.. 2nd ed.·sis with Missing Data New York: J. MA: Harvard University Press. SHANNON. J. Statisr. Statistical Methods.160 Methods of Estimation Chapter 2 KOLMOGOROV. G.. 10. E. AND T. 1985. C. IA: Iowa State University Press.. WILD. Statist. "On the Shannon Theory of Information Transmission in the Case of Continuous Signals. 22. J. I 1 . MOSTELLER. C. 1989. STIGLER. RUPPERT. The History ofStatistics Cambridge. B.. E.. SNEDECOR.95103 (1983) . LIlTLE. 63. F. 1987. WEISBERG. KR[SHNAN. S..379243. 1986. 290300 ( 1959). "On the Convergence Properties of the EM Algorithm. Theory." J. The EM Algorithm and Extensions New York: Wiley. S. R. 13461370 (1994). . In. N. Exp. Ames. RUBIN. Journal. COCHRAN. AND D. Nonlinear Regression New York: Wiley. F. E."/RE Trans! Inform.." 1.. Statistical Anal. 102108 (1956).. AND M. G.. MACLACHLAN. 27." Ann. P. WAND. SEBER. "A Flexible Growth Function for Empirical Use." Bell System Tech.. RICHARDS. Amer.. G. J. 1%7. New York: Wiley. WU. Botany. 6th ed. J. "Multivariate Locally Weighted Least Squares Regression. Wiley. A. Assoc.1. "A Mathematical Theory of Communication. Applied linear Regression.623656 (1948). Statist. 1997. I • I . AND c.
) < R(B.+ R+ by ° R(O. OUf examples are primarily estimation of a real parameter.1 INTRODUCTION Here we develop the theme of Section 1. 6 2 ) for all 0 or vice versa. which is how to appraise and select among decision procedures. loss function l(O.3 that if we specify a parametric model P = {Po: E 8}. actual implementation is limited.1) 161 . AND OPTIMAL PROCEDURES 3. o r(1T. 6) : e . action space A. We think of R(. the relation of the two major decision theoretic principles to the 000decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness. However.a).6(X). in Chapter 4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case. 3.6 .6(X)). in particular computational simplicity and robustness. and 62 on the basis of the risks alone is not well defined unless R(O.4. We return to these themes in Chapter 6.3. R(·. We also discuss other desiderata that strongly compete with decision theoretic optimality. 6) as measuring a priori the performance of 6 for this model. NOTIONS OF OPTIMALITY.6) = E o{(O. In Section 3.3 we show how the important Bayes and minimax criteria can in principle be implemented. we study. Strict comparison of 6. then for data X '" Po and any decision procedure J randomized or not we can define its risk function.2.6) =ER(O.6) = EI(6. However. by introducing a Bayes prior density (say) 1r for comparison becomes unambiguous by considering the scalar Bayes risk. (3.Chapter 3 MEASURES OF PERFORMANCE. In Sections 3.2 BAYES PROCEDURES Recall from Section 1.2 and 3. in the context of estimation. after similarly discussing testing and confidence bounds.
r= . For testing problems the hypothesis is often treated as more important than the alternative. Clearly.2.5). it is plausible that 6. 6) ~ E(q(O) .2.5. oj = ". and that in Section 1.. We may have vague prior notions such as "IBI > 5 is physically implausible" if. 0) = (q(O) using a nonrandomized decision rule 6.2. X) is given the joint distribution specified by (1. This exercise is interesting and important even if we do not view 1r as reflecting an implicitly believed in prior distribution on ().2.2. If 1r is then thought of as a weight function roughly reflecting our knowledge. and 1r may express that we care more about the values of the risk in some rather than other regions of e... we just need to replace the integrals by sums. After all. We first consider the problem of estimating q(O) with quadratic loss.(O)dO (3. Here is an example. 1(0.4) the parametrization plays a crucial role here. x _ !oo= q(O)p(x I 0). considering I. 00 for all 6 or the Using our results on MSPE prediction. and instead tum to construction of Bayes procedure.(O)dfI p(x I O).162 Measures of Performance Chapter 3 where (0.6) = J R(0. (3.6).3 we showed how in an example.3). we could identify the Bayes rules 0. 0) 2 and 1T2(0) = 1.2.6(X)j2. for instance.2) the Bayes risk of the problem. Thus.2. In view of formulae (1.4) !. Suppose () is a random variable (or vector) with (prior) frequency function or density 11"(0). (0. 'f " .8) for the posterior density and frequency functions.4. if computable will c plays behave reasonably even if our knowledge is only roughly right. B denotes mean height of people in meters. Our problem is to find the function 6 of X that minimizes r(".. we can give the Bayes estimate a more explicit form..6): 6 E VJ (3.2.6) In the discrete case.5) This procedure is called the Bayes estimate for squared error loss. 1 1  . e "(). This is just the prohlem of finding the best mean squared prediction error (MSPE) predictor of q(O) given X (see Remark 1. we find that either r(1I".) ~ inf{r(". j I 1': II . It is in fact clear that prior and loss function cannot be separated out clearly either.(0) a special role ("equal weight") though (Problem 3.(0)1 . We don't pursue them further except in Problem 3. Issues such as these and many others are taken up in the fundamental treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger (1985) and Bernardo and Smith (1994).3) In this section we shall show systematically how to construct Bayes rules. In the continuous case with real valued and prior density 1T. such that (3. if 1[' is a density and e c R r(". 0) and "I (0) is equivalent to considering 1 (0. (0.(O)dfI (3. Recall also that we can define R(. .2.. as usual. 6) Bayes rule 6* is given by = a? 6'(X) = E[q(O) I XJ.
7) Its Bayes risk (the MSPE of the predictor) is just r(7r.E(e I X))' I X)] E[(J'/(l+~)]=nja 2 + 1/72 ' 1 2 n n/ No finite choice of '1]0 and 7 2 will lead to X as a Bayes estimate. a) I X = x). This quantity r(a I x) is what we expect to lose. J') ~ 0 as n + 00. r) . If we look at the proof of Theorem 1. we should. . Fonnula (3. '1]0. In fact.7) reveals the Bayes estimate in the proper case to be a weighted average MID + (1 . 1]0 fixed). that is. J' )]/r( 7r. Thus. for each x. Intuitively. take that action a = J*(x) that makes r(a I x) as small as possible. and X with weights inversely proportional to the Bayes risks of these two estimates. J') E(e .2. In fact. If we choose the conjugate prior N( '1]0. This action need not exist nor be unique if it does exist. /2) as in Example 1.12. To begin with we consider only nonrandomized rules. 0 We now tum to the problem of finding Bayes rules for gelleral action spaces A and loss functions l.5. X is the estimate that (3.2 Bayes Procedures 163 Example 3. tends to a as n _ 00. Such priors with f 7r(e) ~ 00 or 2: 7r(0) ~ 00 are called impropa The resulting Bayes procedures are also called improper. Suppose that we want to estimate the mean B of a nonnal distribution with known variance a 2 on the basis of a sample Xl. a 2 / n. if we substitute the prior "density" 7r( 0) = 1 (Prohlem 3. However.E(e I X))' = E[E«e . But X is the limit of such estimates as prior knowledge becomes "vague" (7 .r( 7r. For more on this.1).a)' I X ~ x) as a function of the action a.) minimizes the conditional MSPE E«(Y .2. the Bayes estimate corresponding to the prior density N(1]O' 7 2 ) differs little from X for n large.Section 3. Because the Bayes risk of X.1. we obtain the posterior distribution The Bayes estimate is just the mean of the posterior distribution J'(X) ~ 1]0 l/T' ] _ [ n/(J2 ] [n/(J' + l/T' + X n/(J' + l/T' (3.4. we fonn the posterior risk r(a I x) = E(l(e. . Bayes Estimates for the Mean of a Normal Distribution with a Normal Prior.6. if X = :x and we use action a. we see that the key idea is to consider what we should do given X = x. .00 with. E(Y I X) is the best predictor because E(Y I X ~ .2.2.1.2. X is approximately a Bayes estimate for anyone of these prior distributions in the sense that Ir( 7r.6) yields.w)X of the estimate to be used when there are no observations. Applying the same idea in the general Bayes decision problem. " Xn. see Section 5.
6* = 05 as we found previously. we obtain for any 0 r(?f. and aa are r(at 10) r(a. r(a. A = {ao. Suppose that there exists a/unction 8*(x) such that r(o'(x) [x) = inf{r(a I x) : a E A}. Example 3. As in the proof of Theorem 1. :i " " Proof.89.·. consider the oildrilling example (Example 1.2. let Wij > 0 be given constants. by (3. E[I(O. ?f(O.) = 0.o'(X)) IX = x].2. I x) > r(o'(x) I x) = E[I(O.o) = E[I(O.67 5. . Then the posterior distribution of o is by (1.o(X))] = E[E(I(O.Op}.35.2. Let = {Oo.o(X)) I X] > E[I(O. 0'(0) Similarly. and let the loss incurred when (}i is true and action aj is taken be given by j e e .o(X)) I X)].164 Measures of Performance Chapter 3 Proposition 3. (3. (3.8) Then 0* is a Bayes rule.2.! " j.9) " "I " • But.· .o(X)) I X = x] = r(o(x) Therefore.. ~ a. the posterior risks of the actions aI.8) Thus.1. az.2.2. r(a. !i n j. a q }.3. and the resnlt follows from (3.2. " . Suppose we observe x = O. Therefore..8. az has the smallest posterior risk and. I 0)  + 91(0" ad r(a. E[I(O.1. r(al 11) = 8.74. More generally consider the following class of situations.2. . Therefore.o'(X)) I X].2. . The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corrpeting procedures.4.) = 0.5) with priOf ?f(0. 11) = 5. 10) 8 10. [I) = 3.70 and we conclude that 0'(1) = a...8). if 0" is the Bayes rule. Bayes Procedures Whfn and A Are Finite.9). o As a first illustration.
Section 3.2
Bayes Procedures
165
Let 1r(0) be a prior distribution assigning mass 1ri to Oi, so that 1ri > 0, i = 0, ... ,p, and Ef 0 1ri = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each O. Then, by (1.2.8), the posterior probabilities are
and, thus,
raj (
x I) =
EiWipTiP(X I Oi) . Ei 1fiP(X I Oi)
(3.2.10)
The optimal action 6* (x) has
r(o'(x) I x)
=
O<rSq
min r(oj
I x).
Here are two interesting specializations. (a) Classification: Suppose that p
= q, we identify aj with OJ, j
1, i f j
=
0, ... ,p, and let
Wii
O.
This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,
r(Oi
Ix) = Pl8 f ei I X = xl
and minimizing r(Oi I x) is equivalent to the reasonable procedure of maximizing the posterior probability,
PI8 = e I X = i
xl
=
1fiP(X lei) Ej1fjp(x I OJ)
(b) Testing: Suppose p = q = 1, 1ro = 1r, 1rl = 1  'Jr, 0 < 1r < 1, ao corresponds to deciding 0 = 00 and al to deciding 0 = 01 • This is a special case of the testing fonnulation of Section 1.3 with 8 0 = {eo} and 8 1 = {ed. The Bayes rule is then to
decide e decide e
= 01 if (1 1f)p(x I 0, ) > 1fp(x I eo) = eo if (1  1f)p(x IeIl < 1fp(x Ieo)
and decide either ao or al if equality occurs. See Sections 1.3 and 4.2 on the option of randomizing .between ao and al if equality occurs. As we let 'Jr vary between zero and one. we obtain what is called the class of NeymanPearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) < Ct. This is treated further in Chapter 4. D
166
Measures of Performance
Chapter 3
To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form" the Bayes procedure for an estimation problem when the loss is not quadratic.
Example 3.2.3. Bayes Estimation ofthe Probability ofSuccess in n Bernoulli Trials. Suppose that we wish to estimate () using X I, ... , X n , the indicators of n Bernoulli trials with
probability of success
e, We shall consider the loss function I given by
(0  a)2 1(0, a) = 0(1 _ 0)' 0 < 0 < I, a real.
(3.2.11)
"
" " ,

d
This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for () close to zero, this l((), a) is close to the relative squared error (fJ  a)2 JO, lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if aU terms on the righthand side are finite,
rea I k)
(3.2.12)
;,
I
, " " I
,

I
Minimizing this parabola in a, we find our Bayes procedure is given by
J'(k) = E(1/(1  0) I S = k) E(I/O(I  0) IS  k)
;
(3.2.13)

i! , ,.
provided the denominator is not zero. For convenience let us now take as prior density the density br " (0) of the bela distribution i3( r, s). In Example 1. 2.1 we showed that this leads to a i3(k +r,n +s  k) posteriordistributionforOif S = k. Ifl < k < nI and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and
.
,
";1 
J'(k)
Jo'(I/(1  O))bk+r,n_k+,(O)dO
J~(1/0(1 O))bk+r,n_k+,(O)dO
B(k+r,nk+sI) B(k + r  I, n  k + s  I) k+rl n+s+r2'
(3.2.14)
where we are using the notation B.2.11 of Appendix B. If k = 0, it is easy to see that a = is the only a that makes r(a I k) < 00. Thus, J'(O) = O. Similarly, we get J'(n) = 1. If we assume a un!form prior density, (r = s = 1), we see that the Bayes procedure is the usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). 0
°
"Real" computation of Bayes procedures
The closed fonos of (3.2.6) and (3.2.10) make the compulation of (3.2.8) appear straightforward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 ~ (0" . .. , Op) has a hierarchically defined prior density,
"(0,, O ,... ,Op) = 2
"1 (Od"2(02
I( 1 ) ... "p(Op I Op_ d.
(3.2.15)
I
Section 3.2
Bayes Procedures
167
Here is an example. Example 3.2.4. The random effects model we shall study in Volume II has
(3.2.16)
where the €ij are i.i.d. N(O, (J:) and Jl, and the vector ~ = (AI, .. ' ,AI) is independent of {€i; ; 1 < i < I, 1 < j < J} with LI." ... ,LI.[ i.i.d. N(O,cri), 1 < j < J, Jl, '"'" N(Jl,O l (J~). Here the Xi)" can be thought of as measurements on individual i and Ai is an "individual" effect. If we now put a prior distribution on (Jl,,(J~,(J~) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (J.L, (J~, (J~, AI,. '. ,AI) with the Xii I () independent N(f.l + Ll. i . cr~). Then p(x I 0) = lli,; 'Pu.(Xi;  f.l Ll. i ) and
11'(0)
=
11'}(f.l)11'2(cr~)11'3(cri)
II 'PU" (LI.,)
i=l
I
(3.2.17)
where <Pu denotes the N(O, (J2) density. ]n such a context a loss function frequently will single out some single coordinate Os (e.g., LI.} in 3.2.17) and to compute r(a I x) we will need the posterior distribution of ill I X. But this is obtainable from the posterior distribution of (J given X = x only by integrating out OJ, j t= s, and if p is large this is intractable. ]n recent years socalled Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable 0 and the use of Bayesian methods has spread. We return to the topic in Volume II.
Linear Bayes estimates
When the problem of computing r( 1r, 6) and 611" is daunting, an alternative is to consider  of procedures for which r( 'R', 6) is easy to compute and then to look for 011" E 'D  a class 'D that minimizes r( 1f 16) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q( 0)  a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;=l bjXj _ If in (1.4.14) we identify q(8) with Y and X with Z, the solution is

8(X)
= Eq(8) + IX 
E(X)]T{3
where f3 is as defined in Section 1.4. For example, if in the 11l0del (3.2.16), (3.2.17) we set q(fJ) = Ll 1 , we can find the linear Bayes estimate of Ll 1 hy using 1.4.6 and Problem 1.4.21. We find from (1.4.14) that the best linear Bayes estimator of LI.} is
(3.2.18)
where E(Ll.I) given model
= 0,
X = (XIJ, ... ,XIJ)T, P.
= E(X)
and/l
=
ExlcExA,. For the
168
Var(X I ;)
Measures of Performance
Chapter 3
=E
Var(X,)
I B) +
Var E(XIj
I BJ = E«T~) +
(T~ + (T~,
E Cov(X,j , Xlk I B)
+ Cov(E(X'j I B), E(Xlk I BJ)
0+ Cov(1' + II'!, I' + tl.,)
= (T~ +
(T~,
Cov(tl.
Xlj)
=E
Cov(X,j , tl.,
!: ,,
I',
" From these calculations we find and Norberg (1986).
I B) + Cov(E(XIj I B), E(tl. l I 0» = 0 + (T~, =
(Tt·
f3
and OL(X). We leave the details to Problem 3.2.10.
'I
Linear Bayes procedures are useful in actuarial science, for example, Biihlmann (1970)
if
ii "
Bayes estimation, maximum likelihood, and equivariance
As we have noted earlier. the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior 71"(0) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than
modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing the estimate (j M is said to be equivariant with respect to reparametrization if for every onetoDne function h from e to fl = h (e). the estimate of w '" h(B) is W = h(OM); that is, M
~
I
= h(BM ). In Problem 2.2.16 we show that the MLE procedure is equivariant. If we consider squared error loss. then the Bayes procedure (}B = E(8 I X) is not equivariant
(h(O»
M
~

~
~
for nonlinear transfonnations because
E(h(O) I X)
op h(E(O I X))
1 ,
,
,,
for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is
i
re(a I x)
= LIB  a]'7f(O I x).
'Ee
(3.2.19)
i
If we set w = h(O) for h onetoone onto fl = heel, then w has prior >.(w) and in the w parametrization, the posterior Bayes risk is
= 7f(h
1
(w))
I
,
r,,(alx)
= L[waj2>,(wlx)
wE"
L[h(B)  a)2 7f (B I x).
(3.2.20)
'Ee
Thus, the Bayes procedure for squared error loss is not equivariant because squared error loss is not equivariant and, thus, r,,(a I x) op Te(h 1(a) Ix).
Section 3.2
Bayes Procedures
169
Loss functions of the form l((}, a) = Q(Po, Pa) are necessarily equivariant. The KullbackLeibler divergence K«(},a), (J,a E e, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h 1 (a)), thus, with this loss function,
ro(a I x)
~
re(h1(a) I x).
See Problem 2.2.38. In the discrete case using K means that the importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.l 0». In the N(O, case the K L (KullbackLeibler) loss K( a) is ~n(a  0)' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families
"J)
e,
K(1], a) = L[ryj  ajJE1]Tj
j=1
•
+ A(1]) 
A(a).
Moreover, if we can find the KL loss Bayes estimate 1]BKL of the canonical parameter 7] and if 1] = c( 9) :  t E is onetoone, then the K L loss Bayes estimate of 9 in the
e
general exponential family is 9 BKL = C 1(1]BKd. For instance, in Example 3.2.1 where J.l is the mean of a nonnal distribution and the prior is nonnal, we found the squared error Bayes estimate jiB = wf/o + (lw)X, where 1]0 is the prior mean and w is a weight. Because the K L loss is equivalent to squared error for the canonical parameter p, then if w = h(p), WBKL = h('iiUKL), where 'iiBKL =
~
wTJo + (1  w)X.
Bayes procedures based on the KullbackLeibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume 11.
Bayes methods and doing reasonable things
There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1%9; Lindley, 1965; Savage, 1954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior 1r. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E V there is a 60 E V o such that R(0, 60 ) < R(0, 6) for all O. Summary. We show how Bayes procedures can be obtained for certain problems by COmputing posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.
170
Measures of Performance
Chapter 3
3.3
MINIMAX PROCEDURES
In Section 1.3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worstcase analysis; the true () is one that is as "hard" as possible. That is, 6 1 is better than 82 from a minimax point of view if sUPe R(0,6 1 ) < sUPe R(B, 82 ) and is said to be minimax if
o·
supR(O, 0')
.
= infsupR(O,o).
,
.
,
I·
Here (J and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {p. : 0 E e}. It is fruitful to consider proper subclasses of V and subsets of P. but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by considering a socalled zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 8. For the basic game, 5 picks 0 without N's knowledge, N picks 1f without 5's knowledge and then all is revealed and S pays N
r(Jr,o)
, ,
,
=
J
R(O,o)dJr(O)
where the notation f R(O, o)dJr(0) stands for f R(0, o)Jr(O)dII in the continuous case and L, R(Oj, o)Jr(O j) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important.
I: N is told the choice 0 of 5 before picking 1r and 5 knows the rules of the game. Then
Ii ,
N naturally picks 1f,; such that
r(Jr"o)
that is, 1f,; is leastfavorable against such that
~supr(Jr,o),
o. Knowing the rules of the game S naturally picks 0*
•
(3.3.1)
r( Jr,., 0·) = sup r(Jr, 0') = inf sup r(Jr, 0).
We claim that 0* is minimax. To see this we note first that,
.
,
.
(3.3.2)
for allJr, o. On the other hand, if R(0" 0) = suP. R(0, 0), then if Jr, is point mass at 0" r( Jr" 0) = R(O" 0) and we conclude that supr(Jr,o) = sup R(O, 0)
,
•
•
(3.3.3)
I
,
•
•
Section 3.3
Minimax Procedures
171
and our claim follows. II: S is told the choke 7r of N before picking 6 and N knows the rules of the game. Then S naturally picks 6 1r such that
That is, r5 rr is a Bayes procedure for
11".
Then N should pick
7r*
such that (3.3.4)
For obvious reasons, 1f* is called a least favorable (to S) prior distribution. As we shall see by example, altbough the rigbthand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language.
Theorem 3.3.1. (von Neumann). If both
e and D are finite,
?5
11"
then:
(a)
v=supinfr(1I",6), v=infsupr(1I",6)
rr
?5
are both assumed by (say)
7r*
(least favorable), 6* minimax, respectively. Further,
") =v v=r1f,u (
.
(3.3.5)
v and v are called the lower and upper values of the basic game. When v (saY), v is called the value of the game.
=
v
=
v
Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification ~ {eo} and = {eIl (Example 3.2.2) but is too reSlrictive in ilS and testing when assumption for the great majority of inference problems. A generalization due to Wald and Karlinsee Karlin (1 959)states that the conclusions of the theorem remain valid if and D are compact subsets of Euclidean spaces. There are more farreaching generalizations but, as we shall see later, without some form of compactness of and/or D, although equality of v and v holds quite generally, existence of least favorable priors and/or minimax procedures may fail.
eo
e,
e
e
The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on and D and are easy to prove.
e
Proposition 3.3.1. Suppose 6**. 7r** can be found such that
U
£** =
r ()1r.',
1f **
= 1rJ••
(3.3.6)
172
Measures of Performance
Chapter 3
that is, 0** is Bayes against 11"** and 11"** is least favorable against 0**. Then v R( 11"** ,0**). That is, 11"** is least favorable and J*'" is minimax.
To utilize this result we need a characterization of 11"8. This is given by
v
=
Proposition 3.3.2.11"8 is leastfavorable against
°iff
1r,jO: R(O,b) = supR(O',b)) = 1.
.'
(3,3,7)
That is, n a assigns probability only to points () at which the function R(·, 0) is maximal.
Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 11"8 may not be unique. In particular, if R(O, 0) = constant. the rule has constant risk, then all 11" are least favorable. We now prove Propositions 3.3.1 and 3.3.2.
Proof of Proposition 3.3.1. Note first that we always have
v<v
because, trivially,
i~fr(1r,b)
(3.3.8)
< r(1r,b')
(3.3,9)
for aIln, 5'. Hence,
v = sup in,fr(1r,b) < supr(1r,b')
•
(3,3.10)
•
for all 0' and v
<
infa, sUP1r 1'(11", (/) =
v. On the other hand. by hypothesis,
sup1'(1r,6**) > v.
v> inf1'(11"*'",6)
,
= 1'(1I"*",0*'") =
.
(3.3.11)
Combining (3.3.8) and (3.3.11) we conclude that
:r
v
as advertised.
= i~f 1'(11"**,0) = 1'(11"**,6**) = s~p1'(1I",0*"')
= V
(3.3.12)
" 'I
,.;
~l
o
Proofof Proposition 3.3.2. 1r is least favorable for b iff E.R(8,6) =
,.,
f r(O,b)d1r(O) =s~pr( ..,6).
•
(3.3.13)
But by (3.3.3),
supr(..,b) = supR(O, 6),
(3.3,14)
•
i ,
Because E.R(8, 6)
= sUP. R(O, b), (3.3,13) is possible iff (3.3.7) holds.
o
Putting the two propositions together we have the following.
•
Section 3.3
Minimax Procedures
173
11"*
Theorem 3.3.2. Suppose 0* has sUPo R((}, 0*) = 1" < 00. If there exists a prior that 0* is Bayes for 11"* and tr" {(} : R( (}, 0") = r} = 1, then 0'" is minimax.
such
Example 3.3.1. Minimax estimation in the Binomial Case. Suppose S has a B(n,B) distribution and X = Sjn,as in Example 3.2.3. Let I(B, a) ~ (Ba)'jB(IB),O < B < 1. For this loss function,
R(B X) = E(X _B)' , B(IB)

=
B(IB) = ~ nB(IB) n'

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(Ol 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic loss neither of these assertions holds. The minimax estimate is
o's=S+hln = .,fii X+ I 1 () n+.,fii .,fii+l .,fii+1 2
This estimate does have constant risk and is Bayes against a (J( y'ri/2, vn/2) prior (Problem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n t 00 of the ratio of the risks of 0* and X is > 1 for every () =f ~. At B = the ratio tends to 1. Details are left to Problem 3.3.4. 0
!
Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending ~r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p" 0'2) where p, = v, if the satellite functions, and otherwise. The variance 0'2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or"p, = v." We view this as a decision problem with 1 loss. If the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to and 1  1r to v, use 0  1 loss, and set L(x, 0, v) = p( x Iv) j p(x I 0), then the Bayes test decides I' = v if
°
°°
L(x,O,v)
and decides p,
=
=
exp
°
" {
2"EXi   , a 2a
nv'} >
17l'
if L(x,O,v) <:17l' 7l'
174
Measures of Performance
Chapter 3
This test is equivalent to deciding f.t
= v (Problem 3.3.1) if. and only if,
"yn
1 ;;;:EXi>t,
T=
, , ,
where,
t
If we call this test d,r.
=
" [log " + 2 nv'] ;;;: vyn a
17r
2
'
R(O,J. )
1 ~ <I>(t)
<I>
= <1>( t)
R(v,o,)
(t  vf)
To get a minimax test we must have R(O, 61r ) = R( V, 6ft), which is equivalent to
v.,fii t = t  "'
h •
or
"
','
I ,~
~.
, ,
•
v.,fii t= . 20
•
Because this value of t corresponds to 7r = the intuitive test, which decides JJ only ifT > ~[Eo(T) + Ev(T)J, is indeed minimax.
!.
= v if and
0
'I
•
If is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem
e
3.3.2.
Theorem 3,3,3, Let 0' be a rule such that sup.R(O,o') = r < 00, let {"d denote a sequence of pn'or distributions such that 'lrk;{8 : R(B,o*) = r} = 1, and let Tk = inffJ r( ?rie, J), where r( 7fkl 0) denotes the Bayes risk wrt 'lrk;. If
Tk Task 00,
(3.3.i5)
then J* is minimax. Proof Because r( "k, 0') = r
supR(B, 0') = rk + 0(1)
•
where 0(1)
~
0 as k
~ 00.
But hy (3.3.13) for any competitor 0
supR(O,o) > E.,(R(B,o)) > rk ~supR(O,o') 0(1).
•
•
(3.3.16)
,
'.
If we let k _ suP. R(B, 0').
00
the lefthand side of (3.3.16) is unchanged, whereas the right tends to
0
j
1
•
Section 3.3
M'mimax Procedures
175
Example 3.3.3. Normal Mean. We now show that X is minimax in Example 3.2.1. Identify 1fk with theN(1Jo, 7 2 ) prior where k = 7 2 . Then
whereas the Bayes risk of the Bayes rule of Example 3.2.1 is
i~frk(J) ~ (,,'/n) +7' n
Because
(0
7
2
0
2
=
00,
n  (,,'/n) +7'
a2
1
0
2
n·
2
In)1 « (T2 In) + 7 2 )
>
0 as T 2

we can conclude that
X is minimax.
0
Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann). Suppose XI,'" ,Xn arei.i.d, FE:F
Then X is minimax for estimating B(F) EF(Xt} with quadratic loss. This can be viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as foUows:(J)
(i)
=
".{F: VarF(XIl # M}
= O.
(ii) ".{ F : F
# N(I', M) for some I'} = O.
(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F),M).
Evidently, the Bayes risk is now the same as in Example 3.3.3 with 0"2 evidently,
= M.
Because.
) VarF(X, ) max R(F, X = max :F :F n
Theorem 3.3.3 applies and the result follows. Minimax procedures and symmetry
M
n
o
As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" 0, There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann (1986) and Chapter 5 of Lehmann and CaseUa (1998), for instance. We shall discuss this approach somewhat, by example. in Chapters 4 and Volume II but refer to Lehmann (1986) and Lehmann and Casella (1998) for further reading. Summary. We introduce the minimax principle in the contex.t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.
estimates that ignore the data. .1.. 176 Measures of Performance Chapter 3 . In the nonBayesian framework.. We show that Bayes rules with constant risk. the game is said to have a value v. in Section 1. When Do is the class of linear procedures and I is quadratic Joss. and then see if within the Do we can find 8* E Do that is best according to the "gold standard. in many cases.1 Unbiased Estimation. This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in the N (0 . This approach has early on been applied to parametric families Va.L and (72 when XI. are minimax. 5') for aile. Von Neumann's Theorem states that if e and D are both finite. there is a least favorable prior 1[" and a minimax rule 8* such that J* is the Bayes rule for n* and 1r* maximizes the Bayes risk of J* over all priors. according to these criteria. I X n are d. then the game of S versus N has a value 11. for instance.4. Do C D. for example. Ii I More specifically. E Do that minimizes the Bayes risk with respect to a prior 1r among all J E Do.d. such as 5(X) = q(Oo). Obviously. which can't be beat for 8 = 80 but can obviously be arbitrarily terrible. A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called leas! favorable.. 3. . symmetry. When v = ii.3. on other grounds. then in estimating a linear function of the /3J.2. Moreover.2. D.4 UNBIASED ESTIMATION AND RISK INEQUALITIES 3. the notion of bias of an estimate O(X) of a parameter q(O) in a model P {Po: 0 E e} as = Biaso(5) = Eo5(X)  q(O). 0'5) model. it is natural to consider the computationally simple class of linear estimates. looking for the procedure 0. and so on. This approach coupled with the principle of unbiasedness we now introduce leads to the famous GaussMarkov theorem proved in Section 6.. v equals the Bayes risk of the Bayes rule 8* for the prior 1r*." R(0. An estimate such that Biase (8) 0 is called unbiased. The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk. Bayes and minimaxity. This notion has intuitive appeal. we can also take this point of view with humbler aims. computational ease. . the solution is given in Section 3. (72) = . Survey Sampling In the previous two sections we have considered two decision theoretic optimality principles. or more generally with constant risk over the support of some prior. S(Y) = L:~ 1 diYi . 5) > R(0. We introduced. N (Il. An alternative approach is to specify a proper subclass of procedures. I I .I . II .• •••. for which it is possible to characterize and. compute procedures (in particular estimates) that are best in the class of all procedures. we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1['.1 . all 5 E Va. if Y is postulated as following a linear regression model with E(Y) = zT (3 as in Section 2. The most famous unbiased estimates are the familiar estimates of f. ruling out.6.
Suppose we wish to sample from a finite population.. to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census. . [J = ~ I:~ 1 Ui· XR is also unbiased.···. X N for the unknown current family incomes and correspondingly UI.XN} 1 (3. (~) If{aj. X N ) as parameter .2. . If .4.6) where b is a prespecified positive constant. UMVU (uniformly minimum variance unbiased). UN' One way to do this.4) where 2 CT.. a census unit. . (3. reflecting the probable correlation between ('ttl. .1.Xn denote the incomes of a sample of n families drawn at random without replacement. .3..' .3.(Xi 1=1 I" N ~2 x) ... (3. We want to estimate the parameter X = Xj. We ignore difficulties such as families moving.4.. .3 and Problem 1. ~ N L. these are both UMVU..4.. X N ).3) = 0 otherwise.4.4.an}C{XI.4 Unbiased Estimation and Risk Inequalities 177 given by (see Example 1...8) Jt =X . ..2) 1 Because for unbiased estimates mean square error and variance coincide we call an unbiased estimate O*(X) of q(O) that has minimum MSE among all unbiased estimat~s for all 0. . is to estimate by a regression estimate ~ XR Xi. UN) and (Xl. ..Section 3... . Unbiased Estimates in Survey Sampling.4.1 ) ~ 1 nl L (Xi ~ ooc n ~ X) 2 ..il) Clearly for each b.4.< 2 ~ ~ (3. Write Xl.14) and has = iv L:fl (3.. . Unbiased estimates playa particularly important role in survey sampling.. Example 3. .. (3.4. This leads to the model with x = (Xl. . It is easy to see that the natural estimate X ~ L:~ 1 Xi is unbiased (Problem 3. and u =X  b(U . As we shall see shortly for X and in Volume 2 for . .5) This method of sampling does not use the information contained in 'ttl. We let Xl. for instance. UN for the known last census incomes. . Ui is the last census income corresponding to = it E~ 1 'Ui.
.2Gand minimax estimates often are. Unbiasedness is also used in stratified sampling theory (see Problem 1. the unbiasedness principle has largely fallen out of favor for a number of reasons. However. ". .4. Ifthc 'lrj are not all equal. They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later..4. . outside of sampling..3. N toss a coin with probability 7fj of landing heads and select Xj if the coin lands heads. . X is not unbiased but the following estimate known as the HorvitzThompson estimate is: Ef . . 0 Discussion. . A natural choice of 1rj is ~n. .'I .j' I ': Because 1rj = P[Xj E Sj by construction unbiasedness follows. I j . ' . This makes it more likely for big incomes to be induded and is intuitively desirable. (i) Typically unbiased estimates do not existsee Bickel and Lehmann (1969) and Problem 3..i . M (3. . If B is unbiased for e. 178 Measures of Performance Chapter 3 i . .. For each unit 1.3. I II. (ii) Bayes estimates are necessarily biasedsee Problem 3.. To see this write ~ 1 ' " Xj XHT ~ N LJ 7'" l(Xj E S). . Specifically let 0 < 7fl.bey the attractive equivariance property. .Xi XHT=LJN i=l 7fJ. for instance.18. .15).7) II ! where Ji is defined by Xi = xJ.X)!Var(U) (Problem 3.4.1 the correlation of Ui and Xi is positive and b < 2Cov(U.. I . 111" N < 1 with 1 1fj = n..4. 1.~'! I .11. :I ... The HorvitzThompson estimate then stays unbiased..19). An alternative approach to using the Uj is to not sample all units with the same prob ability. estimate than X and the best choice of b is bopt The value of bopt is unknown but can be estimated by = The resulting estimate is no longer unbiased but behaves well for large samplessee Problem 5.j . this will be a beller cov(U. ~ . Xl!Var(U). (iii) Unbiased_estimates do not <:. X M } of random size M such that E(M) = n (Problem 3.I. Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems.4).. It is possible to avoid the undesirable random sample size of these schemes and yet have specified 1rj. " 1 ". II it >: . j=1 3 N ! I: .4. The result is a sample S = {Xl. q(e) is biased for q(e) unless q is linear.
and p is the number of coefficients in f3. has some decision theoretic applications.p) where i = (Y . in the linear regression model Y = ZDf3 + e of Section 2.4 Unbiased Estimation and Risk Inequalities 179 Nevertheless. We expect that jBiase(Bn)I/Vart (B n ) . Note that in particular (3.8) is assumed to hold if T(xJ = I for all x.OJdx] = J T(xJ%oP(x.O E 81iJO log p( x.9. What is needed are simple sufficient conditions on p(x. Finally. OJ exists and is finite. 0 or equivalently Vare(Bn)/MSEe(Bn) t 1 as n t 00. For instance.ZDf3). The arguments will be based on asymptotic versions of the important inequalities in the next subsection. the variance a 2 = Var(ei) is estimated by the unbiased estimate 8 2 = iTe/(n . %0 [j T(X)P(x.4. The lower bound is interesting in its own right. (II) 1fT is any statistic such that E 8 (ITf) < 00 for all 0 E then the operations of integration and differentiation by B can be interchanged in J T( x )p(x. For instance.2 The Information Inequality The oneparameter case We will develop a lower bound for the variance of a statistic. In particular we shall show that maximum likelihood estimates are approximately unbiased and approximately best among all estimates. e. B) is a density. See Problem 3. Assumption II is practically useless as written.8) whenever the righthand side of (3.O) > O} docs not depend on O. unbiased estimates are still in favor when it comes to estimating residual variances. p. f3 is the least squares estimate.4.4.OJdx (3. 167. Then 11 holdS provided that for all T such that E. for integration over Rq. (I) The set A ~ {x : p(x.   . 3.4.( lTD < 00 J . We suppose throughout that we have a regular parametric model and further that is an open subset of the line. For all x E A. B)dx. and appears in the asymptotic optimality theory of Section 5. and we can interchange differentiation and integration in p(x. B) for II to hold. e e. The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous~case theorems given later. good estimates in large samples ~ 1 ~ are approximately unbiased.4.4. B)dx. That is. We make two regularity assumptions on the family {Pe : BEe}.Section 3.8) is finite. Some classical conditiolls may be found in Apostol (1974). This preference of 8 2 over the MLE 52 = iT e/n is in accord with optimal behavior when both the number of observations and number of parameters are large. Simpler assumptions can be fonnulated using Lebesgue integration theory. From this point on we will suppose p(x. which can be used to show that an estimate is UMVU. suppose I holds..2. as we shall see in Chapters 5 and 6.
Suppose Xl. Suppose that 1 and II hold and that E &Ologp(X. . which is denoted by I(B) and given by Proposition 3.9) Note that 0 < 1(0) < Lemma 3.n and 1(0) = Var (~n_l . where a 2 is known.4.I [I ·1 . Similarly. 0) ~ 1(0) = E.O)). then I and II hold.1.11 ) .4.2. I' .&0 '0 {[:oP(X. Ii ! I h(x) exp{ ~(O)T(x) . Then (see Table 1. . I J J & logp(x 0 ) = ~n 1 .O)] dx " I'  are continuous functions(3) of O. t. If I holds it is possible to define an important characteristic of the family {Po}. thus.O)} p(x. 1 X n is a sample from a Poisson PCB) population.180 for all Measures of Performance Chapter 3 e.4. o Example 3. the Fisher information number. O)dx = O.O)] dxand J T(x) [:oP(X. (~ logp(X.O)]j P(x.. I 1(0) 1 = Var (~ !Ogp(X.0 Xi) 1 = nO =  0' n O· o . It is not hard to check (using Laplace transform theory) that a oneparameter exponential family quite generally satisfies Assumptions I and II. /fp(x..6. (3.O) & < 00.4. 1 and 11 are satisfied for samples from gamma and beta distributions with one parameter fixed.1) ~(O) = 0la' and 1 and 11 are satisfied. 0))' = J(:Ologp(x.1. (J2) population. O)dx. Then (3.4. (3.4. suppose XI. O)dx :Op(x.i J T(x) [:OP(X.10) and. 0))' p(x. 00.. . For instance. Then Xi . O)dx ~ :0 J p(x. the integrals . 1 X n is a sample from a N(B..B(O)} is an exponential family and TJ(B) has a nonvanishing continuous derivative on 8.  Proof.
(Information Inequality).(T(X)) > [1/. We get (3. (e) and Var. Suppose that I and II hold and 0 < 1(0) < 00.4.1 hold.' (e) = Cov (:e log p(X.4.(0). Suppose the conditions of Theorem 3.O).1 hold and T is an unbiased estimate ofB. e) and T(X).O)dX= J T(x) UOIOgp(x.(T(X)) by 1/.1.4. 1/.4. > [1/. Proposition 3.Xu) is a sample from a population with density f(x. e)) = I(e).4.4. T(X)) . (3.4. 10gp(X.4.(0) = E(feiogf(X"e»)'.14) Now let ns apply the correlation (CauchySchwarz) inequality (A.4. by Lemma 3. 1/. nI.(T(X) > I(O) 1 (3.O)dX. Let [. Theorem 3. 0 The lower bound given in the information inequality depends on T(X) through 1/. Suppose that X = (XI.17) .'(0) = J T(X):Op(x.'(~)I)'.(T(X)) < 00 for all O. . then I(e) = nI. CoroUary 3. Then Var.Section 3. we obtain a universal lower bound given by the following.15) The theorem follows because.1.I 1.. and that the conditions of Theorem 3.4 Unbiased Estimation and Risk Inequalities 181 Here is the main result of this section.l4) and Lemma 3. (3. Here's another important special case.'(0)]'  I(e) . (3. Denote E.(0) is differentiable and Var (T(X)) . ej.13) By (A.4.2.16) to the random variables 8/8010gp(X.4. Using I and II we obtain..O))P(x. 0 E e.16) The number 1/I(B) is often referred to as the information or CramerRoo lower bound for the variance of an unbiased estimate of 'tjJ(B).1.(0).4. Var (t. .4. If we consider the class of unbiased estimates of q(B) = B.I1. 1/. Then for all 0.1. 0 (3.12) Proof. Let T(X) be all)' statistic such that Var.
i g . We have just shown that the information [(B) in a sample of size n is nh (8)..B) ] [ ~var [%0 ]Ogj(Xi. which achieves the lower bound ofTheorem 3. ... the conditions of the information inequality are satisfied. whereas if 'P denotes the N(O.j. then X is a UMVU estimate of B.3.1 and I(B) = Var [:B !ogp(X.1 for every B. By Corollary 3.[T'(X)] = [. For a sample from a P(B) distribution.4.j.p'(B)]'ll(B) for all BEe.B)] = nh(B).(8) has a continuous nonvanishing derivative on e.. (Continued).. Conversely.4.4. If the family {Pe} satisfies I and II and if there exists an unbiased estimate T* of 1/. Example 3.2.1) that if Xl.182 Measures of Performance Chapter 3 Proof. Suppose that the family {p.4.4. Theorem 3.(T(X».2. X n is a sample from a nonnal distribution with unknown mean B and known variance a 2 . Because X is unbiased and Var( X) ~ Bin. This is no accident. Then {P9} is a oneparameter exponential family with density or frequency function af the fonn II p(x. : BEe} satisfies assumptions I and II and there exists an unbiased estimate T* of1.4. we have in fact proved that X is UMVU even if a 2 is unknown. Suppose Xl. .4.6. then X is UMVU. . if {P9} is a oneparameter exponentialfamily of the form (1.4. Example 3.(B) such that Var.4.18) 1 a ' . Next we note how we can apply the information inequality to the problem of unbiased estimation.19) Ii i .Xn are the indicators of n Bernoulli trials with probability of success B. I I "I and (3.B(B)J. (3. then T' is UMVU as an estimate of 1. . (Br Now Var(X) ~ a'ln.4. Note that because X is UMVU whatever may be a 2 . As we previously remarked.B) = h(x)exp[ry(B)T'(x) .18) follows. then  1 (3. This is a consequence of Lemma 3. These are situations in which X follows a oneparameter exponential family.B)] Var " i) ~ 8B iogj(Xi.1 we see that the conclusion that X is UMVU follows if ~  Var(X) ~ nI. then T(X) achieves the infonnation inequality bound and is a UMVU estimate of E.(B). 1) density.. 0 We can similarly show (Problem 3. o II (B) is often referred to as the information contained in one observation.. the MLE is B ~ X. .1) with natural sufficient statistic T( X) and .
2. However.A'(B» = VareT(X) = A"(B).25) But .4.4 Unbiased Estimation and Risk Inequalities 183 Proof We start with the first assertion.ll.B) so that = T(X) . (B)T'(X) + a2(B) (3.20) hold.4. B . X2 E A"''''. j hence. B) = a.(B)T' (x) +a2(B) for all BEe}.. (3.B) = a.(A") = 1 for all B'. B) = 2"' exp{(2nj 2"' exp{(2n.4. there exist functions al ((J) and a2 (B) such that :B 10gp(X. Note that if AU = nmAem . Let B .2.1) we assume without loss of generality (Problem 3. then (3.4.4.6. denotes the set of x for which (3. Our argument is essentially that of Wijsman (1973).4.24) [(B) = Vare(T(X) .B) I + 2nlog(1  BJ} . By solving for a" a2 in (3. J 2 Pe. Suppose without loss of generality that T(xI) of T(X2) for Xl.6).19) is highly technical. " and both sides are continuous in B.Section 3. Thus. B) = aj(B)T'(x) + a2(B) (3. But now if x is such that = 1.p(B) = A'(B) and. it is necessary.4.6.4.4 and 2.(Ae) = 1 for all B' (Problem 3.23) for all B1 . B . + n2) 10gB + (2n3 + n3) Iog(1 .4. 2 A ** = A'" and the result follows.10g(1 .4.23) must hold for all B. Conversely in the exponential family case (1.. be a denumerable dense subset of 8.4. 0 Example 3. then (3.. p(x. continuous in B.16) we know that T'" achieves the lower bound for all (J if. (3. and only if. From this equality of random variables we shall show that Pe[X E A'I = 1 for all B where A' = {x: :Blogp(x. B) IdB. Then ~ ( 80 logp X.4.3) that we have the canonical case with ~(B) = Band B( Bj = A(B) = logf h(x)exp{BT(x))dx. ..A (B) . Here is the argument.2 and.14) and the conditions for equality in the correlation inequality (A.1.4.19). By (3. In the HardyWeinberg model of Examples 2. the information bound is [A"(B))2/A"(B) A"(B) Vare(T(X») so that T(X) achieves the information bound as an estimate of EeT(X). thus.20) to (3. :e Iogp(x. The passage from (3.B)) + nz)[log B .4.22) for j = 1.4.4.20) guarantees Pe(Ae) ~ 1 and assumption 1 guarantees P.21 ) Upon integrating both sides of (3. (3. If A.20) with respect to B we get (3. we see that all a2 are linear combinations of {} log p( Xj.4.20) with Pg probability 1 for each B.4.
24). Na). I(B) = Eo aB' It turns out that this identity also holds outside exponential families. Suppose p('. It often happens. DiscRSsion.27) and integrate both sides with respect to p(x. (3.4.B)] ~ B. • The multiparameter case We will extend the information lower bound to the case of several parameters.B)) which equals I(B).26) Proposition 3.184 Measures of Performance Chapter 3 where we have used the identity (2nl + HZ) + (2n3 + nz) = 211.. but the variance of the best estimate is not equal to the bound [. we will find a lower bound on the variance of an estimator . (Continued). . that I and II fail to hold.6. for instance.2 implies that T = (2N 1 + N z )/2n is UMVU for estimating E(T) ~ (2n)1[2nB' + 2nB(1 . . or by transforming p(x..26) holds. B). A third metho:! would be to use Var(B) = I(I(B) and formula (3. B) satisfies in addition to I and II: p(·.4.B)] and then using Theorem 1.4. in the U(O. See Volume II. B) example. Even worse. Because this is an exponential family.25) we obtain a' 10gp(X. 0 0 . ~ ~ This T coincides with the MLE () of Example 2.2 suggests. e). in many situations. .4.6.p' (B) I'( I (B).4. although UMVU estimates exist. The variance of B can be computed directly using the moments of the multinomial distribution of (NI • N z .2..3. Sharpenings of the information inequality are available but don't help in general. Then (3.(e) exist. By (3. assumptions I and II are satisfied and UMVU estimates of 1/.'" .. " 'oi . B) aB' p(x. B)  2 (a log p(x.4. B) 2 1 a = p(x. =e'E(~Xi) = .B) is twice differentiable and interchange between integration and differentiation is pennitted.4.ed)' In particular. Proof We need only check that I a aB' log p(x.IOgp(X. we have iJ' aB' logp (X.2.25). .7) Var(B) ~ B(I . For a sample from a P( B) distribution o EB(::. Theorem 3. Extensions to models in which B is multidimensional are considered next. as Theorem 3. We find (Problem 3. 0 ~ Note that by differentiating (3. .B) ~ A "( B). () (ell" . B) to canonical form by setting t) = 10g[B((1 .4.4.4. B) )' aB (3. Example 3. .B)(2n.2. '.4. I I' f .
(c) If. 1 <j< d. 1 <k< d. B) : 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect OJ. iJO logp(X.. nh (O) where h is the information matrix ofX..( x 1") 2 2a 2 logp(x 0) . d.5. a ). 6) denote the density or frequency function of X where X E X C Rq.32) Proof. .30) iJ Ijd O) = Covo ( eO logp(X. Under the conditions in the opening paragraph. 0) j k That is.0) = er..4 /2. .O) ] iJ2 ] l22(0) = E [ (iJer2)21ogp(x. pC B) is twice differentiable and double integration and differentiation under the integral sign can be interchanged. in addition. (a) B=T 1 (3. The (Fisher) information matrix is defined as (3.d. 0)] = E[er.4.4. Let p( x. j = 1.. . = Var('. ~ N(I". . We assume that e is an open subset of Rd and that {p(x.. 0).4.28) where (3.4.Xnf' has information matrix 1(0) = EO (aO:. 0 = (I".Bd are unknown.4.4. er 2). .. Then 1 = log(211') 2 1 2 1 2 loger . (3. then X = (Xl. Suppose X = 1 case and are left to the problems. 0)..Xn are U.4 E(x 1") = 0 ~ I.Section 3. (3.? ologp(X.4 Unbiased Estimation and Risk Inequalities 185 of fh when the parameters 82 .31) and 1(0) (b) If Xl. (0) iJiJ lt2(0) ~ E [aer 2 iJl" logp(x.4.29) Proposition 3.. er 2 ).Ok IOgp(X. .4.. as X.. The arguments follow the d Example 3. .2 ] ~ er.2 = er. .O») . III (0) = E [~:2 logp(x. .
3.(0) = EOT(X) and let ".38) where Lzy = EO(TVOlogp(X.13). [(0) = VarOT(X) = (3.4. The conditions I.1. Z = V 0 logp(x. 0). UMVU Estimates in Canonical Exponential Families. Then VarO(T(X» > Lz~ [1(0) Lzy (3.2 ( 0 0 ) .4. Then Theorem 3. in this case [(0) ! . = . ! p(x.6. .( 0) be the d x 1 vector of partial derivatives. (continued).4.37) = T(X).33) o Example 3. then [(0) = VarOT(X). I'L(Z) = I'Y + (Z l'z)TLz~ Lzy· Now set Y (3.34) () E e open.. J)d assumed unknown..4.4..36) Proof. A(O). We will use the prediction inequality Var(Y) > Var(I'L(Z».(O). Here are some consequences of this result. Let 1/.O) = T(X) .4. Canonical kParameter Exponential Family..6.4. where I'L(Z) denotes the optimal MSPE linear predictor of Y.' = explLTj(x)9j j=1 . . 1/.4/2 . ..35) o ~ Next suppose 8 1 = T is an estimate of (Jl with 82.6.4.A.4.30) and Corollary 1.O») = VOEO(T(X» and the last equality follows 0 from the argument in (3.4.(0) exists and (3.!pening paragraph hold and suppose that the matrix [(0) is nonsingular Then/or all 0.186 Measures of Performance Chapter 3 Thus. Example 3.4. II are easily checked and because VOlogp(x. Suppose k . We claim that each of Tj(X) is a UMVU . Suppose the conditions of Example 3.4. Assume the conditions of the . that is.A(O)}h(x) (3.. (3.'(0) = V1/. 0) . By (3.6 hold.
.0). AI. .. the lower bound on the variance of an unbiased estimator of 1/Jj(O) = E(n1Tj(X) = >.. hence. . " Ok_il T. This is a different claim than TJ(X) is UMVU for EOTj(X) if Oi..4. But because Nj/n is unbiased and has 0 variance >'j(1 .3... . j.. . .4..O) = exp{TT(x)O .>'.p(0) is the first row of 1(0) and. Example 3. by Theorem 3. without loss of generality.Section 3.Xn)T. (1 + E7 / eel) = n>'j(1.d. To see our claim note that in our case (3..7. we transformed the multinomial model M(n. X n i.p(0)11(0) ~ (1. Thus. n T..4 8'A rl(O)= ( 8080 t )1 kx k . . we let j = 1.. . i =j:. 0 = (0 1 .41) because . .i..A(O) is just VarOT! (X).4.39) where. .0.6 with Xl.4.(1. .. (3. In the multinomial Example 1.j. Multinomial Trials. Ak) to the canonical form p(x.A(0) j ne ej = (1 + E7 eel _ eOj ) .4. .(X)) 82 80.(X) = I)IXi = j].40) J We claim that in this case .p(OJrI(O).)/1£. j = 1. kI A(O) ~ 1£ log Note that 1+ Le j=l Oj 80 A (0) J 8 = 1 "k 11 ne~ I 0 +LJl=le l = 1£>'. But f.Tk_I(X)). j=1 X = (XI.4.) = Var(Tj(X».T(O) = ~:~ I (3.. k. ..>'j)/n.4 Unbiased Estimation and Risk Inequalities 187 estimate of EOTj(X). a'i X and Aj = P(X = j). then Nj/n is UMVU for >'j... . We have already computed in Proposition 3. are known. .A(O)} whereTT(x) = (T!(x).6.A. . is >. = nE(T. .
5. But it does not follow that n~1 L:(Xi . These and other examples and the implications of Theorem 3. and robustness to model departures.42) are d x d matrices. 3. Summary.3 whose proof is left to Problem 3.B)a > Ofor all .42) where A > B means aT(A . The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure.3 are 0 explored in the problems.4. We derive the information inequa!ity in oneparameter models and show how it can be used to establish that in a canonical exponential family. LX. 1 j Here is an important extension of Theorem 3.pd(OW = (~(O)) dxd ."xl' Note that both sides of (3.4.1 Computation Speed of computation and numerical stability issues have been discussed briefly in Section 2.i. The Normal Case.4. 3. . Using inequalities from prediction theory. T(X) is the UMVU estimate of its expectation. Also note that ~ o unbiased '* VarOO > r'(O). reasonable estimates are asymptotically unbiased. .4.5 NON DECISION THEORETIC CRITERIA In practice. We study the important application of the unbiasedness principle in survey sampling. N(IL. " il 'I. . ~ In Chapters 5 and 6 we show that in smoothly parametrized models.4.3 hold and • is a ddimensional statistic.. Suppose that the conditions afTheorem 3.8.. . If Xl.. • Theorem 3.d.. 188 MeasureS of Performance Chapter 3 j Example 3. interpretability of the procedure. features other than the risk function are also of importance in selection of a procedure.4.4. (72) then X is UMVU for J1 and ~ is UMVU for p. . X n are i.2 + (J2.4.X)2 is UMVU for 0 2 . Asymptotic analogues oftbese inequalities are sharp and lead to the notion and construction of efficient estimates. Let 1/J(O) ~EO(T(X))dXl and "'(0) = (1/Jl(O). They are dealt with extensively 'in books on numerical analysis such as Dahlquist. we show how the infomlation inequality can be extended to the multiparameter case.4. . Then J (3.21. . even if the loss function and model are well specified.4.. We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal." .
4.1 / 2 . at least if started close enough to 8. Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at least bounds on the COnstants involved.1.2 is given by where 0.3 and 3. variance and computation As we have seen in special cases in Examples 3.10). say.2).1. Then least squares estimates are given in closed fonn by equ<\tion (2. It may be shown that. with ever faster computers a difference at this level is irrelevant. But it reappears when the data sets are big and the number of parameters large.1.3.9) by.1). estimates of parameters based on samples of size n have standard deviations of order n. The improvement in speed may however be spurious since AI is costly to compute if d is largethough the same trick as in computing least squares estimates can be used.4.P) in Example 2.1.2. a method of moments estimate of (A. the NewtonRaphson method in ~ ~(J) which the jth iterate.2 Interpretability Suppose that iu the normal N(Il.11).4. It is clearly easier to compute than the MLE. We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory. and Anderson (1974).5 we are interested in the parameter III (7 • • This parameter.3.5.2. Closed form versus iteratively computed estimates At one level closed form is clearly preferable.A(BU ~l»)). Of course.4. It follows that striving for numerical accuracy of ord~r smaller than n. the signaltonoise ratio.5. On the other hand. takes on the order of log log ~ steps (Problem 3. (72) Example 2. if we seek to take enough steps J so that III III < . On the other hand. 3.2 is the empirical variance (Problem 2. in the algOrithm we discuss in Section 2. It is in fact faster and better to Y.1 / 2 is wasteful.AI (flU 1) (T(X) . < 1 then J is of the order of log ~ (Problem 3. solve equation (2. Gaussian elimination for the particular z'b Faster versus slower algorithms Consider estimation of the MLE 8 in a general canonical exponential family as in Section 2.5 Nondecision Theoretic Criteria 189 Bjork. consider the Gaussian linear model of Example 2. For instance. for this population of measurements has a clear interpretation. The interplay between estimated./ a even if the data are a sample from a distribution with . fI(j) = iPl) . The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d 3 operations when done in the usual way and can be numerically unstable.5. Its maximum likelihood estimate X/a continues to have the same intuitive interpretation as an estimate of /1.Section 3.
E p.4) is for n large a more precise estimate than X/a if this model is correct.190 Measures of Performance Chapter 3 mean Jl and variance 0'2 other than the normal. However.13. Alternatively. See Problem 3.5.. 1 X~) could be taken without gross errors then p. anomalous values that arise because of human error (often in recording) or instrument malfunction. 3. would be an adequate approximation to the distribution of X* (i.1. if gross errors occur. amoog others. which as we shall see later (Section 5. that is. For instance..')1/2 = l. 1975b. However. This is an issue easy to point to in practice but remarkably difficult to formalize appropriately. To be a bit formal. This idea has been developed by Bickel and Lehmann (l975a.3 Robustness Finally. economists often work with median housing prices. But () is still the target in which we are interested. We consider three situations (a) The problem dictates the parameter. We will consider situations (b) and (c). .4. we observe not X· but X = (XI •. what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about). P(X > v) > ~).\)' distribution. 9(Pl. they may be interested in total consumption of a commodity such as coffee. say () = N p" where N is the population size and tt is the expected consumption of a randomly drawn individuaL (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true'' parametric model with an interpretable parameter B. Similarly. and both the mean tt and median v qualify. v is any value such that P(X < v) > ~. E P*). We can now use the MLE iF2. the HardyWeinberg parameter () has a clear biological interpretation and is the parameter for the experiment described in Example 2. suppose we initially postulate a model in which the data are a sample from a gamma.5. we turn to robustness. However. we could suppose X* rv p.. but we do not necessarily observe X*.. but there are several parameters that satisfy this qualitative notion. For instance. Then E(X)/lVar(X) ~ (p/>.I1'.5. but there are a few • • = . suppose that if n measurements X· (Xi. the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X) / [Var( X)] 1/2. 1976) and Doksum (1975). (c) We have a qualitative idea of what the parameter is. Gross error models Most measurement and recording processes are subject to gross errors.. We return to this in Section 5.)(p/>. we may be interested in the center of a population. X n ) where most of the Xi = X:. The actual observation X is X· contaminated with "gross errOfs"see the following discussion. the parameter v that has half of the population prices on either side (fonnally. On the other hand. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied.e. .
it is the center of symmetry of p(Po.1') : f satisfies (3. for example. However. We return to these issues in Chapter 6. . iff X" . Example 1.J) for all such P..) for 1'1 of 1'2 (Problem 3.. (3.j = PC""f. but with common distribution function F and density f of the form f(x) ~ (1 .Section 3. with common density f(x . where f satisfies (3. The advantage of this formulation is that jJ.d.i. Then the gross error model issemiparametric.l. F. we do not need the symmetry assumption. in situation (c). Xi Xt with probability 1 . if we drop the symmetry assumption. we encounter one of the basic difficulties in formulating robustness in situation (b).\ is the probability of making a gross error.f..1 ) where the errors are independent. Y.5. .1). the sensitivity curve and the breakdown point.) are ij. the assumption that h is itself symmetric about 0 seems patently untenable for gross errors. where Y. X is the best estimate in a variety of senses. ISi'1 or 1'2 our goal? On the other hand. and definitions of insensitivity to gross errors. make sense for fixed n. specification of the gross error mechanism. Unfortunately.5.Xn are i.5.1. identically distributed. Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of X· . . . X~) is a good estimate. Without h symmetric the quantity jJ. However. Formal definitions require model specification. That is. the gross errors. Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6.1') and (Xt. X n ) will continue to be a good or at least reasonable estimate if its value is not greatly affected by the Xi I Xt. it is possible to have PC""f. .2).18).2) Here h is the density of the gross errors and .2) for some h such that h(x) = h(x) for all x}. The breakdown point will be discussed in Volume II. If the error distribution is normal.l. (3.1'). P. Now suppose we want to estimate B(P*) and use B(X 1 .5.h(x). . . with probability ..l. Informally B(X1 . Again informally we shall call such procedures robust. = {f (.is not a parameter.5. so it is unclear what we are estimating..j) E P. A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the Ci still i.d. We next define and examine the sensitivity curve in the context of the Gaussian location model..remains identifiable.5. Further assumptions that are commonly made are that h has a particular fonn. and symmetric about 0 with common density f and d. h = ~(7 <p (:(7) where K ::» 1 or more generally that h is an unknown density symmetric about O.5 Nondecision Theoretic Criteria ~ 191 wild values. and then more generally. two notions.d.)~ 'P C) + . •.i. In our new formulation it is the xt that obey (3. has density h(y .. . p(". X n ) knowing that B(Xi.2. Consider the onesample symmetric location model P defined by ~ ~ i=l""l n . That is. This corresponds to.A Y.
Xl..L. we take I' ~ 0 without loss of generality. +X"_I+X) n = x. . ~ ) SC( X.L = E(X)..5. . the sample mean is arbitrarily sensitive to gross errOfa large gross error can throw the mean off entirely. . X"_l )J.4).1. 1') = 0. We start by defining the sensitivity curve for general plugin estimates.od because E(X. (}(X 1 .14. X n ordered from smallest to largest.. 0) = n[O(xl. and it splits the sample into two equal halves.. not its location..··.16).1 for all p(/l.1. P..od Problem 2. How sensitive is it to the presence of gross errors among XI. . 1 Xnl represents an observed sample of size nl from P and X represents an observation that (potentially) comes from a distribution different from P.. X" 1').Xn ? An interesting way of studying this due to Tukey (1972) and Hampel (1974) is the sensitivity curve defined as follows for plugin estimates (which are well defined for all sample sizes n).. Because the estimators we consider are location invariant.l. in particular. . At this point we ask: Suppose that an estimate T(X 1 . . Often this is done by fixing Xl. X =n (Xl+ ... We are interested in the shape of the sensitivity curve... that is. .. . .192 The sensitivity curve Measures of Performance Chapter 3 . The sample median can be motivated as an estimate of location on various grounds.Xnl so that their mean has the ideal value zero.32. . • . .. .. X n ) .L = (}(X l 1'.I. . . . where F is the empirical d.2. therefore. Are there estimates that are less sensitive? A classical estimate of location based on the order statistics is the sample median X defined by ~ ~ X X(k+l) !(X(k) ifn~2k+1 + X(k+l)) ifn = 2k where X (I)' .O(Xl' . In our examples we shall. has the plugin property.. (}(PU. See Section 2.2.j)) = J..f) E P. This is equivalent to shifting the Be vertically to make its value at x = 0 equal to zero. X n ) = B(F). Xnl as an "ideal" sample of size n . that is. I . Then ~ ~ O. See (2. . X(n) are the order statistics. (i) It is the empirical plugin estimate of the population median v (Problem 3. . . SC(x.J. The sensitivity curoe of () is defined as ~ ~ ~ ~ ~ ~ ~ ~ ~ . (2." .X. shift the sensitivity curve in the horizontal or vertical direction whenever this produces more transparent formulas. .17). See Problem 3.1. Now fix Xl!'" . Suppose that X ~ P and that 0 ~ O(P) is a parameter.. is appropriate for the symmetric location model. x) .. . The empirical plugin estimate of 0 is 0 = O(P) where P is the empirical probability distribution.5. Thus. where Xl. . .1 for which the estimator () gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation X does to the value of ~ We return to the location problem with e equal to the mean J.
5 Nondecision Theoretic Criteria 193 (ii) In the symmetric location model (3. The sensitivity curve in Figure 3. v coincides with fL and plugin estimate of fL.. 27 a density having substantially heavier tails than the normaL See Problems 2.9.. A class of estimates providing such intermediate behavior and including both the mean and . x is an empirical (iii) The sample median is the MLE when we assume the common density f(x) of the errors {cd in (3. < X(nl) are the ordered XI.1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when X is near Ji. . say. The sensitivity curves of the mean and median.32 and 3.. . Although the median behaves well when gross errors are expected. SC(x.2.Xn_l = (X(k) + X(ktl)/2 = 0. we obtain . SC(x) SC(x) x x Figure 3.Section 3.1. XnI.1). The sensitivity curve of the median is as follows: If.5. its perfonnance at the nonnal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X. n = 2k + 1 is odd and the median of Xl.5.5. x) nx(k) nx = _nx(k+l) for for < x(k) for x(k) < x x <x(k+I) nx(k+l) x> x(k+l) where xCI) < .5.5.. ..1) is the Laplace (double exponential) density f(x) = 1 exp{l x l/7}..
5. infinitely better in the case of the Cauchysee Problem 5. f(x) = or even more strikingly the Cauchy. Figure 3. + X(n[nuj) n . Which a should we choose in the trimmed mean? There seems to be no simple answer.5). For a discussion of these and other forms of "adaptation..l)Q] and the trimmed mean of Xl. If f is symmetric about 0 but has "heavier tails" (see Problem 3. . .2.5. The sensitivity curve of the trimmed mean. There has also been some research into procedures for which a is chosen using the observations. for example. That is. Hampel. that is.8) than the Gaussian density. Let 0 trimmed mean. f(x) = 1/11"(1 + x 2 ). Xu = X.2. then the trimmed means for a > 0 and even the median can be much better than the mean. I I ! . . We define the a (3.2[nnl where [na] is the largest integer < nO' and X(1) < .. . Huber (1972). However. Rogers..) SC(x) X x(n[naJ) . See Andrews. For instance.194 Measures of Performance Chapter 3 the median has ~en known since the eighteenth century.5. 4e'x'.2[nn]/n)I. Bickel. Haber. suppose we take as our data the differences in Table 3. The range 0. The estimates can be justified on plugin grounds (see Problem 3. i I .5. Note that if Q = 0. the sensitivity Jo ~ CUIve of an Q trimmed mean is sketched in Figure 3. For more sophisticated arguments see Huber (1981). (The middle portion is the line y = x(1 . < X (n) are the ordered observations.1 again...10 < a < 0. the Laplace density. the sensitivity curve calculation points to an equally intuitive conclusion.. which corresponds approximately to a = This can be verified in tenns of asymptotic variances (MSEs}see Problem 5." see Jaeckel (1971). whereas as Q i ~. Intuitively we expect that if there are no gross errors.5.1. l Xnl is zero. 4. . Xa. and Tukey (1972). we throw out the "outer" [na] observations on either side and take the average of the rest.4. by <Q < 4.1.3) Xu = X([no]+l) + ..5. Xa  X.20 seems to yield estimates that provide adequate protection against the proportions of gross errors expected and yet perform reasonably well when sampling is from the nonnal distribution. . f = <p. and Hogg (1974).4. If [na] = [(n . the mean is better than any trimmed mean with a > 0 including the median.
5. X n denote a sample from that population. where Xu has 100Ct percent of the values in the population on its left (fonnally.16). then the variance 0"2 or standard deviation 0. 0 < 0: < 1. Quantiles and the lQR. To simplify our expression we shift the horizontal axis so that L~/ Xi = O.Ct). .1.. A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = X. Then a~ = 11..I L:~ I (Xi . < x(nI) are the ordered Xl.X.674)0". The IQR is often calibrated so that it equals 0" in the N(/l. 0 Example 3. If no: is an integer. then SC(X.XnI. = ~ [X(k) + X(k+l)]' and at sample size n  1.2. Write xn = 11..5.1.25). Let B(P) = X o deaote a ath quantile of the distribution of X. Let B(P) = Var(X) = 0"2 denote the variance in a population and let XI.. &2) It is clear that a:~ is very sensitive to large outlying Ixi values. ..1x. 0"2) model. Xu is called a Ctth quantile and X. Xo: = x(k).4) where the approximation is valid for X fixed.. the scale measure used is 0.X)2 is the empirical plugin estimate of 0. X n is any value such that P( X < x n ) > Ct. say k. If we are interested in the spread of the values in a population.25 are called the upper and lower quartiles. . . We next consider two estimates of the spread in the population as well as estimates of quantiles. &) (3.5. and let Xo. P( X > xO') > 1 . other examples will be given in the problems. denote the o:th sample quantile (see 2. . .5 Nondecision Theoretic Criteria 195 Gross errors or outlying data points affect estimates in a variety of situations.75 . Because T = 2 x (. n + 00 (Problem 3.25. SC(X.X.1 L~ 1 Xi = 11. Spread. the nth sample quantile is Xo.is typically used. Example 3.2 . Similarly.75 and X.5.75 .10).742(x.Section 3. where x(t) < .
i) = SC(x.(x. o Remark 3. r: ~i II i . 1'0) ~[Xlkl) _ xlk)] x < XlkI) 2 '  2 2 1 Ix _ xlkl] XlkI) < x < xlk+11 '  (3.• . We will return to the influence function in Volume II. The sensitivity of the parameter B(F) to x can be measured by the influence function.25) and the sample IQR is robust with respect to outlying gross errors x.: : where F n . = <1[0((1  <)F + <t. An exposition of this point of view and some of the earlier procedures proposed is in Hampel. ~ . have been studied extensively and a number of procedures proposed and implemented. It plays an important role in functional expansions of estimates. Most of the section focuses on robustness.5.. Unfortunately these procedures tend to be extremely demanding computationally. Ronchetti.75 X. ~ ~ ~ Next consider the sample lQR T=X.SC(x..O(F)] = .O.2. .F) where = limIF. . in particular the breakdown point. 1'. SC(x. Discussion.5) 1 [xlk+11 _ xlkl] x> xlk+1) ' Clearly..5.6.) . 1'75) . trimmed mean.15) I[t < xl). and computability. Other aspects of robustness. We discuss briefly nondecision theoretic considerations for selecting procedures including interpretability. ~' is the distribution function of point mass at x (.196 thus.5. Summary.1 denotes the empirical distribution based on Xl.(x.1. for 2 Measures of Performance Chapter 3 < k < n . Rousseuw.Xnl.F) dO I. Ii I~ . although this difficulty appears to be being overcome lately. The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean. and Stabel (1983). discussing the difficult issues of identifiability.25· Then we can write SC(x. median. F) and~:z.O. which is defined by IF(x... xa: is not sensitive to outlying x's. It is easy to see ~ ! '. 0. x (t) that (Problem 3. and other procedures. .
0)/0(0). ~ ~ 4.O) and = p(x 1 0)[7f(0)lw(0)]/c c= JJ p(x I 0)[7f(0)/w(0)]dOdx is assumed to be finite. Suppose 1(0.0) and give the Bayes estimate of q(O). 0) = (0 . .3. In the Bernoulli Problem 3. Check whether q(OB) = E(q(IJ) I x).0 E e. Compute the limit of e( J. 1 X n is a N(B. where R( 71') is the Bayes risk.24. 71') = R(7f) /r( 71'. In Problem 3.6 Problems and Complements 197 3. Show that (h) Let 1(0. J). under what condition on S does the Bayes rule exist and what is the Bayes rule? 5. give the MLE of the Bernoulli variance q(0) = 0(1 . J) ofJ(x) = X in Example 3. we found that (S + 1)/(n + 2) is the Bayes rule. Hint: See Problem 1. 3.O) = p(x I 0)71'(0) = c(x)7f(0 .0). x) where c(x) =..O)P. density. which is called the odds ratio (for success). E = R.2 o e 1. (a) Show that the joint density of X and 0 is f(x.4.2 with unifonn prior on the probabilility of success O.1.r 7f(0)p(x I O)dO. (3(r.. = (0 o)'lw(O) for some weight function w(O) > 0. where OB is the Bayes estimate of O. (72) sample and 71" is the improper prior 71"(0) = 1. Find the Bayes estimate 0B of 0 and write it as a weighted average wOo + (1 ~ w)X of the mean 00 of the prior and the sample mean X = Sin. Show that OB ~ (S + 1)/(n+2) for ~ ~ the uniform prior. 2. Find the Bayes risk r(7f. respectively..2 preceeding. .2.2. a(O) > 0.a)' jBQ(1..2. Show that if Xl. (X I 0 = 0) ~ p(x I 0). s). That is. Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule.4. (c) In Example 3. is preferred to 0. Consider the relative risk e( J. lf we put a (improper) uniform prior on A. Let X I. the parameter A = 01(1 .3).0)' and that the prinf7r(O) is the beta.Section 3.2. 71') as . if 71' and 1 are changed to 0(0)71'(0) and 1(0. In some studies (see Section 6. then the improper Bayes rule for squared error loss is 6"* (x) = X. 6. the Bayes rule does not change. Suppose IJ ~ 71'(0). 0) is the quadratic loss (0 . . 0) the Bayes rule is where fo(x.Xn be the indicators of n Bernoulli trials with success probability O. change the loss function to 1(0.6 PROBLEMS AND COMPLEMENTS Problems for Section 3.
) with loss function I( B. and that 9 is random with aN(rlO. B = (B" . . and that 0 has the Dirichlet distribution D( a)... (. Let 0 = ftc . . B. 76) distribution. For the following problems.2(d)(i) and (ii). (E. compute the posterior risks of the possible actions and give the optimal Bayes decisions when x = O.\(B) = r . 7. .3.. (b) Problem 1..198 (a) T + 00.aj)/a5(ao + 1). 8. a = (a" . i: agency specifies a number f > a such that if f) E (E. . to a close approximation.3. (a) Problem 1. find the Bayes decision mle o' and the minimum conditional Bayes risk r(o'(x) I x)...O)  I(B.=l CjUj .3. where E(X) = O. Suppose tbat N lo .B. N{O. where is known. where CI. A regulatory " ! I i· .. . Hint: If 0 ~ D(a). then the generic and brandname drugs are. Suppose we have a sample Xl. where ao = L:J=l Qj.E).)T. Measures of Performance Chapter 3 (b) n . 00. given 0 = B are multinomial M(n. • 9. .Xn ) we want to decide whether or not 0 E (e.I(d). (Use these results. c2 > 0 I.\(B) = I(B. Note that ).• N. Var(Oj) = aj(ao . I) = difference in loss of acceptance and rejection of bioequivalence. 1). One such function (Lindley. find necessary and j sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule. . (Bj aj)2. Bioequivalence trials are used to test whether a generic drug is. .. Assume that given f).. (c) Problem 1. .15. Set a{::} Bioequivalent 1 {::} Not Bioequivalent . a) = (q(B) .9j ) = aiO:'j/a5(aa + 1). I j l .(0) should be negative when 0 E (E. a) = Lj~. (a) If I(B. equivalent to a namebrand drug. E).d. On the basis of X = (X I.19(c). Let q( 0) = L:. . Xl.ft B be the difference in mean effect of the generic and namebrand drugs. E) and positive when f) 1998) is 1. Find the Bayes decision rule... a) = [q(B)a]'. B . Cr are given constants. do not derive them. . and Cov{8j . There are two possible actions: 0'5 a a with losses I( B. B).. E).exp {  2~' B' } . bioequivalent. by definition.2..a)' / nj~. I ~' . defined in Problem 1.Xu are Li. then E(O)) = aj/ao. a r )T.O'5). .  I . " X n of differences in the effect of generic and namebrand effects fora certain drug. (c) We want to estimate the vector (B" . 0) and I( B. (e) a 2 + 00.) (b) When the loss function is I(B..
Yo) = inf g(xo. yo) is a saddle point of 9 if g(xo.3. 0.\(±€. RP. Con 10.. 0) and l(().2.y = (YI.. (c) Is the assumption that the ~ 's are nonnal needed in (a) and (b)? Problems for Section 3. . Yo) is in the interior of S x T.. + = 0 and it 00"). S x T ~ R. For the model defined by (3.17).) = 0 implies that r satisfies logr 1 = ( 2 2c' This is an example with two possible actions 0 and 1 where l((). representing x = (Xl. A point (xo. find (a) the linear Bayes estimate of ~l.Yo) = {} (Xo.16) and (3. and 9 is twice differentiable. (b) the linear Bayes estimate of fl.Section 3.Yp).) + ~}" . . Discuss the preceding decision rule for this "prior.1." (c) Discuss the behavior of the preceding decision rule for large n ("n sider the general case (a) and the specific case (b). . {} (Xo. (a) Show that a necessary condition for (xo.1) < (T6(n) + c'){log(rg(~)+.Yo) Xi {}g {}g Yj = 0. (b) It is proposed that the preceding prior is "uninformative" if it has 170 large ("76 + 00"). . ..Xm). Suppose 9 . (Xv.6. (a) Show that the Bayes rule is equivalent to "Accept biocquivalence if E(>'(O) I X and show that (3.1) is equivalent to "Accept bioequivalence if[E(O I x»)' where = x) < 0" (3...3 1. y). S T Suppose S and T are subsets of Rm. Yo) = sup g(x. 1) are not constant.2. Yo) to be a saddle point is that. v) > ?r/(l ?r) is equivalent to T > t.6.2. Any two functions with difference "\(8) are possible loss functions at a = 0 and I. Hint: See Example 3. respectively.2 show that L(x. 2.6 Problems and Complements 199 where 0 < r < 1. Note that ... In Example 3.
(b) There exists 0 < 11"* < 1 such that the prior 1r* is least favorable against is.. 1rWlO = 0 otherwise is Bayes against a prior such that PIB = B ] = .Xn ) . I • > 1 for 0 i ~.PIB = Bo]. 1 <j. iij.2. Hint: Show that there exists (a unique) 1r* so that 61f~' that R(Bo. . Bd.a.d< p. ._ I)'. A = {O. Yc Yd foraHI < i. I(Bi.thesimplex. f. n: I>l is minimax..i. Let S ~ 8(n.:. and o'(S) = (S + 1 2 ..o•• ) =R(B1 . 0») equals 1 when B = ~.1(0. Hint: See Problem 3.c. B1 ) Ip(X.0•• ). Show that the von Neumann minimax theorem is equivalent to the existence of a saddle point for any twice differentiable g. B ) ~ p(X.i) =0. • . Suppose i I(Bi.j)=Wij>O. the test rule 1511" given by o. L:. &* is minimax.n12). and show that this limit ( . f" I • L : I' (a) Show that 0' has constant risk and is Bayes for the beta.= 1 .<7') and 1(<7'. B ) and suppose that Lx (B o. (b) Show that limn_ooIR(O.2. o')IR(O..n).n12.n)/(n + .. Suppose e = {Bo. Let X" .. 4. BIl has a continuous distrio 1 bution under both P Oo and P BI • Show that (a) For every 0 < 7f < 1. prior. Thus.=1 CijXiYj with XE 8 m . X n be i. 5. the conclusion of von Neumann's theorem holds.200 and Measures of Performance Chapter 3 8'g (x ) < 0 8'g(xo.andg(x. .1 < i < m.j=O. y ESp.. and 1 . . and that the mndel is regnlar. I}. 2 J'(X1 . fJ( .)w".0).l... o(S) = X = Sin.a)'. Let Lx (Bo.y) ~ L~l 12.d.. Yo . N(I'. (b) Suppose Sm = (x: Xi > 0. • " = '!.Yo) > 0 8 8 8 X a 8 Xb 0. d) = (a) Show that if I' is known to be 0 (!. B1 ) >  (l. a) ~ (0 ." 1 Xi ~ l}.. i. I j 3.b < m. .(X) = 1 if Lx (B o.
. ik).. .. 6. .. Let Xi beindependentN(I'i. show that 0' is unifonnly best among all rules of the form oc(X) = CL Conclude that the MLE is inadmissible. and iI..\.. ik is an arbitrary unknown permutation of 1. ftk) = (Il?l'"'' J1. . Show that if = (I'I''''. Write X  • "i)' 1(1'.tj. M (n. < /12 is a known set of values. . a) = (.1)1 L(Xi . . . X k be independent with means f. PI. / .I). . . . respectively. 00.tn I(i.a? 1. . LetA = {(iI. ik) is smaller than that of (i'l"'" i~). 8. . " a. where (Ill... . I' (X"". .. N k ) has a multinomial.. dk)T. Rk) where Rj is the rank of Xi.Pi.j.. = tj. Rj = L~ I I(XI < Xi)' Hint: Consider the uniform prior on permutations and compute the Bayes rule by showing that the posterior risk of a pennutation (i l . j 7. 1 < j < k. o'(X) is also minimax and R(I'.\.k... .X)' and.).d) where qj ~ 1 . jl~ < . b. X k ) = (R l . .» = Show that the minimax rule is to take L l. .. For instance...).\) distribution and 1(.a < b" = 'lb. Show that X has a Poisson (. . . 1). . .. > jm). Permutations of I. 0') = (1. (jl. Jlk.?. 1 = t j=l (di .. < im.) . prior. hence. .\ .~~n X < Pi < < R(I'. Xk)T.X)' is best among all rules of the form oc(X) = c L(Xi . 1 <i< k. See Volume II.I'kf. 0 1. o(X) ~ n~l L(Xi . Let k .j..Section 3_6 Problems and Complements 201 (b) If I' ~ 0.2. Remark: Stein (1956) has shown that if k > 3.1. b = ttl. Hint: Consider the gamma. an d R a < R b.3.k} I«i). 9.29). d) = L(d..... Then X is nurnmax. . . o(X 1. that both the MLE and the estimate 8' = (n . . J.X)' are inadmissible. . that is.P. Show that if (N). .. f(k. distribution. d = (d). See Problem 1. i=l then o(X) = X is minimax... ttl .. X is no longer unique minimax. then is minimax for the loss function r: l(p. Hint: (a) Consider a gamma prior on 0 = 1/<7'. .Pi)' Pjqj <j < k.12. . .o) for alII'... . . (c) Use (B. Let Xl.. .. h were t·.. . .j.l . .. Xl· (e) Show that if I' is unknown.
Hint: Consider the risk function of o~ No.Xo)T has a multinomial. of Xi < X • 1 + 1 o. For a given x we want to estimate the proportion F(x) of the population to the left of x.15. Let Xi(i = 1..1'»2 of these estimates is bounded for all nand 1'. Show that v'n 1+ v'n 2(1+ v'n) is minimax for estimating F(x) = P(X i < x) with squared error loss.i. 1 . 13. X = (X"". distribution.8... v'n v'n (a) Show that the risk (for squared error loss) E( v'n(o(X) . Define OiflXI v'n < d < v'n d v'n d d X .. q) denote the K LD (KullbackLeiblerdivergence) between the densities Po and q and define the Bayes KLD between P = {Po : BEe} and q as k(q.. BO l) ~ (k .... Let the loss " function be the KullbackLeibler divergence lp(B. See also Problem 3...3. B(n. . X n be independent N(I'. . Let K(po. 1). M(n.i f X> . B j > 0.d.. 10. I: • . . 14. B).4. See Problem 3.. distribution. .B). . LB= j 01 1. with unknown distribution F. Show that the Bayes estimate of 0 for the KullbackLeibler loss function lp(B. n) be i.BO)T.2.. Pk.202 Measures of Performance Chapter 3 Hint: Consider Dirichlet priors on (PI. ..a) and let the prior be the uniform prior 1r(B" .=1 Show that the Bayes estimate is (Xi . ~ I I .. .1r) = Show that the marginal density of X. ! p(x) = J po(x)1r(B)dB. Suppose that given 6 ~ B.._..... X has a binomial. J K(po.. .d with density defined in Problem 1.I)!.q)1r(B)dB.. d X+ifX 11. (b) How does the risk of these estimates compare to that of X? 12.. Suppose that given 6 = B = (B . .. I: I ir'z . ....2... .. Let X" .. + l)j(n + k). a) is the posterior mean E(6 I X)..
J') < R( (J. Show that B q(ry) = Bp(h. h1(ry)) denote the model in the new parametrization. ry).. 0) and q(x.).X is called the mutual information between 0 and X. a" (J. the Fisher information lower bound is equivariant. ao) + (1 . J). p(X) IO.x =. K) = J [Eo {log ~i~i}] K((J)d(J > a by Jensen's inequality. . if I(O. We shall say a loss function is convex. for any ao. respectively.tO)2 is a UMVU estimate of a 2 • &'8 is inadmissible. Show that X is an UMVU estimate of O.1 E: 1 (Xi  J.f [E. 0. X n are Ll. that is. .4.4 1.. Reparametrize the model by setting ry = h(O) and let q(x. . {log PB(X)}] K(O)d(J. 0) and B(n. (a) Show that if Ip(O) and Iq(fJ) denote the Fisher information in the two parametrizations.2) with I' . Show that if 1(0.4. then That is. ry) = p(x.k(p. (ry»).Section 3. .~). 4. X n be the indicators of n Bernoulli trials with success probability B. a) is convex and J'(X) = E( J(X) I I(X». a.. Give the Bayes rules for squared error in these three cases. Show that in theN(O. Let A = 3.6 Problems and Complements 203 minimizes k( q. Let Bp(O) and Bq(ry) denote the information inequality lower bound ('Ij. Jeffrey's "Prior. and O~ (1 . . R. Hint: Use Jensen's inequality: If 9 is a convex function and X is a random variable. then E(g(X) > g(E(X». 0) cases. N(I"o." A density proportional to VJp(O) is called Jeffrey's prior.1"0 known. It is often improper. K) . .'? /1 as in (3. Problems for Section 3.alai) < al(O. .d. .. Prove Proposition 3. 15..O)!. Suppose that there is an unbiased estimate J of q((J) and that T(X) is sufficient. a < a < 1. then R(O.12) for the two parametrizations p(x. Let X I. .aao + (1 . 2. N(I".1 .. Equivariance.4. Hint: k(q. 7f) and that the minimum is I(J. 0) with 0 E e c R. Show that (a) (b) cr5 = n. Fisher infonnation is not equivariant under increasing transformations of the parameter. p( x. suppose that assumptions I and II hold and that h is a monotone increasing differentiable function from e onto h(8). Jeffrey's priors are proportional to 1. . Let X . .a )I( 0. Suppose X I.. (b) Equivariance of the Fisher Information Bound. S.
0) = Oc"I(x > 0). For instance.ZDf3)/(n . . ( 2).\ = a + bB is UMVU for estimating . • 13. . 8. Show that 8 = (Y . . Ct. 2 ~ ~ 10. ~ ~ ~ (a) Write the model for Yl give the sufficient statistic. . a ~ i 12. . Is it unbiased? Does it achieve the infonnation inequality lower bound? (b) Show that X is an unbiased estimate of 0/(0 inequality lower bound? + 1)... B(O. 7.\ = a + bOo 11.3. 9.Ii . Establish the claims of Example 3. compute Var(O) using each ofthe three methods indicated.o. .' . 204 Hint: See Problem 3. In Example 3. find the bias o f ao' 6. and ..5(b).. . . a ~ (c) Suppose that Zi = log[i/(n + 1)1. Does X achieve the infonnation . . Show that assumption I implies that if A . 1).1.. X n ) is a sample drawn without replacement from an unknown finite population {Xl. . n. . 0) for any set E. then = 1 forallO.P..Yn are independent Poisson random variables with E(Y'i) = !Ji where Jli = exp{ Ct + (3Zi} depends on the levels Zi of a covariate. Let X" . .4. {3) T Compute I( 9) for the model in (a) and then find the lower bound on the variances of unbiased estimators and {J of a and (J. Let F denote the class of densities with mean 0. Show that a density that minimizes the Fisher infonnation over F is f(x. then (a) X is an unbiased estimate of x = I ~ L~ 1 Xi· . Suppose Yl . Let a and b be constants. Hint: Consider T(X) = X in Theorem 3.2 (0 > 0) that satisfy the conditions of the information inequality.. (a) Find the MLE of 1/0. Show that if (Xl.11(9) as n ~ give the limit of n times the lower bound on the variances of and (J. Hint: Use the integral approximation to sums. I I .{x : p(x. Pe(E) = 1 for some 0 if and only if Pe(E) ~ ~2 > OJ docsn't depend on 0. Y n in twoparameter canonical exponential form and (b) Let 0 = (a.4. Show that . 14..4.2. . . . X n be a sample from the beta. i = 1.1 and variance 0.. Zi could be the level of a drug given to the ith patient with an infectious disease and Vi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered.XN }.. =f. 00.4. distribution.13 E R.8. P. Measures of Performance Chapter 3 I (c) if 110 is not known and the true distribution of X t is N(Ji.. Suppose (J is UMVU for estimating fJ.p) is an unbiased estimate of (72 in the linear regression model of Section 2.ZDf3)T(y . Find lim n.
. :/i.. B(n.X)/Var(U).) Suppose the Uj can be relabeled into strata {xkd.. X is not II Bayes estimate for any prior 1r.511 > (b) Show that the inequality between Var X and Var X continues to hold if ~ . . 18. Suppose UI. B(n. 17. More generally only polynomials of degree n in p are unbiasedly estimable. _ Show that X is unbiased and if X is the mean of a simple random sample without replacement from the population then VarX<VarX with equality iff Xk.. . UN are as in Example 3.4).4. even for sampling without replacement in each stratum. 8). k = 1. Show that the resulting unbiased HorvitzThompson estimate for the population mean has variance strictly larger than the estimate obtained by taking the mean of a sample of size n taken without replacement from the population. 16. K. G 19..1  E1 k I Xki doesn't depend on k for all k such that 1Tk > o. (b) Deduce that if p.l. 20. . (c) Explain how it is possible if Po is binomial.". Stratified Sampling. Define ~ {Xkl.. = N(O.4..4. Suppose X is distributed accordihg to {p. ec R} and 1r is a prior distribution (a) Show that o(X) is both an unbiased estimate of (J and the Bayes estimate with respect to quadratic loss. that ~ is a Bayes estimate for O. 7. then E( M) ~ L 1r j=1 N J ~ n.. = 1. (See also Problem 1. .4.. 1 < i < h. Show that is not unbiasedly estimable. XK.} and X =K  1".Xkl. Show that X k given by (3.1 and Uj is retained independently of all other Uj with probability 1rj where 'Lf 11rj = n. .3. Let 7fk = ~ and suppose 7fk = 1 < k < K. distribution..p). k=l K ~1rkXk. 15. Suppose the sampling scheme given in Problem 15 is employed with 'Trj _ ~. : 0 E for (J such that E((J2) < 00.• . (a) Take samples with replacement of size mk from stratum k = fonn the corresponding sample averages Xl. Show that if M is the expected sample size.~ for all k. . Let X have a binomial.Section 3. P[o(X) = 91 = 1. if and only if. .~).6) is (a) unbiased and (b) has smaller variance than X if  b < 2 Cov(U.6 Problems and Complements 205 (b) The variance of X is given by (3. 'L~ 1 h = N.
5. Hint: It is equivalent to show that. Note that 1/J (9)a ~ 'i7 E9(a 9) and apply Theorem 3. B). If a = 0.4. 1 X n1 C. for all X!.5) to plot the sensitivity curve of the 1. that is. . . = 2k is even. give and plot the sensitivity curves of the lower quartile X. use (3. J xdF(x) denotes Jxp(x)dx in the continuous case and L:xp(x) in the discrete 6.l)a is an integer. Show that the a trimmed mean XCII.25 and (n . for all adx 1. B) is differentiable fat anB > x. Show that. B») ~ °and the information bound is infinite. E(9 Measures of Performance Chapter 3 I X) ~ ii(X) compute E(ii(X) . 22. is an empirical plugin estimate of ~ Here case. the upper quartile X."" F (ij) Vat (:0 logp(X.9)' 21. Let X ~ U(O. however.4. Note that logp(x. If n IQR. give and plot the sensitivity curve of the median. B). 1 2.25 and net =k 3.. An estimate J(X) is said to be shift or translation equivariant if. 4.75' and the IQR. with probability I for each B. and we can thus define moments of 8/8B log p(x.4. B) be the uniform distribution on (0. Var(a T 6) > aT (¢(9)II(9). Yet show eand has finite variance. • Problems for Section 35 I is an integer. Show that the sample median X is an empirical plugin estimate of the population median v. Regularity Conditions are Needed for the Information Inequality. Prove Theorem 3. (iii) 2X is unbiased for ./(9)a [¢ T (9)a]T II (9)[¢ T (9)a]...206 Hint: Given E(ii(X) 19) ~ 9. If a = 0.25.3. 5. T .
5.. is symmetrically distributed about O.. The Huber estimate X k is defined implicitly as the solution of the equation where 0 < k < 00. xH L is translation equivariant and antisymmetric.6.30. Show that if 15 is translation equivariant and antisymmetric and E o(15(X» exists and is finite.JL) where JL is unknown and Xi . I)order statistics). 7. .30. X a arc translation equivariant and antisymmetric. . (b) Show that   .67. .). (See Problem 3.) ~ 8.6 Problems and Complements 207 It is antisymmetric if for all Xl. i < j. .. plot the sensitivity curves of the mean.1..3. F(x .Section 3. . X n is a sample from a population with dJ. J is an unbiased estimate of 11.  XI/0.. . median. X are unbiased estimates of the center of symmetry of a symmetric distribution. then (i. Xu. For x > . and xiflxl < k kifx > k kifx<k. . '& is an estimate of scale. k t 0 to the median.(a) Show that X. One reasonable choice for k is k = 1.1. . X.5 and for (j is. (7= moo 1 IX. The HodgesLehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages ~(Xi + Xj).e. In . Its properties are similar to those of the trimmed mean. and the HodgesLehmann estimate. (a) Suppose n = 5 and the "ideal" ordered sample of size n ~ 1 = 4 is 1. Deduce that X. (b) Suppose Xl... trimmed mean with a = 1/4. It has the advantage that there is no trimming proportion Q that needs to be subjectively specified.03 (these are expected values of four N(O. <i< n Show that (a) k = 00 corresponds to X..03.
75 .x}2.6).) has heavier tails than f() if g(x) is above f(x) for Ixllarge. . Let JJo be a hypothesized mean for a certain population.X. 1 Xnl to have sample mean zero. 00. .7(a). thus. This problem may be done on the computer.1)1 L:~ 1(Xi . and Cauchy distributions. 9. Let 1'0 = 0 and choose the ideal sample Xl. (e) If k < 00. 11.) are two densities with medians v zero and identical scale parameters 7. where S2 = (n . Show that SC(x. with k and € connected through 2<p(k) _ 24>(k) ~ e k 1(e) Xk exists and is unique when k £ ! i > O. P(IXI > 3) and P(IXI > 4) for the nonnal.d. Laplace. X 12.. the standard deviation does not exist. In what follows adjust f and 9 to have v = 0 and 'T = 1. 00. The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant \.i.25. Use a fixed known 0"0 in place of Ci.Xn arei. Find the limit of the sensitivity curve of t as (a) Ixl ~ (b) n ~ 00. .5. . we will use the IQR scale parameter T = X. Suppose L:~i Xi ~ . (a) Find the set of Ixl where g(Ixl) > <p(lxl) for 9 equal to the Laplace and Cauchy densitiesgL(x) = (2ry)1 exp{ Ixl/ry} and gc(x) = b[b2 + x 2 ]1 /rr. [.11. . iTn ) ~ (2a)1 (x 2  a 2 ) as n ~ 00..208 ~ Measures of Performance Chapter 3 (b) Ifa is replaced by a known 0"0.." . and "" "'" (b) the Iratio of Problem 3. thenXk is the MLEofB when Xl..5.. .~ !~ t . Location Parameters.• 13. For the ideal sample of Problem 3. ii. then limlxj>00 SC(x. •• .O)/ao) where fo(x) for Ixl for Ixl <k > k.5. In the case of the Cauchy density. Xk) is a finite constant. with density fo((. n is fixed. plot the sensitivity curve of (a) iTn . (c) Show that go(x)/<p(x) is of order exp{x2 } as Ixl ~ 10. Let X be a random variable with continuous distribution function F. and is fixed. (d) Xk is translation equivariant and antisymmetric (see Problem 3. (b) Find thetml probabilities P(IXI > 2). The (student) tratio is defined as 1= v'n(x I'o)/s. = O. If f(·) and g(. we say that g(.
Let Y denote a random variable with continuous distribution function G. and sample trimmed mean are shift and scale equivariant.(k) is a (d) For 0 < " < 1. i/(F)] is the value of some location parameter.O.1: (a) Show that SC(x.a +bxn ) ~ a + b8n (XI. ~ = d> 0. 8 < " is said to be order preserving if X < Y ::::} Ox < ()y. ~ ~ 15. compare SC(x. c E R. x n _. SC(a + bx. 8. let H(x) be the distribution function whose inverse is H'(a) = ![xax. (b) Show that the mean Ji. it is called a location parameter. . An estimate ()n is said to be shift and scale equivariant if for all xl.).xn )..t. 8. then v(F) and i/( F) are location parameters. median v. b > 0. Fn _ I ).8. if F is also strictly increasing.(F) = !(xa + XI").:. i/(F)] is the location parameter set in the sense that for any continuous F the value ()( F) of any location parameter must be in [v(F). n (b) In the following cases. Show that v" is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < i/(F). (jx.5. (a) Show that if F is symmetric about c and () is a location parameter. i/(F)] and.F). where T is the median of the distributioo of IX location parameter.. c + dO. .5. and trimmed population mean JiOl (see Problem 3.]. . Hint: For the second part. · . and ordcr preserving. then for a E R. ~ (b) Write the Be as SC(x. 0 < " < 1. 8) = IF~ (x.8. F) lim n _ oo SC(x.6 Problems and Complements 209 if (Ja+bX = a + bex· It is antisymmetric if (J x :. ~ ~ 8n (a +bxI. Show that J1. v(F) = inf{va(F) : 0 < " < I/2} andi/(F) =sup{v.5.. ([v(F). (e) Let Ji{k) be the solution to the equation E ( t/Jk (X :. 8) and ~ IF(x.xn . X is said to be stochastically smaller than Y if = G(t) for all t E R. a + bx.... 8. then ()(F) = c. b > 0.67 and tPk is defined in Problem 3.Xn_l) to show its dependence on Xnl (Xl.Section 3. the ~ = bdSC(x.. a.. ._. (e) Show that if the support S(F) = {x : 0 < F(x) < 1} of F is a finite interval. ~ (a) Show that the sample mean.. .) 14._. any point in [v(F). In this case we write X " Y. .... sample median.) That is. xnd. In Remark 3. F(t) ~ P(X < t) > P(Y < t) antisymmctric. . Show that if () is shift and location equivariant. If 0 is scale and shift equivariant. ~ se is shift invariant and scale equivariant. let v" = v.t )) = 0. and note thatH(xi/(F» < F(x) < H(xv(F».(F): 0 <" < 1/2}.5) are location parameters. .  vl/O.. Also note that H(x) is symmetric about zero.
BI < C1(1(i. The NewtonRaphson method in this case is . F)! ~ 0 in the cases (i). This is. (b) ~ ~ .Bilarge enough. Because priority of discovery is now given to the French mathematician M.I F(x. ) (a) Show by example that for suitable t/J and 1(1(0) . 81" < 0. · . I : . then fJ.1 is commonly known as the Cramer~Rao inequality.OJ then l(I(i) . B E B.. . we in general must take on the order of log ~ steps. also true of the method of coordinate ascent.2).1) Hint: (a) Try t/J(x) = Alogx with A> 1.1/.' > 0.3 (1) A technical problem is to give the class S of subsets of:F for which we can assign probability (the measurable sets).210 (i) Measures of Performance Chapter 3 O(F) ~ 1'1' ~ J . . and we seek the unique solution (j of 'IjJ(8) = O. then J. and (iii) preceding? ~ 16. ~ ~ 17. ~ I(x  = X n . Frechet.rdF(l:). Let d = 1 and suppose that 'I/J is twice continuously differentiable. • • . Show that in the bisection method.t is identifiable. is not identifiable.5. . We define S as the . (ii).8) .4. In the gross error model (3. {BU)} do not converge.7 NOTES Note for Section 3.field generated by SA. show that (a) If h is a density that is symmetric about zero. (1) The result of Theorem 3. (b) Show that there exists. IlFfdF(x). Assume that F is strictly increasing. • . 0. we shall • I' . (b) If no assumptions are made about h. ..4 I . 18. A. (ii) O(F) = (iii) e(F) "7. 3.B ~ {F E :F : Pp(A) E B}. . I I Notes for Section 3. in order to be certain that the Jth iterate (j J is within e of the desired () such that 'IjJ( fJ) = 0. C < 00. consequently.0 > 0 (depending on t/J) such that if lOCO) . (e) Does n j [SC(x. where B is the class of Borel sets.
TUKEY. Point Estimation Using the KullbackLeibler Loss Function and MML. Statist. ofStatist. BICKEL. M.. M. LEHMANN." Ann. BERNARDO. Location. R.8 References 211 follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement. Reading. 11391158 (1976). MA: AddisonWesley. F. NJ: Princeton University Press.. BICKEL.thematical Methods in Risk Theory Heidelberg: Springer Verlag.. P. I. P. p(x. AND E. A. 15231535 (1969). "Measures of Location and AsymmeUy. (3) The continuity of the first integral ensures that : (J [rO )00 . H. P. 1998. (2) Note that this inequality is true but uninteresting if f(O) = 00 (and 1/J'(0) is finite) or if Var. >. BICKEL. R... WALLACE. 40. 0. Numen'cal Analysis New York: Prentice Hall. "Descriptive Statistics for Nonparametric Models. "Unbiased Estimation in Convex. H. M.8 REFERENCES ANDREWS. Introduction. BJORK. BAXTER. 3. Robust Estimates of Location: Sun>ey and Advances Princeton. OLIVER. LEHMANN. BERGER. SMITH. DAHLQUIST. HUBER. in Proceedings of the Second Pa. 3. BICKEL. DoKSUM. F. J. 2. Statistical Decision Theory and Bayesian Analysis New York: Springer.. III. 2nd ed.. S. 1122 (1975). Families. ROGERS. 10381044 (l975a).10451069 (1975b). 1. 4. L. HAMPEL.. DOWE. T.. AND J. 1974. AND A. "Descriptive Statistics for Nonparametric Models. P." Ann. Statist. 1974. D. A. AND E. J.] = Jroo. Statist.. H. ANDERSON. BICKEL. J. LEHMANN. Jroo T(x) :>. DE GROOT. M." Ann. AND E. J. BOHLMANN.(T(X)) = 00. P. D. . LEHMANN.)dXd>. J. 3. K. Optimal Statistical Decisions New York: McGrawHill. W. 1970... AND N.Section 3.8).. F. Ma. J. AND E." Ann. II. 1972.. P. 1994.. Mathematical Analysis..cific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: SpringerVerlag. APoSTOL. AND C. G. Math. Statist.4. Bayesian Theory New York: Wiley. 1969.." Scand.. "Descriptive Statistics for Nonparametric Models.. Jroo T(x) [~p(X' 0)] dx roo Joo 8(J 00 00 00 for all (J whereas the continuity (or even boundedness on compact sets) of the second integral guarantees that we can interchange the order of integration in (4) The finiteness of Var8(T(X)) and f(O) imply that 1/J'(0) is finite by the covariance interpretation given in (3. 1985. A. W. Dispersion.
A. 49. "Boostrap Estimate of KullbackLeibler Information for Model Selection. "Decision Analysis and BioequivaIence Trials. "The Influence Curve and Its Role in Robust Estimation. Actuarial J. 240251 (1987)." Statistica Sinica." J. C.. • • 1 ..." Ann.. HOGG. HUBER.. Third Berkeley Symposium on Math.. E.. "Adaptive Robust Procedures. TuKEY. 223239 (1987). Wiley & Sons. . . A. M. 136141 (1998).. The Foundations afStatistics New York: J.. H... LINDLEY. L. HUBER... AND W. R. "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution. StatiSl. . Soc. P... RISSANEN. 2nd ed. Assoc. I. MA: AddisonWesley. Mathematical Methods and Theory in Games. Wiley & Sons. 909927 (1974).10411067 (1972). 10201034 (1971). StrJtist. Part II: Inference.. Math. R. E. "Stochastic Complexity (With Discussions):' J. S. Assoc. 42. 1965. R. B. London: Oxford University Press. J. RoyalStatist." J. STEIN.. 13. JAECKEL. New York: Springer.. Y. Stlltist. 2Q4. 1.. Amer. AND B. • NORBERG. Soc. L. Exploratory Data Analysis Reading. "Robust Estimates of Location.. Royal Statist." Ann." Statistical Science..222 (1986).212 Measures of Performance Chapter 3 HAMPEL. Amer. Robust Statistics New York: Wiley. L. D. CASELLA. Assoc. WALLACE. 1986. P. Cambridge University Press. Part I: Probability. RONCHEro. and Economics Reading. E. University of California Press. AND P. L. 49. 383393 (1974).. Math." J. Statist." Ann. j I JEFFREYS. 43. 7. "Robust Statistics: A Review. AND G. Stu/ist. 1959. 69. FREEMAN. 1998." Proc.. J. S. "On the Attainment of the CramerRao Lower Bound. Introduction to Probability and Statistics from a Bayesian Point of View. Theory ofPoint Estimation.V. W. HAMPEL.n. (2000). D. SHIBATA. YU. ROUSSEUW. Amer... Math.. SAVAGE. LINDLEY. P. 197206 (1956). 1954.. MA: AddisonWesley. Programming. STAHEL. 538542(1973). 69. 1981. R. 2nd ed. and Probability. KARLIN. Statist. R. Robust Statistics: The Approach Based on Influence Functions New York: J. 1986. B. London. Stall'." Scand. "Estimation and Inference by Compact Coding (With Discussions). LEHMANN. WUSMAN. E. "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification. 1. Theory ofProbability. Testing Statistical Hypotheses New York: Springer. F. HANSEN. 1972.375394 (1997). "Model Selection and the Principle of Mimimum Description Length. LEHMANN. C." J.. H. . 1948.
This framework is natural if. The Graduate Division of the University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. As we have seen. Nfo) by a multinomial. not a sample. If n is the total number of applicants.3. modeled by us as having distribution P(j. we are trying to get a yes or no answer to important questions in science. whether (j E 8 0 or 8 1 jf P j = {Pe : () E 8 j }. respectively. and the corresponding numbers N mo . pUblic policy. the parameter space e. Accepting this model provisionally. They initially tabulated Nm1.2. Sex Bias in Graduate Admissions at Berkeley.3 the questions are sometimes simple and the type of data to be gathered under our control. The design of the experiment may not be under our control.1. e Example 4. or more generally construct an experiment that yields data X in X C Rq. what is an appropriate stochastic model for the data may be questionable. PI or 8 0 .3 we defined the testing problem abstractly.PfO). Usually.Nf1 .PmllPmO. and we have data providing some evidence one way or the other. conesponding.1 INTRODUCTION In Sections 1.Pfl. treating it as a decision theory problem in which we are to decide whether P E Po or P l or. Nfo of denied applicants. what does the 213 . to answering "no" or "yes" to the preceding questions. distribution.Nfb the numbers of admitted male and female applicants. where is partitioned into {80 . where Po. peIfonn a survey. Here are two examples that illustrate these issues.Chapter 4 TESTING AND CONFIDENCE REGIONS: BASIC THEORY 4. as is often the case. But this model is suspect because in fact we are looking at the population of all applicants here. 8 1 are a partition of the model P or. respectively. medicine. Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial. and 3. in examples such as 1. the situation is less simple. 3. and indeed most human activities. parametrically. and what 8 0 and 8 1 correspond to in tenns of the stochastic model may be unclear. 8d with 8 0 and 8. it might be tempting to model (Nm1 . o E 8. Nmo. M(n.1.1.
and so on..p) distribution. The example illustrates both the difficulty of specifying a stochastic model and translating the question one wants to answer into a statistical hypothesis. • Pml + P/I + PmO + Pio • I . . N mOd .: Example 4." then the data are naturally decomposed into N = (Nm1d. • I I . OUf multinomial assumption now becomes N ""' M(pmld' PmOd. .=7xlO 5 ' n 3 . the number of homozygous dominants. the same data can lead to opposite conclusions regarding these hypothesesa phenomenon called Simpson's paradox. ~. if there were n dominant offspring (seeds). Hammel.214 Testing and Confidence Regions Chapter 4 hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male] = Pm! Pml +PmO = P[Admit I Female] = P fI Pil +PjO But is this a correct translation of what absence of bias means? Only if admission is determined centrally by the toss of a coin with probability Pml Pi! Pml + PmO PIl +PiO [n fact.1.n 3 t I . d = 1. has a binomial (n.. B (n. .! . .. i:. m P [ NAA . . . Mendel crossed peas heterozygous for a trait with two alleles. NjOd. one of which was dominant.2. admissions are petfonned at the departmental level and rates of admission differ significantly from department to department. distribution. I . d = 1.. This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate. the natural model is to assume. as is discussed in a paper by Bickel.. In a modem formulation. • In fact. I I 1] Fisher conjectured that rather than believing that such a very extraordinary event occurred it is more likely that the numbers were made to "agree with theory" by an overzealous assistant. Pfld. The hypothesis of dominant inheritance ~. In one of his famous experiments laying the foundation of the quantitative theory of genetics.< . for d = 1.D. .. that N AA.than might be expected under the hypothesis that N AA has a binomial. D). if the inheritance ratio can be arbitrary. NIld.. either N AA cannot really be thought of as stochastic or any stochastic I . pml +P/I !. . That is. ! .. If departments "use different coins. D). The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive). PfOd. 0 I • . . . ~). and O'Connell (1975). In these tenns the hypothesis of "no bias" can now be translated into: H: Pml Pmld PmOd Pfld Pild + + PfOd . where N m1d is the number of male admits to department d. Mendel's Peas. .. It was noted by Fisher corresponds to H : p = ~ with the alternative K : p as reported in Jeffreys (1961) that in this experiment the observed fraction ':: was much closer to 3. I . • . .
. we shall simplify notation and write H : () = eo.Xn ).Section 4. . Thus. where Xi is 1 if the ith patient recovers and 0 otherwise. our discussion of constant treatment effect in Example 1. is point mass at . Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied. see. Thus. then = [00 . it's not clear what P should be as in the preceding Mendel example. n 3' 0 What the second of these examples suggests is often the case.11. 8 1 is the interval ((}o. If the theory is false. Most simply we would sample n patients. = Example 4.1 loss l(B. In situations such as this one .3 recover from the disease with the old drug. The same conventions apply to 8] and K. In science generally a theory typically closely specifies the type of distribution P of the data X as. B) distribution. If we suppose the new drug is at least as effective as the old. What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug. 11 and K is composite. The set of distributions corresponding to one answer. we reject H if S exceeds or equals some integer.1 Introduction 215 model needs to pennit distributions other than B( n.1. say k. Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug. acceptance and rejection can be thought of as actions a = 0 or 1. That is. for instance. p). we call 8 0 and H simple. That a treatment has no effect is easier to specify than what its effect is.!.1. for instance. and accept H otherwise. P = Po. To investigate this question we would have to perform a random experiment. say 8 0 . We illustrate these ideas in the following example. I} or critical region C {x: Ii(x) = I}. where 1 ~ E is the probability that the assistant fudged the data and 6!. (1 . and we are then led to the natural 0 . then S has a B(n. It will turn out that in most cases the solution to testing problem~ with 80 simple also solves the composite 8 0 problem. a) = 0 if BE 8 a and 1 otherwise. suppose we observe S = EXi . and then base our decision on the observed sample X = (X J. in the e . 0 E 8 0 . . When 8 0 contains more than one point.3.1. recall that a decision procedure in the case of a test is described by a test function Ii: x ~ {D. where 00 is the probability of recovery usiog the old drug. K : () > Bo Ifwe allow for the possibility that the new drug is less effective than the old. 8 0 and H are called compOSite.. These considerations lead to the asymmetric formulation that saying P E Po (e E 8 0 ) corresponds to acceptance of the hypothesis H : P E Po and P E PI corresponds to rejection sometimes written as K : P E PJ . then 8 0 = [0 . Moreover. Suppose that we know from past experience that a fixed proportion Bo = 0. is better defined than the alternative answer 8 1 . Now 8 0 = {Oo} and H is simple. In this example with 80 = {()o} it is reasonable to reject IJ if S is "much" larger than what would be expected by chance if H is true and the value of B is eo. the set of points for which we reject. (2) If we let () be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite. the number of recoveries among the n randomly selected patients who have been administered the new drug.€)51 + cB( nlP). say. (}o] and eo is composite. It is convenient to distinguish between two structural possibilities for 8 0 and 8 1 : If 8 0 consists of only one point. See Remark 4. administer the new drug.3.(1) As we have stated earlier.
3. of the two errors.1 . our critical region C is {X : S rule is Ok(X) = I{S > k} with > k} and the test function or PI ~ probability of type I error = Pe. and later chapters. The Neyman Pearson framework is still valuable in these situations by at least making us think of possible alternatives and then. asymmetry is often also imposed because one of eo. We call T a test statistic.. 0 PH = probability of type II error ~ P. suggesting what test statistics it is best to use. is much better defined than its complement and/or the distribution of statistics T under eo is easy to compute. e 1 . but computation under H is easy.. 4.2. but if this view is accepted. No one really believes that H is true and possible types of alternatives are vaguely known at best. .T would then be a test statistic in our sense. We will discuss the fundamental issue of how to choose T in Sections 4. It has also been argued that. By convention this is chosen to be the type I error and that in tum detennines what we call H and what we call K. Thresholds (critical values) are set so that if the matches occur at random (i. The Neyman Pearson Framework The Neyman Pearson approach rests On the idea that.. 1 i .1. One way of doing this is to align the known and unknown regions and compute statistics based on the number of matches..1.1 and4. announcing that a new phenomenon has been observed when in fact nothing has happened (the socalled null hypothesis) is more serious than missing something new. Given this position. Note that a test statistic generates a family of possible tests as c varies.2 and 4. if H is false.3. how reasonable is this point of view? In the medical setting of Example 4. then the probability of exceeding the threshold (type I) error is smaller than Q". (5 The constant k that determines the critical region is called the critical value. In that case rejecting the hypothesis at level a is interpreted as a measure of the weight of evidence we attach to the falsity of H.2. matches at one position are independent of matches at other positions) and the probability of a match is ~. 0 > 00 . when H is false.) We select a number c and our test is to calculate T(x) and then reject H ifT(x) > c and accept H otherwise. (Other authors consider test statistics T that tend to be small. We do not find this persuasive. As we noted in Examples 4. one can be thought of as more important. . it again reason~bly leads to a Neyman Pearson fonnulation. and large. • .that has in fact occurred. We now tum to the prevalent point of view on how to choose c. ! : . .216 Testing and Confidence Regions Chapter 4 tenninology of Section 1. testing techniques are used in searching for regions of the genome that resemble other regions that are known to have significant biological activity.e. ~ : I ! j . The value c that completes our specification is referred to as the critical value of the test. as we shall see in Sections 4. For instance. . : I I j ! 1 I . . (5 > k) < k). if H is true. In most problems it turns out that the tests that arise naturally have the kind of structure we have just described. There is a statistic T that "tends" to be small.. generally in science. 1. To detennine significant values of these statistics a (more complicated) version of the following is done.1..3 this asymmetry appears reasonable.3.
Indeed. See Problem 3.1.3 and n = 10. the power is 1 minus the probability of type II error.Section 4.1. Once the level or critical value is fixed. If 8 0 is composite as well. k = 6 is given in Figure 4. numbers to the two losses that are not equal and/or depend on 0. our test has size a(c) given by a(e) ~ sup{Pe[T(X) > cJ: 8 E eo}· (4. Here > cJ. Here are the elements of the Neyman Pearson story.1 Introduction 217 There is an important class of situations in which the Neyman Pearson framework is inappropriate. and we speak of rejecting H at level a. By convention 1 . in the Bayesian framework with a prior distribution on the parameter. it is convenient to give a name to the smallest level of significance of a test. Finally.3. Because a test of level a is also of level a' > a. Both the power and the probability of type I error are contained in the power function. e t . we find from binomial tables the level 0.3 with O(X) = I{S > k}. Definition 4.1.(S > 6) = 0. Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8 0 . then the probability of type I error is also a function of 8.1. 8 0 = 0. This is the critical value we shall use. Such tests are said to have level (o/significance) a.3 (continued). The power is a function of 8 on I. if we have a test statistic T and use critical value c.2. It is referred to as the level a critical value.01 and 0. In that case. It can be thought of as the probability that the test will "detect" that the alternative 8 holds. 80 = 0. there exists a unique smallest c for which a(c) < a. which is defined/or all 8 E 8 by e {3(8) = {3(8. Begin by specifying a small number a > 0 such that probabilities of type I error greater than a arc undesirable. 0) is the {3(8.1.Ok) = P(S > k) A plot of this function for n ~ t( j=k n ) 8i (1 .8)n.1. {3(8. the probabilities of type II error as 8 ranges over 8 1 are determined.05 critical value 6 and the test has size . we can attach.. The power of a test against the alternative 0 is the probability of rejecting H when () is true. This quantity is called the size of the test and is the maximum probability of type I error. 0) is just the probability of type! errot.1. Thus.2. if our test statistic is T and we want level a. the approach of Example 3. Example 4.9. it is too limited in any situation in which. The values a = 0.j J = 10. even though there are just two actions. Ll) Nowa(e) is nonincreasing in c and typically a(c) r 1 as c 1 00 and a(e) 1 0 as c r 00.0473.05 are commonly used in practice. . In Example 4.0) = Pe[Rejection] = Pe[o(X) = 1] = Pe[T(X) If 8 E 8 0 • {3(8. That is. such as the quality control Example 1.(6) = Pe. even nominally.1. whereas if 8 E power against (). Specifically.2(b) is the one to take in all cases with 8 0 and 8 1 simple.P [type 11 error] is usually considered. if 0 < a < 1.
:1 • • Note that in this example the power at () = B1 > 0. .218 Testing and Confidence Regions Chapter 4 1.7 0. OneSided Tests for the Mean ofa Normal Distribution with Known Variance. whereas K is the alternative that it has some positive effect. ..1. What is needed to improve on this situation is a larger sample size n.:"::':::'. We might.1.. 0 Remark 4. for each of a group of n randomly selected patients. .4.. One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives. If we assume XI. Power function of the level 0.2 is known. _l X n are nonnally distributed with mean p. I i j .8 0. Suppose that X = (X I. ! • i.8 06 04 0.3770.1.0 Figure 4._~:. That is. (The (T2 unknown case is treated in Section 4.1.2 0.3).t > O. Example 4.1 it appears that the power function is increasing (a proof will be given in Section 4.3.3 versus K : B> 0.2) popnlation with . this probability is only .9 1.[T(X) > k].t < 0 versus K : J. then the drug effect is measured by p. I I I I I 1 . This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject. 0) family of distributions.3 04 05 0.5. Let Xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient. When (}1 is 0.1. We return to this question in Section 4.3 to (it > 0. . Xn ) is a sample from N (/'. .05 test will detect an improvement of the recovery fate from 0.5.. I j a(k) = sup{Pe[T(X) > k]: 0 E eo} = Pe.0473. For instance..:.. suppose we want to see if a drug induces sleep. record sleeping time without the drug (or after the administration of a placebo) and then after some time administer the drug and record sleeping time again. and H is the hypothesis that the drug has no effect or is detrimental.3 is the probability that the level 0.3 for the B(lO. It follows that the level and size of the test are unchanged if instead of 80 = {Oo} we used eo = [0. The power is plotted as a function of 0..0 0.~==/++_++1'+ 0 o 0.3.0 0 ].05 onesided test c5k of H : () = 0. k ~ 6 and the size is 0. and variance (T2.1 0. j 1 . From Figure 4.2 o~31:. a 67% improvement.) We want to test H : J.6 0.
2 and 4.nX/ (J.1.1. T(X(l)). #. 1) distribution.1. The power function of the test with critical value c is p " [Vii (X !") > e _ Vii!"] (J (J 1.P.9).i. This minimum distance principle is essentially what underlies Examples 4. N~A is the MLE of p and d(N~A.1.   = . Given a test statistic T(X) we need to determine critical values and eventually the power of the resulting tests. o The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H : (j E eo c RP. . it is natural to reject H for large values of X. The smallest c for whieh ~(c) < C\' is obtained by setting q:...o versus p.~I. < T(B+l) are the ordered T(X). in any case. has level a if £0 is continuous and (B + 1)(1 . In all of these cases.2. In Example 4.eo) = IN~A .Section 4.• T(X(B)). T(X(B)) from £0.V.1.(T(X)) doesn't depend on 0 for 0 E 8 0 . if we generate i.3. Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward.. . ~ estimates(jandd(~.y) : YES}.a) is the (1. £0.a) quantile oftheN(O. But it occurs also in more interesting situations such as testing p. where d is the Euclidean (or some equivalent) distance and d(x.p) because ~(z) (4..d. sup{(J(p) : p < OJ ~ (J(O) = ~(c). This occurs if 8 0 is simple as in Example 4.. (Tn) for (j E 8 0 is usually invariance The key feature of situations in which under the action of a group of transformations_ See Lehmann (1997) and Volume II for discussions of this property. p ~ P[AA]. Because (J(jL) a(e) ~ is increasing.co =.5.1. (12) observations with both parameters unknown (the t tests of Example 4.2) = 1.1. and we have available a good estimate (j of (j. However. In Example 4.co . The task of finding a critical value is greatly simplified if £.(jo) = (~ (jo)+ where y+ = Y l(y > 0).~ (c . S) inf{d(x.~(z). That is. eo). .o if we have H(p. the common distribution of T(X) under (j E 8 0 . where T(l) < . it is clear that a reasonable test statistic is d((j.a) is an integer (Problem 4.I') ~ ~ (c + V. T(X(lI)..1 and Example 4. which generates the same family of critical regions.1.1 Introduction 219 Because X tends to be larger under K than under H.3. Rejecting for large values of this statistic is equivalent to rejecting for large values of X. critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods.. . = P.3. has a closed form and is tabled. then the test that rejects iff T(X) > T«B+l)(la)).5).( c) = C\' or e = z(a) where z(a) = z(1 . It is convenient to replace X by the test statistic T(X) = .
for n > 80.05.6.1 El{Fo(Xi ) < Fo(x)} n.3.1. 1).220 Testing and Confidence Regions Chapter 4 1 . . Set Ui ~ j . . and the result follows. and . .. where U denotes the U(O.d. :i' D n = sup [U(u) . 0 Example 4. + (Tx) is .L5 rewriting H : F(!' + (Tx) = <p(x) for all x where !' = EF(X .7) that Do. 1997). F o (Xi). Goodness of Fit Tests.i. . the order statistics. X n are ij.1 J) where x(l) < . as X .01. < x(n) is the ordered observed sample. < Fo(x)} = U(Fo(x)) . thus.   Dn = sup IF(x) .. where F is continuous. as a tcst statistic ..1.U.1. In particular.)} n (4. . n. As x ranges over R.Fa. The distribution of D n has been thoroughly smdied for finite and large n.. can be wriHen as Dn =~ax max tI. o Note that the hypothesis here is simple so that for anyone of these hypotheses F = F o• the distribution can be simulated (or exhibited in closed fonn). The distribution of D n under H is the same for all continuous Fo. x It can be shown (Problem 4. Goodness of Fit to the Gaussian Family.. ).u[ O<u<l . P Fo (D n < d) = Pu (D n < d)... . 1) distribution.1. . that is....5.10 respectively. F and the hypothesis is H : F = ~ (' (7'/1:) for some M. We can proceed as in Example 4. This is again a consequence of invariance properties (Lehmann. . h n (L358).1 El{Xi ~ U(O. In particular.Fo(x)[. The natural estimate of the parameter F(!. Consider the problem of testing H : F = Fo versus K : F i.. which is evidently composite. and h n (L224) for" = . (T2 = VarF(Xtl.n {~ n Fo(x(i))' FO(X(i)) _ . .. Also F(x) < x} = n.12 + OJI/Vri) close approximations to the size" critical values ka are h n (L628). Suppose Xl. . .. • .Xn be i.1 El{U. • Example 4.(i_l. and hn(t) = t/( Vri + 0. F.1. then by Problem B. the empirical distribution function F. Let F denote the empirical distribution and consider the sup distance between the hypothesis Fo and the plugin estimate of F. Un' where U denotes the empirical distribution function of Ul u = Fo(x) ranges over (0. This statistic has the following distributionjree property: 1 Proposition 4..d.. Proof.4. What is remarkable is that it is independ~nt of which F o we consider.. Let X I.. U. 1). which is called the Kolmogorov statistic.
.4. Therefore.Z) / (~ 2::7 ! (Zi  if) • . {12 and is that of ~  (Zi . Consider.05. . thereby obtaining Tn1 . and only if. 1). T nB ..1 Introduction 221 (12.X) . under H. a satisfies T(x) = . whereas experimenter II insists on using 0' = 0.3. . 1 < i < n. .(iix u > z(a) or upon applying <l' to both sides if.. .(3) 0 The pValue: The Test Statistic as Evidence o Different individuals faced with the same testing problem may have different criteria of size. Experimenter] may be satisfied to reject the hypothesis H using a test with size a = 0..d. Example 4. Now the Monte Carlo critical value is the I(B + 1)(1 . a > <l'( T(x)).i. This quantity is a statistic that is defined as the smallest level of significance 0' at which an experimenter using T would reject on the basis ofthe observed outcome x. H is rejected. otherwise. if X = x. whatever be 11 and (12.. Tn sup IF'(X x x + iix) . this difficulty may be overcome by reporting the outcome of the experiment in tenus of the observed size or pvalue or significance probability of the test. .2. from N(O. We do this B times independently.<l'(x)1 sup IG(x) . T nB .X)/ii..01... where X and 0'2 are the MLEs of J1 and we obtain the statistic F(X + ax) Applying the sup distance again. and the critical value may be obtained by simulating ij. the pvalue is <l'( T(x)) = <l' (.Section 4. . That is.) Thus..' . whereas experimenter II accepts H on the basis of the same outcome x of an experiment. Tn has the same distribution £0 under H. Zn arc i. observations Zi. I). for instance. if the experimenter's critical value corresponds to a test of size less than the p~value. where Z I. . we would reject H if. Tn!.. and only if. then computing the Tn corresponding to those Zi. If the two experimenters can agree on a common test statistic T.1.d. · · . . the joint distribution of (Dq . I < i < n. If we observe X = x = (Xl. (4. But.a) + IJth order statistic among Tn.4) Considered as a statistic the pvalue is <l'( y"nX /u). ~n) doesn't depend on fl. In).<l'(x)1 where G is the empirical distribution of (L~l"'" Lln ) with Lli (Xi .. N(O. H is not rejected. (Sec Section 8. It is then possible that experimenter I rejects the hypothesis H. 1.
. if r experimenters use continuous test statistics T 1 .1. We will show that we can express the pvalue simply in tenns of the function a(·) defined in (4. these kinds of issues are currently being discussed under the rubric of datafusion and metaanalysis (e. Thus.. T(x) > c. The pvalue is a(T(X)). (4. Various melhods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff (1967). to quote Fisher (1958). a(T) is on the unit interval and when H is simple and T has a continuous distribution. aCT) has a uniform. ..(S > s) '" 1.• see f [' Hedges and Olkin. Similarly in Example 4.. .5).1). It is possible to use pvalues to combine the evidence relating to a given hypothesis H provided by several different independent experiments producing different kinds of data.. Suppose that we observe X = x.Oo)} > 5. and the pvalue is a( s) where s is the observed value of X. Proposition 4. but K is not. a(Tr ). In this context. for miu{ nOo.1. ' I values a(T.1. Thus. For example.3.. But the size of a test with critical value c is just a(c) and a(c) is decreasing in c.g.1. then if H is simple Fisher (1958) proposed using l j :i.5) . • T = ~2 I: loga(Tj) j=l ~ ~ r (4. 1985).. let X be a q dimensional random vector.. and only if.1. when H is well defined. U(O. the largest critical value c for which we would reject is c = T(x). 80).6). .<I> ( [nOo(1 ~ 0 )1' 2 s~l~nOo) 0 . The statistic T has a chisquare distribution with 2n degrees of freedom (Problem 4. I. • i~! <:1 im _ . distribution (Problem 4. This is in agreement with (4. we would reject H if.ath quantile of the X~n distribution. the smallest a for which we would reject corresponds to the largest c for which we would reject and is just a(T(x)). a(8) = p. Then if we use critical value c. !: The pvalue is used extensively in situations of the type we described earlier.4).1. The normal approximation is used for the pvalue also. More generally. .222 Testing and Confidence Regions Cnapter 4 In general. We have proved the following. ''The actual value of p obtainable from the table by interpolation indicates the strength of the evidence against the null hypothesis" (p. 1). Thus. ~ H fJ.1. that is. so that type II error considerations are unclear. The pvalue can be thought of as a standardized version of our original statistic.2.). I T r to produce p .1. n(1 . H is rejected if T > Xla where Xl_a is the 1 . •• 1.6) I • to test H. Thus.
0. is measured in the NeymanPearson theory.)1 5 [(1. OIl = p (x. in the binomial example (4. 00 ) = 0. by convention. we test whether the distribution of X is different from a specified Fo. 00. 0) 0 p(x. We introduce the basic concepts of simple and composite hypotheses. OIl > 0. In this case the Bayes principle led to procedures based on the simple likelihood ratio statistic defined by L(x. The statistic L takes on the value 00 when p(x. power. In this section we will consider the problem of finding the level a test that ha<.2 Choosing a Test Statistic The NeymanPearson lemma 223 The preceding paragraph gives an example in which the hypothesis specifies a distribution completely. OIl where p(x. Typically a test statistic is not given but must be chosen on the basis of its perfonnance.2 CHOOSING A TEST STATISTIC: THE NEYMANPEARSON LEMMA We have seen how a hypothesistesting problem is defined and how perfonnance of a given test b. 0. p(x. (4. significance level. (null) hypothesis H and alternative (hypothesis) K. or equivalently.01 )/(1. and S tends to be large when K : () = 01 > 00 is . we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 80 or e 1. 4./00)5[(1. under H.3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk. and. Summary. I) = EXi is large.Otl/(1 .2. We introduce the basic concepts and terminology of testing statistical hypotheses and give the NeymanPearson framework. Such a test and the corresponding test statistic are called most poweiful (MP). a given test statistic T.00)ln~5 [0. that is. The statistic L is reasonable for testing H versuS K with large values of L favoring K over H. In Sections 3. the highest possible power.00)/00 (1 . L(x. test functions. test statistics. We start with the problem of testing a simple hypothesis H : () = ()o versus a simple alternative K : 0 = 01. In particular. that is.2 and 3.1.3). we try to maximize the probability (power) of rejecting H when K is true. equals 0 when both numerator and denominator vanish. where eo and e 1 are disjoint subsets of the parameter space 6. 00 . subject to this restriction.) which is large when S true. power function.Oo)]n. then. This is an instance of testing goodness offit. type II error. In the NeymanPearson framework. 0) is the density or frequency function of the random vector X. type I error. For instance. o:(Td has a U(O.Section 4. 1) distribution. (0. (I . critical regions. we specify a small number 0: and conStruct tests that have at most probability (significance level) 0: of rejecting H (deciding K) when H is true. and pvalue. size.
2. Note (Section 3.2. B . (a) Let E i denote E8. B.1). then I > O.0262 if S = 5.4) 1 j 7 . likelihood ratio tests are unbeatable no matter what the size a is. and suppose r. We show that in addition to being Bayes optimal. (b) For each 0 < a < 1 there exists an MP size 0' likelihood ratio test provided that randomization is permitted.'P(X)] [:~~:::l. then it must be a level a likelihood ratio test.3). BIl o .kEo['Pk(X) . where 1= EO{CPk(X)[L(X. we choose 'P(x) = 0 if S < 5. where 7l' denotes the prior probability of {Bo}.('Pk(X) .P(S > 5)I!P(S = 5) = . B . Finally. and because 0 < 'P(x) < 1.3. BI )  .3.7l').B.Bo.'P(X)] > O.p is a level a test. (c) If <p is an MP level 0: test. Theorem 4. that is.'P(X)[L(X. 'Pk(X) = 1 if p(x.) . o Because L(x.kJ}. lOY some x. 'P(x) = 1 if S > 5. then ~ . (See also Section 1.Bo. If L(x. To this end consider (4.3 with n = 10 and Bo = 0. 0 < <p(x) < 1.1].1.2) that 'Pk is a Bayes rule with k ~ 7l' /(1 .2.['Pk(X) . It follows that II > O. using (4. the interpretation is that we toss a coin with probability of heads cp(x) and reject H iff the coin shows heads. (NeymanPearson Lemma). If 0 < 'P(x) < 1 for the observation vector x.2. They are only used to show that with randomization. . i = 0. if want size a ~ . there exists k such that (4. Because we want results valid for all possible test sizes Cl' in [0. k] .05 .1) if equality occurs. i' ~ E.. EO'PdX) We want to show E..) For instance. Note that a > 0 implies k < 00 and.. we have shown that I j I i E.[CPk(X) . Such randomized tests are not used in practice.2. Bo. (4.'P(X)] > kEo['PdX) . 'Pk is MP for level E'c'PdX). Bo) = O.'P(X)] 1{P(X. Eo'P(X) < a.2) forB = BoandB = B. which are tests that may take values in (0. 1.'P(X)] a. we consider randomized tests '1'. Proof.224 Testing and Confidence Regions Chapter 4 We call iflk a likelihood ratio or NeymanPearsoll (NP) test ifunction) if for some 0 k < OJ we can write the test function Yk as 'Pk () = X < 1 o if L( x.k is < 0 or > 0 according as 'Pk(X) is 0 or 1.3) 1 : > O.05 in Example 4. thns.1. and 'P(x) = [0.Bd >k <k with 'Pdx) any value in (0. (a) If a > 0 and I{Jk is a size a likelihood ratio test.) .k] + E'['Pk(X) .'P(X)] = Eo['Pk{X) . Bo) = O} = I + II (say).'P(X)] . then <{Jk is MP in the class oflevel a tests.
O. If not. 00 . where v is a known signal. 0.1. The same argument works for x E {x : p(x. Therefore. 0. . then there exists k < 00 such that Po[L(X. 'Pk(X) =a . 0. Consider Example 3. 0. 00 ) (11T)L(x. 0. X n ) is a sample of n N(j.7. define a. { (7 i=l 2(7 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions. OJ! > k] < a and Po[T. Because Po[L(X. Next consider 0 < Q: < 1. 00 . that is.2 where X = (XI. OJ Corollary 4.2 Choosing a Test Statistic: The NeymanPearson lemma " 225 =:cc (b) If a ~ 0.1T)L(x. k = 00 makes 'Pk MP size a. Then the posterior probability of (). If a = 1. 00 .L. Now 'Pk is MP size a. 0.Oo. 60 .. OJ! > k] > If Po[L(X. decides 0. 00 . = 'Pk· Moreover. See Problem 4.2. 00 .) = k] = 0.) + 1T' (4.O. or 00 according as 1T(01 Ix) is larger than or smaller than 1/2.) = k j. (c) Let x E {x: p(x.) > k and have 'P(x) = 'Pk(X) ~ 0 when L(x.2.2. 0 It follows from the NeymanPearson lemma that an MP test has power at least as large as its level.2. OJ! = ooJ ~ 0.4) we need tohave'P(x) ~ 'Pk(X) = I when L(x. I x) = (1 1T)p(X.pk is MP size a.2) holds forO = 0. Part (a) of the lemma can.2.Section 4. 0.(X. for 0 < a < 1. 0. It follows that (4.2. Here is an example illustrating calculation of the most powerful level a test <{Jk.2(b).u 2 ) random variables with (72 known and we test H : 11 = a versus K : 11 = v. tben.fit. Example 4.v) + nv ] 2(72 x . = v. also be easily argued from this Bayes property of 'Pk (Problem 4.v)=exp 2LX'2 .2. 0.1.1./f 'P is an MP level a test. 00 . 'P(X) > a with equality iffp(" 60 ) = P(·. .) > k] Po[L(X. 0.) (1 .Po[L(X.2. Let Pi denote Po" i = 0.) > then to have equality in (4.0. denotes the Bayes procedure of Exarnple 3. Let 1T denote the prior probability of 00 so that (I 1T) is the prior probability of (it.5) If 0. k = 0 makes E. is 1T(O.) . 00 ) > and 0 = 00 ..OI)' Proof... 00 . then 'Pk is MP size a.1T)p(x. 2 (7 T(X) ~.2.) < k.2. OJ.Oo. 00 . we conc1udefrom (4.10).3. 'Pk(X) = 1 and !. We found nv2} V n L(X..O.) + 1Tp(X.k] on the set {x : L(x.fit [ logL(X. then E9.I.5) thatthis 0. when 1T = k/(k + 1). Remark 4..) (1 .
If ~o = (1. Thus. We return to this in Volume II. Suppose X ~ N(Pj.I. From our discussion there we know that for any specified Ct. this is no longer the case (Problem 4.6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > O. large.0. . By the NeymanPearsoo lemma this is the largest power available with a level Ct test.2. . 0 An interesting feature of the preceding example is that the test defined by (4. We will discuss the phenomenon further in the next section. • • . The power of this test is. Eo are known. > 0 and E l = Eo. 0. I I . E j ). This is the smallest possible n for any size Ct test.2).226 Testing and Confidence Regions Chapter 4 is also optimal for this problem. 0.2. say. the test that rejects if.1. (JI correspond to two known populations and we desire to classify a new observation X as belonging to one or the other. O)T and Eo = I.95)." The function F is known as the Fisher discriminant function. Note that in general the test statistic L depends intrinsically on ito.1./ii/(J)) = 13 for n and find that we need to take n = ((J Iv j2[z(la) + z(I3)]'. (Jj = (Pj.2. .. among other things. if Eo #. l. The likelihood ratio test for H : (J = 6 0 versus K : (J = (h is based on .90 or . Simple Hypothesis Against Simple Alternative for the Multivariate Normal: Fisher's Discriminant Flmction. that the UMP test phenomenon is largely a feature of onedimensional parameter problems. . for the two popnlations are known.7) where c ~ z(1 . if we want the probability of detecting a signal v to be at least a preassigned value j3 (say.2. It is used in a classification context in which 9 0 ..a)[~6Eo' ~oJ! (Problem 4. Particularly important is the case Eo = E 1 when "Q large" is equivalent to "F = (ttl . The following important example illustrates.2. then this test rule is to reject H if Xl is large.9).2.2./iil (J)). a UMP (for all ). if ito. then. by (4. they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population CQvanances.Jto)E01X large. In this example we bave assnmed that (Jo and (J. then we solve <1>( z(a) + (v.8). ). <I>(z(a) + (v.6) has probability of type I error 0:. . If this is not the case.4.' Rejecting H for L large is equivalent to rejecting for I 1 . Such a test is called uniformly most powerful (UMP). itl' However if. Example 4. j = 0. and only if. But T is the test statistic we proposed in Example 4. 6. however....) test exists and is given by: Reject if (4. T>z(la) (4. itl = ito + )"6. E j )..
1. N k ) has a multinomial M(n.'P) for all 0 E 8" for any other level 0:' (4. 0" .p.)N' i=l to Here is an interesting special case: Suppose OjO integer I with 1 < I < k > 0 for all}. .. This phenomenon is not restricted to the Gaussian case as the next example illustrates.1.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 227 Summary. 4. . .3 UNIFORMLY MOST POWERFUL TESTS ANO MONOTONE LIKELIHOOD RATIO MODELS We saw in the two Gaussian examples of Section 4. Two examples in which the MP test does not depend on 0 1 are given. which states that the size 0:' SLR test is uniquely most powerful (MP) in the class of level a tests. .···. Suppose that (N1 .OkO' Usually the alternative to H is composite.3.Section 4. 0 < € < 1 and for some fixed (4. 'P') > (3(0..... is established.. .. With such data we often want to test a simple hypothesis H : (}I = Ow. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : (j = ()o versus the simple alternative K : 0 = ()1. We note the connection of the MP test to the Bayes procedure of Section 3.2 that UMP tests for onedimensional parameter problems exist.. .. Example 4. Ok = OkO.1) test !... 0) P n! nn. k are given by Ow. However. .3.. The NeymanPearson lemma. For instance.nk."" (}k = Oklo In this case.2 for deciding between 00 and Ol. and N i is the number of offspring of type i. r lUI nt···· nk· ". Before we give the example. where nl. . . Testing for a Multinomial Vector. here is the general definition of UMP: Definition 4. then (N" . there is sometimes a simple alternative theory K : (}l = (}ll. the likelihood ratio L 's L~ rr(:. . A level a test 'P' is wtiformly most powerful (UMP) for H : 0 E versus K : 0 E 8 1 if eo (3(0.u k nn.3. . .M( n.. . nk are integers summing to n. . Nk) . 1 (}k) distribution with frequency function.2) where . Such tests are said to be UMP (unifonnly most powerful). n offspring are observed. . if a match in a genetic breeding experiment can result in k types.. Ok)' The simple hypothesis would correspond to the theory that the expected proportion of offspring of types 1. _ (nl.3.. (}l.
Theorem 4. under the alternative. k. where u is known. If {P. Suppose {P. e c R.3. 6. . it is UMP at level a = EOodt(X) for testing H : e = eo versus K : B > Bo in fact.3. The family of models {P. . . if and only if. we conclude that the MP lest rejects H. equals the likelihood ratio test 'Ph(t) and is MP.o)n[O/(1 . Because f.1) if T(x) = t.: 0 E e}. is an MLRfamily in T(x).3. .nx/u and ry(!") Define the NeymanPearson (NP) test function 6 ( ) _ 1 ifT(x) > t ..O) = 0'(1 .2.2.. then L(x. = P( N 1 < c).3) with 6. 0 Typically the MP test of H : () = eo versus K : () = (}1 depends on (h and the test is not UMP. for testing H : 8 < /:/0 versus K:O>/:/I' . this test is UMP fortesting H versus K : 0 E 8 1 = {/:/ : /:/ .(X) = '" > 0.1.2) with 0 < f < I}. is an MLRfamily in T(x). : 0 E e} with e c R is said to be a monotone likelihood ratio (MLR) family if for (it < O the distributions POl and P0 2 are distinct and 2 the ratio p(x. Ow) under H. then 6. = . in the case of a real parameter.3. N 1 < c. 0 form with T(x) .(X) is increasing in 0. Moreover. Then L = pnN1£N1 = pTl(E/p)N1.. .OW and the model is by (4.1.d. (2) If E'o6. Consider the problem of testing H : 0 = 00 versus K: 0 ~ 01 with 00 < 81.i. = h(x)exp{ry(O)T(x) ~ B(O)}. 00 . J Definition 4.228 Testmg and Confidence Regions Chapter 4 That is. e c R.1) MLR in s. In this i. 02)/p(X. then this family is MLR.3.2. Note that because l can be any of the integers 1 . .00). is UMP level". . Critical values for level a are easily determined because Nl . :.. 0 .(x) any value in (0. However..nu)!". Example 4. Bernoulli case. This is part of a general phenomena we now describe. type I is less frequent than under H and the conditional probabilities of the other types given that type I has not occurred are the same under K as they are under H. we have seen three models where. X ifT(x) < t ° (4. (1) For each t E (0. Because dt does not depend on ()l. there is a statistic T such that the test with critical region {x : T(x) > c} is UMP. Example 4. < 1 implies that p > E. we get radically different best tests depending on which Oi we assume to be (ho under H. set s then = l:~ 1 Xi. Example 4.1 is of this (. for".3.. B( n. : 8 E e}. = (I .o)n. the power function (3(/:/) = E. Oil is an increasing function ofT(x).6. Consider the oneparameter exponential family mode! o p(x. is of the form (4. If 1J(O) is strictly increasing in () E e.. Thus.2 (Example 4. Oil = h(T(x)) for some increasing function h. .. 0) ~.3 continned). .3.. p(x.
To show (2). 0. For simplicity suppose that bo > n. where xn(a) is the ath quantile of the X. R.1 by noting that for any (it < (J2' e5 t is MP at level Eo. we could be interested in the precision of a new measuring instrument and test it by applying it to a known standard. Suppose tha~ as in Example l. For instance. and because dt maximizes the power over this larger class. is an MLRfamily in T(r). the alternative K as (J < (Jo. d t is UMP for H : (J < (Jo versus K : (J > (Jo.l.3. and only if. Let S = l:~ I (X. the critical constant 8(0') is u5xn(a).. we now show that the test O· with reject H if. Suppose {Po: 0 E Example 4. if N0 1 = b1 < bo and 0 < x < b1 . X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives.5. then e}.U 2 ) population.<» is lfMP level afor testing H: (J < (Jo versus K : (J > (Jo. . distribution.3.( NOo. where 11 is a known standard.(X). where b = N(J. and specifies an 0' such that the probability of rejecting H (keeping a bad lot) is at most 0'. 0) = exp {~8 ~ IOg(27r0'2)} . (X) < <> and b t is of level 0: for H : (J :S (Jo.(X) for testing H: 0 = 01 versus J( : 0 = 0. (l.. H. Testing Precision. then the test/hat rejects H if and only ifT(r) > t(1 . is UMP level a.l. If a is a value taken on by the distribution of X. where h(a) is the ath quantile of the hypergeometric.o.bo > n.Xn .1. If the distributionfunction Fo ofT(X) under X"" POo is continuous and ift(1a) isasolution of Fo(t) = 1 .l) yields e L( 0 0 )=b. Corollary 4. she formulates the hypothesis H as > (Jo. where U O represents the minimum tolerable precision. then by (1). . n). 1 bo(bol) (blX+l)(Nbl) (box+1)(Nbo) (Nbln+x+l) (Nbon+x+l)' . e c p(x. Eoo.Section 4. distribution. 0 Proof (I) follows from b l = iPh(t) The following useful result follows immediately.. recall that we have seen that e5 t maximizes the power for testing II : (J = (Jo versus K : (J > (Jo among the class of tests with level <> ~ Eo. N.<>. Quality Control.4.(bl1) x. N .. Because the most serious error is to judge the precision adequate when it is not. Suppose X!. we test H : u > Uo l versus K : u < uo.2.3 Uniformly Most Po~rful Tests and Monotone likelihood Ratio Models 229 and Corollary 4. Ifwe write ~= Uo t i''''l (Xi _1')2 000 we see that Sju5 has a X~ distribution.l. If 0 < 00 .. Then. Thus. Because the class of tests with level 0' for H : (J < (Jo is contained in the class of tests with level 0' for H : (J = (Jo. 0 Example 4.Xn is a sample from a N(I1. (8 < s(a)) = a. 20' 2  This is a oneparameter exponential family and is MLR in T = So The UMP level 0' test rejects H if and only if 8 < s(a) where s(a) is such that Pa . X < h(a). _1')2.3. . . and we are interested in the precision u.l of the measurements Xl.o. If the inspector making the test considers lots with bo = N(Jo defectives or more unsatisfactory.
. we want the probability of correctly detecting an alternative K to be large. in general.. ° ° .. This equation is equiValent to = {3 z(a) + .c:O".4 because (3(p.230 NotethatL(x.3... In Example 4.ntI" = z({3) whose solution is il .'.x) <I L(x. In our example this means that in addition to the indifference region and level a.. H and K are of the fonn H : () < ()o and K : () > ()o. Therefore. we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small.Il = (b l . This is possible for arbitrary /3 < 1 only by making the sample size n large enough.L.1. I' .4 we might be uninterested in values of p. i. On the other hand. ~) for some small ~ > 0 because such improvements are negligible. 0) continuous in 0.1..O'"O.) is increasing. the appropriate n is obtained by solving i' . we want guaranteed power as well as an upper bound on the probability of type I error. This is not serious in practice if we have an indifference region. forO <:1' < b1 1. It follows that 8* is UMP level Q..+. this is a general phenomenon in MLR family models with p( x. By CoroUary 4.2). this is. I i (3(t) ~ 'I>(z(a) + . and the powers are continuous increasing functions with limOlO.OIl box (Nn+I)(blx) .ntI") for sample size n. For such the probability of falsely accepting H is almost 1 .:(x:. L is decreasing in x and the hypergeometric model is an MLR family in T( x) = r.' . 1 I .n + I) . 0 Power and Sample Size In the NeymanPearson framework we choose the test whose size is small. that is.x) (N .. {3(0) ~ a. This continuity of the power shows that not too much significance can be attached to acceptance of H.(b o . if all points in the alternative are of equal significance: We can find > 00 sufficiently close to 00 so that {3( 0) is arbitrarily close to {3(00) = a.. This is a subset of the alternative on which we are willing to tolerate low power.a. ". (0. we would also like large power (3(0) when () E 8 1 . .Oo. . ~) would be our indifference region. Note that a small signaltonoise ratio ~/ a will require a large sample size n. Thus. Off the indifference region.I.1. In both these cases. not possible for all parameters in the alternative 8.1 and formula (4. as seen in Figure 4.1. we specify {3 close to I and would like to have (3(!') > (3 for aU !' > t. In our nonnal example 4. Thus. That is. in (O .B 1 ) =0 forb l Testing and Confidence Regions Chapter 4 < X <n.1. The critical values for the hypergeometric distribution are available on statistical calculators and software.(}o. However.
Our discussion uses the classical normal approximation to the binomial distribution.3 continued). 00 = 0.1.645 x 0. the size .4. ifn is very large and/or a is small.35.Section 4." The reason is that n is so large that unimportant small discrepancies are picked up. we fleed n ~ (0.05 binomial test of H : 8 = 0. we next show how to find the sample size that will "approximately" achieve desired power {3 for the size 0: test in the binomial example.1. 0 ° Our discussion can be generalized. This problem arises particularly in goodnessoffit tests (see Example 4.3. First. There are various ways of dealing with this problem. if Oi = . to achieve approximate size a. They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is 0:. > O. Now let .4. when we test the hypothesis that a very large sample comes from a particular distribution. The power achievable (exactly.1. (3 = . As a further example and precursor to Section 5.00 )] 1/2 . we find (3(0) = PotS > so) = <I> ( [nO(l ~ 0)11/ 2 Now consider the indifference region (80 . Such hypotheses are often rejected even though for practical purposes "the fit is good enough. It is natural to associate statistical significance with practical significance so that a very low pvalue is interpreted as evidence that the alternative that holds is physically significant.1.90. we can have very great power for alternatives very close to O.5).. 1.4 this would mean rejecting H if.3 requires approximately 163 observations to have probability .) = (3 for n and find the approximate solution no+ 1 . = 0.3 to 0.05)2{1.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 231 Dual to the problem of not having enough power is that of having too much. Often there is a function q(B) such that H and K can be formulated as H : q(O) <: qo and K : q(O) > qo. (h = 00 + Dt.2) shows that.35(0.55)}2 = 162.3(0. Suppose 8 is a vector.90 of detecting the 17% increase in 8 from 0. Bd. Again using the nonnal approximation.282 x 0.Oi) [nOo(1 . Dt. Example 4.86. test for = . where (3(0. Thus.4.35 and n = 163 is 0.05. and only if.7) + 1. In Example 4. we solve (3(00 ) = Po" (S > s) for s using (4. that is.0. Formula (4.6 (Example 4. and 0.3.35.4) and find the approximate critical value So ~ nOo 1 + 2 + z(l. using the SPLUS package) for the level .80 ) . far from the hypothesis. We solve For instance.
Then the MLE of 0..= 00 is now composite. This procedure can be applied. Although H : 0. that are not 01. The set {O : qo < q(O) < ql} is our indjfference region. 0) = (B .1) 1(0.l)l(O. > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q. Implicit in this calculation is the assumption that POl [T > col is an increasing function ofn.3. . .= (10 versus K : 0. we illustrate what can happen with a simple example.. In general.. when testing H : 0 < 00 versus K : B > Bo. a).232 Testing and Confidence Regions Chapter 4 q.(0) > O} and Co (Tn) = C'(') (Tn) for all O. Thus.2 only. The theory we have developed demonstrates that if C. = {O : ). the distribution of Tn ni:T 2/ (15 is X. For instance.5 that a particular test statistic can have a fixed distribution £0 under the hypothesis. We have seen in Example 4.9.Xf as in Example 2. We may ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O. then rejecting for large values of Tn is UMP among all tests based on Tn.(0) = o} aild 9. . To achieve level a and power at least (3. = (tm.4.0) > ° I(O..2.Bo).1 by taking q( 0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative.3.4) j j 1 .. is such that q( Oil = q.3. 00). Example 4. we may consider l(O. 0 = Complete Families of Tests The NeymanPearson framework is based on using the 01 loss function. a reasonable class of loss functions are those that satisfy : 1 i j 1 1(0. I}. Suppose that j3(O) depends on () only through q( (J) and is a continuous increasing function of q( 0). It may also happen that the distribution of Tn as () ranges over 9.£ is unknown.< (10 among all tests depending on <.. For each n suppose we have a level a test for H versus I< based on a suitable test statistic T. is detennined by a onedimensional parameter ).7. However. (4. for 9. first let Co be the smallest number c such that Then let n be the smallest integer such that P" IT > col > fJ where 00 is such that q( 00 ) = qo and 0.2 is (}'2 = ~ E~ 1 (Xi . 0 E 9.3 that this test is UMP for H : (1 > (10 versus K : 0. J. Suppose that in the Gaussian model of Example 4.l' independent of IJ. the critical value for testing H : 0. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II.(Tn ) is an MLR family.+ 00. 0 E 9.3. I .1. Testing Precision Continued.. to the F test of the linear model in Section 6.(0) so that 9 0 = {O : ). is the a percentile of X~l' It is evident from the argument of Example 4. and also increases to 1 for fixed () E 6 1 as n . for instance.O)<O forB < 00 forB>B o.< (10 and rejecting H if Tn is small. a E A = {O.
I") ~ = EO{I"(X)I(O. Proof. For MLR models. (4.(X) be such that.(o.O)} E. and Regions 233 if for any decision rule 'P The class D of decision procedures is said to be there exists E such that complete(I). O))(E.3.(X)) = 1. In the following the decision procedures arc test functions. E.(Io.5). We also show how. the class of NP tests is complete in the sense that for loss functions other than the 01 loss function. e c R.(I"(X))) < 0 for 8 > 8 0 . Theorem 4. (4. Now 0. In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H.4 CONFIDENCE BOUNDS. then the class of tests of the form (4. is similarly UMP for H : 0 > 8 0 versus K : 0 < 00 (Problem 4.E.(X) = ".) . 1") for all 0 E e.3. hence. hence. then any procedure not in the complete class can be matched or improved at all () by one in the complete class. if the model is correct and loss function is appropriate. 1) 1(O. a) satisfies (4. e 4. Suppose {Po: 0 E e}. is an MLR family in T( x) and suppose the loss function 1(0. Finally.3. The risk function of any test rule 'P is R(O.2.(X) > 1. is complete. the risk of any procedure can be matched or improved by an NP test.R(O.3.O) + [1(0.Section 4. when UMP tests do not exist. locally most powerful (LMP) tests in some cases can be found.(X) = E"I"(X) > O.12) and. Thus. the test that rejects H : 8 < 8 0 for large values of T(x) is UMP for K : 8 > 8 0 . We consider models {Po : () E e} for which there exist tests that are most powerful for every () in a composite alternative t (UMP tests).E. a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing ()o versus 8 1 is an increasing function of a statistic T( x) for every ()o < ()1. for some 00 • E"o.O)]I"(X)} Let o.3.o. (4.3) with Eo. INTERVALS. = R(O. Intervals.3. For () real.o. is UMP for H : 0 < 00 versus K: 8 > 8 0 by Theorem 4.3.I"(X) 0 for allO then O=(X) clearly satisfies (4.6) But 1 . we show that for MLR models. AND REGIONS We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter {J i" a .4).5) That is.E.3.0. 0 < " < 1.I"(X) for 0 < 00 . If E.{1(8. Thus.(2) a v R(O.o) < R(O.3. 0 Summary.5) holds for all 8. it isn't worthwhile to look outside of complete classes.4 Confidence Bounds. 1) 1(0.(x)) . 1") = (1(0. 1) + [1 1"(X)]I(O.1 and.
and X ~ P. X + a] contains pis 1 .(X) . In this case..Q equal to . 11 I • J .. is a constant. we want to find a such that the probability that the interval [X .a)l...oz(1 .. .Q with 1 . ..a.234 Testing and Confidence Regions Chapter 4 member of a specified set 8 0 . or sets that constrain the parameter with prescribed probability 1 . Here ii(X) is called an upper level (1 .(X) and solving the inequality inside the probability for p. 0.! = X .) and =1 a .. we may be interested in an upper bound on a parameter. we find P(X . X n ) to establish a lower bound .. we settle for a probability at least (1 . .a) confidence bound for ".fri. In the nonBayesian framework..2 ) example this means finding a statistic ii(X) snch that P(ii(X) > . . In our example this is achieved by writing By solving the inequality inside the probability for tt. and a solution is ii(X) = X + O"z(1 . X E Rq.a) such as . • j I where .) = 1 .a) of being correct... This gives .+(X)] is a level (1 .) = 1 . in many situations where we want an indication of the accuracy of an estimator.a) for a prescribed (1.fri < .a.a.(X) that satisfies P(... Similarly. That is.(X) for" with a prescribed probability (1 .(X) is a lower confidence bound with confidence level 1 .3. we want both lower and upper bounds... N (/1. 1 X n are i. As an illustration consider Example 4.a)l. .8). Finally.95.a) confidence interval for "..±(X) ~ X ± oz (1.a)I. In general.(X). Then we can use the experimental outcome X = (Xl.1. it may not be possible for a bound or interval to achieve exactly probability (1 . Suppose that JL represents the mean increase in sleep among patients administered a drug. if v ~ v(P).95 Or some other desired level of confidence.a... is a lower bound with P(.4 where X I. intervals. That is. . ( 2 ) with (72 known.!a)/.. We find such an interval by noting ..d.a). and we look for a statistic .fri.a. is a parameter. In the N(". PEP. We say that . as in (1. Now we consider the problem of giving confidence bounds.(Xj < .) = Ia.... . We say that [j..fri < .oz(1 .i.
" X n be a sample from a N(J. Note that in the case of intervals this is just inf{ PI!c(X) < v < v(X).~o) for 1'. P[v(X) < vi > I . (72) population. Moreover. has a N(o. For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient.Section 4. by Theorem B. and Regions 235 Definition 4. the confidence level is clearly not unique because any number (I . we will need the distribution of Now Z(I') = Jii(X 1')/".0) or a 100(1 . has aN(O. The quantities on the left are called the probabilities of coverage and (1 . We conclude from the definition of the (Student) t distribution in Section B. D(X) is called a level (1 .0) will be a confidence level if (I . For a given bound or interval. the random interval [v( X). In the preceding discussion we used the fact tbat Z (1') = Jii(X . 1) distribution and is.0') < (J .3.a) upper confidence bound for v if for every PEP. The (Student) t Interval and Bounds.. v(X) is a level (I . I) distribu~on to obtain a confidence interval for I' by solving z (1 . ii( Xl] formed by a pair of statistics v(X). In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level.1')1".l)s2 lIT'. lhat is. P[v(X) < v < v(X)] > I .1.e. .L. independent of V = (n .3. where S 2 = 1 nI L (X.a) is called a confidence level..O!) lower confidence bound for v if for every PEP. Now we tum to the (72 unknown case and propose the pivot T(J.1) = T(I') .1.o.3. for all PEP. and assume initially that (72 is known. P E PI} (i.L) by its estimate s. the minimum probability of coverage).4 Confidence Bounds.!o) < Z(I') < z (1. A statistic v(X) is called a level (1 .. P[v(X) = v] > 10.4.! that Z(iL)1 "jVI(n . Similarly. Intervals.0) is.0)% confidence intervolfor v if.X) n i=l 2 . Let X t .4.L) obtained by replacing (7 in Z(J. Example 4. which has a X~l distribution.o. In this process Z(I') is called a pivot. finding confidence intervals (or bounds) often involves finding appropriate pivots. In general.
4. t n .005. Solving the inequality inside the probability for Jt. Thus. .355s/3.. Then (72. ~. Let tk (p) denote the pth quantile of the P (t n.d.1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal.l)s2/X("I)1 is a confidence interval with confidence coefficient (1 .7. To calculate the coefficients t n.1. • " ~.1.4.3. . Up to this point. In this case the interval (4.1) in nonGaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5. if we let Xn l(p) denote the pth quantile of the X~1 distribution.st n _ 1 (I .1 distribution and can be used as a pivot. On the other hand.l) < t n_ 1 (1.1) The shortest level (I .a. N (p" a 2 )..n is. . .~a) = 3. By Theorem B ..236 has the t distribution 7. we see that as n + 00 the Tn_l distribution converges in law to the standard normal distribution..12). X +stn _ 1 (1 Similarly. Testing and Confidence Regions Chapter 4 1 .355 is .l) is fairly close to the Tn . or very heavytailed distributions such as the Cauchy. It turns out that the distribution of the pivot T(J.2. The properties of confidence intervals such as (4. we find P [X .1 (1 .99 confidence intervaL From the results of Section B.3. then P(X("I) < V(<7 2) < x(l.355 and IX .)/. (4.a).7 (see Problem 8.4. .. Suppose that X 1.1) has confidence coefficient close to 1 . (4. .1 )s2/<7 2 has a X.a.a) /.Xn is a sample from a N(p" a 2 ) population.fii.Q.i.l)s2/X(1. we have assumed that Xl. Hence. .3. we can reasonably replace t n . the confidence coefficient of (4. thus..~a) < T(J. ."2)' (n .1) can be much larger than 1 .1 (I .fii and X + stn.1.a) in the limit as n + 00. For instance.nJ = 1" X ± sc/.. X n are i.. For the usual values of a. X . the interval will have probability (1 ... V( <7 2 ) = (n .) confidence interval of the type [X .l (I . we enter Table II to find that the probability that a 'TnI variable exceeds 3.1.Q).11 whatever be Il and Tj. .!a) and tndl .. By solving the inequality inside the probability for a 2 we find that [(n . computer software.2) f.. • r.~a)) = 1.a). .. " . If we assume a 2 < 00. distribution. X + 3.. Example 4.01.4.1 (p) by the standard normal quantile z(p) for n > 120. . 0 .355s/31 is the desired level 0. .~a)/."2)) = 1".l < X + st n_ 1 (1..fii are natural lower and upper confidence bounds with confidence coefficients (1 .l (I . Confidence Intervals and Bounds for the Vartance of a Normal Distribution.. and if al + a2 = a. or Tables I and II.st n_l (1 ~. we use a calculator. for very skew distributions such as the X2 with few degrees of freedom. i '1 !. .)/ v'n]..stn_ 1 (I .n < J.~a)/.) /. See Figure 5. if n = 9 and a = 0.
There is a unique choice of cq and a2." we can write P [ Let ka vIn(X . 1959). .16 we give an interval with correct limiting coverage probability. It may be shown that for n large.~ a) and observe that this is equivalent to Q P [(X .4..l)sjx(1 . In contrast to Example 4. g( 0.1.0) has approximately a N(O. which unifonnly minimizes expected length among all intervals of this type.0) ] = Plg(O.0) < z (1)0(1 .0)  1Ql] = 12 Q. by the De MoivreLaplace theorem.S)jn] {S + k2~ + k~j4} / (n + k~) (4.4.3) + ka)[S(n . 0 The method of pivots works primarily in problems related to sampling from nonnal populations.Section 44 Confidence Bounds.3. Intervals.. If we use this function as an "approximate" pivot and let:=:::::: denote "approximate equality.4. if we drop the nonnality assumption.1.4. 1) distribution. X) is a quadratic polynomial with two real roots.O)j )0(1 .4) . they are(1) O(X) O(X) {S + k2~ .S)jn] + k~j4} / (n + k~). If X I. In tenns of S = nX.0) .X) = (1+ k!) 0' . There is no natural "exact" pivot based on X and O. = Z (1 . taking at = (Y2 = ~a is not far from optimal (Tate and Klett. However.(2X + ~) e+X 2 For fixed 0 < X < I.l)sjx(Q). We illustrate by an example.a even in the limit as n + 00. 1 X n are the indicators of n Bernoulli trials with probability of success (j. In Problem 1. the confidence interval and bounds for (T2 do not have confidence coefficient 1. = ~(X) < 0 < B(X)]' Because the coefficient of (j2 in g(() 1 X) is greater than zero. then X is the MLE of (j. The pivot V( (T2) similarly yields the respective lower and upper confidence bounds (n . the scope of the method becomes much broader.Q) and (n .X) < OJ'" 1where g(O. and Regions 237 The length of this interval is random.ka)[S(n . X) < OJ (4. vIn(X . which typically is unknown. Example 4. Asymptotic methods and Monte Carlo experiments as described in Chapter 5 have shown that the confidence coefficient may be arbitrarily small depending on the underlying true distribution.0(1.. If we consider "approximate" pivots.k' < . [0: g(O. Approximate Confidence Bounds and Intervalsfor the Probability of Success in n Bernoulli Trials.
~n)2 < S(n .: iI j . say I.~a = 0.5) is used and it is only good when 8 is near 1/2.a) confidence bounds. i " .02 and is a confidence interval with confidence coefficient approximately 0.4. Cai.S)ln ~ to conclude that tn .n 1 tn (4.0)1 JX(1 .96 0. and Das Gupta (2000). = length 0. Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 00 < ~.a) confidence interval for O.X).4. A discussion is given in Brown.~a) ko 2 kcc i 1. O(X)] is an approximate level (1 .5.5) (4. Another approximate pivot for this example is yIn(X . consider the market researcher whose interest is the proportion B of a population that will buy a product.4. and uses the preceding model. Note that in this example we can detennine the sample size needed for desired accuracy.6) Thus. [n this case.600.a) procedure developed in Section 4. This leads to the simple interval 1 I (4. and Das Gupta (2000) for a discussion.(n + k~)~ = 10 . To see this.02 ) 2 . For small n. That is. = T.975. 1 . to bound I above by 1 0 choose . calls willingness to buy success.4. or n = 9.4) has length 0. we n . .2a) interval are approximate upper and lower level (1 .S)ln] Now use the fact that + kzJ4}(n + k~)l (S .4. This fonnula for the sample size is very crude because (4.238 Testing and Confidence Regions Chapter 4 so that [O(X).02 by choosing n so that Z = n~ ( 1.96) 2 = 9.4.95.(1.O > 8 1 > ~.601. ka. of the interval is I ~ 2ko { JIS(n . note that the length. He can then detennine how many customers should be sampled so that (4.4.8) is at least 6..7) See Brown.. See Problem 4. . and we can achieve the desired .96. (1 . it is better to use the exact level (1 .02. if the smaller of nO. For instance. These inlervals and bounds are satisfactory in practice for the usual levels. n(l . we choose n so that ka. Cai. i o 1 [. We can similarly show that the endpoints of the level (1 .16. He draws a sample of n potential customers. = 0.
.1.. X~n' distribution.. and Regions 239 Confidence Regions for Functions of Parameters We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 .X n denote the number of hours a sample of internet subscribers spend per week on the Internet.1 ). Suppose q (X) and ii) (X) are realJ valued. qc (0)). That is. (T c' Tc ) are independent.. Note that ifC(X) is a level (1 ~ 0') confidence region for 0.Section 44 Confidence Bounds. 2nX/0 has a chi~square. qc( 0)) is at least (I . . For instance.a Note that if I. Confidence Regions of Higher Dimension We can extend the notion of a confidence interval for onedimensional functions q( B) to rdimensional vectors q( 0) = (q. X n is modeled as a sample from an exponential. (X) ~ [q.a).4.a).) confidence interval for qj (0) and if the ) pairs (T l ' 1'. ii.r} is said to be a level (1 . if the probability that it covers the unknown but fixed true (ql (0). this technique is typically wasteful... j ~ I. then the rectangle I( X) Ic(X) has level = h (X) x . q(C(X)) is larger than the confidence set obtained by focusing on B1 alone. .). j==1 (4.a) confidence region. ...4. In this case. I is a level (I .3. ( 2 ) T.4.. and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet.. Let 0 and and upper boundaries of this interval.8) . .4. Then the rdimensional random rectangle I(X) = {q(O). then q(C(X)) ~ {q(O) . Let Xl.F(x) = exp{ x/O}. By using 2nX /0 as a pivot we find the (1 . . . Example 4. ~. By Problem B.). we can find confidence regions for q( 0) entirely contained in q( C(X)) with confidence level (1 . . £(8. then exp{ x/O} edenote the lower < q(O) < exp{ x/i)} is a confidence interval for q( 0) with confidence coefficient (1 . Suppose X I. .(O) < ii.a) confidence interval 2nX/x (I ..(X).a).. If q is not 1 . Intervals.(X) < q. (0). ..0').0') confidence region for q(O). distribution. if q( B) = 01 .. . 0 E C(X)} is a level (1.~a) < 0 <2nX/x (~a) where x(j3) denotes the 13th quantile of the X~n distribution. .a. x IT(1a. . We write this as P[q(O) E I(X)I > 1. Here q(O) = 1 . we will later give confidence regions C(X) for pairs 8 = (0 1 .
.2).Laj.a. . 1 X n is a N (M.a) r.a) confidence rectangle for (/" .a). an rdimensional confidence rectangle is in this case automaticalll obtained from the onedimensional intervals.7). . It is possible to show that the exact confidence coefficient is (1 . . That is. then I(X) has confidence level (1 . F(t) . min{l.4.1 h(X) = X ± stn_l (1.4.1) the distribution of . in which case (Proposition 4. See Problem 4.. then leX) has confidence level 1 .(1 .6...2. From Example 4.do}. tl continuous in t. v(P) = F(·).5.. consists of the interval i J C(x)(t) = (max{O. . c c P[q(O) E I(X)] > 1.d. r.4. ( 2 ) sample and we want a confidence rectangle for (Ii.15. D n (F) is a pivot. Moreover. (7 2).40) 2. According to this inequality.LP[qj(O) '" Ij(X)1 > 1..~a). F(t) + do})· We have shown ~ ~ I o P(C(X)(t) :) F(t) for all t E R) for all P E 'P =1.~a)/rn !a).240 Testing and Confidence Regions Chapter 4 Thus. j=1 j=1 Thus. that is.0:. We assume that F is continuous. . if we choose a J = air.. From Example 4.0 confidence region C(x)(·) is the confidence band which. as X rv P. Confidence Rectangle for the Parameters ofa Normal Distribution. if we choose 0' j = 1 .1. ~ ~'f Dn(F) = sup IF(t) ..5). 0 The method of pivots can also be applied to oodimensional parameters such as F.a = set of P with P( 00.ia)' Xnl (lo) is a reasonable confidence interval for (72 with confidence coefficient (1.4. Example 4. and we are interested in the distribution function F(t) = P(X < t). h (X) X 12 (X) is a level (1 . Thus. An approach that works even if the I j are not independent is to use Bonferroni's inequality (A. j = 1. Suppose Xl> . is a confidence interval for J1 with confidence coefficient (1  I (X) _ [(nl)S2 (nl)S2] 2 Xnl (1 . . for each t E R. we find that a simultaneous in t size 1.2. . Example 4.i. Let dO' be chosen such that PF(Dn(F) < do) = 1 . Suppose Xl.1.F(t)1 tEn i " does not depend on F and is known (Example 4.4. Then by solving Dn(F) < do for F. X n are i.
!(F) : F E C(X)) ~ J.5. A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining . We begin by illustrating the duality in the following example. For a nonparametric class P ~ {P} and parameter v ~ v(P). . Example 4. given by (4.! ~ inf{J. Intervals for the case F supported on an interval (see Problem 4. Suppose Xl.Section 4. and we derive an exact confidence interval for the binomial parameter. a 2 ) model with a 2 unknown.4.!(F) : F E C(X)} = J.4.. .0: for all E e. which is zero for t < 0 and nonzero for t > O.d. then Let F(t) and F+(t) be the lower and upper simultaneous confidence boundaries of Example 4. In a parametric model {Pe : BEe}. confidence intervals.4. Suppose that an established theory postulates the value Po for a certain physical constant.4. a level 1 .5 THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS Confidence regions are random subsets of the parameter space that contain the true parameter with probability at least 1 . By integration by parts. A Lower Confidence Boundfor the Mean of a Nonnegative Random Variable. 1WoSided Tests for the Mean of a Normal Distribution.4 and 4.6.. if J. for a given hypothesis H.5 to give confidence regions for scalar or vector parameters in nonparametric models. We shall establish a duality between confidence regions and acceptance regions for families of hypotheses. Then a (1 .18) arise in accounting practice (see Bickel.6.0: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Pe that C(X) covers q( 8) is at least 1 .0') lower confidence bound for /1.4. is /1. Example 4.5 The Duality Between Confidence Regions and Tests 241 We can apply the notions studied in Examples 4.0: when H is true. as X and that X has a density f( t) = F' (t).!(F)   ~ oosee Problem 4. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X. We define lower and upper confidence bounds (LCBs and DCBs).1. We derive the (Student) t interval for I' in the N(Il.4.! ~ /L(F) = f o tf(t)dt = exists.i. and more generally confidence regions. Acceptance regions of statistical tests are. subsets of the sample space with probability of accepting H at least 1 . 1992) where such bounds are discussed and shown to be asymptotically strictly conservative.0'.19. X n are i.7.4.!(F+) and sUP{J. 0 Summary.4. we similarly require P(C(X) :J v) > 1. J. o 4.9)   because for C(X) as in Example 4." for all PEP.
Consider the general framework where the random vector X takes values in the sample space X C Rq and X has distribution PEP. and only if. . then it is reasonable to formulate the problem as that of testing H : fl = Jlo versus K : fl i.Q) confidence interval (4. /l)} where lifvnIX. where F is the distribution function of Xi' Here an example of N is the class of all continuous distribution functions.I n _ 1 (1.2) These tests correspond to different hypotheses. Evidently. the postulated value JLo is a member of the level (1 0' ) confidence interval [X  Slnl (1 ~a) /vn..5. Because p"IITI = I n. For a function space example.1) by finding the set of /l where J(X.a) confidence region for v if the probability that S(X) contains v is at least (1 . then S is a (I .4.a). We achieve a similar effect. Xl!_ Knowledge of his instruments leads him to assume that the Xi are independent and identically distributed normal random variables with mean {L and variance a 2 . o These are examples of a general phenomenon. if and only if.!a) < T < I n. if we start out with (say) the level (1 .l other than flo is a possible alternative. This test is called twosided because it rejects for both large and small values of the statistic T.5.4. Let v = v{P) be a parameter that takes values in the set N.2 = . all PEP.fLo)/ s.~a). For instance. t n .00).. In contrast to the tests of Example 4. (4.2) takes values in N = (00 . i:. if and only if.4. generated a family of level a tests {J(X. (/" . it has power against parameter values on either side of flo. Let S = S{X) be a map from X to subsets of N. in Example 4.1) we constructed for Jl as follows. /l) ~ O.6. by starting with the test (4.242 Testing and Confidence Regions Chapter 4 measurements Xl . Jlo) being of size a only for the hypothesis H : Jl = flo· Conversely. O{X..2(p) takes values in N = (0. then our test accepts H. X .00).4.~a)] = 0 the test is equivalently characterized by rejecting H when ITI > tnl (1 . = 1.l (1.a.1 (1 . /l = /l(P) takes values in N = (00. Because the same interval (4.a) LCB X .1 (1. 1..flO..5.5. If any value of J.. that is P[v E S(X)] > 1 .2.. X + Slnl (1  ~a) /vn].2) we obtain the confidence interval (4.4. and in Example 4. 00) X (0. in Example 4.a)s/ vn and define J'(X.1 (1 .1. in fact. (4.4. . We can base a size Q' test on the level (1 .I n . We accept H. • j n . generating a family of level a tests. as in Example 4.a.5.5'[) If we let T = vn(X .a)s/vn > /l. consider v(P) = F.. /l) to equal 1 if..00). .~1 >tn_l(l~a) ootherwise.1) is used for every flo we see that we have.~Ct).
Fe (t) is decreasing in (). eo e Proof. Let S(X) = (va EN.3. < to(la). We next apply the duality theorem to MLR families: Theorem 4. Duality Theorem. Moreover. the power function Po(T > t) = 1 . It follows that Fe (t) < 1 . the acceptance region of the UMP size a test of H : 0 = 00 versus K : > ()a can be written e A(Oo) = (x: T(x) < too (1.1. For some specified Va.Fo(t) for a test with critical constant t is increasing in ().Ct).0". By Theorem 4.5 The Duality Between Confidence Regions and Tests 243 Next consider the testing framework where we test the hypothesis H = Hvo : v = Va for some specified value va. 0 We next give connections between confidence bounds.G iff 0 > Iia(t) and 8(t) = Ilia. E is an upper confidence bound for 0 with any solution e. this is a random set contained in N with probapility at least 1 . then the test that accepts HI/o if and only if Va is in S(X).Section 4. Conversely. va) with level 0". then [QUI' oU2] is confidence interval for () with confidence coefficient 1 .5. if S(X) is a level 1 .a)}. H may be rejected.Ct. That i~. and pvalues for MLR families: Let t denote the observed value t = T(x) ofT(X) for the datum x. Formally.3.(t) in e. Suppose X ~ Po where (Po: 0 E ej is MLR in T = T(X) and suppose that the distribution function Fo(t) ofT under Po is continuous in each of the variables t and 0 when the other is fixed.1. The proofs for the npper confid~nce hound and interval follow by the same type of argument. let . H may be accepted.vo) = OJ is a subset of X with probability at least 1 . By applying F o to hoth sides oft we find 8(t) = (O E e: Fo(t) < 1. for other specified Va. let Pvo = (P : v(P) ~ Vo : va E V}.a)) where to o (1 . then ()u(T) is a lower confidence boundfor () with confidence coefficient 1 .0" quantile of F oo • By the duality theorem.aj.1. Similarly. acceptance regions. Then the acceptance regIOn A(vo) ~ (x: J(x. If the equation Fo(t) = 1 . We have the following.Ct) is the 1. Suppose we have a test 15(X. if 8(t) = (O E e: t < to(1 . X E A(vo)).Ct confidence region for v.(01 + 0"2).0" of containing the true value of v(P) whatever be P. is a level 0 test for HI/o. if 0"1 + 0"2 < 1.(T) of Fo(T) = a with coefficient (1 .a has a solution O. 00). By Corollary 4. Consider the set of Va for which HI/o is accepted. then PIX E A(vo)) > 1  a for all P E P vo if and only if S (X) is a 1 . then 8(T) is a Ia confidence region forO.0" confidence region for v.
oo). We have seen in the proof of Theorem 4. In the (t. Let T = X. then C = {(t. where S = Ef I Xi_ To analyze the structure of the region we need to examine k(fJ.Fo(t). a) denote the critical constant of a level (1. N (IL.1). 00 ) denote the pvalue for the UMP size Q' test of H : () let = eo versus K : () > eo. .5. : : !. Then the set . Figure 4. I X n are U. The pvalue is {t: a(t.1 that 1 .d. B) ~ poeT > t) = 1 .2. let a( t. For a E (0. Under the conditions a/Theorem 4. • a(t.Fo(t) is decreasing in t..a)] > a} = [~(t). I c= {(t. v will be accepted.3. . a confidence region S( to). In general.B) > a} {B: a(t. cr 2 ) with 0. .5.a) is nondecreasing in B.Fo(t) is increasing in O. vo) = 1 IT > cJ of H : v ~ Vo based on a statistic T = T(X) with observed value t = T(x). (ii) k(B.a)ifBiBo. vol denote the pvalue of a test6(T. We illustrate these ideas using the example of testing H : fL = J. . B) pnints.1.10 when X I .1. '1 Example 4. and an acceptance set A" (po) for this example.1 244 Testing and Confidence Regions Chapter 4 1 o:(t. . Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials.I}.v) : a(t. for the given t. . vertical sections of C are the confidence regions B( t) whereas borizontal sections are the acceptance regions A" (v) = {t : J( t...80 E (0.v) where.~) upper and lower confidence bounds and confidence intervals for B. • . we seek reasonable exact level (1 .2 known.4.B) = (oo.to(l. t is in the acceptance region.Xn ) = {B : S < k(B. and A"(B) = T(A(B)) = {T(x) : x Corollary 4.a) confidence region is given by C(X t. The result follows. B) plane.1. and for the given v. We claim that (i) k(B.5. The corresponding level (1 .3. To find a lower confidence bound for B our preceding discussion leads us to consider level a tests for H : 6 < 00.1). v) <a} = {(t. We call C the set of compatible (t.1 shows the set C. v) =O} gives the pairs (t. a). A"(B) Set) Proof.. E A(B)).1 I.~a)/ v'n}.a)~k(Bo. .p): It pI < <7Z (1. Let k(Bo. a) . _1 X n be the indicators of n binomial trials with probability of success 6.a) test of H.v) : J(t. We shall use some of the results derived in Example 4. 1 . Because D Fe{t) is a distribution function. Let Xl. v) = O}.
(iv) k(O. Q) as 0 tOo.o' (iii) k(fJ. hence.5 The Duality Between Confidence Regions and Tests 245 Figure 4. If fJo is a discontinuity point 0 k(O. (iii).Q) I] > Po. P. [S > j] > Q.[S > j] < Q for all 0 < 00 O(S) = inf{O: k(O. it is also nonincreasing in j for fixed e. whereas A'" (110) is the acceptance region for Hp. Q) would imply that Q > Po.[S > k(02. Po. (ii). [S > j] < Q. From (i).Q)] > Po. On the other hand. if we define a contradiction. Then P.Section 4. if 0 > 00 . The shaded region is the compatibility set C for the twosided test of Hp. I] if S ifS >0 =0 . let j be the limit of k(O.Q) ~ I andk(I. and (iv) we see that. Q).I] [0. S(to) is a confidence interval for 11 for a given value to of T. Q).5. then C(X) ={ (O(S). Therefore.L = {to in the normal model.1 (i) that PetS > j] is nondecreasing in () for fixed j. [S > j] = Q and j = k(Oo. The claims (iii) and (iv) are left as exercises.Q) = n+ 1. Clearly. To prove (i) note that it was shown in Theorem 4. a) increases by exactly 1 at its points of discontinuity. The assertion (ii) is a consequence of the following remarks.[S > k(02.o : J.1. Po. and.3.Q) = S + I}. e < e and 1 2 k(O" Q) > k(02. Therefore.[S > k(O"Q) I] > Q.
these bounds and intervals differ little from those obtained by the first approximate method in Example 4.5 I i .Q)=Sl} where j (0. we define ~ O(S)~sup{O:j(O. we find O(S) as the unique solution of the equation. When S 0.2Q). therefore.16) 3 I 2 11.Q) LCB for 0(2) Figure 4. 0. When S = n. .Q) = S and.4 0. As might be expected. Q) is given by. O(S) together we get the confidence interval [8(S).2. O(S) is the unique solution of 1 ~( ~ s ) or(1 _ 8)nr = Q..5. I • i I .16) for n = 2. j Figure 4. .2 0.1 0.0. 8(S) = O. Then 0(S) is a level (1 .1 d I J f o o 0. Putting the bounds O(S). then k(O(S). 4 k(8. when S > 0. 1 _ .2 portrays the situatiou.I. if n is large. i • .Q) DCB for 0 and when S < n. . .3.5. Plot of k(8.246 Testing and Confidence Regions Chapter 4 and O(S) is the desired level (I . O(S) = I. Similarly.4.3 0. From our discussion. j '~ I • I • . These intervals can be obtained from computer packages that use algorithms based on the preceding considerations. . O(S) I oflevel (1.
. If H is rejected. However. 0. Thus. we test H : B = 0 versus K : 8 i.d.!a).Q) confidence interval X ± uz(1 . when !t < !to. then the set S(x) of Vo where . To see this consider first the case 8 > 00 . Summary.!a).. The problem of deciding whether B = 80 .~Q). v = va. by using this kind of procedure in a comparison or selection problem. Decide I' > I'c ifT > z(1 . This event has probability Similarly. and vice versa. o (4. o o Decide B < 80 if! is entirely to the left of B .2.~Q) / . Therefore. and 3. are given to high blood pressure patients. I X n are i. Make no judgment as 1O whether 0 < 80 or 8 > B if I contains B .3) we obtain the following three decision rule based on T = J1i(X  1'0)/": Do not reject H : /' ~ I'c if ITI < z(1 . Using this interval and (4.. Because we do not know whether A or B is to be preferred. we usually want to know whether H : () > B or H : B < 80 . If we decide B < 0.3) 3.Jii for !t.~a). N(/l. the probability of falsely claiming significance of either 8 < 80 or 0 > 80 is bounded above by ~Q.4. A and B.5.3. 2. Suppose Xl. Decide I' < 1'0 ifT < z(l. the twosided test can be regarded as the first step in the decision procedure where if H is not rejected. Here we consider the simple solution suggested by the level (1 . If J(x.5. we decide whether this is because () is smaller or larger than Bo.O. Example 4.B . o o For instance. suppose B is the expected difference in blood pressure when two treatments. but if H is rejected. vo) is a level 0' test of H .8 < B • or B > 80 is an example of a threeo decision problem and is a special case of the decision problems in Section 1.5. we make no claims of significance.Section 4. In Section 4. twosided tests seem incomplete in the sense that if H : B = B is rejected in favor of o H : () i. we can control the probabilities of a wrong selection by setting the 0' of the parent test or confidence interval. We can use the twosided tests and confidence intervals introduced in later chapters in similar fashions. For this threedecision rule. the probability of the wrong decision is at most ~Q.0') confidence interval!: 1.i. Then the wrong decision "'1 < !to" is made when T < z(1 .5 The Duality Between Confidence Regions and Tests 247 Applications of Confidence Intervals to Comparisons and Selections We have seen that confidence intervals lead naturally to twosided tests.2 ) with u 2 known. and o Decide 8 > 80 if I is entirely to the right of B . it is natural to carry the comparison of A and B further by asking whether 8 < 0 or B > O.13.. We explore the connection between tests of statistical hypotheses and confidence regions.4 we considered the level (1 . then we select A as the better treatment.
n. for X E X C Rq.Q)er/..6. We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution.IB' (X) I • • < B'] < P. 0 I i. A level a test of H : /L = /La vs K .a) LCB for if and only if 0* is a unifonnly most accurate level (1 . where X(1) < X(2) < .Q) LCB B if. the following is true.248 Testing and Confidence Regions Chapter 4 0("'. va) ~ 0 is a level (1 .2 and 4.. . twosided tests.6. which is connected to the power of the associated onesided tests.2) for all competitors. and only if. for any fixed B and all B' < B.[B(X) < B']. I' i' . Formally.Q) con· fidence region for v. random variable S.6.1 t! Example 4. a level (1 any fixed B and all B' > B.1 continued). . (72) model.Xn ) is a sample of a N (/L.a) LCB 0* of (J is said to be more accurate than a competing level (1 ... If S(x) is a level (1 .1 (Examples 3. P. optimality of the tests translates into accuracy of the bounds.I (X) is /L more accurate than .6.3.' .Q). we say that the bound with the smaller probability of being far below () is more accurate. P. . we find that a competing lower confidence bound is 1'2(X) = X(k).Q) UCB for B.1. Definition 4. or larger than eo.1) is nothing more than a comparison of (X)wer functions. A level (1. where 80 is a specified value. unifonnly most accurate in the N(/L. which reveals that (4.1) for all competitors are called uniformly most accurate as are upper confidence bounds satisfying (4.Q for a binomial. We also give a connection between confidence intervals. e.n(X .6. Ii n II r . j i I . 0"2) random variables with (72 known. Thus. ~).. and only if.2. (4.1) Similarly. Note that 0* is a unifonnly most accurate level (1 .Ilo)/er > z(1 . v = Va when lIa E 8(:1') is a level 0: test. B(n.0) confidence region for va. Suppose X = (X" ..6. < X(n) denotes the ordered Xl. But we also want the bounds to be close to (J. We next show that for a certain notion of accuracy of confidence bounds. This is a consequence of the following theorem.6 UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account.~'(X) < B'] < P. for 0') UCB e is more accurate than a competitor e . in fact.. Which lower bound is more accurate? It does tum out that .6.if.Xn and k is defined by P(S > k) = 1 . The dual lower confidence bound is 1'. Using Problem 4.0') lower confidence bounds for (J. (X) = X . and the threedecision problem of deciding whether a parameter 1 l e is eo. 4. If (J and (J* are two competing level (1 .z(1 . less than 80 . .. I' > 1'0 rejects H when .6..5. they are both very likely to fall below the true (J. then the lest that accepts H . (4.2) Lower confidence bounds e* satisfying (4. .u2(X) and is.[B(X) < B'].
t o .2. the robustness considerations of Section 3. X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it.Oo)) < Eo. then for all 0 where a+ = a.2 and the result follows.•. Defined 0(x. Uniformly most accurate (UMA) bounds turn out to have related nice properties. We define q* to be a uniformly most accurate level (1 . Also.5 favor X{k) (see Example 3. We want a unifonnly most accurate level (1 . [O(X) > 001 < Pe.a) lower confidence boundfor O.1 to Example 4. Let O(X) be any other (1 . Suppose ~'(X) is UMA level (1 .6.Because O'(X.6. Let f)* be a level (1 . (O'(X.a) lower confidence bound. 00 ) is UMP level Q' for H : (J = eo versus K : 0 > 00 . For instance (see Problem 4. Most accurate upper confidence bounds are defined similarly.5. Identify 00 with {}/ and 01 with (J in the statement of Definition 4. and only if.1.e>. 0 If we apply the result and Example 4.Section 4. Pe[f < q(O')1 < Pe[q < q(O')] whenever q((}') < q((}). they have the smallest expected "distance" to 0: Corollary 4.6.a) upper confidence bound q* for q( A) = 1 . (Jo) is given by o'(x. 00 ) by o(x. 00 ) is a level a test for H : 0 = 00 versus K : 0 > 00 .4. Let XI.(o(X.1.a) LCB q. Then!l* is uniformly most accurate at level (1 . . we find that j.1. a real parameter. and 0 otherwise. such that for each (Jo the associated test whose critical function o*(x. if a > 0. Then O(X. Proof Let 0 be a competing level (1 . for any other level (1 .a) Lea for q( 0) if.6.6 Uniformly Most Accurate Confidence Bounds 249 Theorem 4. . Boundsforthe Probability ofEarly Failure ofEquipment. However.6. and only if.z(1 a)a/ JTi is uniformly most accurate. 00 ) ~ 0 if. [O'(X) > 001.7 for the proof). O(x) < 00 . 00 )) or Pe. 1 X n be the times to failure of n pieces of equipment where we assume that the Xi are independent £(A) variables.2. We can extend the notion of accuracy to confidence bounds for realvalued functions of an arbitrary parameter. .a). Example 4. for e 1 > (Jo we must have Ee. the probability of early failure of a piece of equipment.2).0') LeB for (J.a) LeB 00.Oo) 1 ifO'(x) > 00 ootherwise is UMP level a for H : (J = eo versus K : (J > 00 .
These topics are discussed in Lehmann (1997).1. l . a uniformly most accurate level (1 . The situation wiili confidence intervals is more complicated. there does not exist a member of the class of level (1 . Neyman defines unbiased confidence intervals of level (1 . Summary.6. That is. there exist level (1 .a) confidence intervals that have uniformly minimum expected lengili among all level (1.a) UCB'\* for A. Of course.T.a) lower confidence bounds to fall below any value ()' below the true B.a) is the (1. it follows that q( >. There are. the interval must be at least as likely to cover the true value of q( B) as any other value. 0 Discussion We have only considered confidence bounds.1 have this property. as in the estimation problem. however. some large sample results in iliis direction (see Wilks. Thus.T.a) UCB for A and. By using the duality between onesided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level (1 . Pratt (1961) showed that in many of the classical problems of estimation .a) by the property that Pe[T.. in general. 1962.a). in the case of lower bounds.\ *) where>.5.. In particular. we can restrict attention to certain reasonable subclasses of level (1 .3) or equivalently if A X2n(1a) o < 2"'~ 1 X 2 L_n= where X2n(1.a) unbiased confidence intervals.a) in the sense that. However.6. because q is strictly increasing in A.250 Testing and Confidence Regions Chapter 4 We begin by finding a uniformly most accurate level (1. pp.a) intervals for which members with uniformly smallest expected length exist. the lengili t . they are less likely ilian oilier level (1 . If we turn to the expected length Ee(t . . is random and it can be shown that in most situations there is no confidence interval of level (1 . Considerations of accuracy lead us to ask that. ~ q(B) ~ t] ~ Pe[T.) as a measure of precision.6. Therefore. subject to ilie requirement that the confidence level is (1 .a) quantile of the X§n distribution. By Problem 4. 374376). ~ q(()') ~ t] for every (). Confidence intervals obtained from twosided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes. the situation is still unsatisfactory because.a) intervals iliat has minimum expected length for all B.a) iliat has uniformly minimum length among all such intervals. the intervals developed in Example 4.a)/ 2Ao i=1 n (4.. *) is a uniformly most accurate level (1 . B'. the confidence region corresponding to this test is (0. the UMP test accepts H if LXi < X2n(1.. the confidence interval be as short as possible. To find'\* we invert the family of UMP level a tests of H : A ~ AO versus K : A < AO.8.a) UCB for the probability of early failure. * is by Theorem 4.
what are called level (1 '. with fLo and 75 known.4.)2 nro 1 . then 100(1 . then Ck = {a: 7r(lx) . from Example 1.Section 4. the interpretation of a 100(1". Suppose that given fL.. :s: alx) ~ 1 .2. Instead.aB = n ~ + 1 I ~ It follows that the level 1 . a~). (j]. Suppose that.a)% of the intervals would contain the true unknown parameter value.1. a Definition 4. a a Turning to Bayesian credible intervals and regions. no probability statement can be attached to this interval. .7.::: k} is called a level (1 .6.2 and 1. Then.i. a E e c R. then fl. it is natural to consider the collec tion of that is "most likely" under the distribution II(alx).a. A consequence of this approach is that once a numerical interval has been computed from experimental data.Xn is N(liB.7. and {j are level (1 .a) by the posterior distribution of the parameter given the data. Let 7r('lx) denote the density of agiven X = x.3.12.7 FREQUENTIST AND BAYESIAN FORMULATIONS We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E X c Rq are random while the parameters are fixed but unknown.::: 1 . the posterior distribution of fL given Xl. Xl.Xn are i. with ~ a6) a6 fLB = ~ nx+l/1 ~t""o n ~ + I ~ ~2 .A)2 nro ao 1 Ji = liB + Zla Vn (1 + .7 Frequentist and Bayesian Formulations 251 4.a)% confidence interval. and that fL rv N(fLo.a lower and upper credible bounds for fL are  fL = fLB  ao Zla Vn (1 + . 75).1.a) credible bounds and intervals are subsets of the parameter space which are given probability at least (1 .'" . with known. Example 4. N(fL. given a. II(a:s: t9lx) . Let II( 'Ix) denote the posterior probability distribution of given X = x.7. then Ck will be an interval of the form [fl. Thus.1. and that a has the prior probability distribution II.a) lower and upper credible bounds for if they respectively satisfy II(fl.d. X has distribution P e. If 7r( alx) is unimodal...a. Definition 4. We next give such an example.a) credible region for e if II( C k Ix) ~ 1  a . In the Bayesian formulation of Sections 1.a)% confidence interval is that if we repeated an experiment indefinitely each time computing a 100(1 . .
2 and suppose A has the gamma f( ~a. Xl. For instance. . the level (1 ..02) where /10 is known. Note that as TO > 00. we may want an interval for the . the center liB of the Bayesian interval is pulled in the direction of /10. . the interpretations of the intervals are different. called level (1 a) credible bounds and intervals.6.a). Similarly. = x a+n (a) / (t + b) is a level (1 . Then.2.Xn are Li. /1 +1with /1 ± = /111 ± Zl'" 2 ~ 00 .a) upper credible bound for 0. Compared to the frequentist bound (n 1) 8 2 / Xnl (a) of Example 4..12. (t + b)A has a X~+n distribution. ~b) density where a > 0. t] in which the treatment is likely to take effect. the Bayesian interval tends to the frequentist interval.4. X n . a doctor administering a treatment with delayed effect will give patients a time interval [1::. however. In the Bayesian framework we define bounds and intervals..n. . where t = 2:(Xi . where /10 is a prior guess of the value of /1.2 . the interpretations are different: In the frequentist confidence interval. D Example 4..3.2. Let xa+n(a) denote the ath quantile of the X~+n distribution.. by Problem 1.• . that determine subsets of the parameter space that are assigned probability at least (1 . is shifted in the direction of the reciprocal b/ a of the mean of W(A). it is desirable to give an interval [Y.a) credible interval is [/1.7. given Xl. In the case of a normal prior w( B) and normal model p( X I B).4 we discussed situations in which we want to predict the value of a random variable Y. whereas in the Bayesian credible interval. Y] that contains the unknown value Y with prescribed probability (1 . the probability of coverage is computed with X = x fixed and () random with probability distribution II (B I X = x). However.a) by the posterior distribution of the parameter () given the data :c. See Example 1./10)2.d.4 for sources of such prior guesses.2.8 PREDICTION INTERVALS In Section 1.a) lower credible bound for A and ° is a level (1 . the probability of coverage is computed with the data X random and B fixed. In addition to point prediction of Y. Let A = a. b > are known parameters. 01 Summary.n(1+ ~)2 nTo 1 • Compared to the frequentist interval X ± Zl~oO/.. We shall analyze Bayesian credible regions further in Chapter 5.2 . N(/10. Suppose that given 0.a) credible interval is similar to the frequentist interval except it is pulled in the direction /10 of the prior mean and it is a little narrower. then . 4.252 Testing and Confidence Regions Chapter 4 while the level (1 .
it can be shown using the methods of .1.Xn . Y] based on data X such that P(Y ::. X is the optimal estimator.Y to construct a pivot that can be used to give a prediction interval. the optimal estimator when it exists is also the optimal predictor. By solving tnl Y..1).4..1. We want a prediction interval for Y = X n +1. we found that in the class of unbiased estimators. fnl (1  ~a) for Vn.8.X n +1 '" N(O. The (Student) t Prediction Interval.8. . ::. It follows that. where MSE denotes the estimation theory mean squared error. TnI. . we find the (1 .Ct) prediction interval as an interval [Y. The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4.1)8 2 /cr 2. 8 2 = (n . AISP E(Y) = MSE(Y) + cr 2 .l distribution.1)1 L~(Xi . 1) distribution and is independent of V = (n . As in Example 4. "~).4. the optimal MSPE predictor is Y = X.1) Note that Tp(Y) acts as a prediction interval pivot in the same way that T(p.i. which is assumed to be also N(p" cr 2 ) and independent of Xl. In Example 3.Y) = 0.) acts as a confidence interval pivot in Example 4. Moreover.X) is independent of X by Theorem B. It follows that Z (Y) p  Y vnY l+lcr has a N(O. Let Y = Y(X) denote a predictor based on X = (Xl. .Xn be i. . Also note that the prediction interval is much wider than the confidence interval (4. Note that Y Y = X .3. . We define a predictor Y* to be prediction unbiased for Y if E(Y* . by the definition of the (Student) t distribution in Section B. .1. Y) ?: 1 Ct.d.3 and independent of X n +l by assumption. Thus. as X '" N(p" cr 2 ). let Xl. Tp(Y) !a) .l + 1]cr2 ). in this case.8. ~ We next use the prediction error Y . and can c~nclu~e that in the class of prediction unbiased predictors.Section 4.a) prediction interval Y = X± l (1 ~Ct) ::. Y ::..4..8 Prediction Intervals 253 future GPA of a student or a future value of a portfolio.. [n. has the t distribution. We define a level (1.4.3. In fact. Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is Note that Y can be regarded as both a predictor of Y and as an estimate of p" and when we do so. which has a X..+ 18tn l (1  (4..
that is.. Now [Y B' YB ] is said to be a level (1 . < X(n) denote the order statistics of Xl. then.i..Xn are i. the confidence level of (4. as X rv F. .d. Ul .2. Q(.8. U(O.' X n .254 Testing and Confidence Regions Chapter 4 Chapter 5 that the width of the confidence interval (4.1) tends to zero in probability at the rate n!.4.E(U(j)) where H is the joint distribution of UCj) and U(k)..2).12...2j)/(n + 1) prediction interval for X n +l . I x) of Xn+l is defined as the conditional distribution of X n + l given x = (XI..0:) in the limit as n ) 00 for samples from nonGaussian distributions. < UCn) be Ul . 0 We next give a prediction interval that is valid from samples from any population with a continuous distribution. Suppose XI.. . Let U(1) < ..' •.8. .. E(UCi») thus. " X n. where F is a continuous distribution function with positive density f on (a. . We want a prediction interval for Y = X n+l rv F. b). This interval is a distributionfree prediction interval.8.. Set Ui = F(Xi ). " X n + l are Suppose that () is random with () rv 1'( and that given () = i.~o:).d. n + 1.. . . whereas the width of the prediction interval tends to 2(]z (1.4.. Bayesian Predictive Distributions Xl' .'. uniform.i.v) = E(U(k») .i. X(k)] with k :s: Xn+l :s: XCk») = n + l' kj = n + 1 . I x) has in the continuous case density e.Un+l are i.xn ). .j is a level 0: = (n + 1 .d. with a sum replacing the integral in the discrete case.2) P(X(j) It follows that [X(j) .0:) Bayesian prediction interval for Y = X n + l if . UCk ) = v)dH(u. by Problem B. then P(U(j) j :s: Un+l :s: UCk ») P(u:S: Un+l:S: v I U(j) = U. . where X n+l is independent of the data Xl'. Here Xl"'" Xn are observable and Xn+l is to be predicted. (4. Moreover. The posterior predictive distribution Q(. = i/(n+ 1).2. See Problem 4.5 for a simpler proof of (4.v) j(v lI)dH(u. By ProblemB.. . 00 :s: a < b :s: 00.8. .8. Example 4.2. Let X(1) < . i = 1..9. p(X I e). Un ordered.1) is not (1 .1) is approximately correct for large n even if the sample comes from a nonnormal distribution. whereas the level of the prediction interval (4. 1).
8.. B = (2 / T 2) TJo + ( 2 / (J"0 X.B)B I 0 = B} = O.c{Xn+l I X = t} = . we find that the interval (4. X n + l and 0. by Theorem BA. (J" B n(J" B 2) It follows that a level (1 .9 Likelihood Ratio Procedures 255 Example 4.1 LIKELIHOOD RATIO PROCEDURES Introduction Up to this point. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distributionfree prediction interval. (J"5). and X ~ Bas n + 00. (4. we construct the Student t prediction interval for the unobservable variable. T2 known.I .. However. T2). The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables. We consider intervals based on observable random variables that contain an unobservable random variable with probability at least (1.0)0] = E{E(Xn+l .7.8. 0 The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box. 1983).9 4. Xn+l . Summary. Xn is T = X = n 1 2:~=1 Xi .1) .3) To consider the frequentist properties of the Bayesian prediction interval (4. .3) we compute its probability limit under the assumption that Xl " '" Xn are i. (n(J"~ /(J"5) + 1.~o:) V(J"5 + a~. note that given X = t. To obtain the predictive distribution.c{(Xn +1 where.1. N(B.0:) Bayesian prediction interval for Y is [YB ' Yt] with Yf = liB ± Z (1. (J"5 known. (J"5). This is the same as the probability limit of the frequentist interval (4.i.9. and 7r( B) is N( TJo . independent. In the case of a normal sample of size n + 1 with only n variables observable. Because (J"~ + 0.0:).3. and it is enough to derive the marginal distribution of Y = X n +l from the joint distribution of X.0 and 0 are uncorrelated and. Thus. A sufficient statistic based on the observables Xl. Note that E[(Xn+l .8. where X and X n + l are independent.2 o + I' T2 ~ P. . Consider Example 3.  0) + 0 I X = t} = N(liB. (J"5 + a~) ~2 = (J" B n I (. from Example 4. The Bayesian prediction interval is derived for the normal model with a normal prior. Thus.d.Section 4. 4. .2.8. even in .3) converges in probability to B±z (1 .1 where (Xi I B) '" N(B .8. Xn+l . the results and examples in this chapter deal mostly with oneparameter problems in which it sometimes is possible to find optimal procedures.0 and 0 are still uncorrelated and independent.~o:) (J"o as n + 00.
2. likelihood ratio tests have weak optimality properties to be discussed in Chapters 5 and 6.2 that we think of the likelihood function L(e. ifsup{p(x. For instance. 2. . . Note that in general A(X) = max(L(x).x) = p(x..Xn ) has density or frequency function p(x. B')/p(x . the MP level a test 6a (X) rejects H for T > z(l . the MP level a test 'Pa(X) rejects H if T ::::. In this section we introduce intuitive and efficient procedures that can be used when no optimal methods are available and that are natural for multidimensional parameters. z(a ). e) and we wish to test H : E 8 0 vs K : E 8 1 . and for large samples./Lo)/a. e) as a measure of how well e "explains" the given sample x = (Xl.. note that it follows from Example 4. 3.256 Testing and Confidence Regions Chapter 4 the case in which is onedimensional.'Pa(x). Also note that L(x) coincides with the optimal test statistic p(x .its (1 . e) : () sup{p(x. then the observed sample is best explained by some E 8 1 . if Xl. The test statistic we want to consider is the likelihood ratio given by L(x) = sup{p(x. . Because 6a (x) I. 1). . To see this. by the uniqueness of the NP test (Theorem 4. In the cases we shall consider. To see that this is a plausible statistic. . . . We start with a generalization of the NeymanPearson statistic p(x .e): E 8 0 }. a 2 ) population with a 2 known.~a). So. ed/p(x. Although the calculations differ from case to case.a )th quantile obtained from the table. eo). optimal procedures may not exist. recall from Section 2. 8 1 = {e l }.2. e) is a continuous function of e and eo is of smaller dimension than 8 = 8 0 U 8 1 so that the likelihood ratio equals the test statistic I e e 1 A(X) = sup{p(x. and conversely. e) : e E 8 l }. eo). e) E 8} : eE 8 0} (4. Xn). there is no UMP test for testing H : /L = /Lo vs K : /L I. we specify the size a likelihood ratio test through the test statistic h(A(X) and . sup{p(x. p(x. .. Calculate the MLE eo of e where e may vary only over 8 0 .2. ed / p(x. there can be no UMP test of H : /L = /Lo vs H : /L I. Form A(X) = p(x.1) whose computation is often simple./Lo. e) : e E 8l}islargecomparedtosup{p(x. Suppose that X = (Xl . Find a function h that is strictly increasing on the range of A such that h(A(X)) has a simple form and a tabled distribution under H .9./LO. e) : e E 8 0 } e e e Tests that reject H for large values of L(x) are called likelihood ratio tests. The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6. 4. On the other hand. Xn is a sample from a N(/L.. We are going to derive likelihood ratio tests in several important testing problems. Calculate the MLE e of e.1 that if /Ll > /Lo. the basic steps are always the same. In particular cases. where T = fo(X . . 1.1(c». if /Ll < /Lo. Because h(A(X)) is equivalent to A(X) . eo) when 8 0 = {eo}.
we consider pairs of patients matched so that within each pair the patients are as alike as possible with respect to the extraneous factors. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo. We can regard twins as being matched pairs.8) ..24. which are composite because 82 can vary freely. If the treatment and placebo have the same effect.9. and soon. mileage of cars with and without a certain ingredient or adjustment.9. For instance. and other factors. we can invert the family of size a likelihood ratio tests of the point hypothesis H : 8 = 80 and obtain the level (1 .9. Studies in which subjects serve as their own control can also be thought of as matched pair experiments. In order to reduce differences due to the extraneous factors. and so on. Let Xi denote the difference between the treated and control responses for the ith pair. The family of such level a likelihood ratio tests obtained by varying 8lD can also be inverted and yield confidence regions for 8 1 . An important class of situations for which this model may be appropriate occurs in matched pair experiments. p (X .c 0 0 It is often approximately true (see Chapter 6) that c( 8) is independent of 8. We are interested in expected differences in responses due to the treatment effect.a) confidence region C(x) = {8 : p(x. bounds.5. In the ith pair one patient is picked at random (i. 4. sales performance before and after a course in salesmanship.Section 4. diet. Response measurements are taken on the treated and control members of each pair.82 ) where 81 is the parameter of interest and 82 is a nuisance parameter. 0'2) population in which both JL and 0'2 are unknown. [ supe p(X. Suppose we want to study the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age. C (x) is just the set of all 8 whose likelihood is on or above some fixed value dependent on the data. while the second patient serves as control and receives a placebo. An example is discussed in Section 4.2) where sUPe denotes sup over 8 E e and the critical constant c( 8) satisfies p.e.9 Likelihood Ratio Procedures 257 We can also invert families of likelihood ratio tests to obtain what we shall call likelihood confidence regions. That is.9. We shall obtain likelihood ratio tests for hypotheses of the form H : 81 = 8lD . After the matching.2 Tests for the Mean of a Normal DistributionMatched Pair Experiments Suppose Xl. Here are some examples. with probability ~) and given the treatment. 8) > (8)] = eo a. To see how the process works we refer to the specific examples in Sections 4. we measure the response of a subject when under treatment and when not under treatment. 8) ~ [c( 8)t l sup p(x.'" .9. the experiment proceeds as follows. 8n e (4. This section includes situations in which 8 = (8 1 .Xn form a sample from a N(JL.2. the difference Xi has a distribution that is . In that case.
o.0}. B) was solved in Example 3. = E(X 1 ) denote the mean difference between the response of the treated and control subjects.B): B E 8} = p(x.0 )2 .0) 2  a2 n] = 0. B) at (fJ.5. . =Ie fJ. However. The problem of finding the supremum of p(x. n i=l is the maximum likelihood estimate of B.L ( X i .0.3. Our null hypothesis of no treatment effect is then H : fJ. which has the immediate solution ao ~2 = . B) : B E 8 0 } boils down to finding the maximum likelihood estimate a~ of a 2 when fJ. To this we have added the nonnality assumption. = fJ. as discussed in Section 4.0 as an established standard for an old treatment. TwoSided Tests We begin by considering K : fJ. The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter 5 if the nonnality assumption is not satisfied. the test can be modified into a threedecision rule that decides whether there is a significant positive or negative effect. 'for the purpose of referring to the duality between testing and confidence procedures. This corresponds to the alternative "The treatment has some effect. e). 8 0 = {(fJ" Under our assumptions.0. Let fJ. where B=(x. = fJ. a 2 ) : fJ.L Xi 1 ~( n i=l .a)= ~ ~2 (1 ~ 1 ~ n i=l 2 ) . we test H : fJ.258 Testing and Confidence Regions Chapter 4 symmetric about zero.fJ. = fJ. a~). The likelihood equation is oa 2 a logp(x. Finding sup{p(x. where we think of fJ.=1 fJ.6. We found that sup{p(x. = O. Form of the TwoSided Tests Let B = (fJ" a 2 ). B) 1[1~ ="2 a 4 L(Xi ." However.0 is known and then evaluating p(x.L X i ' . as representing the treatment effect. We think of fJ. good or bad.X) .
\(x).1). To simplify the rule further we use the following equation.Mo) . Therefore.8) logp(x. which thus equals log .1)1 I:(Xi . the size a likelihood ratio test for H : M Z Mo versus K : M < Mo rejects H if. The statistic Tn is equivalent to the likelihood ratio statistic A for this problem.9 Likelihood Ratio Procedures 259 By Theorem 2. Thus. A and B. if we are comparing a treatment and control. Because Tn has a T distribution under H (see Example 4.11 we argue that P"[Tn Z t] is increasing in 8. and only if. the testing problem H : M ::. .3. A proof is sketched in Problem 4. the relevant question is whether the treatment creates an improvement. 8) for 8 E 8 0. and only if. &5/&2 (&5/&2) is monotone increasing Tn = y'n(x .MO)2/&2.\(x) is equivalent to log . the likelihood ratio tests reject for large values of ITn I. &5 gives the maximum of p(x. the test that rejects H for Tn Z t nl(1 . The test statistic .1). the size a critical value is t n 1 (1 .4. Mo . Mo versus K : M > Mo (with Mo = 0) is suggested. &0)) ~ 2 {~[(log271') + (log&2)]~} ~ log(&5/&2). Because 8 2 function of ITn 1 where = 1 + (x . (n .2.Section 4. or Table III.1. suppose n = 25 and we want a = 0. therefore. For instance. ITnl Z 2.  {~[(log271') + (log&5)]~} Our test rule. 8 Therefore.9.\(x) logp(x. are considered to be equal before the experiment is performed. which can be established by expanding both sides.9. Similarly.05.x)2 = n&2/(n . However.a).~a) and we can use calculators or software that gives quantiles of the t distribution. (Mo. is of size a for H : M ::.Mo)/a. to find the critical value. Then we would reject H if. OneSided Tests The twosided formulation is natural if two treatments. where 8 = (M . rejects H for iarge values of (&5/&2). Therefore.064. In Problem 4.
p.b. To derive the distribution of Tn./Lo)/a . Note that the distribution of Tn depends on () (/L.4. is possible (Lehmann.9.jV/ k where Z and V are independent and have N(8. we know that fo(X /L)/a and (n . / (n1)s2 /u 2 n1 V has a 'Tn1.11) just as the power functions of the corresponding tests of Example 4. we obtain the confidence region We recognize C(X) as the confidence interval of Example 4.1)s2/a 2 are independent and that (n 1)s2/a 2 has a X. denoted by Tk. Problem 17). Thus. say. Similarly the onesided tests lead to the lower and upper confidence bounds of Example 4. by making a sufficiently large we can force the noncentrality parameter 8 = fo(/L . Likelihood Confidence Regions If we invert the twosided tests.2. A Stein solution.1 are monotone in fo/L/ a.1./Lo)/a) = 1. 1) distribution. Computer software will compute n./Lo) / a as close to 0 as we please and.4. however. 260./Lo)/a and Yare . If we consider alternatives of the form (/L /Lo) ~ ~.jV/k is given in Problem 4. we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8. we can no longer control both probabilities of error by choosing the sample size.n(X ./Lo) / a.260 Power Functions and Confidence 4 To discuss the power of these tests. ( 2) only through 8. With this solution. 1997. we may be required to take more observations than we can afford on the second stage. fo( X .9.7) when discussing confidence intervals. . bring the power arbitrarily close to 0:. respectively.1. in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with I/L.. whatever be n. We have met similar difficulties (Problem 4.12. 1) and X~ distributions.l distribution. Because E[fo(X /Lo)/a] = fo(/L . This distribution. The reason is that. The power functions of the onesided tests are monotone in 8 (Problem 4./Lo) / a has N (8. thus. with 8 = fo(/L . The density of Z/ ./Lol ~ ~.4. We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 ~ 81 > 0 in the twosided case and 8 ~ 81 or 8 :S 81 in the onesided cases. and the power can be obtained from computer software or tables of the noncentral t distribution.b distribution. note that from Section B. is by definition the distribution of Z / .3. the ratio fo(X .
32.1 0.9 1.58. . It suggests that not only are the drugs different but in fact B is better than A because no hypothesis Jl = Jl' < 0 is accepted at this level.4 1. 121) giving the difference B . YI .. respectively.8 9 0. weight.Y ) is n2 where n = nl + n2. . suppose we wanted to test the effect of a certain drug on some biological variable (e. p.9.84]. ..5. .6 0.Xnl and YI . volume. . 8 2 = 1. A discussion of the consequences of the violation of these assumptions will be postponed to Chapters 5 and 6. Patient i A B BA 1 0.2. this is the problem of determining whether the treatment has any effect.0 4.3.99 confidence interval for the mean difference Jl between treatments is [0. Yn2 . length.5 1. For instance.g. Then 8 0 = {e : Jli = Jl2} and 9 1 = {e : Jli =F Jl2}' The log of the likelihood of (X...Xn1 and YI .2 0..Xn1 could be blood pressure measurements on a sample of patients given a placebo.0 7 3. As in Section 4.25. consider the following data due to Cushny and Peebles (see Fisher.3 Tests and Confidence Intervals for the Difference in Means of Two Normal Populations We often want to compare two populations with distribution F and G on the basis of two independent samples X I.1.) 4.4 4. Then Xl.1 1. Because tg(0.3 5 0. .4 3 0. Jl2. ( 2). .. it is usually assumed that X I.513.. and so forth. Let e = (Jlt.2 2 1. Tests We first consider the problem of testing H : Jl1 = J:L2 versus H : Jli =F Jl2. In the control versus treatment example. then x = 1. ( 2) and N (Jl2. while YI . ( 2) populations. Y) = (Xl. and ITnl = 4. . it is shown that = over 8 by the maximum likelihood estimate e.2 the likelihood function and its log are maximized In Problem 4.A in sleep gained using drugs A and B on 10 patients. . . 'Yn2 are independent samples from N (JlI.1 0. we conclude at the 1% level of significance that the two drugs are significantly different.. temperature.8 2. The preceding assumptions were discussed in Example 1.4 If we denote the difference as x's.6 10 2.0 3.6 0. blood pressure)..7 1.Section 4.9.2 1.1 1. (See also (4. height.4 1.3).3 4 1.8 1. e .0 6 3.. .6.. one from each popUlation...Xn1 .995) = 3. 1958.9. .. For quantitative measurements such as blood pressure.8 8 0.'" .9 likelihood Ratio Procedures 261 Data Example As an illustration of these procedures.06.6 4. The 0.7 5. This is a matched pair experiment with each subject serving as its own control.. 'Yn2 are the measurements on a sample given the drug.
.2 X.2 (y. By Theorem B. Thus.Y) '1':" a a a i=l a j=l J I .3. we find that the is equivalent to the test statistic ITI where and To complete specification of the size a likelihood ratio test we show that T has distribution when ttl = tt2.2 1 I:rt2 . Y. where 2 Testing and Confidence Regions Chapter 4 When ttl tt2 = tt.1 I:rtl . .3 Tn2 1 .2 (Xi . the maximum of p over 8 0 is obtained for f) (j1.262 (X.1 . j1.X) + (X  j1)]2 and expanding.j112 log likelihood ratio statistic [(Xi .2. (75). where and If we use the identities ~ fCYi ?l i=l j1)2 1 n fp'i i=l y)2 + n2 (Y _ j1)2 n obtained by writing [Xi .X) and . our model reduces to the onesample model of Section 4.9. Y. (7 ).
2 under H bution and is independent of (n . corresponding to the twosided test. and only if.Section 4. T and the resulting twosided.1L2. we find a simple equivalent statistic IT(Ll)1 where If 1L2 ILl = Ll.Ll. 1/nI).X) /0.2 • Therefore.2)8 2 /0.2)8 2/0. T has a noncentral t distribution with noncentrality parameter.ILl.3) for 1L2 . 1L2. X~ll' X~2l' respectively.9 likelihood Ratio Procedures 263 are independent and distributed asN(ILI/o.N(1L2/0. .ILl #.ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 . if ILl #. IL 1 and for H : ILl ::. for H : 1L2 ::. We can show that these tests are likelihood ratio tests for these hypotheses. Confidence Intervals To obtain confidence intervals for 1L2 . onesided tests lead to the upper and lower endpoints of the interval as 1 . this follows from the fact that. T( Ll) has a Tn2 distribution and inversion ofthe tests leads to the interval (4.~ Q likelihood confidence bounds. f"V As usual.ILl = Ll versus K : 1L2 . It is also true that these procedures are of size Q for their respective hypotheses.2 has a X~2 distribution. twosample t test rejects if. there are two onesided tests with critical regions. We conclude from this remark and the additive property of the X2 distribution that (n .9. 1/n2). As in the onesample case. As for the special case Ll = 0. 1) distriTn . Similarly. by definition. and that j nl n2/n(Y .has a N(O.
consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines. Ji2) E R x R. As a first step we may still want to compare mean responses for the X and Y populations.395 ± 0. Because t4(0.845 1. except for an additive constant. respectively. :.2 (1 . For instance.4 The TwoSample Problem with Unequal Variances In twosample problems of the kind mentioned in the introduction to Section 4.977.042 1. ai) and N(Ji2.3..95 confidence interval for the difference in mean log permeability is 0. On the basis of the results of this experiment.776. 4. we conclude at the 5% level of significance that there is a significant difference between the expected log permeability for the two machines. it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same.x = 0.790 1. Again we can show that the selection procedure based on the level (1 . it may happen that the X's and Y's have different variances. The results in terms oflogarithms were (from Hald. The log likelihood. Suppose first that aI and a~ are known. This is the BehrensFisher problem.368. 1952. the more waterproof material.~o:). p.975) = 2.. thus. is The MLEs of Ji1 and Ji2 are.. a~) samples. From past experience.9.395.. Y1. . setting the derivative of the log likelihood equal to zero yields the MLE .264 Testing and Confidence Regions Chapter 4 Data Example As an illustration. H is rejected if ITI 2 t n .9. . and T = 2. 'Yn2 are two independent N (Ji1. a treatment that increases mean response may increase the variance of the responses.583 1. thus. we are lead to a model where Xl. 472) x (machine 1) Y (machine 2) 1. .Xn1 . . When Ji1 = Ji2 = Ji.0:) confidence interval has probability at most ~ 0: of making the wrong selection. If normality holds. Here y . /11 = x and /12 = y for (Ji1. we would select machine 2 as producing the smaller permeability and.0264.8 2 = 0.627 2. The level 0. .282 We test the hypothesis H of no difference in expected log permeability.
X)2 + nl(j1. where D = Y .) 18 D due to Welch (1949) works well.)18D to generate confidence procedures.j1)2 j=1 = L(Yi .y) By writing nl L(Xi . where I:::. Unfortunately the distribution of (D 1:::. j1x = = nl .. 8D=+' nl n2 It is natural to try and use D I 8 D as a test statistic for the onesided hypothesis H : J. n2. (D 1:::. 2 8y 8~ .1:::.t2 must be estimated. y) Next we compute = exp {2~~ eM  x)2 + 2:'~ (Ii  y?}. For large nl. = at I a~.n2)' It follows that the likelihood ratio test is equivalent to the statistic IDl/aD. it Thus.9 Likelihood Ratio Procedures J 265 where.t2 :::. j1.fi)2 we obtain \(x.X and aJy is the variance of D. IDI/8D as a test statistic for H : J. and more generally (D . It follows that "\(x..n2(fi ...3.) I 8 D depends on at I a~ for fixed nl.x)2 i=1 n2 i=1 n2 L(Yi . J. that is Yare X) + Var(Y) = nl + n2 .1:::. n2 an approximation to the distribution of (D .t2.fi = n2(x ..y)/(nl + .fi? j=1 + n2(j1.28).t1. by Slutsky's theorem and the central limit theorem..x)/(nl + .t1.n2)' Similarly. DiaD has aN(I:::. An unbiased estimate is J. For small and moderate nI. BecauseaJy is unknown. = J. n2..)18D has approximately a standard normal distribution (Problem 5.t1 = J. 1) distribution.Section 4.
9. u~ > O. the maximum error in size being bounded by 0. which works well if the variances are equal or nl = n2. mice. Wang (1971) has shown the approximation to be very good for Q: = 0.) are sampled from a population and two numerical characteristics are measured on each case. then we end up with a bivariate random sample (XI. (Xn' Yn).9.5 likelihood Ratio Procedures for Bivariate Normal Distributions If n subjects (persons.3. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the BehrensFisher problem.J1.u~. The LR procedure derived in Section 4.. ur Testing Independence.266 Testing and Confidence Regions Chapter 4 Let c = Sf/nlsb' Then Welch's approximation is Tk where k c2 [ nl 1 (1 C)2]1 + n2 .1. Some familiar examples are: X test score on English exam. Empirical data sometimes suggest that a reasonable model is one in which the two characteristics (X.3. Y = age at death. fields.3. . Note that Welch's solution works whether the variances are equal or not. X percentage of fat in score on mathematics exam. uI 4.28.01.2 . TwoSided Tests . etc. X = average cigarette consumption per day in grams. .003. If we have a sample as before and assume the bivariate normal model for (X. N(j1. with > 0.3 and Problem 5. . X test tistical studies.05 and Q: 0. Y). Y = blood pressure. Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many staweight. Y) have a joint bivariate normal distribution.. our problem becomes that of testing H : p O. can unfortunately be very misleading if =1= u~ and nl =1= n2. Yd. Y diet. Y cholesterol level in blood. the critical value is obtained by linear interpolation in the t tables or using computer software.1 When k is not an integer. See Figure 5. machines. p).
where (jr e was given in Problem 2.4) = Thus.1.5) has a Tn. and eo can be obtained by separately maximizing the likelihood of XI.4. Yn . the distribution of pis available on computer packages.Section 4..1.. 1~ )2 2 1~ )2 0"1 n i=1 n i=1 p= [t(Xi . . (4.Xn and that ofY1 . the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1. Because (U1.9 Likelihood Ratio Procedures 267 The unrestricted maximum likelihood estimate (x.3.(j~. . V n ) is a sample from the N(O. (Un. if we specify indifference regions.0"2 = . (j~. .. 1 (Problem 2. Pis called the sample correlation coefficient and satisfies 1 ::.8).13 as 2 = . ~ = (Yi . ii.3. p::.5. Qualitatively. A normal approximation is available. where Ui = (Xi . the twosided likelihood ratio tests can be based on [Tn I and the critical values obtained from Table II. p). then by Problem B.2 distribution.Jll)/O"I.~( Yi . the distribution of p depends on p only. 0) and the log of the likelihood ratio statistic becomes log A(X) (4.X . There is no simple form for the distribution of p (or Tn) when p #. To obtain critical values we need the distribution of p or an equivalent statistic under H.X)(Yi ~=1 fj)] /n(jl(72.9.~( Xi . 0. . Therefore. we can control probabilities of type II error by increasing the sample size.o. See Example 5.. log A(X) is an increasing function of p2. and the likelihood ratio tests reject H for large values of [P1. VI)"" .Y . If p = 0. for any a. . p) distribution. When p = 0 we have two independent samples.Jl2)/0"2.9.Now. When p = 0. y.9. Because ITn [ is an increasing function of [P1. (jr . We have eo = (x. .1.
0 versus K : P > 0 and similarly that corresponds to the likelihood ratio statistic for H : P 0 versus K : P < O. we can start by constructing size Q likelihood ratio tests of H : P Po versus K : p > po.7 18. inversion of this family of tests leads to levell Q lower confidence bounds. p::. c(po)" where Ppo [p::. We can show that Po [p. c(Po)] 1 ~Q.18 and Tn 0. 0 versus K : P > O.15). To obtain lower confidence bounds. The likelihood ratio test statistic >. we find that there is no evidence of correlation: the pvalue is bigger than 0. Po but rather of the "equaltailed" test that rejects if. by putting two level 1 . We want to know whether there is a correlation between the initial weights and the weight increase and formulate the hypothesis H : p O. 'I7 Summary. and only if p . The power functions of these tests are monotone. we obtain size Q tests for each of these hypotheses by setting the critical value so that the probability of type I error is Q when p = O. only onesided alternatives are of interest.: : d(po) or p::. However. if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood. These intervals do not correspond to the inversion of the size Q LR tests of H : p = Po versus K : p :::f. 370. and only if.:::: c] is an increasing function of p for fixed c (Problem 4. ! Data Example A~ an illustration. we obtain a commonly used confidence interval for p. It can be shown that pis equivalent to the likelihood ratio statistic for testing H : P ::.268 Testing and Confidence Regions Chapter 4 OneSided Tests In many cases.Q. For instance. Thus. by using the twosided test and referring to the tables. c(po) where P po [I)' d(po)] P po [I)'::. we would test H : P = 0 versus K : P > 0 or H : P ::. is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis.75. These tests can be shown to be of the form "Accept if. We can similarly obtain 1 Q upper confidence bounds and.9. Because c can be shown to be monotone increasing in p. consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week. for large n the equal tails and LR confidence intervals approximately coincide with each other. c(po)] 1 . Therefore. We obtain c(p) either from computer software or by the approximation of Chapter 5.1 Here p 0. We find the likelihood ratio tests and associated confidence procedures for four classical normal models: .48. Confidence Bounds and Intervals Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present.~Q bounds together.
where SD is an estimate of the standard deviation of D = Y .L :::.Xn denote the times in days to failure of n similar pieces of equipment.Xn ) is an £(A) sample. Let Xl.L is zero. . . ..L2' ( 2) populations.05? e :::. e). X n are independently and identically distributed according to the uniform distribution U(O. We test the hypothesis that the means are equal and find that the likelihood ratio test is equivalent to the twosample (Student) t test. Mn e = ~? = 0. . J. Consider the hypothesis H that the mean life II A = J. the likelihood ratio test is equivalent to the test based on IY .Lo. (3) Twosample experiments in which two independent samples are modeled as coming from N(J. 4. respectively. . When a? and a~ are unknown.2 degrees of freedom.10 Problems and Complements 269 (1) Matched pair experiments in which differences are modeled as N(J. X n ) and let 1 if Mn 2:: c = of e.Section 4. what choice of c would make 6c have size (c) Draw a rough graph of the power function of 6c specified in (b) when n = 20. Let Mn = max(X 1 . what is the pvalue? 2..L2' a~) populations. Assume the model where X = (Xl. Suppose that Xl. . ~ versus K : e> ~.X.10 PROBLEMS AND COMPLEMENTS Problems for Section 4. We also find that the likelihood ratio statistic is equivalent to a t statistic with n .Ll' ( 2) and N(J. (d) How large should n be so that the 6c specified in (b) has power 0.L.48. .98 for (e) If in a sample of size n = 20. 0 otherwise. We test the hypothesis that X and Yare indepen dent and find that the likelihood ratio test is equivalent to the test based on IPl.Ll' an and N(J. ( 2) and we test the hypothesis that the mean difference J. .1 1. where p is the sample correlation coefficient.. . . we use (Y . (4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases. When a? and a~ are known. The likelihood ratio test is equivalent to the onesample (Student) t test..X)I SD..XI.. (2) Twosample experiments in which two independent samples are modeled as coming from N(J. respectively. . Approximate critical values are obtained using Welch's t distribution approximation. (a) Compute the power function of 6c and show that it is a monotone increasing function (b) In testing H : exactly 0.
j = 1. .. U(O. 7. (Use the central limit theorem.Xn be a sample from a population with the Rayleigh density .3.3). T ~ Hint: See Problem B. " i I 5.<»1 VTi)J . Suppose that T 1 .1.12. . X> 0.. j=l . Establish (4. (cj Use the central limit theorem to show that <l?[ (I"oz( <» II") + VTi(1"  1"0) I 1"] is an approximation to the power of the test in part (a).05? 3.2 L:. Let Xl. . . . give a normal approximation to the significance probability.3.a)th quantile of the X~n distribution. (a) Use the MLE X of 0 to construct a level <> test for H : 0 < 00 versus K : 0 > 00 . (d) The following are days until failure of air monitors at a nuclear plant. 1 using a I . f(x. I .. (b) Show that the power function of your test is increasing in 8.r. 1).. Tr are independent test statistics for the same simple H and that each Tj has a continuous distribution. 6. Hint: See Problem B. .4.ontinuous distribution. If 110 = 25.. Let Xl. Draw a graph of the approximate power function. where x(I . Assume that F o and F are continuous. Hint: Use the central limit theorem for the critical value.a) is the (1 . i I (b) Check that your test statistic has greater expected value under K than under H.. is a size Q test.270 Testing and Confidence Regions Chapter 4 (a) Use the result of Problem 8..4 to show that the test with critical region IX > I"ox( 1  <» 12n]. X n be a 1'(0) sample. .. j . . . 0: (a) Construct a test of H : B = 1 versus K : B > 1 with approximate size complete sufficient statistic for this model. r.. then the pvalue <>(T) has a unifonn. Days until failure: 315040343237342316514150274627103037 Is H rejected at level <> ~ 0.. .~llog <>(Tj ) has a X~r distribution. . (c) Give an approximate expression for the critical value if n is large and B not too close to 0 or 00. 1nz t Z i . Hint: Approximate the critical region by IX > 1"0(1 + z(1 . under H... Let o:(Tj ) denote the pvalue for 1j.0 > 0.) 4. .O) = (xIO')exp{x'/20'}. (0) Show that.. . (b) Give an expression of the power in tenns of the X~n distribution. .2. distribution. Show that if H is simple and the test statistic T has a c.
5.) Next let T(I).o J J x .u. .. (b) Suppose Fo isN(O.p(Fo(x))IF(x) . B. . (a) Show that the power PF[D n > kal of the Kolmogorov test is bounded below by ~ sup Fpl!F(x)..p(F(x)IF(x) . Define the statistics S¢. That is.p(Fo(x))lF(x) .d.l) (Tn). .. . .. Here.o T¢. 00. . Express the Cramervon Mises statistic as a sum.6 is invariant under location and scale. = ~ I.2. 1) and F(x) ~ (1 +exp( _"/7))1 where 7 = "13/. 1. . 80 and x = 0. T(X(B)) is a sample of size B + 1 from La. (b) Use part (a) to conclude that LN("..Section 4.. Suppose that the distribution £0 of the statistic T = T(X) is continuous under H and that H is rejected for large values of T.1. Show that the test rejects H iff T > T(B+lm) has level a = m/(B + 1). = (Xi .T(X(l)). if X. Let T(l).1. and let a > O.Fo(x)IOdFo(x) . . ~ ll. to get X with distribution . j = 1. b > 0.Fo• generate a U(O. (In practice these can be obtained by drawing B independent samples X(I). 1) to (0.10.) Evaluate the bound Pp(lF(x) .o V¢.p(F(x»!F(x) . Hint: If H is true T(X).Fo(x)IOdF(x).)(Tn ) = LN(O. (a) Show that the statistic Tn of Example 4.. and 1. .5.Fo(x)1 > k o ) for a ~ 0. Use the fact that T(X) is equally likely to be any particular order statistic.T.12(b).o sup. let . .00).Fo(x) 10 U¢.(X') = Tn(X). .J(u) = 1 and 0: = 2. 1) variable on the computer and set X ~ F O 1 (U) as in Problem B. ..10 Problems and Complements 271 8.T(B+l) denote T.. T(l). .5. X n be d.a)/b.5 using the nonnal approximation to the binomial distribution of nF(x) and the approximate critical value in Example 4. (b) When 'Ij. . (This is the logistic distribution with mean zero and variance 1.X(B) from F o on the computer and computing TU) = T(XU)). . (a) For each of these statistics show that the distribution under H does not depend on F o.1. 10. then the power of the Kolmogorov test tends to 1 as n . T(B) be B independent Monte Carlo simulated values of T.p(u) be a function from (0.o: is called the Cramervon Mises statistic.... Vw. . is chosen 00 so that 00 x 2 dF(x) = 1.. T(B) ordered.. In Example 4.Fo(x)IO x ~ ~ sup. Let X I. then T.Fo(x)1 foteach. Fo(x)1 > kol x ~ Hint: D n > !F(x) . with distribution function F and consider H : F = Fo. 9.. n (c) Show that if F and Fo are continuous and FiFo.
X n .O) = 0'.a)) = inf{t: Fo(t) > u}.) ..3.2.0) = 20(1.0). let N I • N 2 • and N 3 denote the number of Xj equal to I. Without loss of generality. I 1 (b) Show that if c > 0 and a E (0. 12. j 1 i :J . respectively.nO /. Let To denote a random variable with distribution F o. you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0. The first system costs $106 .05. the power is (3(0) = P(U where FO 1 (u)  < a) = 1. 01 ) is an increasing function of 2N1 + N. Let 0 < 00 < 0 1 < 1. I X n from this population. 3. Whichever system you buy during the year. you intend to test the satellite 100 times. 5 about 14% of the time. f(3.). where" is known.L > J. Problems for Section 4.2.. i i (b) Define the expected pvalue as EPV(O) = EeU. Let T = X/"o and 0 = /" ..1) satisfy Pe. (a) Show that L(x. which is independent ofT..1. If each time you test. One has signalMtonoise ratio v/ao = 2. which system is cheaper on the basis of a year's operation? 1 I Ii I il 2. and 3 occuring in the HardyWeinberg proportions f(I. For a sample Xl. Consider Examples 3.272 Testing and Confidence Regions Chapter 4 (e) Are any of the four statistics in (a) invariant under location and scale.1. Show that the EPV(O) for I{T > c) is uniformly minimal in 0 > 0 when compared to the EPV(O) for any other test. the other has v/aQ = 1./"0' Show that EPV(O) = if! ( . and 3.) I 1 . .to on the basis of the N(p" .2 and 4. 1999.. You want to buy one of two systems. 2. I). ! .. 2.. Suppose that T has a continuous distribution Fe. Expected pvalues. . 00 . (For a recent review of expected p values see Sackrowitz and SamuelCahn. (See Problem 4. .10.') sample Xl. i • (a) Show that if the test has level a. take 80 = O. the other $105 • One second of transmission on either system COsts $103 each. then the pvalue is U =I .0)'. and only if.. where if! denotes the standard normal distribution. Hint: peT < to I To = to) is 1 minus the power of a test with critical value to_ (d) Consider the problem of testing H : J1.. Show that EPV(O) = P(To > T)..2 I i i 1.Fo(T). [2N1 + N. Hint: P(To > T) = P(To > tiT = t)fe(t)dt where fe(t) is the density of Fe(t). ! . the UMP test is of the form I {T > c).0) = (1. > cis MP for testing H : 0 = 00 versus K : 0 = 0 1 • .Fe(F"l(l. 2N1 + N. A gambler observing a game in which a single die is tossed repeatedly gets the impression that 6 comes up about 18% of the time. = /10 versus K : J.. Consider a population with three kinds of individuals labeled 1. ! J (c) Suppose that for each a E (0. > cJ = a. then the test that rejects H if. whereas the other a I ~ a . Consider a test with critical region of the fann {T > c} for testing H : () = Bo versus I< : () > 80 . f(2.
.2.1.. Y) and a critical value c such that if we use the classification rule.. A fonnulation of goodness of tests specifies that a test is best if the max. = a.imum probability of error (of either type) is as small as possible. 6. B" . if .0196 test rejects if. 5. Bo..2.1(a) using the connection between likelihood ratio tests and Bayes tests given in Remark 4.2. . Hint: Use Problem 4.~:":2:~~=C:="::===='. Bo. (a) Show that if in testing H : f} such that = f}o versus K : f) = f)l there exists a critical value c Po. +akNk has approx.. Nk) ~ 4. L . (c) Using the fact that if(N" . 0.Pe.e. of and ~o '" I. and only if. A newly discovered skull has cranial measurements (X. 7.2.2. For 0 < a < I. then the maximum of the two probabilities ofmisclassification porT > cJ. Show that if randomization is pennitted. I. MPsized a likelihood ratio tests with 0 < a < 1 have power nondecreasing in the sample size.1. derive the UMP test defined by (4..2.6) where all parameters are known.o = (1. then a.2) that linear combinations of bivariate nonnal random variables are nonnaUy distributed. +.4.. PdT < cJ is as small as possible.0.. H : (J = (Jo versus K : (J = (J.17).10 Problems and Complements 273 four numbers are equally likely to occur (i. 9. the gambler asks that he first be allowed to test his hypothesis by tossing the die n times. Y) belongs to population 1 if T > c. Hint: The MP test has power at least that of the test with test function J(x) 8. find an approximation to the critical value of the MP level a test for this problem.2. Find a statistic T( X.7).2. with probability ..6) or (as in population 1) according to N(I..0. (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerfullevel.imately a N(np" na 2 ) distribution. [L(X.. where 11 = I:7 1 ajf}j and a 2 = I:~ 1 f}i(ai 11)2.4 and recall (Proposition B.. 1.. Upon being asked to play.1.Bd > cJ then the likelihood ratio test with critical value c is best in this sense. Bk).2. (b) Find the test that is best in this sense for Example 4. Prove Corollary 4. [L(X. .0. . . prove Theorem 4... (X. I. find the MP test for testing 10.=  Section 4.2. Y) known to be distributed either (as in population 0) according to N(O.. In Examle 4.N. In Example 4. two 5's are obtained. M(n.. Bd > cJ = I . and to population 0 ifT < c.<l.1. . ..
. Use the normal approximation to the critical value and the probability of rejection.4.<»/2>'0 where X2n(1. • • " 5.1.Xn is a sample from a Weibull distribution with density f(x. that the Xi are a sample from a Poisson distribution with parameter 8.<» is the (1 .3. You want to ensure that if the arrival rate is < 10..3.95 at the alternative value 1/>'1 = 15. (c) Suppose 1/>'0 ~ 12. In Example 4. (a) Show that L~ K: 1/>. .i . but if the arrival rate is > 15. a model often used (see Barlow and Proschan. ]f the equipment is subject to wear. How many days must you observe to ensure that the UMP test of Problem 4. A possible model for these data is to assume that customers arrive according to a homogeneous Poisson process and. 1965) is the one where Xl.• n. Let Xl be the number of arrivals at a service counter on the ith of a sequence of n days. i 1 .3. hence.&(>').<»th quantile of the X~n distributiou and that the power function of the lIMP level a test is given by where G 2n denotes the X~n distribution function.: (b) Show that the critical value for the size a test with critical region [L~ 1 Xi > k] is k = X2n(1 . x > O. I i .) 3. Here c is a known positive constant and A > 0 is the parameter of interest. Let Xl. Hint: Show that Xf . Suppose that if 8 < 8 0 it is not worth keeping the counter open. 274 Problems for Section 4. 1 j ! Xf is an optimal test statistic for testing H : 1/ A < 1/ AO versus J ! .Xn be the times in months until failure of n similar pieces of equipment. the probability of your deciding to close is also < 0. show that the power of the UMP test can be written as (3(<J) = Gn(<J~Xn(<»/<J2) where G n denotes the X~n distribution function. i' i . (b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP test? 2. Consider the foregoing situation of Problem 4. . .. (a) Exhibit the optimal (UMP) test statistic for H : 0 < 00 versus K : 0 > 00.01.0)"'/[1. Find the sample size needed for a level 0. A) = c AcxC1e>'x . > 1/>'0.3 Testing and Confidence Regions Chapter 4 1. Show that if X I.01 lest to have power at least 0... the expected number of arrivals per day. r p(x. the probability of your deciding to stay open is < 0. I i . 0) = ( : ) 0'(1.Xn is a sample from a truncated binomial distribution with • I. .01. .(IOn x = 1•.. • I i • 4. ..1 achieves this? (Use the normal approximation... . . .
•• of Li.. (b) NabeyaMiura Alternative. Y n be the Ll. then P(Y < y) = 1 e  .[1.) It follows that Fisher's method for cQmbining pvalues (see 4. JL (e) Use the central limit theorem to find a normal approximation to the critical value of test in part (b). which is independent of XI. we derived the model G(y. . > 1. survival times of a sample of patients receiving an experimental treatment. . For the purpose of modeling. A > O.... y > 0.Fo(y)l".6. X 2. P().1. against F(u)=u B.. > po. Hint: Use the results of Theorem 1.). and consider the alternative with distribution function F(x.d.tive.10 Problems and Complements 275 then 2. where Xl_ a isthe (1 . and let YI . 00 < a < b < 00.1.12. In Problem 1. Suppose that each Xi has the Pareto density f(x. (a) Lehmann Alte~p. To test whether the new treatment is beneficial we test H : ~ < 1 versus K : . X 2 .2 to find the mean and variance of the optimal test statistic.. . .6) is UMP for testing that the pvalues are uniformly distributed. then e.~) = 1.d. survival times with distribution Fa. .O<B<1. 1 ' Y > 0.Q)th quantile of the X~n distribution. A > O. Show that the UMP test for testing H : B > 1 versuS K : B < 1 rejects H if 2E log FO(Xi) > Xl_a.1. In the goodnessoffit Example 4.O) = cBBx(l+BI. suppose that Fo(x) has a nonzero density on some interval (a. X N e>. .5. 8. Let Xl.B) = Fi!(x)."" X n denote the incomes of n persons chosen at random from a certain population.6. imagine a sequence X I.:7 I Xi is an optimal test statistic for testing H : () = eo versus l\ . 6. Find the UMP test. X N }). Let N be a zerotruncated Poisson.XFo(y)  1 P(Y < y) = e ' 1 ' Y > 0.. (Ii) Show that if we model the distribotion of Y as C(min{X I .. (i) Show that if we model the distribution of Y as C(max{X I . Let the distribution of sUIvival times of patients receiving a standard treatment be the known distribution Fo. Assume that Fo has ~ density fo.Fo(y)  }). Show how to find critical values. 7. (See Problem 4.  .. x> c where 8 > 1 and c > O. (a) Express mean income JL in terms of e. b).Section 4. . 0 < B < 1. random variable.1.8 > 80 . (b) Find the optimal test statistic for testing H : JL = JLo versus K . ~ > O.6... .
..L and unknown variance (T2. 12. every Bayes test for H : < 80 versus K : > (Jl is of the fann for some t.X)2. x > 0.. e e at l · 6.1).. = .' (cf. Let Xl.. F(x) = 1 . Show that under the assumptions of Theorem 4. Let Xl.. > Er • . Problem 2. . x > 0. Oo)d.' L(x.(O)/ 16' I • 00 p(x..{Oo} = 1 .3. Using a pivot based on 1 (Xi . 0) e9Fo (Y) Fo(Y).3. 'i . I Hint: Consider the class of all Bayes tests of H : () = . Problems for Section 4. eo versus K : B = fh where I i I : 11. Assume that F o has a density !o(Y). The numerator is an increasing function of T(x).0". 1 • .3.  1 j . 1 . 1 X n be a sample from a normal population with unknown mean J. 0. 00 p(x. (a) Show how to construct level (1 .01.j "'" .2..(O) «) L > 1 i The lefthand side equals f6". What would you .0) confidence intervals of fixed finite length for loga2 • (b) Suppose that By 1 (Xi ..4 1. J J • 9.(O) 6 . Find the MP test for testing H : {} = 1 versus K : B = 8 1 > 1.2.exp( _x B).. Show that the test is not UMP. n announce as your level (1 .0 e6 1 0~0. I .1 and 01 loss..276 (iii) Consider the model Testing and Confidence Regions Chapter 4 G(y.. Show that under the assumptions of Theorem 4.0) UeB for '.(O) L':x.2 the class of all Bayes tests is complete. j 1 To see whether the new treatment is beneficial.6t is UMP for testing H : 8 80 versus K : 8 < 80 . . We want to test whether F is exponential. . Hint: A Bayes test rejects (accepts) H if J . the denominator decreasing.d. Let Xi (f)/2)t~ + €i. i = 1.exp( x).O)d. .i. . L(x.X)' = 16. 10. Oo)d. F(x) = 1 . Show that under the assumptions of Theorem 4. we test H : {} < 0 versus K : {} > O. with distribution function F(x). O)d. n. 1 X n be i.. where the €i are independent normal random variables with mean 0 and known variance .'? = 2. .52. 0 = 0. Show that the UMP test is based on the statistic L~ I Fo(Y. . 0. or Weibull.' " 2. /..{Od varies between 0 and 1. B> O...).
1] if 8 > 0.(al + (2)) confidence interval for q(8)... with X . . has a 1. minI ii./N...) Hint: Use (A.4. Let Xl. Then take N . there is a shorter interval 7... < 0. . what values should we use for the t! so as to make our interval as short as possible for given a? 3..(2) " . .1.1.c.al).Xn be as in Problem 4. Compare yonr result to the n needed for (4. then [q(X).IN] . X!)/~i 1tl ofB. Show that if Xl. find a fixed length level (b) ItO < t i < I. but we may otherwise choose the t! freely... Hint: Reduce to QI +a2 = a by showing that if al +a2 with Ql + Q2 = a.X are ij. ii are (b) Calculate the smallest n needed to bound the length of the 95% interval of part (a) by 0.a) confidence interval for B. with N being the smallest integer greater than no and greater than or equal to [Sot".a.) LCB and q(X) is a level (1. then the shortest level n (1 .Section 4.4..4.(2) UCB for q(8).n .SOt no l (1. 6. Suppose we want to select a sample size N such that the interval (4. where c is chosen so that the confidence level under the assumed value of 17 2 is 1 . .02. Suppose that an experimenter thinking he knows the value of 17 2 uses a lower confidence bound for Ji. i = 1. Use calculus.1) based on n = N observations has length at most l for some preassigned length l = 2d. .z(1 .q(X)] is a level (1 .3 we know that 8 < O.d. 0.~a)!. (Define the interval arbitrarily if q > q.1 Show that. . (a) Justify the interval [ii.3). It follows that (1 . N(Ji" 17 2) and al + a2 < a. X + z(1 . Show that if q(X) is a level (1 .4. . although N is random. X+SOt no l (1.2.10 Problems and Complements 277 (a) Using a pivot based on the MLE (2L:r<ol (1 . Begin by taking a fixed number no > 2 of observations and calculate X 0 = (1/ no) I:~O 1 Xi and S5 = (no 1)II:~OI(Xi  XO )2.(X) = X .] .IN.a. distribution. where 8. I")/so.a) interval of the fonn [ X .7). 5.1. [0. Suppose that in Example 4. < a.4.3).01 [X .1)] if 8 given by (4. What is the actual confidence coefficient of Ji" if 17 2 can take on all positive values? 4.IN(X  = I:[' lX.1. t.n is obtained by taking at = Q2 = a/2 (assume 172 known).. 0.l. of the form Ji. n..~a) /d]2 .~a)/ . Stein's (1945) twostage procedure is the following.no further observations.
it is necessary to take at least Z2 (1 (12/ d 2 observations.  10.O)]l < z (1. X has a N(J1.) I ~i .) Hint: Note that X = (noIN)Xno + (lIN)Etn n _ +IXi. ...~ a) upper and lower bounds. (a) If all parameters are unknown. use the result in part (a) to compute an approximate level 0. 1947.jn(X .~a)].95 confidence interval for (J. Hence. in order to have a level (1 . 11. (12.051 (c) Suppose that n (12 = 5.3.1. Show that these two quadruples are each sufficient. (1  ~a) / 2. 1986.4.3. (12 2 9. Such two sample problems arise In comparing the precision of two instruments and in detennining the effect of a treatment. .0') for Jt of length at most 2d. X n1 and Yll . i (b) What would be the minimum sample size in part (a) if (). . " 1 a) confidence . = 0.O). 4().3) are indeed approximate level (1 .18) to show that sin 1( v'X)±z (1 .a:) confidence interval of length at most 2d wheri 0'2 is known. (The sticky point of this approach is that we have no control over N.6. The reader interested in pursuing the study of sequential procedures such as this one is referred to the book of Wetherill and Glazebrook. (a) Show that X interval for 8.. (c) If (12.jn is an approximate level (1  • 12./3z (1. 0. (a) Use (A. v.jn is an approximate level (b) If n = 100 and X = 0. . Show that the endpoints of the approximate level (1 . Suppose that it is known that 0 < ~. 0) and X = 81n. T 2 I I' " I'.!a)/ 4. (b) Exhibit a level (1 . 1"2. 278 Testing and Confidence Regions Chapter 4 I i' I is a confidence interval with confidence coefficient (1 ... By Theorem B.l4. . (J'2 jk) distribution. sn. we may very likely be forced to take a prohibitively large number of observations.4. . and the fundamental monograph of Wald. I .. is _ independent of X no ' Because N depends only on sno' given N = k. 0") distribution and is independent of sn" 8.: " are known.. .1') has a N(o. Y n2 be two independent samples from N(IJ. > z2 (1  is not known exactly.a) confidence interval for (1/1'). Show that ~(). if (J' is large.0)/[0(1.a) confidence interval for sin 1 (v'e). (12) and N(1/.) (lr!d observations are necessary to achieve the aim of part (a). exhibit a fixed length level (1 .. Hint: Set up an inequality for the length and solve for n. Let 8 ~ B(n. ±. find ML estimates of 11.001. and.fN(X . Let XI. but we are sure that (12 < (It.a) interval defined by (4. respectively. d = i II: .. Let 8 ~ B(n.a) confidence interval for 1"2/(Y2 using a pivot based on the statistics of part (a).. Indicate what tables you wdtdd need to calculate the interval.') populations. (a) Show that in Problem 4. I 1 j Hint: [O(X) < OJ = [.
3). Now the result follows from Theorem 8.".)4 < 00 and that'" = Var[(X.2. Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures are observed. See A. XI 16.IUI.9 confidence region for (It.<.4. K.2 is known as the kurtosis coefficient. 1) random variables.J1.3. is replaced by its MOM estimate. cr 2 ) population. In Example 4.) can be approximated by x(" 1) "" (n .30. Compute the (K. Hint: Let t = tn~l (1 .J1.D andn(X{L)2/ cr 2 '" = r (4. (In practice. ~). Hint: By B. 4) are independent. and (b) (4.)/01' ~ (J1.2.1 and. (c) Suppose Xi has a X~ distribution.' 1 (Xi . Show that the confidence coefficient of the rectangle of Example 4.9 confidence interval for cr.3.9 confidence interval for {L. Compare them to the approximate interval given in part (a).' = 2:.)' .J1. . .9 confidence interval for {L x= + cr.(n . 10. it equals O.n(X . and the central limit theorem as given in Appendix A. Z. Assuming that the sample is from a /V({L. Now use the law of large numbers.I) + V2( n1) t z( "1) and x(1 . and X2 = X n _ I (1 . :L7/ (b) Suppose that Xi does not necessarily have a nonnal diStribution. See Problem 5.".7). (n _1)8'/0' ~ X~l = r(~(nl).2. give a 95% confidence interval for the true proportion of cures 8 using (al (4. 100. 15.1) + V2( n . Find the limit of the distribution ofn.3. Slutsky's theorem. find (a) A level 0. (a) Show that x( "d and x{1.4.1) t z{1 . then BytheproofofTheoremB. known) confidence intervals of part (b) when k = I."') . (c) A level 0.4. If S "' 8(64. 0). K.)'. (d) A level 0.) Hint: (n .4 ) .4/0.10 Problems and Complements 279 (b) What sample size is needed to guarantee that this interval has length at most 0.3. Hint: Use Problem 8.1. = 3. In the case where Xi is normal. Suppose that 25 measurements on the breaking strength of a certain alloy yield 11.J1. 14.4 and the fact that X~ = r (k.t ([en . .I). Xl = Xn_l na).5 is (1 _ ~a) 2.Q confidence interval for cr 2 .027 13.1)8'/0') . V(o') can be written as a sum of squares of n 1 independentN(O. and 10 000.4.n} and use this distribution to find an approximate 1 .4. (b) A level 0.). but assume that J1.~Q).ia).3.8).' Now use the central limit theorem.4 = E(Xi .Section 4.1 is known.
indicate how critical values U a and t a for An (Fo) and Bn(Fo) can be obtained using the Monte Carlo method of Section 4.i..11 .1. In this case nF(l') = #[X..4.6 with .7.4. Consider Example 4.F(x)l. That is.l).1 (b) V1'(x)[l. X n are i. we want a level (1 .4. (a) For 0 < a < b < I.u) confidence interval for 1". In Example 4.1.9) and the upper boundary I" = 00. pl(O) < X <Fl(b)}. (a) Show that I" • j j j = fa F(x)dx + f. " • j . 19.I I 280 Testing and Confidence Regions Chapter 4 17.F(x)1 is the approximate pivot given in Example 4. (c) For testing H o : F = Fo with F o continuous.d. I (b) Using Example 4.1'(x)] Show that for F continuous. • i. b) for some 00 < a < 0 < b < 00.a) confidence interval for F(x). verify the lower bounary I" given by (4. Assume that f(t) > 0 ifft E (a. i 1 .3 can be turned into simultaneous confidence intervals for F(x) by replacing z (I . as X and that X has density f(t) = F' (t).r fixed..F(x)1 .4.u. . . 18. 1'1(0) < X < 1'.F(x)1 )F(x)[1 . I' .I Bn(F) = sup vnl1'(x) . .3 for deriving a confidence interval for B = F(x). find a level (I . distribution function.95. Suppose Xl. . < xl has a binomial distribotion and ~ vnl1'(x) . !i .F(x)1 Typical choices of a and bare .4.F(x)]dx. define . U(O. define An(F) = sup {vnl1'(X) . )F(x)[l. (b)ForO<o<b< I. Show that for F continuous where U denotes the uniform. It follows that the binomial confidence intervals for B in Example 4.~u) by the value U a determined by Pu(An(U) < u) = 1.6.05 and .
3. .2n2 distribution.90? 4. Hint: Use the results of Problems B. is of level a for testing H : 0 > 00 . X 2 be independent N( 01. then the test that accepts.3.13 show that the power (3(0 1 . 1/ n J is the shortest such confidence interval. Yn2 be independent exponential E( 8) and £(.Xn1 and YI .1 O? 2. N( O .(2) together.a) VCB for this problem and exhibit the confidence intervals obtained by pntting two such hounds of level (1 . (a) Find c such that Oc of Problem 4.10 Problems and Complements 281 Problems for Section 4. O ) is an increasing 2 function of + B~. with confidence coefficient 1 .th quantile of the .2 that the tests of H : . M n /0. "6 based on the level (1  a) (b) Give explicitly the power function of the test of part (a) in tenos of the X~l distribution function.· (a) If 1(0. = 1 versns K : Il.\. Experience has shown that the exponential assumption is warranted.0.4.) samples.2 are level 0.) denotes the o. Or .1"2nl.Section 4.2). respectively.4 and 8.12 and B.. for H : (12 > (15. 3. of l. (d) Show that [Mn .1. (e) Suppose that n = 16.. if and only if 8(X) > 00 .. respectively. Yf (1 .a) UCB for 0. Show that if 8(X) is a level (1 . (e) Similarly derive the level (1 . How small must an alternative before the size 0. = 1 rejected at level a 0. lem of testing H : 8 1 = 82 = 0 versus K : Or (a) Let oc(X 1 . [8(X) < 0] :J [8(X) < 00 ].3.. X2) = 1 if and only if Xl + Xi > c.05. (a) Deduce from Problem 4.1 has size a for H : 0 (}"2 be < 00 . . test given in part (a) has power 0. show that [Y f( ~a) I X. a (b) Show that the test with acceptance region [f(~a) for testing H : Il. ~ 01).~a) I Xl is a confidence interval for Il. 5. < XI? < f(1.. x y 315040343237342316514150274627103037 826 10 8 29 20 10 ~ Is H : Il. (b) Derive the level (1. .. Give a 90% confidence interval for the ratio ~ of mean life times. and let Il.5. What value of c givessizea? (b) Using Problems B.a) LCB corresponding to Oc of parr (a). and consider the prob2 + 8~ > 0 when (}"2 is known.~a)J has size (c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant.2).3.2 = VCBs of Example 4. Let Xl. .5 1. . Hint: If 0> 00 ... . Let Xl.al) and (1 . (15 = 1. = 0.5.a..
 j 1 j . >k Determine the smallest value k = k(a) such that oklo) is level a for H and show that for n large. 1 Ct ~r (b) Find the family aftests corresponding to the level (1.2).2). We are given for each possible value 1]0 of1] a level 0: test o(X. X. Show that the conclusions of (aHf) still hold.8) where () and fare unknown. .a) confidence region for the parameter ry and con versely that any level (1 . (c) Modify the test of part (a) to obtain a procedure that is level 0: for H : OJ e= O?.282 Testing and Confidence Regions Chapter 4 = ()~ 1 2 eg and exhibit the corresponding family of confidence circles for (ell ( 2). but I(t) = I( t) for all t. (g) Suppose that we drop the assumption that I(t) = J( t) for all t and replace 0 by the v = median of F. (a) Show that testing H : 8 < 0 versus K : e > 0 is equivalent to testing . . N( 020g. (e) Show directly that P. ..~n + ~z(l . O. Suppose () = ('T}. O?. . I . T) where 1] is a parameter of interest and T is a nuisance parameter.0:) LeB for 8 whatever be f satisfying our conditions. we have a location parameter family.a)y'n. (c) Show that Ok(o) (X. I: H': PIX! >OJ < ~ versus K': PIX! >OJ> ~. Thus. Let C(X) = (ry : o(X. . L:J ~ ( . (a) Shnw that C(X) is a level (l . l 7.a. 0. Show that PIX(k) < 0 < X(nk+l)] ~ . . and 1 is continuous and positive.0:) confidence interval for 11.. Let X l1 .0:) confidence region for 1] is equivalent to a family of level tests of these composite hypotheses.00 .4. lif (tllXi > OJ) ootherwise. X n  00 ) is a level a test of H : 0 < 00 versus K : 0 > 00 • (d) Deduce that X(nk(Q)+l) (where XU) is the jth order statistic of the sample) is a level (1 . k .1 when (72 is unknown.IX(j) < 0 < X(k)] do not depend on 1 nr I. ry) = O}. 6. . ).[XU) (f) Suppose that a = 2(n') < OJ and P. 0. Hint: (c) X.Xn be a sample from a population with density f (t . of Example 4. (b) The sign test of H versus K is given by. respectively. og are independentN (0. 'TJo) of the composite hypothesis H : ry = ryo.
Then the level (1 . ' + P2 l' z(1 1 .1. 00) is = 0'. 1).10 Problems and Complements 283 8. the quantity ~ = 8(8) .z(l.z(1 . Hint: You may use the result of Problem 4. (b) Describe the confidence region obtained by inverting the family (J(X. p)} as in Problem 4.a) = (b) Show that 0 if X < z(1 .Y. Let X ~ N(O. hence. the interval is unbiased if it has larger probability of covering the true value 1] than the wrong value 1]'. Y. Let P (p.20) if 0 < 0 and.a) .5. Y ~ N(r/. laifO>O <I>(z(1 .a))2 if X > z(1 . Let a(S.2a) confidence interval for 0 of Example 4. 1) and q(O) the ray (X . This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line. 9. Yare independent and X ~ N(v. Show that as 0 ranges from O(S) to 8(S).a.pYI < (1 1 otherwise. Establish (iii) and (iv) of Example 4. Thus. 10. Po) is a size a test of H : p = Po.00 indicates how far we have to go from 80 before the value 8 is not at all surprising under H.Oo) denote the pvalue of the test of H : 0 = 00 versus K : 0 > 00 in Example 4. 11. p. or).5. 0) ranges from a to a value no smaller than 1 .Section 4.[q(X) < 0 2 ] = 1. 8( S)] be the exact level (1 . Show that the Student t interval (4. ij(x)] for ry is said to be unbiased confidence interval if Pl/[ry(X) < ry' < 7)(X)] <1 a for all ry' F ry. and let () = (ry.O ~ J(X. .a). Define = vl1J.1) is unbiased. (a) Show that the lower confidence bound for q( 8) obtained from the image under q of q(X) (X . 1).3 and let [O(S).2.2 a) (a) Show that J(X. alll/. Y.5.a) confidence interval [ry(x).a). Note that the region is not necessarily an interval or ray.p) = = 0 ifiX . 12. Suppose X. 7)). Let 1] denote a parameter of interest. let T denote a nuisance parameter.5.5. a( S. that suP.2. if 00 < O(S) (S is inconsistent with H : 0 = 00 ).7. That is.7.
. (b) The quantile sign test Ok of H versus K has critical region {x : L:~ 1 1 [Xi > 0] > k}.1 and Fu 1. F(x). and F+(x) be as in Examples 4. Let Xl. 1 .. That is. k+ ' i . Construct the interval using F. We can proceed as follows.a. Show that P(X(k) < x p < X(nl)) = 1.4.95 could be the 95th percentile of the salaries in a certain profession l or lOOx .7. . il. iI .p).. Thus.284 Testing and Confidence Regions Chapter 4 13. .p) variable and choose k and I such that 1 . lOOx. Then   P(P(x) < F(x) < P+(x)) for all x E (a. ! .b) (a) Show that this statement is equivalent to = 1.a. P(xp < x p < xp for all p E (0.5. . where ! 1 1 heal c: n(1  p) + zl_QVnp(1  pl.6 and 4. k(a) . I • (c) Let x· be a specified number with 0 < F(x') < 1. (See Section 3. vF(xp) [1 F(xp)] " Hint: Note that F(x p ) = p.1» . (t) Show that k and 1in part (e) can be approximated by h (~a) and h (1  ~a) where ! heal is given in part (b).a) LeB for x p whatever be f satisfying our conditions.p) versus K' : P(X > 0) > (1 .p)"j.a) confidence interval for x p whatever be F satisfying our conditions.a = P(k < S < n I + 1) = pi(1. Confidence Regions for QlIalltiles. X n be a sample from a population with continuous distribution F. Let F.. Simultaneous Confidence Regions for Quantiles. (e) Let S denote a B(n.05 could be the fifth percentile of the duration time for a certain disease. L. Suppose we want a distributionfree confidence region for x p valid for all 0 < p < 1.h(a). (a) Show that testing H : x p < 0 versus I< : x p > 0 is equivalent to testing H' : P(X > 0) < (I .) Suppose that p is specified. That is. " II [I .. . be the pth quantile of F. (X(k)l X(n_I») is a level (1 .F(x p)]. i j (d) Deduce that X(n_k(Q)+I) (XU) is the jth order statistic of the sample) is a level (1 . Show that the interval in parts (e) and (0 can be derived from the pivot T(x ) p  . . Detennine the smallest value k = k(a) such that Jk(Q) has level a for H and show that for n large. Show that Jk(X I x'. 0 < p < 1. . In Problem 13 preceding we gave a disstributionfree confidence interval for the pth quantile x p for p fixed. Let x p = ~ IFI (p) + Fi! I (P)].. = ynIP(xp) . X n x*) is a level a test for testing H : x p < x* versus K : x p > x*. 14. it is distribution free. (g) Let F(x) denote the empirical distribution.4.
p. where p ~ F(x). Give a distributionfree level (1 .X n .Section 4.T: a <.. . 15. Express the band in terms of critical values for An(F) and the order statistics..Q) simultaneous confidence band for the curve {VF(p) : 0 < p < I}.(x) = F ~(Fx(x» .) < p} and.r" ~ inf{x: a < x < b. . TbeaitemativeisthatF_x(t) IF(t) forsomet E R.1. VF(p) = ![x p + xlpl. then L i==l n IIXi < X + t. . . the desired confidence region is the band consisting of the collection of intervals {[.13(g) preceding. We will write Fx for F when we need to distinguish it from the distribution F_ x of X.r < b. n Hint: nFx(x) = Li~ll[Fx(Xi) < Fx(x)] ~ nFu(F(x)) and ~ ~ nF_x(x) = L IIXi < xl ~ L i==I i==I n n IIFx( X.. F f (.X I. (b) Express x p and . where F u and F t .xp]: 0 < l' < I}.10 Problems and Complements 285 where x" ~ SUp{.z.x. X n and . .d.17(a) and (c) can be used to give another distributionfree simultaneous confidence band for x p .u are the empirical distributions of U and 1 .. LetFx and F x be the empirical distributions based on the i. Suppose that X has the continuous distribution F.. Hint: Let t. The hypothesis that A and B are equally effective can be expressed as H : F ~x(t) = Fx(t) for all t E R. I). F(x) > pl.i.U with ~ U(O.4.(x)] < Fx(x)] = nF1_u(Fx(x)). (c) Show how the statistic An(F) of Problem 4. X I..1. that is... ..r p in terms of the critical value of the Kolmogorov statistic and the order statistics. (b) Suppose we measure the difference between the effects of A and B by ~ the difference between the quantiles of X and X.j < F_x(x)] See also Example 4. Suppose X denotes the difference between responses after a subject has been given treatments A and B. F~x) has the same distribution Show that if F x is continuous and H holds. That is. L i==I n I IFx (Xi) . where A is a placebo. Note the similarity to the interval in Problem 4. ~ ~ (a) Consider the test statistic x .5. then D(F ~ ~ ~ ~ as D(Fu 1 F I  U = F(X) U ).
Give a distributionfree level (1...F1 _u).i. As in Example 1. .F x l (l .. F y ). Fx. <aJ Show that if H holds. then D(Fx..p)] = ~[Fxl(p) + F~l:(p)]. It follows that X is stochastically between Xsv F and XS+VF where X s HI(F(X)) has the symmetric distribution H.) ~ < F x (")] ~ ~ = nFu(Fx(x)). he a location parameter as defined in Problem 3.1..Fy ) = maxIFy(t) 'ER "" .a) that the interval IvF' vt] contains the location set LF = {O(F) : 0(.) is a location parameter} of all location parameter values at F.x = 2VF(F(x)). the probability is (1 . we test H : Fx(t) ~ Fy (t) for all t versus K : Fx(t) # Fy(t) for some t E R. vt = O<p<l sup vt(p) O<p<l where [V. F y ) .l (p) ~ ~[Fxl(p) . .17.. Moreover. treatment A (placebo) responses and let Y1 . (b) Consider the parameter tJ p ( F x. i I f F.. Hint: Let A(x) = FyI (Fx(x)) .a) simultaneous confidence band 15..t::.F~x. then D(Fx .) < Fy(x p )] = nFv (Fx (t)) under H. where do:. I I I 1 1 D(Fx.5.Fy ): 0 < p < 1}. We assume that the X's and Y's are in~ dependent and that they have respective continuous distributions F x and F y .d. nFx(x) we set ~ 2.Jt: 0 < p < 1) for the curve (5 p (Fx.:~ II[Fx(X.. Hint: Define H by H.) < Fx(t)] = nFu(Fx(t))..x p .1) empirical distributions. Let t (c) A Distdbution and ParameterFree Confidence Interval.286 Testing and Confidence Regions ~ Chapter 4 '1. Hint: nFx(t) = 2. where x p and YP are the pth quan tiles of F x and Fy. let Xl. vt(p)) is the band in part (b). i=I n 1 i t . is the nth quantile of the distribution of D(Fu l F 1 .2A(x) ~ HI(F(x)) l 1 + vF(F(x)). Fx(t)l· . .". The result now follows from the properties of B(·). treatment B responses.d. It follows that if c.. we get a distributionfree level (1 .. Let 8(. To test the hypothesis H that the two treatments are equally effective.:~ I1IFx(X. where F u and F v are independent U(O.:~ 1 1 [Fy(Y. .. Show that for given F E F.) = D(Fu .x (': + A(x)).x. Then H is symmetric about zero.) < Fx(x)) = nFv(Fx(x)). i . = 1 i 16. nFy(t) = 2. and by solving D( Fx. Let Fx and F y denote the X and Y empirical distributions and consider the test statistic ~ ~ . ..(x) ~ R. "" = Yp . for ~. where :F is the class of distribution functions with finite support.a) simultaneous confidence band for A(x) = F~l:(Fx(x)) . (p).(F(x)) .) < do. Properties of this and other bands are given by Doksum.U ).. < F y I (Fx(x))] i=I n = L: l[Fy (Y.3. Fenstad and Aaberge (1977). then  nFy(x + A(X)) = L: I[Y.'" 1 X n be Li. ~ ~ ~ F~x.) : :F VF ~ inf vF(p). 1 Yn be i. Also note that x = H. F y ) has the same distribntion as D(Fu .
) .. Show that B = (2 L X.a) that the interval [8. B(.Fy).0:) confidence bound forO. Show that for given (Fx .10 Problems and Complements 287 Moreover. Fy ).6.2. Xl > X 8t 8t 0} (c) A parameter 0 = <5 ('... Q. t.(Fx . Show that n p is known and 0 O' ~ (2 2~A X. Exhibit the UMA level (1 . . then by solving D(F x.L. F y ) is in [6. then B( Fx . ~ D(Fu •F v ).) 12:7 I if. where is nnknown. Show that if 0(·.(P). the probability is (1 . T n n = (22:7 I X. Fx +a) = O(Fx a.3..) < do: for Ll. where :F is the class of distri butions with finite support. ) is a shift parameter} of the values of all shift parameters at (Fx .)I L ti . if I' = 1I>'. = 22:7 I Xf/X2n( a) is . Let6.6 1. . F y ) > O(Fx " Fy).T. 0 < p < 1. F x ) ~ a and YI > Y. F y ) and b _maxo<p<l 6p(Fx . F y. F y ) < O(Fx . F y ) E F x F. then Y' = Y.6+] contains the shift parameter set {O(Fx. moreover X + 6 < Y' < X + 6. Let do denote a size a critical value for D(Fu .2z(1 i=l n n a)ulL tt]} i=l is a uniformly most accurate lower confidence bound for 8. Fv ). x' 7 R.. tt Hint: Both 0 and 8* are nonnally distributed.·) is a shift parameter. :F x :F O(Fx. 6. . ~ ~ = (e) A Distribution and ParameterFree Confidence Interval.('. (d) Show that E(Y) . nFx ("') = nFu(Fx(x)).Section 4.(X).) I i=l L tt .0:) simultaneouS coofidence band for t.4..). 6+ maxO<p<1 6+(p). .F y ' (p) = 6. Let 6 = minO<1'<16p(Fx. is called a shift parameter if O(Fx. Fy ) . (a) Consider the model of Problem 4. (b) Consider the unbiased estimate of O.) = F (p) .4. Now apply the axioms. It follows that if we set F\~•. Suppose X I.. thea It a unifonnly most accurate level 1 .Xn is a sample from a r (p. O(Fx. t Hint: Set Y' = X + t.    Problems for Section 4.E(X). we find a distributionfree level (1 .(x)) .a) UCB for O.6]. ~) distribution.).2uynz(1 i=! i=l all L i=l n ti is also a level (1. F y.(x) = Fy(x thea D(Fx. and 8 are shift parameters. Show that for the model of Problem 4. Fy.(. (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E~ 1 X i )/ E~ 1 has uniformly smaller variance than T. 3.0: UCB for J. 2..= minO<p<l 6. F y )..) ~ + t. Properties of ~ £ = "" this and other bands are given by Doksum and Sievers (1976).
Suppose [B'.1.\.6. Establish the following result due to Pratt (1961).288 Testing and Confidence Regions Chapter 4 4.Xn are Ll. 0). .6 to V = (B .2 and B. G corresponding to densities j. p.7 1 .' with "> ~. density = B. s). (0 .3. OJ.\ =. Hint: Apply Problem 4. [B. j . Hint: By Problem B.d.12 so that F. J • 6. respectively. • .aj LCB such that I . 1. 2.3 and 4. /3( T.\ is distributed as V /050.4.W < BI = 7. where s = 80+ 2n and W ~ X..i. uniform.d. > 1" (h) Show how quantiles of the X2 distribution can be used to determine level (I .. Prove Corollary 4..2. I'. Poisson. . i . Show that if F(x) < G(x) for all x and E(U).B(n. .a. Suppose that given B Pareto..f:s F.B')+. Hint: Use Examples 4.O] are two level (1 . 3. then E(U) > E(V). n = 1.1.. ~. and for (). s). U = (8 .2" Hint: See Sections 8. P(>. Pa( c.(O . .3.Bj has the F distribution F 2r ."" X n are i. .B) = J== J<>'= p( s.B)+. 0']. and that B has the = se' (t. distribution and that B has beta. distribution with r and s positive integers. e> 0.X.2.B).0 upper and lower confidence bounds for J1 in the model of Problem 4.B). B). X has a binomial..6. .6 for c fixed.l2(b). where Show that if (B.B') < E. F1(tjdt.3. Construct unifoffilly most accurate level 1 . '" J.2. U(O. 5. E(U) = . ·'"1 1 .3). V be random variables with d.a) confidence intervals such that Po[8' < B' < 0'] < p. (B" 0') have joint densities. Show how the quantiles of the F distribution can be used to find upper and lower credible bounds for. Suppose that B' is a uniformly most accurate level (I .. Xl. . (b) Suppose that given B = B. f3( T.C.[B < u < O]du.6. 1I"(t) Xl. g. satisfying the conditions of Problem 8. .l. t)dsdt = J<>'= P. In Example 4. . Let U. s > 0. then )" = sB(r(1 . where So is some COnstant and V rv X~· Let T = L:~ 1 Xi' (a) Show that ()" I T m = k + 2t..1 are well defined and strictly increasing.a) upper and lower credible bounds for A. (1: dU) pes.E(V) are finite. 8. . Problems for Section 4.( 0' Hint: E. Suppose that given.6. t) is the joint density of (8. s) distribution with T and s integers. t > e. (a) Show that if 8 has a beta. • • • = t) is distributed as W( s.W < B' < OJ for allB' l' B.4. then E. establish that the UMP test has acceptance region (4.) and that.
Problems for Section 4. ". 'P) is the N(x ..Xn ).8 1.y. respectively.. Hint: p( 0 I x.y. Xl... . (a) Let So = E(x. y) and that the joint density of.s') withe' = max{c.a) upper and lower credible bounds forO to the level (Ia) upper and lower confidence bounds for B. Yn are two independent N (111 \ 'T) and N(112 1 'T) samples. T) = (111.Xn is observable and X n + 1 is to be predicted. .a) upper and lower credible bounds for O.2. r In) density.rlm) density and 7f(/12 1r.r)p(y 1/12. Show that (8 = S IM ~ Tn) ~ Pa(c'. Show fonnaUy that the posteriof?r(O 1 x. (c) Give a level (1 . = PI ~ /12 and'T is 7f(Ll.1. y). r( TnI (c) Set s2 + n..10 Problems and Complements 289 (a)LetM = Sf max{Xj.y. and Y1. y) is aN(y.r 1x. 'T).. Here Xl. r 1x. r) where 7f(Ll I x .x) is aN(x. Hint: 7f(Ll 1x.112.y) = 7f(r I sO)7f(LlI x .. . where (75 is known.l. y) of is (Student) t with m + n .r). (a) Give a level (1 . N(J. .Xm. (b) Show that given r..x)1f(/1.. (b) Find level (1 . Suppose that given 8 = (ttl' J... . r > O. .y) proportional to where 1f(r I so) is the density of solV with V ~ Xm+n2. y) is obtained by integrating out Tin 7f(Ll.2 degrees of freedom. as X. y) is proportional to p(8)p(x 1/11.0"5).i.Xn + 1 be i.Section 4..x)' + E(Yj y)'. (d) Compare the level (1. /11 and /12 are independent in the posterior distribution p(O X.0:) prediction interval for Xnt 1..1.m} and + n. In particular consider the credible bounds as n + 00.d.. . 1 r. . 1f(/11 1r. 4. . (d) Use part (c) to give level (Ia) credible bounds and a level (Ia) credible interval for Ll.y) is 1f(r 180)1f(/11 1 r. = So I (m + n  2).. Show thatthe posterior distribution 7f( t 1 x.1 )) distribution. Let Xl. Suppose 8 has the improper priof?r(O) = l/r. .0:) confidence interval for B.
as X . q(ylx)= ( .Xn are observable and we want to predict Xn+l.' .8. Suppose XI..Q) distribution free lower and upper prediction bounds for X n + 1 • < 00. (3(r. .11.(0 I x)dO. give level (I . " u(n+l). ! " • Suppose Xl. n = 100.5. I x) is sometimes called the Polya q(y I x) = J p(y I 0). suppose Xl. Show that the conditional (predictive) distribution of Y given X = xis 1 J I i I .. X n such that P(Y < Y) > 1. Find the probability that the Bayesian interval covers the true mean j. • .i. Hint: Xi/8 has a ~ distribution and nXn+d E~ 1 Xi has an F 2 . give level 3.a). X is a binomial.15. distribution...2n distribution. (This q(y distribution...5.05.nl.a) lower and upper prediction bounds for X n + 1 . :[ = . X n+ l are i.8. ..'" . Suppose that Y.·) denotes the beta function.290 Testing and Confidence Regions Chapter 4 • • (b) Compare the interval in part (a) to the Bayesian prediction interval (4. Un + 1 ordered. Establish (4.i.8.s+nx+my)/B(r+x.. That is.3) by doing a frequentist computation of the probability of coverage.. Show that the likelihood ratio statistic for testing If : 0 = ~ versus K : 0 i ~ is equivalent to 12X . 0'5)' Take 0'5 ~ 7 2 = I.a (P(Y < Y) > I . . x > 0.t for M = 5.e. . 0) distribution given (J = 8.a) prediction interval for X n + l . give level (1 ~ a:) lower and upper prediction (c) If F is continuous with a positive density f on (a.2) hy using the observation that Un + l is equally likely to be any of the values U(I).. Let X have a binomial. ..d. < u(n+l) denote Ul . A level (1 .. 0). and that (J has a beta. distribution... F. N(Jl. which is not observable.10.Xn are observable and X n + 1 is to be predicted. ~o = 10. B( n.9.. as X where X has the exponential distribution F(x I 0) =I . . b).0:) lower (upper) prediction bound on Y = X n+ 1 is defined to be a function Y(Y) of Xl. i! Problems for Section 4. . .d.• . )B(r+x+y. let U(l) < . and a = .12.. 1 2. 4. 00 < a < b (1 . s). CJ2) with (J2 unknown.. has a B(m. In Example 4..'.x . . Suppose that given (J = 0.Xn + 1 be i. B(n. where Xl. 1 X n are i. . Present the results in a table and a graph.10.0'5) with 0'5 known..d. . 5. Give a level (1 . (b) If F is N(p" bounds for X n + 1 .s+n.x ) where B(·. 0 > 0. Then the level of the frequentist interval is 95%.9 1. . (a) If F is N(Jl. Let Xl. random variable.2.9.) Hint: First show that .8).i.8.
0' 4. (i) F(c.f. Show that (a) Likelihood ratio tests are of the form: Reject if.~Q) also approximately satisfy (i) and (ii) of part (a). In Problems 24. = Xnl (1 . and only if. We want to test H : = 0'0 versus K : 0' 1= 0'0' (a) Show that the size a likelihood ratio test accepts if.~Q) approximately satisfy (i) and also (ii) in the sense that the ratio . · . n Cj I L.3. (n/2)[ii' / C a5  (b) To obtain size Q for H we should take Hint: Recall Theorem B.. .='7"nlog c l n /C2n CIn .C2n !' 1 as n !' 00.. Hint: Note that liD ~ X if X < 1'0 and ~ 1'0 otherwise. and only if.( x) ~ 0. log .~Q) n + V2nz(l.Q.(x) = . Xn_1 (~a).(x) Hint: Show that for x < !n. (c) These tests coincide with the testS obtained by inverting the family of level (1  Q) lower confidence bounds for (12.. 0'2) sample with both JL and 0'2 unknown./(n . where Tn is the t statistic.) . (ii) CI ~ C2 = n logcl/C2. (b) Use the nonnal approximatioh to check that CIn C'n n . ~2 .(Xi . = aD nO' 1". of the X~I distribution.(x) = 0. ° 3.n) and >.. 2 2 L.10 Problems and Complements 291 is an increasing function of (2x . >..3. Hint: log .. let X 1l . TwoSided Tests for Scale...Section 4.(Xi < 2" " (10 i=I  X) 2 < C2 where CI and C2 satisfy. XnI (1 . if Tn < and = (n/2) log(1 + T. if ii' /175 < 1 and = .X) aD· t=l n > C. OneSided Tests for Scale. In testing H : It < 110 versus K . 0'2 < O'~ versus K : 0'2 > O'~.I)) for Tn > 0.log (ii' / (5)1 otherwise.F(CI) ~ 1. Thus.Xn be a N(/L.V2nz(1 . We want to test H . I .Q).(nx). JL > Po show that the onesided. where F is the d. (c) Deduce that the critical values of the commonly used equaltailed test. onesample t test is the likelihood ratio test (fo~ Q <S ~).. 2.
Xn1 and YI .1'2. i = 1..l.. (b) Consider the problem aftesting H : Jl. Suppose X has density p(x. can we conclude that the mean blood pressure in the population is significantly larger than IDO? (b) Compute a level 0. Assume the onesample normal model. eo.z . The control measurements (x's) correspond to 0 pounds of mulch per acre. and that T is sufficient for O. I'· (c) Find a level 0. 0'2 ~ ! corresponding to inversion of the (c) Compute a level 0. 114. . €n are independent N(O. 110. where a' is as defined in < ~.O). find the sample size n needed for the level 0.90 confidence interval for the mean bloC<! pressure J. Show that A(X.95 confidence interval for p.4. .Xn are said to be serially correlated or to follow an autoregressive model if we can write ~ .95 confidence interval for equaltailed tests of Problem 4. 0 E e.lIl (7 2) and N(Jl. 100.. .. .. (c) Using the normal approximation <l>(z(a)+ nl nz/n(1'1 I'z) /(7) to the power. Forage production is also measured in pounds per acre. 1 I 794 2012 1800 2477 576 3498 411 2092 897 1808 I Assume the twosample normal model with equal variances.. • Xi = (JXi where X o I 1 + €i. 8.2' (12) samples.P.9. e 1 ) depends on X only throngh T.292 Testing and Confidence Regions Chapter 4 5.05 onesample t test. x y vi I I .95 when nl = nz = ~n and (1'1 1'2)/17 ~ ~. = 0 and €l. The nonnally distributed random variables Xl. a 2 ) random variables. (X. 9. 6... (b) Can we conclude that leaving the indicated amount of mulch on the ground significantly improves forage production? Use ct = 0. respectively.tl > {t2.3. . .01 test to have power 0. (a) Find a level 0. n.90 confidence interval for a by using the pivot 8 2 ja z . . (a) Show that the MLE of 0 ~ (1'1.1 < J.. I Yn2 be two independent N(J.L. . (7 2) is Section 4. whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre... 7. .9. The following data are from an experiment to study the relationship between forage production in the spring and mulch left on the ground the previous fall. . 0'2). Let Xl.. . Y. 190.L2 versus K : j. Assume a Show that the likelihood ratio statistic is equivalent to the twosample t statistic T. (a) Using the size 0: = 0. The following blood pressures were obtained in a sample of size n = 5 from a certain population: 124.05.
Section 4.'" p(X.. (Tn.. Fix 0 < a < and <>/[2(1 .[T > tl is an increasing function of 6..6. II.OXi_d 2} i= 1 < Xi < 00. (T~). Define the frequency functions p(x. Y2 )dY2. Suppose that T has a noncentral t. (b) Show that the likelihood ratio statistic of H : 0 = 0 (independence) versus K : 0 o(serial correlation) is equivalent to C~=~ 2 X i X i _I)2 / l Xl. (b) P. 13. P.and twosided t tests. Xo = o. distribution. X n1 . 0) by the following table.IIZI > tjvlk] is increasing in [61. has density !k . Stein). = 0. 7k. / L:: i 10. 1 {= xj(kl)ellx+(tVx/k'I'ldx. Show that the noncentral t distribution. with all parameters assumed unknown. From the joint distribution of Z and V. (a) P. XZ distributions respectively.2.n.O)e (a) What is the size a likelihood ratio test for testing H : B = 1 versus K : B 1= I? (b) Show that the test that rejects if... Condition on V and apply the double expectation theorem. Yl. Hint: Let Z and V be independent and have N(o.. get the joint distribution of Yl = ZI vVlk and Y2 = V. . Yn2 be two independent N(P. and only if. samples from N(J1l.O) for 00 = (Z.(t) = .<» (1 . Then use py.. ! x 0 2 1<> 2 I 0 <> I 2 1<> 2 1 ~a Ia 2 iI Oe (i~H <» (1') a 10: (t~) (~. Let Xl... X powerful whatever be B. respectively. . for each v > 0. Then. P. J7i'k(~k)ZJ(k+ll io Hint: Let Z and V be as in the preceding hint.2) In exp{ (I/Z(72) 2)Xi .. (Yl) = f PY"Y. Let e consist of the point I and the interval [0. .[Z > tVV/k1 is increasing in 6. The F Test for Equality of Scale. . . Show that. . has level a and is strictly more 11. Consider the following model. (Yl. I). 7k. 12. i = 1.6. The power functions of one. (An example due to C.10 Problems and Complements 293 X n ) is n (a) Show that the density of X = (Xl.IITI > tl is an increasing function of 161. .<»1 < e < <>.
i=2 i=2 n n S12 =L i=2 n U. using (4.). L Yj'. ! .0 has the same distribution as R. (11..61 310 2. distribution and that critical values can be .1)/(n.8.~ I .4. Let (XI. Yn) be a sampl~ from a bivariateN(O. F > /(1 . Consider the problem of testing H : p = 0 versus K : p i O.V. . (d) Relate the twosided test of part (c) to the confidence intervals for a?/ar obtained in Problem 4.10. !I " . si = L v. where l _ .nt 1 distribution. . . Hint: Use the transfonnations and Problem B. . 1. Because this conditional distribution does not depend on (U21 •. 0. Let '(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p the bivariate normal model.'.9.12 337 1. . I ' LX. . Un = Un. V. . p) distribution.1' Sf = L U.9.96 279 2. YI ).68 250 2. as an approximation to the LR test of H : a1 = a2 versus K : a1 =I a2. Finally..4. can you conclude at the 10% level of significance that blood cholest~rollevel is correlated with weight/height ratio? 'I I .a/2) or F < f( n/2).4. Let R = S12/SIS" T = 2R/V1 R'.. where f( t) is the tth quantile of the F nz .. x y 254 2. <. . . Un). The following data are the blood cholesterol levels (x's) and weightlheight ratios (y's) of 10 men involved in a heart study. i=l j=l n n (b) Show that if we have a sample from a bivariate N(1L1l 1L2. > C. that T is an increasing function of R. and using the arguments of Problems B. then P[p > c] is an increasing function of p for fixed c. Vn ) is a sample from a N(O. . . . vn 1 l i 16. . .1.9. (a) Show that the likelihood ratio ~tatistic is equivalent to ITI where .4) and (4.' ! . i 0 in xi !:.7 and B. · .94 Using the likelihood ratio test for the bivariate nonnal model. Sbow. • . 15. . 1. (a) Show that the LR test of H : af = a~ versus K : ai > and only if.71 240 2.19 315 2.5) that 2 log '(X) V has a distribution.1I(a).1 . .4. a?.62 284 2. . . r> (c) Justify the twosided F test: Reject H if. and use Probl~1ll4.64 298 2. show that given Uz = Uz . and only if.X)' (b) Show that (aUa~)F has an F nz obtained from the :F table. 1 . p) distribution..J J .~ I i " 294 Testing and Confidence Regions Chapter 4 .24) implies that this is also the unconditional distribution. ..'. (Xn . 14..7 to conclude that tribution as 8 12 /81 8 2 .8.4. . (1~. where a.nl1 ar is of the fonn: Reject if. • I " and (U" V. (Un.1)]E(Y. note that . 0. p) distribution.Y)'/E(X.. the continuous versiop of (B.0 has the same dis r j . T has a noncentral Tnz distribution with noncentrality parameter p. Argue as in Proplem 4.37 384 2. F ~ [(nl .
4.9. Wiley & Sons. Box.819 for" = 0. Tl . see also Ferguson (1967). (2) The theory of complete and essentially complete families is developed in Wald (1950). W.Section 4. T. 00.3).895 and t = 0.[ii . Mathematical Theory of Reliability New York: J.2. .4 (I) If the continuity correction discussed in Section A. Rejection is more definitive. i]. 187. F. Editors New York: Academic Press. 1974) to the critical value is <. it also has confidence level (1 . O'CONNELL.~.. Stephens.1. E. respectively.3) holds for some 8 if <p 't V. Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding.. G.15 is used here. Essentially.. !. the class of Bayes procedures is complete. (2) We ignore at this time some reallife inadequacies of this experiment such as the placebo effect (see Example 1.01 + 0. Nntes for Section 4.o = 0.(t) = tl(.3 (1) Such a class is sometimes called essentially complete. S in 8(X) would be replaced by 5 + and 5 in 8(X) is replaced by 5 . P. Because the region contains C(X).0. The term complete is then reserved for the class where strict inequality in (4. AND J. AND F.1 (1) The point of view usually taken in science is that of Karl Popper [1968]. More generally the closure of the class of Bayes procedures (in a suitable metric) is complete.11 Notes 295 17. T6 . i] versus K : 8 't [i. if the parameter space is compact and loss functions are bounded. and C. Wu.11. Data Analysis and Robustness. Apology for Ecumenism in Statistics and Scientific lnjerence. E. (b) Compare your solution to the Bayesian solution based on a continuous loss function rro = 0. (3) A good approximation (Durbin. (a) Find the level" LR test for testing H : 8 E given in Problem 3.035.[ii) where t = 1. P. BICKEL. REFERENCES BARLOW. 00. Consider the bioequivalence example in Problem 3.398404 (975). PROSCHAN..12 1965.. R.. Consider the cases Ii.9.851.05 and 0.01. "Is there a sex bias in graduate admissions?" Science. Box.10.0. Notes for Section 4.. (2) In using 8(5) as a confidence hound we are using the region [8(5).11 NOTES Notes for Section 4. HAMMEL. G. .).2. 1983.I\ E. 1973. 4.. and r. Leonard.3.
." Regional Conference Series in Applied Math. Statist. the Growth ofScientific Knowledge. Sratisr. LEHMANN. WILKS. SACKRoWITZ. 1947. A Decision Theoretic Approach New York: Academic Press. AND E. DOKSUM.. 13th ed. 53. 473487 (1977). L. HAlO. Y. Amer. G. "Interval estimation for a binomial proportion. STEIN. "On the combination of independent test statistics. 1950." J. K. OLlaN." The American Statistician.296 Testing and Confidence Regions Chapter 4 BROWN. A. 36. Statist. Statistical Methods for MetaAnalysis Orlando. A. 1968. JEFFREYS. 1997.. 1958. AND K.. 1952. L." The AmencanStatistician.• Statistical Methods for Research Workers. Math. 549567 (1961). AND A. . "EDF statistics for goodness of fit. FENSTAD. R. ! . Philadelphia. T. T. New York: I J 1 Harper and Row.. New York: Hafner Publishing Company. .." J. A. "P values as random variablesExpected P values." Biometrika. A. New York: Springer. . "Length of confidence intervals. DURBIN.. F.. WANG. Wiley & Sons. The Theory of Probability Oxford: Oxford University Press. H. Wiley & Sons.674682 (1959). 1986. 69. 64. 659680 (1967). Statist. Pennsylvania (1973). "A twosample test for a linear hypothesis whose pOWer is independent of the variance. 3rd ed. WALD. Sequential Methods in Statistics New York: Chapman and Hall. 605608 (1971). "Optimal confidence intervals for the variance of a normal distribution. . I . Amer. R. "Probabilities of the type I errors of the Welch tests... L.. TATE. VAN ZWET." 1. K. R. "Distribution theory for tests based on the sample distribution function.243258 (1945). G.. "PloUing with confidence: Graphical comparisons of two populations. M. A. j 1 . ! I I I FERGUSON. 2nd ed. A. Assoc.. WELCH. 16. 54 (2000). S. 243246 (1949). Sequential Analysis New York: Wiley. "Further notes on Mrs." Biometrika. POPPER. J. Assoc. DOKSUM.. 1985. Assoc. AND G. 1%2. . c. FL: Academic Press. Statist.. OSTERHOFF. . j 1 .. . Statist. I . 421434 (1976). j j New York: 1. 63. "Plots and tests for symmetry. GLAZEBROOK. D. R. Testing Statistical Hypotheses. KLETI. Aspin's tables:' Biometrika. 54. . 56. 38. SIAM. HEDGES. AARBERGE. Mathematical Statistics New York: J.. 66. 730737 (1974). SAMUEL<:AHN. FISHER.. 8. Statistical Theory with Engineering Applications . AND G. K. Statistical Decision Functions New York: Wiley. STEPHENs. PRATI. Mathematical Statistics.• Conjectures and Refutations. AND I. AND R. WALD. WETHERDJ. 1967. W. SIEVERS. S. B.. 326331 (1999)." Ann. 9.. DAS GUPTA. H.." An" Math. W.. D.. V. 1961. 1. CAl. Amer." J. E. Amer.. AND 1.
3) gn(X) =n ( 2k ) F k(x)(1 . if n ~ 2k + 1...2 that . . k Evaluation here requires only evaluation of F and a onedimensional integration. the qualitative behavior of the risk as a function of parameter and sample size is hard to ascertain. Worse.1. Ifwe want to estimate J1(F} = (5. and F has density f we can wrile (5.. consider evaluation of the power function of the onesided t test of Chapter 4.1 degrees of freedom.1. v(F) ~ F.9. If XI. .1. and most of this chapter. To go one step further.1 INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS Despite the many simple examples we have dealt with. 1 X n from a distribution F. but a different one for each n (Problem 5. However.1.1. our setting for this section EFX1 and use X we can write.1). If n is odd. .13). consider med(X 11 •.2) where. Even if the risk is computable for a specific P by numerical integration in one dimension. This distribution may be evaluated by a twodimensional integral using classical functions 297 .d.1.2.Chapter 5 ASYMPTOTIC APPROXIMATIONS 5.2) and (5.3). consider a sample Xl.F(x)k f(x).i. from Problem (B. (5.Xn are i. a 2 ) we have seen in Section 4. Worse. telling us exactly how the MSE behaves as a function of n. and calculable for any F and all n by a single onedimensional integration.. computation even at a single point may involve highdimensional integrals.• . the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5. closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule. In particular. N(J1.1) This is a highly informative formula.fiiXIS has a noncentral t distribution with parameter 1"1 (J and n .Xn ) as an estimate of the population median II(F).1 (D.
• . is to use the Monte Carlo method.1. where A= { ~ . It seems impossible to determine explicitly what happens to the power function because the distribution of fiX / S requires the joint distribution of (X.fiit !Xi.. . X n + EF(Xd or p £F( . but for the time being let's stick to this case as we have until now. Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or.3. Rn (F).3. ... . 1 < j < B from F using a random number generator and an explicit fonn for F. which we explore further in later chapters.. fiB ~ R n (F). . . for instance the sequence of means {Xn }n>l. I j {Tn (X" . X n as n + 00. _. The first. VarF(X 1 ». Thus. 10=1 f=l There are two complementary approaches to these difficulties. Asymptotics. of distributions of statistics. But suppose F is not Gaussian. where X n = ~ :E~ of medians. Monte Carlo is described as follows. In its simplest fonn. The classical examples are.. more specifically.)2) } . i .n (Xl. or the sequence Asymptotic statements are always statements about the sequence. {X 1j l ' •• . . is to approximate the risk function under study by a qualitatively simpler to understand and easier to compute function. observations Xl. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5. just as in numerical integration. based on observing n i. .d. save for the possibility of a very unlikely event.Xn):~Xi<n_l (~2 (EX. Xn)}n>l. always refers to a sequence of statistics I. .fii(Xn  EF(Xd) + N(O. ... .4) i RB =B LI(F. S) and in general this is only representable as an ndimensional integral.o(X1). in this context. X nj }. The other. Draw B independent "samples" of size n.1. We shall see later that the scope of asymptotics is much greater. or it refers to the sequence of their distributions 1 Xi.Xnj ) . which occupies us for most of this chapter. j=l • By the law of large numbers as B j.I I !' 298 Asymptotic Approximations Chapter 5 (Problem 5. Approximately evaluate Rn(F) by _ 1 B i (5. 00. we can approximate Rn(F) arbitrarily closely.2) and its qualitative properties are reasonably transparent.i.
if more delicate Hoeffding bound (B. the much PFlIx n  Because IXd < 1 implies that ./' is as above and .25 whereas (5.' = 1 possible (Problem 5.01.8) Again we are faced with the questions of how good the approximation is for given n. .omes 11m'.' m (5.10) < 1 with .6) does not tell us how large n has to be for the chance of the approximation ~ot holding to this degree (the lefthand side of (5.l1) states that if EFIXd' < 00. which are available in the classical situations of (5.1.Section 5.4. Is n > 100 enough or does it have to be n > 100. For € = .6) to fall... 1'1 > €] < 2exp {~n€'}. this reads (5. the weak law of large numbers tells us that. The trouble is that for any specified degree of approximation. Xn is approximately equal to its expectation. n = 400.1 Introduction: The Meaning and Uses of Asymptotics 299 In theory these limits say nothing about any particular Tn (Xl. X n ) we consider are closely related as functions of n so that we expect the limit to approximate Tn(Xl~"· .1. X n ) but in practice we act as if they do because the T n (Xl. (5.. X n )) (in an appropriate sense).1.1.11) . and P F • What we in principle prefer are bounds..(Xn . .1.VarF(Xd. .Xn ) or £F(Tn (Xl. ' .l) (5..6) gives IX11 < 1.1. . PF[IX n _  .9) As a bound this is typically far too conservative.1.9) when . Further qualitative features of these bounds and relations to approximation (5. (5.9.1. Thus. DOD? Similarly. (5. then P [vn:(X F n .01. by Chebychev's inequality..1.5) That is. say.2 hand side of (5. ..15. 1') < z] ~ ij>(z) (5.8) are given in Problem 5.1.10) is .1. 7). the right sup PF x [r. For instance.3). x. say. the central limit theorem tells us that if EFIXII < 00.1.14. (see A.2 .1.7) where 41 is the standard nonnal d.' 1'1 > €] < . for n sufficiently large. below .6) for all £ > O. if EFIXd < 00.9) is .2 is unknown be.. We interpret this as saying that. Similarly.1.1.1. For instance. then (5.. As an approximation.1...f. £ = .6) and (5.14. the celebrated BerryEsseen bound (A.1') < x] _ ij>(x) < CEFIXd' v'~ 3 1/' " "n (5.. if EFXf < 00.1.
' (0)( (1 / y'n)( v'27i') (Problem 5.1. I . The arguments apply to vectorvalued estimates of Euclidean parameters. B ~ O(F).11) suggests. for any loss function of the form I(F.1. (5. consistency is proved for the estimates of canonical parameters in exponential families.2 and asymptotic normality via the delta method in Section 5. (b) Their validity for the given n and Ttl for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. 1 i . • model.1. F) typically is the standard deviation (SD) of J1iOn or an approximation to this SD. Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo. although the actual djstribution depends on Pp in a complicated way.2 deals with consistency of various estimates including maximum likelihood.F) > N(O 1) . . "(0) > 0. In particular.1. : i . (5. as we have seen. Bounds for the goodness of approximations have been available for X n and its distribution to a much greater extent than for nonlinear statistics such as the median. The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17  j • i. We now turn to specifics.7) says that the behavior of the distribution of X n is for large n governed (approximately) only by j. d) = '(II' . behaves like . Yet. (5.1.8) is typically much betler than (5. .e(F)]) (1(e. . Asymptotics has another important function beyond suggesting numerical approximations for specific nand F. As we mentioned. asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate.I '1. • c F (y'n[Bn . Practically one proceeds as follows: (a) Asymptotic approximations are derived. It suggests that qualitatively the risk of X n as an estimate of Ji.5) and quite generally that risk increases with (1 and decreases with n.di) where '(0) = 0. and asymptotically normal. good estimates On of parameters O(F) will behave like  Xn does in relation to Ji.(l) The approximation (5.1.1. Section 5. Section 5.8) differs from the truth. If they are simple. quite generally.' I I • If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good. As we shall see. The estimates B will be consistent.3. Although giving us some idea of how much (5. even here they are not a very reliable guide. i . which is reasonable.t and 0"2 in a precise way. The qualitative implications of results such as are very impor~ tant when we consider comparisons between competing procedures. For instance. Consistency will be pursued in Section 5.3 begins with asymptotic computation of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for oneparameter exponential families. for all F in the n n .11) is again much too consctvative generally.300 Asymptotic Approximations Chapter 5 where C is a universal constant known to be < 33/4.12) where (T(O.
1). Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to 00.2) is called unifonn consistency. ' . 00. for all (5.2. in accordance with (A.1).2.i. and 8.d. That is.. If is replaced by a smaller set K.2 Consistency 301 in exponential families among other results. If Xl... Summary. .2) are preferable and we shall indicate some of qualitative interest when we can. by the WLLN. and probability bounds.2. A stronger requirement is (5.7. For P this large it is not unifonnly consistent.2. .1) and (B.lS. and other statistical quantities that are not realistically computable in closed form. . The least we can ask of our estimate Qn(X I. Section 5. 5. and B. In practice..2) Bounds b(n.Xn ) is that as e n 8 E ~ e.7 without further discussion. (5.1. 'in 1] q(8) for all 8.2. The simplest example of consistency is that of the mean. with all the caveats of Section 5. Finally in Section 5. asymptotics are methods of approximating risks.Xn from Po where 0 E and want to estimate a real or vector q(O). 1denotes Euclidean distance.1) where I .. (See Problem 5.d.1. e Example 5. remains central to all asymptotic theory.5 we examine the asymptotic behavior of Bayes procedures. We also introduce Monte Carlo methods and discuss the interaction of asymptotics. which is called consistency of qn and can be thought of as O'th order asymptotics. > O. Means. P where P is unknown but EplX11 < 00 then.) forsuP8 P8 lI'in . distributions. .4 deals with optimality results for likelihoodbased procedures in onedimensional parameter models. case become increasingly valid as the sample size increases.14. X ~ p(P) ~ ~  p = E(XJ) and p(P) = X. by quantities that can be so computed.) However. but we shall use results we need from A. But.2. A.2 5.i. 1 X n are i. Monte Carlo. we talk of uniform cornistency over K. The stronger statement (5.Section 5. where P is the empirical distribution. We will recall relevant definitions from that appendix as we need them.7. .q(8)1 > 'I that yield (5. . Most aSymptotic theory we consider leads to approximations that in the i.2. AlS. for .2.14. is a consistent estimate of p(P). if.14.1 CONSISTENCY PlugIn Estimates and MlEs in Exponential Family Models Suppose that we have a sample Xl.2. The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A.
it is uniformly continuous on S. 0) is defined by . the kdimensional simplex. 6 >0 O. Pn = (iiI.2. Then qn q(Pn) is a unifonnly consistent estimate of q(p)." is the following: I X n are LLd. Suppose that q : S + RP is continuous.2. we can go further. and (Xl.b) say.6)! Oas6! O.p) distribution.·) is increasing in 0 and has the range [a.b] < R+bedefinedastheinver~eofw. w.6.2. xd is the range of Xl' Let N i = L~ 11(Xi = Xj) and Pj _ Njln..2. · .2. (5. for all PEP.. 6) It easily follows (Problem 5. implies Iq(p')q(p) I < <Then Pp[li1n . .1) and the result follows. where q(p) = p(1p). sup{ Pp[liln pi > 61 : pES} < kl4n6 2 (Problem 5. l iA) E S be the empirical distribution. But further. consider the plugin estimate p(Iji)ln of the variance of p. . and p = X = N In is a uniformly consistent estimate of p.. Pp [IPn  pi > 6] ~ ..3) Evidently. in this case. = Proof. Other moments of Xl can be consistently estimated in the same way. By the weak law of large numbers for all p. for every < > 0. which is ~ q(ji). Because q is continuous and S is compact. .6) = sup{lq(p)  q(p')I: Ip . 0 In fact.. Ip' pi < 6«).2. Let X 1. . 0 ~ To some extent the plugin method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plugin estimates.5) I A simple and important result for the case in which X!. w( q.1. If q is continuous w(q. Theorem 5. Thus. 0 < p < 1. Binomial Variance.Pk) : 0 < Pi < 1. there exists 6 «) > 0 such that p.14.l «) = inf{6 : w(q..ql > <] < Pp[IPn .. Suppose the modulus ofcontinuity of q. . .. Asymptotic Approximations Chapter 5 X is uniformly consistent over P because o Example 5.LJ~IPJ = I}. Then N = LXi has a B(n. Suppose thnt P = S = {(Pl.4) I i (5.l : [a. q(P) is consistent.. then by Chebyshev's inequality. Evidently.2. where Pi = PIXI = xi]' 1 < j < k. with Xi E X 1 . w(q. by A. p' E S.I l 302 instance.p'l < 6}.• X n be the indicators of binomial trials with P[X I = I] = p.2. P = {P : EpX'f < M < oo}. Letw.1 < j < k. w(q.pi > 6«)] But.3) that > <} (5. i .
1 < j < d.UV) so that E~ I g(Ui . then vIP) h(g).. (ii) ij is consistent. let mj(O) '" E09j(X. )1 < oo}. 11. conclude by Proposition 5. )) and P = {P : Eplg(X .p).2. thus. ag.3.. Var(U.) I < that h(D n) Foreonsisteney of h(g) apply Proposition B. If Xl.U 2. and let q(O) = h(m(O)). Vi). o Example 5. . 'Vi) is the statistic generating this 5parameter exponential family.Section 5.1 that the empirical means. Let g(u.7. Suppose P is a canonical exponentialfamily of rank d generated by T.6.1..d.lpl < 1.ar. Jl2.a~.6) if Ep[g(X. Let Xi = (Ui . Var(V. Theorem 5.2. If we let 8 = (JLI.9d) map X onto Y C R d Eolgj(X')1 < 00. = Proof. 1 are discussed in Problem 5. where P is the empirical distribution.)..4. [ and A(·) correspond to P as in Section 1. 1 < j < d.2. (i) Plj [The MLE Ti exists] ~ 1. 1 < i < n be i. Let g (91. is a consistent estimate of q(O). af > o.3. .1 and Theorem 2. U implies !'. and correlation coefficient are all consistent.1 .2. where h: Y ~ RP. ' .2.1: D n 00. Questions of uniform consistency and consistency when P = { DistribuCorr(Uj. Then. We need only apply the general weak law of large numbers (for vectors) to conclude that (5. Variances and Correlations.Jor all 0.. .) > 0. More generally if v(P) = h(E p g(X .JL2.2 Consistency 303 Suppose Proposition 5. N2(JLI.p). !:. Let TI. Suppose [ is open. h(D) for all continuous h.1.V2.v} = (u. . then Ifh=m.then which is well defined and continuous at all points of the range of m. X n are a sample from PrJ E P.)1 < 1 } tions such that EUr < 00. variances. We may. Then. EVj2 < 00.i.) > 0. ai. if h is continuous.2.V. 0 Here is a general consequence of Proposition 5. is consistent for v(P).2.2.
1.2. Let 0 be a minimum contrast estimate that minimizes I . j 1 Pn(X. (5. which solves 1 n A(1)) = . II) = where.LT(Xi ).1 belong to the interior of the convex support hecause the equation A(1)) = to.2. By definition of the interior of the convex support there exists a ball 8. X n be Li.L[p(Xi .7) I .1I)11: II E 6} ~'O (5. 1 i .. By a classical result. (T(Xl)) must hy Theorem 2. 5. the inverse AI: A(e) ~ e is continuous on 8. Po.1 that On CT the map 17 A(1]) is II and continuous on E. I J Theorem 5. I ! . is a continuous function of a mean of Li. z=l 1 n i.lI) .9) n and Then (} is consistent. D . II) L p(X n. I .= 1 . . .lI o): 111110 1 > e} > D(1I0 .2.2. n LT(Xi ) i=I 21 En. . The argument of the the previous subsection in which a minimum contrast estimate. E1). ..1 that i)(Xj.. i=l 1 n (5. see.j. i I . • .8) I > O. D( 110 . But Tj.3. as usual.1I 0 ) foreverye I . the MLE. II) 0 j =EII. T(Xdl < 6} C CT' By the law of large numbers.3.. . Let Xl. n .Xn ) exists iff ~ L~ 1 T(X i ) = Tn belongs to the interior CT of the convex support of the distribution of Tn. = 1 n p.3. T(Xl).2. if 1)0 is true. Rudin (1987). ...  inf{D(II.2.7) occurs and (i) follows.... A more general argument is given in the following simple theorem whose conditions are hard to check. Suppose 1 n PII sup{l.p(X" II) is uniquely minimized at 11 i=l for all 110 E 6. exists iff the event in (5.2. 0 Hence. and the result follows from Proposition 5.. We showed in Theorem 2.2 Consistency of Minimum Contrast Estimates .d.. I I Proof Recall from Corollary 2..1 to Theorem 2. . .3.D(1I0 . i i • • ! .E1). for example. BEe c Rd.304 Asymptotic Approximations Chapter 5 . vectors evidently used exponential family properties.d. is solved by 110. .3. 7 I . '1 P1)'[n LT(Xi ) E C T) ~ 1. {t: It . Note that. where to = A(1)o) = E1)o T(X 1 ).
IIIJ  IJol > c] (5.2. IIJ . [I ~ L:~ 1 (p(Xi .9) hold for p(x.D(IJo.2. Ie ¥ OJ] ~ Ojoroll j.[max{[ ~ L:~ 1 (p(Xi . But because e is finite.IJor > <} < 0] because the event in (5.8) can often failsee Problem 5.8) follows from the WLLN and PIJ. IJ) . E n ~ [p(Xi .IJol > E] < PIJ [inf{ . Coronary 5. IJ)II ' IJ n i=l E e} > .IJ o)  D(IJo.IJol > <} < 0] (5.2. [inf c e.2. IJ) = logp(x. and (5. IIJ . ifIJ is the MLE.IJO)) ' IIJ IJol > <} n i=l .8) by (i) For ail compact K sup In { c e.1. A simple and important special case is given by the following.8).12) which has probability tending to 0 by (5.2.D(IJ o.14) (ii) For some compact K PIJ.2. 1 n PIJo[inf{. IJo)) . An alternative condition that is readily seen to work more widely is the replacement of (5. then. IJ)I : IJ K } PIJ !l o.[IJ ¥ OJ] = PIJ.2. IJ j ) . .IJ) p(Xi.11) sup{l.2 Consistency 305 proof Note that.Section 5.1 we need only check that (5.IJd ).2. 0) . But for ( > 0 let o ~ ~ inf{D(IJ.2.5. PIJ. /fe e =   Proof Note that for some < > 0.2.2.11) implies that 1 n ~ 0 (5.2. IJj ) .13) By Shannon's Lemma 2. _ I) 1 n PIJ o [16 . (5. IJ) . IJ). o Then (5.11) implies that the righthand side of (5. IJj))1 > <I : 1 < j < d} ~ O.2.2.L[p(Xi . for all 0 > 0.10) By hypothesis. (5.2. ."'[p(X i .p(X" IJo)) : IJ E g ~(P(X" K'} > 0] ~ 1.IJ)1 < 00 and the parameterization is identifiable. {IJ" .D(IJo.9) follows from Shannon's lemma.D( IJ o. IJ j )) : 1 <j < d} > <] < d maxi PIJ.2.L(p(Xi.10) tends to O..D(IJo.8) and (5.2.2. IJ) .. PIJ. 0 Condition (5. 2 0 (5. is finite.inf{D(IJo. IJ) n ~ i=l p(X"IJ o)] .lJ o): IIJ IJol > oj. EIJollogp(XI.
. consistency of the MLE may fail if the number of parameters tends to infinity.Il E(X Il).. ~l Theorem 5.2. m Mi) and assume (b) IIh(~)lloo j 1 .3 FIRST. Summary. that sup{PIiBn . A general approach due to Wald and a similar approach for consistency of generalized estimating equation solutions are left to the problems.1 The Delta Method for Moments • • We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders.14) is in general difficult.3. VariX. ! ! (ii) EIXd~ < 00 Let E(X1 ) = Il. Sufficient conditions are explored in the problems. J. 5.B(P)I > Ej : PEP} t 0 for all € > O. ~ 5.306 Asymptotic Approximations Chapter 5 We shall see examples in which this modification works in the problems. When the observations are independent but not identically distributed.1) where • .2.3. We denote the jth derivative of h by 00 .d. > 2. If fin is an estimate of B(P).i. sequence of estimates) consistency. We then sketch the extension to functions of vector means. We show how consistency holds for continuous functions of vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families. X valued and for the moment take X = R.33.AND HIGHERORDER ASYMPTOTlCS: THE DelTA METHOD WITH APPLICATIONS We have argued. let Ilglioo = sup{ Ig( t) I : t E R) denote the sup nonn. We introduce the minimal property we require of any estimate (strictly speak ing.1. and assume (i) (a) h is m times differentiable on R. I ~. then = h(ll) + L (j)( ) j=l h . . + Rm (5.8) and (5. . Unifonn consistency for l' requires more. in Section 5. Unfortunately checking conditions such as (5. As usual let Xl. If (i) and (ii) hold.3. we require that On ~ B(P) as n ~ 00. ..) = <7 2 Eh(X) We have the following. Let h : R ~ R. D .Xn be i. see Problem 5. =suPx 1k<~)(x)1 <M < . We conclude by studying consistency of the MLE and more generally Me estimates in the case e finite and e Euclidean.3.1 that the principal use of asymptotics is to provide quantitatively or qualitatively useful approximations to risk. .
3.3.3. (5.) ~ 0. Moreover. ..1'1. . (n . tr i/o>2 all k where tl . + i r = j. t r 1 and [t] denotes the greatest integer n.1) .4) for all j and (5.. so the number d of nonzero tenns in (a) is b/2] (c) l::n _ r. (b) a unless each integer that appears among {i I . .3. then (a) But E(Xi .. . +'r=j J . In!. . Let I' = E(X.. C [~]! n(n .4) Note that for j eveo.3. . X ij ) least twice. rl l:: .ij } • sup IE(Xi ..1.and HigherOrder Asymptotics: The Delta Method with Applications 307 The proof is an immediate consequence of Taylor's expansion.. :i1 + .2) where IX' that 1'1 < IX ...3.. _ . 'Zr J . .3) and j odd is given in Problem 5. then there are constants C j > 0 and D j > 0 such (5.5.'" . .i j appears at by Problem 5. The more difficult argument needed for (5.. tI. EIX I'li = E(X _1')' Proof. .. . . . Lemma 5. Xij)1 = Elxdj il.3..[jj2] + 1) where Cj = 1 <r. .. ik > 2. bounded by (d) < t. . The expression in (c) is. and the following lemma.3) for j even..3 First. j > 2.5[jJ2J max {l:: { ..2.'1 +. 21.t" = t1 .3) (5. We give the proof of (5. for j < n/2. If EjX Ilj < 00.3.Section 5..1 < k < r}}.3..
l)h(J. I .6.3.3. respectively.3. and EXt < replaced by G(n').l) + [h(llj'(1')}':.l) 6 + 2h(J.3f2 ) in (5.3f2 ). n (b) Next.5) (b) if E(Xt) < G(n. 00 and 11hC 41 11= < then G(n.+ O(n. 0 1 I . 1 < j < 3. If the conditions of (b) hold. Proof (a) Write 00. apply Theorem 5. Because E(X .ll j < 2j EIXd j and the lemma follows.. then I • I. i • r' Var h(X) = 02[h(11(J.1'11. I 1 .. and (e) applied to (a) imply (5.3. I .1')2 = 0 2 In.3.3.5) can be replaced by Proof For (5.l) + 00 2n I' + G(n3f2).1.3.3.1 with m = 4. By Problem 5.l) and its variance and MSE.JL as our basic variables we obtain the lemma but with EIXd j replaced by EIX 1 . then G(n.3.l)}E(X ..l) + [h(1)]2(J. 1 iln .3. (5.4).l) + {h(21(J. EIX1 . . then h(')( ) 2 0 = h(J.4) for j odd and (5. give approximations to the bias of h(X) as an estimate of h(J.3. I (b) Ifllh(j11l= < 00.308 But (e) Asymptotic Approximations Chapter 5 1 I j i njn(n 1) .1 with m = 3.6) can be 1 Eh 2 (X) = h'(J. if 1 1 <j (a) Ilh(J)II= < 00. (n li/2] + 1) < nIJf'jj and (c). (d).3.2..2 ).3f') in (5. < 3 and EIXd 3 < 00.1')3 = G(n 2) by (5. (5..3. using Corollary 5. (a) ifEIXd3 < 00 and Ilh( 31 11 oo < Eh(X) 00. In general by considering Xi .5) follows.1.1') + {h(2)(J.1')' + 1 E[h'](3)(X')(X _ 1')3 = h2 (J.5) apply Theorem 5.6) n . .l)E(X ..3. .3. 0 The two most important corollaries of Theorem 5. Then Rm = G(n 2) and also E(X .3) for j even.J.l)h(1)(J. I I I Corollary 5. if I' = O.l)h(J.3.1. Corollary 5.llJ' + G(n3f2) (5. • .
exp( 2') c(') = h(/1).l ).1.and HigherOrder Asymptotics: The Delta Method with Applications 309 Subtracting Cal from (b) we get C5.3. We can use the two coronaries to compute asymptotic approximations to the means and variance of heX).3.3. by Corollary 5.3.  . by Corollary 5.Z ".(h(X) .d.7) If h(t) ~ 1 .4 ) exp( 2/t)..Section 5.8) because h(ZI (t) = 4(r· 3 ./1)3 = ~~.Z. as we nonnally would. T!Jus.Z) (5.9) o Further expansion can be done to increase precision of the approximation to Var heX) for large n. A qualitatively simple explanation of this important phenonemon will be given in Theorem 5. (5.1) = 11..= E"X I ~ 11 A.3. and. 0 Clearly the statements of the corollaries as well can be turned to expansions as in Theorem 5.2 (Problem (5. by expanding Eh 2 (X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = ~[h(I)U')]'(1Z +. Example 5. Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5..i. as the plugin estimate of the parameter h(Jl) then.3. then we may be interested in the warranty failure probability (5.exp( 2It).10) + [h<ZI (/1)J'(14} + R~ ! with R~ tending to zero at the rate 1/n3 .3.3.3. which is G(n.h(/1)) h(2~(II):: + O(n.3.3(1_ ')In + G(n.4) E(X . To get part (b) we need to expand Eh 2 (X) to four terms and similarly apply the appropriate form of (5.3. If X [. the bias of h(X) defined by Eh(X) .i.6).t) and X.3. Bias. the MLE of ' is X .1.2 ) 2e.3. where /1.l / Z) unless h(l)(/1) = O. {h(l) (/1)h(ZI U')/13 (5.(h(X)) E. Note an important qualitative feature revealed by these approximations. We will compare E(h(X)) and Var h(X) with their approximations. (1Z 5.1 If the Xi represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours.3.h(/1) is G(n. . Thus. X n are i.3. Bias and Variance of the MLE of the Binomial VanOance.3 First.t.2.3.11) Example 5. then heX) is the MLE of 1 .1 with bounds on the remainders. for large n. .5).. If heX) is viewed. which is neglible compared to the standard deviation of h(X). when hit) = t(1 .
a Xd d} .p) {(I _ 2p)2 n Thus..2p)' .P )} (n_l)2 n nl n Because 1'3 = p(l.!c[2p(1 n p) . and that (i) l IIDm(h)ll= < 00 where Dmh(x) is the array (tensor) a i. 0 < i. < m.p)'} + R~ p(l . R'n p(1 ~ p) [(I .j .2p)2p(1 .5) yields E(h(X))  ~ p(1 .3.3..e x ) : i l + .5) is exact as it should be.p) . First calculate Eh(X) = E(X) . " ! .p) .6p(l.3. asSume that h has continuous partial derivatives of order up to m. and will illustrate how accurate (5.E(X') ~ p .p)(I. .2(1 . D " ~ I .3.9d(Xi )f. . + id = m.3 ).2p)')} + R~. . II'.p) + . I . .p) n .' • " The generalization of this approach to approximation of moments for functions of vector means is fonnally the same but computationally not much used for d larger than 2.2p) n n +21"(1 .pl. (5. . Let h : R d t R.p(1 .. D 1 i I . Suppose g : X ~ R d and let Y i = g(Xi ) = (91 (Xi).{2(1. • .2p)p(l. ~ Varh(X) = (1.3.. Theorem 5. Next compute Varh(X)=p(I.amh i.10) is in a situation in which the approximation can be checked. • ~ O(n.310 Asymptotic Approximations Chapter 5 B(l. the error of approximation is 1 1 + .[Var(X) + (E(X»2] I nI ~ p(1 .' . n n Because MII(t) = I . 1 < j < { Xl .2t.2p).p)J n p(1 ~ p) [I . (5.2p(1 .p).p)(I. 1 and in this case (5..10) yields .2.p)] n I " . M2) ~ 2. .P) {(1_2 P)2+ 2P(I.p(1 ..p) = p ( l .
by (5. under appropriate conditions (Problem 5. + ~ ~:l (J1.3.12) Var h(Y) ~ [(:.) )' Var(Y.1 Then.3."..(J1.11. Y12 ) (5.h(!. if EIY.) Cov(Y". (5. (J1.3.3.3. ijYk = ~ I:~ Yik> Y = ~ I:~ I Vi.2 ). and (5. h : R ~ R. is to m = 3.14) + (g:. Y 12 ) 3/2). as for the case d = 1.4. 1 < j < d where Y iJ I = 9j(Xi ).8. Similarly. and the appropriate generalization of Lemma 5.) + a.3.I' < 00.3.) + 2 g:'. EI Yl13 < 00 Eh(Y) h(J1.3 First. E xl < 00 and h is differentiable at (5.) + ~ {~~:~ (J1. B.3.))) ~ N(O.)} + O(n (5.3.~ (J1.3.C( v'n(h(X) .Section 5. The most interesting application.1...15) Then .. (J1. 0/ The result follows from the more generally usefullernma.~E(X.13).) Var(Y. then O(n.).»)'var(Y'2)] +O(n') Approximations (5. (7'(h)) where and (7' = VariX.2.3. and J.) Cov(Y". Suppose thot X = R..hUll)~. Lemma 5. .3.3). 1.) gx~ (J1. for d = 2. Theorem 5.and HigherOrder Asymptotics: The Delta Method with Applications 311 (ii) EIY'Jlm < 00.5). The proof is outlined in Problem 5.2 The Delta Method for In law Approximations = As usual we begin with d !.12) can be replaced by O(n.13) Moreover.'. We get.3 / 2 ) in (5.x. Suppose {Un} are real random variables and tfult/or a sequence {an} constants with an + 00 as n + 00. 5.) Var(Y. (5..3.). The results in the next subsection go much further and "explain" the fonn of the approximations we already have.3.3. = EY 1 .14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d .6).3. then Eh(Y) This is a consequence of Taylor's expansion in d variables.
.15) "explains" Lemma 5. V for some constant u.3.312 Asymptotic Approximations Chapter 5 (i) an(Un . I (g) I ~: . Thus. The theorem follows from the central limit theorem letting Un X. for every € > 0 there exists a 8 > 0 such that (a) Iv . PIIUn €  ul < iii ~ 1 • "'. ·i Note that (5. . Therefore. By definition of the derivative. _ F 7 . V .ul < Ii '* Ig(v)  g(u) .N(O.3.ul Note that (i) (b) '* . (e) Using (e).16) Proof. ( 2 ). by hypothesis. hence. I .i.3. from (a).a 2 ). (5.3.• ' " and. V and the result follows. 'j But (e) implies (f) from (b). .g(ll(U)(v  u)1 < <Iv . then EV. for every Ii > 0 (d) j. j . j " But. • • .u)!:.8). u = /}" j \ V . Formally we expect that if Vn !:. ~ EVj (although this need not be true. . see Problems 5.1. = I 0 .17) .7. an(Un . an = n l / 2.. . 0 . ~: . (ii) 9 : R Then > R is d(fferentiable at u with derivative 9(1) (u).u) !:. for every (e) > O.N(O. we expect (5. V.32 and B.3. Consider 1 Vn = v'n(X !") !:.
'" 1 X n1 and Y1 . £ (5.3 First. Consider testing H: /11 = j. "t" Statistics.j odd. i=l If:F = {Gaussian distributions}.. (1  A)a~ . al = Var(X. Yn2 be two independent samples with 1'1 = E(X I ).O < A < 1. then Tn . (1 . 1'2 = E(Y.9. we find (Problem 5. In Example 4. 1 2 a P by Theorem 5.). else EZJ ~ = O.3.= 0 versuS K : 11. Let Xl.t2 versus K : 112 > 111. Let Xl. (a) The OneSample Case. N n (0 Aal.. Now Slutsky's theorem yields (5..s Tn =.3.X) n 2 .l (1. we can obtain the critical value t n. In general we claim that if F E F and H is true.i. Slutsky's theorem.a) critical value (or Zla) is approximately correct if H is true and F is not Gaussian. FE :F where EF(X I ) = '".l distribution. and the foregoing arguments. then S t:. Using the central limit theorem. where g(u.18) In particular this implies not only that t n.1 (1 . Then (5. VarF(Xd = a 2 < 00. 1).and HigherOrder Asymptotics: The Delta Method with Applications 313 where Z ~ N(O.3 we saw that the two sample t statistic Sn vn1n2 (Y X) ' n = = n s nl + n2 has a Tn2 distribution under H when the X's and Y's are normal withar = a~.17) yields O(n J. .3. EZJ > 0.3.> 0 .'" ./2 ). j even = o(nJ/'). Example 5.).Section 5.d.a) for Tn from the Tn .N(O. ••• .TiS X where S 2 = 1 nl L (X.3.+A)al + Aa~) .X" be i. But if j is even.3.2. 1). sn!a).18) because Tn = Un!(sn!a) = g(Un . (b) The TwoSample Case. A statistic for testing the hypothesis H : 11.28) that if nI/n _ A.1 (Ia) + Zla butthat the t n. and S2 _ n nl ( ~ X) 2) n~(X. For the proof note that by the central limit theorem.) and a~ = Var(Y.2 and Slutsky's theorem. v) = u!v.
316..5 2 2. and in this case the t n .5 . Other distributions should also be tried. or 50. . the For the twosample t tests. Chisquare data I 0. and the true distribution F is X~ with d > 10. when 0: = 0. Figure 5.I = u~ and nl = n2.3.5 1 1. as indicated in the plot..2 or of a~. Y . The simulations are repeated for different sample sizes and the observed significance levels are plotted. where d is either 2. 10.02 .5 3 Log10 sample size j i Figure 5..c:'::c~:____:_J 0... approximations based on asymptotic results should be checked by Monte Carlo simulations.05.. ..1.. 20..2 shows that when t n _2(10:) critical value is a very good approximation even for small n and for X. We illustrate such simulations for the preceding t tests by generating data from the X~ distribution M times independently.000 onesample t tests using X~ data.3.3. " .. Each plotted point represents the results of 10. Figure 5.. ur ..5 ..1 (0..1 shows that for the onesample t test. Here we use the XJ distribution because for small to moderate d it is quite different from the normal distribution.314 Asymptotic Approximations Chapter 5 It follows that if 111 = 11. oL..95) approximation is only good for n > 10 2 .1. 32. . X~. the asymptotic result gives a good approximation when n > 10 1.. i 2(1 approximately correct if H is true and the X's and Y's are not normal. then the critical value t. The X~ distribution is extremely skew. i I . 0) for Sn is Monte Carlo Simulation As mentioned in Section 5. each lime computing the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table.'. . . One sample: 10000 Simulations..
2:7 ar Two sample.0' H<I o 0.Xi have a symmetric distribution. and the data are X~ where d is one of 2. scaled to have the same means. 10000 Simulations. or 50. and a~ = 12af.02 d'::'. in this case.3. By Theoretn 5. as we see from the limiting law of Sn and Figure 5. Each plotted point represents the results of 10.' 0. even when the X's and Y's have different X~ distributions. let h(X) be an estimate of h(J. 1 (ri . To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T.a~ show that as long as nl = n2. the t n 2(1 . N(O.X = ~. Other Monte Carlo runs (not shown) with I=.12 0. D Next.and HigherOrder Asymptotics: The Delta Method with Applications 315 1 This is because. _ y'ii:[h(X) .95) approximation is good for nl > 100. Y .o'[h(l)(JLJf).000 twosample t tests. However. the t n _2(O. 10.2.n2 and at I=.3. then the twosample t tests with critical region 1 {S'n > t n .5 3 log10 sample size Figure 5.a~.cCc":'::~___:.2 (1 .Xd..3. and Yi .0:) approximation is good when nl I=. in the onesample situation.ll)(!.3. Moreover.n2 and at = a'). af = a~._' 0. }..hoi n s[h(l)(X)1 . Equal Variances 0.Section 5_3 First. For each simulation the two samples are the same size (the size indicated on the xaxis).5 2 2.h(JL)] ". when both nl I=. ChiSquare Dala.5 1 1.3. y'ii:[h(X) .9. In this case Monte Carlo studies have shown that the test in Section 4.4 based on Welch's approximation works well.L) where h is con tinuously differentiable at !'.(})) do not have approximate level 0.) i O.
3.' OIlS . Combining Theorem 5.5 1 1. and beta. o.J~ f).02""~".oj:::  6K .{x)() twosample t tests. such that Var heX) is approximately independent of the parameters indexing the family we are considering. We have seen that smooth transformations heX) are also approximately normally distributed.10000 Simulations: Gaussian Data. Unequal Variances. +___ 9. For each simulation the two samples differ in size: The second sample is two times the size of the first. then the sample mean X will be approximately normally distributed with variance 0"2 In depending on the parameters indexing the family considered.3. I: . £l __ __ 0.. From (5. if H is true .0. . 1) so that ZI_Q is the asymptotic critical value. which are indexed by one or more parameters. The data in the first sample are N(O.".£. Tn . such as the binomial. as indicated in the plot.5 Log10 (smaller sample size) l Figure 5. 1) and in the second they are N(Ola 2) where a 2 takes on the values 1. N(O.3.3. " 0.12 . we see that here. 0..c:~c:_~___:J 0. gamma.3. +2 + 25 J O'. Poisson. and 9.316 Asymptotic Approximations Chapter 5 Two Sample. If we take a sample from a member of one of these families.6. too. called variance stabilizing. '! 1 .I _ __ 0 I +__ I :::. It turns out to be useful to know transformations h.. Each plotted point represents the results of IO.3. In Appendices A and B we encounter several important families of distributions..6) and .. ...~'ri i _ .4. Variance Stabilizing Transfonnations Example 5. The xaxis denotes the size of the smaller of the two samples.3 and Slutsky's theorem. 2nd sample 2x bigger 0..
Also closely related but different are socaBed normalizing transformations. .3. a variance stabilizing transformation h is such that Vi'(h('Y) . Thus. 1976. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II. Some further examples of variance stabilizing transformations are given in the problems..16.. A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models.Section 5.Xn are an i. is to exhibit monotone functions of parameters of interest for which we can give fixed length (independent of the data) confidence intervals. h must satisfy the differential equation [h(ll(A)j2A = C > 0 for some arbitrary c > O. h(t) = Ii is a variance stabilizing transformation of X for the Poisson family of distributions. c) (5.15 and 5.hb)) + N(o.19) is an ordinary differential equation. 1n (X I. To have Varh(X) approximately constant in A. .3 First. Such a function can usually be found if (J depends only on fJ.' . which varies freely.i. sample.3.3. As an example..1 0 ) Vi' ' is an approximate 1. p. Major examples are the generalized linear models of Section 6.19) for all '/. See Example 5.3.3. If we require that h is increasing. which has as its solution h(A) = 2.. 1/4) distribution. In this case (5. Under general conditions (Bhattacharya and Rao. by their definition.3. in the preceding P( A) case. finding a variance stabilizing transfonnation is equivalent to finding a function h such that for all Jl and (J appropriate to our family.\ + d. X n) is an estimate of a real parameter! indexing a family of distributions from which Xl.13) we see that a first approximation to the variance of h( X) is a' [h(l) (/.. In this case a'2 = A and Var(X) = A/n. Thus. Thus.6. The notion of such transformations can be extended to the following situation. 538) one can improve on . A > 0. Suppose further that Then again. . Suppose.3.d. I X n is a sample from a P(A) family.and HigherOrder Asymptotics: The Delta Method with Applications 317 (5.Ct confidence interval for J>.6) we find Var(X)' ~ 1/4n and Vi'((X) .)C. .5.. where d is arbitrary. 0 One application of variance stabilizing transformations. Substituting in (5.(A)') has approximately aN(O.. ..)] 2/ n . See Problems 5. yX± r5 2z(1 . Edgeworth Approximations The normal approximation to the distribution of X utilizes only the first two moments of X.. suppose that Xl. this leads to h(l)(A) = VC/J>.
9990 0.3.3000 0.7000 0.5000 0.0010 ~: EA NA x Exact .9724 0. I' 0..35 0. Then under some conditionsY) where Tn tends to zero at a rate faster than lin and H 2 • H 3 • and H s are Hermite polynomials defined by H 2 (x) ~ x 2  I. According to Theorem 8.9750 0.0284 1. " I.4000 0.2000 0.ססoo 1.1 gives this approximation together with the exact distribution and the nonnal approximation when n = 10.9876 0.9950 0..79 0. . ~ 2.95 :' 1..15 0. P(Tn < x). Exact to' EA NA {.0105 0.1) + 2 (x 3 .!.40 0.0877 0..ססoo 0.5999 0. Therefore.21) The expansion (5.8000 0.77 0.n)1 V2ri has approximately aN(O.0481 1.9506 0.04 1.34 2.95 3.6999 0. Let F'1 denote the distribution of Tn = vn( X . To improve on this approximation.3.9999 1.38 0.ססOO I .0005 0 0.72 . 0 x Exacl 2. .9500 'll . .318 Asymptotic Approximations Chapter 5 the normal approximation by utilizing the third and fourth moments.0287 0.6548 4.9943 0.9997 4.1.3.II 0.n)' _ 3 (2n)2 = 12 n 1 I . "J 1 '~ • 'YIn = E(Vn? (2n)1 E(V .1.75 0. Suppose V rv X~.86 0.4 to compute Xl.66 0.9905 0.9984 0. H 3 (x) = x3  3x.40 0.9999 1.' •• . 0. (5.9996 1.1051 0. i = 1. Edgeworth Approximations to the X 2 Distribution. 0.I .2024 EA NA x 0.IOx 3 + 15x)] I I y'n(X n 9n ! I i !.4000 0. ii.1254 0.6000 0.2. • " . Hs(x) = xS  IOx 3 + 15x.3x) + _(xS . Example 5.0208 ~1.0100 0. We can use Problem B. 4 0.51 1.38 0.85 0 0.0250 0. 1) distribution.2706 0.61 0.0032 ·1.JL) / a and let lIn and 1'211 denote the coefficient of skewness and kurtosis of Tn. 1)..0254 0.5. Edgeworth(2) and nonna! approximations EA and NA to the X~o distribution.20) is called the Edgeworth expansion for Fn . Table 5. we need only compute lIn and 1'2n.4415 0. It follows from the central limit theorem that Tn = (2::7 I Xi n)/V2ri = (V .0001 0 0.9995 0.9900 0.7792 5.9097 o.0500 ! ! .8008 0. where Tn is a standardized random variable. TABLE 5. V has the same distribution as E~ 1 where the Xi are independent and Xi N(O.3.15 0. xro I • I I ./2 2 ..1964 1.ססoo 0.9684 0." • [3 n .1000 0.3.<p(x) ' .5421 0.4999 0.3.3513 0.91 0. Fn(x) ~ <!>(x) .3006 0.0655 O.0050 0.0397 0.0553 0.9029 0.n.
9. then by the 2 vn(U .5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test of H : p = is based on Irl.3 First. Then Proof. The proof follows from the arguments of the proof of Lemma 5.1.2..1) and vn(i7~ .3 and (B. .l = Cov(XkyJ.1.u) ~~ V dx1 forsome d xl vector ofconstants u.). vn(i7f .u) ~ N(O.3.2. we can show (Problem 5.3.1. xmyl). . E Next we compute = T11 .0.6. It follows from Lemma 5. Y) where 0 < EX 4 < 00.n lEX. We can write r 2 = g(C.I. (i) un(U n . .0.ar.p).0 .0 >'1. >'2. we can . and Yj = (Y . (X n . Y)/(T~ai where = Var X.3.i.3.2 extends to the dvariate case.0.J 2 2 + 4 2 4 2 Tn P 120 + P T 02 +2{ _2 p3Al. with  r?) is asymptotically nonnal.1.Section 5.. Using the central limit and Slutsky's theorems.ij~ where ar Recall from Section 4.3.= (Xi .1) jointly have the same asymptotic distribution as vn(U n .a~) : n 3 !' R.. a~ = Var Y. Ui/U~U3.2.2. 4f. P = E(XY).0 2 (5. E).0.j.0 >'1.J11)/a. >'1.5. and let r 2 = C2 /(j3<. Example 5. Let p2 = Cov 2(X.9) that vn(C .2.u) where Un = (n1EXiYi. U3) = Ui/U2U3.2 + p'A2. 1).2 >'2.0 TO . .3. as (X. = ai = 1.3. where g(UI > U2. aiD.2. R d !' RP has a differential g~~d(U) at u.2 T20 .6) that vn(r 2 N(O.22) 2 g(l)(u) = (2U.and HigherOrder Asymptotics: The Delta Method with Applications 319 The Multivariate Case Lemma 5.n1EY/) ° ~ ai ~ and u = (p. Because of the location and scale invariance of p and r.J12) / a2 to conclude that without use the transformations Xi j loss of generality we may assume J11 = 1k2 = 0. Let central limit theorem T'f.2rJ >". Lemma 5.0. (ii) g. Let (X" Y. _p2. Yn ) be i. Suppose {Un} are ddimensional random vectors and that for some sequence of constants {an} with an !' (Xl as n Jo (Xl. = Var(Xkyj) and Ak.3. 1. UillL2U~) = (2p. 0< Ey4 < 00..2 >'1.2.1.m.0./U2U3.o}.d._p2).
320 When (X. Theorem 5. . per < c) '" <I>(vn .p) !:.3.g. we see (Problem 5.~a)/vn3} where tanh is the hyperbolic tangent.fii[h(r) .5 (a) ! .24) I Proof. I Y n are independent identically distributed d vectors with ElY Ii 2 < 00.h(p)) is closely approximated by the NCO. has been studied extensively and it has been shown (e. (5. Argue as before using B.rn) N(o.P The approximation based on this transfonnation. Then ~. then u5 .9) Refening to (5.10) that in the bivariate nonnal case a variance stahilizing transformation h(r) with . N(O. 1938) that £(vn . it gives approximations to the power of these tests. .rn) + o(ly . EY 1 = rn.l) distribution. o Here is an extension of Theorem 5.3.h= (h i .1) is achieved hy choosing h(P)=!log(1+ P ). j f· ~.h(p)J !:.Y) ~ N(/li. . and it provides the approximate 100(1 .fii(Y .3.. p=tanh{h(r)±z (1.3.3. c E (1.1'2.UI. and (Prohlem 5. that is. Suppose Y 1.3. .hp )andhhasatotaldifferentialh(1)(rn) f • . David. Asymptotic Approximations Chapter 5 = 4p2(1_ p2)'.a)% confidence interval of fixed length.4. . ! f This expression provides approximations to the critical value of tests of H : p = O. 2 1.3.3. . E) = .19). .fii(r .23) • . (1 _ p2)2). (5..8. .1).3(h(r) . . . which is called Fisher's z. Var Y i = E and h : 0 ~ RP where 0 is an open subset ofRd..3[h(c) . N(O.: ! + h(i)(rn)(y .u 2.p). .mil £ ~ hey) = h(rn) and (b) so that . . ': I = Ilgki(rn)11 pxd.h(p)]).
with EX l = 0.3. we can use Slutsky's theorem (A. > 0 and EXt < 00. Then. Now the weak law of large numbers (A. i=k+1 Using the (b) part of Slutsky's theorem.05 quantiles are 2. if . This row. (5. Then according to Corollary 8.1. only that they be i.3. See also Figure B.37] = P[(vlk) < 2.37 for the :F5 .3.= density. where V .:l ~ m k+m L X.:: . gives the quantiles of the distribution of V/k. which is labeled m = 00. where k + m = n. We write T k for Tk. For instance.~1. 1 Xl has a x% distribution. D Example 5. L::+.0 < 2.7) implies that as m t 00.m distribution.k+m L::l Xl 2 (5. where Z ~ N(O.x%. as m t 00.1.i.1.. if k = 5 and m = 60.3.i=k+l Xi ".05 and the respective 0.m  (11k) (11m) L.211 = 0. we can write.k. By Theorem B. We do not require the Xi to be normal. EX. 14.l) distribution. I). 00. Next we turn to the normal approximation to the distribution of Tk.m.. we conclude that for fixed k. when the number of degrees of freedom in the denominator is large. Suppose that Xl.3 First. . when k = 10.21 for the distribution of Vlk. xi and Normal Approximation to the Distribution of F Statistics.7. .1 in which the density of Vlk.26) l..25) has an :Fk.h(m)) = y'nh(l)(m)(Y  m) + op(I).. The case m = )'k for some ). the :Fk.15.0 distribution and 2. the:F statistic T _ k. the mean ofaxi variable is E(Z2).m distribution can be approximated by the distribution of Vlk. When k is fixed and m (or equivalently n = k + m) is large.Section 5.3. Thus.m.Xn is a sample from a N(O. But E(Z') = Var(Z) = 1. > 0 is left to the problems. Suppose that n > 60 so that Table IV cannot be used for the distribution ofTk. then P[T5 . we first note that (11m) Xl is the average ofm independent xi random variables.' = Var(X. To get an idea of the accuracy of this approximation. is given as the :FIO .).3.9) to find an approximation to the distribution of Tk...and HigherOrder Asymptotics: The Delta Method with Applications 321 (c) jn(h(Y) .3.m' To show this. check the entries of Table IV against the last row. Suppose for simplicity that k = m and k .d. By Theorem B.
_o l or ~. v) ~ ~.3. i.' !k . In general (Problem 53.7). where J is the 2 x 2 v'n(Tk 1) ""N(0. h(i)(u. X n are a sample from PTJ E P and ij is defined as the MLE 1 . ifVar(Xf) of 2a'.4. Specifically.m = (1+ jK(:: z.8(a» does not have robustness of level. !1 . :f.I. • • 1)/12) .5. When it can be eSlimated by the method of moments (Problem 5.m} t 00. . and a~ = Var(X.1) "" N(O.).28) . i. Then if Xl.m(1.k(tl)] '" 'fI() :'.m 1) can be ap proximated by a N(O.27) where 1 = (1.o) k) K (5. if it exists and equal to c (some fixed value) otherwise.'. the upper h. I' Theorem 5. (5. We conclude that xl jer}.8(d)). Equivalently Tk = hey) where Y i ~ (li" li. 1'. E(Y i ) ~ (1..a).k (t 1 (5..m • < tl P[) :'.29) is unknown. . _. . 1)T.8(c» one has to use the critical value . a'). a'). = (~. v'n(Tk . when rnin{k. Suppose P is a canonical exponential family of rank d generated by T with [open.3. 1953) is that unlike the t test.m(1 . which by (5. ! i l _ .3. ~ N (0.a) .3. P[Tk.3.k(T>. In general.)T.k(Tk. = E(X. .3.a) '" 1 + is asymptotically incorrect.m 1) <) :'. the F test for equality of variances (Problem 5.1)/12 j 2(m + k) mk Z.m(l. 2) distribution. :.3. By Theorem 5.. 1 5.)/ad 2 . 4). . when Xi ~ N(o. j m"': k (/k. I)T and h(u.»)T and ~ = Var(Yll)J.). as k ~ 00. 0 ! j i · 1 where K = Var[(X..3. Thus (Problem 5. 322 Asymptotic Approximations Chapter 5 where Yi1 = and Yi2 = Xf+i/a2. v) identity. ~ Ck. l (: An interesting and important point (noted by Box. the distribution of'. k.3 Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families Our final application of the 8method follows.2Var(Yll))' In particular if X.3.3. 1'.28) satisfies Zto '" 1 I I:. i = 1.m critical value fk.
 (TIl' . T/2 By Theorem 5.I(T/) (5.l4.3. Thus. PT/[ii = AI(T)I ~ L Identify h in Theorem 5.ift = A(T/). thus.4.11) eqnals the lower bound (3. I'.2.2 and 5.3.6.3. o Remark 5. the asymptotic variance matrix /1(11) of . (ii) follows from (5. (5. Then T 1 = X and 1 T2 = n.3..). Ih our case. are sufficient statistics in the canonical model. and.3 First. Recall that A(T/) ~ VarT/(T) = 1(71) is the Fisher information.2 where 0" = T.4.3.Section 5.3.O.EX. Note that by B.3. We showed in the proof ofTheorem 5. if T ~ I:7 I T(X.2 that.4. Thus. = 1/2. hence. = A by definition and.'l 1 (ii) LT/(Vii(iiT/)). = (5.A 1(71))· Proof.A(T/)) + oPT/ (n .8.3.32) Hence.Nd(O.31) . Now Vii11.5..71) for any nnbiased estimator ij. .3. (i) follows from (5. as X witb X ~ Nil'.1.3.33) 71' 711 Here 711 = 1'/0'. iiI = X/O".1. For (ii) simply note that. by Corollary 1. PT/[T E A(E)] ~ 1 and.3. Let Xl>' . therefore. in our case. This is an "asymptotic efficiency" property of the MLE we return to in Section 6.d.". The result is a consequence.8.) .In(ij .24).. and 1).1. Example 5. by Example 2.4 with AI and m with A(T/). X n be i.38) on the variance matrix of y'n(ij . (5. = 1/20'. of Theorems 5. 0').30) But D A . (I" +0')] .T.23). where.3.i.2.and HigherOrder Asymptotics: The Delta Method with Applications 323 (i) ii = 71 + ~ I:71 A •• • I (T/)(T(X.2.N(o.
. We focus first on estimation of O... and confidence bounds. for instance.. see Example 2... 5. The moment and in law approximations lead to the definition of variance stabilizing transfonnations for classical onedimensional exponential families.1) i . N = (No. which lead to a result on the asymptotic nonnality of the MLE in multiparameter exponential families. 8 open C R (e. .". 5.. Firstorder asymptotics provides approximations to the difference between a quantity tending to a limit and the limit. sampling. under Li. I I I· < P.3. . Nk) where N j . Eo) . i In this section we define and study asymptotic optimality for estimation. .i. Finally. il' . Consistency is Othorder asymptotics.4 ASYMPTOTIC THEORY IN ONE DIMENSION I: I " ! . is twice differentiable for 0 < j < k. taking values {xo. Following Fisher (1958)'p) we develop the theory first for the case that X""" X n are i.') N(O.Pk) where (5. and so on. I i I .)'.d. .4 and Problem 2. .4. .1 Estimation: The Multinomial Case . as Y. .7).1 by studying approximations to moments and central moments of estimates.. < 1. and PES. Xk} only so that P is defined by p .P(Xk' 0)) : 0 E 8}.d.L~ I l(Xi = Xj) is sufficient. testing. Higherorder approximations to distributions (Edgeworth series) are discussed briefly. Secondorder asyrnptotics provides approximations to the difference between the error and its firstorder approximation.1. . These "8 method" approximations based on Taylor's fonnula and elementary results about moments of means of Ll.. . In Chapter 6 we sketch how these ideas can be extended to multidimensional parametric families. I . Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean. we can use (5. Assume A : 0 ~ pix. We begin in Section 5. EY = J.i.3.d. 0). .33) and Theorem 5.. We consider onedimensional parametric submodels of S defined by P = {(p(xo.324 Asymptotic Approximations Chapter 5 Because X = T.L.4 to find (Problem 5. J J .. . stochastic approximations in the case of vector statistics and parameters are developed. and h is smooth. 0 . .(T.4.d. Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal. 0. . Y n are i. . . 0). the (k+ I)dimensional simplex (see Example 1.h(JL) where Y 1. . These stochastic approximations lead to Gaussian approximations to the laws of important statistics. the difference between a consistent estimate and the parameter it estimates.3.g.3.s where Eo = diag(a 2 )2a 4 ). variables are explained in tenns of similar stochastic approximations to h(Y) .15).6. Summary.1. and il' = T. when we are dealing with onedimensional smooth parametric models. Thus.26) vn(X /". . i 7 •• .: CPo.
9) . Moreover.11).4) g. (0) 80(Xj.4. .7) where .Jor all 8..8) logp(X I .p(Xk.4.Section 5.8)1(X I =Xj) )=00 (5.4.4.2) is twice differentiable and g~ (X I) 8) is a well~defined.8) . pC') =I 1 m .4.3) Furthermore (See'ion 3. 88 (Xl> 8) and =0 (5.4 Asymptotic Theory in One Dimension 325 Note that A implies that k [(X I .4.1.11)) of (J where h: S satisfies ~ R h(p(8» = 8 for all 8 E e > (5.0).2). 0 <J (5. Consider Example where p(8) ~ (p(xo. bounded random variable (5. h) with eqUality if and only if.. Next suppose we are given a plugin estimator h (r.4.6) 1. < k.4. Under H.2(0. if A also holds. 8))T. Assume H : h is differentiable. (5. > rl(O) M a(p(8)) P. 8) is similarly bounded and well defined with (5. for instance.1. h) is given by (5.5) As usual we call 1(8) the Fisher infonnation. Many such h exist if k 2. .8) ~ I)ogp(Xj... Theorem 5.4.4.2 (8. Then we have the following theorem.4. fJl E.8).l (Xl.:) (see (2.4. (5.1.
(5. 0)] p(Xj.8)).8) = 1 j=o PJ or equivalently. ~ir common variance is a'(8)I(8) = 0' (0.p(x).4. by (5. • • "'i.h)Var. (5.3. &8(X.326 Asymptotic Approximations Chapter 5 Proof. using the definition of N j .14) .16) o .4.2 noting that N) vn (h (.4. ( = 0 '(8. 0) = • &h fJp (5.II.8)&Pj 2: • &h a:(p(0))p(x. h k &h &pj(p(8)) (N p(xj. h(p(8)) ) ~ vn Note that. by noting ~(Xj. we s~~ iliat equality in (5.10) Thus. (5.11) • (&h (p(O)) ) ' p(xj.13)..l6) as in the proof of the information inequality (3. Apply Theorem 5. h) = I ..8) ) ~ .')  h(p(8)) } asymptotically normal with mean 0.8) gives a'(8)I'(8) = 1.10).p(xj. By (5.6). (8. not only is vn{h (.15) i • . we obtain &l 1 <0 .12) [t. for some a(8) i' 0 and some b(8) with prob~bility 1./: . h).I n.4.12)..4.8)) = a(8) &8 (X" 8) +b(O) &h Pl (5.4. whil.4. I (x j.4.4.13) I I · . 8). using the correlation inequality (A. Noting that the covariance of the right' anp lefthand sides is a(8). but also its asymptotic variance is 0'(8.4.p(xj.8) j=O 'PJ Note that by differentiating (5.h)I8) (5.. Taking expectations we get b(8) = O.4. we obtain 2: a:(p(8)) &0(xj.4.. which implies (5. I h 2 (5.8) with equality iff.8) ) +op(I).9). kh &h &Pj (p(8))(I(Xi = Xj) .4. : h • &h &Pj (p(8)) (N . 2: • j=O &l a:(p(8))(I(XI = Xj) .
. Then Theo(5. if it exists and under regularity conditions.Oo) = E. Suppose i. As in Theorem 5. Xk}) and rem 5. p) and HardyWeinberg models can both be put into this framework with canonical parameters such as B = log ( G) in the first case.3. ..(vn(B . Note that because n N T = n .. the MLE of B where It is defined implicitly ~ by: h(p) is the value of O. OneParameter Discrete Exponential Families.4. E open C R and corresponding density/frequency functions p('. .18) and k h(p) = [A]I LT(xj)pj )".. : E Ell. 0) and (ii) solvesL::7~ONJgi(Xj.19) The binomial (n. J(~)) with the asymptotic variance achieving the infonnation bound Jl(B). Example 5.0)) ~N (0.0 (5.3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check.3 that the information bound (5.xd). is a canonical oneparameter exponential family (supported on {xo. by ( 2. .4.d.i.). then. 0 Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena. achieved by if = It (r.3.4.. 0). Let p: X x ~ R where e D(O."j~O (5.1.8) is.0)=0.A(O)}h(x) where h(x) = l(x E {xo. Xl.2. 5. In the next two subsections we consider some more general situations..::7 We shall see in Section 5.o(p(X"O)  p(X"lJo )) . .4.4.Xn are tentatively modeled to be distributed according to Po.. Suppose p(x..I L::i~1 T(Xi) = " k T(xJ )". which (i) maximizes L::7~o Nj logp(xj.4. 0 E e..8) = exp{OT(x) .3) L.17) L..2 Asymptotic Normality of Minimum Contrast and MEstimates o e o We begin with an asymptotic normality theorem for minimum contrast estimates.Section 5A Asymptotic Theory in One Dim"e::n::'::io::n ~_ __'3::2:. Write P = {P. . .4.5 applies to the MLE 0 and ~ e is open.
.23) I.1/ 2) n '. Under AOA5.pix. O(P). .p(Xi. hence. f 1 . < co for all PEP. Let On be the minimum contrast estimate On ~ argmin . O(Pe ) = O.~(Xi. o if En ~ O.21) and.p(X" On) n = O. denote the distribution of Xi_ This is because.' '. 1 n i=l _ .1.p2(X. ~ (X" 0) has a finite expectation and i ! .4. P) = .O(P)I < En} £. (5.4. A4: sup.4. n i=l _ 1 n Suppose AO: ..p(x. # o. { ~ L~ I (~(Xi.LP(Xi. On is consistent on P = {Pe : 0 E e}. J is well defined on P.O). . O)ldP(x) < co.p(x. rather than Pe.3. (5.!:. O(P» / ( Ep ~~ (XI. 0 E e.p(.p = Then Uis well defined. That is..4. PEP and O(P) is the unique solution of(5. We need only that O(P) is a parameter as defined in Section 1.4.328 Asymptotic Approximations Chapter 5 .0(P)) l. j. . L. i' On where = O(P) +. under regularity conditions the properties developed in this section are valid for P ~ {Pe : 0 E e}. .. ~I AS: On £. As we saw in Section 2.O(p)))I: It .t) . That is. as pointed out later in Remark 5.21) i J A2: Ep.=1 n ! (5.20) In what follows we let p.O(P)) +op(n.p(x.. parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for. Theorem 5.22) J i I . Suppose AI: The parameter O(P) given hy the solution of . 8.4. O(P))) . O)dP(x) = 0 (5.2.L .1. • is uniquely minimized at Bo. ~i  i .p E p 80 (X" O(P») A3: .4. 0) is differentiable.
4.27) we get.4. 8(P)) + op(l) = n ~ 1jJ(Xi .l   2::7 1 iJ1jJ.' " 1jJ(Xi . applied to (5.. using (5. where (T2(1jJ. Let On = O(P) where P denotes the empirical probability..4.4.21).O(P)) / ( El' ~t (X" O(P))) o while E p 1jJ2(X"p) = (T2(1jJ. t=1 1 n (5.p) < 00 by AI.4.4 Asymptotic Theory in One Dimension 329 Hence.O(P)) = n. By expanding n. n. .Section 5.4.25) where 18~  8(P)1 < IiJn .20).' " iJ (Xi .lj2 and L 1jJ(X" P) + op(l) i=1 n EI'1jJ(X 1.4. A2.O(P)I· Apply AS and A4 to conclude that (5.4.4.O(P))) p (5.Ti(en .24) proof Claim (5. (5.22) because . 1 1jJ(Xi .1j2 ).8(P)) I n n n~ t=1 n~ t=1 e (5.20) and (5.425)(5.L 1jJ(X" 8(P)) = Op(n.27) Combining (5.O(P)) ~ .4. and A3.28) ._ . But by the central limit theorem and AI.24) follows from the central limit theorem and Slutsky's theorem.29) . en) around 8(P). O(P)).4.p) = E p 1jJ2(Xl .O(P)) 2' (E '!W(X..O(P)) Ep iJO (Xl. we obtain.26) and A3 and the WLLN to conclude that (5.4. Next we show that (5.On)(8n .22) follows by a Taylor expansion of the equations (5. (iJ1jJ ) In (On .4.
2 may hold even if ?jJ is not differentiable provided that . A4 is found. . Our arguments apply even if Xl. 330 Asymptotic Approximations Chapter 5 Dividing by the second factor in (5. (2) O(P) solves Ep?jJ(XI.8). Our arguments apply to Mestimates. n I _ . and we define h(p) = 6(xj )Pj.i. Nothing in the arguments require that be a minimum contrast as well as an Mestimate (i. We conclude by stating some sufficient conditions.31) for all O. and that the model P is regular and letl(x. 1 •• I ! . O(P)) ) and (5. {xo 1 ••• 1 Xk}. it is easy to see that A6 is the same as (3. 0) Covo (:~(Xt. 0) = 6 (x) 0.?jJ(XI'O)). If further Xl takes on a finite set of values. that 1/J = ~ for some p).30) suggests that ifP is regular the conclusion of Theorem 5.1. A2. I 'iiIf 1 I • I I .28) we tinally obtain On .! .4. A6: Suppose P = Po so that O(P) = 0. AI. Suppose lis differen· liable and assume that ! • &1 Eo &O(X I . This extension will be pursued in Volume2. 0) l' P = {Po: or more generally..O).4. X n are i. Conditions AO.4. essentially due to Cramer (1946).4.4.4.: 1.4. This is in fact truesee Problem 5. 0 • .e. and A3 are readily checkable whereas we have given conditions for AS in Section 5. 0) = logp(x.O(P) Ik = n n ?jJ(Xi . written as J i J 2::.2.Eo (XI. =0 (5.30) Note that (5.12). . O(P)) + op (I k n n ?jJ(Xi . I o Remark 5.'.4. Solutions to (5.22) follows from the foregoing and (5. e) is as usual a density or frequency function.4.1.4..d. • Theorem 5.29). B» and a suitable replacement for A3. O)dp.30) is formally obtained by differentiating the equation (5. If an unbiased estimateJ (X d of 0 exists and we let ?jJ (x. O)p(x. for Mestimates.4. 0) is replaced by Covo(?jJ(X" 0).4..4. for A4 and A6. .21). O(P) in AIA5 is then replaced by (I) O(P) ~ argmin Epp(X I . Remark 5.4.3. Remark 5.=0 ?jJ(x.2.4.4. P but P E a}. Z~ (Xl.(x) (5.2 is valid with O(P) as in (I) or (2). we see that A6 corresponds to (5.4. B) where p('.20) are called M~estimates. en j• . Identity (5. An additional assumption A6 gives a slightly different formula for E p 'iiIf (X" O(P)) if P = Po.O) = O.O)?jJ(X I .
s) diL(X )ds < 00 for some J ~ J(O) > 0.4 Asymptotic Theory in One Dimension 331 A4/: (a) 8 + ~~(xI. lO. satisfies If AOA6 apply to p(x. = en en = &21 Ee e02 (Xl.4) but A4' and A6' are not necessary (Problem 5.4. 0) and .4..20) occurs when p(x. "dP(x) = p(x)diL(x).p = a(0) g~ for some a 01 o.4. B).35) with equality iff.O) and P ~ ~ Pe.3 Asymptotic Normality and Efficiency of the MLE The most important special case of (5. 0) . Theorem 5.:d in Section 3.32) where 1(8) is t~e Fisher information introduq.p(x.8) is a continuous function of8 for all x. We can now state the basic result on asymptotic normality and e~ciency of the MLE.O) = l(x. where EeM(Xl.33) so that (5. 5. AOA6.01 < J( 0) and J:+: JI'Ii:' (x.O) 10gp(x. < M(Xl. w.4. We also indicate by example in the problems that some conditions are needed (Problem 5. . (b) There exists J(O) sup { > 0 such that O"lj') Ehi) eo (Xl.0'1 < J(O)} 00. then if On is a minimum contrast estimate whose corresponding p and '1jJ satisfy 2 0' (.p.4. 0') is defined for all x. That is.4.4.4.. 10' . then the MLE On (5.4. 0') .. Pe) > 1(0) 1 (5. In this case is the MLE and we obtain an identity of Fisher's.4.3.1). 0) (5. 0) g~ (x.O) = l(x.eo (Xl." Details of how A4' (with ADA3) iIT!plies A4 and A6' implies A6 are given in the problems. 0) < A6': ~t (x. 0) obeys AOA6.Section 5.34) Furthermore.O). where iL(X) is the dominating measure for P(x) defined in (A.
. Hodges's Example.2.. The optimality part of Theorem 5. (5.4.. . Let Z ~ N(O. p.4. .nBI < nl/'J <I>(n l/ ' . .4.37) .4.. (5. claim (5.nB).1 once we identify 'IjJ(x.34) follow directly by Theorem 5. thus.39) with "'(B) < [I(B) for all B E eand"'(Bo) < [I(Bo).nB) .<1>( _n '/4 .. Consider the following competitor to X: B n I 0 if IXI < n. B) = 0.38) Therefore.'(B) = I = 1(~I' B I' 0. . B) with J T(x) .l'1 Let X" . and Polen = OJ ..2. 5.r 332 Asymptotic Approximations Chapter 5 Proof Claims (5..4. 0 because nIl' . By (5.". Then X is the MLE of B and it is trivial to calculate [( B) _ 1. we know that all likelihood ratio tests for simple BI e e eo. .33) and (5.. and.3 generalizes Example 5.. The major testing problem if B is onedimensional is H : < 8 0 versus K : > If p(.1/4 X if!XI > n..4.d. If B = 0. . PolBn = Xl .4. {X 1. see Lehmann and Casella...4..4. 1 .• •. 1998.1/4 I (5. Then . PoliXI < n1/'1 . .B). . We next compule the limiting distribution of . 0 1 . B) is an MLR family in T(X). j 1 " Note that Theorem 5.n(Bn .4.35) is equivalent to (5.4.1).30) and (5. PoliXI < n 1/4 ] . 1.. I I Example 5. 442.ne .3 is not valid without some conditions on the esti· mates being considered. 1 . We discuss this further in Volume II. for some Bo E is known as superefficiency. N(B. for higherdimensional the phenomenon becomes more disturbing and has important practical consequences.A'(B)..4..Xn be i. I.'(0) = 0 < l(~)' The phenomenon (5.4. cross multiplication shows that (5. 1).4 Testing ! I i .i. 0 e en e.i . We can interpret this estimate as first testing H : () = 0 using the test "Reject iff IXI > n. . However. 1. .36) is just the correlation inequality and the theorem follows because equality holds iff 'I/J is a nonzero multiple a( B) of 3!.35)..36) Because Eo 'lj. }. 00. if B I' 0.1/ 4 " and using X as our estimate if the test rejects and 0 as our estimate otherwise. For this estimate superefficiency implies poor behavior of at values close to 0. .(X B).4.439) where . PIIZ + .. 1 I I.4. Therefore.. .4.
[Bn > c(a. (5.42)  ~ o.4.3 versus simple 2 .4.:.h the same behavior. are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error Q at eo.(1 1>(z))1 ~ 0.r'(O)) where 1(0) (5. this test can also be interpreted as a test of H : >. 00) .4. ~ 0. Let en (0:'. 00)] = PolvnI(O)(Bn  0) > VnI(O)(cn(a. 00 )]  ~ 1.2 apply to '0 = g~ and On.d. "Reject H for large values of the MLE T(X) of >.41) follows. 1) distribution.a quantile o/the N(O.43) Property (5.I / 2 ) (5.4. e e e e.45) which implies that vnI(Oo)(c. B E (a. Xl. 00) . Then c. . 0 E e} is such thot the conditions of Theorem 5.4. > AO. (5. Then ljO > 00 . • • where A = A(e) because A is strictly increasing..(a.4. [o(vn(Bn 0)) ~N(O. ljO < 00 .0)]. Proof..41) where Zlct is the 1 . b). X n distributed according to Po. Suppose the model P = {Po ..4. 1 < z . .:cO"=~ ~ 3=3:.Section 5.Oo)] = a and B is the MLE n n of We will use asymptotic theory to study the behavior of this test when we observe ij. . It seems natural in general to study the behavior of the test.4. Suppose (A4') holds as well as (A6) and 1(0) < oofor all O.l4.42) is sometimes called consistency of the test against a fixed alternative. eo) denote the critical value of the test using the MLE en based On n observations.(a.40) > Ofor all O. derive an optimality property.00) other hand.::.00) > z] ~ 11>(z) by (5. as well as the likelihood ratio test for H versus K.00) > z] . PoolBn > 00 + zl_a/VnI(Oo)] = POo IVnI(Oo) (Bn  00) > ZI_a] ~ a.(a.44) But Polya's theorem (A.4.. "Reject H if B > c(a.46) PolBn > cn(a..40). and (5.. On the (5.m='=":c'. That is. Zla (5. PolO n > c. Theorem 5. The test is then precisely...:c". < >"0 versus K : >. If pC e) is a oneparameter exponential family in B generated by T(X).22) guarantees that sup IPoolvn(Bn . 00 )" where P8 .4 Asymptotic Theo:c'Y:c. the MLE.Oo)] PolOn > cn(a. and then directly and through problems exhibit other tests wi!.4. a < eo < b.4. Thus.Oo) = 00 + ZIa/VnI(Oo) +0(n.0:c":c'=D.4.. (5.4. The proof is straightforward: PoolvnI(Oo)(Bn ..
X n ) i.47) lor.2 and (5. iflpn(X1 .8) > . assume sup{IP.(1 lI(z))1 : 18 .43) follow. In either case.40) hold uniformly for (J in a neighborhood of (Jo. the power of tests with asymptotic level 0' tend to 0'.jnI(8)(Bn .1/ 2 ))]..1 / 2 )) .jnI(O)(Bn .4. f.4.jnI(8)(80 . these statements can only be interpreted as valid in a small neighborhood of 80 because 'Y fixed means () + B .4.80) ~ 8)J Po[. Theorem 5. X n ) is any sequence of{possibly randomized) critical (test) functions such that .jnI(8)(Bn . < 1 lI(Zl_a ")'.jnI(8)(Cn(0'. then by (5.51) I 1 I I . Furthennore. (5. I.8 + zl_a/..jnI(8)(80 . In fact.4.···.4.4.4. That is. If "m(8 . LetQ) = Po. Claims (5.50). . . 1 ~.48). the test based on 8n is still asymptotically MP.")' ~ "m(8  80 ). On the other hand.80 ).   j Proof.jI(80)) ill' > 0 > llI(zl_a ")'.4.80 1 < «80 )) I J . o if "m(8 .4.50) i.jI(80)) ill' < O.48) .4.49)) the test based on rejecting for large values of 8n is asymptotically uniformly most powerful (obey (5. (5. (5. then limnE'o+. .jnI(80) + 0(n.8) > .4. j (5. .80) tends to infinity. .50)) and has asymptotically smallest probability of type I error for B < Bo.k'Pn(X1.4..4.[.5.017".8) + 0(1) .4.4.50) can be interpreted as saying that among all tests that are asymptotically level 0' (obey (5.jnI(80) +0(n.49) ! i .8 + zl_o/.jnI(8)(Cn(0'. 0. 00 if 8 < 80 . 80)J 1 P.4.jnI(8)(80 . (3) = Theorem 5. Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ")' "m( 8 .  i. 80 )  8) .4. then (5.(80) > O.4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to Ct from the left at 00 and continues rising steeply to 1 to the right of 80 .. .41).42) and (5. j . I .J 334 Asymptotic Approximations Chapter 5 By (5. Write Po[Bn > cn(O'. Suppose the conditions afTheorem 5.48) and (5. ~! i • Note that (5.4. ~ " uniformly in 'Y. the power of the test based on 8n tends to I by (5. .80 ) tends to zero.8) < z] . 00 if8 > 80 0 and .[.
4.4. 0 The asymptotic!esults we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n.4.54) establishes that the test 0Ln yields equality in (5.8) 1. .53) where p(x.Section 5. The details are in Problem 5.Xn ) = 5Wn (X" . To prove (5.O+..4. if I + 0(1)) (5. for all 'Y. is O..7). P.jI(Oo)) + 0(1)) and (5. These are the likelihood ratio test and the score or Rao test.4.jn(I(8o) + 0(1))(80 .8) that. Further Taylor expansion and probabilistic arguments of the type we have used show that the righthand side of (5.o + 7.' L10g i=1 p (X 8) 1.8 0 ) . . . X n . Llog·· (X 0) =dn (Q.. . ..50) for all. There are two other types of test that have the same asymptotic behavior. [5In (X" .Xn) is the critical function of the LR test then.4.80 ) < n p (Xi.OO)J . for Q < ~.4.54) Assertion (5. 80)] of Theorems 5..4. .53) is Q if.4. € > 0 rejects for large values of z1. 0 (5. 1. . > dn (Q.Xn) is the critical function of the Wald test and oLn (Xl. k n (80 ' Q) ~ if OWn (Xl. X.4. 8 + fi) 0 P'o+:. 00)]1(8n > 80 ) > kn(Oo. is asymptotically most powerful as well.4. 0 n p(Xi . hand side of (5.50) note that by the NeymanPearson lemma.OO+7n) +EnP.. i=1 P 1...4.<P(Zla(1 + 0(1)) + . " Xn.50) and..)] ~ L (5..4.5.. . hence. Q). t n are uniquely chosen so that the right. It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form "Reject if L log[p(X" 8n )!p(X i=1 n i .4.52) > 0.48) follows. ~ .4. ..53) tends to the righthand side of (5. 1(8) = 1(80 + + fi) ~ 1(80) because our uniformity assumption implies that 0 1(0) is continuous (Problem 5. 0) denotes the density of Xi and dn . 00 + E) logPn(X l E I " . The test li8n > en (Q. .j1i(8  8 0 ) is fixed. Finally.4 and 5." + 0(1) and that It may be shown (Problem 5.[logPn( Xl.<P(Zl_a .. note that the Neyman Pearson LR test for H : 0 = 00 versus K : 00 + t.5 in the future will be referred to as a Wald test...?. Thus.4 Asymptotic Theory in One Dimension 335 If.
336
Asymptotic Approximations
Chapter 5
where PlI(X 1, ... ,Xn ,8) is the joint density of Xl, ... ,Xn . Fort: small. n fixed, this is approximately the same as rejecting for large values of a~o logPn(X 11 • • • 1 X n ) eo).
,
• ,
The preceding argument doesn't depend on the fact that Xl" .. , X n are i.i.d. with common density or frequency function p{x, 8) and the test that rejects H for large values of a~o log Pn (XI, ... ,Xn , eo) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming
"Reject H iff
t
i=I
iJ~
: ,
logp(X i ,90) > Tn (a,9 0)."
0
I I
(5.4.55)
It is easy to see (Problem 5.4.15) that
Tn(a, 90 ) = Z10 VnI(90) + o(n I/2 )
and that again if G (Xl, ... , X n ) is the critical function of the Rao test then
nn
,
•
.,
,•
Po,+' WRn (X t, ... ,Xn ) = Ow n(X t , ... ,Xn)J ~ 1, rn
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(90 ), which d' may require numerical integration, can be replaced by _n l d021n(Bn) (Problem 5.4.10).
5.4,5
Confidence Bounds
Q
We define an asymptotic Levell that
lower confidence bound (LCB) On by the requirement
(5.4.57)
r I I i ,
., '
.'
for all () and similarly define asymptotic level!  a DeBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
(i) By using a natural pivot.
, f.
(
.,
(ii) By inverting the testing regions derived in Section 5.4.4.
,. "',
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)(A6), (A4'), and I(9) finite for all it follows (Problem 5.4.9) that
e.
Co(
V n  e)) ~ N(o, 1) nI(lin)(li
Z1a/VnI(lin ).
(5.4.58)
for all () and, hence. an asymptotic level!  a lower confidence bound is given by
9~ = lin e~,
(5.4.59)
Turning tto method (ii), inversion of 8Wn gives fonnally
= inf{9: en(a, 9) > 9n }
(5.4.60)
,
=
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
337
or if we use the approximation C (0, e) ~ n
e+ zlQ/vnI(iJ), (5,4,41),
e) > en}'
~~, = inf{e , Cn(C>,

(5,4,61)
In fact neither e~I' or e~2 properly inverts the tests unless cn(Q, e) and Cn (Q, e) are increasing in The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, e~l is preferable because this bound is not only approximately but genuinely level 1 Q. But computationally it is often hard to implement because cn(Q, 0) needs, in general, to be computed by simulation for a grid of values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4,10) are preferred but Can be quite inadequate (Problem 5,4,1 I), These bounds e~, O~I' e~2' are in fact asymptotically equivalent and optimal in a suitable sense (Problems 5,4,12 and 5,4,13),
e.
e
Summary. We have defined asymptotic optimality for estimates in oneparameter models. In particular, we developed an asymptotic analogue of the information inequality of Chapter 3 for estimates of in a onedimensional subfamily of the multinomial distributions, showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of oneparameter discrete exponential families. In Section 5.4.2 we developed the theory of minimum contrast and M estimates, generalizations of the MLE, along the lines of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mooel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimality results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson (1996), Le Cam and Yang (1990), Lehmann (1999), Rao (1973), and Serfling (1980),
e
5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as n t 00 in a sense we now describe. The framework we consider is the one considered in Sections 5.2 and 5.4, i.i.d. observations from a regular madel in which is open C R or = {e 1 , , , , , e,} finite, and e is identifiable, Most of the questions we address and answer are under the assumption that fJ = 0, an arbitrary specified value, or in frequentist tenns, that 8 is true.
e
e
Consistency The first natural question is whether the Bayes posterior distribution as n + 00 concentrates all mass more and more tightly around B. Intuitively this means that the data that are coming from Po eventually wipe out any prior belief that parameter values not close to are likely, Formalizing this statement about the posterior distribution, II(· I X It •.• , X n ), which is a functionvalued statistic, is somewhat subtle in general. But for = {O l ,' .. , Ok} it is
e
e
i
338
straightforward. Let
Asymptotic Approximations Chapter 5
I.
11(8 i XI,···, Xn)
Then we say that II(·
=PIO = 8 I Xl,···, Xn].
(5.5.1)
e, P,li"(8 I Xl, ... , Xn)  11 > ,] ~ 0 for all f. > O. There is a slightly stronger definition: rIC I XI,' .. ,Xn ) is a.S. iff for all 8 E e, 11(8 I Xl, ... , Xn) ~ 1 a.s. P,.
is consistent iff for all 8 E
General a.s. consistency is not hard to formulate:
I Xl, ... , Xn)
(5.5.2)
consistent
(5.5.3)
11(· I X), ... , Xn)
=}
OJ'} a.s. P,
(5.5.4)
where::::} denotes convergence in law and <5{O} is point mass at satisfactory result for finite.
e
e.
There is a completely
, ,
,
Theorem 5.5.1. Let 1rj  p[e = Bj ], j = 1, ... 1 k denote the prior distribution 0/8. Then II(· I Xl, ... ,Xn ) is consistent (a.s. consistent) iff 7fj > afor j = I, ... , k.
Proof. Let p(., B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1["] = 0 for some j implies that 1f(Bj I Xl, ... ,Xn ) = 0 for all Xl, .. . , X n because, by (1.2.8),
.,
,J,
11(8j
I Xl, ... ,Xn)
PIO = 8j I Xl, ... ,Xn ] 11j Ir~l p(Xi , 8il
L:.~l 11. ni~l p(Xi , 8.)
k
n .
(5.5.5)
,
,
, ,,
Intuitively, no amount of data can convince a Bayesian who has decided a priori that OJ is impossible. On the other hand, suppose all 71" j are positive. If the true is (J j or equivalently 8 = (J j, then
e
log
11(8.IXl, ... ,Xn) =n 11(8 j I X), ... ,Xn)
(11og+ L.. og P(Xi,8.)) . 11. 1{f.,1 n
7fj
n i~l
p(Xi ,8j)
,
By the weak (respectively strong) LLN, under POi'
.,i
1{f.,log p(Xi ,8.)  Lni~l
p(Xi ,8j )
+
E
OJ
(I
I
P(XI ,8.)) og p(X I ,8j )
i
I
in probability (respectively a.s.). But Eo;
(log: ~:::;))
~
< 0, by Shannon's inequality, if
Ba
• .,
=I=
Bj' Therefore,
11(8.IXI, ... ,Xn) 1 og 11(8j I X), ... , X n )
00
,
in the appropriate sense, and the theorem follows.
o
i ~,
h
_
Section 55
Asymptotic Behavior and Optimality of the Posterior Distribution
339
e
Remark 5.5.1. We have proved more than is stated. Namely. that for each I XI, . .. ,Xn ]  a exponentially.
e E e. Po[O =l0
As this proof suggests, consistency of the posterior distribution is very much akin to consistency of the MLE. The appropriate analogues of Theorem 5.2.3 are valid. Next we give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution
Under conditions AOA6 for p(x, B) that if B is the MLE,

= lex, Bj
=logp(x, B), we showed in Section 5.4
(5.5.6)
Ca(y'n(e  B» ~ N(O,rl(B)).
Consider C( ..;ii((}  B) I Xl, ... , X n ), the posterior probability distribution of y'n((} B( Xl, ... , X n )), where we emphasize that (j depends only on the data and is a constant given XI, ... , X n . For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions that strengthen A4 and A5 by replacing convergence in Po probability by convergence a.s. p•. We also add,



A7: For all (), and all 0> o there exists t(o,(})
p. [sup
> 0 such that
{~ t[I(Xi,B') /(Xi,B)]: 18'  BI > /j} < '(0, B)] ~ I.
e such that 1r(') is continuous and positive
AS: The prior distribution has a density 1f(') On at all B. Remarkably,
Theorem 55.2 (UBernsteinlvon Mises"). If conditions ADA3, A4(a.s.), A5(a.s.), A6, A7, and A8 hold. then
C(y'n((}(})
I X1, ... ,Xn )
~N(O,l
1
(B»)
(5.5.7)
a.s. under P%ralle.
We can rewrite (5.5.7) more usefully as
sup IP[y'n((}  e) < x I Xl, ... , X n]  of>(xVI(B»)j ~ 0
x
(5.5.8)
for all a.s. Po and, of course, the statement holds for our usual and weaker convergence in Po probability also. From this restatement we obtain the important corollary. Corollary 5.5.1. Under the conditions of Theorem 5.5.2,
e
sup IP[y'n(O  e) < x j Xl, ... , XnJ  of>(xVl(e)1
x
~0
(5.5.9)
a.s. P%r all B.
1 , ,
340
Asymptotic Approximations Chapter 5
Remarks
(I) Statements (5,5.4) and (5,5,7)(5,5,9) are, in fact, frequentist statements about the
asymptotic behavior of certain functionvalued statistics.
(2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in P, probability if A4 and
A5 are used rather than their strong formssee Problem 5.5.7.
(3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
identifiability guarantees consistency of Bin a regular model.

Proof We compute the posterior density of .,fii(O  B) as
(5.5.10)
where en = en(X!, . .. ,Xn) is given by

Divide top and bottom of (5.5.10) by
;,'
II7
1 p(Xi ,
B) to obtain
(5.5.11)

where l(x,B)
= 10gp(x,B) and
,
,
We claim that
for all B. To establish this note that (a) sup { 11" + 1I"(B) : ItI < M} tent and 1T' is continuous. (b) Expanding, (5.5.13)
(e In) 
~ 0 a.s.
for all M because
eis a.s. consis
I
I
1
i
1 ! , ,
I
p
J
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
341
where
Ie  Bit)1 < )n.
We use I:~
1
g~ (Xi, e) ~ 0 here. By A4(a.s.), A5(a.s.),
1
n
sup { n~[}B,(Xi,B'(t))n~[}B,(Xi,B):ltl<M
In [}'I
[}'l
}
~O,
for all M, a.s. Po. Using (5.5.13). the strong law of large numbers (SLLN) and A8, we obtain (Problem 5.5.3),
Po
[dnqn(t)~1f(B)exp{Eo:;:(Xl,B)~}
forallt] =1.
(5.5.14)
Using A6 we obtain (5.5.12).
Now consider
dn =
I:
r
+y'n
1f(e+
;")exp{~I(Xi,9+;,,) 1(Xi,e)}ds
(5.5.15)
dnqn(s)ds
J1:;I<o,fii
J
1f(t) exp
{~(l(Xi' t) 1(X
i,
9)) } l(lt 
el > o)dt
By AS and A7,
Po [sup { exp
{~(l(Xi,t) 1(Xi , e») } : It  el > 0} < e"'("O)] ~ 1
(5.5.16)
for all 0 so that the second teon in (5.5.14) is bounded by y'ne"'("O) ~ 0 a.s. Po for all 0> O. Finally note that (Problem 5.5.4) by arguing as for (5.5.14), tbere exists o(B) > 0 such that
Po [dnqn(t) < 21f(8) exp {~ Eo (:;: (Xl, B))
By (5.5.15) and (5.5.16), for all 0
~}
for all It 1 < 0(8)y'n]
~ I.
(5.5.17)
> 0,
(5.5.18)
Po [dn 
r dnqn(s)ds ~ 0] = I. J1:;I<o,fii
exp {_ 8'I(B)} ds
2
Finally, apply the dominated convergence theorem, Theorem B.7.5, to dnqn(sl(lsl < 0(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s. Po,
d ~ 1f(B)
n
r= L=
= 1f(8)v'21i'.
JI(B)
(5.5.19)
,
I
342
Hence, a.S. Po,
Asymptot'lc Approximations Chapter 5
qn(t) ~ V1(e)<p(tvI(e))
where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. 0 Example 5.5.1. Posterior Behavior in the Normal Translation Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N{ (), ( 2 ) distribution with a 2 known and we put aN ('TJ, 7 2 ) prior on 8. Then the posterior
distribution of8 isN(Wln7J!W2nX,
(~I r12)1) where
,,2
W2n
.,
I
• •
WIn
= nT 2 +U2'
= !WIn
(5.5.20)
,
'"
,
r
, .,,
I
Evidently, as n + 00, WIn + 0, X + 8, a.s., if () = e, and (~I T\) 1 + O. That is, the posterior distribution has mean approximately (j and variance approximately 0, for n large, or equivalently the posterior is close to point mass at as we vn(O  9) has posterior distribution expect from Theorem 5.5.1. Because 9 =
;
N ( .,!nw1n(ry  X), n (~+
;'»
1).
x,
e
Now, vnW1n
=
O(n 1/ 2) ~ 0(1) and
0
n (;i + ~ ) 1
rem 5.5.2.
+ (12 =
II (8) and we have directly established the conclusion of Theo
I ,
I
I
Example 5.5.2. Posterior Behavior in the Binomial~Beta Model. (Example 3.2.3 continued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X" ... , X n Li.d. Bernoulli (I, e) and put a beta, (3(r, s) prior on e, then, as in Example 3.2.3, (J has posterior (3(8n +r, n+s  8 n ). We have shown in Problem 5.3.20 that if Ua,b has a f3(a, b) distribution, then as a + 00, b + 00,
I
If 0
a) £ (a+b)3]1( Ua,b a+b ~N(o,I). [ ab
< B<
(5.5.21)
j I ,
i j
1 is true, Sn/n ~. () so that Sn + r + 00, n + s  Sn + 00 a.s. Po. By identifying a with Sn + r and b with n + s  Sn we conclude after some algebra that because 9 = X,
vn((J  X)!:' N(O,e(l e))
a.s. Po, as claimed by Theorem 5.5.2.
o
Bayesian optimality of optimal frequentist procedures and frequentist optimality of
Bayesian procedures
•
Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors are asymptotically efficient in the sense of the previous section.
,
1 ,
I
I
t
hz _
I
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
343
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following.
~
Theorem 5.~.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution ofB. Then
(i)
(5.5.22)
a.s. Pe for all
e. Consequently,
~, _ I ~ 1 az. I (e)ae(X" e) +op,(n 1/2 ) e  e+ n L.
l=l
(5.5.23)
and LO( .,fii(rr  e)) ~ N(o, rl(e)).
(ii)
(5.5.24)
E( .,fii(111 
el11111 Xl,'"
,Xn) = mjn E(.,fii(111  dl 
1(11) 1Xl.··· ,Xn) + op(I).
(5.5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions In (e, d) = .,fii(18 dlIell· But the Bayes estimatesforl n and forl(e, d) = 18dl must agree whenever E(11111 Xl, ... , Xn) < 00. (Note that if E(1111 I Xl, ... , X n ) = 00, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.l4.22)
sup IP[.,fii(O  e)
< x I Xl,'" ,Xu) 1>(xy'""'I(""'e))1 ~
Oa.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7.1 I). Thus, any median of the posterior distribution of .,fii(11  e) tends to 0, the median of N(O, II (~)), a.s. Po. But the median of the posterior of .,fii(0  (1) is .,fii(e'  e), and (5.5.22) follows. To prove (5.5.24) note that
~ ~ ~
and, hence, that
E(.,fii(IOelll1e'l) IXl, .. ·,Xn) < .,fiile
e'l ~O
(5.5.27)
a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?). claim (5.5.24) follows and, hence,
E( .,fii(10 
01 101) I h ... , Xn)
= E( .,fii(10 
0'1  101) I X" ... , X n ) + op(I).
(5.5.28)
344
~
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for In(e,d), (5.5.25) and the theorem follows. 0 Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang (1990).
Bayes credible regions
~
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
Theorem 5.5.4. Suppose the conditions afTheorem 5.5.2 are satisfied. Let
where en is chosen so that 1l"(Cn I Xl, ... ,Xn) = 1  0', be the Bayes credible region defined in Section 4.7. Let Inh) be the asymptotically level 1  'Y optimal interval based on B, given by
~
where dn(y)
i •
= z (! !)
JI~). ThenJorevery€ >
0, 0,
~ 1.
P.lIn(a + €) C Cn(X1 , .•. ,Xn ) C In(a  €)J
(5.5.29)
I
I
I
The proof, which uses a strengthened version of Theorem 5.5.2 by which the posterior density of Jii( IJ  0) converges to the N(O,Il (0)) density nnifonnly over compact neighborhoods of 0 for each fixed 0, is sketched in Problem 5.5.6. The message of the theorem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n 1j2 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n 1 order (see Schervisch, 1995).
Thsting
! ,
•
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of 00 given X I, ... ,Xn if H is false is of a different magnitude than the pvalue for the same data. For more on this socalled Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + D.) or (00  D., 00 + D.), then Bayes and freqnentist testing procedures agree in the limit. See Problem 5.5.2. Summary. Here we established the frequentist consistency of Bayes estimates in the finite parameter case, if all parameter values are a prior possible. Second. we established
i
! I
I b
II
!
_
j
TI
Section 5.6
Problems and Complements
345
the socalled Bernsteinvon Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the socalled Bernstein~von Mises theorem and frequentist contjdence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1
1. Suppose Xl, ... , X n are i.i.d. as X ous case density.
rv
F, where F has median F 1 (4) and a continu
(a) Show that, if n
= 2k + 1,
, Xn)
EFmed(X),
n (
_
~
)
l'
k (1  t)kdt F' (t)t
EF
med (X"
2
,Xn )
n( 2;) [1P'(t)f tk (1t)k dt
= 1, 3,
(b) Suppose F is unifonn, U(O, 1). Find the MSE of the sample median for n and 5.
2. Suppose Z ~ N(I', 1) and V is independent of Z with distribution X;'. Then T
Z/
=
(~)!
is said to have a noncentral t distribution with noncentrality J1 and m degrees
of freedom. See Section 4.9.2. (a) Show that
where fm(w) is the x~ density, and <P is the nonnal distribution function.
(b) If X" ... ,Xn are i.i.d.N(I',<T2 ) show that y'nX /
(,.~, L:(Xi
_X)2)! has a
noncentral t distribution with noncentrality parameter .fiiJ1/IT and n  1 degrees of freedom. (c) Show that T 2 in (a) has a noncentral :FI,m distribution with noncentrality parameter J12. Deduce that the density of T is
p(t)
= 2L
i=O
00
P[R = iJ . hi+,(f)[<p(t 1')1(t > 0)
+ <p(t + 1')1(t < 0)1
where R is given in Problem B.3.12.
, " I I. ,
346
Hint: Condition on
Asymptotic Approximations
Chapter 5
ITI.
1, then Var(X) < 1 with equality iff X
3. Show that if P[lXI < 11
±1 with
probability! .
Hint: Var(X) < EX 2
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through ..jiif..
(a) Show that the ratio of the Hoeffding function h( VilE) to the Chebychev function e( Jii€) tends to as Jii€ ~ 00 so that he) is arbitrarily better than en in the tails.
°
I,.,'
i~
(b) Show that the normal approximation 24> (
V;€)  1 gives lower results than h in
00.
,
the tails if P[lXI < 1] = 1 because, if ,,2 < 1. 1  <p(t)  <p(t)lt as t ~ Note: Hoeffding (1963) exhibits better bounds for known a 2 .
R has .\(0) ~ 0, is bounded, and has a hounded second derivative .\n. Show that if Xl, ... , X n are i.i.d., EX l = f.L and Var Xl = 02 < 00, then
5. Suppose.\ : R
~
E.\(X 
1') = .\'(0)
;'/!; +
0
(~)
as n >
00.
= E>.'(O)JiiIX  1'1 + E (";' (X  I')(X I'?) where IX  1'1 < Ix  1'1· The last term is < suPx I>." (x) 1,,2 In and the first tends to
Hint: JiiE(.\(IX 1,1) .\(0))
I
",
1! "
.\'(0)" f== Izl<p(z)dz by Remark B.7. 1(2).
Problems for Section 5,2
1. Using the notation of Theorern 5.2.1, show that
"
2. Let X" ... ,Xn be ij,d. N(I',,,2), Show that for all n
Sup p(•• u) [IX
u
>
1, all €
>
°
/,1 > ,] ~ 1.
Hint: Let (J
)
00.
•
3, Establish (5.2.5). Hint: Iiln  q(p)1
> € =} IPn  pi > w (€).
l
J
4. Let (Ui , V;), 1 < i < n, be i.i.d.  PEP.
(a) Let y(P)
= PIU, > 0, V, > OJ. Show that if P = N(O, 0,1,1, p), then
p
~ sin21l' (Y(P)  ~).
(eo)} < 00.Xn are i. n Lt= ".e)1 : e' E 5'(e. 6. Hint: K can be taken as [A. N (/J. . i (e))} > <.p(X. by the basic property of maximum contrast estimates. for each e eo.o)} =0 where S( 0) is the 0 ball about Therefore. inf{p(X. .2. Suppose Xl.sup{lp(X. eol > A} > 0 for some A < 00.14)(i) add (ii) suffice for consistency.lO)2)) ..14)(i).\} . where A is an arbitrary positive and finite constant. eo) : Ie  : Ie . e) is continuous. 5_0  lim Eo.Section 5. and < > 0 there is o{0) > 0 such that e.eol > .n 1 (XiJ. 5.e'l < .p(X.e')p(X.2. A].eol > . {p(X" 0) .(ii) holds. e E Rand (i) For some «eo) >0 Eo.2 rII.•. 05) where ao is known. Or of sphere centers such that K Now inf n {e: Ie . inf{p(X.\} c U s(ejJ(e j=l T J) {~ t n . Show that the maximum contrast estimate 8 is consistent. (a) Show that condition (5. Hint: From continuity of p. . e) . "..~ .8) fails even in this simplest case in which X ~ /J is clear. (c) Suppose p{P) is defined generally as Covp (U.p(Xi . Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but p is no longer consistent. Eo. e)1 (ii) Eo.lJ. (Ii) Show that condition (5. (Wald) Suppose e ~ p( X. t e. Hint: sup 1. By compactness there is a finite number 8 1 . e') .p(X. (1 (J. (i). VarpUVarpV > O}.2. then is a consistent estimate of p. eo)} : e E K n {I : 1 IB . . e') . Prove that (5. .d. eo) : e' E S(O. and the dominated convergence theorem.. V)j /VarpU Varp V for PEP ~ {P: EpU' + EpV' < 00.i. sup{lp(X.l)2 _ + tTl = o:J. 7.6 Problems and Complements 347 (b) Deduce that if P is the bivariate normal distribution.
l m < Cm k=I n."d' k=l d < mm+1 L d ElY. Indicate how the conditions of Problem 7 have to be changed to ensure uniform consistencyon K. < Mjn~ E (. Let X~ be i. . ..3. For r fixed apply the law of large numbers. .L..3.. (ii) If ti are ij. Extend the result of Problem 7 to the case () E RP.. i .xW) < I'lj· 3. en are constants. 8.X. . Problems for Section 5.X. Hint: See part (a) of the proof of Lemma 5. and if CI. for some constants M j .3) for j odd as follows: . P > 1. i ..) 2] . Then EIX . . 4. then by Jensen's inequality. Show that the log likelihood tends to 00 as a + 0 and the condition fails.3..I'lj < EIX .X~ are i.Xn but independent of them. II I .X'i j .i.[.. 1 + . li .348 Asymptotic Approximations Chapter 5 > min l<J$r {~tinf{p(Xi.. . 1/(J.d. = 1.d. . 9.1'... in (i) and apply (ii) to get (iv) E IL:~ 1 (Xi  x.. and take the values ±1 with probability ~.k / 2 . Establish Theorem 5.11). with the same distribution as Xl.1.1. + id = m E II (Y. Hint: Taylor expand and note that if i .. 2. I 10.3. The condition of Problem 7(ii) can also fail. L:~ 1(Xi .. < < 17 < 1/<. and let X' = n1EX.2.)I' < MjE [L:~ 1 (Xi . . L:!Xi . . .3.)2]' < t Mjn~ E [.d. Compact sets J( can be taken of the form {II'I < A.O(BJll}. < > OJ. Establish (5. ~ . Establish (5. n.3 I.9) in the exponential model of Example 5.. . . . J (iii) Condition on IXi  X. (72).O'l n i=l p(X"Oo)}: B' E S(Bj. .i.3. Establish (5. • • (i) Suppose Xf.
1 and 'In = n2 . I < liLl}P(IXd < liLl)· 7.. ~ iLli < 2 i EIX.d. s~ ~ (n2 .l(e) Now suppose that F and G are not necessarily nonnal but that and that 0 < Var( Xn < Ck./LI = E(Xt}.\ > 0 and = 1 + JI«k+m) ZIa. then (sVaf)/(s~/a~) has an .ad)]m < m L aj j=1 m <mmILaj. ..d. PH(st! s~ < Ck.. )=1 5. j > 2. 1 < j < m. Establish 5. L~=1 i j = m then m a~l. j = 1.1)'2:7' .r'j".Xn1 bei. (a) Show that if F and G are N(iL"af) and N(iL2.\k for some . G. Show that ~ sup{ IE( X.Tn) + 1 .a as k t 00.m distribution with k = nl . ~ xl".d. LetX1"".c> .6 Problems and Complements 349 Suppose ad > 0. 0'1 = Var(Xd· Xl /LI Ck.i. ~ iLl' I IXli > liLl}P(lXd > 11. 2 = Var[ ( ) /0'1 ]2 . Show that if EIXlii < 00. Show that if m = . . Show that under the assumptions of part (c)..3..m 00. .. P(sll s~ < ~ 1~ a as k ~ 00. .Fk.) I : i" .andsupposetheX'sandY's are independent.. if a < EXr < 00. 1)'2:7' .(X. Let XI. 8. km K. (b) Show that when F and G are nonnal as in part (a).). X..pli ~ E{IX.m with K.28.aD."" X n be i. then EIX.I) Hint: By the iterated expectation theorem EIX. (d) Let Ck. under H : Var(Xtl ~ Var(Y. 6.(l'i .m) Then. R valued with EX 1 = O. respectively. . .a~d < [max(al"" .Section 5. theu the LR test of H : or = a~ versus K : af =I a~ is based on the statistic s1 / s~.1) +E{IX.i.. n} EIX.. . FandYi""'Yn2 bei. replaced by its method of moments estimate. .m be Ck. where s1 = (n. ~ iLli I IX.I.i. i.
6. from {t I. ~ ] . • . Show that under the assumptions of part (e). (UN.2" ( 11. ..N.i.I))T has the same asymptotic distribution as n~ [n.m with KI and K2 replaced by their method of moment estimates. (cj Show that if p = 0.+.00. to estimate Ii 1 < j < n. that is.XN} or {(uI. (I _ P')')..t . JLl = JL2 = 0.XN)} we are interested in is itself a sample from a superpopulation that is known up to parameters.1'2)1"2)' such that PH(SI!S~ < Qk.l EY? .p) ~ N(O. (b) Use the delta method for the multivariate case and note (b opt  b)(U .iL) • . then jn(r . Without loss of generality. X) U t" (ii) X R = bopt(U ~  iL) as in Example 3. jn((C .. .lIl) + 1 . (iJi .p') ~ N(O. ! .Ii > 0.I).350 Asymptotic Approximations Chapter 5 (e) Next drop the assumption that G E g.p) ~ N(O. Et:i = 0. + I+p" ' ) I I I II.(I_A)a2).m) + 1 . 7 (1 . = = 1.2< 7'.p).. i = 1. 4p' (I . . Tin' which we have sampled at random ~ E~ 1 Xi. = ("t . .6.) =a 2 < 00 and Var(U) Hint: (a) X .m (depending on ~'l ~ Var[ (X I . . show that i' ~ .4.. Y) ~ N(I'I. iik. "1 = "2 = I.A») where 7 2 = Var(XIl. . (b) If (X.. n. .d.l EXiYi . suppose i j = j.xd. . Var(€.Il2. TN i._ J . p). and if EX. use a normal approximation to tind an approximate critical value qk.\ 00.. . Without loss of generality. = (1.".1.+xn n when T. if ~ then .. . (a) jn(X . if 0 < EX~ < 00 and 0 < Eyl8 < 00. • op(n')... Po.. Under the assumptions of part (c). err eri Show that 4log (~+~) is the variance stabilizing transfonnation for the correlation 1 Ip coefficient in Example 5..1 that we use Til. then jn(r' . 1).XC) wHere XC = N~n Et n+l Xi.\ < 1. Consider as estimates e = ') (I X = x. . In particular. jn(r . . suppose in the context of Example 3... Hint: Use the centra] limit theorem and Slutsky's theorem.~) (X . . i = I. (iJl.1. if p i' 0. • 10." •.xd. ' . Show that jn(XRx) ~N(0. t N }. .a as k . .p. In survey sampling the modelbased approach postulates that the population {Xl.m (0 Let 9. 0 < . Instead assume that 0 < Var(Y?) < x.1 · /. In Example 5. < 00 (in the supennodel).3. . .4.I. ...i.. as N 2 t Show that. then P(sUs~ < qk. () E such that T i = ti where ti = (ui.a: as k + 00. .1 EX. Wnte (l 1pP . . (a) If 1'1 = 1'2 ~ 0.p')') and. N where the t:i are i. . be qk. . H En. i (b) Suppose Po is such that X I = bUi+t:i. there exists T I . n.3. "i.".1JT.x) ~ N(O.d.I' ill "rI' and "2 = Var[( X. ." .
Show that IE(Y. each with HardyWeinberg frequency function f given by .90.... The following approximation to the distribution of Sn (due to Wilson and Hilferty.i'b)(Yc . X n are independent.99.i'a)(Yb .n)/v'2rl) and the exact values of PISn < xl from the X' table for x = XO.25.12)..6) to explain why. Here x q denotes the qth quantile of the distribution. n = 5. X~. X.: . 15.3' Justify fonnally j1" variance (72... E(WV) < 00. then E(UVW) = O.3. X n is a sample from a population with mean third central moment j1. and Hint: Use (5. (a) Suppose that Ely.i'c) I < Mn'. Suppose X I I • • • l X n is a sample from a peA) distrihution. Let Sn have a X~ distribution. Normalizing Transformation for the Poisson Distribution.E{h(X))1' = 0 to terms up to order l/n' for all A > 0 are of the form h{t) ~ ct'!3 + d.10.. . . (a) Show that if n is large. (c) Compare the approximation of (b) with the central limit approximation P[Sn < xl = 1'((x . X = XO.6 Problems and Complements 351 12. .. (a) Show that the only transformations h that make E[h(X) . EU 13. 16. Suppose XI.3. It can be shown (under suitable conditions) that the nonnal approximation to the distribution of h( X) improves as the coefficient of skewness 'Y1 n of heX) diminishes. is known as Fisher's approximation. (b) Let Sn . Suppose X 1l .14 to explain the numerical results of Problem 5. 1931) is found to be excellent Use (5.3...13(c).3.14)... (a) Use this fact and Problem 5.s. = 0. (h) Use (a) to justify the approximation 17. Hint: If U is independent of (V.. . !) distribution.y'ri has approximately aN (0. This < xl (h) From (a) deduce the approximation P[Sn '" 1>( v'2X  v'2rl). W).Section 5. (h) Deduce fonnula (5.3 < 00. 14. .3.
Var(X) = 171. ... V)) '" . 1'2)(Y  + O( n '). if m/(m + n) . where I' = (a) Find an approximation to P[X < E(X. Y) .y) = a ayh(x.)? 18. and h'(t) > for all t. 20. Let X I. Y) ~ p!7IC72.n P . which are integers. (b) Find an approximation to P[JX' < t] in terms of 0 and t. Yn are < < . 19. (1'1. 101 02 20(10) I f (10)2 2 tJ in terms of fJ and t. . 1'2) = h. .a) Hmt: Use Bm. Cov(X. Y) where (X" Yi). + (mX/nY)] where Xl.y). Show that ifm and n are both tending to oc in such a way that m/(m + n) > a.. Var(Y) = C7~. then I m a(l.n tends to zero at the rate I/(m + n)2. . ° . Let then have a beta distribution with parameters m and n. Yn) is a sample from a bivariate population with E(X) ~ 1'1. where I I h..n  m/(m + n») < x] 1 > 'li(x). .. 1'2)h2(1'1.y) = a axh(x.. Var Bmnm + n +Rmn = 'm+n ' ' where Rm. Y" . 0< a < I.5 that under the conditions of the previous problem. 1'2)pC7.y). (Xn . X n be the indicators of n binomial trials with probability of success B. Show directly using Problem B. h(l) = I. i 21.Xm . is given by h(t) = (2/1r) sin' (Yt). (a) (b) Var(h(X.. 1'2)]2C7n + O(n 2) i .1') + X 2 . (e) What is the approximate distribution of Vr'( X . 172 + [h 2(1'l.h(l'l. h 2 (x.1'2)]2 177 n +2h. Bm. (1'1. I ~ 352 x Asymptotic Approximations Chapter 5 where °< e < f(x) 1..I'll + h2(1'1.• • • ~{[hl (1".. Variance Stabilizing Transfo111U1tion for the Binomial Distribution. E(Y) = 1'2. v'a(l. [vm +  n (Bm.n = (mX/nY)i1 independent standard eXJX>nentials. . 1'2)(X .2.a) E(Bmn ) = . • j b _ . 1'2) Hint: h(X.a tends to zero at the rate I/(m + n)2. Justify fonnally the following expressions for the moments of h(X. Show that the only variance stabilizing transformation h such that h(O) = 0.(x. .
= ~.. _.l) = o. Show that Eh(X) 3 h(J1. (a) When J1.J1.l = ~.2)/24n2 Hint: Therefore. Let Xl>"" X n be a sample Irom a population with and let T = X2 be an estimate of 112 .X2 .(1.h(J1.J1. < t) = p(Vi < (b) When J1. ° while n[h(X h(J1.J1.4 to give a directjustification of where R n / yin 0 as in n Recall Stirling's approximation: + + 00.2. xi. 24. (e) Fiud the limiting laws of y'n(X .6 Problems and Complements 353 22. find the asymptotic distribution of y'n(T. Suppose that Xl. and that h(1)(J.)2 and n(X . Let Sn "' X~.2V with V "' Give an approximation to the distribution of X (1 .) + ": + Rn where IRnl < M 1(J1. XI. xi 25.). find the asymptotic distrintion of nT nsing P(nT y'nX < Vi).J1.2) using the delta method. 1 X n be a sample from a population with mean J.l4 is finite. k > I. X n is a sample from a population and that h is a realvalued function 01 X whose derivatives of order k are denoted by h(k). ()'2 = Var(X) < 00. . Suppose IM41 (x)1 < M for all x and some constant M and suppose that J. n[X(I .X) in tenns of the distribution function when J..J1..)2. = E(X) "f 0. (It may be shown but is not required that [y'nR" I is bounded...)] !:. . . . Let Xl. J1.)1J1. (a) Show that y'n[h(X) . = 0.) 23..)] is asymptotically distributed (b) Use part (a) to show that when J1. Compare your answer to the answer in part (a).2V where V ~ 00.Section 5.l and variance (J2 < Suppose h has a second derivative h<2) continuous at IJ.4 + 3o.)] ~ as ~h(2)(J1.) + ~h(2)(J1.3 1/6n2 + M(J1. Usc Stirling's approximation and Problem 8.
3) and the intervals based on the pivot ID .9. . I i t I . and SD are as defined in Section 4. (d) Make a comparison of the asymptotic length of (4. (c) Show that if IInl is the length of the interval In. Suppose (Xl. Viillnl ~ 2vul + a~z(l ./". then limn PIt. .3) has asymptotic probability of coverage < 1 .\.3. .a. Suppose nl + 00 . Suppose that instead of using the onesample t intervals based on the differences Vi .o.\ < 1. 0 < ..\ul + (1 ~ or al . .3. . 'I :I ! t.9.1/ S D where D and S D are as in Section 4.354 26. .a..) (b) Deduce that if. .\)al + . . (Xnl Yn ) are n sets of control and treatment responses in a matched pair experiment.3) have correct asymptotic (c) Show that if a~ > al and . 2pawz)z(1  > 0 and In is given by (4.9.2 . b l . 1 X n1 and Y1 . cyi . Hint: (a). n2 + 00. 1 Xn.9. We want to obtain confidence intervals on J1.') < 00. . the intervals (4.\ > 1 . Let n _ 00 and (a) Show that P[T(t. = = az.9.JTi times the length of the onesample t interval based on the differences. /"' = E(YIl. .\)a'4)/(1 .4.4. (c) Apply Slutsky's theorem. . whereas the situation is reversed if the sample size inequalities and variance inequalities agree.\.3.(~':~fJ'). We want to study the behavior of the twosample pivot T(Li.Jli = ~. and a~ = Var(Y1 ). . 28. Hint: Use (5.3). al = Var(XIl. Assume that the observations have a common N(jtl. if n}.(j ' a ) N(O..d.9.9...) < (b) Deduce that if p tl ~ <P (t [1. E In] > 1 . 27.)/SD where D.33) and Theorem 5.3 independent samples with III = E(XIl. . a~.2a 4 ).4. Show that if Xl" . Let T = (D . XII are ij.9.}. (a) Show that P[T(t. Suppose Xl. " Yn as separate samples and use the twosample t intervals (4.\ probability of coverage. . .3). so that ntln ~ ...t.t. Eo) where Eo = diag(a'.. I . . p) distribution. Asymptotic Approximations Chapter 5 then ~ £ c: 2 yn(X . N(Pl (}2). Y1 .I I (a) Show that T has asymptotically a standard Donnal distribution as and I I . Yn2 are as in Section 4.) of Example 4. What happens? Analysis for fixed n is difficult because T(~) no longer has a '12n2 distribution.t2. 29. n2 + 00.. < tJ ~ <P(t[(. .\a'4)I'). . that E(Xt) < 00 and E(Y. Yd.9. the interval (4.~a) > 2( Val + a'4  ~ a) where the righthand side is the limit of .•.Xi we treat Xl.
then P(Vn < va<) t a as 11 t 00. (). 33. Show that as n t 00.a) is the critical value using the Welch approximation.3. (a) SupposeE(X 4 ) < 00.d... Plot your results. .. I< = Var[(X 1')/()'jZ. .9 and 4.z = Var(X).i..3 using the Welch test based on T rather than the twosample t test based on Sn.3 by showing that if Y 1. Show that k ~ ex:: as 111 ~ 00 and 112 t 00..£.8Z = (n . .4.1)1 L.3." . .£.Z). T and where X is unifonn.Section 5.d. X. j = I._I' Find approximations to P(Yn and P{Vn < XnI (I .9.I) + y'i«n . 30. Hint: See Problems B. U( 1.4. Y nERd are Li.3..p. . then lim infn Var(Tn ) > Var(T). and let Va ~ (n . Tn .1..a).X)" Then by Theorem B. where I. (a) Show that the MLEs of I'i and (). Carry out a Monte Carlo study such as the one that led to Figure 5.16.. (d) Find or write a computer program that carries out the Welch test. X n be i. .z has a X~I distribution when F is theN(fJl (72) distribution.I)z(a). where tk(1.z are Iii = k. < XnI (a)) (b) Let Xn_1 (a) be the ath quantile of X. Show that if 0 < EX 8 < 00. Let .1).xdf and Ixl is Euclidean distance. 32. then for all integers k: where C depends on d. vectors and EIY 1 1k < 00. . EIY. Let X ij (i = I. Vn = (n . Hint: If Ixll ~ L.I k and k only.a)) and evaluate the approximations when F is 7. then there exist universal constants 0 < Cd < Cd < 00 Such that cdlxl1 < Ixl < Cdlxh· 31. k) be independent with Xi} ~ N(l'i. Let XI.~ t (Xi . (c) Show using parts (a) and (b) that the tests that reject H : fJI = 112 in favor of K: 112 > III when T > tk(l .I)sz/(). (). . . but Var(Tn ) + 00. It may be shown that if Tn is any sequence of random variables such that Tn if the variances ofT and Tn exist.1 L j=I k X ij and iT z ~ (kp)I L L(Xij i=l j=l p k Iii)" . (c) Let R be the method of moment estimaie of K. as X ~ F and let I' = E(X). x = (XI.6 Problems and Complements 355 (b) Let k be the Welch degrees of freedom defined in Section 4. Generalize Lemma 5.3. I is the Euclidean norm. has asymptotic level a.:~l IXjl..
n(X .0) !:. Ep. .d.f... . ..0) ~ N Hint: P( . I ! 1 i . (c) Give a consistent estimate of (]'2. ! Problems for Section 5. Deduce that the sample median is a consistent estimate of the population median if lbe latter is unique. .0) and 7'(0) . and (iii) imply that O(P) defined (not uniquely) by Ep. [N(O))' . (c) Assume the conditions in (a) and (b).p(X.0).. Show that On is consistent for B(P) over P. I i .) Hint: Show that Ep1/J(X1  8) is nonincreasing in B. 1/4/'(0)). . Show that.n(On . random variables distributed according to PEP... aliI/ > O(P) is finite. Show that _ £ ( . ..Xn be i...0) = O.p(X. (e) Suppose lbat the d. (k . I . (a) Show that (i).p(X. . . _ .p(Xi .L~.O(P)) > 0 > Ep!/!(X.... .n for t E R. . . N(O. . O(P) is lbe unique solution of Ep. n 1 F' (x) exists.4 1. 1948). Let i !' X denote the sample median. .0). is continuous and lbat 1(0) = F'(O) exists. .p(X. ..1)cr'/k. Use the bounded convergence lbeorem applied to ..... 00 (iii) I!/!I(x) < M < for all x.)).p( 00) as 0 ~ 00.\(On)] !:.i. I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _1 . N(O) = Cov(. (ii). .. 1/). . \ . Show lbat if I(x) ~ (d) Assume part (c) and A6. 1 X n.p(X. then 8' !.356 Asymptotic Approximations Chapter 5 . That is the MLE jl' is not consistent (Neyman and Scott.0) !. Let Xl. Let On = B(P). . Suppose !/! : R~ R (i) is monotone nondecreasing (ii) !/!(oo) < 0 < !/!(oo) .On) ..0))  7'(0)) 0. F(x) of X.(0) Varp!/!(X...p(x) (b) Suppose lbat for all PEP. I(X. where P is the empirical distribution of Xl.. then < t) = P(O < On) = P (. . 1) for every sequence {On} wilb On 1 n = 0 + t/. Set (. (b) Show that if k is fixed and p ~ 00... N(O. Assmne lbat '\'(0) < 0 exists and lbat .n7(0) ~[. .. (Use . • 1 = sgn(x).On) < 0).nCOn ... under the conditions of (c)..p(X. .
not only does asymptotic nonnality not hold but 8 converges B faster than at rate n.0) (see Section 3.X) =0. X n be i. Show that   ep(X. . 0» density.(x .. Condition A6' pennits interchange of the order of integration by Fubini's theorem (Billingsley.B)) ~ [(I/O). .~ for 0 > x and is undefined for 0 g~(X. " relative efficiency of 8 1 with respect to 82 is defined as ep(8 1 .0.' .(x. x E R.b)p(x.(x).15 and 0.    . 0 > 0.(x. and O WIth .  = a?/o.0..(x.2. then ep(X.exp(x/O)... the asymptotic 2 . 82 ) Pis N(I". Show that A6' implies A6. ( 2 ).(x.jii(Oj .'" ."' c .I / 2 .4.05. Show that assumption A4' of this section coupled with AGA3 implies assumption A4.20 and note that X is more efficient than X for these gross error cases. 1979) which you may assume.. U(O. 2. Hint: Apply A4 and the dominated convergence theorem B. Thus.7. X) as defined in (f). . 0 «< 0. 3. (a) Show that g~ (x.(1:e )n ~ 1.i.(x. Let XI. lJ :o(1J.a)p(x.d.(x) + <'P.O)dl"(x)) = J 1J. X) = 1r/2. (h) Suppose that Xl has the Cauchy density fix) = 1/1r(1 + x 2). If (] = I.O). This is compatible with I(B) = 00. 4. ~"'. evaluate the efficiency for € = .O) is defined with Pe probability I but < x.0)p(x.Xn ) is the MLE.y..9)dl"(x) = J te(1J. Find the efficiency ep(X.b)dl"(x) J 1J. 0) = . .c)'P. not O! Hint: Peln(B .10.O))dl"(x) ifforalloo < a < b< 00.5) where f.(x) = (I .B) < xl = 1. Show that if (g) Suppose Xl has the gross error density f.6 Problems and Complements 357 (0 For two estImates 0.O)p(x. oj).0) ~ N(O. then 'ce(n(B . denotes the N(O. 'T = 4.a)dl"(x). Conclude that (b) Show that if 0 = max(XI. Hint: teJ1J..5  and '1'.. J = 1.Section 5.O)p(x.
50).·_t1og vn n j = p( x"'o+.5 Asymptotic Approximations Chapter 5 ~lOg n p(Xi.. Show that the conclusions of PrQple~ 5.(X.14.in. I .i. I (c) Prove (5. 80 +.[ L:. 0 ~N ~ (.80 + p(X . 1(80 ).5 continue to hold if . Hint: (b) Expand as in (a) but aroond 80 (d) Show that ?'o+".) P"o (X . ~log (b) Show that n p(Xi . ) ! . Suppose A4'.g. ) n en] .00) .0'1 < En}'!' 0 .4. I .22).80 + p(Xi . 8)p(x. under F oo ' and conclude that . 8 ) 0 . 0 for any sequence {en} by using (b) and Polyii's theorem (A.O) 1(0) and Hint: () _ g. &l/&O so that E.i(X. Show that 0 ~ 1(0) is continuous.7.In ~ "( n & &8 10gp(Xi..80 p(X" 80) +. .in) 1 is replaced by the likelihood ratio statistic 7. .4) to g."('1(8) .i (x.O') .In) + ..4. (8) Show that in Theorem 5. ~ L.. I I j 6."log ..18 ..< ~log n p(X. Apply the dominated convergence theorem (B.Oo) p(Xi . .358 5. 8 ) i . i I £"+7. O.in) . . i.~ (X..p ~ 1(0) < 00. 0). I 1 . 8) is continuous and I if En sup {:.4.i'''i~l p(Xi. j. and A6 hold for.in) ~ . A2.
~ 1 L i=I n 8'[ ~ 8B' (Xi. for all f n2 nI n2 .59).4. hence. f is a is an asymptotic lower confidence bound for fJ. Let B be two asymptotic l.4. = X.4.4. (Ii) Deduce that g. (Ii) Compare the bounds in Ca) with the bound (5.4.Section 5.4. Let [~11. B' nj for j = 1. B).56). setting a lower confidence bound for binomial fJ.3).54) and (5.14. We say that B >0 nI Show that 8~I and. the bound (5.a.61) for X ~ 0 and L 12.4.4. at all e and (A4'). 9. (a) Show that. gives B Compare with the exact bound of Example 4.61). (a) Show that under assumptions (AO)(A6) for all Band (A4'). there is a neighborhood V(Bo) of B o o such that limn sup{Po[B~ < BJ : B E V(Bo)} ~ 1 .5 and 5. Hint. Hint: Use Problems 5. Compare Theorem 4.B lower confidence bounds.4.59).3. Consider Example 4.6.58). (Ii) Suppose the conditions of Theorem 5. which agrees with (4. Then (5. B' _n + op (n 1/') 13. (a) Show that under assumptions CAO)(A6) for 1/! consistent estimate of I (fJ).:vell .2. (a) Establish (5.4. Hint: Use Problem 5. .2.6 Problems and Complements 359 8.4.7.57) can be strengthened to: For each B E e.4.4. Let B~ be as in (5.5. which is just (4.2.7). Establish (5. and give the behavior of (5.4. 11.4.Q" is asymptotically at least as good as B if.6 and Slutsky's theorem. all the 8~i are at least as good as any competitors.59) coincide.4. 10. if X = 0 or 1. ~. A.4.5 hold. (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5.
" " I I .\ l .5.2. 0 given Xl..10) and (3. Consider the Bayes test when J1.2.i. 1 .)l/2 Hint: Use Examples 3. the evidence I . 2. (a) Show that the test that rejects H for large values of v'n(X .LJ. .d.•.riX) = T(l + nr 2)1/2rp (I+~~. That is...=2[1  <1>( v'nIXI)] has a I U(O. A exp . is distributed according to 7r such that 1> 7f({O}) and given I' ~ A > 0.\ > o. 1).t = 0 versus K : J.d. Let X 1> .360 1 Asymptotic Approximations Chapter 5 14. 1 15.1. the pvalue fJ . LJ.LJ. the test statistic is a sum of i. phas a U(O. > ~..I ' '1 against H as measured by the smallness of the pvalue is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "paradox").1. Show that (J/(J l'..\ < Ao. where . ! I • = '\0 versus K : . Show that (J l'.) has pvalue p = <1>(v'n(X . Hint: By (3.. X n i. By Problem 4. 00. I' has aN(O. Now use the central limit theorem. where Jl is known.4. . Jl > OJ . ! (a) Show that the posterior probability of (OJ is where m n (. each Xi has density ( 27rx3 A ) 1/2 {AX + J. .d..1. X n be i.. I .L and A. 1) distribution. T') distribution. variables with mean zero and variance 1(0).d. (c) Suppose that I' = 8 > O. .4.21J2 2x A} . (d) Find the Wald test for testing H : A = AO versus K : A < AO. (e) Find the Rao score test for testing H : .1 and 3. .. inverse Gaussian with parameters J.'" XI! are ij.5 ! 1.01 . is a given number. (c) Find the approximate critical value of the NeymanPearson test using a Donnal approximation. 1. = O. (b) Suppose that I' ).I). 1.. 1) distribution. Establish (5. J1.i. X > 0. Suppose that Xl. Consider the problem of testing H : I' E [0.55).] versus K . 7f(1' 7" 0) = 1 . N(I'. Consider testing H : j. =f:. N(Jt.11).6.. if H is false. Problems for Section 5. : A < Ao (a) Find the NeymanPearson (NP) test for testing H : .\ = '\0 versus K • (b) Show that the NP test is UMP for testing H : A > AO versus K : A < AO.i.4.)) and that when I' = LJ. That is.2.A 7" 0.
645 and p = 0. all .2 and the continuity of 7f( 0).Oo»)+log7f(O+ J.4. oyn.17). ~ E..046 . P...5.5.sup{ :.5..for M(.O') : 100'1 < o} 5.(Xi.Section 5.050 logdnqn(t) = ~ {I(O). Extablish (5.(Xi. for all M < 00.' " .042 . ~l when ynX = n 1'>=0. 1) prior.~~(:.052 100 . Suppose that in addition to the conditions of Theorem 5.i It Iexp {iI(O)'. i ~..1 is not in effect.) (d) Compute plimn~oo p/pfor fJ. Hint: 1 n [PI .14).029 . 4. Show that the posterior probability of H is where an = n/(n + I).. yn(anX   ~)/ va..05..5.1.0 3.5. 1) and p ~ L U(O. 0' (i» n L 802 i=l < In n ~sup {82 8021(X i ..)}.s. Hint: By (5. ~ L N(O.i Itlqn(t)dt < J:J').1 I'> = 1. (e) Show that when Jl ~ ~. 10 .2 it is equivalent to show that J 02 7f (0)dO < a. J:J'). By Theorem 5.2.5.5.S.I(Xi.O'(t)).6 Problems and Complements 361 (b) Suppose that J.058 20 ..17). Establish (5.:.13) and the SLLN. (Lindley's "paradox" of Problem 5.} dt < .2.s.I).034 . Fe. yn(E(9 [ X) . Hint: In view of Theorem 5.0'): iO' . (e) Verify the following table giving posterior probabilities of 1. L M M tqn(t)dt ~ 0 a. By (5.) sufficiently large. ifltl < ApplytheSLLN and 0 ous at 0 = O.8) ~ 0 a.l has a N(O. [0. Apply the argnment used for Theorem 5.( Xt.054 50 .01 < 0 } continuThen 00.
. A. 5. dn Finally.5.s. The sets en (c) {t . ~ II• '. I : ItI < M} O. . Hint: (t : Jf(O)<p(tJf(O)) > c(d)} = [d. (1) This famous result appears in Laplace's work and was rediscovered by S. . Show tbat (5. .f.9) hold with a.) by A4 and A5.". von Misessee Stigler (1986) and Le Cam and Yang (1990). • .5 " I i . (O)<p(tI.29). qn (t) > c} are monotone increasing in c. Suppose that in Theorem 5.S.I(Xi}»} 7r(t)dt.3 ). all d and c(d) / in d..5. 7.I Notes for Section 5. . .16) noting that vlnen. convergence replaced by convergence in Po probability. (a) Show that sup{lqn (t) . .5. " .1. L . ! ' (2) Computed by Winston Cbow. . 'i roc J. I: .1 .4 (1) This result was first stated by R. Fn(x) is taken to be O. .. .Pn where Pn a or 1.s(8)vn tqn(t)dt ~ roc vIn(t  0) exp {i=(l(Xi' t) .7 NOTES Notes for Section 5. Notes for Section 5. A proof was given by Cramer (1946). .5. .8) and (5.5. J I . I i.5. .s.2 we replace the assumptions A4(a. ~ 0 a. For n large these do not correspond to distributions one typically faces. Bernstein and R. to obtain = I we must have C n = C(ZI_ ~ [f(0)nt l / 2 )(1 + op(I)) by Theorem 5. i=l }()+O(fJ) I Apply (5. Finally.s. • Notes for Section 5. for all 6. (0» (b) Deduce (5.) and A5(a.) ~ 0 and I jtl7r(t)dt < co. Fisher (1925).1 (I) The bound is actually known to be essentially attained for Xi = a with probability Pn and 1 with probability 1 . 362 f Asymptotic Approximations Chapter 5 > O.!. jo I I . See Bhattacharya and Ranga Rao (1976) for further discussion. (I) If the rightband side is negative for some x.d) for some c(d).
Tables of the Correlation Coefficient. 58... 318324 (1953). Normal Approximation and Asymptotic Expansions New York: Wiley.8 REFERENCES BERGER. RAo.. M. STIGLER. AND D. New York: McGraw Hill. Vol.S. 1938. Math. Prob. Sci. Mathematical Methods of Statistics Princeton. 1986. 132 (1948).. Hartley and E.Section 5. HANSCOMB. 1973. M. Editors Cambridge: Cambridge University Press. L. MA: Harvard University press. 1958. FISHER.. 1980.. R. SERFLING. Linear Statistical Inference and Its Applications. Box. RANGA RAO. LEHMANN. 3rd ed. R.. SCOTT. . Approximation Theorems of Mathematical Statistics New York: J.A. HfLPERTY. C. 16. Mathematical Analysis. LEHMANN. Phil.. T. Statist. Statistical Decision Theory and Bayesian Analysis New York: SpringerVerlag. FISHER. Soc. W. W. Asymptotics in Statistics. B. RUDIN. The History of Statistics: The Measurement of Uncertainty Before 1900 Cambridge. R. I Berkeley. NJ: Princeton University Press. 'The distribution of chi square. 3rd ed. Statisticallnjerence and Scientific Method. 1985. Statist. NEYMAN. 1999. P. J. O. p. 1946. HOEFFDING. 700725 (1925). E." J. 1990.. N. Theory o/Statistics New York: Springer. AND R. H. WILSON. AND E. Theory ofPoint EstimatiOn New York SpringerVerlag. SCHERVISCH. Amer. CRAMER. CASELLA.. Cambridge University Press. HAMMERSLEY. L. "Consistent estimates based on partially consistent observations. Wiley & Sons..." Proc. 17. L.. S. BHATTACHARYA. Wiley & Sons. CA: University of California Press. 1998.. S. 684 (1931).. R. A.. AND M. Camb. "Nonnormality and tests on variances:' Biometrika." Econometrica.. G. H. reprinted in Biometrika Tables/or Statisticians (1966). A CoUrse in Large Sample Theory New York: Chapman and Hall. 22. Vth Berk. "Theory of statistical estimation. BILLINGSLEY. Vth Berkeley Symposium. Nat. FERGUSON. P. New York: J. H. 1380 (1963). J. DAVID. 1979. Some Basic Concepts New York: Springer. Probability and Measure New York: Wiley. L. 2nd ed. M. S. J. Proc. C. YANG. 1967. LE CAM. J. Assoc. 1964. 1995. 40. Elements of LargeSample Theory New York: SpringerVerlag.. E. R. 1987. Pearson. The Behavior of the Maximum Likelihood Estimator Under NonStandard Conditions. HUBER. AND G. AND G. 1976.8 References 363 5. "Probability inequalities for sums of bounded random variables.. 1996. A.." Proc. F. E. E. L. I. Acad. P.. U. J. Vol. Symp.. Monte Carlo Methods London: Methuen & Co.
I 1 . (I I i : " .. I . . . .\ . . ' i ' .
1. However. tests. for instance. 1.1 INFERENCE FOR GAUSSIAN LINEAR MODElS • Most modern statistical questions iovol ve large data sets. We shall show how the exact behavior of likelihood procedures in this model correspond to limiting behavior of such procedures in the unknown variance case and more generally in large samples from regular ddimensional parametric models and shall illustrate our results with a number of important examples. with the exception of Theorems 5. 2.1. 2. the bootstrap. There is.3). and confidence regions in regular onedimensional parametric models for ddirnensional models {PO: 0 E 8}. 365 . real parameters and frequently even more semi.3.4.5.3. we have not considered asymptotic inference. the number of parameters. We begin our study with a thorough analysis of the Gaussian linear model with known variance in which exact calculations are possible. [n this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates. Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II. often many. and n.2.4. and prediction in such situations.and semiparametric models.2. the properties of nonparametric MLEs. in which we looked at asymptotic theory for the MLE in multiparameter exponential families.1) and more generally have studied the theory of multiparameter exponential families (Sections 1. confidence regions.3). the number of observations. an important aspect of practical situations that is not touched by the approximation. 2. the fact that d. the multinomial (Examples 1. 2.2.8 C R d We have presented several such models already. however.6.Chapter 6 INFERENCE IN THE MUlTIPARAMETER CASE 6.or nonpararnetric models. and efficiency in semiparametric models.2 and 5.6.7. are often both large and commensurate or nearly so. multiple regression models (Examples 1.3. testing. This chapter is a leadin to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non. curve estimates. The inequalities ofVapnikChervonenkis. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for functionvalued statistics. the modeling of whose stochastic structure involves complex models governed by several.
. .Herep= 1andZnxl = (l.4) 0 where€I. The OneSample Location Problem.a2).1. the Zij are called the design values.2J) (6.1. andJ is then x nidentity matrix. when there is no ambiguity. . In vector and matrix notation.. i . Regression.4 and 2. . . 366 Inference in the Multiparameter Case Chapter 6 ! .1.. Z = (Zij)nxp. Here is Example 1. .n (6. i = l.2(4) in this framework.1. We consider experiments in which n cases are sampled from a population..6 we will investigate the sensitivity of these procedures to the assumptions of the model.1..1.l p.1 is also of the fonn (6. 6.1. say the ith case.zipf.'" .1..d. n (6.N(O.l)T. In the classical Gaussian (nannal) linear model this dependence takes the fonn p 1 Yi where EI..Zip. In this section we will derive exact statistical procedures under the assumptions of the model (6. .1. We have n independent measurements Y1 . . I I. . . N(O.j ..Zip' We are interested in relating the mean of the response to the covariate values. i = 1.. we have a response Yi and a set of p . . and Z is called the design matrix.1 The Classical Gaussian linear Model l' f Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants Zil. i = 1.1. " n (6. we write (6.3): l I • Example 6..5) . e ~N(O..1. 0"2). " 'I and Y = Z(3 + e. '. . 1 en = LZij{3j j=1 +Ei. .1 covariate measurements denoted by Zi2 • . It turn~ out that these techniques are sensible and useful outside the narrow framework of model (6.1." ... These are among the most commonly used statistical techniques.. ..1. In Section 6. I The regression framewor~ of Examples 1. let expressions such as (J refer to both column and row vectors.1. are U. Notational Convention: In this chapter we will.2) Ii . • i .3). The model is Yi j = /31 + €i..3). The normal linear regression model is }i = /31 + L j=2 p Zij/3j + €i. 1 Yn from a population with mean /31 = E(Y).€nareij.2.1) j . and for each case. .d..1. . Here Yi is called the response variable.. Example 6. .3) whereZi = (Zil. • .
the design matrix has elements: 1 if L jl nk + 1<i < L nk k=1 j k=l ootherwise and z= o 0 ••• o o Ip where I j is a column vector of nj ones and the 0 in the "row" whose jth member is I j is a column vector of nj zeros. /3p are called the regression coefficients./.1. The model (6.1. then the notation (6.. If we are comparing pollution levels.3. Yn1 + 1 . and so on.1. one from each population. If the control and treatment responses are independent and nonnally distributed with the same variance a 2 . Ynt +n2 to that getting the second. we arrive at the one~way layout or psample model. We treat the covariate values Zij as fixed (nonrandom).4."i = 1...1 Inference for Gaussian Linear Models 367 where (31 is called the regression intercept and /32. 0 Example 6. Yn1 correspond to the group receiving the first treatment. Then for 1 < j < p. we are interested in qualitative factors taking on several values. . The random design Gaussian linear regression model is given in Example 1. we want to do so for a variety of locations.1fwe set Zil = 1.6) is an example of what is often called analysis ojvariance models.5) is called the fixed design norrnallinear regression model. . .0: because then Ok represents the difference between the kth and average treatment . (6. .2) and (6. TWosample models apply when the design values represent a qualitative factor taking on only two values. Generally.Yn . . . where YI... nl + .6) is often reparametrized by introducing ( l ~ pl I:~~1 (3k and Ok = 13k .6) where Y kl is the response of the lth subject in the group obtaining the kth treatment. The model (6. In this case. Frequently. 13k is the mean response to the kth treatment.•.Way Layout.. n. . To see that this is a linear model we relabel the observations as Y1 .3) applies.3 we considered experiments involving the comparisons of two population means when we had available two independent samples. In Example I.1.. and the €kl are independent N(O. . this terminology is commonly used when the design values are qualitative.Section 6. To fix ideas suppose we are interested in comparing the performance of p > 2 treatments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k.. 1 < k < p. · · . The pSample Problem Or One. + n p = n. (6. if no = 0.~) random variables.1. we often have more than two competing drugs to compare.3 and Section 4. and so on.1..9.3. We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional distribution of Y given a set of observed covariate values.1.
.1. Note that any t E Jl:l can be written .. i5p )T. b I _ . We assume that n > r. TJi = E(Ui) = V'[ J1. When VrVj = O.. i = T + 1.. ... • .r additional linear restrictions have been specified.! x (p+l) = (1..p. Z) and Inx I is the vector with n ones. j = I.. . •• Note that w is the linear space spanned by the columns Cj. ... j = 1. an orthononnal basis VI.. I5 h . . by the GramSchmidt process) (see Section B.g. It is given by 0 = (itl. •. n.(T2J) where Z. . i .. j = I •.2).P.368 effects. • I t Ew ¢:> t = L:(v[t)Vi i=l ¢:> vTt = 0. ... Because dimw = r.Y. . then r is the rank of Z and w has dimension r. The parameter set for f3 is RP and the parameter set for It is W = {I' = Z(3. . . the vector of means p of Y always is. (3 E RP}. and with the parameters identifiable only once d ... i = 1. r i' ... 1 JLn)T I' where the Cj = Z(3 = L j=l P . V n for Rn such that VI. " . ..6jCj are the columns of the design matrix. Recall that orthononnal means Vj = 0 fori f j andvTvi = 1. vT n i I t and that T = L(vTt)v. Note that Z· is of rank p and that {3* is not identifiable for {3* E RP+l. The Canonical Fonn of the Gaussian Linear Model The linear model can be analyzed easily using some geometry. Let T denote the number of linearly independent Cj. k model is Inference in the Multiparameter Case Chapter 6 1. We now introduce the canonical variables and means . we call Vi and Vj orthogonal. . However. is common in analysis of variance models.1. Cj = (Zlj.po In tenns of the new parameter {3* = (0'. of the design matrix. n. This type oflinear model with the number of columns d of the design matrix larger than its rank r.• znj)T.17). Even if {3 is not a parameter (is unidentifiable). . i=I (6. the linear Y = Z'(3' + E.a 2 ) is identifiable if and only ifr = p(Problem6. . It follows that the parametrization ({3..3. . E ~N(O. V r span w.. Ui = v. there exists (e. n. ...7) !. . . {3* is identifiable in the pdimensional linear subspace {(3' E RP+l : L:~~l dk = O} of RP+ 1 obtained by adding the linear restriction E~=l 15k = 0 forced by the definition of the 15k '5..
•• (ii) U 1 . then the MLE of 0: Ci1]i is a = E~ I CiUi. i it and U equivalent. ais (v) The MLE of.. •. 7J = AIJ.).1. ...Ur istheMLEof1]I.8)(6.1..10). 2 0.9 N(rli.1. = L.1.)T is sufJicientfor'l. Theorem 6. . it = E. (6.1...2. 0. Proof. while (1]1.u) 1 ~ 2 n 2 . In the canonical/orm ofthe Gaussian linear model with 0.• ......1. r. Let A nxn be the orthogonal matrix with rows vi. . and . The U1 are independent and Ui 11i = 0.3.'J nxn . . 1 If Cl. Then we can write U = AY.2 Estimation 0. . n.Cr are constants.(Ui . .1]r.1 Inference for G::'::"::"::"::"::L::'::"e::'::'::M::o::d::e'::' '" '3::6::::.2 L i=l + _ '" ryiui _ 02~ i=l 1 r r 2 (6.9) Var(Y) ~ Var(U) = .. where = r + 1. which are sufficient for (IJ. which is the guide to asymptotic inference in general.. whereas (6. • . . v~.2 We first consider the known case... U1 .. . . . = 1. (T2)T.' based on Y using (6. n.'Ii) . . (iii) U i is the UMVU estimate of1]i. . '" ~ A 1'1. u) based on U £('I... We start by considering the log likelihood £('1. i (iv) = 1. Un are independent nonnal with 0 variance 0. (6. .. Moreover.2"log(21r<7 ) t=1 n 1 n _ _ ' " u. .. IJ. i i = 1. . Note that Y~AIU. Theorem 6.1.and 7J are equivalently related.2 and E(Ui ) = vi J..".1. 20. 2 ' " 'Ii i=l ~2(T2 6.. also UMVU for CY. . .Section 6.L = 0 for i = r + 1. is Ui = making vTit. E w.10) It will be convenient to obtain our statistical procedures for the canonical variables U. 1 ViUi and Mi is UMVU for Mi.2<7' L. observing U and Y is the same thing. . and then translate them to procedures for .. n.1. (3. .2 known (i) T = (UI •..2. 1]r)T varies/reely over Rr.. U.2 ) using the parametrization (11.11) _ n log(21r"')..8) So. . and by Theorem B. n because p.1. .
0  i Next we consider the case in which a 2 is unknown and assume n >r+ 1.. we need to maximize n (log 27f(J2) 2 II ". I . 1 Ur ... where J is the n x n identity matrix. ."7i)2 and is minimized by setting '(Ii = Ui. 370 Inference in the Multiparameter Case Chapter 6 I iI . (i) By observation.. apply Theorem 3. To this end. n..Ui · If all thee's are zero. 1lr only through L~' 1 CUi . i > r+ 1. . .. 2:7 r+l un Tis sufficientfor (1]1. 2 2 ()j = 1'0/0. the MLE of q(9) = I:~ I c. (iv) By the invariance of the MLE (Section 2. n " .1]r. .3. WI = Q is sufficient for 6 = Ct. and give a geometric interpretation of ii.'  . By (6. by observation.11) is an exponential family with sufficient statistic T. {3.. To show (ii). Q is UMVU. recall that the maximum of (6. "7r because. But because L~ 1 Ui2 L:~ I Ui2 + r+l Ul.11) is a function of 1}1. Let Wi = vTu.• EUl U? Ul Projections We next express j1 in terms of Y. .N(~.1. .. i = 1. " V n of R n with VI = C = (c" .1.. (We could also apply Theorem 2.2. Theorem 6. 1 U1" are the MLEs of 1}1. then W .10.2..6 to the canonical exponential family obtained from (6. .) = a. (i) T = (Ul1" . I i > . j = 1...2 .4.Ur'L~ 1 Ul)T is sufficient.. Proof. That is.. define the norm It I of a vector tERn by  ! It!' = I:~ . Assume that at least one c is different from zero.3.1. (iii) 8 2 =(n . " as a function of 0.4 and Example 3.. I i I . r.ti· 1 .. ..3 and Example (iii) is clear because 3. T r + 1 = L~ I and ()r+l = 1/20.11). .1.1.2 (ii). ( 2 )T. . (U I .4.3.. .11) has fJi 1 2 = Ui . Proof. is q(8) = L~ 1 <.) (iii) By Theorem 3. and . By Problem 3. (v) Follows from (iv). .. In the canonical Gaussian linear model with a 2 unknown. . wecan assume without loss of generality that 2:::'=1 c. By GramSchmidt orthogonalization. = 0. . To show (iv). there exists an orthonormal basis VI) .(v) are still valid.r) 2:7 1 L~ r+l ul is an unbiased estimator of (J2... (iv) The conclusions a/Theorem 6.O)T E R n . l Cr .. The distribution of W is an exponential family.1. = 1. obtain the MLE fJ of fl. . Ui is UMVU for E(U.6. I.. .2 . this statistic is equivalent to T and (i) follows. . . ~i = vi'TJ..2).) = '7" i = 1.. (6.1).1. r.1 L~ r+l U. (ii) U1. 2(J i=r+l ~ U. . . and is UMVU for its expectation E(W.1.(J2J) by Theorem B. (6. .'7.11) by setting T j = Uj .4.4.1. The maximizer is easily seen to be n 1 L~ r+l (Problem 6.4. (ii) The MLE ofa 2 is n..2. 0.
and  Jii is the UMVU estimate of J.3..L of vectors s orthogonal to w can be written as ~ 2:7 w.1.1.1.14) (v) f3j is the UMVU estimate of (3j. /3. Proof.Ii = .jil' /(n r) (6.1.3 because Ii = l:~ 1 ViUi and Y .ji) = 0 and the second equality in (6.ti.3 E RP.n.1 Inference for Gaussian linear Models 371 E Rn on w is the point Definition 6.2<T' IY . (i) is clear because Z.2. = Z/3 and (6. The projection Yo = 7l"(Y I L.1. note that the space w. (ii) and (iii) are also clear from Theorem r+l VjUj . i=l.4. f3j and Jii are UMVU because.1. 0 ~ . = Z T Z/3 and ZTji = ZTZfj and.1..9).12) implies ZT..14) follows.1 and Section 2.3 maximizes 1 log p(y.. any linear combination of U's is a UMVU estimate of its expectation. then /3 is identifiable. fj = arg min{Iy .Section 6.1.. spans w.3(iv).2(iv) and 6. <T) = . Thus.n log(21f<T . ..' and by Theorems 6.12) ii. In the Gaussian linear model (i) jl is the unique projection oiY on L<.' and is given by ji (ii) jl is orthogonal to Y (iii) 8' ~ = zfj (6. .1... ) 2 or. ZT (Y . by (6. We have Theorem 6. .. any linear combination ofY's is also a linear combination of U's.1. .1. /3 = (ZTZ)l ZT. ZTZ is nonsingular. note that 6. ~ The maximum likelihood estimate . IY ...13) (iv) lfp = r. the MLE = LSE of /3 is unique and given by (6. the MLE of {3 equal~ the least squares estimate (LSE) of {3 defined in Example 2. That is. To show (iv).L = {s ERn: ST(Z/3) = Oforall/3 E RP}.Z/3 I' . _. j = 1. equivalently..1.J) of a point y Yo =argmin{lyW: tEw}.Z/3I' : /3 E W}. and /3 = (ZTZll ZT 1".3 of. To show f3 = (ZTZ)lZTy.p.L..1. because Z has full rank. fj = (ZTZ)lZTji..1. which implies ZT s = 0 for all s E w. It follows that /3T (ZT s) = 0 for all/3 E RP.
5.4. . Suppose we are given a value of the covariate z at which a value Y following the linear model (6. 1 · . (12 (ZTz) 1 ).H).1. Var(€) = (12(J .1. Y = il.16) J I I CoroUary 6. the residuals €. (ii) (iii) y ~ N(J.({3t + {3. There the points Pi = {31 + fhzi. € ~ N(o. = [y.1.' € = Y .1. n}.Y = (J . (12(J .372 Inference in the Multiparameter Case Chapter 6 Note that in Example 2. By Theorem 1. (iv) ifp = r. moreover." As a projection matrix H is necessarily symmetric and idempotent. i = 1.). .1.'.. . =H . Example 2. . and the residual € are independent.. • . Note that by (6.3) that if J = J nxn is the identity matrix. L7 1 . the best MSPE predictor of Y if f3 is known as well as z is E(Y) = ZT{3 and its best (UMVU) estimate not knowing f3 is Y = ZT {j. • ~ ~ Y=HY where i• 1 .ii is called the residual from this fit. The estimate fi = Z{3 of It is called thefitted value and € = y . 1 < i < n.. In this method of "prediction" of Yi. H T 2 =H. : H = Z(Z T Z)IZT The matrix H is the projection matrix mapping R n into w.)I are the vertical distances from the points to the fitted line. 1 < i < n.1..1.12) and (6. n lie on the regression line fitted to the data {('" y. (6. We can now conclude the following.1 we give an alternative derivation of (6.2.2 illustrates this tenninology in the context of Example 6.I" (12H). That is..1.3) is to be taken. then (6.1. The residuals are the projection of Y on the orthocomplement of wand ·i . ~ ~ ·. . and I. The goodness of the fit is measured by the residual sum of squares (RSS) IY . the ith component of the fitted value iJ. (j ~ N(f3. .15) Next note that the residuals can be written as ~ I . it is commOn to write Yi for 'j1i.1. In statistics it is also called the hat matrix because it "puts the hat on Y.14) and the normal equations (Z T Z)f3 = Z~Y.H)Y.. I . i = 1. . In the Gaussian linear model (i) the fitted values Y = iJ. i .1. whenp = r.2 with P = 2. w~ obtain Pi = Zi(3.H)). . Taking z = Zi. see also Section RIO. I.ill 2 = 1 q. H I I It follows from this and (B. 1 ~ ~ ! I ~ ~ ~ i .14).
Y.8 and (3.. If the design matrix Z has rank P. 0 ~ ~ ~ ~ ~ ~ ~ Example 6.1.1. .8) = (J2(Z T Z)1 follows from (B.p.Ill'. The error variance (72 = Var(El) can be unbiasedly estimated by 8 2 = (n _p)lIY . Regression (continued).. . In the Gaussian case. . .5..Section 6.81 and i1 = {31 Y. hence.3).3. + n p and we can write the least squares estimates as {3k=Yk .2. We now see that the MLE of It is il = Z. .i respectively. (not Y.. Here Ji = . in general) p k=l L and the UMVU estimate of the incremental effect Ok = {3k .8.8 and that {3j and f1i are UMVU for {3j and J1.1. 0 ~ We now return to our examples. i = 1. The independence follows from the identification of j1 and € in tenns of the Ui in the theorem.j = 1. 1=1 At this point we introduce an important notational convention in statistics. a = {3. One Sample (continued). . Var(..1 and Section 2.3. o .ii. and€"= Y .2. . k = 1.no The variances of .. } is a multipleindexed sequence of numbers or variables. In this example the nonnal equations (Z T Z)(3 = ZY become n. (n .ii = Y.p. is th~ UMVU estimate of the average effect of all a 1 p = Yk .. .1.1. If {Cijk _.1.1..a of the kth treatment is ~ Pk = Yk. then the MLE = LSE estimate is j3 = (ZTZr1 ZTy as seen before in Example 2. the treatments. Example 6.. in the Gaussian model. . Thus..4. . 0 Example 6.1 ~ Inference for Gaussian Linear Models 373 Proof (Y.p.. Moreover..1. then replacement of a subscript by a dot indicates that we are considering the average over that subscript. 'E) is a linear transformation of U and.. k = 1.1. and € are nonnally distributed with Y and € independent. nk(3k = LYkl. which we have seen before in the unbiased estimator 8 2 of (72 is L:~ I (Yi ~  Yf/ Problem 1. where n = nl + . By Theorem 6..p. The OneWay Layout (continued).3. joint Gaussian.8.2)..Y are given in Corollary 6. k=I.1). .
• = 131 + f32Zi2 + f33Zi3.1. j 1 • 1 . n} is a twodimensional linear subspace of the full model's threedimensional linear subspace of Rn given by (6. i = 1. and the matrix IIziillnx3 with Zil = 1 has rank 3. which together with a 2 specifies the model. the mean vector is an element of the space {JL : J. j.. under H. = {3p = 13 for some 13 E R versus K: "the f3's are not all equal. .3 with 13k representing the mean resJXlnse for the kth population.17). . . . j where Zi2 is the dose level of the drug given the ith patient. we let w correspond to the full model with dimension r and let Wo be a qdimensional linear subspace over which JL can range under the null hypothesis H.\(y) = sup{p(Y.1.i 1 6. ! j ! I I . . Next consider the psample model of Example 1.. a regression equation of the form mean response ~ ..Li = 13 E R. 1 and the estimates of f3 and J. . the LADEs are obtained fairly quickly by modem computing methods. For more on LADEs.JL): JL E wo} ~ ! .1.2.4.. Zi3 is the age of the ith patient. .17) I. For instance. see Problems 1. i = 1. An alternative approach 10 the MLEs for the nonnal model and the associated LSEs of this section is an approach based on MLEs for the model in which the errors El.1.. in the context of Example 6. " . The LSEs are preferred because of ease of computation and their geometric properties.L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L~ 1 IYi . . in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider.zT ..3 Tests and Confidence Intervals .2.1.1. Thus. Now we would test H : 132 = 0 versus K : 132 =I O.En in (6. . The most important hypothesistesting questions in the context of a linear model correspond to restriction of the vector of means JL to a linear subspace of the space w. However. In general.374 Inference in the Multiparameter Case Chapter 6 Remark 6.al.6. (6. see Koenker and D'Orey (1987) and Portnoy and Koenker (1997)." Now. whereas for the full model JL is in a pdimensional subspace of Rn.7 and 2.1.. under H.JL): JL E w} sup{p(Y. The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEssee Stigler (1986). • . which is a onedimensional subspace of Rn. q < r. The first inferential question is typically "Are the means equal or notT' Thus we test H : 131 = . {JL : Jti = 131 + f33zi3.. We first consider the a 2 known case and consider the likelihood ratio statistic . .31. .1) have the Laplace distribution with density .p}. I. 1 < .
2 log .\(Y) Proof We only need to establish the second equality in (6. = IJL i=q+l r JLol'..I8) then. when H holds. In this case the distribution of L:~q+1 (Uda? is called a chisquare distribution with r ._q' = AJL where A L 'I.. v~' such that VI.1. Proposition 6. 1) distribution with OJ = 'Ida.. .Section 6_1 Inference for Gaussian Linear Models 375 for testing H : fL E Wo versus j{ : JL E W  woo Because (6.19).\(Y) It follows that = exp 1 2172 L '\' Ui' (6. respectively. Note that (uda) has a N(Oi.iii' . span Wo and VI. . .. o ..20) i=q+I 210g.1.1. by Theorem 6. by Theorem 6. V q (6.'" (}r)T (see Problem B.q( (}2) distribution with 0 2 =a 2 ' \ '1]i L.l2)..1.iio 2 1 } where i1 and flo are the projections of Yon wand wo.3.1.1.1. ..1. Write 'Ii is as defined in (6.1.q degrees offreedom and noncentrality parameter Ej2 = 181 2 = L:=q+ I where 8 = «(}q+l.1. . .\(Y) ~ exp {  2t21Y ..1.q {(}2) for this distribution. then r. We write X. (}r..IY .J X. i=q+l r = a 21 fL  Ito [2 (6. We have shown the following. V r span wand set . . But if we let A nx n be an orthogonal matrix with rows vi. 2 log .\(Y) has a X.2(v).l9) then. In the Gaussian linear model with 17 2 known.21 ) where fLo is the projection of fL on woo In particular.\(Y) = L i=q+l r (ud a )'..21).
r IY .1.23) . T has the (central) Frq.r degrees affreedam (see Prohlem B. The resulting test is intuitive.liol' (n _ r) 'IY _ iii' ._q(02) distribution with 0' = .iT') : /L E wo} (6.I. .1.l) {Ir .E Wo for K : Jt E W ./Lol'.. I .5) that if we introduce the variance equal likelihood ratio w'" .q and n .'IY _ iii' = L~ r+l (Ui /0)2. Thus. is poor compared to the fit under the general model. . By the canonicaL representation (6.• j . The distribution of such a variable is called the noncentral F distribution with noncentrality parameter 0 2 and r .1.1 suppose the assumption "0.14).1.r.1.'}' YJ. We have seen in PmIXlsition 6.. which is equivalent to the likelihood ratio statistic for H : Ii.itol 2 = L~ q+ 1 (Ui /0) 2.q and m = n . T is an increasing function of A{Y) and the two test statistics are equivalent.1.'I/L.1.max{p{y.2 1JL . . when H I lwlds. {io.n_r(02) where 0 2 = u. T is called the F statistic for the general linear hypothesis. 8 2 .18). In the Gaussian linear model the F statistic defined by (6. has the noncentral F distribution F r _q. it can be shown (Problem 6.iT'): /L E w} .1. (6./L.a:.JtoI 2 .L a =  n I' and . as measured by the residual sum of squares under the model specified by H.. and 86 into the likelihood ratio statistic..') denotes the righthand side of (6. .22).3. Proposition 6.1./L.2.2 is the same under Hand K and estimated by the MLE 0:2 for /. = aD IIY . ]t consists of rejecting H when the fit.2.19).2 is known" is replaced by '.1.a = ~(Y'E. We know from Problem 6. we can write .. In Proposition 6.(Y). we obtain o A(Y) = P Y.1 that the MLEs of a 2 for It E wand It E wQ are . E In this case.L n where p{y. /L. We have shown the following. ..22) Because T = {n .0.J. J • T = (noncentral X:.itol 2 have a X.wo.q)'{[A{Y)J'/n . IIY .JLo.21ii. Substituting j).r){r .~I.nr distribution.. For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >. In particular.q variable)/df (central X~T variahle)/df with the numerator and denominator independent. which has a X~r distribution and is independent of u./to I' ' n J respectively. T = n . statistic :\(y) _ max{p{y.m{O') for this distrihution where k = r .376 Inference in the Multiparameter Case Chapter 6 Next consider the case in which a 2 is unknown. Remark 6.q)'Ili . T has the representation .. We write :h.I}.1.1 that 021it .liol' IY r _q IY _ iii' iii' (r .
1. We next return to our examples.1 and Section B. and the Pythagorean identity.iLl' ~++Yl I' I' 1 .3. Y3 y Iy .iLol' = IY .iLol'.1.22). Example 6.1 Inference for Gaussian Linear Models ~ 377 then >.1. r = 1 and T = 1'0 versus K : (3 'f 1'0.iLl' + liL . (6.1'0 I' y.1.19) made it possible to recognize the identity IY . One Sample (continued).q noncentral X? q 2Iog.Y)2' which we recognize as t 2 / n.\(Y) = .~T = (n . This is the Pythagorean identity. where t is the onesample Student t statistic of Section 4. We test H : (31 case wo = {I'o}.1. Yl = Y2 Figure 6.9.IO.1.24) Remark 6. In this = (Y 1'0)2 (nl) lE(Y. q = 0.Section 6.1. (6. The projections il and ilo of Yon w and Wo. It follows that ~ (J2 known case with rT 2 replaced by r.r)ln central X~_cln where T is the F statistic (6. The canonical representation (6.25) which we exploited in the preceding derivations.2.1. See Pigure 6.1.1.(Y) equals the likelihood ratio statistic for the 0'2. 0 .
. we want to test H : {31 = .q).q covariates does not affect the mean response..Q)f3f(ZfZ2)f32.Yk.g. (6.RSSF)/(dflf .2.dfp) RSSF/dh I 2 where RSSF = IY and RSSH = IY .{3p are Y1.26) and H. In the special case that ZrZ2 = 0 so the variables in Zl are orthogonal to the variables in Z2.ito 1 are the residual sums of squares under the full model and H. in general 02 depends on the sample correlations between the variables in Zl and those in Z2' This issue is discussed further in Example 62.q) x 1 vector of main (e.3. ..=11=1 k=1 •• Substituting in (6.26) 0 versus K : f3 2 i' O.}f32.n_p(fP) distribution with noncentrality parameter (Problem 6.g.q)1f3nZfZ2 .1.22) we can write the F statistic version of the likelihood ratio test in the intuitive fonn =   i .1. The One. Using (6.y)2 k ..)2' .1. . 02 simplifies to (J2(p . We consider the possibility that a subset of p .: I :I T _ n ~ p L~l nk(Y .1 L~=. '1 Yp.)2. respectively.1. anddfF = np and dfH = nq are the corresponding degrees of freedom.1. which only depends on the second set of variables and coefficients. I3I) where {32 is a (p .1. Without loss of generality we ask whether the last p . .27) 1 .. 0 I j Example 6.P . Under H all the observations have the same mean so that. • 1 F = (RSSlf . Under the alternative F has a noncentral Fp_q. age. L~"(Yk' . As we indicated earlier. P nk (Y. Regression (continued). we partition the design matrix Z by writing it as Z = (ZI..vy.... To formulate this question.)2= Lnk(Yk. However. 1 02 = (J2(p .22) we obtain the F statistic for the hypothesis H in the oneway layout . 7) iW l. P . and we partition {3 as f3T = (f3f.1.ZfZl(ZiZl)'ziz. treatment) effect coefficients and f3 1 is a q x 1 vector of "nuisance" (e. = {3p. The F test rejects H if F is large when compared to the ath quantile of the Fpq.. _y.. .. Z2) where Z1 is n x q and Z2 is 11 X (p .1.  I 1. .np distribution.q covariates in multiple regression have an effect after fitting the first q. . In this case f3 (ZrZ)lZry and f3 0 = (Z[Ztl1Z[Y are the ML& under the full model (6. Recall that the least squares estimates of {31. I • ! . I • I b .1. •... Now the linear model can be written as i i I I ! We test H : f32 (6.Way Layout (continued).' • ito = Thus.. . economic status) coefficients. respectivelY.n 378 Inference in the Multiparameter Case Chapter 6 iI II " • Example 6. j ! j • litPol2 = LL(Yk _y.
identifying 0' and (p .1) and 8Sw /(n .1 and 6.. As an illustration. consider the following data(I) giving blood cholesterol levels of men in three different socioeconomic groups labeled I. .p) degrees of freedom. If the fJ..28) where j3 = n I :2:=~"" 1 nifJi. n.···... the total sum of squares. the between groups (or treatment) sum of squares and SSw. Because 88B/0" and SSw / a' are independent X' variables with (p . This information as well as S8B/(P . SST/a 2 is a (noncentral) X2 variable with (n . the unbiased estimates of 02 and a 2 . . Note that this implies the possibly unrealistic . T has a noncentral Fp1.1 Inference for Gaussian Linear Models 379 When H holds. (3p. compute a. [f we define the total sum of squares as 88T =L l' L(Y nk k'  Y..fLo 12 for the vector Jt (rh.. (6.3.1. SSB. ijjT = There is an interesting way of looking at the pieces of infonnation summarized by the F statistic. and III with I being the "high" end.25) 88T = 888 + 88w . .)'. .1) and (n .1. (3p)T and its projection 1'0 = (ij. the within groups (or residual) sum of squares. is a measure of variation between the p samples YII . are often summarized in what is known as an analysis a/variance (ANOVA) table.)' . We assume the oneway layout is valid. sum of squares in the denominator. . T has a Fpl.. . To derive IF. k=l 1=1 measures variation within the samples.1. we have a decomposition of the variability of the whole set of data.. (3" .1.. .Section 6. Yi nl .30) can also be viewed stochastically. . Ypnp' The 88w ~ L L(Y Y p "' k'  k )'.np distribution..2 1M . and the F statistic. respectively.29) Thus. are not all equal. 888 =L k=I P nk(Yk.Y. . .. . then by the Pythagorean identity (6. SST.1) degrees of freedom and noncentrality parameter 0'. we see that the decomposition (6.p).. . . which is their ratio.1) degrees offreedom as "coming" from SS8/a' and the remaining (n .np distribution with noncentrality parameter (6... The sum of squares in the numerator.1. Ypl . See Tables 6.fJ. (31. k=1 1=1 which measures the variability of the pooled samples.p) of the (n 1) degrees offreedom of SST /0" as "cooling" from S8w/a'.1.. into two constituent components.
j:
I ,
380
Inference in the Multiparameter Case Chapter 6
,
:
TABLE 6.1.1. ANOYA table for the oneway layout
Sum of squares
d.f.
,
Between samples
Within samples
SSe
I:
r I lld\'k_
Mean squares
F value
1\1 S '
i'dS B
Lf!
P
1
MSB
~
, ,
Total
58W  ""k '1'1n I "(I'k l  I')'  I " k 55T  ,. P 1 L "I" 1 (I'kl _ I' )' ~ k I
np
Tl  1
" '''
A1S w = SSw
1
1
j
TABLE 6.1.2. Blood cholesterol levels
I
J
286 290
II III
403 312 403
311 222 244
269 302 353
336 420 235
259 420 319
386 260
353
210
l 1
I
I
,
i,
.!
, , I
I , ,
assumption that the variance of the measurement is the same in the three groups (not to speak of normality). But see Section 6.6 for "robustness" to these assumptions. We want to test whether there is a significant difference among the mean blood cholesterol of the three groups. Here p = 3, nl = 5, n2 = 10, n3 = 6. n = 21, and we compute
TABLE 6.1.3. ANOYA table for the cholesterol data
;
,
88
Between groups Within groups Total
I
,
,I
1202.5 85,750,5 86,953.0
dJ, 2 18 20
M8
601.2 4763,9
F~value
0.126
'I
From :F tables, we find that the pvalue corresponding to the Fvalue 0.126 is 0.88. Thus, there is no evidence to indicate that mean blood cholesterol is different for the three socioeconomic groups. 0 Remark 6.1.4. Decompositions such as (6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding to variation of covariates are referred to as analysis oj variance. They can be fonnulated in any linear model including regression models. See Scheff" (1959, pp. 4245) and Weisberg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to establish the distribution theory of the components via a device known as Cochran's theorem (Graybill, 1961, p. 86). Their principal use now is in the motivation of the convenient summaries of infonnation we call ANOVA tables.
,
,I "
,
,
,
Section 6.1
Inference for Gaussian linear Models
~
381
Confidence Intervals and Regions We next use our distributional results and the method of pivots to find confidence intervals for J.li, 1 < i < n, !3j, 1 <j < p, and in general, any linear combination
n
,p
= ,p(/l) = Lai/Li
i::: 1
~ aT /l
of the J.l's. If we set;j;
= 1:7
1 ai/Ii
= aT fl and
~
~
where H is the hat matrix, then (,p  ,p)ja(,p) has a N(O, I) distribution. Moreover,
(n  r)8 2ja 2 ~
IY  iii 2 ja 2 =
L
i=r+l
n
(Uda 2 )'
has a X~r distribution and is independent of ;;;. Let
~
~
be an estimate of the standard deviation a('IjJ) of
~
'0. This estimated standard deviation is
called the standard error of 'IjJ. By referring to the definition of the t distribution, we find that the pivot
has a TnT. distribution. Let t n  r (1  40) denote the 1 ~o: quantile of the bution, then by solving IT(,p)1 < t n _, (I  ~",) for,p, we find that
Tn r
distri
is, in the Gaussian linear model, a 100(1  a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider'IjJ = p. We obtain the interval
i' ~ Y ± t n 1
(1
~q) sj.,fii,
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = T. First consider 1/J = f3j for some specified ~gression coefficient (3j. The 100(1  a)% confidence interval for (3j is
(3j
}" = (3j ± t n p (','" s{ [ (ZT Z) 11 j j ' 1 )
382
Inference in the Multiparameter Case
Chapter 6
where [(ZTZ)~l]j) is the jthdiagonal element of (ZTZ)~ '. Computersoftware computes (ZTZ)~I and labels S{[(ZTZ)~lI]j); as the standard errOr of the (estimated) jth regression coefficient. Next consider ljJ = j.ti = mean response for the ith case, 1 < i < n. The level (1  0:) confidence interval is
J.Li =
Jii ± t n _ p (1 
!a) sJh:
where hii is the ith diagonal element of the hat matrix H. Here sjh;; is called the standard error of the (estimated) mean of the ith case. Next consider the special case in which p = 2 and
Yi = 131 + 132Zi2 + Eil i = 1 ", n.
"
If we use the identity
n
~)Zi2  Z2)(l'i  Y)
i=1
~)Zi2  Z.2)l'i,
We obtain from Example 2.2.2 that
ih ~
Because Var(Yi) = a 2 , we obtain
~ Var(.6,)
L~l (Zi2  z.,)l'i
L~ 1 (Zi2  Z.2)2 .
(6.1.30)
= (J I L)Zi' i=l
''''
n
Z.2) ,
,
and the 100(1  a)% confidence interval for .6, has the form
732 ± t n p (I  ~a) sl J2:(Zi2  Z.2)'. The confidence interval for 131 is given in Problem 6.1.10. .62
=
Similarly, in the p = 2 case, it is straightforward (Problem 6.1.10) to compute
i.
h ii
I ,
•
;
_ 1

(Zi2  z.,)2
",n
+ n
L....i=l Zi2 
(
Z·2
)'
,
0
•
I
and the confidence interval for th~ mean response J1.i of the ith case has a simple explicit
fonn.
,
1
,
i ,
.I
Example 6.1.3. OneWay Layout (continued). We consider 'I/J = 13k. 1 < k .6k = Yk. ~ N(.6k, (J'lnk), we find the 100(1  a)% confidence interval
~
< p. Because
•
,
I
= 7J. ± t n  p (1 ~a) slj'nj; where 8' = SSwl(np). The intervals for I' = .6. and the incremental effect Ok =.6k1'
.6k
are given in Problem 6.1.11. 0
j
j
"I
I
i
,
I
I
Do
l
Section 6.2
Asymptotic Estimation Theory in p Dimensions
383
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j, 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector /3 with prescribed probability (1  0:). This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of f3 0 that is accepted when the level (1  a) F test is used to test H : (3 ~ (30' Under H, /' = /'0 = Z(3o; and the numerator of the F statistic (6.1.22) is based on
Iii  /'01'
C "
=
Izi3 
Z(3ol' =
(13  (30)T(Z T Z)(i3 (30)
(30)
Thus, using (6.1.22), the simultaneous confidence region for
f3 is the ellipse
= {(30 .. (13  (30)T(Z 2 Z)(i3 rs
T
< f r,ni"" (1 _ 1 20
'J}
. .T
(6.1.31)
where fr,nr
(1  40:)
is the 1  !o quantile of the :Fr,nr distribution.
Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) write Z = (ZJ, Z2) and /3T = (f3j, {3f), where f32 is a vector of main effect coefficients and f3 1is a vector of "nuisance" coefficients. Similarly, we partition {3 as {3 = ({31 ,(32 ) where (31 is q x 1 and (3, is (p  q) x 1. By Corollary 6.1.1, O"(ZTZ) is the variancecovariance matrix of {3. It follows that if we let 8 denote the lower right (p  q) x (p ~ q) comer of (ZTZ)I, then (728 is the variancecovariance matrix of 132' Thus, a joint 100(1  0:)% confidence region for {32 is the p  q dimensional ellipse
~ ~ ~
.T ",T
C={(3
0' .
. (i3,(302)TSl(i3,_(3o,) <f
() 2
pqs

pq,np
(11
20:·
l}
o
Summary. We consider the classical Gaussian linear model in which the resonse Yi for (3jZij of the ith case in an experiment is expressed as a linear combination J1i = covariates plus an error fi, where Ci, ... 1 f n are i.i.d. N (0, (72). By introducing a suitable orthogonal transfonnation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transfonnation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {(3j}, the resJXlnse means {J1i}, and linear combinations of these.
LJ=l
6.2
ASYMPTOTIC ESTIMATION THEORY IN p DIMENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic properties of the MLE and related tests and confidence bounds for onedimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generalizing Section 5.4.2.
384'
f ",'c:,c:""""ce=:::;'=''''h':.M=,'"';,,pa::.'::'m'''e::.t::"...:C::':::",..::Cc::h,:,p:::'e::'~6
6.2.1
Estimating Equations
OUf assumptions are as before save that everything is made a vector: X!, ... , X n are i.i.d. Pwhere P E Q, a model containing P = {PO: 0 E e} such that
(i)
e open C RP.
e.
(ii) Densities of P, are pC 0),9 E
1 , I
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=(,p" ... , ,pp)T where,p) ~ g:., is well defined and
 L >I'(X n.
1=1
1
n
..
i,
On) = O.
(6.2.1)
A solution to (6.2.1) is called an estimating equation estimate or an M estimate.
AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p
unknowns):
J
A2. Epl'l'(X" 0(P)1 2
>I'(x, O)dP(x)
=0
(6.2.2)
, , ,
I
is well defined on Q so that O(P) is the unique solution of (6.2.2). Necessarily O(PO) because Q => P.
=0
I , ,
< 00 where I· I is the Euclidean nonn.
I I:
.' , ,
,
I
A3. 'l/Ji{', 8), 1 < i < P. have firstorder partials with respect to all coordinates and using the notation of Section B.8,
I
where
': I
,
~
.,
i
~~
l:iI ~
is nonsingular.
A4. sup
..
{I ~ L~ 1 (D>I'(X"
p
t)  D>I'(Xi , O(P))) I: It  O(P)I < En} ~ 0 if En ~ O.
, .I ,
,
I
I
AS. On ~ O(P) for all P E Q.
Theorem 6.2.1. Under AOA5 ofthis section
8n = O(P) + where
L iii(X n.
t=l
1
n
i,
O(P))
+ op(n 1 / 2 )
(6.2.3)
i..I , ,
• •
iii(x,o(p))
= (EpD>I'(X" 0(P)W 1 >1'(x, O(P)).
(6.2.4)
b
Section 6.2
Asymptotic Estimation Theory in p Dimensions
385
Hence,
(6.2.5)
where
E(q" P) ~ J(O. p)Eq,q,T(X" O(p))F (0, P)
and
81/J~
J
(6.2.6)
r'(o,p)
= EpDq,(X"O(P))
~
E p 011 (X"O(P))
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
 n
2.:= q,(Xi , O(P)) = n 2.:= Dq,(Xi,0~)(9n
i=l
n
1
n
 O(P)).
(6.2.7)
i=l
Note that the lefthand side of (6.2.7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open , subset of RP , representable, for instance, as the Set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to conclude that A3 and A4 guarantee that with probability tending to 1, ~ l::~ I Dq,(Xi , 6~) is nonsingular. Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by p, the behavior in (6.2.3) is guaranteed for P E Q, which can include P <1c P. In fact, typically Q is essentially the set of P's for which O(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to: A6. If 1(,0) is differentiabLe
EODq,(X"O)

EOq,(X" O)DI(X" 0) CovO(q,(X" 0), DI(X, , 0))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the onedimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however. be shown that with probability tending to 1, a rootfinding algorithm starting at a consistent estimate 6~ will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
386
Inference in the Multiparameter Case
Chapter 6
6,2,2
comes
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x,O)
=
l(x,O)
= 10gp(x,0), and >I>(x,O)
obeys AOA6, then (62,8) be.
T
l
EODl (X" O)D l( X 1,0» VarODl(X"O)
(62,9)
where
j 1 ,
, ,
is the Fisher information matrix I(e) introduced in Section 3.4. If p: e _ R, e c R d , is a scalar function, the matrix)! 8~~P(}J (e) is known as the Hessian or curvature matrix of the sutface p. Thus, (6.2.9) stateS that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
Theorem 6.2.2. If AOA6 holdfor p(x, 0)
,1, , ,
I
j
=10gp(x, 0), then the MLE  satisfies On
i,
(62,10)
(62,11)
On ~ 0+  :Lr'(O)DI(Xi,O) + op(n"/')
n
i=1
1
n
,,
so that
is a minimum contrast estimate with p and 'f/J satisfying AOA6 and corresponding asymptotic variance matrix E('I1, Pe), then
"
"
If en
"
, ,
E(>I>,PO)
On
> r'(O)
(62,12)
in the sense of Theorem 3.4.4 with equality in (6,2,12) for 0
,
= 0 0 iff, undRr 0 0 ,
(6,2,13)
~
On + op(n 1/2 ),
, i ,
•
,
Proof. The proofs of (6,2,10) and (6,2,11) parallel those of (5.4.33) and (5.4,34) exactly, The proof of (6,2,12) parallels that of Theorem 3.4.4, For completeness we give it Note that hy (6,2,6) and (6,2,8)
I
" :I I.
1" ,
where U >I>(Xt,O). V Var(U T , VT)T nonsingular Var(V)
=
E(>I>,PO)
= CovO'(U, V)VarO(U)CovO'(V, U)
~
(62,14)
DI(X1,0),
But hy (B,lO.8), for any U,V with
(6,2.15)
> Cov(U, V)Var1(U)Cov(V, U),
Taking inverses of both sides yields
r1(0)
= Var01(V) < E(>I>,O).
(6.2.16)
• • !
Section 6.2
Asymptotic Estimation Theory in p Dimensions
387
Equality holds in (6.2.15) by (B. 10.2.3) iff for some b U
= b(O)
(6.2.17)
= b + Cov(U, V)Var1(V)V with probability 1. This means in view of Eow = EODl = 0 that
w(X"O) = b(0)Dl(X1,0).
In the case of identity in (6.2.16) we must have
[EODw(X 1, OW'W(X 1 , 0)
=
r1(0)DI(X" 0).
(6.2.18)
Hence, from (6.2.3) and (6.2.10) we conclude that (6.2.13) holds.
o
apxl, a T 8 n
~
We see that, by the theorem, the MLE Is efficient in the sense that for any has asymptotic bias o(n1/2) and asymptotic variance nlaT ]1(8)a, which is n.? larger than that of any competing minimum contrast estimate. Further any competitor 8 n such that aTO n has the same asymptotic behavior as a T 8 n for all a in fact agrees with On to ordern 1/2


A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6
on the asymptotic nonnality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zr, Yi)T. 1 < i < n, be ij.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways:
(il
(6.2.19) where, is distributed as N(O, (),2) independent of Z and E(Z) Z, Y has aN (a + ZT [3, (),2) distrihution.
= O.
That is, given
(ii) The distribution Ho of Z is known with density h o and E(ZZT) is nonsingular.
The second assumption is unreasonable but easily dispensed with. It readily follows (Prohlem 6.2.6) that the MLE of [3 is given by (with probability 1) [3 = [Z(n}Z(n}]
T
~
lT
Zen} Y.
(6.2.20)
Here Zen) is the n x p matrix IIZij ~ Z.j II where z.j = ~ 1 Zij. We used subscripts (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Zen) = (Zl, .. , 1 Zn)T is referred to as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and ()'2 are
p
2:7
Ci
=Y
 ~ Zj{3j, (j
J=l
"

2
.
1 ~2 = IY  (Ci + Z(n)[3)1 .
n
(6.2.21 )
I
388
~
Inference in the Multiparameter Case
Chapter 6
Note that although given ZI," " Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AOA6 hold in this case because if H o has density k o and if 8 denotes (a,{3T,a 2 )T. then
~
j
1
j
I(X,8) Dl(X,8)
and
 20'[Y  (a + Z T 13W
(
1
)
2(logo'
1
)
+ log21T) + logho(z)
(6.2.22)
;2'
1
i,
Z ;" 20 4
(.2
o o
1)
I
.,,
a 2
1(8) =
I
0 0' E(ZZT)
0
,
o o
20 4
(6.2.23)
i ,
,
1
so that by Theorem 6.2.2
iI
I'
L(y'n(ii 
a,(3  13,8'  0 2»
~ N(O,diag(02,02[E(ZZ T W',20 4 ».
(6.2.24)
1
!
I
!
,,
This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of H o 2 known plays no role in the limiting result for a,j3,Ci • Of course, these will only be the MLEs if H o depends only on parameters other than (a, f3,( 2 ). In this case we can estimate E(ZZT) by ~ L~ 1 ZiZ'[ and give approximate confidence intervals for (3j. j = 1 .. ,po " An interesting feature of (6.2.23) is that because 1(8) is a block diagonal matrix so is II (6) and, consequently, f3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perfonn inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew 0 2 , the MLE would still be and its asymptotic varianc~ optimal for this model. If we knew a and 13. ;;2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, 0=2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = {P(9,"} : 8 E e, '/ E £}
~
j,
1 ,
l
I ,
1 ,
1
13
. ,
:
~
•
L
.' ,
we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE (J (or more generally, an efficient estimate of e) in the pair (e, iiJ is lhe same as that of e(,/o), the efficient estimate for 'P'I7o = {P(9,l'jo) : E 8}. The possibility of adaptation is in fact rare. though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, ({32 ... , /3p ) with
~ ~
e
(i) a, (3" ... , {3p known. (ii)
13 arbitrary.
/32
... = {3p = O. LetZi
In case (i), we take, without loss of generality, a = (ZiI, ... , ZiP) T. then the efficient estimate in case (i) is
,
•• "
I
;;n _ L~l Zit Yi Pi n 2
Li=l Zil
(6,2.25)
, , ,
, I
.
I
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance (T2[EZlJ1. On the other hand, [31 is the first coordinate of {3 given by (6.2.20). Its asymptotic variance is the (1,1) element of O"'[EZZTjl, which is strictly bigger than ,,' [EZfj1 unless [EZZTj1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate [31 adaptively if {3z, .. . , (3p are regarded as nuisance parameters. What is happening can be seen by a representation of [Z'fn) Z(n)ll Zen) Y and [11(11) where ['(II) Ill'j(lI) II· We claim that


=
2i
fJl
= 2:~_l(Z"
",n
 Zill))y; (Z _ 2(11 ),
tl
t
(6.2.26)
L....t=1
where Z(1) is the regression of (Zu, ... , Z1n)T on the linear space spanned by (Zj1,' . " Zjn)T, 2 < j < p. Similarly.
[11 (II)
= 0"'/ E(ZlI
 I1(ZlI
1
Z'l,"" Zp,»)'
(6.2.27)
where II(Z11 I ZZ1,' .. , Zpl) is the projection of ZII on the linear span of Z211 .. , , Zpl (Problem 6.2.11). Thus, I1(ZlI I Z'lo"" Zpl) = 2:;=, Zjl where (ai, ... ,a;) minimizes E(Zll  2: P _, ajZjl)' over (a" ... , ap) E RPI (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3" ... , f3p when the variables Z2, .. . ,Zp are in any way correlated with Z1 and the price is measured by
a;
[E(Zll  I1(Zll I Z'lo"" Zpl)'jl = ( _ E(I1(Zll 1 Z'l, ... ,ZPI)),)I '''''';:;f;c;o,~''"'''''1 , E(ZII) E(ZII)
(6.2.28) In the extreme case of perfect collinearity the price is 00 as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where (Z2, . .. , Zp) have no value in predicting Zl linearly (see Section 1.4). Correspondingly in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, ... , n, (31 is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if E(Zu  I1(ZlI I Z'l, ... , Zpt})' = O. 0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are ij.d. but not necessarily Gaussian with density ~fo (~). for instance, e x
fo(x)
~
=
(1 + eX)"
the logistic density. Such error densities have the often more realistic, heavier tails(l) than the Gaussian density. The estimates {301 0'0 now solve
and
390
Inference in the Multiparameter Case
Chapter 6
f' ~ ~ where1/; = 'X;,X(Y) =  (iQ)~ Yfo(y)+1 ,(30 (fito, ... ,ppO)T The assumptions of Theorem 6.2.2 may be shown to hold (Problem 6.2.9) if
=
(i) log fa is strictly concave, i.e.,
*
is strictly decreasing.
(ii) (log 10)" exists and is bonnded.
Then, if further fa is symmetric about 0,
I(IJ)
cr"I((3T, I)
= cr"
where
Cl
(C 1 Z ) E(t
~)
(6.2.29)
, ,
= J (f,(x»)' lo(x)dx, c, = J (xf,(x) + I)' 10 (x)dx.
~
Thus, ,Bo, ao are opti
mal estimates of {3 and (l in the sense of Theorem 6.2.2 if fo is true. Now suppose fa generating the estimates Po and (T~ is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable conditions we can apply Theorem 6.2.1 with
1 ,
,
i
where
i
1f;j(Z,y,(3,(J)
,
;
"
~ 1f; (y  L~1 Zkpk )
, I
<j < p
(6.2.30)
!
,
•
> (y  L~1
to conclude that
where (301 ao solve
I 1
j
Zkpk )
1
I
I
I
L:o( y'n(,Bo  (30)) ~ N(O, I: ('I', P») L:( y'n(a  cro) ~ N(O, (J'(P))
1 •
,
(6.2.31)
• ,
J'I'(y zT(30)dP=O
• ,
I
p
II . ,.I'
ii
I:
and E('I', P) is as in (6.2.6). What is the relation between (30' (Jo and (3, (J given in the Gaussian model (6.2.19)? If 10 is symmetric about 0 and the solution of (6.2.31) is unique, . then (30 = (3. But (Jo = c(fo)q for some, c(to) typically different from one. Thus, (30 can be used for estimating (3 althougll if t1j. true distribution of the is N(O, (J') it should ' perform less well than (3. On the Qther hand, o is an estimate of (J only if normalized by a constant depending on 10. (See Problem 6.2.5.) These are issues of robustness, that
~

a
'i
1:
is, to have a bounded sensitivity curve (Section 3~5. Problem 3.5.8), we may well wish to use a nonlinear bounded '.Ii = ('!f;11'" l ,pp)T to estimate f3 even though it is suboptimal when € rv N(O, (12). and to use a suita~ly nonnalized version of {To for the same purpose. One effective choice of.,pj is the Hu~er function defined in Problem 3.5.8. We will discuss these issues further in Section 6.6 and Volume II. 0
•
A4(a.. say (fJ 1. The consequences of Theorem 6. . Kadane.2. This approach is refined in Kass. The problem arises also when.Section 6. and Tierney (1989). O ). and interpret I . We have implicitly done this in the calculations leading up to (5. .5. fJ p vary freely. . A5(a. t)dt. All of these will be developed in Section 6. Wald tests (a generalization of pivots). and Rao's tests..19). 2 The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5. However. as is usually the case. under PeforallB. We defined minimum contrast (Me) and M estimates in the case of pdimensional parameters and established their convergence in law to a nonnal distribution.(t) n~ 1 p(Xi .3 The Posterior Distribution in the Multiparameter Case The asymptotic theory of the posterior distribution parallels that in the onedimensional case exactly. 1T(8) rr~ 1 P(Xi1 8).3 are the same as those of Theorem 5. We simply make 8 a vector.3. ~ (6. Using multivariate expansions as in B. Summary. Although it is easy to write down the posterior density of 8.2.3. Confidence regions that parallel the tests will also be developed in Section 6. A new major issue that arises is computation.19) (Laplace's method). the latter can pose a fonnidable problem if p > 2. Again the two approaches differ at the second order when the prior begins to make a difference.I as the Euclidean nonn in conditions A7 and A8.s. When the estimating equations defining the M estimates coincide with the likelihood . . because we then need to integrate out (03 . say.8 we obtain Theorem 6. Op). the likelihood ratio principle. up to the proportionality constant f e . in perfonnance and computationally. typically there is an attempt at "exact" calculation. See Schervish (1995) for some of the relevant calculations.3.s.) and A6A8 hold then. A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems. for fixed n. Ifthe multivariate versions of AOA3.2.s..32) a. ife denotes the MLE. 6. These methods are beyond the scope of this volume but will be discussed briefly in Volume II.5..2.2. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : fJ 1 < 0 versus K : fh > 0 where fJ 2 .). . the equivalence of Bayesian and frequentist optimality asymptotically.2 Asymptotic Estimation Theory in p Dimensions 391 Testing and Confidence Bounds There are three principal approaches to testing hypotheses in multiparameter models.5. we are interested in the posterior distribution of some of the parameters. The three approaches coincide asymptotically but differ substantially.
1 '~ J . I ..d.392 Inference in the Multiparameter Case Chapter 6 . LARGE SAMPLE TESTS AND CONFIDENCE REGIONS ~ 0) converges a. adaptive estimation of 131 is possible iff Zl is uncorre1ated with every linear function of Z2) . . and confidence procedures. '(x) = sup{p(x.".denotes the MLE for PO' then the posterior distribution of yn(O if 0 r' (9)) distribution. These were treated for f) real in Section 5.I .. A(X) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions. In linear regression. . X II .f/ : () E 8.6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal.4.2 to extend some of the results of Section 5.3. under P. 6. Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic i. 1 Zp. ! I In Sections 4. the exact critical value is not available analytically.1 we considered the likelihood ratio test statistic.4. and showed that in several statistical models involving normal distributions. We shall show (see Section 6.1 . However. to the N(O. covariates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations.•• . Finally we show that in the Bayesian framework where given 0. ry E £} if the asymptotic distribution of In(() . we need methods for situations in which. I H I! ~. 170 specified. in many experimental situations in which the likelihood ratio test can be used to address important questions. Wald and Rao large sample tests. I . . In this section we will use the results of Section 6. equations. I j I I . 1 .4. However. In such .B} has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {PO. 9 E 8 0 } for testing H : 9 E 8 0 versus K . 9 E 8} sup{p(x. and we tum to asymptotic approximations to construct tests. X n are i. We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AIestimate if we know the correct model P = {Po : BEe} and use the MLE for this model. this result gives the asymptotic distribution of the MLE. PO'  . confidence regions. Another example deals with M estimates based on estimating equations generated by linear models with nonGaussian error distribution.9 and 6. and other methods of inference. In these cases exact methods are typically not available. "' ~ 6. We present three procedures that are used frequently: likelihood ratio.9) .s. .1 we developed exact tests and confidence regions that are appropriate in re~ gression and anaysis of variance (ANOVA) situations when the responses are normally distributed.3 • .9). We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {PO. In Section 6.4 to vectorvalued parameters.I .110 : () E 8}.8 0 . as in the linear model. 9 E 8" 8 1 = 8 . .i.
Suppose we want to test H : Q = 1 (exponential distribution) versus K : a i.. where A nxn is an orthogonal matrix with rows vI. as X where X has the gamma. . . 0 We iHustrate the remarkable fact that X~q holds as an approximation to the null distribution of 2 log A quite generally when the hypothesis is a nice qdimensional submanifold of an rdimensional parameter space with the following.£ X~" for degrees of freedom d to be specified later. <1~) where {To is known. iJo) is the denominator of the likelihood ratio statistic. 2 log . 1.randXi = Uduo. In this u 2 known example. V q span Wo and VI...Xn are i.\(Y)). 0).q' i=q+1 r Wilks's theorem states that. such that VI. .Wo where w is an rdimensional linear subspace of Rn and W ::> Wo."" V r spanw..\(X) based on asymptotic theory. distribution with density p(X.3.tl.1 exp{ iJx )/qQ). 0) ~ iJ"x·. Q > 0. .. Suppose XL X 2 .2 we showed how to find B as a nonexplicit solution of likelihood equations.. i = 1.3.5) to be iJo = l/x and p(x. . . .1).v'[. the numerator of .3.. . . which is usually referred to as Wilks's theorem or approximation.l.. The Gaussian Linear Model with Known Variance.'··' J.\(x) is available as p(x.i.i = 1.1.. Other approximations that will be explored in Volume II are based on Monte Carlo and bootstrap simulations. i = r+ 1. Then Xi rvN(B i . x ~ > 0.Section 6. Moreover.3 we test whether J. when testing whether a parameter vector is restricted to an open subset of Rq or R r .. The MLE of f3 under H is readily seen from (2. SetBi = TU/uo. ~ X. As in Section 6. r and Xi rv N(O. 0 It remains to find the critical value.1. q < r. . . and we transform to canonical form by setting Example 6. Thus.1. 1). ... =6r = O.2. iJ). the X~_q distribution is an approximation to £(21og . iJ). . iJ > O. Let YI .d. .3 Large Sample Tests and Confidence Regions 393 cases we can turn to an approximation to the distribution of . Here is an example in which Wilks's approximation to £('\(X)) is useful: Example 6. . We next give an example that can be viewed as the limiting situation for which the approximation is exact: independent with Yi rv N(Pi.. Wilks's approximation is exact.n.tn)T is a member of a qdimensional linear subspace of woo versus the alternative that J..\(Y) = L X. •.. we conclude that under H. 0) = IT~ 1 p(Xi.l. Using Section 6. E W .3.i = 1. exists and in Example 2. the hypothesis H is equivalent to H: 6 q + I = . Y n be nn.n. = (J.2 we showed that the MLE.. .. ~ ~ ~ ~ ~ The approximation we shall give is based on the result "2 log '\(X) .1. under regularity conditions. ~ In Example 2. 0 = (iX.3.4. qQ. This is not available analytically.
. "'" T. I1«(Jo)). !.(Y) defined in Remark ° 6.. hy Corollary B.3... we can conclude arguing from A. . 21og>.(In[ < l(Jn .(Jol. (J = (Jo. where x E Xc R'. .\(Y) £ X. Then. . where Irxr«(J) is the Fisher information matrix. we derived e e 2log >'(Y) ~ n log ( 1 + 2::" L. .. Proof Because On solves the likelihood equation DOln(O) = 0.(Jo) ".. = 2[ln«(Jn) In((Jo)] ~ ~ ~ £ X.. If Yi are as in = (IL.i=q+l 'C'" I=r+l Xi' ) X. (6. : . 1 • ! We tirst consider the simple hypothesis H : 6 = 8 0 _ Theorem 6.1. Here ~ ~ j. 0 Consider the general i.~ L H .3.. where DO is the derivative with respect to e..Xn a sample from p(x. under H . V T I«(Jo)V ~ X. . .(Jol. "'" (6.q also in the 0. By Theorem 6.3.2 with 9(1) = log(l + I).1) " I.2. .. Apply Example 5. . I( I In((J) ~ n 1 n L i=l & & &" &"." .(Jo) for some (J~ with [(J~ . .7 to Vn = I:~_q+lXl/nlI:~ r+IX.2.3.. Write the log likelihood as e In«(J) ~ L i=I n logp(X i .. ! . and (J E c W. I..2 are satisfied. an = n. N(o.394 Inference in the Multiparameter Case Chapter 6 1 .Onl + IOn  (Jol < 210n . y'ri(On .. B).2 but (T2 is unknown then manifold whereas under H._q as well.1.3. case with Xl.2.(X) ~ 1 .6. c = and conclude that 210g . I I i Example 6. Because . X.2 unknown case.2.2. o .(J) Uk uJ rxr c'· • 1 .3 and AA that that In((J~) "'" 2[ln((Jn) In((Jo)] £r ~ V I((Jo)V.logp(X. I(J~ . Suppose the assumptions of Theorem 6.3. i\ I .. Eln«(Jo) ~ I«(Jo).(Jo) In«(Jn)«(Jn .· . ranges over a q + Idimensional manifold. The result follows because. an expansion of lnUI) about en evaluated at 8 = 0 0 gives 2[ln«(Jn) In((Jo)] = n«(Jn . ".3. ·'I1 .3. 0).3. and conclude that Vn S X. I . 2 log >'(Y) = Vn ".1..d.1 ((Jo)). The Gaussian Linear Model with Unknown Variance. (72) ranges over an r + Idimensional Example 6. Hence. j .2) V ~ N(o.q' Finally apply Lemma 5.(Jol < I(J~ . Note that for J.i. In Section 6.
0) is the 1 and C!' quantile of the X.4) It is easy to see that AOA6 for 10 imply AOA5 for Po. Suppose that the assumptions of Theorem 6. .2[1. T (1) (2) Proof.0(1) = (8 1" "..+l.1) applied to On and the corresponding argument applied to 8 0 .(Ia)..4). 8 2)T where 8 1 is the first q coordinates of8.' ".2.n) /n(Oo)l.3. Then under H : 0 E 8 0.. 0).. .Section 6. (6.0:'.2. We set d ~ r . the test that rejects H : () 2Iog>. Let Po be the model {Po: 0 E eo} with corresponding parametn"zation 0(1) = ~(1) (1) ~(1) (8 1. ~ {Oo : 2[/ n (On) /n(OO)] < x. for given true 0 0 in 8 0 • '7=M(OOo) where. 0(2) = (8.10) and (6. dropping the dependence on 6 0 • M ~ PI 1 / 2 (6.1 and 6.3.n and (6.80.a)) (6.81(00).(X) 8 0 when > x.8.3. (JO. 10 (0 0 ) = VarO.3 Large Sample Tests and Confidence Regions 395 = As a consequence of the theorem.q.0(2».. where x r (1.8.3.j.3. Let 0 0 E 8 0 and write 210g >'(X) ~ = 2[1n(9 n )  In(Oo)] .(9 0 .3. 0b2) = (80"+1". Make a change of parameter.T.2 hold for p(x.. SupposetharOo istheMLEofO underH and that 0 0 satisfiesA6for Po. Theorem 6. where e is open and 8 0 is the set of 0 E e with 8j = 80 .8q ). Furthennore.)1'.j} are specified values.3.3. and {8 0 ."..0 0 ).2 illustrate such 8 0 .)T.6) . has approximately level 1.3) is a confidence region for 0 with approximat~ coverage probability 1 . Next we tum to the more general hypothesis H : 8 E 8 0 . Let 00. ~(11 ~ (6. j ~ q + 1.2.n = (80 . By (6. OT ~ (0(1). Examples 6.)T.(] .35) where 8(00) = n 1/2 L Dl(X" 0) i=1 n and 8 = (8 1 .a. distribution.. 0 E e.
3.3.8. This asymptotic level 1 . . = 0.8) that if T('1) then = 1 2 n. because in tenus of 7].3. with ell _ .+I> .10) LTl(o) .II). '1 E Me}../ L i=I n D'1I{X" liD + M. J) distribution by (6.3.9) .. • 1 y{X) = sup{P{x. ).6) and (6.l.1 '1). which has a limiting X~q distribution by Slutsky's theorem because T(O) has a limiting Nr(O... under the conditions of Theorem 6.7).In( 00 .13) D'1I{x.1I0 + M.. Moreover.11) . rejecting if .1 '1) = [M.2. (O) r • + op(l) (6. Now write De for differentiation with respect to (J and Dry for differentiation with respect to TJ.2.+ 1 . • . .(X) > x r .5) to "(X) we obtain. We deduce from (6. ~ 'I. Thus.. IIr ) : 2[ln(lI n ) . His {fJ E applying (6..3.all (6. Such P exists by the argument given in Example 6. 0 Note that this argument is simply an asymptotic version of the one given in Example 6.. by definition.0: confidence region is i .LTl(o) + op(l) i=l i=l L i=q+l r Ti2 (0) + op(l).(l .3. ••. . = . : • Var T(O) = pT r ' /2 II.1I 0 + M. Tbe result follows from (6.a) is an asymptotically level a test of H : 8 E Of equal importance is that we obtain an asymptotic confidence region for (e q+ 1. then by 2 log "(X) TT(O)T(O) .(1 . 1e ).00 : e E 8 0} MAD = {'1: '/..I '1): 11 0 + M. Note that.1'1 '1 E eo} and from (B.2.. Or)J < xr_.1ITD III(x.1/ 2 p = J.3. a piece of 8. .Tf(O)T .+1 ~ . Me : 1]q+l = TJr = a}.1> . I I eo~ ~ ~ {(O..1 because /1/2 Ao is the intersection of a q dirnensionallinear subspace of R r with JI/2{8 .9).0.8 0 : () E 8}. .00 .. is invariant under reparametrization . (6..3. 18q acting as r nuisance parameters.• I . if A o _ {(} .l.3.(X) = y{X) where ! 1 1 ..1 '1ll/sup{p(x.396 Inference in the Multiparameter Case Chapter 6 and P is an orthogonal matrix such that.1I0 + M..3.
2 and the previously conditions on g..13) then the set {(8" .14) a hypothesis of the fonn (6.3.. themselves depending on 8 q + 1 . X~q under H.3. q + 1 < j < r. Theorem 6..o''C6::. q + 1 < j < r written as a vector g. let (XiI.::..3. .( 0 ) ~ o} (6."de"'::'"e:::R::e"g.3.6. R.2.3. Here the dimension of 8 0 and 8 is the same but the boundary of 8 0 has lower dimension.t. The proof is sketched iu Problems (6.. which require the following general theorem. X. If 81 + B2 = 1..d.3...3. such that Dg(O) exists and is of rauk r . Suppose the assumptions of Theorem 6.. q + 1 <j < r}.3. A(X) behaves asymptotically like a test for H : 0 E 8 00 where 800 = {O E 8: Dg(Oo)(O . t. where J is the 2 x 2 identity matrix aud 8 0 = {O : B1 + O < I}....13). Suppose the MLE hold (JO.g.(0) ~ 8j .3.::Se::...o:::"'' 397 CC where 00. Examples such as testing for independence in contingency tables. It can be extended as follows.:::m. if A(X) is ~ the likelihood ratio statistic for H : 0 E 8 0 given in (6.2 is still inadequate for most applications.13). N(B 1 . will appear in the next section.3.p:::'e:. More complicated linear hypotheses such as H : 6 ... .. if 0 0 is true.:::f..'':::':::"::d_C:::o:::..13) Evideutly.2)(6.3.80.. (6.8r are known.2 to this situation is easy and given in Problem 6. . of f)l" .3. . The esseutial idea is that.n under H is consistent for all () E 8 0.e:::S:::.q at all 0 E 8. We need both properties because we need to analyze both the numerator and denominator of A(X). 8q o assuming that 8q + 1 . .3. . v r span " R T then. Then. 9j . Suppose H is specified by: There exist d functions. We only need note that if WQ is a linear space spanned by an orthogonal basis v\.. B2 .l:::.8 0 E Wo where Wo is a linear space of dimension q are also covered. . Theorem 6. "'0 ~ {O : OT Vj = 0.. Wilks's theorem depends critically On the fact that nOt only is open but that if giveu in (6.. (6. J).3. More sophisticated examples are given in Problems 6. l B.....3. Defiue H : 0 E 8 0 with e 80 = {O E 8 : g(O) = o}. The formulation of Theorem 6. 8. 210g A(X) ".5 and 6. .3)..3. As an example ofwhatcau go wrong.12) The extension of Theorem 6.cTe:::.3..i. Vl' are orthogonal towo and v" .'! are the MLEs.3.) be i. 2 e eo Ii o ~ (Xl + X 2) 2 + ~ 1 _ (X.2 falls under this schema with 9.. 8q)T : 0 E 8} is opeu iu Rq. " ' J v q and Vq+l " .. 2' 2 + X 2 )) 2 and 210g A(X) ~ xf but if 81 + 82 < 1 clearly 2 log A(X) ~ op(I).
3. It and I(8 n ) also have the advantage that the confidence region One generates {6 : Wn (6) < xp(l . I.15) Because I( IJ) is continuous in IJ (Problem 6.2. it follows from Proposition B.398 Inference in the Multiparameter Case Chapter 6 6. Then .~ D 2ln (IJ n ).. I .3..Or) and define the Wald (6.4. Slutsky's it I' I' .3. ~ Theorem 6. More generally. For the more general hypothesis H : (} E 8 0 we write the MLE for 8 E as = ~ ! I: (IJ n . for instance.3. [22 is continuous.~ D 2l n (1J 0 ) or I (IJ n ) or .rl(IJ» asn ~ p ~ L 00. y T I(IJ)Y .10). I 22 (9 n ) is replaceable by any consistent estimate of J22( 8).7.2.6.16). (6. has asymptotic level Q. the lower diagonal block of the inverse of . i favored because it is usually computed automatically with the MLE. .) and IJ n ~ ~ ~(2) ~ (Oq+" .2 Wald's and Rao's Large Sample Tests The Wald Test Suppose that the assumptions of Theorem 6.2. the Hessian (problem 6.lJ n ) where IJ n statistic as ~(1) ~(2) ~(1) = (0" . The last Hessian choice is ~ ~ . .9).1. . r 1 (1J)) where.31).a)} is an ellipsoid in W easily interpretable and computablesee (6. X. It follows that the Wald test that rejects H : (J = 6 0 in favor of K : () • i= 00 when .Nr(O. Under the conditions afTheorem 6.3..3.17) ~ ~ e ~ en 1 • I I Wn(IJ~2)) ~ n(9~2) 1J~2)l[I"(9nJrl(9~) _1J~2») where I 22 (IJ) is the lower diagonal block of II (IJ) written as I2 I 1 _ (Ill(lJ) (IJ) I21(IJ) I (1J)) I22(IJ) with diagonal blocks of dimension q x q and d x d.: b. (6. ~ ~ 00.. i . More generally /(9 0 ) can be replaced by any consistent estimate of I( 1J0 ). But by Theorem 6.X. y T I(IJ)Y.18) i 1 Proof.fii(lJn (2) L 1J 0 ) ~ Nd(O. according to Corollary B.2.2.. respectively. If H is true.2.15) and (6.3..~ D 2 1n ( IJ n). in particular . .l(a) that I(lJ n ) ~ I(IJ) asn By Slutsky's theorem B.7.0.2. theorem completes the proof.2 hold.. Y .16) n(9 n _IJ)TI(9 n )(9 n IJ) !:. .3. (6. I22(lJo)) if 1J 0 E eo holds. hence. . 0 ""11 F . j Wn(IJ~2)) !:. 1(8) continuous implies that [1(8) is continuous and.3.q' ~(2) (6.fii(1J IJ) ~N(O.3.
1(0 0 )) where 1/J n = n .n) under n H.20) vnt/Jn(OO) !. It follows from this and Corollary B. as n _ CXJ. (Problem 6.9..1/ 2 D 2 In (0) where D 1 l n represents the q x 1 gradient with respect to the first q coordinates and D 2 l n the d x 1 gradient with respect to the last d.:El ((}o) is ~ n 1 [D. is. and the convergence Rn((}o) ~ X. The extension of the Rao test to H : (J E runs as follows. by the central limit theorem..3.ln(8 0 . .n) 2  + D21 ln (00.. Rao's SCOre test is based on the observation (6.n) ~  where :E is a consistent estimate of E( (}o).3. the two tests are equivalent asymptotically. Let eo '1'n(O) = n. requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests. What is not as evident is that.9) under AOA6 and consistency of (JO. Furthermore.. Rn(Oo) = nt/J?:(Oo)r' (OO)t/Jn(OO) !.' (0 0 )112 (00) (6.21) where III is the upper left q x q block of the r x r infonnation matrix I (80 ).. X.ln(Oo. 112 is the upper right q x d block.2 that under H. = 2 log A(X) + op(l) (6. the asymptotic variance of .n] D"ln(Oo.3. 2 Wn(Oo ) ~(2) where .3.8) E(Oo) = 1. asymptotically level Q.. therefore.nl]  2  1 . ~1 '1'n(OO. The Rao Score Test For the simple hypothesis H that.. under H...19) indicates.nlE ~ (2) _ T  .3.. The test that rejects H when R n ( (}o) > x r (1. It can be shown that (Problem 6. a consistent estimate of .n under H.Q). The argument is sketched in Problem 6. the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n.19) : (J c: 80.22) .3. (J = 8 0 . in practice they can be very different.(00 )  121 (0 0 )/...3 Large Sample Tests and Confidence Regions 399 The Wold test.3. (6. un. This test has the advantage that it can be carried out without computing the MLE.n W (80 .\(X) is the LR statistic for H Thus. ~ B ) T given by {8(2) : r W n (0(2)) < x r _ q (1 These regions are ellipsoids in R d Although.I Dln ((Jo) is the likelihood score vector.3. The Rao test is based on the statistic Rn(Oo ) ~n'1'n(Oo. The Wald test leads to the Wald confidence regions for (B q +] .Section 6. and so on. as (6. N(O.n)[D.0:) is called the Rao score test.6. which rejects iff HIn (9b ») > x 1_ q (1 .
.0 0 ) = T'" "" (vn(On . Under H : (J E A6 required only for Po eo and the conditions ADAS a/Theorem 6. called the Wald statistic..)I(On)( vn(On  On) + A) !:. Power Behavior of the LR. then. We considered the problem of testing H : 8 E 80 versus K : 8 E 8 .19) holds under On and that the power behavior is unaffected and applies to all three tests. . The analysis for 8 0 = {Oo} is relatively easy. .2 but with Rn(lJb 2 1 1 » !:. .3. and so on. On the other hand. Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score testssee Rao (1973) for more on this. We established Wilks's theorem.0 0 ) I(On)(On .5. X. I I . under regularity conditions. e(l). Finally. .On) + t.2.a)} and • The Rao large sample critical and confidence regions are {R n (Ob » {0(2) : Rn (0(2) < xd(l. We also considered a quadratic fonn. which is based on a quadratic fonn in the gradient of the log likelihood. Summary. For instance.8 0 where is an open subset of R: and 8 0 is the collection of 8 E 8 with the last r ~ q coordinates 0(2) specified. where X~ (')'2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'? It may he shown that the equivalence (6._q(A T I(Oo)t. The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. Rao.q distribution under H. I . it shares the disadvantage of the Wald test that matrices need to be computed and inverted. 2 log A(X) has an asymptotic X. for the Wald test neOn . . and Wald Tests It is possible as in the onedimensional case to derive the asymptotic power for these tests for alternatives of the form On = ~ eo + ~ where 8 0 E 8 0 . which stales that if A(X) is the LR statistic. . which measures the distance between the hypothesized value of 8(2) and its MLE. X~' 2 j > xd(1 . D 21 the d x d Theorem 6.a)}. 0(2). and showed that this quadratic fonn has limiting distribution q .) . The asymptotic distribution of this quadratic fonn is also q• I e X: X: 6. . we introduced the Rao score test.3.400 Inference in the Multiparameter Case Chapter 6 where D~ is the d x d matrix of second partials of ill with respect to matrix of mixed second partials with respect to e(l). . In particular we shall discuss problems of .. b .4 LARGE SAMPLE METHODS FOR DISCRETE DATA In this section we give a number of important applications of the general methods we have developed to inference for discrete data.
j=1 Do.00. k . k1.. Because ()k = 1 .lnOo.Section 6.8. For i.4 Large Sample Methods for Discret_e_D_a_ta ~_ _ 401 goodnessoffit and special cases oflog linear and generalized linear models (GLM). In.]2100. Ok if i if i = j.4.1. 6. ~ N..i.6. .2.)2InOo.nOo. Let ()j = P(Xi = j) be the probability of the jth category.. j = 1.) /OOk = n(Bk .1 GoodnessofFit in a Multinomial Model. j=l . I ij [ .0). k.OO. = L:~ 1 1{Xi = j}. j = 1. . j=l i=1 The second term on the right is kl 2 n Thus. treated in more detail in Section 6. . .. ()j. . where N. j=1 To find the Wald test..5. . or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by theory.E~. L(9. . Thus. .OOi)(B.. with ()Ok =  1 1 E7 : kl j=l ()OJ. .7. In Example 2.4. It follows that the large sample LR rejection region is ~ k 2IogA(X) = 2 LN.) > Xkl(1.()k_l)T and test the hypothesis H : ()j = ()OJ for specified (JOj. + n LL(Bi . consider i.)/OOk." .3. and 2. we need the information matrix I = we find using (2.32) thaI IIIij II.00k)2lOOk.2..8 we found the MLE 0.d.33) and (3.+ ] 1 1 0. we consider the parameter 9 = (()l. Pearson's X2 Test As in Examples 1.2. we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution. 2. trials in which Xi = j if the ith trial prcx:luces a result in the jth category.. the Wald statistic is klkl Wn(Oo) = n LfB.3. j = 1. Ok Thus. k Wn(Oo) = L(N. log(N. # j.
The second term on the right is n [ j=1 (!..Nk_d T and. .~ 8 8 0j 0k  2 80j ) ( n LL _ _ L kl kl kl ( J=1 z=1 ~ 8_Z 80i ~) 8k 80k (~ 8 _J 80J ~) ) 8k ..~) 8 80j 80 k ~ ~ ~ ] 2 0j = n [~ 80 k ~ 1] 2 To simplify the first term on the right of (6.11). n ( L kl ( j=1 ~ ~) !. ~ = Ilaijll(kl)X(kl) with Thus. '" . j=1 .13.2.L ..8. 8k 80 k = {[8 0k (8 j  80j )]  [8 0j (8 k ~  80k ) ] } . ]1(9) = ~ = Var(N).2).4. 80i 80j 80k (6. 80j 80k 1 and expand the square keeping the square brackets intact...1) X where the sum is over categories and "expected" refers to the expected frequency E H (Nj ). .. ~ .. .80k).. with To find ]1.1S.4. Then. The general form (6. note that from Example 2.2.1) of Pearson's X2 will reappear in other multinomial applications in this section.L . by A. we write ~ ~  8j 80j .4. It is easily remembered as 2 = SUM (Observed .402 Inference in the Multiparameter Case Chapter 6 The term on the right is called Pearson's chisquare (X 2 ) statistic and is the statistic that is typically used for this multinomial testing problem..4. the Rao statistic is . where N = (N1 . To derive the Rao test.2) . we could invert] or note that by (6. because kl L(Oj .Expected)2 Expected (6. .80j ) = (Ok .
4).. For comparison 210g A = 0. we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1.fh. which is a onedimensional curve in the twodimensional parameter space 8. 8). There is insufficient evidence to reject Mendel's hypothesis.1. Here testing the adequacy of the HardyWeinberg model means testing H : 8 E 8 0 versus K : 8 E . 04 = 1/16. n040 = 34.i::.1) dimensional parameter space k 8={8:0i~0. 04. M(n. i=1 For example. Mendel's theory predicted that 01 = 9/16.. . Example 6.75. which has a pvalue of 0. where 8 0 is a composite "smooth" subset of the (k .k. nOlO = 312. fh = 03 = 3/16. Mendel observed nl = 315.1.4. Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds. • 6. 02. in the HardyWeinberg model (Example 2.: . However.48 in this case. 7. k = 4 X 2 = (2.25 + (2. Nk)T has a multinomial. distribution.2 GoodnessofFit to Composite Multinomial Models. n4 = 32. n3 = 108.9 when referred to a X~ table. If we assume the seeds are produced independently. We will investigate how to test H : 0 E 8 0 versus K : 0 ¢:. Testing a Genetic Theory. (2) wrinkled yellow.75)2 = 04 34. this value may be too small! See Note 1. In experiments on pea breeding. and (4) wrinkled green. .75 . and we want to test whether the distribution of types in the n = 556 trials he performed (seeds he observed) is consistent with his theory.75 + (3. Contingency Tables Suppose N = (NI .25)2 104. j.4. Possible types of progeny were: (1) round yellow. Then. n020 = n030 = 104.Section 6. 2.2) becomes It follows that the Rao statistic equals Pearson 's X2.4 Large Sample Methods for Discrete Data 403 the first term on the right of (6. l::.75)2 104.25)2 312. n2 = 101. LOi=l}.75. 8 0 .4. (3) round green. 3.25. 4 as above and associated probabilities of occurrence fh.25 + (3.
X approximately has a xi distribution under H. Other examples.3) Le. . If £ is not open sometimes the closure of £ will contain a solution of (6...3.. . ij = (iiI.~) and test H : e~ = O. . To apply the results of Section 6.X approximately has a X.q) exists.. 8(11)) for 11 E £. the Wald statistic based on the parametrization 8 (11) obtained by replacing e by e (ij). {p(.8 0 . To avoid trivialities we assume q < k .4..4."" 'r/q) T..l now leads to the Rao statistic R (8("") n 11 =~ ~ j=l [Ni . j Oj is.. . i=l k If 11 ~ e( 11) is differentiable in each coordinate. which will be pursued further later in this section. we obtain the Rao statistic for the composite multinomial hypothesis by replacing eOj in (6. £ is open. nk. approximately X. . The Rao statistic is also invariant under reparametrization and.3 that 2 log . .q distribution for large n. nk." For instance. i=l t n.. . and ij exists.2) by ej(ij). is a subset of qdimensional space. j = q + 1. The algebra showing Rn(8 0 ) = X2 in Section 6. . involve restrictions on the e obtained by specifying independence assumptions on i classifications of cases into different categories. The Wald statistic is only asymptotically invariant under reparametrization. . ... .3). Maximizing p(n1.404 Inference in the Multiparameter Case Chapter 6 8 1 where 8 1 = 8 . .. a 11 'r/J .1. .q' Moreover. and the map 11 ~ (e 1(11). we define ej = 9j (8). j = 1.. e~) T ranges over an open subset of Rq and ej = eOj .. .ne j (ij)]2 = 2 ne.(t)~ei(11)=O. . . 8 0 • Let p( n1.3 and conclude that.1..1). ij satisfies l a a'r/j logp(nl. 8) for 8 E 8 0 is the same as maximizing p( n1..4. If a maximizing value.. However. . l~j~q. 8(11)) : 11 E £}. . 8) denote the frequency function of N. Consider the likelihood ratio test for H : e E 8 0 versus K : e 1:.r for specified eOj .. nk. 8(11)) = 0. e~ = e2 .. Then we can conclude from Theorem 6. by the algebra of Section 6. then it must solve the likelihood equation for the model. where 9j is chosen so that H becomes equivalent to "( e~ .4. the log likelihood ratio is given by log 'x(nb . 1 ~ j ~ q or k (6. r.4. 2 log . to test the HardyWeinberg model we set e~ = e1 .. . thus.nk.r... also equal to Pearson's X2 • . nk) = L ndlog(ni/n) log ei(ij)].4.2~(1 . We suppose that we can describe 8 0 parametrically as where 11 = ('r/1. ek(11))T takes £ into 8 0 .("") X J 11 I where the righthand side is Pearson's X2 as defined in general by (6. under H.. That is.
nl (2+TJ) (n2 + n3) n4 (1TJ) +:ry 0.4 Methods for Discrete Data 405 Example 6. The Fisher Linkage Model. {}21. . ( 2n3 + n2) 2) T 2n 2n o Example 6.4. If Ni is the number of offspring of type i among a total of n offspring. iTJ) : TJ :. 1958. 0 Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories. ij {}ij = (Bil + Bi2 ) (Blj + B2j ). A linkage model (Fisher.4. respectively.2. Independent classification then means that the events [being an A] and [being a B] are independent or in terms of the B . T() test the validity of the linkage model we would take 8 0 {G(2 + TJ). H is rejected if X2 2:: Xl (1 . (6. {}22.6 that Thus.1). 301). To study the relation between the two characteristics we take a random sample of size n from the population. {}12. we obtain critical values from the X~ tables. k 4. 9(ij) ((2nl + n2) 2n 2n 2. (4) starchygreen. An individual either is or is not inoculated against a disease. For instance. p. do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second Band B. .TJ).4. TJ). and so on.3) becomes i(1 °:. N 4 ) has a M(n. then (Nb . AB. (3) starchywhite. Then a randomly selected individual from the population can be one of four types AB. A selfcrossing of maize heterozygous on two characteristics (starchy versus sugary. (h. We found in Example 2. is male or female.a) with if (2nl + n2)/2n.4.5.4) which reduces to a quadratic equation in if. AB. green base leaf versus white base leaf) leads to four possible offspring types: (1) sugarywhite.Section 6. i(1 . The only root of this equation in [0.. is or is not a smoker. The likelihood equation (6. (2) sugarygreen. Because q = 1. We often want to know whether such characteristics are linked or are independent. (2nl + n2)(2 3 + n2) .4. 1] is the desired estimate (see Problem 6. AB. HardyWeinberg. 1} a "onedimensional curve" of the threedimensional parameter space 8. specifies that where TJ is an unknown number between 0 and 1.. .4.• . Denote the probabilities of these types by {}n. The results are assembled in what is called a 2 x 2 contingency table such as the one shown. (}4) distribution. ..
'r/1) (1 . because k = 4. 0 11 . respectively. we have N rv M(n. (1 . 'r/2 to indicate that these are parameters.~ + n12) iiI (nll + n2d (nll i72 (n21 + n22) (1 . I}. 1. (6. which vary freely. This suggests that X2 may be written as the square of a single (approximately) standard normal variable. These solutions are the maximum likelihood estimates. 'r/2 (1 .2). Pearson's statistic is then easily seen to be (6. N 21 .4. 'r/1 (1 .4. In fact (Problem 6. 0 12 .'r/1).Ti2) (6.Tid (n12 + n22) (1 . where 8 0 is a twodimensional subset of 8 given by 8 0 = {( 'r/1 'r/2. Thus.6) the proportions of individuals of type A and type B. q = 2. 'r/1 ::.4. ( 22 ). Then.5) whose solutions are Til 'r/2 = (n11+ n 12)/n (nll + n21)/n. By our theory if H is true. We test the hypothesis H : () E 8 0 versus K : 0 f{.7) where Ri = Nil + Ni2 is the ith row sum. X2 has approximately a X~ distribution. Here we have relabeled 0 11 + 0 12 . 021 . where z tt [R~jl1 2=1 J=l .'r/2)) : 0 ::. 'r/2 ::. C j = N 1j + N 2j is the jth column sum.3) become . 8 0 . if N = (Nll' N 12 .4. for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic. the likelihood equations (6. N 22 )T.406 Inference in the Multiparameter Case Chapter 6 l A 11 The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column. For () E 8 0 .4. the (Nij . 0 ::.'r/2).RiCj/n) are all the same in absolute value and.011 + 021 as 'r/1.
.e. a.. a. b). 1 :::. B.. i :::. and only if.Section 6. a.4. b ~ 2 (e. b where N ij is the number of individuals of type i for characteristic 1 and j for characteristic 2. if and only if. . Z = v'n[P(A I B) ..g. peA I B) = peA I B). Next we consider contingency tables for two nonnumerical characteristics having a and b states. . that is. . peA I B)) versus K : peA I B) > peA I B). If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N ij .a) as a level a onesided test of H : peA I B) = peA I B) (or peA I B) :::. 1). Z indicates what directions these deviations take.. Thus. 1 :::. Nab Cb Ra n ..3) that if A and B are independent. a.peA I B)] [~(B) ~(~)ll/2 peA) peA) where P is the empirical distribution and where we use A.4 Large Sample Methods for Discrete Data 407 An important altemative form for Z is given by (6. b where the TJil... . . . b} "J M(n.4. b NIb Rl a Nal C1 C2 ... (Jij TJil The hypothesis that the characteristics are assigned independently becomes H : TJil TJj2 for 1 :::. then Z is approximately distributed as N(O. eye color.. i :::.. that A is more likely to occur in the presence of B than it would in the presence of B). B to denote the event that a randomly selected individual has characteristic A. j :::. j = 1. i = 1. If (Jij = P[A randomly selected individual is of type i for 1 and j for 2]. 1 :::...8) Thus. respectively. i :::.4. if X2 measures deviations from independence. . B. Z ~ z(l. j :::. (Jij : 1 :::. Therefore.4... a. TJj2 are nonnegative and 2:~=1 = 2:~=1 TJj2 = 1. The N ij can be arranged in a a x b contingency table. j :::. 1 Nu 2 N12 . hair color). The X2 test is equivalent to rejecting (twosidedly) if.. then {Nij : 1 :::. B.. Positive values of Z indicate that A and B are positively associated (i. It may be shown (Problem 6. . it is reasonable to use the test that rejects. .
"F~1 Yij where Yij is the response on the jth of the mi trials in block i.8 as the canonical parameter zT TJ = g ( 7r) = log [7r / (1 .) over the whole range of z is impossible. . (6.6.3 Logistic Regression for Binary Responses In Section 6.{3p.4. we call Y = 1 a "success" and Y = 0 a "failure. [Xi C :i"J log + m. ' XI. or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y = 0). approximately normally distributed ~ for known constants {Zij} and and whose means are modeled as J.~ =1 Zij {3j = unknown parameters {31. Examples are (1) medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0). log ( ~: ) .4. a simple linear representation zT ~ for 7r(.1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are.f.4. As is typical. such as the probit gl(7r) = <I>1(7r) where <I> is the N(O. we obtain what is called the logistic linear regression model where .408 Inference in the Multiparameter Case Chapter 6 with row and column sums as indicated. Instead we turn to the logistic transform g( 7r)."" 7rk)T based on X = (Xl"'" Xk)T is t.. In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1.. Other transforms. 6.4. .7r)]. and the loglog transform g2(7r) = 10g[log(1 . B(mi' 7ri)' where 7ri = 7r(Zi) is the probability of success for a case with covariate vector Zi. whicll we introduced in Example 1. perhaps after a transfonnation." We assume that the distribution of the response Y depends on the known covariate vector ZT.. k. with Xi binomial. The argument is left to the problems as are some numerical applications. 1 :::. log(l "il] + t.11) When we use the logit transfonn g(7r). i :::.9) which has approximately a X(al)(bl) distribution under H. Because 7r(z) varies between 0 and 1. The log likelihood of 7r = (7rl.Li = L. we observe independent Xl' . In this section we assume that the data are grouped or replicated so that for each fixed i.. usually called the log it. (6.7r)] are also used in practice. we observe the number of successes Xi = L. 1) d.10. (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y = 0). Thus. Next we choose a parametric model for 7r(z) that will generate useful procedures for analyzing experiments with binary responses. Maximum likelihood and dimensionality calculations similar to those for the 2 x 2 table show that Pearson's X2 for the hypothesis of independence is given by (6.
14) where W = diag{ mi1fi(l1fi)}kxk. the solution to this equation exists and gives the unique MLE 73 of f3.4.4. As the initial estimate use (6. Tp) T.[ i(l7r iW')· . i ::.1 and by Proposition 3.J.p.3.4.l.4.. mi 2mi Here the adjustment 1/2mi is used to avoid log 0 and log 00. 7r Using the 8method. we can guarantee convergence with probability tending to 1 as N + 00 as follows.. p (Jj Tj  ~ mi loge1+ exp{Zif3} ) + ~ log '. Alternatively with a good initial value the NewtonRaphson algorithm can be employed. .1 applies and we can conclude that if 0 < Xi < mi and Z has rank p. where Z = IIZij Ilrnxp is the design matrix. that if m 1fi > 0 for 1 ::.13) By Theorem 2.3.J) SN(O.16) 1fi) 1]. j Thus. Zi The log likelihood l( 7r(f3)) = == k (1. Note that IN(f3) is the log likelihood of a pparameter canonical exponential model with parameter vector f3 and sufficient statistic T = (T1' . It follows that the NILE of f3 solves E f3 (Tj ) = T .. in (6. ( 1 . (6. it follows by Theorem 5.log(l:'.Li.(1... k m} (V.2 can be used to compute the MLE of f3.Section 6.3. . The coordinate ascent iterative procedure of Section 2.4. Similarly.Li = E(Xi ) = mi1fi. I N(f3) = f..f3p )T is. Although unlike coordinate ascent NewtonRaphson need not converge..+ .4.4 Large Sample Methods for Discrete Data 409 The special case p = 2.4. W is estimated using Wo = diag{mi1f. the likelihood equations are just ZT(X ..+ . .~ Xi 2 1 +"2 ' (6.1fi)*}' + 00. or Ef3(Z T X) = ZTX. The condition is sufficient but not necessary for existencesee Problem 2. IN(f3) of f3 = (f31.15) Vi = log X+. Theorem 2.1fi )* = 1 . Zi)T is the logistic regression model of Problem 2. .3.4.3. We let J..L) = O. if N = 2:7=1 mi.. j = 1.3. mi 2mi Xi 1 Xi 1 = . .12) where T j = 2:7=1 ZijXi and we make the dependence on N explicit. the empirical logistic transform.! ) ( mi . k ( ) (6.1.4 the Fisher information matrix is I(f3) = ZTWZ (6. Then E(Tj ) = 2:7=1 Zij J. Because f3 = (ZTz) 1 ZT 'T] and TJi = 10g[1fi (1 1fi) in TJi has been replaced by 730 is a plugin estimate of f3 where 1fi and (1 1f.14).
4.4. we set n = Rk and consider TJ E n.4.13.4. then we can form the LR statistic for H : TJ E Wo versus K: TJ E w .. We want to contrast w to the case where there are no restrictions on TJ. The samples are collected indeB(1ri' mi).1. ji). As in Section 6. Testing In analogy with Section 6. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and recording the number Xi of patients that recover. it foHows (Problem 6. 2 log . We obtain k independent samples.11) and (6. i 1.)l {LOt (6. In this case the likelihood is a product of independent binomial densities. Theorem 5. D(X. and the MLEs of 1ri and {Li are Xdmi and Xi. fl) has asymptotically a X%r.~.. By the multivariate delta method. Example 6. k. . by Problem 6. suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition. Thus.18) {Lo~ where jio is the MLE of JL under H and fl~i = mi .8 that the inverse of the logit transform 9 is the logistic distribution function Thus..4.k."" Xk independent with Xi example. f"V .3.. the MLE of 1ri is 1ri = 9..14) that {3o is consistent.17) where X: = mi Xi and M~ mi Mi.4. .6. . distribution for TJ E w as mi + 00.4. {3 E RP} and let r be the dimension of w..4. For a second pendently and we observe Xl. i = 1. . The LR statistic 2 log .) +X. recall from Example 1. ji) 2 I)Xi 10g(Xdfli) + XIlog(XI!flDJ i=1 (6.. D(X... . Here is a special case.IOg(E.12) k zT D(X.\ has an asymptotic X. If Wo is a qdimensional linear subspace of w with q < r. i = 1.410 Inference in the Mllltin::lr. ji) measures the distance between the fit ji based on the model wand the data X. one from each location.q distribution as mi + 00.Wo 210g.w is denoted by D(y.1.13.\ for testing H : TJ E w versus K : TJ n . . The Binomial OneWay Layout. that is. In the present case. we let w = {TJ : 'f]i = {3. and for the ith location count the number Xi among mi that has the given attribute.\=2t [Xi log i=1 (Ei.:IITlPt'pr Case 6 Because Z has rankp. k < oosee Problem 6. To get expressions for the MLEs of 7r and JL.1 linear subhypotheses are important.4. where ji is the MLE of JL for TJ E w.flm. from (6.1 (L:~=l Xij(3j).
the Pearson statistic is shown to have a simple intuitive form. . in the case of a Gaussian response. The Pearson statistic 2 _ X  L k1 i=l (X "'(1) m'7r . Finally.Li of a response is expressed as a function of a linear combination ~i =.Li = ~i. (3 = L Zij{3j j=l p of covariate values. the Wald and Rao statistics take a form called "Pearson's X2. 7r E (0.7r ~ i "')2 mi 7r is a Wald statistic and the X2 test is equivalent asymptotically to the LR test (Problem 6. In the special case of testing independence of multinomial frequencies representing classifications in a twoway contingency table.4. . 6.3. we give explicitly the MLEs and X2 test. Summary. versus the alternative that the 7r'S are not all equal.3 we considered experiments in which the mean J.4. In particular.Li = EYij. It follows from Theorem 2. if J. and (3 = 10g[7r1(1 .13). we find that the MLE of 7r under His 7r = TIN. J. we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates.4. We derive the likelihood equations.7r)].1 and 6.5 GENERALIZED LINEAR MODELS Yi In Sections 6. Thus.mk7r)T. Using Z as given in the oneway layout in Example 6. Under H the log likelihood in canonical exponential form is {3T . and as in that section.4.1). When the hypothesis is that the multinomial parameter is in a qdimensional subset of the k .1 that if 0 < T < N the MLE exists and is the solution of (6. The LR statistic is given by (6.. z." which equals the sum of standardized squared distances between observed frequencies and expected frequencies under H.4. We found that for testing the hypothesis that a multinomial parameter equals a specified value. an important hypothesis is that the popUlations are homogenous.5 Generalized Linear Models 411 This model corresponds to the oneway layout of Section 6.3.15).3 to find tests for important statistical problems involving discrete data. the Rao statistic is again of the Pearson X2 form. In the special case of testing equality of k binomial parameters.L (ml7r.1.Li 7ri = gl(~i)' where gl(y) is the logistic distribution . and give the LR test.N log(1+ exp{{3}) + ~ log ( k ~: ) where T = 2::=1 Xi..1. N 2::=1 mi.Idimensional parameter space e. In Section 6. then J. we test H : 7rl = 7r2 7rk = 7r. We used the large sample testing results of Section 6. discuss algorithms for computing MLEs.Section 6.18) with JiOi mi7r. where J.3.
. Yn)T is n x 1 and ZJxn = (ZI. . 1989) synthesized a number of previous generalizations of the linear model.5. We assume that there is a onetoone transform g(fL) of fL. in which case 9 is also called the link function. As we know from Corollary 1. called the link junction. These Ii are independent Gaussian with known variance 1 and means JLi The model is GLM with canonical g(fL) fL. The generalized linear model with dispersion depending only on the mean The data consist of an observation (Z. Y) where Y = (Yb . McCullagh and NeIder (1983. Typically. More generally. = L:~=1 Zij{3. See Haberman (1974).412 Inference in the 1\III1IITin'~r:>rn"'T. Typically. Special cases are: (i) The linear model with known variance.. . ••• .. which is p x 1.1) where 11 is not in E. g(fL) is of the form (g(JL1) .1..r Case 6 function. fL determines 11 and thereby Var(Y) A( 11).. in which case JLi A~ (TJi). the mean fL of Y is related to 11 via fL = A(11). the natural parameter space of the nparameter canonical exponential family (6. (ii) Log linear models. 11) given by (6.5.."" zn) with Zi = (Zib"" Zip)T nonrandom and Y has density p(y. A (11) = Ao (TJi) for some A o. Znj)T is the jth column vector of Z. Note that if A is oneone. most importantly the log linear model developed by Goodman and Haberman. . but in a subset of E obtained by restricting TJi to be of the form where h is a known function. j=1 p In this case. the identity.. g = or 11 L J'j Z(j) = Z{3. g(JLn)) T. .1). that is. such that g(fL) = L J'jZ(j) j=1 p Z{3 where Z(j) (Zlj.6. the GLM is the canonical subfamily of the original exponential family generated by ZTy. Canonical links The most important case corresponds to the link being canonical.
5. 0 < (}i < 1 withcanonicallinkg(B) = 10g[B(1B)]. 1 ::.4. This isjust the logistic linear model of Section 6.1. j ::. In this case.3. f3j are free (unidentifiable parameters). if maximum likelihood estimates they necessarily uniquely satisfy the equation f3 exist.B 1. that procedure is just (6. in general.8) is orthogonal to the column space of Z. say. . the . b.1. "Ij = 10gBj. classification i on characteristic 1 and j on characteristic 2. by Theorem 2.. 1 ::.5. j ::. b.4.5.8) (6. .3. and the log linear model corresponding to log Bij = f3i + f3j.5.7. The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ((}i). . 1f.3) where In this situation and more generally even for noncanonical links.. See Haberman (1974) for a further discussion.. i ::. The models we obtain are called log linearsee Haberman (1974) for an extensive treatment.2) (or ascertain that no solution exists). B+ j = 2:~=1 Bij . .Yp)T is M(n.Bp). the link is canonical. Algorithms If the link is canonical. The coordinate ascent algorithm can be used to solve (6 .5. But. j ::./1(.t1. that Y = IIYij 111:Si:Sa. . Suppose.2) It's interesting to note that (6. With a good starting point f3 0 one can achieve faster convergence with the NewtonRaphson algorithm of Section 2.6..tp) T . a. j::.8) is not a member of that space. 1 ::. p are canonical parameters. Bj > 0..Section 6.. Then. 1 ::. 2:~=1 B = 1. so that Yij is the indicator of. p. where f3i.80 ~ f3 0 . Then () = IIBij II. for example.1::. . If we take g(/1) = (log J. NewtonRaphson coincides with Fisher's method of scoring described in Problem 6.5 Generalized linear Models 413 Suppose (Y1. ZTy or = ZT E13 y = ZT A(Z. log J.2) can be interpreted geometrically in somewhat the same way as in the Gausian linear modelthe "residual" vector Y . /1(. as seen in Example 1. is that of independence Bij = Bi+B+j where Bi+ = 2:~=1 Bij .
5.1. Let ~m+l == 13m+1 . If Wo C WI we can write (6.D(Y . the deviance of Y to iLl and tl(iL o. as n ~ 00.. D generally except in the Gaussian case. then with probability tending to 1 the algorithm converges to the MLE if it exists. For the Gaussian linear model with known variance 0'5 (Problem 6. As in the linear model we can define the biggest possible GLM M of the form (6.1) for which p = n.4) That is. iLl) == D(Y.5. (6.5. We can then formally write an analysis of deviance analogous to the analysis of variance of Section 6. This quantity. ••• .2. 210gA = 2[l(Y.5. Unfortunately tl =f. the correction Am+ l is given by the weighted least squares formula (2.2. This name stems from the following interpretation.414 Inference in the Multiparameter Case Chapter 6 true value of {3. 1]) > o}).4). In that case the MLE of J1 is iL M = (Y1 . the algorithm is also called iterated weighted least squares. iLo) == inf{D(Y. the variance covariance matrix is W m and the regression is on the columns ofW mZProblem 6. iLl)' each of which can be thought of as a squared distance between their arguments. iLo) . Write 1](') for AI.D(Y.1](Y)) l(Y .5.5) is always ~ O.1](lLa)] for the hypothesis that IL = lLa within M as a "measure" of (squared) distance between Y and lLo. Testing in GLM Testing hypotheses in GLM is done via the LR statistic. .20) when the data are the residuals from the fit at stage m.6) a decomposition of the deviance between Y and iLo as the sum of two nonnegative components. called the deviance between Y and lLo. iLl) where iLl is the MLE under WI. We can think of the test statistic . iLo) . In this context. IL) : IL E wo} where iLo is the MLE of IL in Woo The LR statistic for H : IL E Wo versus K : IL E WI . The LR statistic for H : IL E Wo is just D(Y.Wo with WI ~ Wo is D(Y. Yn ) T (assume that Y is in the interior of the convex support of {y : p(y.5.13m' which satisfies the equation (6.
10) is asymptotically X~ under H.Section 6 . Similar conclusions r '''''''. which we temporarily assume known. (3))Z.3.1 . For instance. .5.. .. which.2... Ao(Z.8) This is not unconditionally an exponential family in view of the Ao (z. . Asymptotic theory for estimates and tests If (ZI' Y 1 ).. . . f3p ). fil) is thought of as being asymptotically X.3). . ••• . can be estimated by f = :EA(Z{3) where :E is the sample variance matrix of the covariates. f3d+l. the theory of Sections 6.2. and (6. However.. in order to obtain approximate confidence procedures. Zn in the sample (ZI' Y 1 )..2. and can conclude that the statistic of (6. we obtain If we wish to test hypotheses such as H : 131 = . if we assume the covariates in logistic regression with canonical link to be stochastic.3 hold (Problem 6. (3) term.9) What is ]1 ((3)? The efficient score function a~i logp(z .. .. (Zn' Y n ) has density (6.q. 1. then ll(fio. . if we take Zi as having marginal density qo.5. Y n ) from the family with density (6.5. so that the MLE {3 is unique.2 and 6. y . . is consistent with probability 1.7) More details are discussed in what follows. (Zn..0. This can be made precise for stochastic GLMs obtained by conditioning on ZI.5.5.5 Generalized Linear Models 415 Formally if Wo is a GLM of dimension p and WI of dimension q with canonical links. (Zn' Y n ) can be viewed as a sample from a population and the link is canonical. Thus. .. and 6. d < p.3. (3) is (Yi and so. then (ZI' Y1 ). . there are easy conditions under which conditions of Theorems 6.3 applies straightforwardly in view of the general smoothness properties of canonical exponential families. = f3d = 0. asymptotically exists. 6. we can calculate i=l where {3 H is the (p xl) MLE for the GLM with {3~xl = (0.
3.13) (6. then A(11)/c(r) = log! exp{c1 (r)11T y}h(y. For further discussion of this generalization see McCullagp. noncanonicallinks can cause numerical problems because the models are now curved rather than canonical exponential families.416 Inference in the Multiparameter Case Chapter 6 follow for the Wald and Rao statistics. when it can. Cox (1970) considers the variance stabilizing transformation which makes asymptotic analysis equivalent to that in the standard Gaussian linear model.5. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6.4.10) is of product form A( 11) [1/ c( r)] whereas the righthand side cannot always be put in this form. For instance. then it is easy to see that E(Y) = A(11) Var(Y) = c(r)A(11) (6. r) = exp{c.1 ::.1) depend on an additional scalar parameter r. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean.5. (J2) and gamma (p. 11. for c( r) > 0. Because (6. These conclusions remain valid for the usual situation in which the Zi are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables. if in the binary data regression model of Section 6.5.9 are rather similar. and NeIder (1983. A) families.5. .A(11))}h(y.r)dy. p(y. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred.11) Jp(y.1 (r)(11T y . Note that these tests can be carried out without knowing the density qo of Z1. r)dy = 1. 1989). (6. which we postpone to Volume II. r).5.12) The lefthand side of (6. 11. General link functions Links other than the canonical one can be of interest. Important special cases are the N (JL. As he points out. JL ::. .5. From the point of analysis for fixed n. the results of analyses with these various transformations over the range .14) so that the variance can be written as the product of a function of the mean and a general dispersion parameter. However. we take g(JL) = <1>1 (JL) so that 7ri = <1> (zT!3) we obtain the socalled probit model. It is customary to write the model as.
31) with fa symmetric about O." Of course.2. ____ . That is.8 but otherwise also postponed to Volume II..1) where E is independent of Z but ao(Z) is not constant and ao is assumed known. in fact. which is symmetric about 0. of a linear predictor of the form Ef3j Z(j).6. the resulting MLEs for (3 optimal under fa continue to estimate (3 as defined by (6.6 Robustness Pr.2. Ii 6. roughly speaking. we studied what procedures would be appropriate if the linearity of the linear model held but the error distribution failed to be Gaussian. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function. still a consistent asymptotically normal estimate of (3 and. and Models 417 Summary.Section 6. under further mild conditions on P. Y) rv P given by (6.2. in fact. EpZTZ nonsingular. Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem 6. We discussed algorithms for computing MLEs of In the random design case. For this model it turns out that the LSE is optimal in a sense to be discussed in Volume II. the LSE is not the best estimate of (3. In Example 6. These issues.6. then the LSE of (3 is. and tests.i. is "act as ifthe model were the one given in Example 6. the GaussMarkov theorem.2. We found that if we assume the error distribution fa.19) even if the true errors are N(O. We considered the canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor. There is another semiparametric model P 2 = {P : (ZT.5).i.. Y) has a joint distribution and if we are interested in estimating the best linear predictor /lL(Z) of Y given Z. if P3 is the nonparametric model where we assume only that (Z.2. E p y2 < oo}.2. confidence procedures.19) with Ei i. have a fixed covariate exact counterpart. for estimating /lL(Z) in a submodel of P 3 with /3 (6.d. where the Z(j) are observable covariate vectors and (3 is a vector of regression coefficients. a 2 ) or.. /3.2. so is any estimate solving the equations based on (6. called the link function. the distributional and implicit structural assumptions of parametric models are often suspect. the right thing to do if one assumes the (Zi' Yi) are i. had any distribution symmetric around 0 (Problem 6."n".. we use the asymptotic results of the previous sections to develop large sample estimation results. are discussed below.2.2. PI {P : (zT.d. Furthermore.6 ROBUSTNESS PROPERTIES AND SEMIPARAMETRIC MODELS As we most recently indicated in Example 6. if we consider the semiparametric model.J . ". y)T rv P that satisfies Ep(Y I Z) = ZT(3.1. with density f for some f symmetric about O}.r+i". whose further discussion we postpone to Volume II.
. and Y. € are n x 1. .1.1. 13i of Section 6.Ej) = a La. The preceding computation shows that VarcM(Ci) = Varc(Ci).. our current Ji coincides with the empirical plugin estimate of the optimal linear predictor (1. Suppose the GaussMarkov linear model (6.6.2) holds. One Sample (continued).4 are still valid.1.En in the linear model (6. Varc(Ci) for all unbiased Ci. and by (B. in addition to being UMVU in the normal case.. the conclusions (1) and (2) of Theorem 6. Var(Ei) + 2 Laiaj COV(Ei. .. 418 Inference in the Multiparameter Case Chapter 6 Robustness in Estimation We drop the assumption that the errors E1.an. Ej). . and ifp = r.2) where Z is an n x p matrix of constants.1.S. the result follows. Example 6.1 and the LSE in general still holds when they are compared to other linear estimates. and a is unbiased. Yj) = Cov( Ei. n a Proof. In fact. Many of the properties stated for the Gaussian case carry over to the GaussMarkov case: Proposition 6. Var(jj) = a 2 (ZTZ)1. Because Varc = VarCM for all linear estimators.3) are normal.Yn .1. n = 0:..3(iv). .'. Var(e) = a 2 (I . However... .' En with the GaussMarkov assumptions. Then. JL = f31. Theorem 6.. In Example 6. . where Varc(Ci) stands for the variance computed under the Gaussian assumption that E1.. Let Ci stand for any estimate linear in Y1 . n 2 VarcM(a) = La. The optimality of the estimates Jii. 0 Note that the preceding result and proof are similar to Theorem 1. .H). Varc(a) ::. for any parameter of the form 0: = E~l aiJLi for some constants a1.p .i. i=l i<j i=l where VarCM refers to the variance computed under the GaussMarkov assumptions. {3 is p x 1.1.4.2(iv) and 6. . Because E( Ei) = 0.1.4. Instead we assume the GaussMarkov linear model where (6.1. Ifwe replace the Gaussian assumptions on the errors E1.1.d.6. The GaussMarkov theorem shows that Y.6).4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case. .14).En are i.. .1. and Ji = 131 = Y.1.Y . for .2. ( 2 ).1. By Theorems 6. moreover. jj and il are still unbiased and Var(Ji) = a 2 H. the estimate = E~l aiJii has uniformly minimum variance among all unbiased estimates linear in Y 1 .3.6.6. E(a) = E~l aiJLi Cov(Yi. Moreover. in Example 6. . See Problem 6.6. is UMVU in the class of linear estimates for all models with EY/ < 00. N(O.
y E R. Il E R.6. as (Z. If 8(P) (6. Thus. .8(P)) ~ N(O. the asymptotic distribution of Tw is not X.5) than the nonlinear estimate Y median when the density of Y is the Laplace density 1 2A exp{ Aly .. consider H : 8 tics Tw = n(e ( 0 )T I(8 0 )(e ( 0 ).and twosample problems using asymptotic and Monte Carlo methods. 8)dP(x) ° = 80 (6.Section 6. Y has a larger variance (Problem 6. Now we will use asymptotic methods to investigate the robustness of levels more generally.ennip. which is the unique solution of ! and w(x.6.. As seen in Example 6.Ill}.6. 1 (Zn. then Tw ~ VT I(8 0 )V where V '" N(O. Another weakness is that it only applies to the homoscedastic case where Var(1'i) is the same for all i.3 we investigated the robustness of the significance levels of t tests for the one.. A > O.. 0. we can use the GaussMarkov theorem to conclude that the weighted least squares are unknown. For this density and all symmetric densities.4. Suppose we have a heteroscedastic 0.d. the asymptotic behavior of the LR.1.6.6. Y is unbiased (Problem 3. There is an important special case where all is well: the linear model we have discussed in Example 6...6. our estimates are estimates of Section 2..2.2 are UMVU.:lrarnetlric Models 419 sample n large. aT Robustness of Tests In Section 5. If the are version of the linear model where E( €) known. P)). and .6 Robustness "" . Y) where Z is a (p x 1) vector of random covariates and we model the relationship between Z and Y as (6. However.. when the not optimal even in the class of linear estimates. for instance. Suppose (ZI' Yd. 00. but Var( Ei) depends on i. ~('11.2. a major weakness of the GaussMarkov theorem is that it only applies to linear estimates.ti".6.4) and the Wald test statis8 0 evidently we have i= Tw But if 8(P) = 8 0 . .12). which does not belong to the model.3) .P) i= II (8 0 ) in general. Remark 6. This observation holds for the LR and Rao tests as wellbut see Problem 6. Y n ) are d.:. Wald and Rao tests depends critically on the asymptotic behavior of the underlying MLEs 8 and 80 .2 we know that if '11(" 8) = Die.ii(8 .6). 8) and Pis true we expect that 8n 8(P). 0. From the theory developed in Section 6.1."n". P)) with :E given by (6. ~(w.5) ..2..6. Example 6. Because ~ ('lI . The 6method implies that if the observations Xi come from a distribution P.2.4. The Linear Model with Stochastic Covariates.
6. procedures based on approximating Var(. even if E is not Gaussian. the limiting distribution of is Because !p.r Case 6 with the distribution P of (Z.. X.6. H : (3 = O. For instance. (12).2. when E is N(O. by the law of large numbers..8) Moreover.6.6. (see Examples 6.6) (3 is the LSE. (Jp (Jo. !ltin:::.a) where _ ""T T  2 (6.t. EpE 0. Y) such that Eand Z are independent. (i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model..7.B) by where W pxp is Gaussian with mean 0 and.5) and (12(p) = Varp(E). This kind of robustness holds for H : (Jq+ 1 (Jo.6. say. n lZT Z (n) (n) so that the confidence ~ ~ ZiZ'f n ~ 1 (6. Zen) S (Zb . Then.2. under H : (3 = 0. it is still true that if qi is given by (6. It is intimately linked to the fact that even though the parametric model on which the test was based is false. and we consider.np(l a) .6.. . hence. E p E2 < 00.p or more generally (3 E £0 + (30' a qdimensional affine subspace of RP.. the test (6.Zn)T and  2 1 ~ " 2 1 ~(Yi .7) and (6. .30) then (3(P) specified by (6.q+ 1. Wald..6.3) equals (3 in (6.np(1 ..9) i=l n1Zfn)Z(n)S2 / Vii are still asymptotically of correct level. . .2.1 that (6.5) has asymptotic level a even if the errors are not Gaussian. it is still true by Theorem 6. and Rao tests all are equivalent to the F test: Reject if Tn = (3 Z(n)Z(n)(3/ S 2 !p. Thus.2.420 Inference in the M .1 and 6.t xp(l a) by Example 5.Yi) = IY n .2) the LR.r::nn".3.  2 Now.P i=l n p  Z(n)(31 ... It follows by Slutsky's theorem that.
The Linear Model with Stochastic Covariates with E and Z Dependent.. .6.17) is Suppose our model is that X "Reject iff n (0 1 where &2 2 O) > x 1 (1  .6.2. But the test (6. i=1 (6. E (i. under H. X(2») where X(I). Suppose without loss of generality that Var( (') = L Then (6.5) are still meaningful but (and Z are dependent. However suppose now that X(I) 101 and X(2) 102 are identically distributed but not exponential. Example 6.15) . We illustrate with two final examples.3. That is.6. V2 VarpX{I) (~2) Xz in general. To see this.. If it holds but the second fails. X(2) are independent E of paired pieces of equipment.6 Robustness Properties and Semiparametric Models 421 If the first of these conditions fails.) respectively. Q) (6.6.7) fails and in fact by Theorem 6.11) i=1 &= v 2n ! t(xP) + X?»).Section 6. the lifetimes A2 the standard Wald test (6. let the distribution of ( given Z = z be that of a(z)(' where (' is independent of Z. then it's not clear what H : (J = (Jo means anymore.6.2 clearly hold. Suppose E(E I Z) = 0 so that the parameters f3 of (6..10) does not have asymptotic level a in general.6.6..4. (X(I). (6. (6. note that. the theory goes wrong.6.10) nL and ~ ~ X(j) 2 (6. we assume variances are heteroscedastic.6.. If we take H : Al .a)" (6. For simplicity.6..3. Simply replace (12 by (J ~2 = ~ ~ { ( ~1) n~ 2=1 Xz _ (X(I) + X(2»))2 2 + _ (X(1) + X(2»))2} 2 . 2  (th).14) o Example 6.1 Vn(~ where (3) + N(O. Then H is still meaningful.12) The conditions of Theorem 6. The TwoSample Scale Problem. It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general.3.13) but & V2Ep(X(I») ::f..6.. XU) and X(2) are identically distributed.
and are uncorrelated."i=r+ I i .6. (6. we showed that the MLEs and tests generated by the model where the errors are U.3 to compute the information lower bound on the variance of an unbiased estimator of (52. Yn .Id) has a multinomial (A1. and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model. LR. . Ur. N(O. We considered the behavior of estimates and tests when the model that generated them does not hold.. For d = 2 above this is just the asymptotic solution of the BehrensFisher problem discussed in Section 4..i.. In partlCU Iar. the methods need adjustment. Wald. = {3d = 0 fail to have correct asymptotic levels in general unless (52 (Z) is constant or A1 = .1 1.. . identical variances.4. . (52) are still reasonable when the true error distribution is not Gaussian. = Ad = 1/ d (Problem 6.d.d..Ad)T where (I1. .Id .A1. . . . This is the stochastic version of the dsample model of Example 6. . d.16) Summary. In the linear model with a random design matrix..6. .4. the confidence procedures derived from the normal model of Section 6.6. use Theorem 3. in general.JL • ""n 2..1.6. (ii) if n 2: r + 1.. the twosample problem with unequal sample sizes and variances. A solution is to replace Z(n)Z(n)/ 8 2 in (6.fiT) o:2)T of U 2)T ..1.Ad) distribution. in general. (i) the MLE does not exist if n = r.1 are still approximately valid as are the LR. n 1 L. j ::. ••• .11 . Wald.n l1Y .... 0 < Aj < 1.6) and. In this case. .11) with both 11 and (52 unknown. provided we restrict the class of estimates to linear functions of Y1 . TJr. then the MLE (iil.6) by Q1.9. ""2 _ ""12 (TJ1. 6. We also demonstrated that when either the hypothesis H or the variance of the MLE is not preserved when going to the wider model. . N(O.6) does not have the correct level. Show that in the canonical exponential model (6.422 Inference in the Multiparameter Case Chapter 6 (Problem 6. . 0 To summarize: If hypotheses remain meaningful when the model is false then... (52) assumption on the errors is replaced by the assumption that the errors have mean zero. where Qis a consistent estimate of Q. For the canonical linear Gaussian model with (52 unknown. The simplest solution at least for Wald tests is to use as an estimate of [Var y'n8]1 not 1(8) or ~D2ln(8) but the socalled "sandwich estimate" (Huber. It is easy to see that our tests of H : {32 = . (5 2)T·IS (U1.3. .6. the test (6. Specialize to the case Z = (1. and Rao tests. and Rao tests need to be modified to continue to be valid asymptotically. (5 . . then the MLE and LR procedures for a specific model will fail asymptotically in the wider model. Compare this bound to the variance of 8 2 • ..6). 1 ::. 1967). The GaussMarkov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i. In particular.7 PROBLEMS AND COMPLEMENTS Problems for Section 6.
and (3 and {Lz are as defined in (1.4. . is (c) Show that Y and Bare unbiased. . 4. Consider the model (see Example 1. . 0. Find the MLE B (). J =~ i=O L)) c i (1.1. 1 n f1.7 Problems and Complements 423 Hint: By A.Li = {Ly+(zi {Lz)f3. Var(Uf) = 20. Suppose that Yi satisfies the following model Yi ei = () + fi.1.2 . and the 1. . .d. Let Yi denote the response of a subject at time i..l+c (_c)j+l)/~ (1.2 known case with 0. (e) Show that Var(B) < Var(Y) unless c = O..2 with p = r coincides with the empirical plugin estimate of JLL = ({LL1.2 ). Here the empirical plugin estimate is based on U. {Ly = /31. Show that in the regression example with p = r = 2. fn where ei = ceil + fi.. i = 1.29).13.n ~ where fi can be written as fi = ceil + ei for given constant c satisfying 0 ~ c are independent identically distributed with mean zero and variance 0.a)% confidence interval for /31 in the Gaussian linear model is Z2)2 ] .d. (Zi.1] is a known constant and are i. i eo = 0 (the €i are called moving average errors. i = 1. 8.22. 0. see Problem 2. of 7.2 replaced by 0: 2 • 6. . ..28) for the noncentrality parameter ()2 in the regression example.. 9. Yn ) where Z. Derive the formula (6.14). . zi = (Zi21"" Zip)T. Yl). = (Zi2. Let 1.i. 5.1. 1 {LLn)T. t'V (b) Show that if ei N(O. (d) Show that Var(O) ~ Var(Y).. then B the MLE of (). where a. ...1. where IJ.5) Yi () + ei.. n. fO 0. .2 ).. N(O... .Section 6. i = 1."" (Z~.29) for the noncentrality parameter 82 in the oneway layout. Show that >:(Y) defined in Remark 6. Derive the formula (6. Show that Ii of Example 6. c E [0.2. i = 1. n.. the 100(1 .1. .(_C)i)2 l+c L i=l (a) Show that Bis the weighted least squares estimate of ().'''' Zip)T.n.4 • 3. .2 coincides with the likelihood ratio statistic A(Y) for the 0..
. (n . n2 = . Show that for the estimates (a) Var(a) = +==_=:. Yn ) such that P[t. Show that if p Inference in the Multiparameter Case Chapter 6 = r = 2 in Example 6.I).p)S2 /x n . then the hat matrix H = (h ij ) is given by 1 n 11.. We want to predict the value of a 14. = 2(Y .2)2 a and 8k in the oneway layout Var(8k ) . Yn ). . (b) Find a level (1 a) prediction interval for Y (i. (b) Find confidence intervals for 'lj. then Var(8k ) is minimized by choosing ni = n/2. then Var( a) is minimized by choosing ni = n / p.a).2. (c) If n is fixed and divisible by 2(p . Yn . where n is even. Yn be a sample from a population with mean f.{3i are given by = ~(f.• 424 10.1 2)? and that a level (1 a) confidence interval for a 2 is given by ~a)::..1 and variance a 2 ..:1 Yi + (3/ 2n) L~= ~ n+ 1 }i.. (d) Give the 100(1 .1. T2 1 (1/ 2n) L.a) confidence interval for the best MSPE predictor E(Y) = {31 + {32 Z. ~(~ + t33) t31' r = 2. 'lj. 15. = np n/2(p 1). Often a treatment that is beneficial in small doses is harmful in large doses.. (a) Show that level (1 .. The following model is useful in such situations. a 2 ::. ••. (a) Find a level (1 .a)% confidence intervals for a and 6k • 12.e . ::.• statistics HYl. Let Y 1 . Consider the three estimates T1 = and T3 Y.. (Zi2 Z. Note that Y is independent of Yl. Y ::.":=::. In the oneway layout.2).Z. .~ L~=1 nk = (p~:)2 + Lk#i .k)' C (b) If n is fixed and divisible by p.p)S2 /x n ...p (~a) (n . ~ 1 . Consider a covariate x. .. ...p (1 .Z. . 1  (a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2? (b) Which estimate has the smallest MSE for estimating 0 13. which is the amount . Assume the linear regression model with p future observation Y to be taken at the pont z. l(Yh .a) confidence intervals for linear functions of the form {3j .2)(Zj2 .2) L(Zi2 .
0289. xC' = 0 => x = 0 for any rvector x. ( 2 ) logp(x.. But xC'C = o => IIxC'I1 2 = xC'Cx' = 0 => xC' = O. P. 17. i = 1. 8). a 2 > O}. nonsingular. 653).r2 )l 1 2 7 > O. . hence. Assume the model = (31 + (32 X i + (33 log Xi + Ei. 1 2 where <t'CJL. compute confidence intervals for (31. r :::. 16. p(x.Section 6. Show that if C is an n x r matrix of rank r. and Q = P.or dose of a treatment..2. . a 2 <7 2 .I::llogxi' You may use X = 0. Problems for Section 6. . p.0'2)(x ) + E<t'(JL. fi2. 8) = 2. . Check AO.7 Probtems and Lornol.1 and let Q be the class of distributions with densities of the form (1 E) <t' CJL.0'2) is the N(Il'.A6 when 8 = ({L. 1952. Y (yield) 3. (b) Show that the MLE of ({L.• . and a response variable Y.2 1. (b) Plot (Xi. En Xi. Suppose a good fit is obtained by the equation where Yi is observed yield for dose 10gYi where E1. : {L E R. Yi) and (Xi. •. Let 8. then the r x r matrix C'C is of rank r and.r2) (x). (a) For the following data (from Hald. ( 2 )T is identifiable if and only if r = p. ( 2 ).77 167.. . fli) where fi1. 7 2 ) does not exist so that A6 doesn't hold. (33.5 Zi2 * + (33zi2 where Zil = Xi  X. ( 2 ). and p be as in Problem 6. (a) Show that AOA4 and A6 hold for model Qo with densities of the form 1 2<t'CJL. which is yield or production. In the Gaussian linear model show that the parametrization ({3. (32. n are independent N(O. ( 2 ) density. n. X (nitrogen) Hint: Do the regression for {Li = (31 + (32zil + (32zil = 10gXi ..emEmts 425 . Find the value of X that maxi mizes the estimated yield Y= e131 e132xx133. P = {N({L.95 Yi = e131 e132xi Xf3. E :::. fi3 and level 0. Hint: Because C' is of rank r.1) + 2<t'CJL.
1) EZ . Hint: (a).1/ 2 ). that (d) (fj  (3)TZ'f..2. (fj multivariate nannal distribution with mean 0 and variance. (J derived as the limit of Newtonwith equality if and only if > [EZfj1 4.24) directly as follows: (a) Show that if Zn = ~ I:~ 1 Zi then.20). 3. H abstract. .)Z(n)(fj . T 2 ) based on the first two I (d) Deduce from Problem 6..2 hold if (i) and (ii) hold._p distribution. i' " (b) Suppose that the distribution of Z is not known so that the model is semiparametric. .10 that the estimate Raphson estimates from On is efficient.". (6. show that the assumptions of Theorem 6.426 Inference in the Multiparameter Case Chapter 6 I . '".1.'. t) is then called a conditional MLE.2.  I .IY .  (3)T)T has a (". (c) Apply Slutsky's theorem to conclude that and. In Example 6.1 show thatMLEs of (3. show that c(fo) = (To/a is 1 if fo is normal and is different from 1 if 10 is logistic.I3). I . .2. T(X) = Zen). iT 2 ) are conditional MLEs. Establish (6. Hint: !x(x) = !YIZ(Y)!z(z). then ({3. Y).2.21). In some cases it is possible to find T(X) such that the distribution of X given T(X) = tis Q" which doesn't depend on H E H. (e) Show that 0:2 is unconditionally independent of (ji. b .2. and . In Example 6. X .. (P("H) : e E e. given Z(n). (I) Combine (aHe) to establish (6.24). I I . 5. Z i ~O. e Euclidean. show that ([EZZ T jl )(1.2 are as given in (6. In Example 6.2. Show that if we identify X = (z(n).Zen) (31 2 e ~ 7.1. 6. Fill in the details of the proof of Theorem 6. /1. and that 0:2 is independent of the preceding vector with n(j2 /0 2 having a (b) Apply the law oflarge numbers to conclude that T P n I Z(n)Z(n) ~ E ( ZZ T ). H E H}.fii(ji .2.. The MLE of based on (X. ell of (} ~ = (Ji.i > 1..2.2..Pe"H). (e) Construct a method of moment estimate moments which are ~ consistent. .2. hence. ..(b) The MLE minimizes .2.(3) = op(n.2. n[z(~)~en)jl)' X. (a) In Example 6.2. 8.
Oo) i=cl (1 . ao (}2 = (b) Write 8 1 uniquely solves 1 8. (logistic).p(Xi'O~) n.0 0 1 < <} ~ 0. <). (iii) Dg(0 0 ) is nonsingular.3).L'l'(Xi'O~) n. that is. . and f(x) = e..•. E R. .p(Xi'O~) + op(1) n i=l n ) (O~ .(XiI') _O. a > 0 are unknown and ( has known density f > 0 such that if p(x) log f(x) then p" > 0 and. R d are such that: (i) sup{jDgn(O) .LD.2.x (1 + C X ) .LD.7 Problems and Complements 427 9. (a) Show that if solves (Y ao is assumed known a unique MLE for L..'2:7 1 'l'(Xi. (b) Show that under AGA4 there exists E > 0 such that with probability tending to 1.Section 6.. Hint: n .l + aCi where fl.' . p is strictly convex.l real be independent identically distributed Y.. (iv) Dg(O) is continuous at 0 0 . Examples are f Gaussian. Him: You may use a uniform version of the inverse function theorem: If gn : Rd j. 10 .L 'l'(Xi'O~) n i=l 1 1 =n n L 'l'(Xi.O) has auniqueOin S(Oo. (ii) gn(Oo) ~ g(Oo). Let Y1 •. a' . ] t=l 1 n . the <ball about 0 0 . Show that if BB a unique MLE for B1 exists and 10.l exists and uniquely ~ .1' _ On = O~  [ . Suppose AGA4 hold and 8~ is vn consistent. = J.2. 8~ = 1 80 + Op(n. Y. .0 0 ). hence.Dg(O)j . (al Let ii n be the first iterate of the NewtonRapbson algorithm for solving (6.1/ 2 ). p i=l J.1) starting at 0. t=l 1 tt Show that On satisfies (6.
Zip)) + L j=2 p "YjZij + €i J . Zi. .12) and the assumptions of Theorem (6. = versus K : O oF 0.l hold for p(x.2.2. log Ai = ElI + (hZi. . 'I) = p(. Y. . iteration of the NewtonRaphson algorithm starting at 0.26) and (6. that Theorem 6. Suppose that Wo is given by (6. .3. < Zn 1 for given covariate values ZI. • (6.O E e. •.. there exists a j and their image contains a ball S(g(Oo).~_.. . .3.' " 13. 8) for 'I E 3 and 3 = {'1( 0) : 9 E e}.(Z"  ~(') z. a 2 ). Similarly compute the infonnation matrix when the model is written as Y. .6). q(. i ./3)' = L i=1 1 CjZiU) n p .(j) l. Wald. i2. I '" LJ i=l ~(1) Y. . Show that if 3 0 = {'I E 3 : '1j = 0.3..3). Ii 'I :f Z'[(3)2 over all {3 is the same as minimizing n . II. P(Ai).1 on 8(9 0 • 6) (c) Conclude that with probability tending to 1.. Inference in the Multiparameter Case Chapter 6 > O. 1 Zw Find the asymptotic likelihood ratio. Reparametnze l' by '1(0) = Lj~1 '1j(O)Vj where '1j(9) 0 Vj.27).Zi )13. 1 Yn are independent Poisson variables with Yi ... ~. N(O.. and ~ . 0 < ZI < .3.. t• where {31..O). . then it converges to that solution. 'I) : 'I E 3 0 } and. > 0 such that gil are 1 . converges to the unique root On described in (b) and that On satisfies 1 1 • j. Hint: Write n J • • • i I • DY.. and Rao tests for testing H : Ih. LJ j=2 Differentiate with respect to {31. Establish (6.12). i=l • Z. q + 1 < j < r} then )'(X) for the original testing problem is given = by )'(X) = sup{ q(X· 'I) : 'I E 3}/ sup{q(X. {Vj} are orthonormal.(Z" .3 i 1.428 then. for "II sufficiently large.. Problems for Section 6. Thus. minimizing I:~ 1 (Ii 2 i.ip range freely and Ei are i. Suppose responses YI ...II(Zil I Zi2.2. 2 ° 2. = 131 (Zil .2 holds for eo as given in (6. )Ih  '" j=l LJ(lJj + where ~(l) = I:j and the Cj do not depend on j3. P I . .d.i. . hence.. f. CjIJtlZ. •. Hint: You may use the fact that if the initial value of NewtonRaphson is close enough to a unique solution.
d.3. .n) satisfy the conditions of Problem 6. IJ. .3 is valid. independent. X n Li.(X1 ..(X" . Oil and p(X"Oo) = E. > 0 or IJz > O.Xn ) be the likelihood ratio statistic. 1)..aq are orthogonal to the linear span of ~(1J0) .) Show that if we reparametrize {PIJ: IJ E S(lJo)} by q('..) where k(Bo. aTB. OJ}.gr. (b) If h = 2 show that asymptotically the critical value of the most powerful (NeymanPearson) test with Tn ~ L~ .o log (X 0) PI.. =  p(·. S(lJ o) and a map ce which is continuously differentiable such that (i) ~J(IJ) = 9J(IJ) on S(lJo).'" .) 1(Xi . q + 1 <j < r S(lJo) .'1 8 0 = and. Oil + Jria(00 .. q + 1 <j< rl. Testing Simple versus Simple..3.7 Problems and C:com"p"":c'm"'.. Consider testing H : (j = 00 versus K : B = B1 . 0 ») is nK(Oo. .3. even if ~ b = 0.n:c"' 4c:2=9 3. Xi.n 7J(Bo. {IJ E S(lJo): ~j(lJ) = 0.2..'1(IJ» is uniquely defined on:::: = {'1(IJ) : IJ E Tio. 210g. . q('. Vi). (ii) E" (a) Let ). . .(I(X i . Bt ) is a KullbackLeibler in fonnation ° "_Q K(Oo. . Deduce that Theorem 6. Let (Xi. 1 < i < n.. with Xi and Y..3 hold.Section 6. 'I) S(lJo)} then. B). be ii.. j = 1. = ~ 4. . \arO whereal. (Adjoin to 9q+l. Suppose that Bo E (:)0 and the conditions of Theorem 6.'1) and Tin '1(lJ n) and. Assume that Pel of Pea' and that for some b > 0.l. Suppose 8)" > 0. Show that under H.2. < 00.d. p .. . Consider testing H : 8 1 = 82 = 0 versus K : 0. 1 5.. Let e = {B o . respectively. with density p(. There exists an open ball about 1J0. hence. pXd . N(IJ" 1).IJ) where q(.X.) O. (ii) 'I is IIon S( 1J0 ) and D'I( IJ) is a nonsingular r x r matrix for aIlIJ E S( 1J0).. N(Oz. IJ.
Hint: ~ (i) Show that liOn) can be replaced by 1(0). > OJ..) under the model and show that (a) If 0.d. 2 log >'( Xi. which is a mixture of point mass at O.li : 1 <i< n) has a null distribution. 1 < i < n.)) ~ N(O.) if 0. for instance. and ~ where sin. = a + BB where . . mixture of point mass at 0.3. Then xi t. be i. = 0.=0 where U ~ N(O. (d) Relate the result of (b) to the result of Problem 4(a).6.~. Show that 2log.2 for 210g. Yi : l<i<n).19) holds.3. Po . ii. we test the efficacy of a treatment on the basis of two correlated responses per individual.) have an N. . ' · H In!: Consl'defa1O .0.. = (T20 = 1 andZ 1= X 1.0"10' O"~o. 1) with probability ~ and U with the same distribution. In the model of Problem 5(a) compute the MLE (0. Po) distribution and (Xi.\(X). < cIJ"O. (iii) Reparametrize as in Theorem 6.0.6.\( Xi.. =0. li). are as above with the same hypothesis but = {(£It.i. Exhibit the null distribution of 2 log . • i .1.\(Xi. Let Bi l 82 > 0 and H be as above. /:1.  0. > 0. (b) If 0.. 0. ±' < i < n) is distributed as a respectively. (h) : 0 < 0.  OJ.0. j ~ ~ 6. 7. (ii) Show that Wn(B~2» is invariant under affine reparametrizations "1 B is nonsingular.1... ~ ° with probability ~ and V is independent of ~ (e) Obtain tbe limit distribution of y'n( 0.0). XI and X~ but with probabilities ~ . > 0. O > 0. 1988). (b) Suppose Xil Yi e 4 (e) Let (X" Y. < . I . Hint: By sufficiency reduce to n = 1.2 and compute W n (8~2» showing that its leading term is the same as that obtained in the proof of Theorem 6. 2 e( y'n(ii. Z 2 = poX1Y v' .3. Yi : 1 and X~ with probabilities ~. Show that (6. Sucb restrictions are natural if. 0.0.. (0. = v'1~c2' 0 <. Wright. Note: The results of Problems 4 and 5 apply generally to models obeying AQA6 when we restrict the parameter space to a cone (Robertson. and Dykstra. under H.430 Infer~nce in the Multiparameter Case Chapter 6 (a) Show that whatever be n.
21).4 ~ 1(11) is continuous.3.4. 9.22) is a consistent estimate of ~l(lIo).7 Problems and Complements ~(1 ) 431 8.1(0 0 ).8) for Z.Section 6.0 < PCB) < 1.3. A611 Problems for Section 6.2 to lin . hence.3.10. Exhibit the two solutions of (6. . Hint: Write and apply Theorem 6.P(A))P(B)(1 .4.4. 0 < P( A) < 1. Under conditions AOA6 for (a) and AOA6 with A6 for (a) ~(1 ) i!~1) for (b) establish that [~D2ln(en)]1 is a consistent estimate of 1.8) by Z ~ . A3.4) explicitly and find the one that corresponds to the maximizer of the likelihood.4.8) (e) Derive the alternative form (6. 3.nf. Hint: Argue as in Problem 5. is of the fonn (b) Deduce that X' ~ Z' where Z is given by (6. 10. 2.P(B))· (b) Show that the sample correlation coefficient r studied in Example 5. (b) (6.P(A)P(B) JP(A)(1 . 1. (e) Conclude that if A and B are independent. In the 2 x 2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a Born. Show that under AOA5 and A6 for 8 11 where ~(lIo) is given by (6.6 is related to Z of (6.2. Show that under A2.. (a) Show that the correlation of Xl and YI is p = peA n B) . (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (estimated) expectations from each entry has all rows and columns summing to zero. then Z has a limitingN(011) distribution.3.
. Let R i = Nil + N i2 • Ci = Nii + N 2i · Show that given R 1 = TI. . a .1 II .. use this result to construct a test of H similar to the test with probability of type I error independent of TJil' TJj2? 1 .C. ( rl..~~. R 2 = T2 = n .54). 6. Fisher's Exact Test From the result of Problem 6. .. S. (a) Show that then P[N'j niji i = 1. .D. TJj2 are given by TJil ~ = . . ( where ( .)).6 that H is true.. TJj2 ~ = Cj n where R.. CI.4.~l. 8 21 . . 432 Inference in the Multiparameter Case Chapter 6 i 4.TI. TJj2.. n) can be then the test that rejects (conditionally on R I = TI' C 1 = GI) if N ll > j(a) is exact level o. (a) Let (NIl.. j.1 ... .. B!c1b!... .ra) nab ... N 12 • N 21 . . Cj = Cj] : ( nll. i = 1. I (c) Show that under independence the conditional distribution of N ii given R. Suppose in Problem 6.b1only. 8l! / (8 1l + 812 )). are the multinomial coefficients. . N 22 ) rv M (u. (a) Show that the maximum likelihood estimates of TJil. (b) Deduce that Pearson's X2 is given by (6. Ti) (the hypergeometric distribution). nal ) n12.4.~~. (22 ) as in the contingency table.. Let N ij be the entries of an let 1Jil = E~=l (}ij.4 deduce that jf j(o:) (depending on chosen so that Tl. TJj2 = 2::: a x b contingency table with associated probabilities Bij and 1 Oij. 8(r2' 82 I/ (8" + 8. 1 : X 2 (b) How would you.. It may be shown (see Volume IT) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5.. j = 1. This is known as Fisher's exact test. 7.. n R. . 811 \ 12 .1. (b) Sbow that 812 /(8l! ° + 812 ) ~ 821 /(8 21 + 822 ) iff R 1 and C 1 are independent.9) and has approximately a X1al)(bl) distribution under H. . . i = 1. n. C i = Ci. n a2 ) . nab) A ) = B. Hint: (a) Consider the likelihood as a function of TJil.~.u. = Lj N'j. N ll and N 21 are independent 8( r. in principle.4. j = 1.2 is 1t(Ci. Consider the hypothesis H : Oij = TJil TJj2 for all i. =T 1. C j = L' N'j. b I Ri ( = Ti.2. .
) Show that (i) and (ii) imply (iii).05 level (a) using the X2 test with approximate critical value? (b) using Fisher's exact test of Problem 6. n.ziNi > k.1""".. where Pp~ [2:f .14).BINDEPENDENTGNENC) p(AnB I C) ~ peA I C)P(B I C) (A.ZiNi > k] = a. and that under H. consider the assertions. C2). C are three events. (a) If A.4. classified by sex and admission status. 10.(rl + cd or N 22 < ql + n .5? Admit Deny Men Women 1 19 . Give pvalues for the three cases. ~i = {3. 11. if A and C are independent or B and C are independent..4. The following table gives the number of applicants to the graduate program of a small department of the University of California.93'1 215 103 69 172 Deny 225 162 n=387 (d) Relate your results to the phenomenon discussed in (a). N 22 is conditionally distributed 1t(r2.Section 6.. = 0 in the logistic model.(rl + cI).12 I . but (iii) does not. . and that we wish to test H : Ih < f3E versus K : Ih > f3E. Establish (6. Show that. 2:f . B INDEPENDENT) (iii) PeA (C is the complement of C. if and only if.BINDEPENDENTGIVENC) n B) = P(A)P(B) (A. 9. there is a UMP level a test. (b). (h) Construct an experiment and three events for which (i) and (ii) hold. for suitable a. which rejects. (e) The following 2 x 2 tables classify applicants for graduate study in different departments of the university according to admission status and sex. Test in both cases whether the events [being a man] and [being admitted] are independent.0. Admit Men Women 1 235 1~35' 38 7 273 42 n = 315 Deny Admit 270 45 Men Women I 122 1'. and petfonn the same test on the resulting table. Then combine the two tables into one. 5 Hint: (b) It is easier to work with N 22 • Argue that the Fisher test is equivalent to rejecting H if N 22 > q2 + n .7 Problems and Complements 433 8. + IhZi. Suppose that we know that {3. B. (i) p(AnB (ii) I C) = PeA I C)P(B I C) (A. Would you accept Or reject the hypothesis of independence at the 0. Zi not all equal.
Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fishersee Rao (1973). k and a 2 is unknown. I < i < k are independent.3. = rn I < /3g and show that it agrees with the test of (b) Suppose that 131 is unknown.3.4.5. Nk) ~ M(n. if the design matrix has rank p. .3) l:::~ I (Xi . E {z(lJ.4) is as claimed formula (2. Given an initial value ()o define iterates  Om+l . X k be independent Xi '" N (Oi.. Suppose that (Z" Yj). .. Compute the Rao test statistic for H : (32 case. Show that the likelihO<Xt ratio test of H : O} = 010 . .20) for the regression described after (6. 1 .)llmi~i(1 'lri).5 construct an exact test (level independent of (31). 3.11 are obtained as realization of i. which is valid for the i.4. . . for example. a 2 ) where either a 2 = (known) and 01 . .18) tends 2 to Xrq' Hint: (Xi . ~ 8 m + [1(8 m )Dl(8 m ). · .i. . with (Xi I Z.. . Zi and so that (Zi.5 1. Suppose the ::i in Problem 6.OkO) under H.. a5 j J j 1 J I ..".Oio)("Cooked data"). 13...OiO)2 > k 2 or < k}. . (Zn.8) and..Ld. i Problems for Section 6. Tn. Use this to imitate the argument of Theorem 6.4..i.) ~ B(m..z(kJl) ~I 1 .. Y n ) have density as in (6. Show that.4. " lOk vary freely. This is an approximation (for large k.l 1 . but under K may be either multinomial with 0 #. • I 1 • • (a)P[Z. Show that if Wo C WI are nested logistic regression models of dimension q < r < k and mI... 434 • Inference in the Multiparameter Case Chapter 6 f • 12. I.. but Var. Verify that (6. I). (a) Compute the Rao test for H : (32 Problem 6.(Ni ) < nOiO(1 . .1 14. i 16.'Ir(. 15.11. . .. < f3g in this (c) By conditioning on L~ 1 Xi and using the approach of Problem 6. . case. 2.2.. . .. 010.. a under H..4).00 or have Eo(Nd : . or Oi = Bio (known) i = 1.: nOiO.15) is consistent. Xi) are i. In the binomial oneway layout show that the LR test is asymptotically equivalent to Pearson's  X2 test in the sense that 2log'\  X2 . mk ~ 00 and H : fJ E Wo is true then the law of the statistic of (6.f. n) and simplification of a model under which (N1. . in the logistic regression model...4. .5.5. then 130 as defined by (6. Let Xl. .d. 1 i Show that for GLM this method coincides with the NewtonRaphson method of Section 2.... a 2 = 0"5 is of the form: Reject if (1/0".4. .LJ2 Z i )).d. " Ok = 8kO. . asymptotically N(O.
.b(Oi)} p(y.. b(B). Show that when 9 is the canonical link. (e) Suppose that Y. Y) and that given Z = z.p.ztkl} is RP (c) P[ZI ~ z(jl] > 0 for all j. b(9). .Section 6.) .1 = 0 V fJ~ . J JrJ 4.= 1 . h(y. ball about "k=l A" zlil in RP . J .) and v(. . b(9).).. .(3). as (Z. the deviance is 5. ~ .. <. 1'0) = jy 1'01 2 /<7.Oi)~h(y. ~ N("" <7. .O(z)) where O(z) solves 1/(0) = gI(zT {3).5. h(y. Show that. Assume that there exist functions h(y.. and v(. d".. j = 1. give the asymptotic distribution of y'n({3 .. T). Y follow the model p(y. (y. the resuit of (c) coincides with (6. . T. g(p. (d) Gaussian GLM. and v(. P(I"). then the convex support of the conditional distribution of = 1 Aj Yj zU) given Z j = Z (j) . . Find the canonical link function and show that when g is the canonical link.5. Show that for the Gaussian linear model with known variance D(y. Wi = w(". C(T).i. 00 d" ~ a(3j (b) Show that the Fisher information is Z].). g("i) = zT {3. (Zn. Show that the conditions AQA6 hold for P = P{3o E P (where qo is assumed known).T)exp { C(T) where T is known.. and b' and 9 are monotone.y . Yn ) are i. ..F.. 05.. k. distribution... Give 0. Let YI..(. (c) Suppose (Z" Y I ).) and C(T) such that the model for Yi can be written as O. T.WZ v where Zv = Ilz'jll is the design matrix and W = diag( WI.7 Problems and Complements 435 (b) The linear span of {ZII). .5. contains an open L. under appropriate conditions. Wn). T). C(T).d. Set ~ = g(.) ~ 1/11("i)( d~i/ d".9).. your result coincides with (6.)z'j dC. . In the random design case. ..T). Give 0.. (a) Show that the likelihood equations are ~ i=I L. 9 = (b')l. .. Hint: By the chain rule a l(y a(3j' 0) = i3l dO d" a~ .5). . Suppose Y.. Yn be independent responses and suppose the distribution of Yi depends on a covariate vector Zi.9). L...) = ~ Var(Y)/c(T) b"(O). Hint: Show that if the convex support of the conditional distribution of YI given ZI = zU) contains an open interval about p'j for j = 1. has the Poisson.. k.
I 4.1. ! I I ! I .p under the sole assumption that E€ = 0. • I 3. replacing &2 by 172 in (6. . 0 < Vae f < 00. then the Rao test does not in general have the correct asymptotic level. f}o) is used. W n and R n are computed under the assumption of the Gaussian linear model with a 2 known. let 1r = s(z.3) then f}(P) = f}o.10) creates a valid level u test.i. Consider the Rao test for H : f} = f}o for the model "P = {P/I : /I E e} and ADA6 hold. 2. then O' 2(p) > 0'2 = Varp(X.3. W n .6.3 and. Suppose that the ttue P does not belong to"P but if f}(P) is defined by (6. n . Show that the standard Wald test forthe problem of Example 6.. set) = 1. t E R. Wn Wn + op(l) + op(I). then the sample median X satisfies ~ f at I I Vri(X where O' (p) 2 yep») ~ N(0. /10) is estimated by 1(80 ). Suppose ADA6 are valid. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation.6.. I "j 5. then it is. . Show that. P. then v( P) .1) ~ .6. (a) Show that if f is symmetric about 1'. and Rao tests are still asymptotically equivalent in the sense that if 2 log An.6.Xn are i. (6.7.15) by verifying the condition of Theorem 6. .. if VarpDl(X. Establish (6.1.2 and the hypothesis (3q+l = (3o.d.6. I: . By Problem 5. but that if the estimate ~ L:~ dDIllDljT (Xi. Note: 2 log An. . .). Consider the linear model of Example 6.3 is as given in (6.(3p ~ (3o. Wald. Suppose Xl. O' 2(p)/0'2 = 2/1r. in fact. hence.3. .1 under this model and verifying the fonnula given.2.14) is a consistent estimate of2 VarpX(l) in Example 6.O' 2 (p») = 1/4f(v(p». (3) where s(t) is the continuous distribution function of a random variable symmetric about 0. i 6.2. . . " .Q+1>'" . if P has a positive density v(P).6. Show that the LR.10). 0'2).6 1. then under H. Show that 0: 2 given in (6. 7. j. but if f~(x) = ~ exp Ix 1'1. the unique median of p.4.6. and R n are the corresponding test statistics. Apply Theorem 6. = 1" (b) Show that if f is N(I'. I .j 436 Inference in the Multiparameter Case Chapter 6 i Problems for Section 6.6. l 1 . that is. then 0'2(P) < 0'2. In the hinary data regression model of Section 6.s(t). the infonnation bound and asymptotic variance of Vri(X 1').
~ ~ (d) Show that (a).. Zi are ddimensional vectors for covariate (factor) values.. are indepeqdent of}] 1 ' · ' 1 Yn and ~* is distributed as }'i.. ...i . .. Gaussian with mean zero J1i = Z T {3 and variance 0'2. that Zj is bounded with probability 1 and let ih(X ln )).9.' i=l . the model with 13d+l = . Show that if the correct model has Jri given by s as above and {3 = {3o.9.. 1 V.. L:~ 1 (P.lp)2 Here Yi" l ' •• 1 Y. . i = 1.2 is known. Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d ..(b) continue to hold if we assume the GaussMarkov model. i = 1.jP)2 where ILl ) = z.d. .. . = (3p = 0 by the (average) expected prediction error ~ ~ n EPE(p) = n. _.Zn and evaluate the performance of Yjep) . hence.1) Yi = J1i + ti.. be the MLE for the logit model. where ti are i. Zi..lpI)2 be the residual sum of squares.1. 8. (b) Suppose that Zi are realizations of U. (a) Show that EPE(p) ~ . /3P+I = '" = /3d ~ O. for instance). . Suppose that ..1. Model selection consists in selecting p to minimize EPE(p) and then using Y(P) as a predictor (Mallows.2 is an unbiased estimate of EPE(P).2 (b) Show that (1 + ~D + .7 Problems and Complements 437 (a) Show that ~ Jr can be written in this form for both the probit and logit models.{3lp) p and {3(P) ~ (/31.Section 6..(p) the corresponding fitted value.2. A natural goal to entertain is to obtain new values Yi". . Hint: Apply Theorem 6. then 13L is not a consistent estimate of f3 0 unless s(t) is the logistic distribution. where Xln) ~ {(Y.. Let f3(p) be the LSE under this assumption and Y. 1 n. (Model Selection) Consider the classical Gaussian linear model (6. . But if 13L is defined as the solution of EZ1s(Zr {30) = Q(/3) where Q({3) = E(Zr A(Zr/3) is p x 1. i . then ~ Vri(rJL QI ({3) Var(ZI (Y1   {3 Ll has a limiting normal distribution with mean 0 and variance A(Zr {3)) )[QI ({3)J where Q({3) = E(Zr A(Zr{30)ZI) is p x p and necessarily nonsingular.p don't matter. 1973. O)T and deduce that (c) EPE(p) = RSS(p) + ~... y~p) and. Zi) : 1 <i< n}. f3p. 1 n.p. 0. at Zll'" .I EL(Y. .. .d. Let RSS(p) = 2JY.
. 1 I . which are not multinomial.4. 1969. Note for Section 6. and (ii) 'T}i = /31 Zil + f32zi2. R.4 (1) R.  J . Y.16).1 (1) From the L A. Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good. EPE(p) n R SS( p) = ..". DIXON. . . 1 "'( i=1 . (b) The result depends only on the mean and covariance structure of the i=l"". The Analysis of Binary Data London: Methuen. . Heart Study after Dixon and Massey (1969). if we consider alternatives to H.i2} such that the EPE in case (i) is smaller than in case (d) and vice versa. ! . New York: McGrawHill. but it is reasonable. + Evaluate EPE for (i) ~i ~ "IZ.L . AND F. ~(p) n .8 NOTES Note for Section 6. ti.9 REFERENCES Cox.9 for a discussion of densities with heavy tails. )I'i). t12 and {Z/I.. Note for Section 6. D. For instance. 3rd 00. 1 n n L (I'. W. Give values of /31.n.6. 1970.) . i 6.' . Hint: (a) Note that ".5. i=1 Derive the result for the canonical model. 6.438 (e) Suppose p ~ Inference in the Multiparameter Case Chapter 6 2 and 11(Z) ~ Ii.~I'. MASSEY. The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section 6. . this makes no sense for the model we discussed in this section.. A. Y/" j1~. To guard against such situations he argued that the test should be used in a twotailed fashion and that we should reject H both for large and for small values of X2 • Of course. Introduction to Statistical Analysis. we might envision the possibility that an overzealous assistant of Mendel "cooked" the data. Use 0'2 = 1 and n = 10. .z. LR test statistics for enlarged models of this type do indeed reject H for data corresponding to small values of X2 as well as large ones (Problem 6.2 (1) See Problem 3.
A.• St(ltistical Methods {Of· Research lV(Jrkers. An Inrmdllction to Linear Stati. I. McCULLAGH.. Applied linear Regression. HUBER. The l!istory of Statistics: The Measuremem of Uncel1ainty Before 1900 Cambridge. Prob. "Some comments on C p . "The behavior of the maximum likelihood estimator under nonstandard conditions. Math. T. 36. 1974. Vol. 1988. 1973.. A." Biometrika. D'OREY. Genemlized Linear Models London: Chapman and Hall. HABERMAN. Univ. 1989.Him{ Models. second edition. Theory ofStatistics New York: Springer. AND R." J.• "Sur quelques points du systeme du monde." Technometrics. STIGLER. J. GRAYBILL. MA: Harvard University Press. MALLOWS. S.. PORTNOY. Statistical Theory with £r1gineering Applications New York: Wiley. KOENKER. 1995. 383393 (1987). 76. 1959.425433 (1989). New York: Hafner. II. S. LAPLACE. New York: J. NELDER." Statistical Science. Soc. A. PS. New York. HAW. ROBERTSON. 1985. AND j. Statist. . GauthierVillars. R." Proc." Memoires de l'Academie des Sciences de Paris (Reprinted in Oevres CompUtes. "Computing regression qunntiles. t 983. WEISBERG. The Analysis of Variance New York: Wiley. 221233 (1967). 13th ed. R.. T. TIERNEY. P. S. f. J. Ser. Linear Statisticallnference and Its Applications. $CHERVISCH. New York: Wiley. KASS. E..Section 6. L. DYKSTRA. S.. C. M. WRIGHT. 475558. 1986. P. KADANE AND L.9 References 439 Frs HER. "The Gaussian Hare and the Laplacian Tortoise: Computability of squarederror versus absoluteerror estimators. RAO. /5. 12. R. The Analysis of Frequency Data Chicago: University of Chicago Press. Roy.. Paris) (1789). Fifth BNkeley Symp. 1952. C. R. 1961. AND Y. I New York: McGrawHill.. KOENKER. Wiley & Sons. II.. 279300 (1997). Statist. 2nd ed. 2nd ed. of California Press. "Approximate marginal densities of nonlinear functions.661675 (1973). c. 1958. AND R.. SCHEFFE. Order Restricted Statistical Inference New York: Wiley.
.I :i . . j. i 1 I . I . I I . .. . . . I j i. .
A. at least conceptually. which are relevant to our study of statistics. This 441 . require that every repetition yield the same outcome. an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation. we include some proofs as well in these sections. Sections A. The situations we are going to model can all be thought of as random experiments. we include some commentary. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts. The intensity of solar flares in the same month of two different years can vary sharply. The adjective random is used only to indicate that we do not. although we do not exclude this case. in addition.IS contain some results that the student may not know. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level.I THE BASIC MODEl Classical mechanics is built around the principle that like causes produce like effects. Probability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. We add to this notion the requirement that to be called an experiment such an action must be repeatable.Appendix A A REVIEW OF BASIC PROBABILITY THEORY In statistics we study techniques for obtaining and using information in the presence of uncertainty. A coin that. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. Therefore. The reader is expected to have had a basic course in probability theory. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties.14 and A. Viewed naively. The Kolmogorov model and the modem theory of probability based on it are what we need. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. is tossed can land heads or tails. The purpose of this appendix is to indicate what results we consider basic and to introduce some of the notation that will be used in the rest of the book.
C for union. We denote it by n. we preSUnle the reader to be familiar with elementary set theory and its notation at the level of Chapter I of Feller (1968) or Chapter 1 of Parzen (1960). I l 1.\ is the number of times the possible outcome A occurs in n repetitions. and Berger (1985). de Groot (1970). A probabiliry distribution or measure is a nonnegative function P on A having the following properties: (i) P(Q) n 1 n . it is called a composite event. = 1. If A contains more than one point. By interpreting probability as a subjective measure. and inclusion as is usual in elementary set theory. 1977). Its complement. We shall use the symbols U. then (ii) If AI. 1 1  . from horse races to genetic experiments. i ! A. . . • . A is always taken to be a sigma field.I I 442 A Review of Basic Probability Theory Appendix A 1 longtefm relative frequency 11 . B. and complementation (cf. A. A random experiment is described mathematically in tenns of the following quantities. falls under the vague heading of "random experiment.1. In this sense. complementation." The set operations we have mentioned have interpretations also. and so on or by a description of their members.4 We will let A denote a class of subsets of to which we an assign probabilities. n. which by definition is a nonempty class of events closed under countable unions. For example. A. Savage (1962). .3 Subsets of are called events. set theoretic difference. c. the probability model. 1992.~ n and is typically denoted by w. .l The sample space is the set of all possible outcomes of a random experiment. the operational interpretation of the mathematical concept of probability. If wEn. as we shall see subsequently. Ii I I I j . whether it is conceptually repeatable or not. intersections. For technical mathematical reasons it may not be possible to assign a probability P to every subset of n. Lindley (1965). the relation A C B between sets considered as events means that the occurrence of A implies the occurrence of B. We denote events by A.. almost any kind of activity involving uncertainty. In this section and throughout the book. We now turn to the mathematical abstraction of a random experiment. .2 A sample point is any member of 0.l. Raiffa and Schlaiffer (\96\).1/ /1. is denoted by A. 1974. A2 . they are willing to assign probabilities in any situation involving uncertainty.l.1. The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A. A. is to many statistician::. " are pairwise disjoint sets in I • I Recall that Ui I Ai is just the collection of points that are in anyone of the sets Ai and that two sets are disjoint if they have no points in common.. {w} is called an elementary event. where 11. However. For a discussion of this approach and further references the reader may wish to consult Savage (1954). . the null set or impossible event. Grimmett and Stirzaker. and Loeve. Chung." Another school of statisticians finds this formulation too restrictive. intersection. induding the authors..
~ A.2 PeN) 1 .. and Stone (1971) Sections 1. Sections 13. and Stone (1992) Section 1. (n~~l Ai) > 1.2. A. References Gnedenko (1967) Chapter I. Port. A.3 DISCRETE PROBABILITY MODELS A. A. A.2 Parzen (1960) Chapter 1.2.2) n . and P together describe a random experiment mathematically. c A..S The three objects n.P(A). by axiom (ii) of (A. (1967) Chapter l.Section A. Sections 45 Pitman (1993) Section 1.68 Grimmett and Stirzaker (1992) Sections 1. Port.l1. when we refer to events we shall automatically exclude those that are not members of A..2 ELEMENTARY PROPERTIES OF PROBABILITY MODELS The following are consequences of the definition of P.S P A.2. Section 8 Grimmett and Stirzaker (1992) Section 1.2. References Gnedenko.3 If A C B.3 Hoel.t A probability model is called discrete if is finite or countably infinite and every subset of f! is assigned a probability.A) = PCB) .. P(B) A. P(0) ~ O. We shall refer to the triple (0. For convenience. (U.. P) either as a probability model or identify the model with what it represents as a (random) experiment.2 and 1.2.. J. then P (U::' A.3.2.3 Panen (1960) Chapter I. we have for any event A.3. we can write f! = {WI. In this case. W2. A.3 A.3 A.1.40 < P(A) < 1.L~~l peA. } and A is the collection of subsets of n. > PeA).2. (A. ..' 1 An) < L.. .7 P 1 An) = limn~= P(A n ). That is. c .2 Elementary Properties of Probability Models 443 A.1 If A c B...3 Hoe!.4).PeA).). Sections 15 Pitman (1993) Sections 1. 1.6 If A . C An .) (Bonferroni's inequality). then PCB .l.' 1 peA.
for fixed B as before. I B). For large N. we define the conditional probability of A given B. Then P( {w}) = 1/ N for every wEn. then .3) If B l l B 2 . and drawing.l) gives the multiplication rule. machines.~ 1 1 ~• 1 . Transposition of the denominator in (AA. flowers. Given an event B such that P( B) > 0 and any other event A. n PiA) = LP(A I Bj)P(Bj ). . . .(A n B j ).3) yield U.' .444 A Review of Basic Probability Theory Appendix A An important special case arises when n has a finite number of elements.1) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment. by l P(A I B) ~ PtA n B) P(B) .4 Suppose that WI. etc.:.l n' " I"I ".. Such selection can be carried out if N is small by putting the "names" of the Wi in a hopper. then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur. Sections 67 Pitman (1993) Section l.4. the identity A = .I B) is a probability measure on (fl. '< = ~=~:. all of which are equally likely.3) 1 I A. (A.:c Number of elements in A N . (A. which we write PtA I B). (A. i • i: " i:. (A. I B) = ~ PiA. A) which is referred to as the conditional probability measure given B.3). i References Gnedenko (1967) Chapter I. 1 1 PiA n B) = P(B)P(A I B). A.• 1 B n are (pairwise) disjoint events of positive probability whose union is fl. shaking well.:c:. Sections 45 Parzen (1960) Chapter I..1. is an experiment leading to the model of (A.4)(ii) and (A.4 CONDITIONAL PROBABILITY AND INDEPENDENCE I j 1 1 ! . say N.. If A" A 2 .). selecting at random. .3..4) . WN are the members of some population (humans. .1 • 1 ..4. the function P(. are (pairwise) disjoint events and P(B) > 0.3. j=l (A. and P( A ) ~ j .• P (Q A. .=:.4.3..=. guinea pigs.. (A.. a random number table or computer can be used.4.2) In fact. ~f 1 .4. Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another. From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred.
. . Chapter 3.7) whenever P(B I n .. P(il..} such thatj ct {i" . (AA." ..lO) for any subset {iI.4. B ll is written P(A B I •.)P(B...1 ).. and (A. The events All .A. the relation (AA.. relation (A.) j=l k (AA. P(B 1 n·· n B n ) ~ P(B 1 )P(B2 I BJlP(B3 I ill.) A) ~ ""_ PIA I B )p(BT L.S) The conditional probability of A given B I defined by ..i k } of the integers {l. . .I J (AA. Sections IA Pittnan (1993) Section lA . . P(B" I Bl.i. BnJl (AA.8) may be written P(A I B) ~ P(A) (AA. we can combine (A.. A and B are independent if knowledge of B does not affect the probability of A. ...lO) is equivalent to requiring that P(A J I A.. An are said to be independent if P(A i . . I. B 2 ) .4. .. Sections 9 Grimmett and Stirzaker (1992) Section IA Hoel.) = II P(A. n Bnd > O. Section 4.···.• ... Two events A and B are said to be independent if P(A n B) If P( B) ~ P(A)P(B).En such that P(B I n ....3)...8) > 0. References Gnedenko (1967) Chapter I...9) In other words.S parzen (1960) Chapter 2.4. Ifall theP(A i ) are positive. I PeA I B. ..Section AA Conditional Probability and Independence 445 If P( A) is positive.. (AA. Port...) ~ P(A J ) (AA.4) and obtain Bayes rule . . n B n ) > O. . Simple algebra leads to the multiplication rule.n}. n··· nA i . B I .I_1 .. ..}. and Stone (1971) Sections lA.i. B n ) and for any events A.II) for any j and {i" . . .
if Ai E Ai.5 COMPOUND EXPERIMENTS There is an intuitive notion of independent experiments. it can be uniquely extended to the sigma field A specified in note (l) at the end of this appendix. There are certain natural ways of defining sigma fields and probabilities for these experiments. .5. To be able to talk about independence and dependence of experiments. the subsets A ofn to which we can assign probability(1). The interpretation of the sample space 0 is that (WI.. x An of AI.o n = {(WI.. . To say that £t has had outcome pound event (in . x'" .on. An are events. ) = P(A.. If we want to make the £i independent. More generally. A2). independent...w n ) E o : Wi = wf}.. we should have . A.' .. wn ) is a sample point in .53) holds provided that P({(w"". x flnl n ' .wn )}) I: " = PI ({Wi}) . . P. and if we do not replace the first chip drawn before the second draw. Loeve. Pn({W n }) foral! Wi E fli.6 where examples of compound experiments are given. 1 < i < n}.. 1 £n if the n stage compound experiment has its probability structure specified by (A5. For example. We shall speak of independent experiments £1. then Ai corresponds to .5..0) given by . 1977) that if P is defined by (A.. P(fl..... r. .3).. . 1 1 . . P([A 1 x fl2 X .. it is easy to give examples of dependent experiments: If we draw twice at random from a hat containing two green chips and one red chip.0 ofthe n stage compound experiment is by definition 0 1 x·· . x An) ~ P(A." X An) = P. Chung.o i . a compound experiment is one made up of two or more component experiments. x . .2) 1 i " 1 If we are given probabilities Pi on (fl" AI). If we are given n experiments (probability models) t\. (A5A) i. \ I I . On the other hand. These will be discussed in this section.'" £n and recording all n outcomes. . .1 X Ai x 0i+ I x . This makes sense in the compound experiment. on (fl2.. that is. the sigma field corresponding to £i. W2 is the outcome of £2 and E Oi corresponds to the Occurrence of the comso on. . .0 1 x··· XOi_I x {wn XOi+1 x··· x.. then (A52) defines P for A) x ". then the probability of a given chip in the second draw will depend on the outcome of the first dmw. 1974. . I ! i w? 1 . then intuitively we should have alI classes of events AI. the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second. An is by definition {(WI.. .. . X fl n 1 X An). if we toss a coin twice. x On in the compound experiment. I P n on (fln> An). I n  I . £n with respective sample spaces OJ.. . . 1995.1 ."" . the Cartesian product Al x . . If P is the probability measure defined on the sigma field A of the compound experiment.0 if and only if WI is the outcome of £1. . 446 A Review of Basic Probability Theory Appendix A A.. InformaUy.. In the discrete case (A. X . .. .3) for events Al x .) . W n ) : W~ E Ai. Pn(A n ).1 Recall that if AI.. (A5. x An. The reader not interested in the formalities may skip to Section A.on.. then the sample space . .. we introduce the notion of a compound experiment.0 1 X '" x . x " . .An with Ai E Ai. x An by 1 j P(A. . I <i< n.' . X A 2 X '" x fl n ). x fl 2 x '" x fln)P(fl. (A53) It may be shown (Billingsley. (A. The (n stage) compound experiment consists in performing component experiments £1.. x flnl n Ifl) x A 2 x ' . .
Ifweassign P( {5}) = p. i = 1" .6.u.6 Hoel.Pq' If fl is the sample space of this experiment and W E fl. /I. . If we repeat such an experiment n times independently. References Grimmett and Stirzaker ( (992) Sections (.£"_1 has outcome wnd.w n )}) = P(£l has outcome wd P(£2 hasoutcomew21 £1 has outcome WI)'" P(£T' has outcome W I £1 has outcome WI.S) . then (A. any point wEn is an ndimensional vector of S's and F's and. .7) we have.6. .. .6. the compound experiment is called n multinomial trials with probabilities PI.·.6. A. SAMPLING WITH AND WITHOUT REPLACEMENT A. and Stone (1971) Section 1. . A.k)!' The fonnula (A.5. the following. we shall refer to such an experiment as a Bernoulli trial with probability of success p. By the multiplication rule (A...3) is known as the binomial probability.4 More generally..SP({(w] . then (A..4.. Port.· IPq. . if an experiment has q possible outcomes WI. we refer to such an experiment as a multinomial trial with probabilities PI. J. 1. The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success). In the discrete case we know P once we have specified P( {(wI . If the experiment is perfonned n times independently. . Sampling With and Without Replacement 447 Specifying P when the £1 are dependent is more complicated.5. ill the discrete case.6 BERNOULLI AND MULTINOMIAL TRIALS....6.: n )}) for each (WI. If Ak is the event [exactly k S's occur].6.5 Parzen (1960) Chapter 3 A. .. we say we have performed n Bernoulli trials with success probability p.'" . n The probability structure is determined by these conditional probabilities and conversely. If o is the sample space of the compound experiment. which we shall denote by 5 (success) and F (failure).: n ) with Wi E 0 1.3) where n ) ( k = n! kl(n ....1 Suppose that we have an experiment with only two possible outcomes. (A.2) where k(w) is the number of S's appearing in w... . Other examples will appear naturally in what follows.wq and P( {Wi}) = Pi. .Section A 6 Bernoulli and Multinomial Trials.
i .(N(I. I A. n . PtA ) k =( n ) (Np).n)!' If the case drawn is replaced before the next drawing.. A.S) as follows. A. Sections 14 Pitman (1993) Section 2. . . .S If we have a finite population of cases = {WI"" WN} and we select cases Wi successively at random n times without replacement.6.. and the component experiments are independent and P( {a}) = liNn.6) where the k i are natural numbers adding up to n. .. Port.. If Np of the members of n have a "special" characteristic S and N (1 ~ p) have the opposite characteristic F and A k = (exactly k "special" individuals are obtained in the sample). · · • A. I' .A l P). (N)n . the component experiments are not independent and. for any outcome a = (Wil"" 1Wi.1 .0. When n is finite the tenn. exactly kqwq's are observed). . ·Pq t (A. P) independently n times. Np). n P({a}) ~ (N)n where 1 I (A.6. If AkJ.6. Sectiou 11 Hoel. 1 . .4 Parzen (1960) Chapter 3. with replacement is added to distinguish this situation from that described in (A. .6. . 10) is known as the hypergeometric probability. i . The fonnula (A..1 j . .) A = n! k k k !..7 PROBABILITIES ON EUCLIDEAN SPACE Random experiments whose outcomes are real numbers playa central role in theory and practice.7 If we perform an experiment given by (. ..p)) < k < min( n.1O) J for max(O. The probability models corresponding to such experiments can all be thought of as having a Euclidean space for sample space. References Gnedeuko (1967) Chapter 2. __________________J i . we are sampling with replacement.P))nk k (N)n = (A6. . .448 A Review of Basic Probability Theory Appendix A where k~(w) = number of times Wi appears in the sequence w.k. we shall sometimes refer to the outcome of the compound experiment as a sample of size n from the population given by (n. then P( k" . .) of the compound experiment.N (1 .6. = N! (N . kq!Pt' ... and P(Ak) = 0 otherwise. .1 .6. exactly k 2 wz's are observed.9) . then ~ .k q is the event (exactly k l WI 's are observed. . and Stone (1971) Section 2.
5) A.7... which we denote by f3k.8 are usually called absolutely continuous. Riemann integrals are adequate.).S) is by definition r JR' 1A(X)P(x)dx where 1A(x) ~ 1 if x E A.3 A discrete (probability) distribution on R k is a probability measure P such that L:~ I P( {Xi}) = 1 for some sequence of points {xd in R k .7. A..1If (al. only an Xi can occur as an outcome of the experiment.. An important special case of (A.bd x '" (ak. . .7.Section A. } of vectors and that satisfies L:~ I P(Xi) = 1 defines a unique discrete probability distribution by the relation P(A) = L x.6 A nonnegative function p on R k • which is integrable and which has r p(x)dx = 1. we shall call the set (aJ. .EA pix. Recall that the integral on the right of (A.7.bk ) are k open intervals. and 0 otherwise. dx n • is called a density function.. (A7A) Conversely. It may be shown that a function P so defined satisfies (AlA). . We will write R for R1 and f3 for f31.S) for some density function P and all events A. .Xk) :ai <Xi <bi... Integrals should be interpreted in the sense of Lebesgue. A.. any nonnegative function pon R k vanishing except on a sequence {Xl.7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation P(A) = L p(x)d x =1 (A7.3. P defined by A. .S) is given by (A.. X n . A. Geometrically.2 The Borelfield in R k . JR' where dx denotes dX1 .Xn . . However. is defined to be the smallest sigma field having all open k rectangles as members. Thefrequency function p of a discrete distribution is defined on Rk by n pix) = P({x»). (A7. x A.7. . P(A) is the volume of the "cylinder" with base A and height p(x) at x.7 Probabilities on Euclidean Space 449 We shall use the notation R k of k~dimensional Euclidean space and denote members of Rk by symbols such as x or (Xl. Any subset of R k we might conceivably be interested in turns out to be a member of f3k...7. } are equivalent. 1 <i <k}anopenkrectallgle..b k ) = {(XI"". for practical purposes.7. xd'.9) . . We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely. That is.7.bd.. where ( )' denotes transpose. This definition is consistent with (A.7.I) because the study of this model and that of the model that has = {Xl.. .(ak.
! I .14) .1. . (A..7. (A.7. the density function has an operational interpretation close to thal of the frequency function.15) I.) F is defined by F(Xl' . 5. When k = 1.f.7.7. Sections 21.1'. then by the mean value theorem paxo . POlt.h. r .12) The dJ.7..16) I It may be shown that any function F satisfying (A. defines P in the sense that if P and Q are two probabilities with the same d. 22 Hoel.450 A Review of Basic Probability Theory Appendix A It turns out that a continuous probability distribution determines the density that generates it "uniquely. 5.16) defines a unique P on the real line. and Stone (1971) Sections 3.11 The distribution function (dJ. 4..2.13) x <y =? F(x) < F(y) (Monotone) F(x) (Continuous from the right) (A. be thought of as measuring approximately how much more Or less likely we are to obtain an outcome in a neighborhood of XQ then one in a neighborhood of Xl_ A. For instance.7. x n j X =? F(x n ) ~ (A.lO) The ratio p(xo)jp(xl) can. .1 and 4. Xo P([xo .. (A. P Xl (A.J x .7.x.) = P( ( 00. Thus.7. if p is a continuous density on R. F is continuous at x if and only if P( {x}) (A. x (00.1.13HA. Sections 14.7 Pitman (1993) Sections 3.]).h. . limx~oo F(x) limx~_oo =1 F(x) = O. and h is close to 0.4. • J . x. F is a function of a real variable characterized by the following properties: > I ..5 i • .7.xo + h]) '" 2hp(xo) and P([ h Xl 1 + h]) + Xl p(xo) hi) '" ( ). x."(l) Although in a continuous model P( {x}) = 0 for every x. then P = Q. We always have F(x)F(xO)(2) =P({x}). 3. thus. References Gnedenko (1967) Chapter 4.17) = O.7. .2 parzen (1960) Chapter 4.:0 and Xl are in R.
we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined. k. Px ) given by n Px(B) = PIX E BI· (A. in fact. When we are interested in particular random variables or vectors.8. In the probability model. and so on of a random vectOr when we are. the time to breakdown and length of repair time for a randomly chosen machine.5) and (A.Section A.1 (B) is in 0 fnr every BE B. the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred.or vectorvalued functions of a random vectOr X is central in the theory of probability and of statistics.8. we measure the weight of pigs drawn at random from a population. The study of real. Forexample.4 A random vector is said to have a continuous or discrete distribution (or to be continuous or discrete) according to whether its probability distribution is continuous or discrete. The event Xl( B) will usually be written [X E B] and P([X E BJ) will be written PIX E: H]. In the discrete case this means we need only know the frequency function and in the continuous case the density. Thus. The probability distribution of a random vector X is. 13k .7. we will refer to the frequency Junction.Xk)T is ktuple of random variables. the yield per acre of a field of wheat in a given year.1 (H) is in A for every B E BkJI) For k = 1 random vectors are just random variables.a RANDOM VARIABLES AND VECTORS: TRANSFORMATIONS Although sample spaces can be very diverse. A. the probability measure Px in the model (R k . and so on.8.8. Letg be any function from Rk to Rm. referring to those features of its probability distribution. .. Similarly.7.3) A. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X. the concentration of a certain pollutant in the atmosphere. if X is continuous.8) PIX E: A] LP(X). or equivalently a function from to Rk such that the set {w . dJ. . density. m > 1.5) p(x)dx . X(w) E: B} ~ X. and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted. Here is the formal definition of such transformations.2 A random vector X = (Xl •. such that(2) gl(B) = {y E: Rk : g(y) E: .(1) = A.1 A random variable X is a function from Oto Rsuch that the set {w: X(w) E B} X. from (A.'s.S. The subscript X or X will be used for densities. ifXisdiscrete xEA L (A.8 Random VariOlbles and Vectors: Transformations 451 A. by definition. dj. these quantities will correspond to random variables and vectors.
8.12) by summing or integrating out over yin P(X.6) An example of a transformation often used in statistics is g (91.S.y)dy. Then the random tran~form(lti(m g( X) is defined by g(X)(w) = g(X(w)). . .8.7) If X is discrete with frequency function Px.8) it follows that if (X. (A.S.11) Similarly.8. if (X.7) and (A. ! for Pg(x)(I) ~ PX(gl(t)) Ig'(g 1(1))1 (A.).8)) that X is a marginal density function given by px(x) ~ 1: P(X. . i .8) Suppose that X is continuous with density PX and 9 is realvalued and onetoone(3) on an open set S such that P[X E 5] = 1. y (A. then g(X) is discrete and has frequency function Pg(X)(t) = L {x:g(x)=t} Px(x). max{X.8. a random vector obtained by putting two random vectors together. Pg(X) (I) = j.8.1 1 Xi = X and 92(X) = k.Y).: for every B E Bill.1 E~' l(X i .y). .12) 1 .Y). y)T is continuous with density p(X.8. a # 0. then the frequency function of X. Yf is a discrete random vector with frequency function p(X. If g(X) ~ aX + 1'. The (marginal) frequency or density of X is found as in (A. V). • II ! I • . and X is continuous.8. then 1 (I . The probability distribution of g(X) is completely detennined by that of X through L:: P[g(X) E BI = PIX E gI(B)]. These notions generalize to the case Z = (X.1') .9) t E g(S). assume that the derivative l of 9 exists and does not vanish on S.y)(x. it may be shown (as a consequence of (A.y)(x. j . (A.)'. Another common example is g(X) = (min{X.11) and (A.1O) From (A.8.jPX a (A. and 0 otherwise. is given by(4) i I ( PX(X) = LP(X. Discrete random variables may be used to approximate continuous ones arbitrarily closely and vice versa. .452 A Review of BClsic ProbClbility Theory Appendix A [J} E BI. This is called the change of variable formula.7. Furthennore.8.(5) (A.Y) (x.X)2. . . y). known as the marginal frequency function.92/ with 91 (X) = k. Then g(X) is continuous with density given by . (A.
all random variables are discrete because there is no