Mathematical Statistics
Basic Ideas and Selected Topics
Volume I
Peter J. Bickel
University of California
Kjell A. Doksum
University of California
1'1"(.'nl icc
Hall
.. ~
PRENTICE HALL
Upper Saddle River, New Jersey 07458
Library of Congress CataloginginPublication Data Bickel. Peter J. Mathematical statistics: basic ideas and selected topics / Peter J. Bickel, Kjell A. Doksum2 nd ed. p. em. Includes bibliographical references and index. ISBN D13850363X(v. 1) L Mathematical statistics. L Doksum, Kjell A. II. Title.
QA276.B47200l
519.5dc21 00031377
Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte
}'I('I1II(\'
lI,dl
@2001, 1977 by PrenticeHall, Inc. Upper Saddle River, New Jersey 07458
All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
I
i
ISBN: D1385D363X
PrenticeHall International (UK) Limited, London PrenticeHall of Australia Pty. Limited, Sydney PrenticeHall of Canada Inc., Toronto PrenticeHall Hispanoamericana, S.A., Mexico PrenticeHall of India Private Limited, New Delhi PrenticeHall of Japan, Inc., Tokyo Pearson Education Asia Pte. Ltd. Editora PrenticeHall do Brasil, Ltda., Rio de Janeiro
J
To Erich L Lehmann
." i
~I
"I
~ !
,'
,
I,
,.~~
_.

..
CONTENTS
PREFACE TO THE SECOND EDITION: VOLUME I PREFACE TO THE FIRST EDITION I STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA 1.1 Data, Models, Parameters, and Statistics
xiii
xvii
1 1
1.1.1
1.1.2
Data and Models
Pararnetrizations and Parameters
I
6
1.1.3
1.2
Statistics as Functions on the Sample Space
8
1.3
1.4
1.5
1.6
1.7 1.8 1.9
1.1.4 Examples, Regression Models Bayesian Models The Decision Theoretic Framework 1.3.1 Components of the Decision Theory Framework 1.3.2 Comparison of Decision Procedures 1.3.3 Bayes and Minimax Criteria Prediction Sufficiency Exponential Families 1.6.1 The OneParameter Case 1.6.2 The Multiparameter Case 1.6.3 Building Exponential Families 1.6.4 Properties of Exponential Families 1.6.5 Conjugate Families of Prior Distributions Problems and Complements Notes References
9 12 16 17 24 26 32 41 49 49 53 56 58 62 66 95 96
VII
VIII
•••
CONTENTS
2 METHODS OF ESTIMATION 2.1 Basic Heuristics of Estimation 2.1.1 Minimum Contrast Estimates; Estimating Equations 2.1.2 The PlugIn and Extension Principles 2.2 Minimum Contrast Estimates and Estimating Equations 2.2.1 Least Squares and Weighted Least Squares 2.2.2 Maximum Likelihood 2.3 Maximum Likelihood in Multiparameter Exponential Families *2.4 Algorithmic Issues 2.4.1 The Method of Bisection 2.4.2 Coordinate Ascent 2.4.3 The NewtonRaphson Algorithm 2.4.4 The EM (ExpectationlMaximization) Algorithm 2.5 Problems and Complements 2.6 Notes 2.7 References
3
MEASURES OF PERFORMANCE
99 99 99 102 107 107 114 121 127 127 129 132 133 138 158 159
3.1 Introduction 3.2 Bayes Procedures 3.3 Minimax Procedures *3.4 Unbiased Estimation and Risk Inequalities 3.4.1 Unbiased Estimation, Survey Sampling 3.4.2 The Information Inequality *3.5 Nondecision Theoretic Criteria 3.5.1 Computation 3.5.2 Interpretability 3.5.3 Robustness 3.6 Problems and Complements 3.7 Notes 3.8 References
4 TESTING AND CONFIDENCE REGIONS 4.1 Introduction 4.2 Choosing a Test Statistic: The NeymanPearson Lemma 4.3 UnifonnIy Most Powerful Tests and Monotone Likelihood Ratio
161 161 161 170 176 176 179 188 188 189 190 197 210 211 213 213 223
227 233
4.4
Models Confidence Bounds, Intervals, and Regions
CONTENTS
ix
241 248 251 252
4.5 *4.6 *4.7 4.8
The Duality Between Confidence Regions and Tests Uniformly Most Accurate Confidence Bounds Frequentist and Bayesian Formulations Prediction Intervals
4.9
Likelihood Ratio Procedures 4.9.1 Inttoduction
4.9.2 4.9.3 Tests for the Mean of a Normal DistributionMatched Pair Experiments Tests and Confidence Intervals for the Difference in Means of
255 255
257
4.9.4
4.9.5
Two Normal PopUlations The TwoSample Prohlem with Unequal Variances
Likelihood Ratio Procedures fOr Bivariate Nonnal
261 264 266 269 295 295
Distrihutions 4.10 Problems and Complements 4.11 Notes 4.12 References
5 ASYMPTOTIC APPROXIMATIONS 5.1 Inttoduction: The Meaning and Uses of Asymptotics
5.2 Consistency
297 297
301
5.2.1
5.2.2
PlugIn Estimates and MLEs in Exponential Family Models
Consistency of Minimum Contrast Estimates
301
304
5.3
5.4
5.5 5.6 5.7 5.8
First and HigherOrder Asymptotics: The Delta Method with Applications 5.3.1 The Delta Method for Moments 5.3.2 The Delta Method for In Law Approximations 5.3.3 Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families Asymptotic Theory in One Dimension 5.4.1 Estimation: The Multinomial Case *5.4.2 Asymptotic Normality of Minimum Conttast and M Estimates *5.4.3 Asymptotic Normality and Efficiency of the MLE *5.4.4 Testing *5.4.5 Confidence Bounds Asymptotic Behavior and Optimality of the Posterior Distribution Problems and Complements Notes References
306 306 311 322 324 324 327 331 332 336 337 345 362 363
x
CONTENTS
6 INFERENCE IN THE MULTIPARAMETER CASE
6.1 Inference for Gaussian Linear Models 6.1.1 6.1.2 6.1.3 *6.2 6.2.1 6.2.2 6.2.3 *6.3 6.3.1 6.3.2 *6.4 6.4.1 6.4.2 6.4.3 *6.5 *6.6 6.7 6.8 6.9 The Classical Gaussian Linear Model Estimation Tests and Confidence Intervals Estimating Equations Asymptotic Normality and Efficiency of the MLE The Posterior Distribution in the Multiparameter Case Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic Wald's and Rao's Large Sample Tests GoodnessofFit in a Multinomial Model. Pearson's X 2 Test GoodnessofFit to Composite Multinomial Models. Contingency Thbles Logistic Regression for Binary Responses
365
365 366 369 374 383 384 386 391 392 392 398 400 401 403 408 411 417 422 438 438 441 441 443 443
Asymptotic Estimation Theory in p Dimensions
Large Sample Tests and Confidence Regions
Large Sample Methods for Discrete Data
Generalized Linear Models
Robustness Properties and Scmiparametric Models Problems and Complements Notes References
A A REVIEW OF BASIC PROBABILITY THEORY A.I The Basic Model A.2 Elementary Properties of Probability Models A.3 Discrete Probability Models A.4 Conditional Probability and Independence A.5 Compound Experiments A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement A.7 Probabilities On Euclidean Space A.8 Random Variables and Vectors: Transformations A.9 Independence of Random Variables and Vectors A.IO The Expectation of a Random Variable A.II Moments A.12 Moment and Cumulant Generating Functions
444
446 447 448 451 453 454 456 459
CONTENTS
XI
•
A. t3 Some Classical Discrete and Continuous Distributions A.14 Modes of Convergence of Random Variables and Limit Theorems A. I5 Further Limit Theorems and Inequalities A.16 Poisson Process
A.17 Notes
460
466 468 472
474 475 477 477 477 479 480 482 484 485 485 488 491 491 494 497 502 502 503 506 506 508
A.18 References
B ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS
B.I Conditioning by a Random Variable or Vector B.I.l B.I.2 B.1.3 B.IA B.l.5 B.2.1 B.2.2 B.3
B.3.1
The Discrete Case Conditional Expectation for Discrete Variables Properties of Conditional Expected Values Continuous Variables Comments on the General Case The Basic Framework The Gamma and Beta Distributions
The X2 , F, and t Distributions
B.2 Distribution Theory for Transformations of Random Vectors
Distribution Theory for Samples from a Normal Population B.3.2 Orthogonal Transformations
BA The Bivariate Normal Distribution
B.5 Moments of Random Vectors and Matrices B.5.1 B.5.2 B.6.1 B.6.2 B.8 Basic Properties of Expectations Properties of Variance Definition and Density Basic Properties. Conditional Distributions
Op
B.6 The Multivariate Normal Distribution
B.7 Convergence for Random Vectors: Multivariate Calculus B.9 Convexity and Inequalities
and Op Notation
511
516 518 519 519 520 521
B.1O Topics in Matrix Theory and Elementary Hilbert Space Theory
B.1O.1 Symmetric Matrices B,10.2 Order on Symmetric Matrices
B.10.3 Elementary Hilbert Space Theory
B.Il Problems and Complements B.12 Notes B.13 References
524
538 539
•• XII
CONTENTS
C TABLFS
Table I The Standard Nonna! Distribution Table I' Auxilliary Table of the Standard Normal Distribution
Table II t Distribution Critical Values Table In x2 Distribution Critical Values Table IV F Distribution Critical Values
INDEX
541 542 543 544 545 546 547
I
PREFACE TO THE SECOND EDITION: VOLUME I
In the twentythree years that have passed since the first edition of our book appeared statistics has changed enonnollsly under dIe impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phy
logenetic and other), and other types of combinatorial objects.
(2) The generation of enonnous amounts of dataterrabytes (the equivalent of 10 12 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore's "law") and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number of directions:
(I) Methods for inference based on larger numbers of observations and minimal assumptionsasymptotic methods in non and semiparametric models, models with ''infinite'' number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo.
XIll
.,q;,
•
:ci'~:,f"
.;.
'4,.
<.it:
...
some methods run quickly in real time. From the beginning we stress functionvalued parameters. reflecting what we now teach our graduate students. such as the empirical distribution function. Appendix B is as selfcontained as possible with proofs of mOst statements.and semipararnetric models. Volume [ covers the malerial of Chapters 16 and Chapter 10 of the first edition with pieces of Chapters 710 and includes Appendix A on basic probability theory. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. we do not require measure theory but assume from the start that our models are what we call "regular. which includes more advanced topics from probability theory such as the multivariate Gaussian distribution. As a consequence our second edition. such as the density." That is. each to be only a little shorter than the first edition. we assume either a discrete probability whose support does not depend On the parameter set. covers material we now view as important for all beginning graduate students in statistics and science and engineering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing concluSIons. Others do not and some though theoretically attractive cannot be implemented in a human lifetime. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. instead of beginning with parametrized models we include from the start non. our focus and order of presentation have changed. Our one long book has grown to two volumes. Specifically. of course. is much changed from the first. but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. problems. and probability inequalities as well as more advanced topics in matrix theory and analysis. Volume I. These will not be dealt with in OUr work. Despite advances in computing speed. and functionvalued statistics. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. which we present in 2000. and references to the literature for proofs of the deepest results such as the spectral theorem. However.xiv Preface to the Second Edition: Volume I (4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. However. We . Hilbert space theory is not needed. Chapter 1 now has become part of a larger Appendix B. then go to parameters and parametric models stressing the role of identifiability. or the absolutely continuous case with a density. (5) The study of the interplay between numerical and statistical considerations. weak convergence in Euclidean spaces. As in the first edition. There have.
the Wald and Rao statistics and associated confidence regions. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the mOre advanced topics of Volume II. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in muitiparameter exponential families and ail introduction to the EM algorithm. if somewhat augmented. Wilks theorem on the asymptotic distribution of the likelihood ratio test. The main difference in our new treatment is the downplaying of unbiasedness both in estimation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. It includes the initial theory presented in the first edition but goes much further with proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference. Major differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs).Preface to the Second Edition: Volume I xv also. Chapter 5 of the new edition is devoted to asymptotic approximations. there is clearly much that could be omitted at a first reading that we also star. One of the main ingredients of most modem algorithms for inference. inference in the general linear model. Chapters 3 and 4 parallel the treatment of Chapters 4 and 5 of the first edition on the theory of testing and confidence regions. and some parailels to the optimality theory and comparisons of Bayes and frequentist procedures given in the univariate case in Chapter 5. There are clear dependencies between starred . Nevertheless. Although we believe the material of Chapters 5 and 6 has now become fundamental. Included are asymptotic normality of maximum likelihood estimates. which parallels Chapter 2 of the first edition. including a complete study of MLEs in canonical kparameter exponential families. are an extended discussion of prediction and an expanded introduction to kparameter exponential families. Save for these changes of emphasis the other major new elements of Chapter 1. There is more material on Bayesian models and analysis. include examples that are important in applications. Finaliy. The conventions established on footnotes and notation in the first edition remain. including some optimality theory for estimation as well and elementary robustness considerations. Robustness from an asymptotic theory point of view appears also. Also new is a section relating Bayesian and frequentist inference via the Bemsteinvon Mises theorem. from the start. such as regression experiments. Generalized linear models are introduced as examples. Chapter 6 is devoted to inference in multivariate (multiparameter) models. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that are given in Appendix B. Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation. Chapters 14 develop the basic principles and examples of statistics. Almost all the previous ones have been kept with an approximately equal number of new ones addedto correspond to our new topics and point of view. As in the first edition problems playa critical role by elucidating and often substantially expanding the text. we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis.
• with the field.5 I. Michael Jordan. and the classical nonparametric k sample and independence problems will be included.4. are weak. particnlarly Jianging Fan. Yoram Gat for proofreading that found not only typos but serious errors.3 ~ 6. greatly extending the material in Chapter 8 of the first edition. Jianhna Hnang. Bickel bickel@stat.XVI • Pref3ce to the Second Edition: Volume I sections that follow. encouragement. and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo. Semiparametric estimation and testing will be considered more generally. We also expect to discuss classification and model selection using the elementary theory of empirical processes. 5. Michael Ostland and Simon Cawley for producing the graphs. Nancy Kramer Bickel and Joan H. The basic asymptotic tools that will be developed or presented.4 ~ 6. and active participation in an enterprise that at times seemed endless. We also thank Faye Yeager for typing. j j Peter J. taken on a new life. other transformation models. The topic presently in Chapter 8. will be studied in the context of nonparametric function estimation.2 ~ 5.4. Examples of application such as the Cox model in survival analysis.berkeley. For the first volume of the second edition we would like to add thanks to new colleagnes.3 ~ 6.2 ~ 6. Ying Qing Chen. and Prentice Hall for generous production support. Fujimura. and our families for support. 6. elementary empirical process theory. in part in appendices. convergence for random processes. density estimation. Topics to be covered include permutation and rank tests and their basis in completeness and equivariance. With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics.edn .6 Volume II is expected to be forthcoming in 2003. and the functional delta method. Last and most important we would like to thank our wives.berkeley. in part in the text and. appeared gratifyingly ended in 1976 but has.edn Kjell Doksnm doksnm@stat.
The work of Rao. These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. (4) Show how the ideas aod results apply in a variety of important subfields such as Gaussian linear mOdels. multinomial models. we feel that none has quite the mix of coverage and depth desirable at this level. the treatment is abridged with few proofs and no examples or problems. and the GaussMarkoff theorem. Our appendix does give all the probability that is needed. Hoel. 2nd ed. the physical sciences. we need probability theory and expect readers to have had a course at the level of. which go from modeling through estimation and testing to linear models. and nonparametric models. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated. (2) Give careful proofs of the major "elementary" results such as the NeymanPearson lemma. we select topics from xvii . Introduction to Mathematical Statistics. Port. the Lehmann5cheffe theorem. Be cause the book is an introduction to statistics. PREFACE TO THE FIRST EDITION This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be. Although there are several good books available for tbis purpose. and the structure of both Bayes and admissible solutions in decision theory. and engineering that we have taught we cover the core Chapters 2 to 7.. statistics. for instance. Our book contains more material than can be covered in tw~ qp.arters. However. At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig. 3rd 00.. the information inequality. In the twoquarter courses for graduate students in mathematics. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. Finally. Linear Statistical Inference and Its Applications. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory). and Stone's Introduction to Probability Theory. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice. (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates. covers most of the material we do and much more but at a more abstract level employing measure theory.
Chen. distribution functions. Gupta.xviii Preface to the First Edition Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. Bickel Kjell Doksum Berkeley /976 : . or it may be included at the end of an introductory probability course that precedes the statistics course. G. and stimUlating lectures of Joe Hodges and Chuck Bell. J. Chapter 1 covers probability theory rather than statistics. Gray. (i) Various notational conventions and abbreviations are used in the text. final draft) through which this book passed. Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers. Without Winston Chow's lovely plots Section 9. . densities. R. It may be integrated with the material of Chapters 27 as the course proceeds rather than being given at the start. They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text. caught mOre mistakes than both authors together. Scholz. Quang. The foundation of oUr statistical knowledge was obtained in the lucid. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text. We would also like tn thank tlle colleagues and friends who Inspired and helped us to enter the field of statistics.5 was discovered by F. respectively. X. i .2. These comments are ordered by the section to which they pertain. W. students. enthusiastic. Minassian who sent us an exhaustive and helpful listing. E. Peler J. and A Samulon. 2 for the second. Among many others who helped in the same way we would like to mention C. and so On. and moments is established in the appendix. Chou. preliminary edition. Pyke's careful reading of a nexttofinal version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D. I for the first. Cannichael. reservations. Later we were both very much influenced by Erich Lehmann whose ideas are strongly rellected in this hook. They need to be read only as the reader's curiosity is piqued. P. Lehmann's wise advice has played a decisive role at many points. A serious error in Problem 2. C. and friends who helped us during the various stageS (notes.6 would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. I . A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text. Drew. (iii) Basic notation for probabilistic objects such as random variables and vectors. S. L. in proofreading the final version. The comments contain digressions. Much of this material unfortunately does not appear in basic probability texts but we need to draw on it for the rest of the book. We would like to acknowledge our indebtedness to colleagues. Conventions: (i) In order to minimize the number of footnotes we have added a section of comments at the end of each Chapter preceding the problem section. U. A special feature of the book is its many problems. and additional references.
Mathematical Statistics Basic Ideas and Selected Topics Volume I Second Edition .
.. . j 1 J 1 . I .
digitized pictures or more routinely measurements of covariates and response on a set of n individualssee Example 1.1.2. AND PERFORMANCE CRITERIA 1. and/or characters.1 and 6. we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading. The particular angle of mathematical statistics is to view data as the outcome of a random experiment that we model mathematically. (4) All of the above and more. ''The Devil is in the details!" All the principles we discuss and calculations we perform should only be suggestive guides in successful applications of statistical analysis in science and policy. trees as in evolutionary phylogenies. PARAMETERS AND STATISTICS Data and Models Most studies and experiments. Subject matter specialists usually have to be principal guides in model formulation. produce data whose analysis is the ultimate object of the endeavor.4 and Sections 2. but we will introduce general model diagnostic tools in Volume 2. Chapter 1. A generic source of trouble often called grf?ss errors is discussed in greater detail in the section on robustness (Section 3. large scale or small. for example. and so on. Data can consist of: (1) Vectors of scalars. A 1 . In any case all our models are generic and. A detailed discussion of the appropriateness of the models we shall discuss in particUlar situations is beyond the scope of this book. functions as in signal processing. as usual. scientific or industrial. (3) Arrays of scalars and/or characters as in contingency tablessee Chapter 6or more generally multifactor IDultiresponse data on a number of individuals.1. MODELS. measurements. (2) Matrices of scalars and/or characters.1 1. which statisticians share. for example.1 DATA. GOALS. The goals of science and society.Chapter 1 STATISTICAL MODELS. are to draw useful information from data using everything that we know.5. a single time series of measurements. Moreover.3).1. in particUlar.
Hierarchies of models are discussed throughout. Here are some examples: (a) We are faced with a population of N elements. For instance. It is too expensive to examine all of the items.! I 2 Statistical Models. testing. We begin this discussion with decision theory in Section 1. if we observe an effect in our data.5 and throughout the book. In this manner. Goals. For instance.. Goodness of fit tests. and diagnostics are discussed in Volume 2. (3) We can assess the effectiveness of the methods we propose. The population is so large that. A to m. we obtain one or more quantitative or qualitative measures of efficacy from each experiment. confidence regions. Random variability " : . So to get infonnation about a sample of n is drawn without replacement and inspected. in the words of George Box (1979). plus some random errors. and so on. and Performance Criteria Chapter 1 priori. The data gathered are the number of defectives found in the sample. We begin this in the simple examples that follow and continue in Sections 1. for modeling purposes. e. reducing pollution. "Models of course. we can assign two drugs. (2) We can derive methods of extracting useful information from data and.3 and continue with optimality principles in Chapters 3 and 4. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes. randomly selected patients and then measure temperature and blood pressure. An unknown number Ne of these elements are defective. (5) We can be guided to alternative or more general descriptions that might fit better. height or income. (b) We want to study how a physical or economic feature. Chapter I. . is distributed in a large population.. and B to n. An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the population. in particular. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p. robustness. for instance. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. give methods that assess the generalizability of experimental results. (c) An experimenter makes n independent detenninations of the value of a physical constant p. are never true but fortunately it is only necessary that they be usefuL" In this book we will study how. and so on. and more general procedures will be discussed in Chapters 24. learning a maze. (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee. for example. producing energy.21. to what extent can we expect the same effect more generally? Estimation. we approximate the actual process of sampling without replacement by sampling with replacement. This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population. treating a disease. starting with tentative models: (I) We can conceptualize the data structure and our goals more precisely. a shipment of manufactured items. have the patients rated qualitatively for improvement by physicians.
. can take on any value between and N. 1.d.8). So.t + (i. . so that sampling with replacement replaces sampling without. . . .1. The same model also arises naturally in situation (c). n).l) if max(n ..xn . which together with J1.. X n as a random sample from F.. n) distribution.2) where € = (€ I ... . Models. which are modeled as realizations of X I.. OneSample Models. F./ ' stands for "is distributed as.1. . that depends on how the experiment is carried out.1 Data. . anyone of which could have generated the data actually observed. 0 ° Example 1. k ~ 0.d.. 1 <i < n (1. Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording "defective" or not. First consider situation (a). Parameters." The model is fully described by the set :F of distributions that we specify. Sample from a Population. and Statistics 3 here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs.. n corresponding to the number of defective items found.Section 1."" €n are independent.i.6) (J.. . . . .1.) random variables with common unknown distribution function F. The mathematical model suggested by the description is well defined.N(I .1.. H(N8. .. identically distributed (i.. Here we can write the n determinations of p. Fonnally.. as X with X . A random experiment has been perfonned. On this space we can define a random variable X given by X(k) ~ k. although the sample space is well defined.Xn independent. then by (AI3.. N.. if the measurements are scalar. The sample space consists of the numbers 0. where . n. n)} of probability distributions for X. What should we assume about the distribution of €.. which we refer to as: Example 1.". I. we observe XI. Sampling Inspection. Given the description in (c). N.... We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models. completely specifies the joint distribution of Xl. and also write that Xl.. If N8 is the number of defective items in the population sampled. X has an hypergeometric.. we cannot specify the probability structure completely but rather only give a family {H(N8. . The main difference that our model exhibits from the usual probability model is that NO is unknown and. €I.. 1 €n) T is the vector of random errors.0) < k < min(N8. . That is. in principle. We often refer to such X I.l.. X n ? Of course. X n are ij. we postulate (l) The value of the error committed on one determination does not affect the value of the error at other times. Thus. as Xi = I. . It can also be thought of as a limiting case in which N = 00...2.
•. will have none of this. We let the y's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo. and Performance Criteria Chapter 1 (2) The distribution of the error at one determination is the same as that at another. E}. . heights of individuals or log incomes. The Gaussian distribution. The classical def~ult model is: (4) The common distribution of the errors is N(o. be the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B.. ... . tbe set of F's we postulate. It is important to remember that these are assumptions at best only approximately valid. if we let G be the distribution function of f 1 and F that of Xl.t + ~. £n are identically distributed. Thus. TwoSample Models. the Xi are a sample from a N(J. Natural initial assumptions here are: (1) The x's and y's are realizations of Xl. A placebo is a substance such as water tpat is expected to have no effect on the disease and is used to correct for the welldocumented placebo effect.3. or by {(I"." Now consider situation (d). Then if F is the N(I". that is. Equivalently Xl. 1 Yn a sample from G. G) pairs.. 1 X Tn . Example 1. if drug A is a standard or placebo. we have specified the Gaussian two sample model with equal variances.4 Statistical Models. so that the model is specified by the set of possible (F. Commonly considered 9's are all distributions with center of symmetry 0. for instance.2) distribution and G is the N(J.t and cr. response y = X + ~ would be obtained where ~ does not depend on x. This implies that if F is the distribution of a control. whatever be J.1.~). 0 ! I = . we refer to the x's as control observations..3) and the model is alternatively specified by F.. and Y1.. .. YI. cr 2 ) distribution.Yn. To specify this set more closely the critical constant treatment effect assumption is often made. (72) population or equivalently F = {tP (':Ji:) : Jl E R. G E Q} where 9 is the set of all allowable error distributions that we postulate. where 0'2 is unknown.t. We call this the shift model with parameter ~.Xn are a random sample and. (2) Suppose that if treatment A had been administered to a subject response X would have been obtained. Heights are always nonnegative. then F(x) = G(x  1") (1. .. . (3) The control responses are normally distributed. 0'2). . We call the y's treatment observations. Goals. or alternatively all distributions with expectation O.. By convention... (3) The distribution of f is independent of J1. G) : J1 E R. Often the final simplification is made. then G(·) F(' . 0 This default model is also frequently postulated for measurements taken on units obtained by random sampling from populations. respectively. .1. That is.. All actual measurements are discrete rather than continuous. 1 X Tn a sample from F. cr > O} where tP is the standard normal distribution. '. patients improve even if they only think they are being treated. There are absolute bounds on most quantitiesIOO ft high men are impossible. Let Xl. . Then if treatment B had been administered to the same subject instead of treatment A.
However. our analyses. The advantage of piling on assumptions such as (I )(4) of Example 1. The number of defectives in the first example clearly has a hypergeometric distribution.1. in comparative experiments such as those of Example 1. All the severely ill patients might. though correct for the model written down. Since it is only X that we observe. we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + n available patients.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease. equally trained observers with no knowledge of each other's findings. Experiments in medicine and the social sciences often pose particular difficulties. A review of necessary concepts and notation from probability theory are given in the appendices. When w is the outcome of the experiment. It is often convenient to identify the random vector X with its realization. if (1)(4) hold. we know how to combine our measurements to estimate JL in a highly efficient way and also assess the accuracy of our estimation procedure (Example 4.5. and Statistics 5 How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations. For instance.Xn ). Parameters. the methods needed for its analysis are much the same as those appropriate for the situation of Example 1.1 Data. but not others. For instance.1. we can be reasonably secure about some aspects. In Example 1. In others. We are given a random experiment with sample space O. On this sample space we have defined a random vector X = (Xl. In this situation (and generally) it is important to randomize. For instance. P is the . we now define the elements of a statistical model. for instance. Models. in Example 1. Fortunately.3 when F. in Example 1. Statistical methods for models of this kind are given in Volume 2. there is tremendous variation in the degree of knowledge and control we have concerning experiments.3 and 6. we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated. G are assumed arbitrary.4.2 is that.1. we can ensure independence and identical distribution of the observations by using different. if they are true. That is. may be quite irrelevant to the experiment that was actually performed. The danger is that. As our examples suggest.'" .1.2. P is referred to as the model. Using our first three examples for illustrative purposes. the data X(w).6. This will be done in Sections 3. The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise. In some applications we often have a tested theoretical model and the danger is small. have been assigned to B. we observe X and the family P is that of all bypergeometric distributions with sample size n and population size N. if they are false. ~(w) is referred to as the observations or data.1.1). This distribution is assumed to be a member of a family P of probability distributions on Rn.1.2. Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter.1.Section 1. we need only consider its probability distribution. the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed.
that is. 02 ). we have = R+ x R+ .. 1 • • • .G) : I' E R. Of even greater concern is the possibility that the parametrization is not onetoone.2 with assumptions (1)(3) are called semiparametric.. Pe the H(NB. lh i O => POl f. 1) errors. we only wish to make assumptions (1}(3) with t:. G with density 9 such that xg(x )dx = O} and p(".e. that is. by (tt. G) : I' E R. tt is the unknown constant being measured. Yn .1. Goals. knowledge of the true Pe. . 1) errors lead parametrization is unidentifiable because. we know we are measuring a positive quantity in this model.1 we can use the number of defectives in the population..2)). I 1.X l . if () = (Il. Now the and N(O. for eXample.3 with only (I) holding and F. N." that is.. e j . . . tt = to the same distribution of the observations a~ tt = 1 and N (~ 1. X rn are independent of each other and Yl . If. j. For instance. . When we can take to be a nice subset of Euclidean space and the maps 0 + Po are smooth. However.Xrn are identically distributed as are Y1 . in Example 1.. Thus. . having expectation 0. We may take any onetoone function of 0 as a new parameter. Then the map sending B = (I'. if) (X'..G taken to be arbitrary are called nonparametric. under assumptions (1)(4). on the other hand. () t Po from a space of labels. Finally.1. or equivalently write P = {Pe : BEe}. 02 and yet Pel = Pe 2 • Such parametrizations are called unidentifiable. In Example 1.t2 + k e e e J . and Performance Criteria Chapter 1 family of all distributions according to which Xl. The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything. . .l. the parameter space e.. moreover. For instance.1.1. in senses to be made precise later. 0. to P. X n ) remains the same but = {(I'. . ... we can take = {(I'. The critical problem with such parametrizations is that ev~n with "infinite amounts of data. we may parametrize the model by the first and second moments of the normal distribution of the observations (i. Note that there are many ways of choosing a parametrization in these and all other problems..2) suppose that we pennit G to be arbitrary. in Example = {O. models such as that of Example 1.2 ) distribution. G) into the distribution of (Xl. the first parametrization we arrive at is not necessarily the one leading to the simplest analysis. What parametrization we choose is usually suggested by the phenomenon we are modeling. as a parameter and in Example 1. that is. . . parts of fJ remain unknowable. Thus. NO. Ghas(arbitrary)densityg}. models P are called parametric.. as we shall see later. we will need to ensure that our parametrizations are identifiable.JL) where cp is the standard normal density..1. Pe 2 • 2 e ° . If. Yn . still in this example.2.1. n) distribution.I} and 1.G) has density n~ 1 g(Xi 1'). a map.2 with assumptions (1)(4) we have = R x R+ and. . () is the fraction of defectives. Pe the distribution on Rfl with density implicitly taken n~ 1 .1 we take () to be the fraction of defectives in the shipment. such that we can have (}1 f.Xn are independent and identically distributed with a common N (IL..1. .1.2 Parametrizations and Parameters i e To describe P we USe a parametrization. Models such as that of Example 1.3 that Xl.6 Statistical Models.. It's important to note that even nonparametric models make substantial assumptionsin Example 1. in (l. .
G) parametrization of Example 1.3 with assumptions (1H2) we are interested in~. Then P is parametrized by 8 = (fl1~' 0'2).. Similarly.1. a function q : + N can be identified with a parameter v( P) iff Po. In addition to the parameters of interest. if the errors are normally distributed with unknown variance 0'2. from to its range P iff the latter map is 11. we can start by making the difference of the means. that is. Models. instead of postulating a constant treatment effect ~.2 is now well defined and identifiable by (1. (J is a parameter if and only if the parametrization is identifiable. which can be thought of as the difference in the means of the two populations of responses. Here are two points to note: (1) A parameter can have many representations. our interest in studying a population of incomes may precisely be in the mean income.2 with assumptions (1)(4) the parameter of interest fl. implies q(BJl = q(B2 ) and then v(Po) q(B).1. For instance. As we have seen this parametrization is unidentifiable and neither f1 nor ~ arc parameters in the sense we've defined.. For instance. the fraction of defectives () can be thought of as the mean of X/no In Example 1. which correspond to other unknown features of the distribution of X. we can define (J : P + as the inverse of the map 8 + Pe. say with replacement. For instance. in Example 1. Formally. and observe Xl. Parameters. v. the focus of the study. make 8 + Po into a parametrization of P.M(P) can be characterized as the mean of P. Implicit in this description is the assumption that () is a parameter in the sense we have just defined. that is.1. a map from some e to P. ~ Po. The (fl.3) and 9 = (G : xdG(x) = J O}. in Example 1.1. in Example 1. When we sample. it is natural to write e e e .1.1. . from P to another space N. in Example 1. if POl = Pe'). ) evidently is and so is I' + C>. where 0'2 is the variance of €.Section 1.flx. More generally. We usually try to combine parameters of interest and nuisance parameters into a single grand parameter (). or the median of P. implies 8 1 = 82 . is that of a parameter. . thus. X n independent with common distribution. E(€i) = O.1.1.1.2 again in which we assume the error € to be Gaussian but with arbitrary mean~.2. formally a map. and Statistics 7 Dual to the notion of a parametrization. or the midpoint of the interquantile range of P. Sometimes the choice of P starts by the consideration of a particular parameter. as long as P is the set of all Gaussian distributions. Then "is identifiable whenever flx and flY exist. consider Example 1..2 where fl denotes the mean income and." = f1Y . For instance.1 Data. For instance. . But given a parametrization (J + Pe. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable).3. then 0'2 is a nuisance parameter. But = Var(X. or more generally as the center of symmetry of P. A parameter is a feature v(P) of the distribution of X. there are also usually nuisance parameters. which indexes the family P.
T(x) = x/no In Example 1. and Performance Criteria Chapter 1 1. Mandel. .. For instance.. this difference depends on the patient in a complex manner (the effect of each drug is complex).1. in Example 1.X)" L i=l I n How we use statistics in estimation and other decision procedures is the subject of the next section.• 1 X n ) = X ~ L~ I Xi. we have to formulate a relevant measure of the difference in performance of the drugs and decide how to estimate this measure. which evaluated at ~ X and 8 2 are called the sample mean and sample variance. for example. consider situation (d) listed at the beginning of this section. Deciding which statistics are important is closely connected to deciding which parameters are important and. x E Ris F(X " ~ I . .2 a cOmmon estimate of J1. a statistic we shall study extensively in Chapter 2 is the function valued statistic F. is used to decide what estimate of the measure of difference should be employed (ct. can be related to model formulation as we saw earlier. T( x) is what we can compute if we observe X = x. For instance.1. Xn)(x) = n L I(X n i < x) i=l where (X" . Informally. a common estimate of 0.8 Statistical Models. Goals. In this volume we assume that the model has . usually a Euclidean space. then our attention naturally focuses on estimating this constant If. the fraction defective in the sample. It estimates the function valued parameter F defined by its evaluation at x E R.2 is the statistic I i i = i i 8 2 = n1 "(Xi . These issues will be discussed further in Volume 2. is the statistic T(X 11' .. X n ) are a sample from a probability P on R and I(A) is the indicator of the event A. F(P)(x) = PIX. Formally. Often the outcome of the experiment is used to decide on the model and the appropriate measure of difference. statistics. < xJ. If we suppose there is a single numerical measure of performance of the drugs and the difference in performance of the drugs for any given patient is a constant irrespective of the patient.3 Statistics as Functions on the Sample Space I j • . but the true values of parameters are secrets of nature. we can draw guidelines from our numbers and cautiously proceed.1. however. The link for us are things we can compute. Databased model selection can make it difficult to ascenain or even assign a meaning to the accuracy of estimates or the probability of reaChing correct conclusions.1. Our aim is to use the data inductively.. hence. For future reference we note that a statistic just as a parameter need not be real or Euclidean valued. 1964). Thus. Next this model. Nevertheless. which now depends on the data. called the empirical distribution function. This statistic takes values in the set of all distribution functions on R. Models and parametrizations are creations of the statistician. to narrow down in useful ways our ideas of what the "true" P is. a statistic T is a map from the sample space X to some space of values T.
age. (B. Y. Xl). This is obviously overkill but suppose that. 1990). ). However. The distribution of the response Vi for the ith subject or case in the study is postulated to depend on certain characteristics Zi of the ith subject. Regular models. ) that is independent of 0 such that 1 P(Xi' 0) = 1 for all O. weight. Regression Models. the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other. drugs A and B are given at several . In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1. For instance. Models.. We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more infonnation. and Statistics 9 been selected prior to the current experiment. in Example 1. 1. 0).1 Data. See A. When dependence on 8 has to be observed. . 0). Regression Models We end this section with two further important examples indicating the wide scope of the notions we have introduced.1. For instance... Thus.1. They are not covered under our general model and will not be treated in this book. X m ). 0). and there exists a set {XI. after a while. patients may be considered one at a time. .X" . 0). and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients.. This is the Stage for the following. In the discrete case we will use both the tennsfrequency jimction and density for p(x. . (zn. Zi is a d dimensional vector that gives characteristics such as sex.3. for example. Yn ). we shall denote the distribution corresponding to any particular parameter value 8 by Po. Moreover. Problems such as these lie in the fields of sequential analysis and experimental deSign. It will be convenient to assume(1) from now on that in any parametric model we consider either: (I) All of the P. assign the drug that seems to be working better to a higher proportion of patients. We observe (ZI' Y. Parameters. sequentially. the number of patients in the study (the sample size) is random.). in situation (d) again. Thus.4.1. are continuous with densities p(x.1. density and frequency functions by p(.lO. (A. 'L':' Such models will be called regular parametric models. This selection is based on experience with previous similar experiments (cf. 8). Lehmann. (B. The experimenter may. these and other subscripts and arguments will be omitted where no confusion can arise. height.Section 1. Distribution functions will be denoted by F(·. Notation. (2) All of the P e are discrete with frequency functions p(x. and so On of the ith subject in a study. Expectations calculated under the assumption that X rv P e will be written Eo. in the study. assign the drugs alternatively to every other patient in the beginning and then... Y n ) where Y11 ••• .. There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. Yn are independent. . Example 1.4 Examples.3 we could take z to be the treatment label and write onr observations as (A.
By varying the assumptions we obtain parametric models as with (I). Example 1.L(z) is an unknown function from R d to R that we are interested in. . Goals. then the model is zT (a) P(YI... Here J. [n general. Treatment Dose Level) for patient i.3(3) is a special case of this model.treat the Zi as a label of the completely unknown distributions of Yi.z) = L.10 Statistical Models.~II3Jzj = zT (3 so that (b) becomes (b') This is the linear model.1. .1.. d = 2 and can denote the pair (Treatment Label. then we can write. See Problem 1. Identifiability of these parametrizations and the status of their components as parameters are discussed in the problems.Z~)T and Jisthen x n identity. n. and Performance Criteria Chapter 1 dose levels. (c) whereZ nxa = (zf.. Often the following final assumption is made: (4) The distributiou F of (I) is N(O. i=l n If we let J. . .3 with the Gaussian twosample model I'(A) ~ 1'.. I3d) T of unknowns. .E(Yi ). i = 1.8. Clearly.(z) only. We usually ueed to postulate more. So is Example 1.1.1. The most common choice of 9 is the linear fOlltl.2 with assumptions (1)(4)..1. . in Example 1. (b) where Ei = Yi .. (3) g((3. On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) 1'(z) = 9((3. Then. the effect of z on Y is through f.. semiparametric as with (I) and (2) with F arbitrary. A eommou (but often violated assumption) is (I) The ti are identically distributed with distribution F. .2 uuknown. which we can write in vector matrix form.. 0 . .L(z) denote the expected value of a response with given covariate vector z. In the two sample models this is implied by the constant treatment effect assumption. I'(B) = I' + fl. In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed observations. (3) and (4) above. That is. . If we let f(Yi I zd denote the density of Yi for a subject with covariate vector Zi.Yn) ~ II f(Yi I Zi).. For instance. Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on Zi. Then we have the classical Gaussian linear model. and nonparametric if we drop (1) and simply . z) where 9 is known except for a vector (3 = ((31.2) with ..
.. However. Models..t. i = l. we give an example in which the responses are dependent.t = E(Xi ) be the average time for an infinite series of records. .. Xi ~ 1. . p(e n I end f(edf(c2 . say. we have p(edp(C21 e1)p(e31 e"e2)". and the associated probability theory models and inference for dependent data are beyond the scope of this book.. Parameters.n. . Xi ...(3)I'). X n is P(X1.1 Data... . Measurement Model with Autoregressive Errors. the model for X I. . The default assumption.cn _d p(edp(e2 I edp(e3 I e2) . It is plausible that ei depends on Cil because long waves tend to be followed by long waves.(3) + (3X'l + 'i. Of course.. ergodicity. . X n be the n determinations of a physical constant J. . is that N(O. A second example is consecutive measurements Xi of a constant It made by the same observer who seeks to compensate for apparent errors.(1 .Il.. Let XI. Then we have what is called the AR(1) Gaussian model J is the We include this example to illustrate that we need not be limited by independence. model (a) assumes much more but it may be a reasonable first approximation in these situations. .n ei = {3ei1 + €i. Let j. .. i ~ 2. i = 1.1. . the conceptual issues of stationarity. n. save for a brief discussion in Volume 2. . 0 .. x n ). In fact we can write f. To find the density p(Xt.. Xl = I' + '1.p(e" I e1.. the elapsed times X I.(3cd··· f(e n Because €i =  (3e n _1). we start by finding the density of CI. eo = a Here the errors el. . ' . C n where €i are independent identically distributed with density are dependent as are the X's. 0'2) density. X n spent above a fixed high level for a series of n consecutive wave records at a point on the seashore. at best an approximation for the wave example. . .. Example 1. .t+ei. .. . .xn ) ~ f(x1 I') II f(xj .. and Statistics 11 Finally.. ... An example would be..{3Xj1 j=2 n (1.Section 1. Consider the model where Xi and assume = J..5.. . en' Using conditional probability theory and ei = {3ei1 + €i. .
the most important of which is the workhorse of statistics. That is.2) This is an example of a Bayesian model. it is possible that. n). a . the regression model.2. Models are approximations to the mechanisms generating the observations. ([. we can construct a freq uency distribution {1l"O. and indeed necessary.1) Our model is then specified by the joint distribution of the observed number X of defectives in the sample and the random variable 9.. . Now it is reasonable to suppose that the value of (J in the present shipment is the realization of a random variable 9 with distribution given by P[O = N] I = IT" i = 0. Thus. PIX = k. vector observations X with unknown probability distributions P ranging over models P. How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences.1. . 1. Goals. There is a substantial number of statisticians who feel that it is always reasonable.I 12 Statistical Models. N. we have had many shipments of size N that have subsequently been distributed. N. They are useful in understanding how the outcomes can he used to draw inferences that go beyond the particular experiment. in the inspection Example 1. i = 0. The notions of parametrization and identifiability are introduced. We know that.1. to think of the true value of the parameter (J as being the realization of a random variable (} with a known distribution. N. 1l"i is the frequency of shipments with i defective items. There are situations in which most statisticians would agree that more can be said For instance. X has the hypergeometric distribution 'H( i. This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as measure of the beliefs of the experimenter concerning the true value of (J before he or she takes any data. in the past. In this section we introduced the first basic notions and formalism of mathematical statistics..2 BAYESIAN MODELS Throughout our discussion so far we have assumed that there is no information available about the true value of the parameter beyond that provided by the data.. The general definition of parameters and statistics is given and the connection between parameters and pararnetrizations elucidated. 0 = N I I ([. ."" 1l"N} for the proportion (J of defectives in past shipments. and Performance Criteria Chapter 1 Summary. This is done in the context of a number of classical examples. If the customers have provided accurate records of the number of defective items that they have found.2. given 9 = i/N. We view statistical models as useful tools for learning from the outcomes of experiments and studies. ..
For a concrete illustration. I(O.3) Because we now think of p(x. An interesting discussion of a variety of points of view on these questions may be found in Savage et al.O). let us turn again to Example 1. which is called the posterior distribution of 8. 1. Suppose that we have a regular parametric model {Pe : (J E 8}. the information or belief about the true value of the parameter is described by the prior distribution. There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in tenus of frequencies. we can obtain important and useful results and insights. then by (B. The joint distribution of (8. ?T. This would lead to the prior distribution e. Before the experiment is performed. After the value x has been obtained for X. X) is appropriately continuous or discrete with density Or frequency function.2. However. Raiffa and Schlaiffer (1961).2. and Berger (1985). (J) as a conditional density or frequency function given 8 = we will denote it by p(x I 0) for the remainder of this section. with density Or frequency function 7r. The function 7r represents our belief or information about the parameter (J before the experiment and is called the prior density or frequency function.3). We shall return to the Bayesian framework repeatedly in our discussion. In the "mixed" cases such as (J continuous X discrete. the resulting statistical inference becomes subjective. We now think of Pe as the conditional distribution of X given (J = (J. = ( I~O ) (0. . Before sampling any items the chance that a given shipment contains .2. by giving (J a distribution purely as a theoretical tool to which no subjective significance is attached. whose range is contained in 8. (1.l.1. (1) Our own point of view is that SUbjective elements including the views of subject matter experts arc an essential element in all model building. Eqnation (1. Lindley (1965). the joint distribution is neither continuous nor discrete.2. the information about (J is described by the posterior distribution. . suppose that N = 100 and that from past experience we believe that each item has probability . To get a Bayesian model we introduce a random vector (J. However. The most important feature of a Bayesian model is the conditional distribution of 8 given X = x. The theory of this school is expounded by L.2 Bayesian Models 13 Thus.4) for i = 0.2) is an example of (1.1)'(0. insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later). given (J = (J. x) = ?T(O)p(x.1. De Groot (1969).9)'00'. (1962).1 of being defective independently of the other members of the shipment. X) is that of the outcome of a random experiment in which we first select (J = (J according to 7r and then.. 100. Savage (1954). (8. select X according to Pe. For instance. (1. If both X and (J are continuous or both are discrete..Section 1.3). 1. In this section we shall define and discuss the basic clements of Bayesian models.
Thus.II ~.1 and good with probability ..9 ] . Here is an example. Suppose that Xl. Now suppose that a sample of 19 has been drawn in which 10 defective items are found.1) '" I . 1r(t)p(x I t) if 8 is discrete.3). Bernoulli Trials..30.8. If we assume that 8 has a priori distribution with deusity 1r.1 > . Specifically.2. the number of defectives left after the drawing.9) I .. .2. this will continue to be the case for the items left in the lot after the 19 sample items have been drawn.9 independently of the other items.2. 1. Example 1.1100(0. 1008 .1)(0. (A 15.8) as posterior density of 0.1..X) .2. . (i) The posterior distribution is discrete or continuous according as the prior distri bution is discrete or continuous. I 20 or more bad items is by the normal approximation with continuity correction. is iudependeut of X and has a B(81.9)(0..10 '" . P[IOOO > 201 = 10] P [ . In the cases where 8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (0.2. This leads to P[1000 > 20 I X = lOJ '" 0. X) given by (1. we obtain by (1.9) > .1) (1.181(0.2.k dt (1.<1>(0. ( 1.1)(0. to calculate the posterior. Therefore.4) can be used..J C3 (1.1) distribution. and Performance Criteria Chapter 1 I .10).xn )= Jo'1r(t)t k (lt)n. .<I> 1000 . theu 1r(91 x) 1r(9)p(x I 9) 2.2.2.1..2.7) In general.8) roo 1r(9)p(x 19) 1r(t)p(x I t)dt if 8 is continuous.j .6) To calculate the posterior probability given iu (1.9) I .1100(0. 0.181(0.52) 0. (1. . 1r(9)9k (1 _ 9)nk 1r(Blx" .X > 10 I X = lOJ P [(1000 . some variant of Bayes' rule (B.9)(0.001. . P[1000 > 20 I X = 10] P[IOOO ..:. . Goals. 14 Statistkal Models.6) we argue loosely as follows: If be fore the drawing each item was defective with probability . (ii) If we denote the corresponding (posterior) frequency function or density by 1r(9 I x).Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1.30..5) 5 ) = 0..X..
say 0.16. and the posterior distribution of B givenl:X i =kis{3(k+r. r. We may want to assume that B has a density with maximum value at o such as that drawn with a dotted line in Figure B.n. we need a class of distributions that concentrate on the interval (0.2.2. Summary. 'i = 1. we might take B to be uniformly distributed on (0.·) is the beta function. To choose a prior 11'.6. . If we were interested in some proportion about which we have no information or belief. we only observe We can thus write 1[(8 I k) fo". One such class is the twoparameter beta family.2.9).1). L~ 1 Xi' We also obtain the same posterior density if B has prior density 1r and 1 Xi. the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all. x n ). Another bigger conjugate family is that of finite mixtures of beta distributionssee Problem 1. = 1. which depends on k. As Figure B. introduce the notions of prior and posterior distributions and give Bayes rule.2. Suppose. n . If n is small compared to the size of the city.ll) in (1.. . so that the mean is r/(r + s) = 0.nk+s). Xi = 0 or 1. To get infonnation we take a sample of n individuals from the city. We present an elementary discussion of Bayesian models. k = I Xi· Note that the posterior density depends on the data only through the total number of successes. which corresponds to using the beta distribution with r = .2. We also by example introduce the notion of a conjugate family of distributions. s) density (B.2.2. For instance. 1).<. . 0 A feature of Bayesian models exhibited by this example is that there are natural parametric families of priors such that the posterior distributions also belong to this family. Such families are called conjugaU? Evidently the beta family is conjugate to the binomial.9) we obtain 1 L: L7 (}k+rl (1 _ 8)nk+s1 c (1.Section 1. We return to conjugate families in Section 1.05. (A.05 and its variance is very small.. Or else we may think that 1[(8) concentrates its mass near a small number.s) distribution.2. . nonVshaped bimodal distributions are not pennitted.(8 I X" . we are interested in the proportion () of "geniuses" (IQ > 160) in a particular city.11» be B(k + T.13) leads us to assume that the number X of geniuses observed has approximately a 8(n. 8) distribution given B = () (Problem 1. where k ~ l:~ I Xj.. for instance. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions.2 Bayesian Models 15 for 0 < () < 1. and s only. which has a B( n.15. must (see (B. Specifically.k + s) where B(·.10) The proportionality constant c. upon substituting the (3(r.2.2. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B.2 indicates. The result might be a density such as the one marked with a solid line in Figure B.8) distribution..2. Then we can choose r and s in the (3(r.
These are estimation problems. as it's usually called.. We may wish to produce "best guesses" of the values of important parameters.2.lo as special. a reasonable prediction rule for an unseen Y (response of a new patient) is the function {t(z). There are many possible choices of estimates. As the second example suggests.L in Example 1. Unfortunately {t(z) is unknown. there are k! possible rankings or actions. sex. at a first cut. Intuitively. A consumer organization preparing (say) a report on air conditioners tests samples of several brands.l < J. 0 In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function.2. if we have observations (Zil Yi). Po or Pg is special and the general testing problem is really one of discriminating between Po and Po.3. state which is supported by the data: "specialness" or. P's that correspond to no treatment effect (i. and Performance Criteria Chapter 1 1. Note that we really want to estimate the function Ji(')~ our results will guide the selection of doses of drug for future patients. shipments. there are many problems of this type in which it's unclear which oftwo disjoint sets of P's. Goals. Prediction. 1 < i < n." In testing problems we. the expected value of Y given z. say. for instance.3. Zi) and then plug our estimate of (3 into g. of g((3. the receiver wants to discriminate and may be able to attach monetary Costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment. if there are k different brands.16 Statistical Models. whereas the receiver does not wish to keep "bad. (age. For instance. i we can try to estimate the function 1'0.e. "hypothesis" or "nonspecialness" (or alternative). if we believe I'(z) = g((3. For instance. A very important class of situations arises when. Thus.3. with (J < (Jo. we have a vector z. such as. and as we shall see fonnally later. For instance. the information we want to draw from data can be put in various forms depending on the purposes of our analysis.1. Thus.4.l > J. We may have other goals as illustrated by the next two examples. On the basis of the sample outcomes the organization wants to give a ranking from best to worst of the brands (ties not pennitted). However. i. one of which will be announced as more consistent with the data than others. say a 50yearold male patient's response to the level of a drug. 0 Example 1. Example 1.1. Making detenninations of "specialness" corresponds to testing significance. Given a statistical model.1.l < Jio or those corresponding to J. in Example 1.1 do we use the observed fraction of defectives .1 or the physical constant J.lo is the critical matter density in the universe so that J.1 contractual agreement between shipper and receiver may penalize the return of "good" shipments. z) we can estimate (3 from our observations Y. then depending on one's philosophy one could take either P's corresponding to J. In Example 1. drug dose) T that can be used for prediction of a variable of interest Y. In other situations certain P are "special" and we may primarily wish to know whether the data support "specialness" or not." (} > Bo.l > JkO correspond to an eternal alternation of Big Bangs and expansions.1. as in Example 1. the fraction defective B in Example 1. Ranking.lo means the universe is expanding forever and J.1.3 THE DECISION THEORETIC FRAMEWORK I . placebo and treatment are equally effective) are special because the FDA (Food and Drug Administration) does not wish to pennit the marketing of drugs that do no good. say. in Example 1.1.1. If J.
That is. 3). on what criteria of performance we use. By convention. in Example 1. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify Po with the set of "special" P's).1.1.3 The Decision Theoretic Framework 17 X/n as our estimate or ignore the data and use hislOrical infonnation on past shipments. or combine them in some way? In Example 1. even if. 2.1.3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that. is large. we would want a posteriori estimates of perfomlance. we begin with a statistical model with an observation vector X whose distribution P ranges over a set P. once a study is carried out we would probably want not only to estimate ~ but also know how reliable our estimate is. 1. 1. ~ 1 • • • l I} in Example 1.3. in testing whether we are right or wrong. it is natural to take A = R though smaller spaces may serve equally well. for instance. Estimation.1 Components of the Decision Theory Framework As in Section 1. P = {P. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study. 2). Thus. (4) provide guidance in the choice of procedures for analyzing outcomes of experiments. I)}. Thus. .. or p. whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing. Here are action spaces for our examples. On the other hand. 2). k}}. Here quite naturally A = {Permntations (i I . (3. accuracy.6. We usnally take P to be parametrized. Testing. In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter. X = ~ L~l Xi. taking action 1 would mean deciding that D. 1. Intuitively. 2. (2.1. ik) of {I. (2) point to what the different possible actions are. and reliability of statistical procedures.2 lO estimate J1 do we use the mean of the measurements. we need a priori estimates of how well even the best procedure can do. A = {a. if we have three air conditioners. Ranking. # 0.1. (3) provide assessments of risk. in Example 1. (2. In any case. I} with 1 corresponding to rejection of H. Thus.. in estimation we care how far off we are. For instance. there are 3! = 6 possible rankings. : 0 E 8). . . most significantly.2. 3. (1. If we are estimating a real parameter such as the fraction () of defectives. or the median. A = {O. Action space. in ranking what mistakes we've made. The answer will depend on the model and. a large a 2 will force a large m l n to give us a good chance of correctly deciding that the treatment effect is there. in Example 1. . 3).3. 3. .1.Section 1.. (3. A new component is an action space A of actions or decisions or claims that we can contemplate making. A = {( 1.1. in Example 1.1. and so on. 1)..1. defined as any value such that half the Xi are at least as large and half no bigger? The same type of question arises in all examples.
a) = 1 otherwise. Q is the empirical distribution of the Zj in . the expected squared error if a is used. a) = min {(v( P) .1). [v(P) . they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. a) 1(0. sometimes can genuinely be quantified in economic terms. A = {a . For instance.V. as the name suggests. Closely related to the latter is what we shall call confidence interval loss. asymmetric loss functions can also be of importance. Although estimation loss functions are typically symmetric in v and a. is the nonnegative loss incurred by the statistician if he or she takes action a and the true "state of Nature. Goals. Far more important than the choice of action space is the choice of loss function defined as a function I : P X A ~ R+.: la. say.3.3. l(P.a)2. If v ~ (V1.qd(lJ)) and a ~ (a""" ad) are vectors. although loss functions. a) 1(0. a) = (v(P) . examples of loss functions are 1(0. Quadratic Loss: I(P. in the prediction example 1. as we shall see (Section 5.. a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z.)2 = Vj [ squared Euclidean distance/d absolute distance/d 1.Vd) = (q1(0). which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1. The interpretation of I(P. then a(B. ." respectively. . a). = max{laj  vjl.a)2 (or I(O. a) if P is parametrized. and z = (Treatment. I(P. and Performance Criteria Chapter 1 Prediction. and truncated quadraticloss: I(P. r I I(P. less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P. ~ 2. "does not respond" and "responds." that is. Sex)T. . r " I' 2:)a.al < d. If.a) ~ (q(O) . M) would be our prediction of response or no response for a male given treatment B. In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is. a) = J (I'(z) . say. Here A is much larger. if Y = 0 or 1 corresponds to. Other choices that are.ai. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider...3. . the probability distribution producing the data.. 1'() is the parameter of interest. and z E Z.. d'}. a) = l(v < a). 1 d} = supremum distance.2. l( P.. Loss function.18 Statistical Models. . As we shall see.a(z))2dQ(z). [f Y is real. For instance.i = We can also consider function valued parameters. is P. Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·).a)2). a) I d . Estimation. a) = [v( P) . ' . or 1(0. This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable. For instance. a) = 0.
We ask whether the parameter () is in the subset 6 0 or subset 8 1 of e.a) = I n n. a) ~ 0 if 8 E e a (The decision is correct) l(0.Section 1. a(zn) jT and the vector parameter (I'( z. . ]=1 which is just n. Then the appropriate loss function is 1.. In Example 1. Y n).1 times the squared Euclidean distance between the prediction vector (a(z..). The data is a point X = x in the outcome or sample space X. Of course. Testing.3 with X and Y distributed as N(I' + ~. that is.3. then a reasonable rule is to decide Ll = 0 if our estimate x . where {So. a Estimation. We next give a representation of the process Whereby the statistician uses the data to arrive at a decision.3. Otherwise. other economic loss functions may be appropriate.9. For the problem of estimating the constant IJ. (1. respectively.1 loss function can be written as ed.1 suppose returning a shipment with °< 1(8. a) = 1 otherwise (The decision is wrong). Here we mean close to zero relative to the variability in the experiment.). is a partition of (or equivalently if P E Po or P E P.1.3 The Decision Theoretic Framework 19 the training set (Z 1.. in Example 00 defectives results in a penalty of s dollars whereas every defective item sold results in an r dollar replacement cost.. this leads to the commonly considered I(P. the statistician takes action o(x). For instance. and to decide Ll '# a if our estimate is not close to zero. This 0 . Y). L(I'(Zj) _0(Z. . We define a decision rule or procedure to be any function from the sample space taking its values in A.. .. .I loss: /(8.. ...2) and N(I'. .1'(zn))T Testing. In Section 4.). in the measurement model. y) = o if Ix::: 111 (J <c (1. relative to the standard deviation a. (Zn.y is close to zero. Using 0 means that if X = x is observed. The decision rule can now be written o(x. 0. if we are asking whether the treatment effect p'!1'ameter 6.))2. e ea.2).. If we take action a when the parameter is in we have made the correct decision and the loss is zero.. I) 1(8. the decision is wrong and the loss is taken to equal one. we implicitly discussed two estimates or decision rules: 61 (x) = sample mean x and 02(X) = X = sample median.0) sif8<80 Oif8 > 80 rN8. I) /(8.2) I if Ix ~ 111 (J >c . is a or not.3 we will show how to obtain an estimate (j of a from the data.1.1) Decision procedures.
" Var(X.. we typically want procedures to have good properties not at just one particular x. If we use quadratic loss. If we expand the square of the righthand side keeping the brackets intact and take the expected value..3.. and assume quadratic loss. . then the loss is l(P.) = 2L n ~ n . R('ld) is our a priori measure of the performance of d. Suppose v _ v(P) is the real parameter we wish to estimate and fJ(X) is our estimator (our decision rule).3. Estimation of IJ. fJ(x)). (Continued).3.E(fi)] = O. e Estimation. A useful result is Bias(fi) Proposition 1. Thus.i.. That is. we turn to the average or mean loss over the sample space. 1 is the loss function. withN(O. our risk function is called the mean squared error (MSE) of. We do not know the value of the loss because P is unknown. . Goals. the risk or riskfunction: The risk function. The MSE depends on the variance of fJ and on what is called the bias ofv where = E{fl) . X n are i. 6(X)] as the measure of the perfonnance of the decision rule o(x).1. and X = x is the outcome of the experiment. fi) = Ep(fi(X) .6) = Ep[I(P. Example 1.) 0 We next illustrate the computation and the a priori and a posteriori use of the risk function. How do we choose c? We need the next concept of the decision theoretic framework. and is given by v= MSE{fl) = R(P. lfwe use the mean X as our estimate of IJ. () is the true value of the parameter. so is the other and the result is trivially true. MSE(fi) I. The other two terms are (Bias fi)' and Var(v).3) where for simplicity dependence on P is suppressed in MSE. the cross term will he zero because E[fi . R maps P or to R+. i=l .v can be thought of as the "longrun average error" of v. Suppose Xl. Moreover. If d is the procedure used.. for each 8.v) = [V  + [E(v) . 6 (x)) as a random variable and introduce the riskfunction R(P.d. we regard I(P. then Bias(X) 1 n Var(X) .v(P))' (1. Proof. but for a range of plausible x·s.3. We illustrate computation of R and its a priori use in some examples. measurements of IJ. (fi .. (If one side is infinite. Thus. (]"2) errors.1'].20 Statistical Models. Write the error as 1 = (Bias fi)' + E(fi)] Var(v). and Performance Criteria Chapter 1 where c is a positive constant called the critical value.
00 (1. write for a median of {aI..i. an}). In fact. The choice of the weights 0.. Then R(I". . If.4. Ii ~ (0.S.. a'. R( P.3 The Decision Theoretic Framework 21 and. in general. If we have no data for area A. a natural guess for fl would be flo.6).1"1 ~ Eill 2 ) where £. 1 X n from area A.a 2 then by (A 13. areN(O.1". (. We shall derive them in Section 1. .. if X rnedian(X 1 . with mean 0 and variance a 2 (P). census. or approximated asymptotically. the £.3. analytic. If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = <:TolE measurements. for instance by 8 2 = ~ L~ l(X i .fii/a)[ ~ a . Let flo denote the mean of a certain measurement included in the U. Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be < crJ. X) = EIX . then for quadratic loss.3.3. whereas if we have a random sample of measurements X" X 2. but for absolute value loss only approximate. or numerical and/or Monte Carlo computation.. as we assumed. If we have no idea of the value of 0.X) =.3.1 is useful for evaluating the performance of competing estimators.fii V. say.8 can only be made on the basis of additional knowledge about demography or the economy.fii 1 af2 00 Itl<p(t)dt = . 1) and R(I". Example 1. itself subject to random error. . The a posteriori estimate of risk (j2/n is.1")' ~ E(~) can only be evaluated numerically (see Problem 1.4) which doesn't depend on J.X)2. an estimate we can justify later.6 through a .5) This harder calculation already suggests why quadratic loss is really favored.Xn ) (and we. as we discussed in Example 1.2 . of course.4) can be used for an a priori estimate of the risk of X. Suppose that instead of quadratic loss we used the more natural(l) absolute value loss. by Proposition 1. planning is not possible but having taken n measurements we can then estimate 0.8)X. age or income. .1). or na 21(n . computational difficulties arise even with quadratic loss as soon as we think of estimates other than X.d. is possible.3. = X. E(i . we may wantto combine tLo and X = n 1 L~ 1 Xi into an estimator. ( N(o.23).. .a 2 .2. n (1. .t. For instance.Section 1.1.2. Then (1. for instance. If we only assume.. Next suppose we are interested in the mean fl of the same measurement for a certain area of the United States. 0 ~ = a We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1.3. that the E:i are i.2 and 0.X) = ".2)1"0 + (O. X) = a' (P) / n still.3.1 2 MSE(X) = R(I".
The test rule (1.6) = 1(i'>.1')' + (. I)P[6(X. Ii) of ii is smaller than the risk R(I'. = 0.n. Y) = IJ.3. using MSE. I.8)'Yar(X) = (. respectively.6) P[6(X. However..I' = 0. We easily find Bias(ii) Yar(ii) 0. thus. the risk R(I'. and we are to decide whether E 80 or E 8 where 8 = 8 0 U 8 8 0 n 8 .64)0" In MSE(ii) = .1.3. Here we compare the performances of Ii and X as estimators of Jl using MSE.04(1'0 . D Testing. 22 Statistical Models.1 gives the graphs of MSE(ii) and MSE(X) as functions of 1'.2) for deciding between i'> two values 0 and 1. In the general case X and denote the outcome and parameter space.1') (0. MSE(ii) MSE(X) i . iiJ + 0. A test " " e e e .3.2(1'0 .64)0" In. neither estimator can be proclaimed as being better than the other.1 loss is 01 + lei'>. if we use as our criteria the maximum (over p. Figure 1.64 when I' = 1'0. R(i'>. then X is optimal (Example 3. Y) = I] if i'> = 0 P[6(X. Y) = which in the case of 0 .81' .21'0 R(I'. the risk is = 0 and i'> i 0 can only take on the R(i'>. I' E R} being 0. Because we do not know the value of fL. Figure 1.I. The two MSE curves cross at I' ~ 1'0 ± 30'1 . Goals.) ofthe MSE (called the minimax criteria). Y) = 01 if i'> i O. X) = 0" In of X with the minimum relative risk inf{MSE(ii)IMSE(X).3. The mean squared errdrs of X and ji.4). If I' is close to 1'0.0)P[6(X. and Performance Criteria Chapter 1 i formal Bayesian analysis using a nonnal prior to illustrate a way of bringing in additional knowledge.
say . we call the error committed a Type I error. the risk of <5(X) is e R(8. we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding ~ =1= 0 when ~ = 0).3. on the probability of Type I error.IX > k] + rN8P. 6(X) = IIX E C]. Thus.3. For instance. a > v(P) I. and then trying to minimize the probability of a Type II error. confidence bounds and intervals (and more generally regions).1. I(P.Section 1.o) 0. that is.6) P(6(X) = 0) if 8 E 8 1 Probability of Type II error. whereas if <5(X) = 0 and we decide () E 80 when in fact 8 E 8 b we call the error a Type II error.6) E(6(X)) ~ P(6(X) ~ I) if8 E 8 0 (1.1 lead to (Prohlem 1. For instance.[X < k[. Suppose our primary interest in an estimation type of problem is to give an upper bound for the parameter v." in Example 1.8) for all possible distrihutions P of X. For instance. usually .7) Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation. a < v(P) . Such a v is called a (1 .6) Probability of Type [error R(8. This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function. R(8. 8 < 80 rN8P.3.01 or less.05 or .3. where 1 denotes the indicator function. [n the NeymanPearson framework of statistical hypothesis testing. (1. and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding Do = 0 when Do ¥ 0). Finding good test functions corresponds to finding critical regions with smaIl probabilities of error. the loss function (1. the focus is on first providing a small bound. This is not the only approach to testing.8) sP. it is natural to seek v(X) such that P[v(X) > vi > 1. "Reject the shipment if and only if X > k.a (1.3 The Decision Theoretic Framework 23 function is a decision rule <5(X) that equals 1 On a set C C X called the critical region and equals 0 on the complement of C. an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed.18).1) and tests 15k of the fonn.[X < k]. Here a is small. in the treatments A and B example. If <5(X) = 1 and we decide 8 E 8 1 when in fact E 80.05.3. 8 > 80 . If (say) X represents the amount owed in the sample and v is the unknown total amount owed.a) upper coofidence houod on v.
or wait and see. a) (Drill) aj (Sell) a2 (Partial rights) a3 . A has three points. aJ.1. an asymmetric estimation type loss function. The decision theoretic framework accommodates by adding a component reflecting this. though upper bounding is the primary goal. Suppose the following loss function is decided on TABLE 1. though. and Performance Criteria Chapter 1 1 .3. We shall go into this further in Chapter 4. c ~1 .2 . we could leave the component in. we could operate.a ·1 for all PEP. sell the location. For instance. as we shall see in Chapter 4.3. What is missing is the fact that.24 Statistical Models. Suppose we have two possible states of nature.: Comparison of Decision Procedures In this section we introduce a variety of concepts used in the comparison of decision procedures.a)  a~v(P) . in fact it is important to get close to the truthknowing that at most 00 doJlars are owed is of no use.a>v(P) . and the risk of all possible decision procedures can be computed and plotted.3. In the context of the foregoing examples. replace it. a patient either has a certain disease or does not. The same issue arises when we are interested in a confidence interval ~(X) l v(X) 1for v defined by the requirement that • P[v(X) < v(P) < Ii(X)] > 1 . we could drill for oil. it is customary to first fix a in (1. We next tum to the final topic of this section. We shall illustrate some of the relationships between these ideas using the following simple example in which has two members. I I (Oil) (No oil) Bj B2 0 12 10 I 5 6 \ . We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model. Goals.3.8) and then see what one can do to control (say) R( P.. a2.1 nature makes it resemble a testing loss function and. 1. are available. Suppose that three possible actions.5. administer drugs. . and so on. where x+ = xl(x > 0). thai this fonnulation is inadequate because by taking D 00 we can achieve risk O. The loss function I(B. The 0. It is clear. a component in a piece of equipment either works or does not work.a<v(P). or sell partial rights. general criteria for selecting "optimal" procedures. or repair it. Ii) = E(Ii(X) v(P))+. e Example 1. a 2 certain location either contains oil or does not. rather than this Lagrangian form. For instance = I(P. . which we represent by Bl and B . for some constant c > O. Typically. and a3. the connection is close.
3 0.6 0. . a. 5 a. 8 a3 a2 9 a3 a3 Here.5 3 7 1. the loss is zero.4. Next.6. a3 4 a2 a. Possible decision rules 5i (x) ." 02 corresponds to ''Take action Ul> if X = 0.2 (Oil) (No oil) e.).5.6 4 5 1 " 3 5. TABLE 1. 0 . 3 a.6 3 3. take action U2. whereas if there is no oil.3) For instance.3 and formation 1 with frequency 0.7) = 7 12(0.4 = 1.3.3. 6 a. and so on. a. a..4.6 and 0. X may represent a certain geological formation. an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0.3. R(B k .. ..a2)P[5(X) ~ a2] + I(B. 2 a.5.7. i Rock formation x o I 0. a3 7 a3 a. the loss is 12.a3)P[5(X) = a3]' 0(0.4) ~ 7. I x=O x=1 a. (R(O" 5).4 8 8.) 0 12 2 7 7.5 4. B) given by the following table TABLE 1.)P[5(X) = ad +1(B. whereas if there is no oil and we drill.0 9 5 6 It remains to pick out the rules that are "good" or "best.Section 1.5 9. The frequency function p(x.4 10 6 6." Criteria for doing this will be introduced in the next subsection. R(025. e.)) 1 1 R(B . R(B" 52) R(B2 . We list all possible decision rules in the following table.2 for i = 1. it is known that formation 0 occurs with frequency 0. if there is oil and we drill. a..3.). The risk of 5 at B is R(B. B2 Thus.52 ) + 10(0. Risk points (R(B 5. 5)) and if k = 2 we can plot the set of all such points obtained by varying 5. formations 0 and 1 occur with frequencies 0. if X = 1. The risk points (R(B" 5. e TABLE 1. R(B2. ..) R(B. 01 represents "Take action Ul regardless of the value of X.5) E[I(B.)) are given in Table 1.5(X))] = I(B.3 The Decision Theoretic Framework 25 Thus.6) + 1(0.3. we can represent the whole risk function of a procedure 0 by a point in kdimensional Euclidean space." and so on.3.4 and graphed in Figure 1. 5. and when there is oil.2. If is finite and has k members. and frequency function p(x.7 0.). 1.5 8. 1 9. .
8 5 .) I • 3 10 . R( 8" Si)).9 .. The absurd rule "S'(X) = 0" cannot be improved on at the value 8 ~ 0 because Eo(S'(X)) = 0 ifand only if O(X) = O. It is easy to see that there is typically no rule e5 that improves all others. if J and 8' are two rules..2 . We say that a procedure <5 improves a procedure Of if. and only if.) < R( 8" S6) but R( 8" S. if we ignore the data and use the estimate () = 0. S. i (1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry.26 Statistical Models. in estimating () E R when X '"'' N((). or level of significance (for tests).S) < R(8.   .S') for all () with strict inequality for some (). we obtain M SE(O) = (}2. Usually.5 0 0 5 10 R(8 j . Consider. Symmetry (or invariance) restrictions are discussed in Ferguson (1967). R(8. For instance. i = 1. We shall pursue this approach further in Chapter 3. and Performance Criteria Chapter 1 R(8 2 . Extensions of unbiasedness ideas may be found in Lehmann (1997. neither improves the other.3. a5).9. Researchers have then sought procedures that improve all others within the class. The risk points (R(8 j .6 . Section 1.5). . Here R(8 S.3 Bayes and Minimax Criteria The difficulties of comparing decision procedures have already been discussed in the special contexts of estimation and testing.4 . Goals.Si) Figure 1. unbiasedness (for estimates and tests).) > R( 8" S6)' The problem of selecting good decision " procedures has been attacked in a variety of ways.7 . and S6 in our example.3.2. 1. for instance. . (2) A second major approach has been to compare risk functions by global crite .S. Si).
4 8 4. given by reo) = E[R(O. . = 0. O(X)) I () = ()].8.48 7. If there is a rule <5*.5 7 7.20) in Appendix B.. which attains the minimum Bayes risk l that is.Section 1.5 9 5. such that • see that rule 05 is the unique Bayes rule then it is called a Bayes rule.I.92 5. if () is discrete with frequency function 'Tr(e). and reo) = J R(O. TABLE 1. we need not stop at this point.6 12 2 7. Bayes: The Bayesian point of view leads to a natural global criterion. The method of computing Bayes procedures by listing all available <5 and their Bayes risk is impracticable in general. r(o") = minr(o) reo) = EoR(O. . 0)11(0). it has smaller Bayes risk.3.6 4 4.3.o)] = E[I(O.02 8. Note that the Bayes approach leads us to compare procedures on the basis of.10) r( 0) ~ 0. the only reasonable computational method.3. Recall that in the Bayesian model () is the realization of a random variable or vector () and that Pe is the conditional distribution of X given 0 ~ O. 0) isjust E[I(O.6 3 8. We shall discuss the Bayes and minimax criteria. to Section 3. If we adopt the Bayesian point of view. From Table 1. r(o.o)1I(O)dO.. + 0. (1.9) The second preceding identity is a consequence of the double expectation theorem (B.4 5 2.38 9. To illustrate.9).0).5 gives r( 0. This quantity which we shall call the Bayes risk of <5 and denote r( <5) is then. (1.. suppose that in the oil drilling example an expert thinks the chance of finding oil is .3.) maxi R(0" Oil.3. but can proceed to calculate what we expect to lose on the average as () varies. the expected loss. Then we treat the parameter as a random variable () with possible values ()1.3.). 11(0.3 The Decision Theoretic Framework 27 ria rather than on a pointwise basis.5. therefore.0) Table 1. ()2 and frequency function 1I(OIl The Bayes risk of <5 is.3.7 6.8 6 In the Bayesian framework 0 is preferable to <5' if. Bayes and maximum risks oftbe procedures of Table 1. .8 10 6 3.) ~ 0. and only if. if we use <5 and () = ().8R(0.2R(0.2.5 we for our prior.2. We postpone the consideration of posterior analysis. . .o(x))].9 8. In this framework R(O.3.2. R(0" Oi)) I 9. r( 09) specified by (1.
we prefer 0" to 0"'.(2) We briefly indicate "the game of decision theory. . 6). who picks a decision procedure point 8 E J from V. For simplicity we shall discuss only randomized . It is then natural to compare procedures using the simple average ~ [R( fh 10) + R( fh. Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule. suppose that.J') . This is. Our expected risk would be. It aims to give maximum protection against the worst that can happen. . if V is the class of all decision procedures (nonrandomized).5. J) A procedure 0*. sup R(O. if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used. and Performance Criteria Chapter 1 if (J is continuous with density rr(8).J). For instance. This criterion of optimality is very conservative. For instance. = infsupR(O. Such comparisons make sense even if we do not interpret 1r as a prior density or frequency. which has . I Randomized decision rules: In general. R(02. a weight function for averaging the values of the function R( B. in Example 1.4. R(O. J). in Example 1. < sup R(O. Player II then pays Player I." Nature (Player I) picks a independently of the statistician (Player II). The criterion comes from the general theory of twoperson zero sum games of von Neumann. J'). Nevertheless l in many cases the principle can lead to very reasonable procedures. a randomized decision procedure can be thought of as a random experiment whose outcomes are members of V. 2 The maximum risk 4. we toss a fair coin and use 04 if the coin lands heads and 06 otherwise. To illustrate computation of the minimax rule we tum to Table 1.20 if 0 = O .4. 8)]. if and only if. The principle would be compelling. Of course. J)) we see that J 4 is minimax with a maximum risk of 5. the sel of all decision procedures. .75 if 0 = 0. which makes the risk as large as possible. The maximum risk of 0* is the upper pure value of the game.5 we might feel that both values ofthe risk were equally important.75 is strictly less than that of 04. e 4. supR(O.28 Statistical Models.3. From the listing ofmax(R(O" J).3. Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to emplOy. Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk. But this is just Bayes comparison where 1f places equal probability on f}r and ()2.3. Goals. but only ac. 4. Nature's choosing a 8. is called minimax (minimizes the maximum risk).
3.. when ~ = 0..5. Jj .3.3.10). this point is (10.5. we represent the risk of any procedure J by the vector (R( 01 . ~Ai=1}.3.Ji )] i=l A randomized Bayes procedure d.Section 1.3. (1.2.3).3). We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1.\. (2) The tangent is the line connecting two unonrandomized" risk points Ji . 1 q.). = ~A.J): J E V'} where V* is the set of all procedures.3. J). we then define q R(O. S= {(rI. If . including randomiZed ones.(02 ).R(02. minimizes r(d) anwng all randomized procedures. which is the risk point of the Bayes rule Js (see Figure 1.13) defines a family of parallel lines with slope i/(1 . All points of S that are on the tangent are Bayes. .14) .J. Ai >0. i=l (1. By (1. (1.R(O..13) intersects S. then all rules having Bayes risk c correspond to points in S that lie on the line irl + (1 . We will then indicate how much of what we learn carries over to the general case. AR(02. . i = 1. For instance. Ei==l Al = 1. given a prior 7r on q e.i)r2 = c.r).1). .3.A)R(O"Jj ).J. A point (Tl..3 The Decision Theoretic Framework 29 procedures that select among a finite set (h 1 • • • .12) r(J) = LAiEIR(IJ... TWo cases arise: (I) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule. Jj )..(0 d = i = 1 .Ji)' r2 = tAiR(02. . 9 (Figure 2 1. R(O . A randomized minimax procedure minimizes maxa R(8.i/ (1 . As in Example 1. T2) on this line can be written AR(O"Ji ) + (1.d) anwng all randomized procedures. If the randomized procedure l5 selects l5i with probability Ai. J)) and consider the risk sel 2 S = {(RrO"J).'Y) that is tangent to S al the lower boundary of S. That is.3.3.). R(O . Ji ) + (1 . i = 1 . the Bayes risk of l5 (1. S is the convex hull of the risk points (R(0" Ji ).. Ji )).R(OI. 0 < i < 1.13) As c varies. (1. This is thai line with slope .J) ~ L.3. Finding the Bayes rule corresponds to finding the smallest c for which the line (1. l5q of nOn randomized procedures.A)R(02.r2):r.11) Similarly we can define.3.
(\)). thus. To locate the risk point of the nilhimax rule consider the family of squares./(1 . < 1.3. See Figure 1.3.3. The convex hull S of the risk poiots (R(B!. 0 < >. Let c' be the srr!~llest c for which Q(c) n S i 0 (i. OJ with probability (1 .11) corresponds to the values Oi with probability>. Goals..30 Statistical Models. i = 1.3. as >. (1.e.3. the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (Le.13).3. namely Oi (take'\ = 1) and OJ (take>. the first point of contact between the squares and S is the .3. = r2. . Then Q( c*) n S is either a point or a horizontal or vertical line segment. is Bayes against 7I". In our example. ranges from 0 to 1. < 1. where 0 < >. by (1. 2 The point where the square Q( c') defined by (1. and Performance Criteria Chapter 1 r2 I 10 5 Q(c') o o 5 10 Figure 1.3. It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*.9. . oil. R(B .3. (1. Because changing the prior 1r corresponds to changing the slope .16) whose diagonal is the line r.)') of the line given by (1.>').. We can choose two nonrandomized Bayes rules from this class. = 0).16) touches S is the risk point of the minimax rule.• all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes).. the first square that touches S). and.15) Each one of these rules.
Thus.3. for instance. There is another important concept that we want to discuss in the context of the risk set. there is no (x.14). {(x. agrees with the set of risk points of Bayes procedures. .\ + 6. y) in S such that x < Tl and y < T2.e.. thus.\ ~ 0. Naturally. e = {fh..5(1 . for instance).59.4.3 The Decision Theoretic Framework 31 intersection between TI = T2 and the line connecting the two points corresponding to <54 and <56 . From the figure it is clear that such points must be on the lower left boundary. then any Bayes procedure corresponding If e is not finite there are typically admissible procedures that are not Bayes..4 < 7. A rule <5 with risk point (TIl T2) is admissible. However. > a for all i.\ the solution of From Table 1.) and R( 8" 04) = 5.\) = 5. the set of all lower left boundary points of S corresponds to the class of admissible rules and.)). that <5 2 is inadmissible because 64 improves it (i.5 can be shown to hold generally (see Ferguson.0). all rules that are not inadmissible are called admissible.3. . which yields . R(8" 0)) : 0 E V'} where V" is the set of all randomized decision procedures.. we can define the risk set in general as s ~ {( R(8 . (c) If e is finite(4) and minimax procedures exist. 1 .Section 1. j = 6 and . Using Table 1. if there is a randomized one. R( 81 . The following features exhibited by the risk set by Example 1.4 we can see. 1967.6 ~ R( 8" J. (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the positive quadrant. the minimax rule is given by (1. this equation becomes 3.4'\ + 3(1 .3. 04) = 3 < 7 = R( 81 . all admissible procedures are either Bayes procedures or limits of . (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the Sense of (1. (a) For any prior there is always a nonrandomized Bayes procedure. Y < T2} has only (TI 1 T2) in common with S. To gain some insight into the class of all admissible procedures (randomized and nOnrandomized) we again use the risk set. 0.14) with i = 4. In fact.. if and only if. under some conditions.V) : x < TI.3. . (d) All admissible procedures are Bayes procedures. l Ok}.3. A decision rule <5 is said to be inadmissible if there exists another rule <5' such that <5' improves <5. or equivalently.\). they are Bayes procedures. if and only if. (e) If a Bayes prior has 1f(Oi) to 1f is admissible. If e is finite.
which is the squared prediction error when g(Z) is used to predict Y. A government expert wants to predict the amount of heating oil needed next winter. Summary. For more information on these topics. In terms of our preceding discussion. Z would be the College Board score of an entering freshman and Y his or her firstyear grade point average.2 presented important situations in which a vector z of 00variates can be used to predict an unseen response Y. he wants to predict the firstyear grade point averages of entering freshmen on the basis of their College Board scores. Next we must specify what close means. Statistical Models.32 . The basic biasvariance decomposition of mean square error is presented. I II . . A stockholder wants to predict the value of his holdings at some time in the future on the basis of his past experience with the market and his portfolio. An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson. loss function. ranking. We stress that looking at randomized procedures is essential for these conclusions. One reasonable measure of "distance" is (g(Z) . ]967. Other theorems are available characterizing larger but more manageable classes of procedures. For example. Since Y is not known. Here are some further examples of the kind of situation that prompts our study in this section. The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility. The MSPE is the measure traditionally used in the .g(Z») = E[g(Z) ~ YI 2 or its square root yE(g(Z) .3. are due essentially to Waldo They are useful because the property of being Bayes is ea·der to analyze than admissibility. The frame we shaH fit them into is the following. Z is the information that we have and Y the quantity to be predicted. confidence bounds. A college admissions officer has available the College Board scores at entrance and firstyear grade point averages of freshman classes for a period of several years.4 PREDICTION . Goals. and prediction. These remarkable results. decision rule. We assume that we know the joint probability distribution of a random vector (or variable) Z and a random variable Y.y)2. testing.4). Section 3.Yj'. and risk through various examples including estimation. We introduce the decision theoretic foundation of statistics inclUding the notions of action space. we tum to the mean squared prediction error (MSPE) t>2(Y. at least when procedures with the same risk function are identified.I 'jlI . Similar problems abound in every field. in the college admissions situation. Using this information. A meteorologist wants to estimate the amount of rainfall in the coming spring. • 1. We want to find a function 9 defined on the range of Z such that g(Z) (the predictor) is "close" to Y.. which include the admissible rules. although it usually turns out that all admissible procedures of interest are indeed nonrandomized. and Performance Criteria Chapter 1 '. I The prediction Example 1. we referto Blackwell and Girshick (1954) and Ferguson (1967). at least in their original fonn. Bayes procedures (in various senses). The joint distribution of Z and Y can be calculated (or rather well estimated) from the records of previous years that the admissions officer has at his disposal.
see Problem 1. Just how widely applicable the notions of this section are will become apparent in Remark 1.c)' < 00 for all c. in which Z is a constant.4. EY' < 00 Y .4.4. In this section we bjZj . EY' = J. that is.3.1. consider QNP and the class QL of linear predictors of the form a + We begin the search for the best predictor in the sense of minimizing MSPE by considering the case in which there is no covariate information.4.Section 1. and by expanding (1.L = E(Y). 1.4. when EY' < 00. E(Y . The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6. for example.5 and Section 3.4. we have E[(Y . (1.711). Because g(z) is a constant. we can find the 9 that minimizes E(Y .1) < 00 if and only if E(Y .4 Prediction 33 mathematical theory of prediction whose deeper results (see.) = 0 makes the cross product term vanish.20). In this situation all predictors are constant and the best one is that number Co that minimizes B(Y . 1957) presuppose it. exists. 0 Now we can solve the problem of finding the best MSPE predictor of Y.4. (1.4. then either E(Y g(Z))' = 00 for every function g or E(Y .25. L1=1 Lemma 1. Let (1.4. we can conclude that Theorem 1.I'(Z))' < E(Y .(Y. or equivalently. ljZ is any random vector and Y any random variable.l. Proof.c)2 is either oofor all c or is minimized uniquely by c In fact.g(Z))2.g(z))' IZ = z] = E[(Y I'(z))' I Z = z] + [g(z) I'(z)]'.3) If we now take expectations of both sides and employ the double expectation theorem (B.YI) (Problems 1. see Example 1.16).g(Z))' (1.6.c)2 as a function of c.4. We see that E(Y . g(Z)) such as the mean absolute error E(lg( Z) .c) implies that p. By the substitution theorem for conditional expectations (B.2) I'(z) = E(Y I Z = z).1 assures us that E[(Y .g(z))' I Z = z].4. E(Y .4) . given a vector Z.1.4.c = (Y 1') + (I' . Grenander and Rosenblatt.4.C)2 has a unique minimum at c = J.1) follows because E(Y p.L and the lemma follows.g(Z))' I Z = z] = E[(Y . Lemma 1.2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss. The class Q of possible predictors 9 may be the nonparametric class QN P of all 9 : R d Jo R or it may be to some subset of this class. See Remark 1.c)' ~ Var Y + (c _ 1')'.
7) IJVar Y • < 00 strict inequality ooids unless 1 Y = E(Y I Z) (1.1(c) is equivalent to (1.4. and Performance Criteria Chapter 1 for every 9 with strict inequality holding unless g(Z) best MSPE predictor.4.6) and that (1.E(v)IIU . To show (a). then we can write Proposition 1. Var(Y I z) = E([Y . Proof.4.34 Statistical Models. E{h(Z)<1 E{ E[h(Z)< I Z]} E{h(Z)E[Y I'(Z) I Z]} = 0 because E[Y I'(Z) I Z] = I'(Z) I'(Z) = O.4. Equivalently U and V are uncorrelated if either EV[U E(U)] = 0 or EUIV .4. let h(Z) be any function of Z. we can derive the following theorem.6) is linked to a notion that we now define: Two random variables U and V with EIUVI < 00 are said to be uncorrelated if EIV .4. and recall (B. then (1. Property (1.4. As a consequence of (1.4.E(U)] = O. Vat Y = E(Var(Y I Z» + Var(E(Y I Z).20).4. (1. Suppose that Var Y < 00.E(Y I z)]' I z).4. Write Var(Y I z) for the variance of the condition distribution of Y given Z = z.g(Z)' = E(Y . If E(IYI) < 00 but Z and Y are otherwise arbitrary.6).4.5) follows from (a) because (a) implies that the cross product term in the expansionof E ([Y 1'(z)J + [I'(z) g(z)IP vanishes. . then by the iterated expectation theorem.8) or equivalently unless Y is a function oJZ. Theorem 1.6) which is generally valid because if one side is infinite. Properties (b) and (c) follow from (a). Infact. I'(Z) is the unique E(Y . that is.5) An important special case of (1.1. 0 Note that Proposition 1.5) is obtained by taking g(z) ~ E(Y) for all z.Il(Z) denote the random prediction error. = I'(Z).I'(Z))' + E(g(Z) 1'(Z»2 ( 1.4. then (a) f is uncorrelated with every function ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <. That is.4. = O. 1.2.5) becomes. (1. Goals. which will prove of importance in estimation theory. when E(Y') < 00.E(V)J Let € = Y . so is the other. then Var(E(Y I Z)) < Var Y.
E(Y I Z))' =L x 3 ~)y y=o .y) pz(z) 0. E(Y . i=l 3 E (Y I Z = ~) ~ 2.7) follows immediately from (1.15 I 0. Each day there can be 0.10 0. These fractional figures are not too meaningful as predictors of the natural number values of Y.15 3 0.lI. (1.05 0. E(Var(Y I Z)) ~ E(Y .4.45 ~ 1 The MSPE of the best predictor can be calculated in two ways.8) is true.45. 1. whereas the column sums py (y) yield the frequency of 0.25 1 0.25 2 0. Equality in (1. An assembly line operates either at full.05 0.05 0.025 0.1. if we are trying to guess. o Example 1. or 3 shutdowns due to mechanical failure. p(z. E (Y I Z = ±) = 1. 2. y) = 0.1 2. The assertion (1. We want to predict the number of failures for a given day knowing the state of the assembly line for the month.4.20. and only if.'" 1 Y.25 0.25 0.7) can hold if.10 0.30 I 0.30 py(y) I 0. the best predictor is E (30. y) = p (Z = Z 1 Y = y) of the number of shutdowns Y and the capacity state Z of the line for a randomly chosen day. and only if.4.:.9) this can hold if.10 0. half. the average number of failures per day in a given month. or 3 failures among all days.6). The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state. also.4. I z) = E(Y I Z). In this case if Yi represents the number of failures on day i and Z the state of the assembly line.4.10. or quarter capacity. Within any given month the capacity status does not change. as we reasonably might. I.10 I 0.E(Y I Z = z))2 p (z. 2. We find E(Y I Z = 1) = L iF[Y = i I Z = 1] ~ 2. The first is direct. But this predictor is also the right one. .885.4 Prediction 35 Proof. The following table gives the frequency function p( z.Section 1.E(Y I Z))' ~ 0 By (A.025 0.50 z\y 1 0 0.
Suppose Y and Z are bivariate normal random variables with the same mean . The Bivariate Normal Distribution.36 Statistical Models. u~. this quantity is just rl'.J. = Z))2 I Z = z) = u~(l _ = u~(1 _ p'). and Performance Criteria Chapter 1 The second way is to use (1.4. Goals. E(Y . (1. can reasonably be thought of as a measure of dependence. Y) has a N(Jlz l f. p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence. The term regression was coined by Francis Galton and is based on the following observation. (1 . If p = 0. which is the MSPE of the best constant predictor.LY. E(Y ~ E(Y I Z))' VarY ~ Var(E(Y I Z)) E(Y') ~ E[(E(Y I Z))'] I>'PY(Y) . .11) I The line y = /lY + p(uy juz)(z ~ /lz).L[E(Y I Z y . the predictor is a monotone increasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z.y as we would expect in the case of independence.9) I . Regression toward the mean. which corresponds to the best predictor of Y given Z in the bivariate normal model. Because of (1.E(Y I Z))' (1.4.885 as before. the best predictor of Y using Z is the linear function Example 1. the best predictor is just the constant f. whereas its magnitude measures the degree of such dependence. If p > O. 2 I .4. for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y./lz). p) distribution. is usually called the regression (line) of Y on Z.10) I I " The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution. If (Z.4. In the bivariate normal case.E(Y I Z p') (1. o tribution of Y given Z = z is N City + p(Uy j UZ ) (z . O"~. " " is independent of z. Therefore. Thus. . a'k.4. One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y.4.2 tells us that the conditional dis /lO(Z) Because .4. Similarly. Theorem B. = z)]'pz(z) 0. The larger this quantity the more dependent Z and Yare. the MSPE of our predictor is given by. E((Y . = /lY + p(uyjuz)(Z ~ /lZ).6) writing.6) we can also write Var /lO(Z) p = VarY .2.p')).
.6) in which I' (I"~. . (Zn. Y» T T ~ E yZ and Uyy = Var(Y). so the MSPE of !lo(Z) is smaller than the MSPE of the constant predictor J1y.8 = EziEzy anduYYlz ~ uyyEyzEziEzy. coefficient ofdetennination or population Rsquared.4. We shall see how to do this in Chapter 2.12) with MSPE ElY l"o(Z)]' ~ E{EIY l"o(zlI' I Z} = E(uYYlz) ~ Uyy  EyzEziEzy. YI ). Y) is unavailable and the regression line is estimated on the basis of a sample (Zl. Let Z = (Zl. these were the heights of a randomly selected father (Z) and his son (Y) from a large human population. where the last identity follows from (1. or the average height of sbns whose fathers are the height Z. is closer to the population mean of heights tt than is the height of the father. the distribution of (Z. ttd)T and suppose that (ZT. The quadratic fonn EyzEZ"iEzy is positive except when the joint nonnal distribution is degenerate. should. The variability of the predicted value about 11. I"Y = E(Y). E). be less than that of the actual heights and indeed Var((! .. uYYlz) where. This quantity is called the multiple correlation coefficient (MCC).COV(Zd. Thus.. We write Mee = ' ~! _ PZy ElY /lO(Z))' ~ Var l"o(Z) Var Y Var Y . Ezy O"yy E zz is the d x d variancecovariance matrix Var(Z) .. .4.4 Prediction 37 tt. E zy = (COV(Z" Y).p)tt + pZ. usual correlation coefficient p = UZy / crfyCT lz when d = 1.11).6).4." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox. Yn ) from the population. .. and positive correlation p. . there is "regression toward the mean. (1 . N d +1 (I'. Then the predicted height of the son. consequently.. Note that in practice. the best predictor E(Y I Z) ofY is the linear function (1. in particular in Galton's studies. 0 Example 1. y)T has a (d + !) multivariate normal. . By (1. variance cr2. " Zd)T be a d x 1 covariate vector with mean JLz = (ttl. Thus. the MCC equals the square of the . One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y. Theorem B. . I"Y) T.8.3.. The Multivariate Normal Distribution. In Galton's case.4.p)1" + pZ) = p'(T'. tall fathers tend to have shorter sons. .Section 1.5 states that the conditional distribution of Y given Z = z is N(I"Y +(zl'zf.6. distribution (Section B. .
bZ}' = E(Y)  b E(Z) 1· = (Y  [Z(b .38 Statistical Models. and parents..y = .zy ~ (407. What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1.. In words. We expand {Y .335.1. Suppose(l) that (ZT l Y) T is trivariate nonnal with Var(Y) = 6.2. when the distribution of (ZT. We first do the onedimensional case.4. Z2)T be the heights in inches of a IOyearold girl and her parents (Zl = mother's height. we can avoid both objections by looking for a predictor that is best within a class of simple predictors. Y) a I VarZ. are P~. The natural class to begin with is that of linear combinations of components of Z.~ ~:~~).l Proof. The problem of finding the best MSPE predictor is solved by Theorem 1. . Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor. respectively. Then the unique best zero intercept linear predictor i~ obtained by taking E(ZY) b=ba = E(Z') . 29W· Then the strength of association between a girl's height and those of her mother and father.5%. E(Y . Y. Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z.y = . Goals. and E(Y _ boZ)' = E(Y') _ [E(ZY)]' E(Z') (1. In practice.13) . Suppose that E(Z') and E(Y') are finite and Z and Y are not constant.bZ)' is uniquely minintized by b = ba.3%. y)T is unknown.ba) + Zba]}' to get ba )' . yn)T See Sections 2. whereas the unique best linear predictor is ILL (Z) = al + bl Z where b = Cov(Z.. l.bZ)' = E(Y') + E(Z')(b  Therefore. (Z~.1 and 2.zz ~ (. the hnear predictor l'a(Z) and its MSPE will be estimated using a 0 sample (Zi.)T.393. let Y and Z = (Zl. Z2 = father's height).39 l. and Performance Criteria Chapter 1 For example. If we are willing to sacrifice absolute excellence.9% and 39.E(Z')b~.4. E(Y .. respectively. p~y = . knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33. . The best linear predictor.4. P~.3.:.209. The percentage reductions knowing the father's and both parent's heights are 20.
0 Remark 1.E(ZWW ' E([Z . This is in accordance with our evaluation of E(Y I Z) in Example 1.l7) in the appendix.E(Y)) . Note that R(a.bZ)2 we see that the b we seek minimizes E[(Y .4 Prediction 39 To prove the second assertion of the theorem note that by (lA. Best Multivariate Linear Predictor.4. E[Y . If EY' and (E([Z .1). by (1.bZ)' ~ Var(Y ..a .1 the best linear predictor and best predictor differ (see Figure 1. I 1.Z)'.a .E(Y) to conclude that b. and only if.4. This is because E(Y . if the best predictor is linear.1. Therefore. Substituting this value of a in E(Y .a . it must coincide with the best linear predictor.bE(Z) .I'd Z )]' = 1 05 ElY I'(Z)j' .boZ)' > 0 is equivalent to the CauchySchwarz inequality with equality holding if.E(Z)IIZ .4.4.4. We can now apply the result on zero intercept linear predictors to the variables Z .t and covariance E of X.bZ) + (E(Y) . is the unique minimizing value.bE(Z). = I'Y + (Z  J.I1. On the other hand. .E(Z) and Y . Let Po denote = . E(Y . 0 Note that if E(Y I Z) is of the form a + bZ. (1.boZ)' = 0.a.tzf [3. Theorem 1. whatever be b. and b = b1 • because.E(Z))]'.E(Z)J))1 exist.b. b) X ~ (ZT.bZ)2 is uniquely minimized by taking a ~ E(Y) .4.2.E(Z)J[Y . in Example 1.E(Z)f[Z .Section 1.E(Y)]) = EziEzy. E(Y .4. In that example nothing is lost by using linear prediction.4.16) directly by calculating E(Y .I'l(Z)1' depends on the joint distribution P of only through the expectation J. then the Unique best MSPE predictor is I'dZ) Proof.13) we obtain tlte proof of the CauchySchwarz inequality (A. That is. E(Y .4.14) Yl T Ep[Y .b(Z .l).5). A loss of about 5% is incurred by using the best linear predictor.a)'. whiclt corresponds to Y = boZo We could similarly obtain (A. From (1. then a = a. OUf linear predictor is of the fonn d I'I(Z) = a + LbjZj j=l = a+ ZTb [3 = (E([Z .
In the general.1(a). b) is miuimized by (1. 0 Remark 1.4 by extending the proof of Theorem 1. We could also have established Theorem 1. and R(a. We want to express Q: and f3 in terms of moments of (Z. The three dots give the best predictor. Y). Goals. . See Problem 1.4.17 for an overall measure of the strength of this relationship.1.2. .14).4. The line represents the best linear predictor y = 1. N(/L. Set Zo = 1. case the multiple correlation coefficient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y. . thus.4.00 z Figure 1. the multivariate uormal.. b) = Ro(a. that is.19. (1. A third approach using calculus is given in Problem 1. E(Zj[Y . . Suppose the mooel for I'(Z) is linear. 0 Remark 1.. b) = Epo IY .25 0. .4. ~ Y . . b).4. E).I'LCZ)).4.I'I(Z)]'. 2 • • 1 o o 0.50 0.. distribution and let Ro(a.40 Statistical Models. j = 0.4.14). the MCC gives the strength of the linear relationship between Z and Y. b) is minimized by (1. that is. Zd are uncorrelated. not necessarily nonnal. Thus..4.4.4. . and Performance Criteria Chapter 1 y .3.I'(Z) and each of Zo.3.4.4. . p~y = Corr2 (y.45z. However.(a + ZT 13)]) = 0..05 + 1.3 to d > 1. our new proof shows how secondmoment results sometimes can be established by "connecting" them to the nannal distribution. Ro(a. R(a. .4. I"(Z) = E(Y I Z) = a + ZT 13 for unknown Q: E R and f3 E Rd.d.15) .75 1. 0 Remark 1.4. By Example 1. Because P and Po have the same /L and E. By Proposition 1.
5)'. Remark 1. we would clearly like to separate out any aspects of the data that are irrelevant in the context of the model and that may obscure our understanding of the situation. this gives a new derivation of (1. Moreover.I 0 and there is a 90 E 9 such that 0 < oc form a Hilbert go = arg inf{".10.g(Z»): 9 E g).14) (Problem 1.16) is the Pythagorean identity. Remark 1. I'dZ» = ".4. If we identify II with Y and X with Z. Note that (1.I'(Z) is orthogonal to I'(Z) and to I'dZ). Thus. Because the multivariate normal model is a linear model.(Y.23).4.3.5) = (9 . then by recording or taking into account only the value of T(X) we have a reduction of the data.Thus.6.(Y 19NP). I'dZ) = ..17) ""'(Y. T(X) = X loses information about the Xi as soon as n > 1.4.1.4. The range of T is any space of objects T. I'(Z» Y . The notion of mean squared prediction error (MSPE) is introduced. usually R or Rk. We return to this in Section 3.5 SUFFICIENCY Once we have postulated a statistical model. We consider situations in which the goal is to predict the (perhaps in the future) value of a random variable Y. but as we have seen in Sec~ tion 1.. Consider the Bayesian model of Section 1. Using the distance D. It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear.12). .15) for a and f3 gives (1. Recall that a statistic is any function of the observations generically denoted by T(X) or T. the optimal MSPE predictor is the conditional expected value of Y given Z. We begin by fonnalizing what we mean by "a reduction of the data" X EX..(Y 1ge) ~ "(I'(Z) 19d (1. we see that r(5) ~ MSPE for squared error loss 1(9.(Y I 9). I'(Z» + ""'(Y..' (l'e(Z).Section 1.8) defined by r(5) = E[I(II. then 90(Z) is called the projection of Y on the space 9 of functions of Z and we write go(Z) = .4. can also be a set of functions.3.5.4. When the class 9 of possible predictors 9 with Elg(Z)1 space as defined in Section B. and it is shown that if we want to predict Y on the basis of information contained in a random vector Z. 1. the optimal MSPE predictor E(6 I X) is the Bayes procedure for squared error loss.2.. we can conclude that I'(Z) = .4. The optimal MSPE predictor in the multivariate normal distribution is presented. With these concepts the results of this section are linked to the general Hilbert space results of Section 8.16) (1.2 and the Bayes risk (1. and projection 1r notation.5(X»].4. If T assigns the same value to different sample points. o Summary..5 Sufficiency 41 Solving (1.4. g(Z) and h(Z) are said to be orthogonal if at least one has expected value zero and E[g(Z)h(Z)] ~ O.
T is a sufficient statistic for O. Therefore. Suppose there is no dependence between the quality of the items produced and let Xi = 1 if the ith item is good and 0 otherwise. .l.l. By Example B." it is important to notice that there are many others e.l we see that given Xl + X2 = t.5.l. We could then represent the data by a vector X = (Xl.Xn ) where Xi = 1 if the ith item sampled is defective and Xi = 0 otherwise. It follows that. = t. t) distribution. it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T. .' • . Xl has aU(O. = tis U(O. ~ t. X. recording at each stage whether the examined item was defective or not.5. 16. .Xn ) does not contain any further infonnation about 0 or equivalently P.'" .4). o Example 1. For instance. we need only record one.2. Each item produced is good with probability 0 and defective with probability 1. The idea of sufficiency is to reduce the data with statistics whose use involves no loss of information.) and that of Xlt/(X l + X. However. The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X. Thus. X. X n ) into the same number. is a statistic that maps many different values of (Xl. .9.) are the same and we can conclude that given Xl + X...1. T = L~~l Xi. Then X = (Xl. when Xl + X.'" . the conditional distribution of Xl = [XI/(X l + X.1) where Xi is 0 or 1 and t = L:~ I Xi. • X n ) (X(I) . P[X I = XI. the sample X = (Xl. the conditional distribution of XI/(X l + X.) given Xl + X. Instead of keeping track of several numbers.) is conditionally distributed as (X. are independent and the first of these statistics has a uniform distribution on (0.X. Goals. is sufficient. whatever be 8. . . 1) whatever be t. 1).Xn ) is the record of n Bernoulli trials with probability 8.5).3.5.'" . X 2 the time between the arrival of the first and second customers. . Thus. t) and Y = t .) and Xl +X. Using our discussion in Section B.0. " ..)](X I + X. + X. Although the sufficient statistics we have obtained are "natural. and Performance Criteria Chapter 1 Even T(X J. The total number of defective items observed.. once the value of a sufficient statistic T is known. XI/(X l +X. Xl and X 2 are independent and identically distributed exponential random variables with parameter O.42 Statistical Models.. given that P is valid.8)n' (1. Begin by noting that according to TheoremB. Y) where X is uniform on (0. We prove that T = X I + X 2 is sufficient for O. Example 1. X(n))' loses information about the labels of the Xi. . One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following. .. Suppose that arrival of customers at a service counter follows a Poisson process with arrival rate (parameter) Let Xl be the time of arrival of the first customer.1.Xn = xnl = 8'(1.2. 0 In both of the foregoing examples considerable reduction has been achieved. suppose that in Example 1. where 0 is unknown. By (A. . By (A. in the context of a model P = {Pe : (J E e}.1 we had sampled the manufactured items in order. whatever be 8. A statistic T(X) is called sufficient for PEP or the parameter if the conditional distribution of X given T(X) = t does not involve O. (Xl.. A machine produces n items in succession. Thus. the conditional distribution of X 0 given T = L:~ I Xi = t does not involve O. We give a decision theory interpretation that follows.
e porT = til = I: {x:T(x)=t.e) = g(ti. if (152) holds.In a regular model. e) definedfor tin T and e in on X such that p(X.]/porT = Ii] p(x. e)h(Xj) Po[T (15.5.. In general. • only if. if Tl and T2 are any two statistics such that 7 1 (x) = T 1 (y) if and only if T2(x) = T 2(y). forO E Si. O)h(x) for all X E X.j is independent ofe on cach of the sets Si = {O: porT ~ til> OJ.. Such statistics are called equivalent. it is enough to show that Po[X = XjlT ~ l. Let (Xl. i = 1. The complete result is established for instance by Lehmann (1997. Section 2.4) if T(xj) ~ Ii 41 o if T(xj) oF t i . and e and a/unction h defined (152) = g(T(x). and Halmos and Savage. 0) porT ~ Ii] ~~CC+ g(li. Fortunately.5 Sufficiency 43 that will do the same job.O) I: {x:T(x)=t. More generally. Neyman. . T = t..6). Being told that the numbers of successeS in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one. (153) By (B. 0 E 8.S. Po[X = XjlT = t.] o if T(xj) oF ti if T(xj) = Ii.. there exists afunction g(t.") be the set of possible realizations of X and let t i = T(Xi)' Then T is discrete and 2::~ 1 porT = Ii] = 1 for every e.Ll) and (152). This result was proved in various forms by Fisher..} p(x. We shall give the proof in the discrete case.I. we need only show that Pl/[X = xjlT = til is independent of for every i and j.} h(x). a statistic T(X) with range T is sufficient for e if. a simple necessary and sufficient criterion for a statistic to be sufficient is available. X2. It is often referred to as the factorization theorem for sufficient statistics.O) Theorem I. Proof.] Po[X = Xj. checking sufficiency directly is difficult because we need to compute the conditional distribution.2. Applying (153) we arrive at. Now. h(xj) (1. Po [X ~ XjlT = t.Section 1. then T 1 and T2 provide the same information and achieve the same reduction of the data. To prove the sufficiency of (152). By our definition of conditional probability in the discrete case.5) .
... . . Common sense indicates that to get information about B. 0 o Example 1.S. and h(xl •.xn }. n P(Xl...5. we can show that X(n) is sufficient. . T is sufficient. we need only keeep track of X(n) = max(X ..Xn ) is given by (see (A.5.Xn ) = L~ IXi is sufficient.5. O} = 0 otherwise.44 Statistical Models. •. O)h(x) (1. ~ Po[X = x. [27".. Then the density of (X" .1. . and both functions = 0 otherwise.. which admits simple sufficient statistics and to which this example belongs.Xn.5..[27f.3)..JL)'} ~=l I n I! 2 .. By Theorem 1. If X 1..'I. 1. .16. . I x n ) = 1 if all the Xi are > 0..10) P(Xl.O)=onexP[OLXil i=l (1. .X. ' .' L(Xi . l X n are the interarrival times for n customers. then the joint density of (X" . and P(XI' . .7) o Example 1. T = T(x)] = g(T(x). 0) by (B. are introduced in the next section. let g(ti' 0) = porT = tiL h(x) = P[X = xIT(X) = til Then (1.9) if every Xi is an integer between 1 and B and p( X 11 • (1.~.. . Let 0 = (JL.X n JI) = 0 otherwise. 1 X n be independent and identically distributed random variables each having a normal distribution with mean {l and variance (j2. _ _ _1 .8) = eneStift > 0. Conversely.. X n ).. > 0. The probability distribution of X "is given by (1.' (L x~ 1=1 1 n n 2JL LXi))]. Consider a population with () members labeled consecutively from I to B.Xn ) is given by I. and Performance Criteria Chapter 1 Therefore. ..4.5. 0 Example 1.6) p(x. .5. Estimating the Size of a Population.5. The population is sampled with replacement and n members of the population are observed and their labels Xl.4)). A whole class of distributions.5. Expression (1. if T is sufficient.1 to conclude that T(X 1 . We may apply Theorem 1.n /' exp{ .'t n /'[exp{ .'). Let Xl. . = max(xl....5. 1=1 (!.'" . 10 fact.O) where x(n) = onl{x(n) < O}.5.Il) .Xn are recorded.2. . both of which are unknown. . }][exp{ .3.2. X(n) is a sufficient statistic for O.8) if all the Xi are > 0. Goals.5.9) can be rewritten as " 1 Xn . Takeg(t.2 (continued).
6.2a I3. a 2)T is identifiable (Problem 1. we can.) i=l i=l is sufficient for B. An equivalent sufficient statistic in this situation that is frequently used IS S(X" .' + 213 2a + 213. i= 1 = (lin) L~ 1 Xi. Yi N(Jli. Then 6 = {{31.12) t ofT(X) and a Example 1.9) and _ (y.Section 1. By randomized we mean that o'(T(X») can be generated from the value random mechanism not depending on B.5. a<. EYi 2 . (1. Suppose X" .Zi)'} exp {Er. L~l x~) and () only and T(X 1 ..5. I)) . Let o(X) = Xl' Using only X..4 with d = 2.1.xnJ)) is itself a function of (L~ upon applying Theorem 1. if T(X) is sufficient.1.X)']. Specifically. Here is an example. . X n ) = (I: Xi. where we assume diat the given constants {Zi} are not all identical.~0)}(2"r~n exp{ .5 Sufficiency 45 I Evidently P{Xl 1 • •• .··· . in Example 1. 0 X Example 1. Ezi 1'i) is sufficient for 6. for any decision procedure o(x). 1 2 2 . .EZiY. .~ I: xn By the factorization theorem. 2:>. that is.5. X n ) = [(lin) where I: Xi.• Yn are independent. respectively. . Suppose. find a randomized decision rule o'(T(X)) depending only on T(X) that does as well as O(X) in the sense of having the same risk function.5. T = (EYi . with JLi following the linear regresssion model ("'J . Thus.o) = R(O. The first and second components of this vector are called the sample mean and the sample variance.. we construct a rule O'(X) with the sarne risk = mean squared error as o(X) as follows: Conditionally. X n are independent identically N(I). [1/(n i=l n n 1)1 I:(Xi . that Y I . a 2 ). 0') for allO.} + EY. Then pix. o Sufficiency and decision theory Sufficiency can be given a clear operational interpretation in the decision theoretic setting.5.1 we can conclude that n n Xi. {32.l) distributed. . X is sufficient..( 2?Ta ')~ exp p {E(fJ. I)) = exp{nI)(x . R(O.
Then by the factorization theorem we can write p(x.1 (continued). .6). Goals. R(O. Let SeX) be any other sufficient statistic. This 6' (T(X)) will have the same risk as 6' (X) because. X n ) are hoth sufficient. We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X). if Xl. 0 and X are independent given T(X). Theorem }.12) follows along the lines of the preceding example: Given T(X) = t. Equivalently.46 Statistical Models. L~ I xl) and S(X) = (X" . In this Bernoulli trials case.6). where k In this situation we call T Bayes sufficient.14. . 6'(X) and 6(X) have the same mean squared error. . (T2) sample n > 2.. Now draw 6' randomly from this conditional distribution. the same as the posterior distribution given T(X) = L~l Xi = k.Xn is a N(p. it is Bayes sufficient for every 11.5. o Sufficiency aud Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P = {Po : 0 E e} we postulate a prior distribution 11 for e. n~l) distribution. T = I Xi was shown to be sufficient. 0 The proof of (1. choose T" = 15"' (X) from the normal N(t. we can find a transformation r such that T(X) = r(S(X)).2. = 2:7 Definition. Minimal sufficiency For any model there are many sufficient statistics: Thus. O)h(x) E:  .5. and Performance Criteria Chapter 1 given X = t.2. Example 1. But T(X) provides a greater reduction of the data. Thus.4.l and (1.5. Using Section B.6(X»)IT]} = R(O.6'(T))IT]} ~ E{E[R(O...5. the distribution of 6(X) does not depend on O.6') ~ E{E[R(O. ). 0) as p(x. This result and a partial converse is the subject of Problem 1.. we find E(T') ~ E[E(T'IX)] ~ E(X) ~ I" ~ E(X . by the double expectation theorem.O) = g(S(x). IfT(X) is sufficient for 0.. then T(X) = (L~ I Xi. In Example 1. ) Var(T') = E[Var(T'IX)] + VarIE(T'IX)]  ~ nl n + 1 n ~ 1 = Var(X .1 (Bernoulli trials) we saw that the posterior distribution given X = x is 1 Xi. (Kolmogorov). T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of () given T(X) = T(x) for all x. in that.
for a given 8.(t. ()) x E X}.O)h(x) forall O. ~ 1/3. t. when we think of Lx(B) as a function of (). we find OT(1 . for a given observed x. It is a statistic whose values are functions. T is minimally sufficient. the statistic L takes on the value Lx. 2/3)/g(S(x). }exp ( I .4 (continued).o)nT ~g(S(x).1).11) is determined by the twodimensional sufficient statistic T Set 0 = (TI.O). In this N(/". as a function of (). the likelihood function (1. for example.O E e. L x (') determines (iI. Now. Thus.) n/' exp {nO. = 2 log Lx(O.) ~ (L Xi. then Lx(O) = (27rO. L x (()) gives the probability of observing the point x. the "likelihood" or "plausibility" of various 8. Example 1. In the discrete case. (T'). The formula (1. Lx is a map from the sample space X to the class T of functions {() t p(x.)}.5. 20. it gives.5.1) nlog27r .. In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x. Thus. 20' 20 I t. if X = x. ~ 2/3 and 0. if we set 0. We define the likelihood function L for a given observed data vector x as Lx(O) = p(x. However.) = (/". L Xf) i=l i=l n n = (0 10 0. ()) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient.8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of 8. t2) because. 1/3)]}/21og 2. we find T = r(S(x)) ~ (log[2ng(s(x).Section 1.T.5. take the log of both sides of this equation and solve for T. The likelihood function o The preceding example shows how we can use p(x.5 Sufficiency 47 Combining this with (1. the ratio of both sides of the foregoing gives In particular. For any two fixed (h and fh. (T') example.5.
We define a statistic T(X) to be Bayes sufficient for a prior 7r if the posterior distribution of f:} given X = x is the same as the posterior distribution of 0 given T(X) = T(x) for all X. Consider an experiment with observation vector X = (Xl •. 0 the likelihood ratio of () to Bo. Suppose there exists 00 such that (x: p(x. For instance. S(X) = (R" . The likelihood function is defined for a given data vector of . or if T(X) ~ (X~l)' . l:~ I (Xi .17).5. < X. X is sufficient. OJ denote the frequency function or density of X. L is minimal sufficient. • X( n» is sufficient. itself sufficient. it is Bayes sufficient for O. 1 X n ). but if in fact a 2 I 1 all information about a 2 is contained in the residuals. and Scheffe. By arguing as in Example 1. B ul the ranks are needed if we want to look for possible dependencies in the observations as in Example 1.. . if the conditional distribution of X given T(X) = t does not involve O.5. SeX) becomes irrelevant (ancillary) for inference if T(X) is known but only if P is valid. (X( I)' . .5. Suppose that X has distribution in the class P ~ {Po : 0 E e}. . Goals..Xn . _' . Then Ax is minimal sufficient.. or for the parameter O. 1) (Problem 1. SeX)~ where SeX) is a statistic needed to uniquely detennine x once we know the sufficient statistic T(x). Let Ax OJ c (x: p(x. Thus.X. . L is a statistic that is equivalent to ((1' i 2) and.1. . If.5. and Performance Criteria Chapter 1 with a similar expression for t 1 in terms of Lx (O.Rn ). a statistic closely related to L solves the minimal sufficiency problem in general. where R i ~ Lj~t I(X. The "irrelevant" part of the data We can always rewrite the original X as (T(X).48 Statistical Models. as in the Example 1.5. We say that a statistic T (X) is sufficient for PEP. Let p(X. O)h(X).1 (continued) we can show that T and. a 2 is assumed unknown. (X.5.X). If P specifies that X 11 •. we can find a randomized decision rule J'(T(X» depending only on the value of t = T(X) and not on 0 such that J and J' have identical risk functions.5. The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t. hence. .4. See Problem ((x:). Lehmann.13. Thus.. 0 In fact.O) > for all B.. 1 X n are a random sample. We show the following result: If T(X) is sufficient for 0. if a 2 = 1 is postulated.). Ax is the function valued statistic that at (J takes on the value p x.X)2) is sufficient. Summary. .. the ranks. 1) and Lx (1. ifT(X) = X we can take SeX) ~ (Xl .12 for a proof of this theorem of Dynkin. hence.Oo) > OJ = Lx~6o)' Thus. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic SeX) we can find a ttansformation r such that T(X) = r(S(X). the residuals.5. in Example 1. but if in fact the common distribution of the observations is not Gaussian all the information needed to estimate this distribution is contained in the corresponding S(X)see Problem 1.X Cn )' the order statistics.. If T(X) is sufficient for 0. 0) = g(T(X). 1. then for any decision procedure J(X). 0) and heX) such that p(X...
for x E {O. Then. B) > O} c {x: p(x. if there exist realvalued functions fJ(B).Section 1. by the factorization theorem. Poisson. BEe.o I x. and Dannois through investigations of this property(l). such that the density (frequency) functions p(x.1 The OneParameter Case The family of distributions of a model {Pe : B E 8}. x. Let Po be the Poisson distribution with unknown mean IJ.6 Exponential Families 49 observations X to be the function of B defined by Lx(B) = p(X.B) = eXe. binomial.B) and h(x) with itself in the factorization theorem.B(B)} with g(T(x). B). Pitman. More generally. Ax(B) is a minimally sufficient statistic. many other common features of these families were discovered and they have become important in much of the modern theory of statistics. 1.6. They will reappear in several connections in this book. Note that the functions 1}.6. B. these families form the basis for an important class of models called generalized linear models.exp{xlogBB}. is said to be a oneparameter exponentialfami/y.B o) > O}. B E' sufficient for B.2) . p(x.6 EXPONENTIAL FAMILIES The binomial and normal models considered in the last section exhibit the interesting feature that there is a natural sufficient statistic whose dimension as a random vector is independent of the sample size. We return to these models in Chapter 2. We shall refer to T as a natural sufficient statistic of the family.B(B)} (1. Subsequently. B>O. then. B) of the Pe may be written p(x. The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman. = 1 . IfT(X) is {x: p(x. (1. In a oneparameter exponential family the random variable T(X) is sufficient for B. realvalued functions T and h on Rq. This is clear because we need only identify exp{'7(B)T(x) .6.1) where x E X c Rq. the likelihood ratio Ax(B) = Lx(B) Lx(Bo) depends on X through T(X) only. Probability models with these common features include normal.1. 1. and T are not unique. The Poisson Distribution.B) ~ h(x)exp{'7(B)T(x) . and if there is a value B E such that o e e. Example 1. beta.2"" }. Here are some examples. 1.6. and multinomial regression models used to relate a response variable Y to a set of predictor variables. B( 0) on e. gamma.
~O'(y  z)' logO}.. (1. .) exp[~(O)T(x.(y I z) = <p(zW1<p((y .6) .6. T(x) ~ x. . 1). Suppose X has a B(n. . the Pe form a oneparameter exponential family with ~~I . 1. + OW.1](O)=log(I~0).3. is the family of distributions of X = (Xl' .6.4) Therefore. h(x) = (2rr)1 exp { . B(O) = 0..B(O)=nlog(I0).2. p(x.3) I' .Xm are independent and identically distributed with common distribution Pe. B(O) = logO. for x E {O. .6. Statistical Models. y)T where Y = Z independent N(O. The Binomial Family. n) < 0 < 1.0)].6.y.) i=1 m B(8)] ( 1. we have p(x.~z. and Performance Criteria Chapter 1 Therefore. I I! . 1 (1. This is a oneparameter exponential family distribution with q = 2. q = 1. Z and W are f(x. 1 X m ) considered as a random vector in Rmq and p(x.Z)OI) Z)'O'J} (2rrO)1 exp { (2rr)1 exp {  ~ [z' + (y  ~z' } exp { .~O'. Specifically..O) IIh(x.} .h(x) = .. B) are the corresponding density (frequency) functions.1](0) = logO. fonn a oneparameter exponential family as in (1. . Goals.O) = f(z)f. Example 1. .6. .' x. suppose Xl. Then > 0. o The families of distributions obtained by sampling from oneparameter exponential families are themselves oneparameter exponential families. the family of distributions of X is a oneparameter exponential family with q=I.1](0) = .O) = ( : ) 0'(1_ Or" ( : ) ex p [xlog(1 ~ 0) + nlog(l.h(X)=( : ) .6. 0) distribution.T(x) = (y  z)'.. 0 Then. . Suppose X = (Z. .50 . Here is an example where q (1. If {pJ=». where the p.O) f(z.T(x)=x.6.1). 0 E e. 0 Example 1.5) o = 2.
then qCm) = mq. B.7) Note that the natural sufficient statistic T(m) is onedimensional whatever be m. · . 1 X m). oneparameter exponential family with natural sufficient statistic T(m)(x) = Some other important examples are summarized in the following table. .Xm ) = I Xi is distributed as P(mO). . and pJ LT(Xi)..8) .. then the family of distributions of the statistic T(X) is a oneparameter exponential family of discrete distributions whose frequency functions may be written h'Wexp{1)(O)t .. h(m)(x) ~ i=l II h(xi). 1 X m ) is a vector of independent and identically distributed P(O) random variables and m ) is the family of distributions of x. In the discrete case we can establish the following general result. ily of distributions of a sample from any of the foregoin~ is just In our first Example 1.6.2) r(p. if X = (Xl.' fixed I' fixed p fixed . Therefore.1 the sufficient statistic T :m)(x1. •. If we use the superscript m to denote the corresponding T. . .1. For example.6 Exponential Families 51 where x = (x 1 .6. pi pi I:::n TABLE 1.\) (3(r.Section 1...· . . 1].x log x 10g(l x) logx The statistic T(m) (X I.. We leave the proof of these assertions to the reader. Let {Pe } be a oneparameter exponential family of discrete distributions with corresponding functions T.6.\ fixed r fixed s fixed 1/2. . I:::n I::n Theorem 1. i=l (1. and h. . B.B(O)} for suitable h * .Il)" . and h. the m) fonn a oneparameter exponential family. B(m)(o) ~ mB(O)..1 Family of distributions . fl.. This family of Poisson distIibutions is oneparameter exponential whatever be m.\ 1"/'" (p (s 1) 1) 1) (r T(x) x (x .6. 1)(0) N(I'. then the m ) fonn a I Xi. X m ) corresponding to the oneparameter exponential fam1 T(Xd. ..
Example 1.6.9) and ~ is an Ihlerior point of E.6. and Performance Criteria Chapter 1 Proof.B(8)]{ Ifwe let h'(t) ~ L:{"T(x)~t} h(x). the momentgenerating function o[T(X) exists and is given by M(s) ~ exp[A(s + 1}) .1}) = (1(x!)exp{1}xexp[1}]}. The model given by (1.A(1})1 for s in some neighborhood 0[0. E is either an interval or all of R and the class of models (1.9) where A(1}) = logJ .. Theorem 1. where 1} = log 8. 00 00 exp{A(1})} andE = R. By definition.2.52 Statistical Models. If X is distributed according to (1. P9[T(x) = tl L {x:T(x)=t} p(x.6.6. Then as we show in Section 1.B(8)] L {x:T(x)=t} ( 1. If 8 E e.6. Let E be the collection of all 1} such that A(~) is finite. We obtain an important and useful reparametrization of the exponential family (1. x=o x=o o Here is a useful result.6. = L(e"' (x!) = L(e")X (x! = exp(e"). The Poisson family in canonical form is q(x.9) with 1} E E contains the class of models with 8 E 8.1) by letting the model be indexed by 1} rather than 8.6. L {x:T{x)=t} h(x)}.2. 0 A similar theorem holds in the continuous case if the distributions ofT(X) are them Canonical exponential families.6. The exponential family then has the form q(x. }.9) with 1} ranging over E is called the canonical oneparameter exponential family generated by T and h. if q is definable. £ is called the natural parameter space and T is called the natural sufficient statistic.1. Jh(x)exp[1}T(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case. ..6. x E {O..8) exp[~(8)t . x E X c Rq (1. selves continuous. .2. the result follows.A(~)I.8) h(x) exp[~(8)T(x) . then A(1}) must be finite. Goals.1}) = h(x)exp[1}T(x) .1. (continued)..
0 Here is a typical application of this result.O) = h(x) exp[L 1]j(O)Tj (x) .6 Exponential Families 53 Moreover. nlog0 J. Pitman.8 > O. ' e k p(x. 0 = n log 82 and A(1]) 1 xl has mean nit} = = nlog(21]). A family of distrib~tioos {PO: 0 E 8}. Now p(x. c R k . We give the proof in the continuous case.6. It is used to model the density of "time until failure" for certain types of equipment. which is naturally indexed by a kdimensional parameter and admit a kdimensional sufficient statistic..2 The Multiparameter Case Our discussion of the "natural form" suggests that oneparameter exponential families are naturally indexed by a onedimensional real parameter fJ and admit a onedimensional sufficient statistic T(x).6. (1. Direct computation of these moments is more complicated.j8 2))exp(i=l n n I>U282) ~=1 = (il xi)exp[202 LX.4 Suppose X 1> ••• . x E X j=1 c Rq. . if there exist realvalued functions 171. h on Rq such that the density (frequency) functions of the Po may be written as. is one. and realvalued functions T 1. X n is a sample from a population with density p(x. 2 i=1 i=l n 1 n Here 1] = 1/20 2.8(0)]. 02 ~ 1/21]. 1. Therefore. exp[A(s + 1]) . Proof. B(O) the natural sufficient statistic E~ 2n(j2 and variance nlt}2 4n04 . _ .10) . Var(T(X)) ~ A"(1]).8) = (il(x. This is known as the Rayleigh distribution. Koopman.A(1])] because the last factor. is said to be a kparameter exponential family.Section 1.. and Dannois were led in their investigations to the following family of distributions. Example 1. being the integral of a density.A(1])]dx h(x)exp[(s + 1])T(x)  A(s + 1])Jdx = . E(T(X)) ~ A'(1]). The rest of the theorem follows from the momentenerating property of M(s) (see Section A. 1 Tk. More generally. ..12).6.8) ~ (x/8 2)exp(_x 2/28 2). x> 0.<·or . We compute M(s) = = E(exp(sT(X))) {exp[A(s ~ + 1])  A(1])]} J'" J J'" J h(x)exp[(s + 1])T(x) .17k and B of 6.
(x) = x'. and Performance Criteria Chapter 1 By Theorem 1. 8 2 = and I" 1 2' Tl(x) = x.LTk(Xi )) t=1 Example 1. h(x) = I.Xm ) where the Xi are independent and identically distributed and their common distribution ranges over a k~parameter exponential family given by (1.10). . . . .LX. If we observe a sample X = (X" . .O) =exp[".Ot = fl. J1 < 00. A(71) is defined in the same way except integrals over Rq are replaced by sums.') : 00 < f1 x2 1 J12 2 p(x. 2 (".6.5.'). we define the natural parameter space as t: = ('l E R k : 00 < A('l) < oo}.6..11) (]"2. The density of Po may be written as e ~ {(I".5. . the vector T(X) = (T. i=l i=1 which we obtained in the previous section (Example 1. which corresponds to a twOparameter exponential family with q = I. . Goals. i=l ..'l) = h(x)exp{TT(x)'l. It will be referred to as a natural sufficient statistic of the family.6. The Normal Family.') population. (]"2 > O}.2. .. .. letting the model be Thus. = N(I". . suppose X = (Xl. Then the distributions of X form a kparameter exponential family with natural sufficient statistic m m TCm)(x) = (LTl(Xi). ..fJk)T rather than family generated by T and h is e. 0 Again it will be convenient to consider the "biggest" families.(X) •. (1. then the preceding discussion leads us to the natural sufficient statistic m m (LXi.'" . Suppose lhat P. I ! In the discrete case. . q(x.Tk(x)f and. in the conlinuous case. 1)2(0) = 2 " B(O) " 1"' 1 " T. +log(27r" »)].A('l)}.. X m ) from a N(I". Again.I' 54 Statistical Models.x .. the canonical kparameter exponential indexed by 71 = (fJI.. .TdX))T is sufficient...2(".' . + log(27r"'».x E X c Rq where T(x) = (T1 (x).')...1. In either case...3.4).
.5.)T. . Now we can write the likelihood as k qo(x. a 2 ). 0. 2.. .6 Exponential Families 55 (x.~. and t: = Rk .~l= (3. h(x) = I andt: ~ R x R. = n'Ezr Example 1. in Examples 1.ti = f31 + (32Zi.. 0) for 1 = (1. Let T. I <j < k 1. TT(x) = T(Y) ~ (EY"EY.) + log(rr/. .. A) = I1:~1 AJ'(X).6. as X and the sample space of each Xi is the k categories {I.~2): ~l E R.T.~·(x) . with J. (N(/" (7') continued}./(7'. 0 + el) ~ qo(x.7. where A is the simplex {A E R k : 0 < A.5 that Y I .. . ~l ~ /.. We observe the outcomes of n independent trials where each trial can end up in one of k possible categories. . A E A.5. A(1/) ~ ~[(~U2~. In this example. We write the outcome vector as X = (Xl. = 1/2(7'.6... .~2)].= {(~1. E R k ..~. the density of Y = (Yb . Then p(x.~3): ~l E R._l)(x)1/nlog(l where + Le"')) j=1 ..)j./(7'.k. It will often be more convenient to work with unrestricted parameters. . 1/) =exp{T'f. . k}] with canonical parameter 0. n. j=1 k This is a kparameter canonical exponential family generated by TIl"" Tk and h(x) = II~ I l[Xi E {I.. .1.kj. E R. . x') = (T...~3 ~ 1/2(7'."k.n log j=l L exp(. < OJ.id.j = 1... From Exam~ pIe 1. Suppose a<. . Y n are independent.k. is not and all c.EziY. Yi rv N(Jli.6. remedied by considering IV ~j = log(A..6. L71 Aj = I}.'12 ~ (3.Xn)'r where the Xi are i..5. 0) ~ exp{L . Linear Regression. we can achieve this by the reparametrization k Aj = eO] /2:~ e Oj . A(1/) = 4n[~:+m'~5+z~1~'+2Iog(rr/~3)].4 and 1. .. ~3 and t: = {(~1. and AJ ~ P(X i ~ j).j j=l = 1.. This can be identifiable because qo(x.~3 < OJ. . However 0. where m.. i = 1.. Example 1./Ak) = ".(x). I yn)T can be put in canonical form with k = 3.'. Example 1.. In this example./(7'. . k ~ 2. Multinomial Trials. . . .(x»). ~. < l. and rewriting kl q(x.'I2.Section 1.(x) = L:~ I l[X i = j].5.
then all models for X are exponential families because they are submodels of the multinomial trials model. Let Y. X n ) T where the Xi are i.·..56 Note that q(x..6.6. then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h. 0) ~ q(x. if X is discrete Affine transformations from Rk to RI defined by UP is the canonical family generated by T kx 1 and hand M is the affine transformation I' . Here is an example of affine transformations of 6 and T.d. .Y n ) ~ Y. 0 1. h(y) = 07 I ( ~. 17).12) taking on k values as in Example 1.6.. and 1] is a map from e to a subset of R k • Thus. 11 E £ C R k } is an exponential family defined by p(x. Similarly. .3 Building Exponential Families Submodels A submodel of a kparameter canonical exponential family {q(x.2. < X n be specified levels and (1.13) .). are identifiable. k}] with canonical parameter TJ and £ = Rk~ 1.6. Ai). Goals.i. . " .6. and Performance Criteria Chapter 1 1 parameter canonical exponential family generated by T (k 1) {I. I < k. be independent binomial. If the Ai are unrestricted. 1 <j < k . Logistic Regression..1. See Problem 1.7 and X = (X t. More10g(P'I[X = jI!P'IIX ~ k]).8. < .. is unchanged.6. Here 'Ii = log 1\. " Example 1. 1/(0») (1. as X.. . 0 < Ai < 1. . from Example 1. M(T) sponding to and ~ MexkT + hex" it is easy to see that the family generated by M(T(X» and h is the subfamily of P corre 1/(0) = MTO. 1 < i < n.' A(1/) = L:7 1 ni 10g(l + e'·). this. 1 < i < n.17 e for details. 1]) is a k and h(x) = rr~ 1 l[xi E over. where (J E e c R1.6. is an npararneter canonical exponential family with Yi by T(Yt . let Xt =integers from 0 to ni generated Vi < n. if c Re and 1/( 0) = BkxeO C R k . B(n. the parameters 'Ii = Note that the model for X Statistical Models. ) 1(0 < However.
'.Xi)). > O.. Then (and only then).5 a 2nparamctercanonical exponential family model with fJi = P.i. Example 1.. ..)... p(x.. ranges over (0. generated by . .00). I ii. the 8 parametrization has dimension 2. = '\501 and 'fJ2(0) = ~'\5B2..a 2 ). . Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic.xn)T.. .. ~ (1.. i=I This model is sometimes applied in experiments to determine the toxicity of a substance. with B = p" we can write where T 1 = ~~ I Xi. so it is not a curved family.. + 8.9. . The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance. If each Iii ranges over R and each a. log(P[X < xl!(1 . E R. T 2 = L:~ 1 X. and l1n+i = 1/2a.x and (1..6. However. is a known constant '\0 > O. LocationScale Regression.13) holds. .i/a.8. then this is the twoparameter canonical exponential family generated by lYIY = (L~l.'V T(Y) = (Y" .Section 1. y. SetM = B T . This is a 0 In Example 1. which is called the coefficient of variation or signaltonoise ratio. n. Yn. Gaussian with Fixed SignaltoNoise Ratio.. Y. .6. x).'. where 1 is (1. N(Jl. .12) with the range of 1J( 8) restricted to a subset of dimension l with I < k . i = 1.. . + 8.10. Then. this is by Example 1. suppose the ratio IJlI/a..6. In the nonnal case with Xl. a.14) 8. are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization.1)T. Suppose that YI ~ . which is less than k = n when n > 3.x = (Xl. 8) in the 8 parametrization is a canonical exponential family.8.. Y n are independent.8.i.6.PIX < x])) 8. }Ii N(J1.6. L:~ 1 Xl yi)T and h with A(8.. + 8. 'fJ1(B) curved exponential family with l = 1. that is. .!t is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied.)T .6 Exponential Families 57 This is a linear transformation TJ(8) = B nxz 8 corresponding to B nx2 = (1.6.Xn i.1. o Curved exponential families Exponential families (1.d.) = L ni log(1 + exp(8. Example 1.6. . PIX < xl ~ [1 + exp{ (8.x)W'.
6.10 and 1. and = Ee sTT .8 note that (1.4 Properties of Exponential Families Theorem 1. but a curved exponential family model with 0 1=3. say. 8. 1989.5 that for any random vector Tkxl.. 1. the map 1/(0) is Because L:~ 11]i(6)Yi + L:~ i17n+i(6)Y? cannot be written in the fonn ~.12. Sections 2. 1 Bj(O).6. Let Yj .6.(0). M(s) as the momentgenerating function.12) is not an exponential family model.xjYj ) and we can apply the supennode! approach to reach the same conclusion as before. then p(y. . E R.6. we define I I .1/(O)) as defined in (6. . ~. Thus. n. and B(O) = ~.8. Section 15.i.1.6.(O)Tj'(Y) for some 11. yn)T is modeled by the exponential family generated by T(Y) ~. Models in which the variance Var(}'i) depends on i are called heteroscedastic whereas models in which Var(Yi ) does not depend on i are called homoscedastic. for unknown parameters 8 1 E R. Bickel.). For 8 = (8 1 . a. 83 > 0 (e. respectively.) and 1 h'(YJ)' with pararneter1/(O). Goals.. sampling. and Snedecor and Cochran. be independent.1 generalizes directly to kparameter families as does its continuous analogue. 1978.) = (Yj.10).3. as being distributed according to a twoparameter family generated by Tj(Y. Examples 1.13) exhibits Y..E Yj c Rq.6.6. Even more is true. Recall from Section B. We extend the statement of Theorem 1. and Performance Criteria Chapter 1 and h(Y) = 1.g. We return to curved exponential family models in Section 2.58 Statistical Models. 1 Tj(Y. Carroll and Ruppert. Supermode1s We have already noted that the exponential family structure is preserved under i.d._I11.) depend on the value Zi of some covariate. In Example 1. 1 < j < n. with an exponential family density Then Y = (Y1.8.6 are heteroscedastic and homoscedastic models.5. 0) ~ q(y. 1988. Next suppose that (JLi. Ii. TJ'(Y).2..
a)'72) < aA('7l) + (1 .3(c).4). ~ = 1 .6. We prove (b) first.a)A('72) Because (1.15) is finite.1. T.6. Since 110 is an interior point this set of5 includes a baLL about O. Corollary 1.15) that a'7. ('70)) . Fiually (c) 0 The formulae of Corollary 1.9. If '7" '72 E + (1 .6.1 and Theorem 1. 8". Proof of Theorem 1. J u(x)v(x)h(x)dx < (J ur(x)h(x)dx)~(Jv'(x)h(x)dx)~.6. Under the conditions ofTheorem 1.A('7o)} vaLidfor aLL s such that 110 + 5 E E. + (1 . Theorem 1.r(T) = IICov(7~.A('70) = II 8".a)'72 E is proved in exactly the same way as Theorem 1.15) t: the righthand side of (1.6 Expbnential Families 59 V.5.a. .3.6. ('70) II· The corollary follows immediately from Theorem B.T(x)). A(U'71 Which is (b). h) with corresponding natural parameter space E and function A(11).1 give a classical result in Example 1. (with 00 pennitted on either side).6. h(x) > 0. Then (a) E is convex (b) A: E ) R is convex (c) If E has nonempty interior in R k generating function M given by and'TJo E E. Suppose '70' '71 E t: and 0 < a < 1.6.3. u(x) ~ exp(a'7. t: and (a) follows.8". A where A( '70) = (8". vex) = exp«1 . Substitute ~ ~ a.6.6. By the Holder inequality (B. then T(X) has under 110 a moment M(s) ~ exp{A('7o + s) .a)'7rT(x)) and take logs of both sides to obtain. S > 0 with ~ + ~ = 1. ('70).6.Section 1. .6. .3 V.r'7o T(X) .. v(x). 8A 8A T" = A('7o) f/J. for any u(x). Let P be a canonicaL kparameter exponential famiLy generated by (T. 1O)llkxk.6. J exp('7 TT (x))h(x)dx > 0 for all '7 we conclude from (1.2..
(X)".~.8. in Example 1. 9" 9.7.4. for all 9 because 0 < p«X. However. However. But the rank of the family is 1 and 8 1 and 82 are not identifiable. (i) P is a/rank k.6. and Performance Criteria Chapter 1 Example 1.7). Then the following are equivalent. Formally. Here. using the a: parametrization.lx ~ j] = AJ ~ e"'ILe a .x. P7)[L. + 9.(Tj(X» = P>. Xl Y1 ).4. k A(a) ~ nlog(Le"') j=l and k E>. Suppose P = (q(x.6. An exponential family is of rank k iff the generating statistic T is kdimensional and 1. Theorem 1.60 Statistical Models.1.. (ii) 7) is a parameter (identifiable).1 is in fact its rank and this is seen in Theorem 1. Our discussion suggests a link between rank and identifiability of the 'TJ parameterization. II ! I Ii . (iii) Var7)(T) is positive definite. ifn ~ 1. 7) E f} is a canonical exponential/amily generated by (TkXI ' h) with natural parameter space e such that E is open. ."I 11 J Evidently every kparameter exponential family is also k'dimensional with k' > k. .~~) < 00 for all x. o The rank of an exponential family I . We establish the connection and other fundamental relationships in Theorem 1. .7 we can see that the multinomial family is of rank at most k .6. ajTj(X) = ak+d < 1 unless all aj are O. It is intuitively clear that k . we are writing the oneparameter binomial family corresponding to Yl as a twoparameter family with generating statistic (Y1 . Note that PO(A) = 0 or Po (A) < 1 for some 0 iff the corresponding statement holds i.6. there is a minimal dimension. Tk(X) are linearly independent with positive probability.4 that follows. f=l . and ry. T.6. . (continued). such that h(x) > O. 2' Going back to Example 1.6. p:x.(9) = 9. Goads. Sintilarly. if we consider Y with n'> 2 and Xl < X n the family as we have seen remains of rank < 2 and is in fact of rank 2.
A(TJ. thus..4 hold and P is of rank k. III..6 Exponential Families 61 (iv) 1) ~ A(1)) is 11 onto (v) A is strictly convex on E. because E is open. all 1) = (~i) II. all 1) ~ (iii) = a T Var1)(T)a = Var1)(aT T) = 0 for some a of 0. 0 . We.. = Proof ofthe general case sketched I.6.) is false. :. Taking logs we obtain (TJI . This is equivalent to Var"(T) ~ 0 '*~ (iii) {=} f"V(ii) There exist T}I =I= 1J2 such that F Tll = Pm' Equivalently exp{r/.. (b) logq(x. . ~ (i) =~ (iii) ~ (i) = P1)[aT T = cJ ~ 1 for some a of 0. A"(1]O) = 0 for some 1]0 implies c. hence. . '" '. ranges over A(I').(v). Suppose tlult the conditions of Theorem 1. of 1)0· Let Q = (P1)o+o(1).T(x) . The proof for k details left to a problem. A' is constant. A is defined on all ofE. ..A(TJ2))h(x). 0 Corollary 1.) with probability 1 =~(i).6. for all T}. This is just a restatement of (iv) and (v) of the theorem.' . ".1)o)TT. Note that. Now (iii) =} A"(TJ) > 0 by Theorem 1. have (i) . by our remarks in the discussion of rank." Then ~(i) > 1 is then sketched with '* P"[a.'t·· . Proof.~> . 1)0) E n· Q is the exponential family (oneparameter) generated by (1)1 .3. ~ (ii) ~ _~ (i) (ii) = P1). hence.(ii) = (iii). We give a detailed proof for k = 1.TJ2)T(X) = A(TJ2) . 1)) is a strictly concavefunction of1) On 1'.. Conversely. Thus. F!~ . Apply the case k = 1 to Q to get ~ (ii) =~ (i). with probability 1. (iii) = (iv) and the same discussion shows that (iii) . = Proof. by Theorem 1.T ~ a2] = 1 for al of O.1)o) : 1)0 + c(1).A(TJI)}h(x) ~ exp{TJ2T(x) . Then (a) P may be uniquely parametrized by "'(1)) E1)T(X) where". (iv) '" (v) = (tii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined as previously for arbitrary 1)0' 1).2 and.2.6.6.. which that T implies that A"(TJ) = 0 for all TJ and. A'(TJ) is strictly monotone increasing and 11. Let ~ () denote "(. ~ P1)o some 1)..Section 1.
.21).29) that I) generate this family and that the rank of the family is indeed p(p + 3)/2. E) _~yTEIy + (E1JllY 2 I 2 (log Idet(E)1 + JlT I P log". where X is the Bernoulli trial. the relation in (a) may be far from obvious (see Problem 1.10). .1 (Y ...Jl..6.16) Rewriting the exponent we obtain 10gf(Y.tpx 1 and positive definite variance covariance matrix L::pxp' iff its density is f(Y.(Xi) i=l j=l i=l n • n nB(9)}.6.2 we considered beta prior distributions for the probability of success in n Bernoulli trials. Goals. as we always do in the Bayesian context. . .Xn is a sample from the kparameter exponential family (1.Yn)T follows the k = p(p + 3)/2 parameter exponential family with T = (EiYi .t parametrization is close to the initial parametrization of classical P. (1. ifY 1.2 (Y . E.6. Thus.6. ."5) family by E(X).E). E). where we identify the second element ofT.Jl) . Suppose Xl...1 '" 11"'.._. By our supermodel discussion. . The corollary will prove very important in estimation theory. 0). and Performance Criteria Chapter 1 The relation in (a) is sometimes evident and the j.6. and that (. then X . write p(x 19) for p(x. It may be shown (Problem 1.17) can be rewritten ( 2: 1 <i<j~p aijYiYj + ~ 2: aii r:2 ) + L(2: aij /lj)Yi i=l i=l j=l p where E. T (and h generalizing Example 1. is open. which is a p x p symmetric matrix. E) = Idet(E) 1 1 2 / . with its distinct p(p + I) /2 entries._.{Y. the N(/l. .. ...6.5 Conjugate Families of Prior Distributions In Section 1. with mean J. h(Y) ..6. Then p(xI9) = III h(x.(Y).X') = (JL...Yp. . so that Theorem 1.62 Statistical Models.. However.6..p/2 I exp{ . See Section 2.2 p p (1. families to which the posterior after sampling also belongs. The p Van'ate Gaussian Family. We close the present discussion of exponential families with the following example.5. the 8(11) 0) family is parametrized by E(X).6.17) The first two terms on the right in (1.ljh<i<. Example 1.. ~iYi Vi). N p (1J. For {N(/" "')}. An important exponential family is based on the multivariate Gaussian distributions of Section 8. (1.Jl)}. E(X. 9 ~ (Jl. E).(9) LT.. revealing that this is a k = p(p + 3)/2 parameter exponential family with statistics (Yi..11.)] eXP{L 1/.Jl)TE. Jl. B(9) = (log Idet(E) I+JlTEl Jl).6.4 applies.18) _.6. Recall that Y px 1 has a p variate Gaussian distribution.6. This is a special case of conjugate families of priors...a 2 + J12).. which is obviously a 11 function of (J1.I. 0 ! = 1. Y n are iid Np(Jl.a 2 ).3. and.11..~p).
O<W(tJ. .6.12. Proof.6. n)T D It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way.6. .fk+IB(O) logw(t)) j=1 (1. Note that (1. then Proposition 1.21) is an updating formula in the sense that as data Xl..20) given by the last expression in (1.u5) sample.6.6..(Olx) ex p(xlO)". tk+l) E 0.6. . Suppose Xl.6. .6. .6.6. where a = (I:~ 1 T1 (Xi).21) and our assertion follows.(Xi)+ f.6... let t = (tl.fkIJ)<oo) with integrals replaced by sums in the discrete case. which is k~dimensional. is a conjugate prior to p(xIO) given by (1.'''' I:~ 1 Tk(Xi). j = 1.6.) ..22) 02 p(xIO) ex exp{ 2 .6.fk+Jl. (1.1..6.19) o {(iJ. and t j = L~ 1 Tj(xd. we consider the conjugate family of the (1.. .(lk+1 j=1 i=1 + n)B(O)) ex "..36). Example 1..6 Exponential Families 63 where () E e. We assume that rl is nonempty (see Problem 1. tk+ I) T and w(t) = 1= eXP{hiJ~J(O)' _='" _= 1 k ik+JB(II)}dO J···dOk (1.18). The (k + I)parameter exponential/amity given by k ".Xn is a N(O..18) by letting 11. the parameter t of the prior distribution is updated to s = (t + a).'" . .(O) ex exp{L ~J(O)(LT. A conjugate exponential family is obtained from (1. X n become available. 1r(6jx) is the member of the exponential family (1.2)' Uo 2uo Ox . .20).(0).. where u6 is known and (j is unknown.20) where t = (fJ.18) and" by (1.. That is. k. 0 Remark 1..20). n n )T and ex indicates that the two sides are proportional functions of ()."" Sk+l)T = ( t} + ~TdxiL ..1. To choose a prior distribution for model defined by (1. ik + ~Tk(Xi)' tk+l + 11. . . Because two probability densities that are proportional must be equal..21) where S = (SI. . If p(xIO) is given by (1. then k n . .Section 1. be "parameters" and treating () as the variable of interest.6. .(0) ~ exp{L ~j(O)fj .6. Forn = I e.
.6. it can be shown (Problem 1.n) and variance = t.26) intuitively as ( 1. rJ) distributions where Tlo varies freely and is positive.27) Note that we can rewrite (1. Our conjugate family.6.25) By (1.6. > 0 and all t I and is the N (tI!t" (1~ It.23) with (70 TO ..6. Eo known. consists of all N(TJo. we find that 1r(Olx) is a normal density with mean Jl(s. 0 These formulae can be generalized to the case X.28) where W. Goals. 761) where '10 varies over RP.6. = 1 .6. if we observe EX. 1 < i < n.30) that the Np()o" f) family with )0.(S) = t.) } 2a5 h t2 t1 2 (1.d. 75) prior density. the posterior has a density (1.20) has density (1. = s.37).(n) (g +n)I[s+ 1]0~~1 TO TO (1.6.23) Upon completing the square. Eo). i. (J Np(TJo.. Using (1. tl(S) = 1]0(10 2 TO 2 + S. 1r.6. (0) is defined only for t.( 0 . + n.26) (1.6. therefore.64 Statistical Models.i.6.6. and Performance Criteria Chapter 1 This is a oneparameter exponential family with The conjugate twoparameter exponential family given by (1.(O) <X exp{ . If we start with aN(f/o. W. we obtain 1r.24). = rg(n)lrg so that w. Tg is scalar with TO > 0 and I is the p x p identity matrix (Problem 1. E RP and f symmetric positive definite is a conjugate family f'V . Moreover.21). t. we must have in the (t l •t2) parametrization "g (1.) density.6. Np(O.(n) = .W.6.24) Thus. = nTg(n)I(1~.
32. In the onedimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape.31 and 1. Discussion Note that the uniform U( (I.... . is a kparameter exponential/amity of distributions if there are realvalued functions 'T}I . B) are not exponential.6. The set is called the natural parameter space. the family of distributions in this example and the family U(O.6. O}) model of Example 1. a theory has been built up that indicates that under suitable regularity conditions families of distributions. The set E is convex. and realvalued functions TI.. for all s such that '10 + s is in Moreover E 7Jo [T(X)] = A(7Jo) and Var7Jo[T(X)] A( '10) where A and A denote the gradient and Hessian of A. 8 C R k .'T}k and B on e.. e. Some interesting results and a survey of the literature may be found in Brown (1986). .6. " Tk. (1. x j=1 E X c Rq.. ••• .1 are often too restrictive. Pitman. to Np (8 ..20). The natural sufficient statistic max( X 1 ~ .6. It is easy to see that one can consUUct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1. which is onedimensional whatever be the sample size. Eo). {PO: 0 E 8}. and Darmois. . The canonical kparameter exponentialfamily generated by T and h is T(X) = where A(7J) ~ log J:. Problem 1. Despite the existence of classes of examples such as these.29) (Tl (X). r) is a p(p + 3)/2 rather than a p + I parameter family. 0) = h(x) expl2:>j(I1)Ti (x) .(s) = exp{A(7Jo + s) .6. In fact. 2.6.6.Xn ). the conditions of Proposition 1.. with integrals replaced by sums in the discrete case. Summary. If bas a nonempty interior in R k and '10 E then T(X) has for X ~ P7Jo the momentgenerating function ..3 is not covered by this theory.. is not of the form L~ 1 T(Xi). must be kparameter exponential families.A(7Jo)} e e.B(O)]. . See Problems 1.20) except for p = 1 because Np(A.10 is a special result of this type. .J: h(x)exp{TT(x)7J}dx in the continuous case. . . In fact. T k (X)) is called the natural sufficient statistic of the family.5. h on Rq such that the density (frequency) function of Pe can be written as k p(x. the map A . E 1 R is convex. which admit kdimensional sufficient statistics for all sample sizes. starting with Koopman.6 Exponential Families 65 but a richer one than we've defined in (1..Section 1.
tk+.1 units.(0) = exp{L1)j(O)t) .L. (a) A geologist measures the diameters of a large number n of pebbles in an old stream bed.L and variance a 2 . is conjugate to the exponential family p(xl9) defined in (1.6. If P is a canonical exponential family with E open..66 Statistical Models.7 PROBLEMS AND COMPLEMENTS Problems for Section 1.B(O)}dO.. Assume that the errors are otherwise identically distributed nonnal random variables with known variance.1 1. (iii) Var'7(T) is positive definite. He wishes to use his observations to obtain some infonnation about J. and Performance Criteria Chapter 1 An exponential family is said to be of rank k if T is kdimensional and 1. .. Give a formal statement of the following models identifying the probability laws of the data and the parameter space. . (iv) the map '7 ~ A('7) is I ' Ion 1'. (ii) '7 is identifiable. Suppose that the measuring instrument is known to be biased to the positive side by 0. and t = (t" .L and a 2 but has in advance no knowledge of the magnitudes of the two parameters.parameter exponential family k ".B(O)tk+l logw} j=l where w = 1:" 1: E exp{l:.29).1)) (O)t j .. Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean J. State whether the model in question is parametric or nonparametric. The (k + 1). then the following are equivalent: (i) P is of rank k. Goals. A family F of priOf distributions for a parameter vector () is called a conjugate family of priors to p(x I 8) if the posterior distribution of () given x is a member of F. 1.) E R k+l : 0 < w < oo}. (b) A measuring instrument is being used to obtain n independent determinations of a physical constant J.. Ii II I . (v) A is strictlY convex on f.. l T k are linearly independent with positive Po probability for some 8 E e. T" . tk+1) 0 ~ {(t" .
.1.1 describe formaHy the foHowing model... respectively. are sampled at random from a very large population. 1 Ab) restricted to the sets where l:f I (Xi = 0 and l:~~1 A. i = 1.7 Problems and Complements 67 (c) In part (b) suppose that the amount of bias is positive but unknown. (b) The parametrization of Problem 1..p..Section 1. each egg has an unknown chance p of hatching and the hatChing of one egg is independent of the hatching of the others. (d) Xi. ..) < Fy(t) for every t. Show that Fu+v(t) < Fu(t) for every t. .. j = 1.. .. Q p ) and (A I .2) where fLij = v + ai + Aj. 0.2). .··. = o. . . B = (1'1.. .1.2 ).. Each .. Once laid. (e) The parametrization of Problem l. Q p ) {(al... . bare independeht with Xi. Can you perceive any difficulties in making statements about f1. restricted to p = (all' . .) (a) Xl. 4. . . . .. i=l (c) X and Y are independent N (I' I .Xp ).. then X is (b) As in Problem 1.Xp are independent with X~ ruN(ai + v. Which of the following parametrizations are identifiable? (Prove or disprove.. (If F x and F y are distribution functions such that Fx(t) said to be stochastically larger than Y.for this model? (d) The number of eggs laid by an insect follows a Poisson distribution with unknown mean A. ~ N(l'ij.2 ) and Po is the distribution of Xll. At.l(d) if the entomologist observes only the number of eggs hatching but not the number of eggs laid in each case. Ab.1(d)..t. Are the following parametrizations identifiable? (Prove or disprove."" a p .. 3. 0.) (a) The parametrization of Problem 1. (a) Let U be any random variable and V be any other nonnegative random variable. .. and Po is the distribution of X (b) Same as (a) with Q = (X ll .. 2. An entomologist studies a set of n such insects observing both the number of eggs laid and the number of eggs hatching for each nest.1(c).1.. . 1'2) and we observe Y X. = (al.2) and N (1'2. v. .Xpb . ..ap ) : Lai = O}. e (e) Same as (d) with (Qt. Two groups of nl and n2 individuals.
1. What assumptions underlie the simplification? 6. 5..0.0)..9 and they oecur with frequencies p(0.. .. 0 < Vi < 1. I.. whenXis unifonnon {O. k. P8[N I . 8.. . (b)P..c has the same distribution as Y + c. 7. . is the distribution of X e = (0.0.1.1).Lk VI n".  . Vi = (1 11")(1 ~ v)iI V for i = 1.. ..p(0... nk·mI .··· ..L. .. e = = 1 if X < 1 and Y {J.. Suppose the effect of a treatment is to increase the control response by a fixed amount (J.t). M k = mk] I! n.2). Let Y and Po is the distribution of Y.p(0. =X if X > 1.Li < + . but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown.. + mk + r = n 1.. • . + nk + ml + . Hint: IfY .' I . Show that Y .2...c bas the same distribution as Y + c.. .. i = 1.k is proposed where 0 < 71" < I..) (a) p. (e) Suppose X ~ N(I'. (a) What are the assumptions underlying this model'? (b) (J is very difficult to estimate here if k is large.I nl . I nl.3(2) and 1.. .LI ni + .. I . .· .0'2) (d) Suppose the possible control responses in an experiment are 0. Both Y and p are said to be symmetric about c.. 0 < V < 1 are unknown. (0).. + Vk + P = 1. • "j :1 member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour.. 68 Statistical Models.. Mk.:".. the density or frequency function p of Y satisfies p( c + t) ~ p( c . J. Vk) is unknown. . + fl. .. In each of k subsequent years the number of students graduating and of students dropping out is recorded. Let N i be the number dropping out and M i the number graduating during year i... The simplification J.1. 0 < J. Goals. Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour. I ! fl.P(Y < c .. .. .t) fj>r all t.4(1). is the distribution of X when X is unifonn on (0. 1 < i < k and (J = (J. V k mk P r where J. 0 = (1'. . It is known that the drug either has no effect or lowers blood pressure. = n11MI = mI..Li = 71"(1 M)iIJ. o < J1 < 1.9). ..O}. The number n of graduate students entering a certain department is recorded.Nk = nk.2. 2. ml . 0'2). }. Which of the following models are regular? (Prove or disprove. if and only if.LI. Let Po be the distribution of a treatment response. Consider the two sample models of Examples 1. mk·r. and Performance Criteria Chapter 1 ":..t) = 1 . then P(Y < t + c) = P( Y < t . . VI.c) = PrY > c . The following model is proposed.k + VI + .
C(t} = F(tjo). Sx(t) = . log X and log Y satisfy a shift model with parameter log (b) Show that if X and Y satisfy a shift model With parameter tl. . .. . Let c > 0 be a constant. o(x) ~ 21' + tl . = (Zlj. .zp. are not collinear (linearly independent).. Suppose X I. .. N}. = 9. j = 0.3 let Xl. 0 'Lf 'Lf op(j) = I.Xn are observed i.Section 1.'" .. For what tlando(x) ~ 2J1+tl2x? type ofF is it possible to have C(·) ~ F(·tl) for both o(x) (e) Suppose that Y ~ X + O(X) where X ~ N(J1.. and (I'. i . 0 < j < N. eY and (c) Suppose a scale model holds for X. 12.. . . . r(j) = = PlY = j. I(T < C)) = j] PIC = j] PIT where T.Znj?' (a) Show that ({31.tl and o(x) ~ 21' + tl. N} are identifiable. 10. The Scale Model. . the two cases o(x) .."" Yn ). . C).tl) implies that o(x) tl... C).. Does X' Y' = yc satisfy a scale model? Does log X'.. . suppose X has a distribution F that is not necessarily normal. e(j) > 0.i.. e) : p(j) > 0. (]"2) independent. Collinearity: Suppose Yi LetzJ' = L:j:=1 4j{3j + Ci. then C(·) ~ F(. Hint: Consider "hazard rates" for Y min(T. according to the distribution of X. eX o. I} and N is known.7 Problems and Complements 69 (a) Show that if Y ~ X + o(X). e(j). . o (a) Show that in this case.N = 0."'). 11. {r(j) : j ~ 0.. Let X < 1'. 1 < i < n. fJp) are not identifiable if n pardffieters is larger than the number of observations. C are independent.Xn ). . Y. .. .2x yield the same distribution for the data (Xl. log y' satisfy a shift model? = Xc.. (b) Deduce that (th. . or equivalently. then C(·) = F(. That is.. ci '" N(O. Y = T I Y > j]. N.tl) does not imply the constant treatment effect assumption.. . Sbow that {p(j) : j = 0. C(·) = F(· ..d.{3p) are identifiable iff Zl. p(j).1. . then satisfy a scale model with parameter e.. > 0. (Yl. that is. Therefore. . (b) In part (a). if the number of = (min(T.. . e) vary freely over F ~ {(I'.tl). X m and Yi. Positive random variables X and Y satisfy a scale model with parameter 0 > 0 if P(Y < t) = P(oX < t) for all t > 0. Show that if we assume that o(x) + x is strictly increasing.2x and X ~ N(J1. t > O. Yn denote the survival times of two groups of patients receiving treatments A and B.6."') and o(x) is continuous. . The Lelunann 1\voSample Model. In Example 1.
12.(t). I (3p)T of unknowns.ll) with scale parameter 5. Hint: By Problem 8.(t) if and only if Sy(t) = Sg. as T with survival function So. I (b) Assume (1.7.. Tk > t all occur. I' 70 StCitistical Models. . Let f(t I Zi) denote the density of the survival time Yi of a patient with covariate vector Zi apd define the regression survival and hazard functions of Y as i Sy(t I Zi) = 1~ fry I zi)dy. 1 P(X > t) =1 (. Goals. Specify the distribution of €i..(t) with C.i... Also note that P(X > t) ~ Sort). we have the Lehmann model Sy(t) = si.12. .(t). t > 0. So (T) has a U(O. 13. = exp{g(. Let T denote a survival time with density fo(t) and hazard rate ho(t) = fo(t)j P(T > t). z)) (1.1) Equivalently. and Performance CriteriCl Chapter 1 t Ii .2) where ho(t) is called the baseline hazard function and 9 is known except for a vector {3 ({311 .l3.2) is equivalent to Sy (t I z) = sf" (t). Zj) + cj for some appropriate €j. t > 0...7. Moreover..7. Show that if So is continuous.I ~I .{a(t).ho(t) is called the Cox proportional hazard model.. The Cox proportional hazani model is defined as h(t I z) = ho(t) exp{g(.l3. . are called the survival functions. then Yi* = g({3. hy(t) = c. Survival beyond time t is modeled to occur if the events T} > t. thus.h. Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t).G(t). .1. Hint: See Problem 1. Then h.) Show that Sy(t) = S'. The most common choice of 9 is the linear form g({3. (c) Suppose that T and Y have densities fort) andg(t). (b) By extending (bjn) from the rationals to 5 E (0. show that there is an increasing function Q'(t) such that ifYt = Q'(Y. then X' = log So (X) and Y' ~ log So(Y) follow an exponential scale model (see Problem Ll. . respectively. where TI". z) zT. A proportional hazard model. " T k are unobservable and i.l3. ~ a5.d.) Show that (1. (1. F(t) and Sy(t) ~ P(Y > t) = 1 .  . log So(T) has an exponential distribution.. (c) Under the assumptions of (b) above.00).2.2) and that Fo(t) = P(T < t) is known and strictly increasing. t > 0. I) distribution. = = (.(t) = fo(t)jSo(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y. h(t I Zi) = f(t I Zi)jSy(t I Zi). Set C. 7.). k = a and b.z)}. ... Show that hy(t) = c. For treatments A and E. Sy(t) = Sg.
.8 0. Yl . '1/1' > 0.d. Consider a parameter space consisting of two points fh and ()2.2 0. .7 Problems and Complements 71 Hint: See Problems 1." . Show that both . ) = ~.5) or mean I' = xdF(x) = Fl(u)du.2 with assumptions (t){4). have a N(O. Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function J=oo Jo F(x. Observe that it depends only on l:~ I Xi· I 0).O) 1 . Examples are the distribution of income and the distribution of wealth.. Let Xl..Section 1. J1 and v are regarded as centers of the distribution F. Generally. Problems for Section 1. Find the . the median is preferred in this case.. Yn be i. F.1.11 and 1.12..2 1.i. In Example 1. Merging Opinions. and for this reason.p(t)). 15.1. . what parameters are? Hint: Ca) PIXI < tl = if>(. 14. . x> c x<c 0. (a) Suppose Zl' Z. J1 may be very much pulled in the direction of the longer tail of the density.. 'I/1(±oo) = ±oo. Find the median v and the mean J1 for the values of () where the mean exists. Are. have a N(O. and suppose that for given (). .1.. wbere the model {(F. G. (b) Suppose Zl and Z. When F is not symmetric.d. Show how to choose () to make J. . . 1f(O2 ) = ~.G)} is described by where 'IjJ is an unknown strictly increasing differentiable map from R to R. (b) Suppose X" .i.2 unknown.p and C.p and ~ are identifi· able..(x/c)e. Ca) Find the posterior frequency function 1f(0 I x).Xm be i.v arbitrarily large. and Zl and Z{ are independent random variables. the parameter of interest can be characterl ized as the median v = F.2) distribution with .6 Let 1f be the prior frequency function of 0 defined by 1f(I1.000 is the minimum monthly salary for state workers. still identifiable? If not.l (0.X n ).4 1 0. 1) distribution. where () > 0 and c = 2. an experiment leads to a random variable X whose frequency function p(x I 0) is given by O\x 01 O 2 0 0.L . Here is an example in which the mean is extreme and the median is not.. X n are independent with frequency fnnction p(x posterior1r(() I Xl.
. X has the geometric distribution (a) Find the posterior distribution of (J given X = 2 when the prior distribution of (} is . I XI . 3. . " J = [2:.2. ':. }. does it matter which prior. 'IT .. Goals. .Xn = X n when 1r(B) = 1.0 < B < 1.O)kO. 0 < x < O..[X ~ k] ~ (1. . X n be distributed as where XI. (b) Find the posterior density of 8 when 71"( OJ = 302 ..... 2. 7l" or 11"1. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success O. where a> 1 and c(a) .. Assume X I'V p(x) = l:..Xn are natural numbers between 1 and Band e= {I. . (3) Find the posterior density of 8 when 71"( 0) ~ 1. For this convergence. 0 (c) Find E(81 x) for the two priors in (a) and (b). Show that the probability of this set tends to zero as n t 00. . the outcome X has density p(x OJ ~ (2x/O'). 7r (d) Give the values of P(B n = 2 and 100. Consider an experiment in which.m+I. ) ~ . what is the most probable value of 8 given X = 2? Given X = k? (c) Find the posterior distribution of (} given X = k when the prior distribution is beta.25. . IE7 1 Xi = k) forthe two priors (f) Give the set on which the two B's disagree..5n) for the two priors and 1f} when 7f (e) Give the most probable values 8 = arg maxo 7l"(B and 71"1. Find the posteriordensityof8 given Xl = Xl.'" .. 71"1 (0 2 ) = .8). . c(a). 3.75. l3(r. (J. ~ = (h I L~l Xi ~ = . for given (J = B. . ... Suppose that for given 8 = 0. X n are independent with the same distribution as X.0 < 0 < 1. 4'2'4 (b) Relative to (a). 4. "X n ) = c(n 'n+a m) ' J=m. This is called the geometric distribution (9(0)). k = 0. 1r(J)= 'u . Unllormon {'13} . 1. (d) Suppose Xl. . Then P. " ••. < 0 < 1. (3) Suppose 8 has prior frequency. Let 71" denote a prior density for 8.'" 1rajI Show that J + a. Let X I. and Performance CriteriCi Chapter 1 (cj Same as (b) except use the prior 71"1 (0 .72 Statistical Models. is used in the fonnula for p(x)? 2. Compare these B's for n = 2 and 100. .2. .)=1.. .=l1r(B i )p(x I 8i ).
are independent standard exponential. Next use the central limit theorem and Slutsky's theorem.Xn is a sample with Xi '"'' p(x I (}).2. where E~ 1 Xi = k.• X n = Xn. X n = Xn is that of (Y. Show that a conjugate family of distributions for the Poisson family is the gamma family." . 6. (a) Show that the family of priors E. .. X n+ Il . a regular model and integrable as a function of O.. then !3( a.1 that the conditional distribution of 6 given I Xi = k agrees with the posterior distribution of 6 given X I = Xl •. Assume that A = {x : p( x I 8) > O} does not involve O..(1..1."'tF·t'. 7. .2. is the standard normal distribution function and n _ T 2 x + n+r+s' a J. 11'0) distribution.Section 1. } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of () given X = x is .1. ..1 suppose n is large and (1 In) E~ I Xi = X is not close to 0 or 1 and the prior distribution is beta.. .··... If a and b are integers. . s). In Example 1. .0 E Let 9 have prior density 1r. X n+ k ) be a sample from a population with density f(x I 0). n.. 11'0) distribution.• X n = Xn' and the conditional distribution of the Zi'S given Y = t is that of sample from the population with density e. 10....2. I Xn+k) given . Suppose Xl.7 Problems . f3(r. b) denote the posterior distribution.. Xn) + 5.. f(x I f).b> 1. .) = n+r+s Hint: Let!3( a.. .. Interpret this result. Show that the conditional distribution of (6. . b) is the distribution of (aV loW)[1 + (aV IbW)]t. where VI."" Zk) where the marginal distribution of Y equals the posterior distribution of 6 given Xl = Xl.. W. . XI = XI.8) that if in Example 1. Let (XI. ZI. Show in Example 1. then the posterior distribution of D given X = k is that of k + Z where Z has a B(N . where {i E A and N E {l.. D = NO has a B(N. (b) Suppose that max(xl.n.c(b.f) = [~. Show rigorously using (1. .'" . Show that ?T( m I Xl. .1= n+r+s _ .J:n). Justify the following approximation to the posterior distribution where q.2. . 9.and Complements 73 wherern = max(xl. Xn) = Xl = m for all 1 as n + 00 whatever be a... S.. Va' W t .
Here HinJ: Given I" and ".O.. compute the predictive and n + 00. 0> 0 a otherwise. 'V . The Dirichlet distribution.9). po is I t = ~~ l(X i  1"0)2 and ()( denotes (b) Let 7t( 0) ()( 04 (>2) exp { .... 1 X n . . LO. N(I".\ > 0. (b) Discuss the behavior of the two predictive distributions as 15.) pix I 0) ~ ({" . (a) Show that p(x I 0) ()( 04 n exp (~tO) where "proportional to" as a function of 8.Xn. (N.~vO}...J t • I X. 11. . 8 2 ) is such that vn(p. 0 > O. a = (a1. I(x I (J). (12). j=1 '\' • = 1..... (c) Find the posterior distribution of 0". O(t + v) has a X~+n distribution. . . •• • .Xn + l arei.~X) '" tnI. <0<x and let7t(O) = 2exp{20}. The posterior predictive distribution is the conditional distribution of X n +l given Xl.6 ""' 1r. D(a). Next use Bayes rule.) u/' 0 < cx) L".. . = 1. Note that. .") ~ . then the posterior density ~(Jt 52 = _1_ "'(X _ X)2 nI I. . j=1 • . "5) and N(Oo. . i). given (x. Q'j > 0."Z In) and (n1)S2/q 2 '" X~l' This leads to p(x. . Letp(x I OJ =exp{(xO)}.. . given x. the predictive distribution is the marginal distribution of X n + l .. unconditionally.. .i./If (Ito. N. x n ). (. posterior predictive distribution. Let N = (N l . . .2 is (called) the precision of the distribution of Xi. 01 )=1 j=l Uj < 1.. . Goals.O the posterior density 7t( 0 Ix). X 1. I (b) Use the result (a) to give 7t(0) and 7t(0 I x) when Oexp{ Ox}. 0 > O. 13. . Find the posterior distribution 7f(() I x) and show that if>' is an integer.~l . 14. and () = a.i..d.d.74 where N' ~ N Statistical Models. Xl . (a) If f and 7t are the N(O.J u.2) and we formally put 7t(I". 52) of J1.. vB has a XX distribution. .1 < j < T. < 1.. .d. L. v > 0.. x> 0. 82 I fL.. The Dirichlet distribution is a conjugate prior for the multinomial. .i. Suppose p(x I 0) is the density of i. Find I 12. where Xi known.. 0<0. has density r("" a) IT • fa(u) ~ n' r(.)T.. X n . In a Bayesian model whereX l . X n are i. . X and 5' are independent with X ~ N(I". 9= (OJ.. .. Q'r)T.) be multinomial M(n. T6) densities. and Performance Criteria Chapter 1 + nand ({. . Show that if Xl.
. = 0. I b"9 }. . . Let the actions corresponding to deciding whether the loss function is given by (from Lehmann. a new buyer makes a bid and the loss function is changed to 8\a 01 O 2 al a2 a3 a 12 7 I 4 6 (a) Compute and plot the risk points in this case for each rule 1St.1.. the possible actions are al. Find the Bayes rule for case 2.5. (d) Suppose 0 has prior 1C(01) ~ 1'.. then the posteriof7r(0 where n = (nt l · · · . (c) Find the minimax rule among the randomized rules. 1C(O.) = 0. 1C(02) l' = 0.. 159 for the preceding case (a). . (d) Suppose that 0 has prior 1C(OIl (a). Suppose that in Example 1.. . Suppose the possible states of nature are (Jl..5.7 Problems and Complements 75 Show that if the prior7r( 0) for 0 is V( a). (b)p=lq=.5 and (ii) l' = 0. Find the Bayes rule when (i) 3.1.5.. 0.. Problems for Section 1. I.Section 1.3. O 2 a 2 I a I 2 I Let X be a random variable with frequency function p(x.3. (J) given by O\x a (I . a3. Compute and plot the risk points (a)p=q= .3. . (b) Find the minimax rule among {b 1 . a2. 8 = 0 or 8 > 0 for some parameter 8.p) (I . .q) and let when at. and the loss function l((J.3 IN ~ n) is V( a + n). or 0 > a be penoted by I. = 11'. 1. a) is given by 0.159 be the decision rules of Table 1.1. respectively and suppose .3. 1957) °< °= a 0. 159 of Table 1. (J2. n r ). See Example 1..3. The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether 8 < O. (c) Find the minimax rule among bt.1.3.
variances. geographic locations or age groups).0)). J".I L LXij. Show that the strata sample sizes that minimize MSE(!i2) are given by (1. How should nj. c<I>(y'n(sO))+b<I>(y'n(rO)).7.Xnjj. I I I. .s. < J' < s. Goals. random variables Xlj"". 1 < j < S. 1) distribution function. < 00. 1 < j < S.g.) c<f>( y'n(r . and Performance Criteria ChClpter 1 O\a 1 0 c 1 <0 0 >0 J".3) .0)) +b<f>( y'n(s .. where <f> = 1 <I>.. 0 be chosen to make iiI unbiased? I (b) Neyman allocation.) r=2s=1. Suppose that the jth stratum has lOOpj% of the population and that the jth stratum population mean and variances are f£j and Let N = nj and consider the two estimators 0. E. 1 (l)r=s=l.(X) 0 b b+c 0 c b+c b 0 where b and c are positive. Weassurne that the s samples from different strata are independent. and MSEs of iiI and 'j1z. 0<0 0=0 0 >0 R(O. 1) sample and consider the decision rule = 1 if X <r o (a) Show that the risk function is given by ifr<X<s 1 if X > s. . i I For what values of B does the procedure with r procedure with r = ~s = I? = 8 = 1 have smaller risk than the 4.j = 1. .. .andastratumsamplemeanXj.i. are known. iiz = LpjX j=Ii=1 )=1 s n] s j where we assume that Pj.j = I. Stratified sampling. I . Within the jth stratum we have a sample of i. are known (estimates will be used in a latet chapter). 11 ..d. n = 1 and .=l iii = N. Assume that 0 < a. and <I> is the N(O. b<f>(y'ns) + b<I>(y'nr).. Suppose X is aN(B. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e. We want to estimate the mean J1..76 StatistiCCII Models.8.. (b) Plot the risk function when b = c = 1. (a) Compute the biases. (.
p).25. (c) Show that MSE(!ill with Ok ~ PkN minus MSE(!i. . and 2 and (e) Compute the relative risks M SE(X)/MSE(X) in questions (ii) and (iii). .9. .1.I 2:.0 + '.. X n are i.7 Problems and Complements 77 Hint: You may use a Lagrange multiplier..5. Each value has probability . Hint: By Problem 1. b = '. Suppose that X I.0 . and that n is odd.b.2.5. a = .= 2:.2'. set () = 0 without loss of generality. . Suppose that n is odd.20. 75. (b) Compute the relative risk RR = M SE(X)/MSE(X) in question (i) when b = 0. ~ (a) Find the MSE of X when (i) F is discrete with P(X a < b < c.. show that MSE(Xb ) and MSE(Xb ) are the same for all values of b (the MSEs of the sample mean and sample median are invariant with respect to shift).3. Hint: Use Problem 1.15.2.5. Let XI.. ..0. Let Xb and X b denote the sample mean and the sample median of the sample XI b.. Hint: See Problem B. respectively.4. n ~ 1.0 ~ + 2'.bl when n = compare it to EIX .Section 1..13. ~ ~ ~ .. Let X and X denote the sample mean and median. (b) Evaluate RR when n = 1.. < ° P < 1. > k) (ii) F is uniform. .40.45.3. .. Also find EIX . N(o..) with nk given by (1. 5.5. .. l X n . P(X ~ b) = 1 . ~ ~  ~ 6. (iii) F is normal.. p = . > 0. Hint: See Problem B. plot RR for p = .=1 pp7J" aV.2p. Next note that the distribution of X involves Bernoulli and multinomial trials.5. .'. X n be a sample from a population with values 0.. I).3.b.d. where lJ is defined as a value satisfyingP(X < v) > ~lllldP(X >v) >~. 1).=1 Pj(O'j where a.bl. that X is the median of the sample. We want to estimate "the" median lJ of F. (c) Same as (b) except when n ~ 1. ~ 7. [f the parameters of interest are the population mean and median of Xi .7. ali X '"'' F.i. Use a numerical integration package. (d) Find EIX . ° = 15.3. ~ a) = P(X = c) = P. U(O..5(0 + I) and S ~ B(n.2.2. . and = 1.3) is N. . The answer is MSE(X) ~ [(ab)'+(cb)'JP(S where k = .bl for the situation in (i). (a) Find MSE(X) and the relative risk RR = MSE(X)/MSE(X).
. then this definition coincides with the definition of an unbiased estimate of e.0(X))) < E.(o(X)). If the true 6 is 60 . (b) Suppose Xi _ N(I'..3) that .. . and only if. .'). A decision rule 15 is said to be unbiased if E. (a) Show that if 6 is real and 1(6. 10.(1(6'..  . II. suppose (J is discrete with frequency function 11"(0) = 11" (!) = 11" U) = Compute the Bayes risk of I5r •s when i. I . . Let Xl. 8. Find MSE(6) and MSE(P).3(a) with b = c = 1 and n = I.3.2)(.. for all 6' E 8 1. Show that the value of c that minimizes M S E(c/. i1 . 0 < a 2 < 00. I I 78 Statistical Models. Goals.for what 60 is ~ MSE(8)/MSE(P) < 17 Give the answer for n = 25 and n = 100.0(X))) for all 6.X)2.0) = E.1)1 E~ 1 (Xi . I (b) Show that if we use the 0 1 loss function in testing.X)' = ([Xi .g. then a test function is unbiased in this sense if. ' .0): 6 E eo}.. defined by (J(6. 0) (J(6'. = c L~ 1 (Xi .1'] . and Performance Criteria Chapter 1 :! . (a) Show that s:? = (n . the powerfunction..Xn be a sample from a population with variance a 2 .J. a) = (6 .10) + (.1'])2. Let () denote the proportion of people working in a company who have a certain characteristic (e.X)2 is an unbiased estimator of u 2 • Hint: Write (Xi . .8lP where p = XI n is the proportion with the characteristic in a sample of size n from the company.X)2 has a X~I distribution. ~ 2(n _ 1)1.[X .X)2 keeping the square brackets intact.L)4 = 3(12. satisfies > sup{{J(6. being lefthanded). In Problem 1. (i) Show that MSE(S2) (ii) Let 0'6 c~ . e 8 ~ (.4.a)2. .(1(6.2 ~~ I (Xi . . Which one of the rules is the better one from the Bayes point of view? 11. (a)r = 8 =1 (b)r= ~s =1.) is (n+ 1)1 Hint!or question (bi: Recall (Theorem B. I . 9. You may use the fact that E(X i . then expand (Xi . 10% have the characteristic. . A person in charge of ordering equipment needs to estimate and uses I . 6' E e. It is known that in the state where the company is located.3.
03) = aR(B. 0. 15. Show that if J 1 and J 2 are two randomized. 0 < u < 1. 18.. Suppose that Po. Show that the procedure O(X) au is admissible.1.' and 0 = II' .~ is unbiased 13. we petiorm a Bernoulli trial with probability rp( x) of success and decide 8 1 if we obtain a success and decide 8 0 otherwise. procedures.a )R(B. If U = u. find the set of I' where MSE(ji) depend on n. The interpretation of <p is the following.. and possible actions at and a2. then. . In Problem 1.3. Define the nonrandomized test Ju . (c) Same as (b) except k = 2. Show that J agrees with rp in the sense that.Section 1. In Example 1. plot R(O.7).) for all B.Polo(X) ~ 01 = Eo(<p(X)).wo to X.ao) ~ O. b.UfO.1'0 I· 17. (h) If N ~ 10. there is a randomized procedure 03 such that R(B. 16. a) is . (b) find the minimum relative risk of P. 1) and is independent of X. s = r = I. Suppose that the set of decision procedures is finite. consider the estimator < MSE(X). 14. Ok) as a function of B. Convexity ofthe risk set.1) and let Ok be the decision rule "reject the shipment iff X > k. Your answer should Mw If n.3. Furthersuppose that l(Bo. .3. Po[o(X) ~ 11 ~ 1 .7 Problems and Complements 79 12. B = . In Example 1. (a) find the value of Wo that minimizes M SE(Mw). consider the loss function (1. o Suppose that U . show that if c ::. Compare 02 and 03' 19.3. Consider a decision problem with the possible states of nature 81 and 82 . Suppose the loss function feB.. If X ~ x and <p(x) = 0 we decide 90. but if 0 < <p(x) < 1.3.1. if <p(x) = 1. 0.3. and k o ~ 3. Consider the following randomized test J: Observe U.) + (1 ." (a) Show that the risk is given by (1. and then JT.1. wc dccide El. z > 0. by 1 if<p(X) if <p(X) >u < u. (B) = 0 for some event B implies that Po (B) = 0 for all B E El..4.4.' and 0 = = WJ10 + (1  w)x' II'  1'0 I are known. use the test J u . For Example 1. A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 < <p(X) < 1. given 0 < a < 1.
al a2 0 3 2 I 0. !.l. Give an example in which the best linear predictor of Y given Z is a constant (has no predictive value) whereas the best predictor Y given Z predicts Y perfectly. In Problem B. Give the minimax rule among the nonrandomized decision rules. Give an example in which Z can be used to predict Y perfectly. Goals.4. In Example 1. 0 0.1 calculate explicitly the best zero intercept linear predictor. (a) Find the best predictor of Y given Z. (c) Suppose 0 has the prior distribution defined by . 5.) = 0. the best linear predictor.6 (a) Compute and plot the risk points of the nonrandomized decision rules.9.4 I 0. 6. What is 1. Let U I . and the best zero intercept linear predictor.7 find the best predictors ofY given X and of X given Y and calculate their MSPEs.8 0.4 = 0. 3.(0.) the Bayes decision rule? Problems for Section 1. but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z). (b) Give and plot the risk set S. its MSPE. Give the minimax rule among the randomized decision rules. U2 be independent standard normal random variables and set Z Y = U I.(0. 2. and the ratio of its MSPE to that of the best and best linear predictors. Show that either R(c) = 00 for all cor R(c) is minimized by taking c to be any number such that PlY > cJ > pry < cJ > A number satisfying these restrictions is called a median of (the distribution of) Y. An urn contains four red and four black balls. .. The midpoint of the interval of such c is called the conventionally defined median or simply just the median.1. and Performance Criteria Chapter 1 B\a 0. 7.2 0. 4... !. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn. Let X be a random variable with probability function p(x ! B) O\x 0. (b) Compute the MSPEs of the predictors in (a).Is Z of any value in predicting Y? = Ur + U?. 0. Four balls are drawn at random without replacement. Let Y be any random variable and let R(c) = E(IYcl) be the mean absolute prediction error.80 Statistical Models.
Suppose that Z. where (ZI . Y') have a N(p" J. 10. Suppose that if we observe Z = z and predict 1"( z) for Your loss is 1 unit if 11"( z) . (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict y J • 13.  = ElY cl + (c  co){P[Y > co] . z > 0.cl) = oQ[lc . which is symmetric about c.t. 7 2 ) variables independent of each other and of (ZI. Let Y have a N(I".1"1/0] where Q(t) = 2['I'(t) + t<I>(t)]. ° 14. p( c all z. p) distribution and Z". Y" be the corresponding genetic and environmental components Z = ZJ + Z". Suppose that Z has a density p. Show that if (Z. a 2. that is. Y) has a bivariate nonnal distribution the best predictor of Y given Z in the sense of MSPE coincides with the best predictor for mean absolute error. Z".2 ) distrihution. 0. (a) Show that P[lZ  tl < s] is maximized as a function oftfor each s > ° by t = c. Y) has a bivariate normal distribution. y l ). If Y and Z are any two random variables. (b) Show directly that I" minimizes E(IY . Show that c is a median of Z. a 2. p( z) is nonincreasing for z > c. (a) Show that the relation between Z and Y is weaker than that between Z' and y J . Find (a) The best MSPE predictor E(Y I Z = z) ofY given Z = z (b) E(E(Y I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z» . Y'. Suppose that Z has a density p. + z) = p(c . Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables.L. Define Z = Z2 and Y = Zl + Z l Z2.z) for 11.7 Problems and Complements 81 Hint: If c ElY . Yare measurements on such a variable for a randomly selected father and son. and otherwise. Y)I < Ipl. 12.YI > s. 9. Sbow that the predictor that minimizes our expected loss is again the best MSPE predictor. Let Zl and Zz be independent and have exponential distributions with density Ae\z. Y = Y' + Y". Let ZI.PlY < eo]} + 2E[(c  Y)llc < Y < co]] 8.col < eo. Y" are N(v. that is. which is symmetric about c and which is unimodal. (a) Show that E(IY .el) as a function of c.Section 1. (b) Suppose (Z. exhibit a best predictor of Y given Z for mean absolute prediction error. ICor(Z.
and Performance Criteria Chapter 1 . Hint: Recall that E( Z. Let /L(z) = E(Y I Z ~ z). j3 > 0. > 0. . g(Z)) 9 where g(Z) stands for any predictor. 2 'T ' . At the same time t a diagnostic indicator Zo of the severity of the disease (e. IS. hat1s.1] . IS. Yo > O. Goals.15. Show that Var(/L(Z))/Var(Y) = Corr'(Y.3. Consider a subject who walks into a clinic today.) = E( Z. on estimation of 1]~y. 1).g(Z)) where £ is the set of 17. We are interested in the time Yo = t . Hint: See Problem 1. 16. It will be convenient to rescale the problem by introducing Z = (Zo . a blood cell or viral load measurement) is obtained. and Doksurn and Samarov.y}.exp{ >. where piy is the population multiple correlation coefficient of Remark 1. then'fJh(Z)Y =1JZY.4.. at time t. and Var( Z.) (a) Show that 1]~y > piy. . 1905. and a 2 are the mean and variance of the severity indicator Zo in the population of people without the disease.g. (b) Show that if Z is onedimensional and h is a 11 increasing transfonnation of Z.(y) = >./L(Z)) = max Corr'(Y.4. Predicting the past from the present. . 1995. = Corr'(Y. (See Pearson. in the linear model of Remark 1. (a) Show that the conditional density j(z I y) of Z given Y (b) Suppose that Y has the exponential density = y is H(y.S from infection until detection.I • ! . that is./L£(Z)) ~ maxgEL Corr'(Y.) ~ 1/>''. €L is uncorrelated with PL(Z) and 1J~Y = P~Y' 18.) ~ 1/>.: (0 The best linear MSPE predictor of Y based on Z = z. 82 Statisflcal Models.ElY .) = Var( Z./LdZ) be the linear prediction error. . ./L(Z)J' /Var(Y) ~ Var(/L(Z))/Var(Y). ./L)/<J and Y ~ /3Yo/<J.4. Show that p~y linear predictors. and is diagnosed with a certain disease. ~~y = 1 . Let S be the unknown date in the past when the sUbject was infected.. >. mvanant un d ersuchh. One minus the ratio of the smallest possible MSPE to the MSPE of the constant predictor is called Pearson's correlation ratio 1J~y. y> O. . I. Show that. (c) Let '£ = Y .4. Here j3yo gives the mean increase of Zo for infected subjects over the time period Yo. Assume that the conditional density of Zo (the present) given Yo = Yo (the past) is where j1.
s(Y» given Z = z. wbere Y j . 19.. s(Y)] < 00.4.7 Problems and Complements 83 Show that the conditional distribution of Y (the past) given Z = z (the present) has density .(>' . 1990. Establish 1.). see Berman. Z2 and W are independent with finite variances. where Z}. (d) Find the best predictor of Yo given Zo = zo using mean absolute prediction error EIYo .g(Zo)l· Hint: See Problems 1.s(Y)] Covlr(Y).4. Z] = Cov{E[r(Y) (d) Suppose Y i = al + biZ i + Wand 1'2 = a2 and Y2 are responses of subjects 1 and 2 with common influence W and separate influences ZI and Z2. (c) Show that if Z is real. and Nonnand and Doksum. x = 1. (e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z ~ z) ~ cI<p(>.>')1 2 } Y> ° where c = 4>(z . (b) Show that (a) is equivalent to (1. (c) The optimal predictor of Y given X = x. Cov[r(Y). where X is the number of spots showing when a fair die is rolled.~ [y  (z .Section 1. 6. c. + b2 Z 2 + W. all the unknowns. Y2) using (a).9. b) equal to zero.z) . This density is called the truncated (at zero) normal. (a) Show that ifCov[r(Y). . Let Y be a vector and let r(Y) and s(Y) be real valued. Write Cov[r(Y). density. and checking convexity.>. then = E{Cov[r(Y).. Find (a) The mean and variance of Y. Let Y be the number of heads showing when X fair coins are tossed. 21. b). Find Corr(Y}.4.. (In practice. 20.. (c) Find the conditional density 7I"o(Yo I zo) of Yo given Zo = zo. 7r(Y I z) ~ (27r) .>.7 and 1.I exp { .6) when r = s.14 by setting the derivatives of R(a. (b) The MSPE of the optimal predictor of Y based on X. 1). solving for (a. s(Y) I Z]} + Cov{ E[r(Y) I Z]' E[s(Y) I Z]}.. including the "prior" 71".4. . I Z]' Z}. need to be estimated from cohort studies. 2000). . s(Y) I z] for the covatiance between r(Y) and s(Y) in the conditional distribution of (r(Y). Hint: Use Bayes rule. N(z .z).
e)' Hint: Whatever be Y and c. (a) p(x.84 Statistical Models. . and Performance Criteria Chapter 1 (e) In the preceding model (d). density. population where e > O. n > 2. z) = z(1 .~(1 ..z). Verify that solving (1. x  . This is known as the Weibull density.4. Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. suppose that Zl and Z.p~y).14). z "5). Oax"l exp( Ox").2eY + e' < 2(Y' + e'). > a. 00 (b) Suppose that given Z = z.x > 0.l . 24.6(O.4.z). 25. 0) = Oa' Ix('H).. 23. we say that there is a 50% overlap between Y 1 and Yz. and suppose that Z has the beta.3. 0 < z < I. where Po(Y.g(z)]'lw(y. In Example 1. a > O. Y ~ B(n. \ (b) Establish the same result using the factorization theorem.6(r. 1 2 y' .0> O. . 0) = < X < 1.z) be a positive realvalued function. Y z )? (f) In model (d). are N(p. Assume that Eow (Y. g(Z)) < for some 9 and that Po is a density. 0 (b) p(x. and = 0 otherwise. Let Xl. optimal predictor of Yz given (Yi.density. z) ~ ep(y. Show that L~ 1 Xi is sufficient for 8 directly and by the factorization theorem. 2. Then [y . Zz).e' < (Y .. . Find the 22. ..15) yields (1. Let Xi = 1 if the ith item drawn is bad.z) = 6w (y.e)' = Y' . if b1 = b2 and Zl. Find Po(Z) when (i) w(y.Xn is a sample from a population with one of the following densities. 1). 1. 0 > 0. Show that EY' < 00 if and only if E(Y . (c) p(x.g(z») is called weighted squared prediction error. In this case what is Corr(Y. p(e). show that the MSPE of the optimal predictor is .4. This is thebeta. . a > O. z) and c is the constant that makes Po a density.5 1. (a) Show directly that 1 z:::: 1 Xi is sufficient for 8. Problems for Section 1. z) = I. 3. Show that the mean weighted squared prediction error is minimized by Po(Z) = EO(Y I Z). . 0 > 0. Suppose Xl. (a) Let w(y. Goals. . s). and (ii) w(y. . < 00 for all e. Zz and ~V have the same variance (T2.z)lw(y..') and W ~ N(po. 0) = Ox'.. 1 X n be a sample from a Poisson.
Section 1. 7
Problems and Complements
85
This is known as the Pareto density. In each case. find a realvalued sufficient statistic for (), a fixed. 4. (a) Show that T 1 and T z are equivalent statistics if, and only if, we can write T z = H (T1 ) for some 11 transformation H of the range of T 1 into the range of T z . Which of the following statistics are equivalent? (Prove or disprove.) (b) n~ (c) I:~
1
x t and I:~
and I:~
I:~
1
log Xi, log Xi.
Xi Xi
I Xi
1
>0 >0
l(Xi X
1 (Xi 
(d)(I:~ lXi,I:~ lxnand(I:~ lXi,I:~
(e) (I:~
I Xi, 1
?)
xl)
and (I:~
1 Xi,
I:~
X)3).
5. Let e = (e lo e,) be a bivariate parameter. Suppose that T l (X) is sufficient for e, whenever 82 is fixed and known, whereas T2 (X) is sufficient for (h whenever 81 is fixed and known. Assume that eh ()2 vary independently, lh E 8 1 , 8 2 E 8 2 and that the set S = {x: pix, e) > O} does not depend on e. (a) Show that ifT, and T, do not depend one2 and e, respectively, then (Tl (X), T2 (X)) is sufficient for e. (b) Exhibit an example in which (T, (X), T2 (X)) is sufficient for T l (X) is sufficient for 8 1 whenever 8 2 is fixed and known, but Tz(X) is not sufficient for 82 , when el is fixed and known. 6. Let X take on the specified values VI, .•. 1 Vk with probabilities 8 1 , .•• ,8k, respectively. Suppose that Xl, ... ,Xn are independently and identically distributed as X. Suppose that IJ = (e" ... , e>l is unknown and may range over the set e = {(e" ... ,ek) : e, > 0, 1 < i < k, E~ 18i = I}, Let Nj be the number of Xi which equal Vj' (a) What is the distribution of (N" ... , N k )? (b) Sbow that N = (N" ... , N k _,) is sufficient for 7. Let Xl,'"
1
e,
e.
X n be a sample from a population with density p(x, 8) given by
pix, e)
o otherwise.
Here
e = (/1, <r) with 00 < /1 < 00, <r > 0.
(a) Show that min (Xl, ... 1 X n ) is sufficient for fl when a is fixed. (b) Find a onedimensional sufficient statistic for a when J1. is fixed. (c) Exhibit a twodimensional sufficient statistic for 8. 8. Let Xl,. " ,Xn be a sample from some continuous distribution Fwith density f, which is unknown. Treating f as a parameter, show that the order statistics X(l),"" X(n) (cf. Problem B.2.8) are sufficient for f.
,
"
I
!
86
Statistical Models, Goals, and Performance Criteria
Chapter 1
9. Let Xl, ... ,Xn be a sample from a population with density
j,(x)
a(O)h(x) if 0, < x < 0,
o othetwise
where h(x)
> 0,0= (0,,0,)
with
00
< 0, < 0, <
00,
and a(O)
=
[J:"
h(X)dXr'
is assumed to exist. Find a twodimensional sufficient statistic for this problem and apply your result to the U[()l, ()2] family of distributions. 10. Suppose Xl" .. , X n are U.d. with density I(x, 8) = ~elx61. Show that (X{I),"" X(n», the order statistics, are minimal sufficient. Hint: t,Lx(O) =  E~ ,sgn(Xi  0), 0 't {X"" . , X n }, which determines X(I),
. " , X(n)'
11. Let X 1 ,X2, ... ,Xn be a sample from the unifonn, U(O,B). distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for O.
12. Dynkin, Lehmann, Scheffe's Theorem. Let P = {Po : () E e} where Po is discrete concentrated on X = {x" x," .. }. Let p(x, 0) p.[X = xl Lx(O) > on Show that f:xx(~~) is minimial sufficient. Hint: Apply the factorization theorem.
=
=
°
x,
13. Suppose that X = (XlI" _, X n ) is a sample from a population with continuous distribution function F(x). If F(x) is N(j1., ,,'), T(X) = (X, ,,'). where,,2 = n l E(Xi 1')2, is sufficient, and S(X) ~ (XCI)"" ,Xin»' where XCi) = (X(i)  1')/'" is "irrelevant" (ancillary) for (IL, a 2 ). However, S(X) is exactly what is needed to estimate the "shape" of F(x) when F(x) is unknown. The shape of F is represented hy the equivalence class F = {F((·  a)/b) : b > 0, a E R}. Thus a distribution G has the same shape as F iff G E F. For instance, one "estimator" of this shape is the scaled empirical distribution function F,(x) jln, x(j) < x < x(i+1)' j = 1, . .. ,nl
~
0, x
< XCI)
> x(n)
1, x
~
Show that for fixed x, F,((x  x)/,,) converges in prohahility to F(x). Here we are using F to represent :F because every member of:F can be obtained from F.
I
I ,
'I i,
14. Kolmogorov's Theorem. We are given a regular model with e finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on 9, the posterior distrihution of 9 depends on x only through T(x). Show that T(X) is sufficient.
(b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the posterior distribution depends on x only through T(x).
Section 1.7
Problems and Complements
87
Hint: Apply the factorization theorem.
15. Let X h .··, X n be a sample from f(x  0), () E R. Show that the order statistics arc minimal sufficient when / is the density Cauchy Itt) ~ I/Jr(1 + t 2 ). 16. Let Xl,"" X rn ; Y1 ,· . " ~l be independently distributed according to N(p, (72) and N(TI, 7 2 ), respectively. Find minimal sufficient statistics for the following three cases:
(i) p, TI,
0", T
are arbitrary:
00
< p, TI < 00, a <
(J,
T.
(ii)
(J
=T
= TJ
and p, TI, (7 are arbitrary. and p,
0", T
(iii) p
are arbitrary.
17. In Example 1.5.4. express tl as a function of Lx(O, 1) and Lx(l, I). Problems to Sectinn 1.6
1. Prove the assertions of Table 1.6.1.
2. Suppose X I, ... , X n is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show that the distribution of X fonns a oneparameter exponential family. Identify 'TI, B, T, and h. 3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability nf success O. Then P, IX = k] = (I  0)'0, k ~ 0 0 1,2, ... This is called thc geometric distribution (9 (0». (a) Show that the family of geometric distributions is a oneparameter exponential family with T(x) ~ x. (b) Deduce from Theorem 1.6.1 that if X lo '" oXn is a sample from 9(0), then the distributions of L~ 1 Xi fonn a oneparameter exponential family. (c) Show that E~
1
Xi in part (b) has a negative binomial distribution with parameters
(noO)definedbyP,[L:71Xi = kJ =
(n+~I
)
(10)'on,k~0,1,2o'"
(The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success 0.) Hint: By Theorem 1.6.1, P,[L:7 1 Xi = kJ = c.(1  o)'on. 0 < 0 < 1. If
=
' " CkW'
I
= c(;'',)::, 0 lw n
k=O
L..J
< W < I, then
4. Which of the following families of distributions are exponential families? (Prove or
disprove.) (a) The U(O, 0) fumily
88
(b) p(.", 0)
Statistical Models, Goals, and Performance Criteria
Chapter 1
(c)p(x,O)
= {exp[2Iog0+log(2x)]}I[XE = ~,xE {O.I +0, ... ,0.9+0j = 2(x +
0)/(1 + 20), 0
(0,0)1
(d) The N(O, 02 ) family, 0 > 0
(e)p(x,O)
< x < 1,0> 0
(f) p(x,9) is the conditional frequency function of a binomial, B(n,O), variable X, given that X > O.
5. Show that the following families of distributions are twoparameter exponential families and identify the functions 1], B, T, and h. (a) The beta family. (b) The gamma family. 6. Let X have the Dirichlet distribution, D( a), of Problem 1.2.15. Show the distribution of X form an rparameter exponential family and identify fJl B, T, and h.
7. Let X = ((XI, Y I ), ... , (X no Y n » be a sample from a bIvariate nonnal population.
Show that the distributions of X form a fiveparameter exponential family and identify 'TJ, B, T, and h.
8. Show that the family of distributions of Example 1.5.3 is not a one parameter eX(Xloential family. Hint: If it were. there would be a set A such that p(x, 0) > on A for all O.
°
9. Prove the analogue of Theorem 1.6.1 for discrete kparameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (XI, X 2) is a sample of size 2 from f(·, 0), then XI + X2 is sufficient for B. Show that f(·, B) corresponds to a onearameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x" X2) such that log f(x" 0) + log f(X2, 0) = g(xI + X2, 0) + h(XI, X2). Fix 00 and let r(x, 0) = log f(x, 0)  log f(x, 00), q(x, 0) = g(x,O)  g(x,Oo). Then, q(xI + X2,0) = r(xI,O) +r(x2,0), and hence, [r(x" 0) r(O, 0)1 + [r(x2, 0)  r(O, 0») = r(xi + X2, 0)  r(O, 0). 11. Use Theorems 1.6.2 and 1.6.3 to obtain momentgenerating functions for the sufficient statistics when sampling from the following distributions. (a) normal, () ~ (ll,a 2 )
(b) gamma. r(p, >.), 0
= >., p fixed
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3)
(0 gamma. r(p, >'). ()
= (p, >.).
 
,

Section 1. 7
Problems and Complements
89
12. Show directly using the definition of the rank of an ex}X)nential family that the multinomialdistribution,M(n;OI, ... ,Ok),O < OJ < 1,1 <j < k,I:~oIOj = 1, is of rank k1. 13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any (k ~ I)dimensional hyperplane. 14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of &. Show that if k = 1 and &0 oJ 0 and A, A are defined on all of &, then Theorem 1.6.3 continues to hold. 15. Let P = {P. : 0 E e} where p. is discrete and concentrated on X = {x" X2, ... }, and let p( x, 0) = p. IX = x I. Show that if P is a (discrete) canonical ex ponential family generated bi, (T, h) and &0 oJ 0, then T is minimal sufficient. Hint: ~;j'Lx('l) = Tj(X)  E'lTj(X). Use Problem 1.5.12.
16. Life testing. Let Xl,.'" X n be independently distributed with exponential density (20)l e x/2. for x > 0, and let the ordered X's be denoted by Y, < Y2 < '" < YnIt is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r aparticles have been emitted. Show that
(i) The joint distribution of Y1 , •.. , Yr is an exponential family with density
n! [ (20), (n _ r)! exp (ii) The distribution of II:: I Y;
(iii) Let
1
I::l Yi + (n 20
r)Yr]
' 0  Y,  ...  Yr·
<
<
<
+ (n 
r)Yrl/O is X2 with 2r degrees of freedom.
denote the time required until the first, second,... event occurs in a Poisson process with parameter 1/20' (see A.I6). Then Z, = YI/O', Z2 = (Y2 Yr)/O', Z3 = (Y3  Y 2)/0', ... are independently distributed as X2 with 2 degrees of freedom, and the joint density of Y1 , ••. , Yr is an exponential family with density
Yi, Yz , ...
The distribution of Yr/B' is again XZ with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burnedout tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
I ,
90
Statistical Models, Goals, and Performance Criteria Chapter 1
1)(Y; l~~l)/e (I = 1", .. ,') are independently distributed as X2 with 2 degrees of freedom, and [L~ 1 Yi + (n  7")Yr]/B = [(ii): The random variables Zi ~ (n  i
+
L::~l Z,.l
17. Suppose that (TkXl' h) generate a canonical exponential family P with parameter k 1Jkxl and E = R . Let
(a) Show that Q is the exponential family generated by IlL T and h exp{ cTT}. where IlL is the projection matrix of Tonto L = {'I : 'I = BO + c). (b) Show that ifP has full rank k and B is of rank I, then Q has full rank l. Hint: If B is of rank I, you may assume
18. Suppose Y1, ... 1 Y n are independent with Yi '" N(131 + {32Zi, (12), where Zl,'" , Zn are covariate values not all equaL (See Example 1.6.6.) Show that the family has rank 3.
Give the mean vector and the variance matrix of T.
19. Logistic Regression. We observe (Zll Y1 ), ... , (zn, Y n ) where the Y1 , .. _ , Y n are independent, Yi "' B(TIi, Ad The success probability Ai depends on the characteristics Zi of the ith subject, for example, on the covariate vector Zi = (age, height, blood pressure)T. The function I(u) ~ log[u/(l  u)] is called the logil function. In the logistic linear re(3 where (3 = ((31, ... ,/3d ) T and Zi is d x 1. gression model it is assumed that I (Ai) = Show that Y = (Y1 , ... , yn)T follow an exponential model with rank d iff Zl, ... , Zd are
zT
not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). 20. (a) In part IT of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by ('11 'Io)TT and that ~(ii) =~(i). (b) Fill in the details of part III of the proof of Theorem 1.6.4. 21. Find JJ.('I) ~ EryT(X) for the gamma,
qa, A), distribution, where e = (a, A).
I
22. Let X I, . _ . ,Xn be a sample from the k·parameter exponential family distribution (1.6.10). Let T = (L:~ 1 1 (Xi ), ... , L:~ 1Tk(X,») and let T
I
S
~
((ryl(O), ... ,ryk(O»): e E 8).
Show that if S contains a subset of k + 1 vectors Vo, .. _, Vk+l so that Vi  Vo, 1 < i are not collinear (linearly independent), then T is minimally sufficient for 8.
< k.
I .' jl,
"
23. Using (1.6.20). find a conjugate family of distributions for the gamma and beta families. (a) With one parameter fixed. (b) With both parameters free.
:
I
Section 1.7
Problems and Complements
91
24. Using (1.6.20), find a conjugate family of distributions for the normal family using as parameter 0 = (O!, O ) where O! = E,(X), 0, ~ l/(Var oX) (cf. Problem 1.2.12). 2 25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with (72 known. Find a conjugate family of prior distributions for (131,132) T. 26. Using (1.6.20), find a conjugate family of distributions for the multinomial distribution. See Problem 1.2.15. 27. Let P denote the canonical exponential family genrated by T and h. For any TJo E £, set ho(x) = q(x, '10) where q is given by (1.6.9). Show that P is also the canonical exponential family generated by T and h o.
28. Exponential/amities are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by h(f)
~ E(logf(X)) =
l:IIOgf(X)I!(X)dx.
This quantity arises naturally in infonnation in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S ~ {x: f(x) > OJ. (a) Show that the canonical kparameter exponential family density
f(x, 'I)
= exp
• ryjrj(x) 1/0 + I:
j:=1
A('I)
, XES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is
f(x)dx
~ 1,
Is
f(x)rj(x)
~ aj,
1 < j < k,
where '17o, .•.• '17k are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. (b) Find the maximum entropy densities when rj(x) = x j and (i) S ~ (0,00), k = 1, at > 0; (ii) S = R, k = 2, at E R, a, > 0; (iii) S = R, k = 3, a) E R, a, > 0, a3 E R. 29. As in Example 1.6.11, suppose that Y 1, ...• Y n are Li.d. Np(f.L. E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y = (Y ... , Yn ) is the p(p + 3)/2 canonical " exponential family generated by h = 1 and the p(p + 3)/2 statistics
n n
Tj
=
LYii>
i=l
1 <j <Pi
Tjl =
LJ'ijJ'iI.
i=l
1 <j< l<p
where Y i = (Yi!, ... , Yip). Show that <: is open and that this family is of rank pcp + 3)/2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = pcp + 3)/2 statistics Tj(Y) ~ Yj, 1 < j < p, and Tj,(Y) = YjYi, 1 <j < I < p,
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Np(J.l, E). As E ranges over all p x p symmetric positive definite matrices, so does E 1 • Next establish that for symmetric matrices M,
J
M
exp{ _uT Mu}du
< 00 iff M
is positive definite
by using the spectral decomposition (see B.I0.1.2)
=L
j=1
p
AjejeJ for el, ... , e p orthogonal. Aj E R.
To show that the family has full rank m, use induction on p to show that if Zt, ... , Zp are i.i.d. N(O, 1) and if B pxp = (b jl ) is symmetric, then
p
P
LajZj
j"" 1
+ Lbj,ZjZ,
j,l
~c
= P(aTZ + ZTBZ = c) ~ 0
N p(l', E), then
unless a ~ 0, B = 0, c = 0. Next recall (Appendix B.6) that since Y ~ y = SZ for some nonsingular p x p matrix S.
I
30. Show that if Xl,'" ,Xn are d.d. N p (8,E o) given (J where ~o is known, then the Np(A, f) family is conjugate to N p(8, Eo), where A varies freely in RP and f ranges over
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I'j, Tj) : 1 < j < k} be a given collection of pairs with I'j E R, Tj > 0. Let (tt, tT) be a random pair with Aj = P«(I', tT) = (I'j, Tj)), 0 < Aj < 1, L:~~l Aj = 1. Let 8 be a random variable whose conditional distribution given (IL, IT) = (p,j 1 Tj) is nonnal, N(p,j, rJ). Consider the model X = 8 + f, where 8 and € are independent and € rv N(O, a3), a~ known. Note that 8 has the prior density
11'(0)
=L
j=l
k
Aj'l'rj (0  I'j)
(1.7.4)
where 'I'r denotes the N(O, T 2 ) density. Also note that (X tion. (a) Find the posterior
i!
I 0) has the N(O, (75) distribu
" "
,;
k
11'(0 I x)
and write it in the fonn
= LP((tt,tT) ~
j=1
(l'j,Tj) I X)1I'(O I (l'j,Tj),X)
" • ,
k
L
j=1
Aj (x)'I'rj(x) (0 l'j(X»
Section 1.7
Problems ;3nd Complements
93
for appropriate A) (x), Tj (x) and ILJ (x). This shows that (1. 7.4) defines a conjugate prior for the N(O, (76), distribution. (b) Let Xi = + Ei, I < i < n, where is as previously and EI," ., En are ij.d. N(O, (76)' Find the posterior 7r( 0 I Xl, ... , x n ), and show that it belongs to class (1.7 A). Hint: Consider the sufficient statistic for p(x I B).
e
e
32. A Hierarchical BinomialBeta Model. Let {(rj, Sj) : 1 <j < k} be a given collection of pair.; with rj > 0, sJ > 0, let (R, S) be a random pair with P(R = cJ' S = 8j) = Aj, D < Aj < 1, E7=1 Aj = 1, and let e be a random variable whose conditional density ".(0, c, s) given R = r, S = S is beta, (3(c, s). Consider the model in which (X I 0) has the binomial, B( n, fJ), distribution. Note that e has the prior density
".(0)
Find the posterior
k
=L
j=1
k
Aj"'(O, cJ ' sJ)'
(J .7.5)
".(0 I x) = LP(R= cj,S =
j=l
8j
I x)7r(O I (rj,sj),x)
and show that it can be written in the form J (x)7r(O,rj(x),sj(x)) for appropriate Aj(X), Cj(x) and 8j(X). This shows that (1.7.5) defines a class of conjugate prior.; for the B( n, 0) distribution.
L:A
33. Let p(x,TJ) be a one parameter canonical exponential family generated by T(x) = x and h(x), X E X C R, and let 1jJ(x) be a nonconstant, nondecreasing function. Show that E,1jJ(X) is strictly increasing in ry. Hint:
Cov,(1jJ(X), X)
~E{(X 
X')i1jJ(X) 1jJ(X')]}
where X and X' are independent identically distributed as X (see A.Il.12).
34. Let (Xl, ... , X n ) be a stationary Markov chain with two states D and 1. That is.
P[Xi
where
= Ei I Xl = EI,·· .,Xi  = Eid = P[Xi = Ci I X i  l = Eid =
l
PEi_1Ei
(POO PIO
pal) is the matrix of transition probabilities. Suppose further that Pn '
= 1  p.
(i) poo
(ii)
= PII = p, so that, PlO = Pal PIX, = OJ = PIX, = IJ = !.
94
Statistical Models, Goals, and Performance Criteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, oneparameter exponential family with T = NOD + N ll where Nt) the number of transitions from i to j. For example, 01011 has N Ol = 2, Nil = 1, N oo = 0, N IO ~ 1.
(b) Show that E(T)
= (n 
l)p (by the method of indicators or otherwise).
35. A Conjugate Priorfor the Two~Sample Problem. Suppose that Xl, ... , X n and Y1 , ... , Yn are independent N(fLI' (12) and N(1l2' ( 2 ) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro 2 has a X~ distribution and given 0 2 , /11 and fL2 are independent with N(~I, <7 2/ kt} and N(6, <7 2/ k2) distributions, respectively, where ~j E R, k j > 0, j = 1,2. Show that Jr is a conjugate prior.
I
36. The inverse Gaussian density. IG(j..t, .\), is
f(x,J1.,>')
= [>./21Tjl/2 x 3/2 exp { >.(x 
J1.)2/ 2J1.2 X}, x> 0, J1. > 0, >. > O.
(a) Show thatthis is an exponentialfamily generated hy T( X) h(x) = (21T)1/2 X 3/'. (b) Show that the canonical parameters TJl, TJ2 are given by TJI that A( 'II, '12) =  [! [Og('I2) + v''I1'I2]'£ = [0,00) x (0,00).
= ! (X, XI) T and = fL 2A, 1]2 =
= '\, and
(e) Fwd the momentgenerating function ofT and show that E(X) J1. 3>., E(XI) = J1. 1 + >.1, Var(X I ) = (>'J1.)1 + 2>'2. (d) Suppose J1. pnor.
~
J1., Var(X)
J1.o is known. Show that the gamma family, qa,,6), is a conjugate
(e) Suppose that>' = >'0 is known. Show that the conjngate prior formula (1.6.20) produces a function that is not integrable with respect to fl. That is, defined in (1.6.19) is empty.
n
(I) Suppose that J1. and>. are both unknown. Show that (1.6.20) produces a function
that is not integrable; that is, f! defined in (1.6,19) is empty. 37, Let XI, ... , X n be i.i.d. as X ~ Np(O, ~o) where ~o is known. Show that the conjugate prior generated by (1.6.20) is the N p ( 1]0,761) family, where 1]0 varies freely in RP, 76 > 0 and I is the p x p identity matrix.
,
•
38. Let Xi
"
(Zi, Yi)T be jj,d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of XI, ... ,Xn as a canonical exponential family and identify T, h, A, and E. Find the expected value and variance of the sufficient statistic.
=
,!,i
i!: "
39. Suppose that Y1 , •.. 1 Y n are independent, Yi
'"'' N(fLi, a 2 ), n > 4.
"'
(a) Write the distribution of Y1 , ..• ,Yn in canonical exponential family fonn. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 1) : (0 1, O , 03)T ~ (1'7, <72 jT where 1) is detennined by 2
fLi
I i'i
I ,
I
= exp{OI
+ 02Zi},
Zl
< Z2 < .. , <
Zn;
(72 =
03


Section 1.8
Notes
95
where 8 r E R, O E R, 03 > O. This model is sometimes used when IIi is restricted to be 2 positive. Show that p(y, 0) as given by (1.6.12) is a curved exponential family model with
1=3.
40. Suppose Y 1 , • •. , y;'l are independent exponentially, E' (Ai), distributed survival times, n > 3. (a) Write the distribution of Y1 , ... 1 Yn in canonical exponential family form. Identify T, h, '1, A, and E. (b) Recall that J.1i = E (Y'i) = Ai I. Suppose lJi depends on the value Zi of a covariate. Because Iti > O. fLi is sometimes modeled as
fLi
= CXp{ 0 1 + (hZi},
i
=
1, ... , n
where not all the z's are equal. Show that p(y, fi) as given by (1.6.12) is a curved exponential family model with 1 = 2.
1.8
NOTES
Note for Section 1.1
(1) For the measure theoretically minded we can assume more generally that the Po are
all dominated by a derivative.
(J
finite measure It and that p(x, 8) denotes
dJ;lI, the Radon Nikodym
Notes for Section 1,3
~
(I) More natural in the sense of measuring the Euclidean distance between the estimate f} and the "truth" Squared error gives much more weight to those that are far away from f} than those close to f}.
e.
e
~
(2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r.
Note for Section 1,4
(I) Source; Hodges, Jr., J. L., D. Keetch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGrawHill, 1975.
Notes for Section 1,6
(1) Exponential families arose much earlier in the work of Boltzmann in statistical mechanics as laws for the distribution of the states of systems of particlessee Feynman (1963), for instance. The connection is through the concept of entropy, which also plays a key role in infonnation theorysee Cover and Thomas (199]). (2) The restriction that's x E Rq and that these families be discrete or continuous is artificial. In general if fL is a (J finite measure on the sample space X. p( x, e) as given by (1.6.1)
Mathematical Statistics New York: Academic Press. R. THOMAS. 1967." J. . 1966. ley.. . T.. r HorXJEs. Nonlinearity. Note for Section 1. "Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discussion).. v. Theory ofGames and Statistical Decisions New York: Wiley.. 2nd ed. Testing Statistical Hypotheses. G." Statist.• Statistical Decision Theory and Bayesian Analysis New York: Springer. "A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts. L. Hayward. I. • • . CRUTCHFIELD. Optimal Statistical Decisions New York: McGrawHili. Vols.. J. 14431473 (1995). 6. Statlab: An Empirical Introduction to Statistics New York: McGrawHili. RUPPERT. DoKSUM. A 143. BLACKWELL. I: I" Ii L6HMANN... Elements of Information Theory New York: Wiley. I and II. Iii I' i M. P. Feynman.266291 (1978). 0 . The Advanced Theory of Statistics. KRETcH AND R. positions. New York: Springer. B. S. P. The Feynmtln Lectures on Physics. 1954. Soc.547572 (1957). M. SAMAROV.. and so on. A." Ann. L. . BICKEL. DE GROOT. the Earth). M. A. Science 5. Leighton. 1988.96 Statistical Models. J. Royal Statist. JR." Biomelika. R. STUART.. J. 1957. and Performance Criteria Chapter 1 can be taken to be the density of X with respect to fLsee Lehmann (1997). T.. "A Theory of Some Multiple Decision Problems. Transformation and Weighting in Regression New York: Chapman and Hall. L. H. CARROLL.g. ROSENBLATT. Statist. E. m New York: Hafner Publishing Co. R. L. IMS Lecture NotesMonograph Series. Ch. FERGUSON.733741 (1990). and spheres (e.I' L6HMANN.. 22. Statistical Analysis of Stationary Time Series New York: Wi • . S. 77. BOX. This permits consideration of data such as images. Eels. GRENANDER. AND A.t AND D. E. 40 Statistical Mechanics ofPhysics Reading. "Using Residuals Robustly I: Tests for Heteroscedasticity.9 REFERENCES BERGER. FEYNMAN. R. Ii . AND M. P. J. 125. D. and M. . 1963. 23. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Math Statist. BERMAN. MA: AddisonWesley. and Later Developments. G." Ann.7 (I) u T Mu > 0 for all p x 1 vectors u l' o. P.." Ann.. Statist. K A AND A. for instance. D.. 5. 1986. . BROWN. Sands. AND M. KENDALL.. 1. "Model Specification: The Views of Fisher and Neyman.. ... L. 1985. E. 1975. L6HMANN.. 160168 (1990).. M. "Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression. 1961. U. 1997. 383430 (1979). E. II. GIRSHICK. AND COVER. 1969. 1991. Goals.
1962. PEARSON.." Empirical Bayes and Likelihood Inference.. ET AL. Part II: Inference London: Cambridge University Press. . SL.) RAIFFA..Section 1.. Introduction to Probability and Statistics from a Bayesian Point of View. Ames. New York: Springer. Applied Statistical Decision Theory. E. G. Part I: Probability. Ahmed and N.. Statistical Methods. The Statistical Analysis of Experimental Data New York: J. A." Proc. Wiley & Sons. Roy. 2000. Harvard University. 1961. 1964. MANDEL. D. AND K. H. Biometrics Series II. Lecture Notes in Statistics. B. K. SAVAGE. London 71. SAVAGE. Reid. Sequential Methods in Statistics New York: Chapman and Hall. 303 (1905). NORMAND. AND W. G. The Foundations ofStatistics. 8th Ed. 1954. 1965. COCHRAN. SCHLAIFFER. V. J. Editors: S. G. J. DOKSUM. D. L. "On the General Theory of Skew Correlation and NOnlinear Regression. 1. SNEDECOR. Boston. L. Dulan & Co. AND R. Division of Research. GLAZEBROOK. IA: Iowa State University Press. The Foundation ofStattstical Inference London: Methuen & Co. Soc. Graduate School of Business Administration. W. 1989.. J.. New York. Wiley & Sons. AND K. "Empirical Bayes Procedures for a Change Point Problem with Application to HIVJAIDS Data. 1986.. WETHERILL. 6779. 9 References 97 LINDLEY. (Draper's Research Memoirs.
. I 'I i . I . I:' o.'.: '". :i. I II 1'1 .·.! ~J .
if PO o were true and we knew DC 8 0 .2) define a special form of estimating equations. X """ PEP. The equations (2. Then we expect  \lOD(Oo. how do we find a function 8(X) of the vector observation X that in some sense "is close" to the unknown 81 The fundamental heuristic is typically the following. and 8 Jo D( 8 0 . So it is natural to consider 8(X) minimizing p(X.1. p(X. 6) ~ o. 8).O). the true 8 0 is an interior point of e. usually parametrized as P = {PO: 8 E e}.1. Now suppose e is Euclidean C R d . =0 (2. (2.1 2.8) is an estimate of D(8 0 .1. As a function of 8.1.O) where V denotes the gradient. DC8 0l 9) measures the (population) discrepancy between 8 and the true value 8 0 of the parameter.Chapter 2 METHODS OF ESTIMATION 2.1) Arguing heuristically again we are led to estimates B that solve \lOp(X. how do we select reasonable estimates for 8 itself? That is. We consider a function that we shall call a contrast function  p:Xx8>R and define D(Oo.8) as a function of 8.2) 99 . In order for p to be a contrast function we require that DC 8 0l 9) is uniquely minimized for 8 = (Jo.8) is smooth. That is. X E X.8). we don't know the truth so this is inoperable.O) EOoP(X. Estimating Equations Our basic framework is as before. This is the most general fonn of the minimum contrast estimate we shall consider in the next section.1 BASIC HEURISTICS OF ESTIMATION Minimum Contrast Estimates. but in a very weak sense (unbiasedness). In this parametric case. we could obtain 8 0 as the minimizer. Of course.
8/3 ({3. d g({3.1. then {3 satisfies the equation (2.1. ZI). ~ ~8g~ L..)Y. where the function 9 js known.4) w(X.»)'. J which is indeed minimized at {3 = {3o and uniquely so if and only if the parametrization is identifiable.i + L[g({3o. But. ua).i. Evidently. (3) = !Y .p(X.6) . . z.1L1 2 = L[Yi . {3) to consider is the squared Euclidean distance between the vector Y of observed Yi and the vector expectation of Y. (3) n n<T.1. W(X. l'i) : 1 < i < n} where Yi. suppose we are given a function W : X and define Methods of Estimation Chapter 2 X R d . for convenience. The estimate (3 is called the least squares estimate.d.)g(l3. Wd)T V(l1 o.g({3. "'. (3) exists if g({3. Zn»T That is. z) is differentiable in {3.1. Here the data are X = {(Zi.!I [" '1 L • .2) or equivalently the system of estimating eq!1ations. I In the important linear case.1. . . i=l (2.z. (2.Zid)T 'j • r . z) is continuous and ! I lim{lg(l3.8/3 ({3.g({3..10). .. g({3. . there is a substantial overlap between the two classes of estimates.4 R d.16).4 with I'(z) = g({3. An estimate (3 that minimizes p(X. g({3. (1) Suppose V (8 0 . (1). .1.1. . Then {3 parametrizes the model and we can compute (see Problem 2. suppose we postulate that the Ei of Example 1. W . z.1.8) ~ E I1 . I D({3o. we take n p(X. N(O. z.).zd = LZij!3j j=1 andzi = (Zil.4 are i.1.1.1.) .' .5) Strictly speaking P is not fully defined here and this is a point we shall explore later.7) i=l J . :i = ~8g~ ~ L.. Example 2. (3) E{3. Yn are independent.100 More generally. Here is an example to be pursued later. IL( z) = (g({3. z).3) = 0 has 8 0 as its unique solution for all 8 0 ~ E e. A naiural(l) function p(X.I1) =0 is an estimating equation estimate. If.. Least Squares.z. ~ Then we say 8 solving (2. {3 E R d .)J i=1 ~ 2 (2. i=l 3 I (2. further.z)l: 1{31 ~ oo} 'I = 00 I (Problem 2.('l/Jl. Consider the parametric version of the regression model of Example 1.1. z.
d > k. lid) as the estimate of q( 8).. = 1'. To apply the method of moments to the problem of estimating 9.i. if we want to estimate a Rkvalued function q(8) of 9. which can be judged on its merits whatever the true P governing X is. These equations are commonly written in matrix fonn (2. 0 Example 2.1..8) the normal equations. and then using h(p!. u5). . Method a/Moments (MOM). The method of moments prescribes that we estimate 9 by the solution of p.Section 2. 1 1j ~_l".. suppose is 1 .d. Suppose Xl.1.Xn are i. Least squares.1 from R d to Rd. Thus.1. we assume the existence of Define the jth sample moment f1j by. provides a first example of both minimum contrast and estimating equation methods. I'd). I'd( 8) are the first d moments of the population we are sampling from. This very important example is pursued further in Section 2. I'd of X.1 Basic Heuristics of Estimation 101 the system becomes (2.. Suppose that 1'1 (8). . Here is another basic estimating equation example.Xi t=l . The motivation of this simplest estimating equation example is the law of large numbers: For X "' Po. say q(8) = h(I'I. . N(O.  n n. Thus.2.. .d. . as X ~ P8. /1j converges in probability to flj(fJ). . we need to be able to express 9 as a continuous function 9 of the first d moments.9) where Zv IIZijllnxd is the design matrix. we obtain a MOM estimate of q( 8) by expressing q( 8) as a function of any of the first d moments 1'1. 1 1 < j < d. 1 < i < n}.i. .2 and Chapter = 6. .1 <j < d if it exists. once defined we have a method of computing a statistic fj from the data X = {(Zi 1 Xi).. . ~ . We return to the remark that this estimating method is well defined even if the Ci are not i. 8 E R d and 8 is identifiable..(8). thus. More generally. In fact. . L... ..
Here is some job category data (Mosteller..X 2 • In this example. l Vk of the population being sampled are known. .. for instance.. neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn.2 The PlugIn and Extension Principles We can view the method of moments as an example of what we call the plugin (or substitution) and extension principles. in general.. . . . A = Jl1/a 2 .2 and 0=2 = n 1 EX1. This algorithm and others will be discussed more extensively in Section 2.•. express () as a function of IJ" and fl3 = E(X 3 ) and obtain a method of moment estimator based on /11 and fi3 (Problem 2. or 5. . and 1" ~ E(X') = . Vi = i. . I.>0. ~ iii ~ (X/a)'. .~ (I'l/a)'. Suppose we observe multinomial trials in which the values VI..11). .(X")1 dxd J isquickandM is nonsingular with high probability is the NewtonRaphson algorithm.1. 1'1 for () gives = E(X) ~ "/ A. Frequency Plugin(2) and Extension. A>O. Here k = 5.·)  l~t. the method of moment estimator is not unique. As an illustration consider a population of men whose occupations fall in one of five different job categories.(1 + ")/A 2 Solving . Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category. then setting (2. f(u. i = 1.102 Methods of Estimation Chapter 2 For instance. two other basic heuristics particularly applicable in the i. In this case f) ~ ('" A).1. We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments.i. An algorithm for estimating equations frequently used when computationofM(X.·) fli =D'I'(X.1. 2. A = X/a' where a 2 = fl. 1968). There are many algorithms for optimization and root finding that can be employed.Pk are completely unknown.. . x>O. 3.)]xO1exp{Ax}. I.. 0 Algorithmic issues We note that.4 and in Chapter 6. It is defined by initializing with eo. 4.. .5.1 0) . in particular Problem 6.d. the proportion of sample values equal to Vi. then the natural estimate of Pi = P[X = Vi] suggested by the law of large numbers is Njn. We can. X n be Ltd. as X and Ni = number of indices j such that X j = Vi. If we let Xl. case..6. . 2. Example 2. but their respective probabilities PI. consider a study in which the survival time X is modeled to have a gamma distribution.10. .3. with density [A"/r(. '\)..
pl . We would be interested in estimating q(P"".03 2 84 0. The frequency plugin principle simply proposes to replace the unknown population frequencies PI. Next consider the marc general problem of estimating a continuous function q(Pl' .Pk) = (~l .Section 2.Ni =708 2:::o.11) to estimate q(Pll··.31 5 95 0. That is. N 3 ) has a multinomial distribution with parameters (n.1 Basic Heuristics of Estimation 103 Job Category l I Ni Pi ~ 23 0..0). and Then q(p) (p. Pk) of the population proportions. . P3) given by (2.44 . 1 Nk/n. If we use the frequency substitution principle..»i=1 for Danish men whose fathers were in category 3. 1 < think of this model as P = {all probability distributions Pan {VI. together with the estimates Pi = NiJn. . let P dennte p ~ (P". use (2.. . P3 PI = 0 = (1. . then (NIl N 2 . . we are led to suppose that there are three types of individuals whose frequencies are given by the socalled HardyWeinberg proportions 2 . can be identified with a parameter v : P ~ R. v. v(P) = (P4 + Ps) and the frequency plugin principle simply says to replace P = (PI. . Many of the models arising in the analysis of discrete data discussed in Chapter 6 are of this type. Equivalently.. P2 ~ 20(1. Pk do not vary freely but are continuous functions of some ddimensional parameter 8 = ((h. that is. . the estimate is which in our case is 0.. the difference in the proportions of bluecollar and whitecollar workers.0)2. (2.Xn . in v(P) by 0 P ). If we assume the three different genotypes are identifiable.13 n2::.. Suppose .... i < k.0.12) If N i is the number of individuals of type i in the sample of size n.Pk by the observable sample frequencies Nt/n. categories 4 and 5 correspond to bluecollar jobs. . For instance. 1 ()d) and that we want to estimate a component of 8 or more generally a function q(8).. whereas categories 2 and 3 correspond to whitecollar jobs. HardyWeinberg Equilibrium. Now suppose that the proportions PI' .41 4 217 0.1.1..Pk) with Pi ~ PIX = 'Vi].53 = 0.1. . lPk). .P2.. . suppose that in the previous job category table. 1~!." .4. .1... ..12 3 289 0.0 < 0 < 1..}}. the multinomial empirical distribution of Xl.09.12). .. Example 2. .(P2 + P3). + P3). Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles.Ps) = (P4 + Ps) . .
Pk are continuous functions of 8.1. case if P is the space of all distributions of X and Xl. Now q(lJ) can be identified if IJ is identifiable by a parameter v . a natural estimate of P and v(P) is a plugin estimate of v(P) in this nonparametric context For instance..(IJ). Nk) . We can think of the extension principle alternatively as follows. 1  V In general.13) Given h we can apply the extension principle to estimate q( 8) as. I) and ! F'(a) " = inf{x. (2. as X p. consider va(P) = [Fl (a) + Fu 1(a )1.104 Methods of Estimation Chapter 2 we want to estimate fJ. .Pk(IJ»). (2.13) and estimate (2.1.Xn ) =h N.. where a E (0.h(p) and v(P) = v(P) for P E Po.1. . thus. . Because f) = ~. q(lJ) with h defined and continuous on = h(p. E A) n./P3 and.. .4) and 5 how to choose among such estimates. suppose that we want to estimate a continuous Rlvalued function q of e. we can usually express q(8) as a continuous function of PI.Pk.. . Note. however. . Let be a submodel of P. ..16) . In particular.d. Fu'(a) = sup{x.E have an estimate P of PEP such that PEP and v : P 4 T is a parameter.'" . we can use the principle we have introduced and estimate by J N l In. in the Li. "·1 is.. We shall consider in Chapters 3 (Example 3.1. Po ~ R given by v(PO) = q(O).1. X n are Ll. by the law of large numbers. If w. that we can also Nsfn is also a plausible estimate of O. . I: t=l (2. P > R where v(P) .). the frequency of one of the alleles. If PI. . T(Xl. F(x) < a}. The plugin and extension principles can be abstractly stated as follows: Plugin principle.. the representation (2.. (2. the empirical distribution P of X given by    f'V 1 n P[X E Al = I(X.14) are not unique.. .15) ~ ".d.4.1..1. ( . 0 write 0 = 1 . .13) defines an extension of v from Po to P via v. . that is. . then v(P) is the plugin estimate of v. Then (2. F(x) > a}.14) As we saw in the HardyWeinberg case.. if X is real and F(x) = P(X < x) is the distribution function (dJ. .
With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plugin estimates for multinomial trials because I'j(8) where =L i=l k vfPi(8) = h(p(8» = viP. .' Nl = h N () ~ ~ = v(P) and P is the empirical distribution.Section 2. P 'Ie Po. II to P. (P) is the ath population quantile Xa.1. and v are continuous. then the plugin estimate of the jth moment v(P) = f. P E Po. However.1. Remark 2. e e Remark 2.1. Let viP) be the mean of X and let P be the class of distributions of X = 0 + < where B E R and the distribution of E ranges over the class of distributions with mean zero.1. In this case both VI(P) = Ep(X) and V2(P) = "median of P" satisfy v(P) = v(P). the sample median V2(P) does not converge in probability to Ep(X).tj = E(Xj) in this nonparametric .1 Basic Heuristics of Estimation 105 V 1. case (Problem 2. if X is real and P is the class of distributions with EIXlj < 00.d. then v(P) is an extension (and plugin) estimate of viP).i.12) and to more general method of moment estimates (Problem 2. they are mainly applied in the i.1. As stated. ~ I ~ ~ Extension principle.17) where F is the empirical dJ.i.13).1. because when P is not symmetric.14. is called the sample median.2. A natural estimate is the ath sample quantile . For instance. Here x ~ = median. If v: P ~ T is an extension of v in the sense that v(P) = viP) on Po.4. 1/. Suppose Po is a submodel of P and P is an element of P but not necessarily Po and suppose v: Po ~ T is a p~eter. ~ ~ . but only VI(P) = X is a sensible estimate of v(P). v(P8 ) = q(8) = h(p(8)) is a continuous map from to Rand D( P) = h(p) is a continuous map from P to R. then Va. Here x!. these principles are general.1. = Lvi n i= 1 k . let Po be the class of distributions of X = B + E where B E R and the distribution of E ranges over the class of symmetric distributions with mean zero. casebut see Problem 2. i=1 k /1j ~ = ~ LXI 1=1 I n . This reasoning extends to the general i. is a continuous map from = [0.) h(p) ~ L vip. For instance. Pe as given by the HardyWeinberg p(O). The plugin and extension principles must be calibrated with the target parameter.1. (P) is called the population (2... context is the jth sample moment v(P) = xjdF(x) ~ 0.1.3 and 2. ~ For a second example.d.1 I:~ 1 Xl. The plugin and extension principles are used when Pe. = v(P). in the multinomial examples 2.
" . . and there are best extensions. The method of moments can lead to either the sample mean or the sample variance.) = ~ (0 + I). estimation of B real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of (J values rather than minimizers of functions p( (J. because Po = P(X = 0) = exp{ O}. Plugin is not the optimal way to go for the Bayes. minimax. . we find I' = E. Unfortunately.5. they are often difficult to compute. . I .d.4. . However. 2. Thus. The method of moments estimates of f1 and a 2 are X and 2 0 a.1 because in this model B is always at least as large as X (n)' 0 As we have seen.1.1. Suppose X I. Suppose that Xl. for large amounts of data. )" I I • . a saving grace becomes apparent in Chapters 5 and 6. . where Po is n.3 where X" . What are the good points of the method of moments and frequency plugin? (a) They generally lead to procedures that are easy to compute and are. a frequency plugin estimate of 0 is Iogpo. C' X n are ij. • . This minimal property is discussed in Section 5. as we shall see in Chapter 3. X n are the indicators of a set of Bernoulli trials with probability of success fJ. ".LI (8) = (J the method of moments leads to the natural estimate of 8.4. Moreover.2. Example 2. For example... Estimating the Size of a Population (continued). there are often several method of moments estimates for the same q(9). For instance.I and 2X . If the model fits. 'I I[ i['l i . optimality principle solutions agree to first order with the best minimum contrast and estimating equation solutions.1. a 2 ) sample as in Example 1. (b) If the sample size is large. X n is a N(f1. [ [ ..7. X(l . then B is both the population mean and the population variance. we may arrive at different types of estimates than those discussed in this section. 0 = 21' .2 with assumptions (1)(4) holding. See Section 2. these estimates are likely to be close to the value estimated (consistency). 0 Example 2. the frequency of successes.6. We will make a selection among such procedures in Chapter 3.1. X). valuable as preliminary estimates in algorithms that search for more efficient estimates. Because we are dealing with (unrestricted) Bernoulli trials.Ll). or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter 3. O}. •i .1. these are the frequency plugin (substitution) estimates (see Problem 2. therefore.. ..X). those obtained by the method of maximum likelihood.1 is a method of moments estimate of B. Algorithms for their computation will be introduced in Section 2. . Discussion. a special type of minimum contrast and estimating equation method. Remark 2. as we shall see in Section 2. In Example 1. This is clearly a foolish estimate if X (n) = max Xi> 2X . Example 2. .l [iX. the plugin principle is justified. It does turn out that there are "best" frequency plugin estimates.8) we are led by the first moment to the estimate. = 0]. Because f. To estimate the population variance B( 1 . U {I. (X. . if we are sampling from a Poisson population with parameter B.5.106 Methods of Estimation Chapter 2 Here are three further simple examples illustrating reasonable and unreasonable MOM estimates.4..3. When we consider optimality principles. X.
. f. In the multinomial case the frequency plugin estimators are empirical PIEs based on v(P) = (Ph . When P is the empirical probability distribution P E defined by Pe(A) ~ n.2. . a least squares estimate of {3 is a minimizer of p(X.O) = O.2 Minimum Contrast Estimates and Estimating Equations 107 Summary. Suppose X . In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6. A minimum contrast estimator is a minimizer of p(X..ld)T where f. Zi). Let Po and P be two statistical models for X with Po c P.(3) ~ L[li . The general principles are shown to be related to each other.. D(P) is called the extensionplug·in estimate of v(P). where 9 is a known function and {3 E Rd is a vector of unknown regression coefficients.2 2. .g((3.lj = E( XJ\ 1 < j < d. 1 < j < k.. P. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters. P E Po. ~ ~ ~ ~ ~ v we find a parameter v such that v(p.Zi)j2. is parametric and a vector q( 8) is to be estimated.ll. 0 E e.lJ) ~ E'o"p(X. then is called the empirical PIE. z) = ZT{3. when g({3.) = q(O) and call v(P) a plugin estimator of q(6). 0).Pk). For the model {Pe : () E e} a contrast p is a function from X x H to R such that the discrepancy D(lJo. It is of great importance in many areas of statistics such as the analysis of variance and regression theory.1 L:~ 1 I[X i E AI. where ZD = IIziJ·llnxd is called the design matrix. ~ ~ ~ ~ 2. . For data {(Zi. 0 E e c Rd is uniquely minimized at the true value 8 = 8 0 of the parameter. I < i < n. . Method of moment estimates are empirical PIEs based on v(P) = (f..1 MINIMUM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares Least squares(1) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement. 6). The plugin estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P. li) : I < i < n} with li independent and E(Yi ) = g((3.. where Pj is the probability of the jth category. the associated estimating equations are called the normal equations and are given by Z1 Y = ZtZD{3. For this contrast. An extension ii of v from Po to P is a parameter satisfying v(P) = v(P). and the contrast estimating equations are V Op(X.Section 2.. If P is an estimate of P with PEP.. If P = PO..
(Z" Y.(z..E(Y.zt}.2. Note that the joint distribution H of (C}. • . i=l (2. Y) ~ PEP = {Alljnintdistributions of(Z. . fj) because (2. as (Z.2). . This is frequently the case for studies in the social and biological sciences.. 1_' .=.13) n n i=1  L Varp«.) = E(Y. .108_~ ~_ _~ ~ ~cMc.) = 0. . then 13 has an interpretation as a parameter on p. .g(l3.2..o=nC.z).2) Are least squares estimates still reasonable? For P in the semiparametric model P. which satisfies (but is not fully specified by) (2..) = g(l3. 1 < i < j < n 1 > 0\ . Zn»)T is 1.2. aJ) and (3 ranges over R d or an open subset. That is.2. where Ci = g(l3.2.m::a.»)2.2. (Zi. The estimates continue to be reasonable under the GaussMarkov assumptions. 13 E R d }. E«. " (2. This follows "I .C=ha:::p::.2.) + L[g(l3 o.. I) are i..3) which is again minimized as a function of 13 by 13 = 130 and uniquely so if the map 13 ~ (g(j3..) .).5) (2.d.E:::'::'. z.6) is difficult Sometimes z can be viewed as the realization of a population variable Z. 1 < j < n. we can compute still Dp(l3o.3) continues to be valid.Y) such thatE(Y I Z = z) = g(j3.zn)f is 11.1. Z could be educationallevel and Y income.) + <" I < i < n n (2. . . " .i. z.::'hc. that is. Suppose that we enlarge Po to P where we retain the independence of the Yi but only require lJ. 1 < i < n.g(l3.. . and 13 ~ (g(l3. I<i<n.z.). j.1. I Zj = Zj).2. a 2 and H unknown. E«.2. The contrast p(X. NCO.. j. . that is.6) Var( €i) = u 2 COV(fi.)]' i=l ~ led to the least squares estimates (LSEs) {3 of {3. 13) = LIY. The least squares method of estimation applies only to the parameters {31' .:1. Yi). z.::2 : In Example 2.4) (2. I' .) i.z.1 we considered the nonlinear (and linear) Gaussian model Po given by Y.2.f3d and is often applied in situations in which specification of the model beyond (2.g(l3.i. Z)f. .:. < i < n.1. as in (a) of Example 1. 13 = I3(P) is the miniml2er of E(Y .En) is any distribution satisfying the GaussMarkov assumptions.z. is modeled as a sample from a joint distribution. For instance.4 with <j simply defined as Y. (3)  E p p(X. If we consider this model. the model is semiparametric with {3. or Z could be height and Y log weight Then we can write the conditional model of Y given Zj = Zj..g(l3.) =0.g(l3. .::..4}{2. . (2.:0::d=':Of. = 0.d. ..
I to each of the n pairs (Zil Yi)...2. see Rnppert and Wand (1994). (2.8) d f30 = («(3" .6). (2. As we noted in Example 2.2.41.2. and Volnme n. y)T has a nondegenerate multivariate Gaussian distribution Nd+1(Il. In this case we recognize the LSE {3 as simply being the usual plugin estimate (3(P). We continue our discussion for this important special case for which explicit fonnulae and theory have been derived.zo). in conjnnction with (2. {3d).7) (2. Yi). . z) = zT (3. is called the linear (multiple) regression modeL For the data {(Zi.2. E(Y I Z = z) can be written as d p(z) where = (30 + L(3jZj j=1 (2. as we have seen in Section lA. Yi) are a sample from a (d + 1)dimensional distribution and the covariates that are the coordinates of Z are continuous.4.4.2. we can approximate p(z) by p(z) ~ p(zo) +L j=1 d a: a (zo)(z . . we are in a situation in which it is plausible to assume that (Zi. n} we write this model in matrix fonn as Y = ZD(3 + €    where Zv = IIZij 11 is the design matrix. = py . In that case. a further modeling step is often taken and it is assumed that {Zll . where P is the empirical distribution assigning mass n.Section 2. We can then treat p(zo)  nnknown (30 and identify (zo) with (3j to give an approximate (d 1 and Zj as before and linear model with Zo = t!.2. J for Zo an interior point of the domain. E1=1 g~ (zo)zo as an + 1) dimensional d p(z) = L(3jZj.1).1.3 and 6.5. j=O This type of approximation is the basis for nonlinear regression analysis based on local polynomials. (2) If as we discussed earlier.. For nonlinear cases we can use numerical methods to solve the estimating equations (2.2.1. which. Fan and Gijbels (1996).. .7). j=1 (2. The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth. See also Problem 2.5) and (2..1 the most commonly used 9 in these models is g({3..L{3jPj. Sections 6.4).2.1. i = 1. . Zd.2 Minimum Contrast Estimates and Estimating Equations 109 from Theorem 1. and Seber and Wild (1989). 2:).9) . I)/.
Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil. is independent of Z and has a N(O. In the measurement model in which Yi is the detennination of a constant Ih. p.) ~ 0.2.11) When the Zi'S are not all equal.1.1. {3. g(z. whose solution is .il2Zi) = 0.12) .(3zz. the least squares estimate.1 '. 139). If we run several experiments with the same z using plants and soils that are as nearly identical as possible. the solution of the normal equations can be given "explicitly" by (2. Nine samples of soil were treated with different amounts Z of phosphorus.2. necessarily.il. Example 2. il.I 'i. ( (J2 2 ) distribution where = ayy Therefore. il. we assume that for a given z.) = 1 and the normal equation is L:~ 1(Yi .2. . We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil. 0 Ii.il.25. Following are the results of an experiment to which a regression model can be applied (Sned. see Problem 2. g(z. .8). we get the solutions (2.1.no Furthennore.) = il" a~. . L 1 that. We have already argued in Examp}e 2. ~ = (lin) L:~ 1Yi = ii. < Methods of Estimation Chapter 2 =y  I'(Z) Eyz:Ez~Ezy. The nonnal equations are n i=l n ~(Yi .ill . For this reason. Y is random with a distribution P(y I z).2.) = O. I I. 1 < i < n. We want to estimate f31 and {32. we will find that the values of y will not be the same. il. if the parametrization {3 ) ZD{3 is identifiable.2. i I . the relationship between z and y can be approximated well by a linear equation y = 131 + (32Z provided z is restricted to a reasonably small interval. 1967. Zi. Example 2. given Zi = 1 < i < n. In that case. Estimation of {3 in the linear regression model. d = 1.ecor and Cochran. LZi(Yi .2. i=l (2. For certain chemicals and plants. The parametrization is identifiable if and only ifZ D is of rank d or equivalently if Z'bZD is affulI rank d.2.2. exists and is unique and satisfies the Donnal equations (2. we have a Gaussian linear regression model for Yi. 1 64 4 71 5 54 9 81 11 76 13 23 77 93 23 95 28 109 The points (Zi' Yi) and an estimate of the line 131 + 132z are plotted in Figure 2. Zi 1'.10) Here are some examples.
n} and sample regression line for the phosphorus data.. .(/31 + (32Zi)] are called the residuals of the fit. . This connection to prediction explains the use of vertical distance! in regression.. suppose we select p realvalued functions of z.13) where Z ~ (lin) I:~ 1 Zi. 0 ~ ~ ~ ~ ~ Remark 2.. that is p I'(z) ~ 2:). For instance. Yi). Zn.Yi) and a line Y ~ a + bz vertically by di = Iy. &atter plot {(Zi. j=1 Then we are still dealing with a linear regression model because we can define WpXI . if we measure the distance between a point (Zi.. . The linear regression model is considerably more general than appears at first sight.. . P > d and postulate that 1'( z) is a linear combination of 91 (z).1.42.4. . The regression line for the phosphorus data is given in Figure 2. Here {31 = 61. . n. i = 1. . ..2.. Geometrically.. .2. €6 is the residual for (Z6.Yn on ZI.2. Yn). (zn.132 z ~ ~ (2. ..3. . The line Y = /31 + (32Z is an estimate of the best linear MSPE predictor Ul + b1Z of Theorem 1. then the regression line minimizes the sum of the squared distances to the n points (ZI' Yl).1.2 Minimum Contrast Estimates and Estimating Equations 111 y 50 • o 10 20 x 30 Figure 22. .Section 2. . and 131 = fi .. 91. .9...58 and 132 = 1.9p. and fi = (lin) I:~ 1 Yi· The line y = /31 + f32z is known as the sample regression line or line of best fit of Yl.. Y6). The vertical distances €i = fYi .(a + bzi)l.(Z). 9p( z). i = 1. .1..
which for given Yi = yd.filii minimizes ~  I i i L[1Ii ...5) fails. and g({3.. • I' "0 r .2. 0 < j < 2.14) where (J2 is unknown as before. 1<i <n Wij = 9j(Zi).wi 1 < . We need to find the values {3l and /32 of (3. Thus.2.2 and many similar situations it may not be reasonable to assume that the variances of the errors Ci are the same for all levels Zi of the covariate variable. gp( z)) T as our covariate and consider the linear model Yi = where L j=l p ()jWij + Ci.wi) g((3. I and the Y i satisfy the assumption (2.8.)]2 i=l i=l ~ n (2. . Example 2.wi ..2. i = 1. 1 n. . That is. we can write (2. Note that the variables . iI L Vi[Yi i=l n ({3.3.2.z. .. For instance. Consider the case in which d = 2. then = WW2/Wi = . we may be able to characterize the dependence of Var( ci) on Zi at least up to a multiplicative constant. Such models are called heteroscedastic (as opposed to the equal variance models that are homoscedastic). < n..2.2.wi. if we setg«(3.2.g(. Zi) + Ei . Weighted Linear Regression..II 112 Methods of Estimation Chapter 2 (91 (z). Zi) = {3l + fhZi. . but the Wi are known weights. Zi2 = Zi. Yi = 80 + 8 1Zi + 82 + Cisee Problem 2. Zi)/. In Example 2. and (32 that minimize ~ ~ "I.. We return to this in Volume II.. However.wi. [Yi . + /3zzi)J2 (2.5). z.".= g({3. .2.. Weighted least squares.17) .. The method of least squares may not be appropriate because (2.24 for more on polynomial regression..16) as a function of {3.2. I . The weighted least squares estimate of {3 is now the value f3.....2.wi  _ g((3. if d = 1 and we take gj (z) = zj. . . Whether any linear model is appropriate in particular situations is a delicate matter partially explorable through further analysis of the data and knowledge of the subject matter.. Zi) = (2. Zil = 1. _ Yi . .. 1 <i < n Yi n 1 .Zi)]2 = L ~. we arrive at quadratic regression.zd + ti.'l are sufficient for the Yi and that Var(Ed..g«(3.= y. fi = <d.15) ...
.7).B+€ can be transformed to one satisfying (2. .2.2. We can also use the results on prediction in Section 1.27) that f3 satisfy the weighted least squares normal equations ~ ~ where W = diag(wI. a__ Cov(Z"'.(L:~1 UlYi)(L:~l uizd Li~] Ui Zi . Zi) = z.~ UiYi .. Thus. using Theorem 1.4 as follows.. i ~ I.4H2.Y") ~(Zi.2.1.. When ZD has rank d and Wi > 0. we may allow for correlation between the errors {Ei}.26. when g(.2..Y.n where Ui n = vi/Lvi. Y*) denote a pair of discrete random variables with possible values (z" Yl).2. Var(Z") and  _ L:~l UiZiYi .2.1 leading to (2. the .16) for g((3. .~ UiZi· n n i"~l I" n n F"l This computation suggests.8)... .3. Then it can be shown (Problem 2. ~ .6)...1) and (2.1.2. By following the steps of Exarnple 2. wn ) and ZD = IIZijllnxd is the design matrix.2. that weighted least squares estimates are also plugin estimates.20).18) ~ ~ ~I" (31 = E(Y') . i i=l = 1. as we make precise in Problem 2.n.4.19) and (2.)] ~Ui. (3 and for general d..13 minimizing the least squares contrast in this transformed model is given by (2. . That is. we find (Problem 2. 0 Next consider finding the.2. .Section 2. .B that minimizes (2. 1 < i < n. .2.1 7) is equivalent to finding the best linear MSPE predictor of y .(3.2.B.(32E(Z') ~ . z) = zT.B. Yn) and probability distribution given by PI(Z".2. we can write Remark 2. then its MSPE is ElY" ~ 1l1(Z")f ~ :L UdYi i=l n «(3] + (32 Zi)1 2 It follows that the problem of minimizing (2. If Itl(Z*) = {31 given by + {32Z* denotes a linear predictor of y* based on Z .2.2. Let (Z*.. . (Zn. suppose Var(€) = a 2W for some invertible matrix W nxn..28) that the model Y = ZD. Moreover.(Li~] Ui Z i)2 n 2 n (2. More generally. Y*) V.2 Minimum Contrast Estimates and Estimating Equations 113 where Vi = l/Wi_ This problem may be solved by setting up analogues to the normal equations (2. .
.
.
.
.
2. where ).2.0). p(X.28) If 2n. the rate of arrival.2. the MLE does not exist if 112 + 2n3 = O. .) = ·j1 e'"p. 2 and 3. Because . is zero. 0) ~ 20(1 .7. . p(3. O)p(l. Here are two simple examples with (j real.2.. In general. and n3 denote the number of {Xl. represents the expected number of arrivals in an hour or.2. the maximum likelihood estimate exists and is given by 8(x) = 2n. Nevertheless. 2. (2. 1. and as we have seen in Example 2. Here X takes on values {O.27) that are not maxima or only local maxima. + n. which has the unique solution B = ~. X2 = 2. + n.l. respectively. let nJ. Then the same calculation shows that if 2nl + nz and n2 + 2n3 are both positive. I). Let X denote the number of customers arriving at a service counter during n hours.x=O. then I • Lx(O) ~ p(l.6.lx(0) = 5 1 0' .2. there may be solutions of (2. Example 2. equivalently.2.ny l. . A is an unknown positive constant and we wish to estimate A using X. OJ = (1  0)' where 0 < () < 1 (see Example 2. x. 1). . 0)p(2.2. the likelihood is {1 . If we observe a sample of three individuals and obtain Xl = 1.. Similarly. so the MLE does not exist because = (0. .4). Evidently.. Consider a popUlation with three kinds of individuals labeled 1. 8' 80. then X has a Poisson distribution with parameter nA.27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section.. n2. ]n practice. and 3 and occurring in the HardyWeinberg proportions p(I.2. p(2.(1. 0 e Example 2..118 Methods of Estimation Chapter 2 which again enables us to analyze the behavior of B using known properties of sums of independent random variables.0) The likelihood equation is 8 80 Ix (0) ~ = 5 1 0 10 =0. x n } equal to 1. 2n (2.>. X3 = 1. If we make the usual simplifying assumption that the arrivals form a Poisson process. . which is maximized by 0 = 0.29) '(j ..1.2. situations with f) well defined but (2. } with probabilities.27) doesn't make sense.22) and (2. 0) = 20'(1. the dual point of view of (2.OJ'".5.O) = 0'..0)' < ° for all B E (0. ~ maximizes Lx(B).
j = 1. 31=1 By (2..2. for an experiment in which we observe nj ~ 2:7 1 I[X. A sufficient condition. .. = L. 8' (2.. trials in which each trial can produce a result in one of k categories.8B " 1=13 k k =0. ~ n . Example 2. (see (2.7. = jl.o. k. . .30)).. 0 To apply the likelihood equation successfully we need to know when a solution is an MLE.2. = j) be the probability J of the jth category.ekl with kl Ok ~ 1. thus. the MLE must have all OJ > 0. 8Bk/8Bj (2.\ I 0..2.8.2. ~ ~ .6. this estimate is the MLE of ). this is well known to be equivalent to ~ e. fJE6= {fJ:Bj >O'LOj ~ I}. is that l be concave in If l is twice differentiable. As in Example 1. p(x. J = I. = x/no If x is positive. and the equation becomes (BkIB j ) n· OJ = . L. Multinomial Trials.d..2. consider an experiment with n Li. We assume that n > k ~ 1. and let N j = L:~ 1 l[Xi = j] be the number of observations in the jth category..Section 2. 8) = 0 if any of the 8j are zero.32) We first consider the case with all the nj positive. e..n. 8B.kl. let B = P(X.30) for all This is the condition we applied in Example 2.2 Minimum Contrast Estimates and Estimating Equations 119 The likelihood equation is which has the unique solution).. However. Then p(x. .6.lx(B) <0.. A similar condition applies for vector parameters.31 ) To obtain the MLE (J we consider l as a function of 8 1 . 80. Then.32). and k j=1 k Ix(fJ) = LnjlogB j . If x = 0 the MLE does not exist. Let Xi = j if the ith trial produces a result in the jth category. familiar from calculus. the maximum is approached as . j=d (2.1.LBj j=1 • (2.2. fJ) = TI7~1 B7'. and must satisfy the likelihood equa~ons ~ 8B1x(fJ) = 8B 3 8 8 " n. .2.logB.32) to find I..2. .
) = gi«(3).(Xi .  seen and shall see more in Section 6. Zi)J[l:} . version of this example will be considered in the exponential family case in Section 2. (2.34) g«(3.2.28. 0. zn) 03W. Zj)]Wij where W = Ilwijllnxn is a symmetric positive definite matrix.2 are Maximum likelihood and least squares We conclude with the link between least squares and maximum likelihood.1 holds and X = (Y" . 1 Yn . i = 1.6.33) It follows that in this nj > 0. Summary.2..lV.30.. In Section 2.2 ) with J1. Using the concavity argument.). ~ g((3..2 both unknown.2. 1 < i < n.1 we consider least squares estimators (LSEs) obtained by min2 imizing a contrast of the fonn L:~ 1 IV. . see Problem 2. as maximum likelihood estimates for f3 when Y is distributed as lV. Ix(O) is strictly concave and (} is the unique ~ ~ maximizer of lx(O)... As we have IIV.1.2. j = 1. gi. ~ g«(3. these estimates viewed as an algorithm applied to the set of data X make sense much more generally.120 ~ Methods of Estimation Chapter 2 To show thaI this () maximizes lx(8). where Var(Yi) does not depend on i. o 1=1 Evidently maximizing Ix«(3) is equivalent to minimizing L:~ n (2. . . we find that for n > 2 the unique MLEs of Il and Ii = X and iT2 ~ n. YnJT.1 L:~ . N(lll 0. .. . See Problem 2.. i=l g«(3. Thus. ((g((3.IV. z. and 0.2. where E(V. Then Ix «(3) log IT ~'P (V. OJ > ~ 0 case. n. we check the concavity of lx(O): let 1 1 <j < k .. 1 < j < k.11(a)). are known functions and {3 is a parameter to be estimated from the independent observations YI .2. wi(5). Suppose that Xl.2.202 L.d. This approach is applied to experiments in which for the ith case in a study the mean of the response Yi depends on • . X n are Li. g«(3. Example 2.Zi)]2.' . Then (} with OJ = njjn. we can consider minimizing L:i.. The 0 < OJ < 1. More generally.3. ... Zi)) 00 ao n log 2 I"" 2 "2 (21Too) .. It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model Yi independent N(g({3. Suppose the model Po of Example 2. . is still the unique MLE of fJ.1).1. Next suppose that nj = 0 for some j. least squares estimates are maximum likelihood for the particular model Po.g«(3. k.X)2 (Problem 2. 'ffi f..9.gi«(3) 1 .. then <r <k I. . zi)1 . .. Zi).
bE {a..1) lim{l(li) : Ii ~ ~ De} = 00.2.6.3..1. or 8 1nk diverges with 181nk I . z=1=1 ~ 2. e e I"V e De= {(a. = RxR+ and ee e. .4 and Corollaries 1. Proof. oo]P. for a sequence {8 m } of points from open.b) . a8 is the set of points outside of 8 that can be obtained as limits of points in e.3. In the case of independent response variables Yi that are modeled to have a N(9i({3). as k .1 ).6.3. Suppose 8 c RP is an open set. Extensions to weighted least squares.2 and other exponential family properties also playa role.1. we define 8 1n . b). See Problem 2. m. Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2. In Section 2. m. if X N(B I . 8). (m. Properties that derive solely fTOm concavity are given in Propositon 2.3 Maximum Likelihood in Multiparameter Exponential Families 121 a set of available covariate values Zil. (m. (m. m). in the N(O" O ) case.Section 2. For instance. Suppose also that e t R where e c RP is open and 1 is (2.3. (h). . That is. b). Let &e = be the boundary of where denotes the closure of in [00.a= ±oo. including all points with ±oo as a coordinate.1 only. Suppose we are given a function 1: continuous.2 we consider maximum likelihood estimators (MLEs) 0 that are defined as maximizers of the likelihood Lx (B) = p(x. (a. it is shown that the MLEs coincide with the LSEs. In particular we consider the case with 9i({3) = Zij{3j and give the LSE of (3 in the case in which I!ZijllnXd is of rank d. 2 e Lemma 2. are given. b) : aER.ae as m t 00 to mean that for any subsequence {B1nI<} either 8 1nk t t with t ¢ e.3.00. which are appropriate when Var(Yj) depends on i or the Y's are correlated. . (a. though the results of Theorems 1. Then there exists 8 E e such that 1(8) = max{l(li) : Ii E e}.I ) all tend to De as m ~ 00.6. These estimates are shown to be equivalent to minimum contrast estimates based on a contrast function related to Shannon entropy and KullbackLeibler information divergence. We start with a useful general framework and lemma. Concavity also plays a crucial role in the analysis of algorithms in the next section.3 MAXIMUM LIKELIHOOD IN MUlTIPARAMETER EXPONENTIAL FAMILIES Questions of existence and uniqueness of maximum likelihood estimates in canonical exponential families can be answered completely and elegantly. For instance.. In general. Formally. (}"2) distribution.1 and 1.oo}}.3 and 1.5. where II denotes the Euclidean norm. This is largely a consequence of the strict concavity of the log likelihood in the natural parameter TI.6.00.a < b< oo} U {(a.zid.
' L n . Applications of this theorem are given in Problems 2. From (B. [ffunher {x (II) e = e e t 88. a contradiction.3.1 hold. (ii) The family is of rank k. Suppose the cOlulitions o/Theorem 2. and 0.00.3..1.2).27). '10) for some reference '10 E [ (see Problem 1. If 0.)) 0 Themem 2. " .3. have a necessary and sufficient condition for existence and uniqueness of the MLE given the data. thus.3. then lx(11 m ) . then 1] exists and is unique ijfto E ~ where C!i: is the interior of CT. then the MLE8(x) exists and is unique.1.12. Suppose P is the canonical exponential/amity generated by (T. Write 11 m = Am U m .3) I I' (b) Conversely.3. if lx('1) logp(x.)) > ~ (fx(Od +lx (0. We show that if {11 m } has no subsequence converging to a point in E. then the MLE doesn't exist and (2.'1 . Let x be the observed data vector and set to = T(x). then Ix lx(O.3. which implies existence of 17 by Lemma 2.1.3.3) has i 1 !. U m = Jl~::rr' Am = .3. is unique. Without loss of generality we can suppose h(x) = pix. with corresponding logp(x. Furthennore.1. i .8 and 2. II). II) is strictly concave and 1. ! " .  (hO! + 0. .1. .122 Methods of Estimation Chapter 2 Proposition 2.).to. and is a solution to the equation (2. We can now prove the following. ~ Proof. We give the proof for the continuous case. Then. Existence and Uniqueness o/the MLE ij. Proofof Theorem 2. lI(x) =  exists. Define the convex suppon of a probability P to be the smallest convex set C such that P(C) = 1.3. " We. . I I.2) then the MLE Tj exists. II E j.1. Corollary 2.. we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) .3.. /fCT is the convex suppon ofthe distribution ofT (X). = .3. By Lemma 2.(11) ~ 00 as densities p(x.9) we know that II lx(lI) is continuous on e. h) and that (i) The natural parameter space. no solution. are distinct maximizers. if to doesn't satisfy (2.'1) with T(x) = 0.6. open C RP. (a) If to E R k satisfies!!) Ifc ! 'I" 0 (2. Suppose X ~ (PII . E.3. is open.3.
0 ° '* '* PrOD/a/Corollary 2.5. Write Eo for E110 and Po for P11o . both {t : dTt > dTto} n CO and {t : dTt < dTto} n CO are nonempty open sets. Suppose X"".6.2) fails.3. The equivalence of (2.3. If ij exists then E'I T ~ 0 E'I(cTT) ~ 0.3.3.6.3. Evidently. . Then because for some 6 > 0. Because any subsequence of {11m} has no subsequence converging in E we conclude L ( 11m) t 00 and Tj exists.9. I' E R.se density is a general phenomenon. that is. um~.3. Suppose the conditions ofTheorem 2. Theorem 2. (1"2 > O.4. .Xn are i. fOT all 1]. So we have Case 2: Amk t A. It is unique and satisfies x (2.3.1. thus. So In either case limm. = 0 because T(XI) is always a point on the parabola T2 = T'f and the MLE does not exist.3. contradicting the assumption that the family is of rank: k.i.3) by TheoTem 1.1 follow. Case 1: Amk t 111]=11.2. existence of MLEs when T has a continuous ca. lIu m ll 00.* P'IlcTT = 0] = I. T(X) has a density and. CT = C~ and the MLE always exists. So.J = 00.1.Section 2. Por n = 1. iff. 0 Example 2.3. Then AU ¢ £ by assumption. this is the exponential family generated by T(X) = 1 Xi.k Lx( 11m. there exists c # such that Po[cTT < 0] = 1 E1](c T T(X)) < 0. Po[uTT(X) > 61 > O.d. In fact.1) a point to belongs to the interior C of a convex set C iff there exist points in CO on either side of it. Nonexistence: if (2. I Xl) and 1.2) and COTollary 2. if {1]m} has no subsequence converging in £ it must have a subsequence { 11m k} that obeys either case 1 or 2 as follows.3).3 Maximum Likelihood in Multiparameter Exponential Families ~ 123 1. u mk t u. The Gaussian Model..1 hold and T k x 1 has a continuous case density on R k • Then the MLE Tj exists with probabiliry 1 and necessarily satisfies (2. By (B. N(p" ( 2 ). CT = R )( R+ FOT n > 2. which is impossible. As we observed in Example 1. Then. 0 (L:7 L:7 CJ. t u. This is equivalent to the fact that if n = 1 the formal solution to the likelihood equations gives 0'2 = 0. for every d i= 0.
. o Remark 2.2) holds..4) and (2.1. Thus.3.)e>"XxPI. with density 9p.4 we will see that 83 is. How to find such nonexplicit solutions is discussed in Section 2.T(X) = A(.3 we know that E.3.3.2 that (2. From Theorem 1.= r(jJ) A log X (2.5) ~=X A where log X ~ L:~ t1ogXi. liz = I  vn3/n and li3 = (2nl + nz)/2n are frequency substitution estimates (Problem 2.3.7. with i'._t)T. The likelihood equations are equivalent to (problem 2. Here is an example. When method of moments and frequency substitution estimates are not unique. exist iff all Tj > O. Thus. The TwoParameter Gamma Family.(X) = r~. For instance.124 Methods of Estimation Chapter 2 Proof. and one of the two sums is nonempty because c oF 0.I and verify using Theorem 2..2 follows.3.2(b» r' log . It is easy to see that ifn > 2.. if T has a continuous case density PT(t).3.3.6.3.3).. We conclude from Theorem 2.13 and the next example). > O.3..5) have a unique solution with probability 1. Example 2.3. .1.3. lit ~ . the MLE 7j in exponential families has an interpretation as a generalized method of moments estimate (see Problem 2. LXi).. in a certain sense.3. thus.ln.d.1. h(x) = XI. .4) (2.3.6. To see this note that Tj > 0..4 and 2. I <t< k .. Thasadensity.2. I < j < k.>.2(a).1). I < j < k.. the best estimate of 8.6.3. where Tj(X) L~ I I(Xi = j). in the HardyWeinberg examples 2.I. The boundary of a convex set necessarily has volume 0 (Problem 2. I < j < k iff 0 < Tj < n. Multinomial Trials. using (2.1. then and the result follows from Corollary 2. We follow the notation of Example 1.1 that in this caseMLEs of"'j = 10g(AjIAk).i.I which generates the family is T(k_l) = (Tl>"" T.9). In Example 3. We assume n > k . >.Xn are i.4. I < j < k.3.. where 0 < Aj PIX = j] < I.3. This is a rank 2 canonical exponential family generated by T   = (L log Xi.n. I <j < k. Example 2. A nontrivial application of Theorem 2.). the maximum likelihood principle in many cases selects the «best" estimate among them.1. we see that (2. They are determined by Aj = Tj In. by Problem 2. = = L L i i bJ _ I . 0 = If T is discrete MLEs need not exist. Because the resulting value of t is possible if 0 < tjO < n.2. x > 0. P > 0. The statistic of rank k . but only Os is a MLE.4. Suppose Xl. if we write c T to = {CjtjO : Cj > O} + {cjtjO : Cj < O} we can increase c T to by replacing a tjO by tjo + I in the first sum or a tjO by tjO .3.1 in the second.
lI) ~ exp{cT (II)T(x) .1 we can obtain a contradiction to (2. 1)T.1.3 Maximum likelihood in Multiparameter Exponential Families 125 On the other hand.. Unfortunately strict concavity of Ix is not inherited by curved exponential families.B(B) j=l . n > kl.. be a cUlVed exponential family p(x. and unicity can be losttake c not onetoone for instance.3. 0 < j < k . mxk e.. the following corollary to Theorem 2.exist and are unique.2) so that the MLE ij in P exists.13).B) ~ h(x)exp LCj(B)Tj(x) .. Corollary 2. (2.. if 201 + 0.3.3.3.I. The following result can be useful. note that in the HardyWeinberg Example 2.1 is useful.~l Aj = I}.3. Alternatively we can appeal to Corollary 2.8see Problem 2. Similarly.1] it does exist ~nd is unique. the MLEs ofA"j = 1. . 0 The argument of Example 2. whereas if B = [0.3.3. .3. In some applications. when we put the multinomial in canonical exponential family form.3. .1 directly D (Problem 2. if any TJ = 0 or n.3. h). .8 we saw that in the multinomial case with the clQsed parameter set {Aj : Aj > 0. e open C R=. Here [ is the natural paramerer space of the exponential fantily P generated by (T.3.1. If the equations ~ have a solution B(x) E Co.3. The remaining case T k = a gives a contradiction if c = (1.k. Then Theorem 2. lfP above satisfies the condition of Theorem 2.2) by taking Cj = 1(i = j).7) Note that c(lI) E c(8) and is in general not ij. then it is the unique MLE ofB. Let Q = {PII : II E e). When P is not an exponential family both existence and unicity of MLEs become more problematic.1.3 can be applied to determine existence in cases for which (2.3. . m < k .1 and Haberman (1974).10). In Example 2. Ck (B)) T and let x be the observed data. Let CO denote the interior of the range of (c. x E X.1). BEe. c(8) is closed in [ and T(x) = to satisfies (2. However.Section 2.3.3. our parameter set is open. the bivariate normal case (Problem 2.2.3) does not have a closedform solution as in Example 1.2. for example. the MLE does not exist if 8 ~ (0. then so does the MLE II in Q and it satisfies the likelihood equation ~ cT(ii) (to . = 0. Remark 2.2. Consider the exponential family k p(x. (II) . . (B).A(c(II))}h(x) Suppose c : 8 ~ [ C R k has a differential (2. L.6. 1 < i < k .1.A(c(ii)) ~ = O..6. .3.6) on c( II) = ~~.
11.\()'.3.6) with " • .d.?' m)T " ".. We find CI Example 2.. As in Example 1.9. 9) is a curved exponential family of the form (2... . i Example 2.. . LocationScale Regression. 'TJn+i = 1/2a..I LX..10. As a consequence of Theorems 2. tLi = (}1 + 82 z i .2.".2 and 2. . .2~~2 .. Ii± ~ ~[)..3.oJ).. L: Xi and I.1/'1') T . i = 1.3..3)T.6.6. . l = 1.ry. j = 1.~Y'1. E R. Zl < . simplifies to Jl2 + A6XIl.7) if n > 2. = 1 " = 2 n( ryd'1' . we see that the distribution of {01 : j = 1. l}m. as in Example 1. < O}. I .it. < O}. .!0b''. n.6. m} is a 2nparameter canonical exponential family with 'TJi = /tila.< O.ryl > O. ).Zn are given constants. Using Examples 1. suppose Xl.5 and 1. .n(it' + )"5it'))T which with Ji2 = n..5.5 = .g( it' .gx±). Evidently c(8) = {(lh. m m m f?. the o q. 1 n. = L: xl. .nit.) : ry... Equation (2.10.n.. where 0l N(tLj 1 0'1). which implies i"i+ > 0.3. with I..ry.3 )(t.126 The proof is sketched in Problem 2.A6iL2 = 0 Ii. . .6. that . we can conclude that an MLE Ii always exists and satisfies (2.11.. generated by h(Y) = 1 and "'> I T(Y) '. < Zn where Zt.'. '1.3. Now p(y. which is closed in E = {( ryt> ry. .3. Suppose that }jI •.'' . af = f)3(8 1 + 82 z i )2. N'(fl.6. are n independent random samples.it.3../2'1' . Because It > 0. .3.3. = ( ~Yll'''' . Next suppose. Jl > 0. .2 . = ~ryf.. Methods of Estimation Chapter 2 family with (It) = C2 (It) = .i. ..: Note that 11+11_ = '6M2 solution we seek is ii+..4.0'2) with Jl/ a = AO > a known. .Xn are i.. '() A 1/ Thus. t. corresponding to 1]1 = //x' 1]2 = . This is a curved exponential ¥.5X' +41i2]' < 0. ry.~Yn..7) becomes = 0. Gaussian with Fixed Signal to Noise.) : ry. C(O) and from Example 1.
x*l: Find Xo < x" f(xo) < 0 < f(x') by taking Ixol.3. b). b) such that f(x*) = O. order d3 operations to invert.3. 2. which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2. MLEs may not be given explicitly by fonnulae but only implicitly as the solutions of systems of nonlinear equations. in pseudocode. Ixd large enough. L 10). The packages that produce least squares estimates do not in fact use fonnula (2. Given f continuous on (a.7).1 work.3. Here. ~ 2.1 The Method of Bisection The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in kparameter exponential families. However. > 2.02 E R.Section 2. the fonnula (2.4 ALGORITHMIC ISSUES As we have seen. if implemented as usual.3. an MLE (J of (J exists and () 0 ~ ~ Summary. such as the twoparameter gamma. Let £ be the canonical parameter set for this full model and let e ~ (II: II. the basic property making Theorem 2.3. E R. there exists unique x*€(a.4 Algorithmic Issues 127 If m > 2. is isolated and shown to apply to a broader class of models.1 and Corollary 2. we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all. by the intermediate value theorem. in this section. In this section we derive necessary and sufficient conditions for existence of MLEs in canonical exponential families of full rank with £ open (Theorem 2. These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency. then.O. We begin with the bisection and coordinate ascent methods. L 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of Z'bZD requires on the order of nd 2 operations to evaluate each of d(d + 1)/2 tenns with n operations to get Z'bZD and then. at various times. x o1d = xo. even in the context of canonical multiparameter exponential families. f( a+) < 0 < f (b. Initialize x61d ~ XI. even in the classical regression model with design matrix ZD of full rank. Given tolerance € > a for IXfinal .). f i strictly.1). In fact.1. > OJ. entrust ourselves. strict concavity.4. .1. It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis. is the bisection algOrithm to find x*. Then c(8) is closed in £ and we can conclude that for m satisfies (2. Finally.3. then the full 2nparameter model satisfies the conditions of Theorem 2. d.
exists. (2. • .i. h). By Theorem 1.4. Moreover. in addition.xol/€).128 (I) If IX~ld .xol· 1 (2) Therefore.4. lx. f(a+) < 0 < f(b). Xm < x" < X m +l for all m. Let X" .to· I' Proaf. Theorem 2.. !'(.4... xfinal = !(x~ld + x old ) and return xfinal' (2) Else.. o ! i If desired one could evidently also arrange it so that. .1. the MLE Tj. x~ld (5) If f(xnew) > 0. Then.1). by the intermediate value theorem.1.1.3.3. may befound (to tolerance €) by the method afbisection applied to f(. I (3) and X m + x* i as m j. for m = log2(lxt .. which exists and is unique by Theorem 2. 0 Example 2.1) . so that f is strictly increasing and continuous and necessarily because i. . x~ld = Xnew· Go to (I).1.4. xfinal = Xnew· (4) If f(xnew) < 0. Xnew = !(x~ld + x~ld)' ~ Xnew· (3) If f(xnew) = 0. I. X n be i.) = VarryT(X) > 0 for all . b) of the convex support a/PT... satisfying the conditions of Theorem 2. .d.xoldl Methods of Estimation Chapter 2 < 2E. Let p(x I 1]) be a oneparameter canonical exponentialfamily generated by (T.IXm+l . i I i. From this lemma we can deduce the following.. [(8.) = EryT(X) . The bisection algorithm stops at a solution xfinal such that i Proot If X m is the mth iterate of Xnew i (I) Moreover.x*1 < €.6.. End Lemma 2. 00...4. The Shape Parameter Gamma Family. the interior (a. If(xfinal)1 < E.1 and T = to E C~.
it is in fact available to high precision in standard packages such as NAG or MATLAB.. we would again set a tolerenceto be.2 Coordinate Ascent The problem we consider is to solve numerically.1]2.4. 1] ~Ok = (~1 ~1 {} "") 1]I.'fJk' ~1) an soon. In fact.'TJk =tk' ) . which is slow. Here is the algorithm. but as we shall see.'TJ3. in cycle j and stop possibly in midcycle as soon as <I< . for a canonical kparameter exponential family. say c. E1/(T(X» = A(1/) = to when the MLE Tj = Tj(to) exists.4.1. r > 1.'fJk· Repeat.1]2.'TJk.4 Algorithmic Issues 129 Because T(X) = L:~' equation I log Xi has a density for all n the MLE always exists. However. .Section 2.1 can be evaluated by bisection. Notes: (1) in practice. 1 k. bisection itself is a defined function in some packages. The function r(B) = x'Iexdx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method. getting 1j 1r). eventually.···.oJ _ = (~1 "" {}) ~02 _ 1]1. 0 J:: 2.· . for each of the if'. This example points to another hidden difficulty. always converges to Ti. d and finally 1] _ ~Il) _ =1] (~1 1]1"". The case k = 1: see Theorem 2. It solves the r'(B) r(B) T(X) n which by Theorem 2.···. The general case: Initialize ~o 1] = ("" 'TJll···' "") • 'TJk Solve ~1 f or1]k: Set 1] 8'T]k [) A(~l ~1 1]ll1'J2.4.
'tn... For n > 2 we know the MLE exists. I(W ) = 00 for some j. We note some important generalizations. refuses to converge (in fI space!)see Problem 2.) where flj has dimension d j and 2::. ..2.'. A(71') ~ to. . . The case we have C¥. 0 Example 2.j l(i/i) = A (say) exists and is > 00. Else lim.'11 11 t t (I .5).. W = iji = ij. We give a series of steps. Theorem 2. in fact. (6) By (4) and (5)... .. fI. ilL" iii.. j (5) Because 1]1. ijik) has a convergent subsequence in t x . (fij(e) are as above.2: For some coordinates l. = 7Jk. 1J ! But 71j E [. (I) l(ij'j) Tin j for i fixed and in. Suppose that this is true for each l. Then each step of the iteration both within cycles and from cycle to cycle is quick. that is.2. Consider a point we noted in Example 2.··. To complete the proof notice that if 1j(rk ) is any subsequence of 1j(r) that converges to ij' (say) then. Hence. ij as r t 00. . (2) The sequence (iji!. ff 1 < j < k. (3) and (4) => 1]1 . 71J. This twodimensional problem is essentially no harder than the onedimensional problem of Example 2. x ink) ( '~ini .. .I ~(~ (1) 1 to get V. We pursue this discussion next. Here we use the strict concavity of t.I) solving r0'... Bya standard argument it follows that. (Wn') ~ O.1 because the equa~ tion leading to Anew given bold' (2. they result in substantial savings of time.1j{ can be explicit.4. Let 1(71) = tif '1.2. The TwoParameter Gamma Family (continued).A(71) + log hex).. ij = (V'I) .3. I • . 'II.···) C!J. 71 1 is the unique MLE.pO) = We now use bisection ' r' ..2'  It is natural to ask what happens if. _A (1»).4. as it should. 1j(r) ~ 1j. Suppose that we can write fiT = (fir. 0 ~ (4) :~i (77') = 0 because :~..=1 d j = k and the problem of obtaining ijl(tO. We use the notation of Example 2. Therefore.\(0) = ~. Continuing in this way we can get arbitrarily close to 1j. . the MLE 1j doesn't exist.2. to ¢ Fortunately in these cases the algorithm.4. (i).. Because l(ijI) ~ Aand the MLE is unique. limi. .I)) = log X + log A 0) and then A(1) = pY.4. . Proof. j ¥ I) can be solved in closed form. the algorithm may be viewed as successive fitting of oneparameter families. ij'j and ij'(j+I) differ in only one coordinate for which iji(j+1) maximizes l.? Continuing.4.1. (ii) a/Theorem 2.1 hold and to E 1j(r) t g:. 1 < j < k. Thus... . TIl = .l(ij') = A. the log likelihood. We can initialize with the method 2 of moments estimate from Example 2. (3) I (71j) = A for all j because the sequence oflikelihoods is monotone. rp differ only in the second coordinate. is the expectation of TI(X) in the oneparameter exponential family model with all parameters save T1I assumed known. is computationally explicit and simple.1J k) .130 Methods of Estimation Chapter 2 (2) Notice that (iii. Whenever we can obtain such steps in algorithms.2... by (I).3.3.
. Figure 2.. The graph shows log likelihood contours.p.3. iterate and proceed. The coordinate ascent algorithm. which we now sketch. The coordinate ascent algorithm can be slow if the contours in Figure 2.1 illustrates the j process. 1' B .4. . . 1 B. each of whose members can be evaluated easily.. the log likelihood for 8 E open C RP. BJ+l' . Feinberg.4. the method extends straightforwardly. and Problems 2.B2 )T where the log likelihood is constant... find that member of the family of contours to which the vertical (or horizontal) line is tangent. See also Problem 2.4 Algorithmic Issues 131 just discussed has d 1 = .2 has a generalization with cycles of length r.B~) = 0 by the method of j bisection in B to get OJ for j = 1. At each stage with one coordinate fixed.4.1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method. r = k..4. Solve g~: (Bt. values of (B 1 . . for instance.1. If 8(x) exists and Ix is differentiable.7..Section 2. Then it is easy to see that Theorem 2.. . e ~ 3 2 I o o 1 2 3 Figure 2.4.10.4. Next consider the setting of Proposition 2. is strictly concave. Change other coordinates accordingly.92. A special case of this is the famous DemingStephan proportional fitting of contingency tables algorithmsee Bishop.4. = dr = 1. and Holland (1975). .1 in which Ix(O). that is.
A hybrid of the two methods that always converges and shares the increased speed of the NewtonRaphson method is given in Problem 2. which may counterbalance its advantage in speed of convergence when it does converge.4.4.B) We find exp{ (x . (2. if l(B) denotes the log likelihood. then ii new = iiold .B) i=l 1 1 2 L i=l n I(X" B) < O. and Anderson (1974).3 The NewtonRaphson Algorithm An algorithm that.4.4.k I (iiold)(A(iiold) . when it converges. X n be a sample from the logistic distribution with d. If 110 ld is close enough to fj. Newton's method also extends to the framework of Proposition 2.2) gives (2. though there is a distinct possibility of nonconvergence or convergence to a local rather than global maximum.3. B)}F(Xi.1. coordinate ascent.7. B) The density is = [I + exp{ (x = B) WI.f.to).10.4. This method requires computation of the inverse of the Hessian. is the NewtonRaphson method. 1 l(x. and NewtonRaphson's are still employed. in general.6.3. _. In this case. methods such as bisection.3) Example 2. F(x.  I o The NewtonRaphson algorithm has the property that for large n. Let Xl. We return to this property in Problem 6. then by f1new is the solution for 1] to the approximation equation given by the right.4. If 7J o ld is close to the root 'ij of A(ij) expanding A(ij) around 11old' we obtain = to. iinew after only one step behaves approximately like the MLE. When likelihoods are noncave.. the argument that led to (2. Here is the method: If 110 ld is the current value of the algorithm. I 1 The NewtonRaphson method can be implemented by taking Bold X. Bjork.. A onedimensional problem 7 . this method is known to converge to ij at a faster rate than coordinate ascentsee Dahlquist.B)} [i+exp{(xB)}j2' n 1 l(B) l(B) n2Lexp{(X.and lefthand sides.2) The rationale here is simple. can be shown to be faster than coordinate ascent.132 Methods of Estimation Chapter 2 2.
the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€i2 = 1). a < 0 < I. i = 1. Lumped HardyWeinberg Data. . is not X but S where = 5i 5i Xi. for some individuals.4.0)2. with an appropriate starting point. and Rubin (977). .13. Petrie. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997). Say there is a closedfonn MLE or at least Ip. Ei3). (2. What is observed. however. 0) is difficultto maximize. (0) = log q(s. though an earlier general form goes back to Baum. 0) where I".OJ] i=l n (2.4. the rest of X is "missing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood. A prototypical example folIows. Po[X = (0. .4 Algorithmic Issues 133 in which such difficulties arise is given in Problem 2. in Chapter 6 of Dahlquist. where Po[X = (1.4. It does tum out that in this simplest case an explicit maximum likelihood solution is still possible.€i3). 1 8 m are not Xi but (€il. Bjork.4) Evidently.B). E'2.4). Many examples and important issues and methods are discussed. There are ideal observations. Soules. Here is another important example.. Po[X (0. As in Example 2. This could happen if. difficult to compute. be a sample from a population in HardyWeinberg equilibrium for a twoallele locus.Section 2.2. S = S(X) where S(X) is given by (2. 1. e = Example 2. 0. 1 <i< m (€i1 +€i2. X ~ Po with density p(x. and Weiss (1970).. The algorithm was fonnalized with many examples in Dempster. 0 leads us to an MLE if it exists in both cases.n.. Xi = (EiI. 2. Laird.4. we observe 5 5(X) ~ Qo with density q(s. 0). and Anderson (1974).0)] ~ 28(1 . for instance.0. m+ 1 <i< n.x(B) is "easy" to maximize. I)] (1 . and so on. but the computation is clearly not as simple as in the Original HardyWeinberg canonical exponential family example.B) + 2E'310g(1 .4.x(B) is concave in B.5) + ~ [(EiI+Ei2)log(1(10)')+2Ei310g(IB)] i=m+l a function that is of curved exponential family fonn.(0) })2EiI 10gB + Ei210g2B(1 . then explicit solution is in general not lXJssible.0 E c Rd Their log likelihood Ip.4. €i2 + €i3). We give a few examples of situations of the foregoing type in which it is used.. and its main properties.4 The EM (Expectation/Maximization) Algorithm There are many models that have the following structure. If we suppose (say) that observations 81. A fruitful way of thinking of such problems is in terms of 8 as representing part of X.6.4. let Xi. . The log likelihood of S now is 1". Unfortunately. Yet the EM algorithm. a)] = B2. the function is not concave.
B) J(B I Bo) ~ E.4. Bo) _ I S(X) = s ) (2. It is not obvious that this falls under our scheme but let (2.i = 1.\.9) follows from (2.134 Methods of Estimation Chapter 2 Example 2.B) I S(X) = s) 0=00 (2. (P(X.4. .5. • • i ! I • .). we can think of S as S(X) where X is given by (2.B) 0=00 = E.B) IS(X)=s) q(s.o log p(X. As we shall see in important situations..8) by o taking logs in (2. are independent identically distributed with p() [A. = 11 = .. B) = (1.p() [A.12) q(s. I' Hi where we suppress dependence on s. The rationale behind the algorithm lies in the following formulas. . we have given.4.8) . Then we set Bnew = arg max J(B I Bold).4. Initialize with Bold = Bo· Tbe first (E) step of the algorithm is to compute J(B I Bold) for as many values of Bas needed. J.(sI'z)where() = (. It is easy to see (Problem 2.11).B) =E. + (1  A. o (:B 10gp(X. differentiating and exchanging E oo and differentiation with respect i.<T. The EM Algorithm.. Let .(f~). if this step is difficult. If this is difficul~ the EM algorithm is probably not suitable..<T.4. I:' .\)"'u. .7) ii. reset Bold = Bnew and repeatthe process.2)andO <. .4) = L()(Si I Ai) = N(Ail'l + (1. j.4. The second (M) step is to maximize J(B I Bold) as a function of B. I . Thus.Bo) and (2. . :B 10gq(s. O't.Sn is a sample from a population P whose density is modeled as a mixture of two Gaussian densities. . A. S has the marginal distribution given previously.0'2 > 0.4. I That is. This fiveparameter model is very rich pennitting up to two modes and scales. Note that (2.9) I for all B (under suitable regularity conditions)..\"'u.\ = 1.8).I.tz E Rand 'Pu (5) = . Suppose that given ~ = (~11"" ~n). EM is not particularly appropriate.(I'. I.4.(sl'd +.af) or N(J12.ll.12). Although MLEs do not exist in these models. .4.\ < 1. Here is the algorithm..4. Suppose 8 1 . .6) where A. p(s. a local maximum close to the true 8 0 turns out to be a good "proxy" for the 0 nonexistent MLE. The EM algorithm can lead to such a local maximum.Bo) 0 p(X.Ai)I'Z. Again. ~i tells us whether to sample from N(Jil.4.p (~). the Si are independent with L()(Si I . i. that under fJ. The log likelihood similarly can have a number of local maxima and can tend to 00 as e tends to the boundary of the parameter space (Problem 2. Mixture of Gaussians.4. the M step is easy and the E step doable. which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2. = 01.6). (P(X.)<T~). including the examples.
DO The main reason the algorithm behaves well follows.4. ~ E '0 (~I ogp(X . Id log (X I .O)r(x I 8. 0 0 = 0old' = Onew.1.11)  D IOgq(8.3. uold Now.4.lfOnew." ) I S(X) ~ 8 .4. formally. Lemma 2. Suppose {Po : () E e} is a canonical exponential family generated by (T. 00ld are as defined earlier and S(X) (2.0) (2. 0) I S(X) = 8 ) DO (2.(X 18. 0old)' Equality holds in (2. q S.17) o The most important and revealing special case of this lemma follows.14) whete r(· j ·.13) q(8.Onew) {r(X I 8 .h) satisfying the conditions a/Theorem 2. J(Onew I 0old) > J(Oold I 0old) = 0 by definition of Onew. Theorem 2. On the other hand. I S(X) ~ s > 0 } (2.00Id) by Shannon's ineqnality. r(Xj8..3.o log .Onew) E'Old { log r(X I 8.O) { " ( X I 8.2. DJ(O I 00 ) [)O and. Let SeX) be any statistic. (2.0 ) I S(X) ~ 8 .15) q(s.4. (2.1. However.13) iff the conditional distribution of X given S(X) forOnew asfor 00ld and Onew maximizes J(O I 0old)' =8 is the same Proof. Because. hence.4.Section 2. (2.0) = O.4.4. Onew) > q(s. S (x) ~ 8 p(x. 0 ) +E. the result holds whenever the quantities in J(O I (0 ) can be defined in a reasonable fashion.0) } ~ log q(s.4. 0) ~ q(s.4.1.E.4 Algorithmic Issues 135 to 0 at 00 .12) = s. In the discrete case we appeal to the product rule. We give the proof in the discrete case. Then (2.16) ° q(s. Lemma 2.10) [)J(O I 00 ) DO it follows that a fixed point 0 of the algorithm satisfies the likelihood equation. then . uold 0 r s.0) is the conditional frequency function of X given S(X) J(O I 00 ) If 00 = s.4. Fat x EX. O n e w ) } log ( " ) ~ J(Onew I 0old) .4.
4.136 (a) The EM algorithm consists of the alternation Methods of Estimation Chapter 2 A(Bnew ) = Eoold(T(X) I S(X) Bold ~ = s) (2. Proof. (2. Part (b) is more difficult.Bof Eoo(T(X) I S(X) = y) .18) (2. 1 I • • Po I<ij = 1 I <it + <i' = OJ =  0.23) I .(A(O) . 0 Example 2. = 11 <it +<" = IJ. h(x) = 2 N i .I<j<2 I r r: I I r! Pol<it 11 <it + <i' = 11 0' B' B' + 20(1 . then it converges to a limit 0·.4. A proof due to Wu (I 983) is sketched in Problem 2.4..18) exists it is necessarily unique.4.4. after some simplification.4.= +Eo ( t (2<" + <i') I <it + <i'.16.4. Thus. 1'I. A(1)) = 2nlog(1 + e") and N jn = L:~ t <ij(Xi). i i. 1 <j < 3. In this case.4.25) I" .+l (2.A(Bo)) (2.21) Part (a) follows. A'(1)) = 2nB (2. • I E e (2Ntn + N'n I S) = 2Nt= + N.19) = Bnew · If a solution of(2.4.Pol<i.4 (continued). .22) • ~ 0) . I • Under the assumption that the process that causes lumping is independent of the values of the €ij.B) 1 .24) " ! I .(A(O) . that.BO)TT(X) . which is necessarily a local maximum J(B I Bo) I I = Eoo{(B .A('1)}h(x) (2. Now.0)' 1.4. X is distributed according to the exponential family I where p(x.A(Bo)) I S(X) = s} (B .(x).20) has a unique solution.B).(1 . = Ee(T(X) I S(X) = s) ~ (2. (b) lfthe sequence of iterates {Bm} so obtained is bounded and the equation A(B) o/q(s. m + 1 < i < n) . .4.B) 1) = log (1 = exp(1)(2Ntn (x) + N'n(x)) . we see. I . i=m.
4.26) It may be shown directly (Problem 2. r) (Problem 2. + 1 <i< n}.(y. to conclude ar. the EM iteration is L i=tn+l n ('iI + <. compute (Problem 2.6.) E. l"l)/al]Zi with the corresponding Z on Y regression equations when conditioning on Yi (Problem 2. For other cases we use the properties of the bivariate nonna] distribution (Appendix BA and Section 1.(2N3= + N 2=)B + 2 (N1= + (I n n =0 0 in (0.T2 ]}' .4.. In this case a set of sufficient statistics is T 1 = Z. 112. To compute Eo (T I S = s).Bold n ~ (2. (Zn.1) that the Mstep produces I1l. where (Z.1. .new ~ + I'i. i=I i=l n n n i=l s = {(Z" Y.T 2 . T3 = The observed data are n l L zl.Yn) be i.new T4 (Bo1d ) . I Zi) 1"2 + P'Y2(Z. which is indeed the MLE when S is observed. we note that for the cases with Zi andlor Yi observed. where B = (111.. ~ T l (801 d).4. ai ~ We take Bo~ = B MOM ' where BMOM is the method of moment estimates (11" 112. . Y) ~ N (I"" 1"" 1 O'~. This completes the Estep. a~ + ~. and for n2 + 1 < i < n. 2 Mn .(ZiY.1) A( B) = E.(Y.i. I'l)/al [1'2 + P'Y2(Z.27) hew i'tT2Ji{[T3 (8ol d) .4 Algoritnmic Issues 137 where Mn ~ Thus. I). we observe only Yi . and find (Problem 2.2 I Z..new ii~. Suppose that some of the Zi and some of the Yi are missing as follows: For 1 < i < nl we observe both Zi and Yi.4.4.4.8) of B based on the observed data. I Z. Let (Zl'yl).. Jl2. B). al a2P + 1"1/L2).: nl + 1 <i< n.d. T4 = n. at Example 2. Y). For the Mstep..= > N 3 =)) 0 and M n > 0. ii2 . converges to the unique root of ~ + N.: n.): 1 <i< nt} U {Z.p2)a~ [1'2 + P'Y2(Z. iii. as (Z. p). /L2.12) that if2Nl = Bm. unew  _ 2N1m + N 2m n + 2 . we oberve only Zi.l L ZiYi.l LY/.} U {Y.T{ (2.Section 2.1). a~.) E. b. then B' . [T5 (8ald) 2 = T2 (8old)' iii.4.Tl IlT4 (8ol d) .). T 5 = n. the conditional expected values equal their observed values. new = T 3 (8ol d) . T 2 = Y. 1"1)/ad 2 + (1 .T = (1"1.4). for nl + 1 < i < n2.
respectively.2. in general.2. X I. where 0 < B < 1. X n assumed to be independent and identically distributed with exponential. ~' "... 0) is minimized. are discussed and introduced in Section 2.138 Methods of Estimation Chapter 2 where T j (B) denotes T j with missing values replaced by the values computed in the Estep and T j = Tj(B o1d )' j = 1.0'2) distribution. Consider a population made up of three different types of individuals occurring in the HardyWeinberg proportions 02.i' I' I i. in the context of Example 2. which yields with certainty the MLEs in kparameter canonical exponential families with E open when it exists..3 and the problems.).d. j : > I) that one system 3.0'2) based on the first two moments. j. based on the first moment. based on the second moment. show that T 3 is a method of moment estimate of ().4. Also note that. ! I 2.. X n have a beta. (a) Show that T 3 = N1/n + N 2 /2n is a frequency substitution estimate of e. Important variants of and alternatives to this algorithm..4 we derive and discuss the important EM algorithm and its basic properties.1 with respective probabilities PI.P3 given by the HardyWeinberg proportions..5 PROBLEMS AND COMPLEMENTS Problems fnr Sectinn 2.4. 0 Because the Estep. (b) Find the method of moments estimate of>. then J(B i Bo) is log[p(X.. . &(>.4. E"IJ(O I 00)] is the KullbackLeiblerdivergence (2. . the EM algorithm is often called multiple imputation. .B o)]. Remark 2. Find the method of moments estimates of a = (0'1. . 2B(1 .1 1. (b) Using the estimate of (a). .0. use this algorithm as a building block for the general coordinate ascent algorithm. Consider n systems with failure times X I .P2. Suppose that Li.4. (c) Combine your answers to (a) and (b) to get a method of moment estimate of >. By considering the first moment of X. Finally in Section 2. distributions..B)2. in Section 2. involves imputing missing values.6.23). We then.B)? . which as a function of () is maximized where the contrast logp(X.. based On the first two moments. 2. . (a) Find the method of moments estimate of >. Now the process is repeated with B MOM replaced by ~ ~ ~ ~ 0new.2.0) and (I . Note that if S(X) = X.4. The basic bisection algorithm for finding roots of monotone functions is developed and shown to yield a rapid way of computing the MLE in all oneparameter canonical exponential families with E open (when it exists). (d) Find the method of moments estimate of the probability P(X1 will last at least a month. B)/p(X. what is a frequency substitution estimate of the odds ratio B/(l. (c) Suppose X takes the values 1. including the NewtonRaphson method. Summary.1. (3( 0'1.
5...5.. 4. . Let Xl. = Xi with probability lin. X 1l be the indicators of n Bernoulli trials with probability of success 8. which we define by ~ F ~( s. t) is the bivariate empirical distribution function FCs. .. ~ ~ ~ (a) Show that in the finite discrete case. Give the details of this correspondence. . empirical substitution estimates coincides with frequency substitution estimates. Yi) such that Zi < sand Yi < t n .. (b) Exhibit method of moments estimates for VaroX = 8(1 .. YI ). (b) Show that in the continuous case X ~ rv F means that X (c) Show that the empirical substitution estimate of the jth moment JLj is the jth sample moment JLj' Hinr: Write mj ~ f== xjdF(x) ormj = Ep(Xj) where XF. X n . of Xi < xl/n. we know the order statistics. Nk+I) where N I ~ nF(t l ). (Z2. 8. of Xi n ~ =X.2. < .. < ~ ~ Nk+1 = n(l  F(tk)). t). (d) FortI tk. . .2. Let (ZI. Show that these estimates coincide. Yn ) be a set of independent and identically distributed random vectors with common distribution function F.. . Let X I. .) There is a Onetoone correspondence between the empirical distribution function ~ F and the order statistics in the sense that.8)ln first using only the first moment and then using only the second moment of the population. Hint: Consi~r (N I . . (a) Show that X is a method of moments estimate of 8... The natural estimate of F(s.. Let X(l) < . .F(t. N 2 = n(F(t2J ... given the order statistics we may construct F and given P.5 Problems and Complements 139 Hint: See Problem 8.t ) = Number of vectors (Zi. .. 6. The empirical distribution function F is defined by F(x) = [No.. . The jth cumulant '0' of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant Cj' Give the first three sample cumulants. (See Problem B. ~  7.12. (Zn. . If q(8) can be written in the fonn q(8) ~ s(F) for sOme function s of F we define the empirical substitution principle estimate of q( 8) to be s( F).8. < X(n) be the order statistics of a sample Xl.. ~ Hint: Express F in terms of p and F in terms of P ~() X = No. . .findthejointfrequencyfunctionofF(tl). Y2 ).Xn be a sample from a population with distribution function F and frequency function or density p...)). See A.Section 2. (c) Argue that in this case all frequency substitution estimates of q(8) must agree with q(X).F(tk). .
Yn . the sampIe correlation. I .) is the distribution function of a probability P on R2 assigning mass lin to each point (Zi.j2.2 with X iiI and /is..j) is given by ~ The sample covariance is given by n l n L (Zk k=l .!'r«(J)) I .1. Vi). Suppose X = (X"". (J E R d. the sample covariance. Y are the sample means of the sample correlation coefficient is given by Z11 •.. as X ~ P(J. . Hint: See Problem B. ~ i . Zn and Y11 .2). Show that the sample product moment of order (i. In Example 2.2.17. find the method of moments estimate based on i 12. p(X. (b) Define the sample product moment of order (i. respectively. 0).. suppose that g({3..140 Methods of Estimation Chapter 2 (a) Show that F(. 9. are independent N"(O."ZkYk . (See Problem 2... ..) Note that it follows from (A. 10. L I. Show that the least squares estimate exists. . Vk and that q(9) can be written as ec q«(J) = h(!'I«(J)"" .". . X n be LLd. z) is continuous in {3 and that Ig({3. >. {3) > c. Let X". " . the result follows.' where Z. (a) Find an estimate of a 2 based on the second mOment. .81 tends to 00. 11. . . with (J identifiable. ..L. . as the corresponding characteristics of the distribution F. and so on. : J • .Y) = Z)(Yk .ZY. I) = ". z) I tends to 00 as \..). In Example 2. f(a. 1".IU9) that 1 < T < L . The All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates.• . = #. {3) is continuous on K. . . . k=l n  n .1. Since p(X. j).4. Hint: Set c = p(X. Suppose X has possible values VI. . There exists a compact set K such that for f3 in the complement of K. (b) Construct an estimate of a using the estimate of part (a) and the equation a ..Xn ) where the X.. (c) Use the empirical substitution principle to COnstruct an estimate of cr using the relation E(IX.
d. to give a method of moments estimate of (72. = h(PI(O).. r. 0 ~ (p. See Problem 1. j i=l = 1. . (a) Use E(X.5...:::nt:::' ~__ 141 for some R k valued function h.. > 0.l and (72 are fixed.6.• it may still be possible to express parameters as functions of moments and then use estimates based on replacing population moments with "sample" moments...1) (iii) Raleigh. 1'(0.6. fir) can be written as a frequency plugin estimate.l Lgj(Xi ). . General method of moment estimates(!). .(X) ~ Tj(X). (b) Suppose P ~ po and I' = b are fixed. Let 911 . Use E(U[). p fixed (v) Inverse Gaussian.Section 2. . find the method of moments estimates (i) Beta.) to give a method of moments estimate of p. where U. Show that the method of moments estimate q = h(j1!. ~ (X. Iii = n.. 1'(1.1. When the data are not i. .O) ~ (x/O')exp(x'/20'). 0). as X rv P(J. A). 1 < j < k.1.. (c) If J.:::m:::.x (iv) Gamma. in estimate. . (b) Suppose {PO: 0 E 8} is the kparameter exponential family given by (1. .. with (J E R d and (J identifiable. A).0 > 0 14. Hint: Use Corollary 1.0) (ii) Beta. Let g. Vk and that q(O) for some Rkvalued function h.6. ..10).i.ilr) is a frequency plug (a) Show that the method of moments estimate if = h(fill . .i. .. .p:::'. In the fOllowing cases. Suppose that X has possible values VI. IG(p.d.5 Problems and Com".36.Pr(O)) .p(x..9r be given linearly independent functions and write ec n Pi(O) = EO(gj(X)). Consider the Gaussian AR(I) model of Example 1. rep.  1/' Po) / .. Suppose X 1.. can you give a method of moments estimate of {3? . X n are i. 13.
. Establish (2..... j > 0... 16.S 1 .. where OJ = 1. n. and IF... The reading Yi . = n 1 L:Z.g(I3.. In a large natural population of plants (Mimulus guttatus) there are three possible alleles S. 1 . ..q. t n on the position of the object.. ..g(l3o..• l Yn are taken at times t 1.. . I..  t=1 liB = (0 1 .4..) .. 5 SF 28.)] = [Y. Y) and (ai.. I • .. Y) and B = (ai. . ()2.... 1 6 and let Pl ~ N j / n.1 " Xir Xk J.2 1. An Object of unit mass is placed in a force field of unknown constant intensity 8. . ). Let 8" 8" and 83 denote the probabilities of S..83 8t Let N j be the number of plants of genotype j in a sample of n independent plants.."" X iq ). HardyWeinberg with six genotypes.. I I .z.. Multivariate method a/moments.. and F at one locus resulting in six genotypes labeled SS.. of observations. I. 1 I .~b. >. Q. 11P2 + "2P4 + '2P6 . 17. For a vector X = (Xl.... 1".. .. = Y .8.z.bIZ. . II.Y.)] + [g(l3o.. b...3.. I .. and F.. _ (Z)' ' a. bI) are as in Theorem 1. S = 1.. .. SF.)].1. I . respectively. FF. The HardyWeinberg model specifies that the six genotypes have probabilities L7=1 I Genotype Genotype Probability SS 2 II 8~ 3 FF 8j 4 SI 28.. Readings Y1 .. k> Q mjkrs L. ' l X q ). where (Z.. and (J3. SI.J is' _ n n. Show that method of moments estimators of the parameters b1 and al in the best linear predictor are estimate ~ 8 of B is obtained L:Z.. 83 6 IF 28. Show that <j < i ! ._ .1"... n 1  Problems for Sectinn 2. For independent identically distributed Xi = (XiI. + '2P4 + 2PS . Hint: [y. let the moments be mjkrs = E(XtX~)..142 Methods of Estimation Chapter 2 15.. .. P3 + "2PS + '2P6 PI are frequency plugin estimates of OJ.. k > 0. .ZY ~ .z. Let X = (Z.. we define the empirical or sample moment to be j ~ .  1.6). .. . the method of moments l by replacing mjkrs by mjkrs.. i = 1.z.g(I3.... .. 1 q.! ...•• Om) can be expressed as a function of the moments..
BJ . . (b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1. 9. Yn).2.. B) = Bc'x('+JJ. x > 0. (a) Let Y I . and 13 ranges over R d 8. . A new observation Ynl. The regression line minimizes the sum of the squared vertical distances from the points (Zl.2. .. nl and Yi = 82 + ICi.4. B > O.y.B.<) ~ . x > c. B. Let X I.4.). . . 7. .3. .2 may be derived from Theorem 1.I is to be taken at time Zn+l. . Find the least squares estimates of ()l and B .We suppose the oand be uncorrelated with constant variance..(zn.13 E R d } is closed." + 82 z i + €i = nl with ICi a~ given by = 81 + ICi.z. Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (Zi. to have mean 2.5) provided that 9 is differentiable with respect to (3" 1 < i < d. Find the line that minimizes the sum of the squared perpendicular distance to the same points. . 2 10. .. .. B > O.. .)' 1+ B5 6.. in fact. Show that the fonnulae of Example 2. nl + n2. Yn be independent random variables with equal variances such that E(Yi) = O:Zj where the Zj are known constants.t of the best (MSPE) predictor of Yn + I ? 4. Show that the least squares estimate is always defined and satisfies the equations (2.2. . (exponential density) (b) f(x. ..Yn). Yn have been taken at times Zl. all lie on aline. 13). . _. Hint: The quantity to be minimized is I:7J (y.5 Problems and Complements 143 ICi differs from the true position (8 /2)tt by a mndom error f l . ~p ~(yj}) ~ . 3.. Find the least squares estimates for the model Yi = 8 1 (2.. (a) f(x.. . (zn. Y. Find the LSE of 8. where €nlln2 are independent N(O.. .Section 2. X n denote a sample from a population with one of the following densities Or frequency functions..1. . i + 1. i = 1. if we consider the distribution assigning mass l/n to each of the points (zJ. (Pareto density) . Hint: Write the lines in the fonn (z. . the range {g(zr. g(zn.4)(2. .6) under the restrictions BJ > 0. i = 1. Find the least squares estimate of 0:. Yl). (J2) variables.. . Suppose Yi ICl".. < O... B) = Be'x. " 7 5. .. Zn and that the linear regression model holds. c cOnstant> 0. Find the MLE of 8. Suppose that observations YI .. What is the least squares estimate based on YI . ..n. 13).Yi)..
ry) denote the density or frequency function of X in terms of T} (Le. . = X and 8 2 ~ n. X n . 0) = (x/0 2) exp{ _x 2/20 2 }. then {e(w) : wEn} is a partition of e. 0 > O. Show that depends on X through T(X) only provided that 0 is unique.P > I.1). J1 E R. density) (e) fix.2.Pe.. I < k < p. 0 > O.5. . 0 E e and let 0 denote the MLE of O. .0"2 (a) Find maximum likelihood estimates of J. 0 <x < I. . bl to make pia) ~ p(b) = (b .  under onetoone transformations). Let X I. (Pareto density) (d) fix.1. Hint: Let e(w) = {O E e : q(O) = w}.: I I \ . n > 2. e c Let q be a map from e onto n. say e(w). X n • n > 2.r(c+<). 12. (b) Find the maximum likelihood estimate of PelX l > tl for t > 1".t and a 2 are both known to be nonnegative but otherwise unspecified. c constant> 0. (beta. (Wei bull density) 11. then i  WMLE = arg sup sup{Lx(O) : 0 E e(w)}. Pi od maximum likelihood estimates of J1 and a 2 .. Suppose that Xl. Because q is onto n. WEn . .O) where 0 ~ (1".. be a family of models for X E X C Rd. Show that the MLE of ry is h(O) (i.X)" > 0. (We write U[a.t and a 2 . .II ".. x > 1". be independently and identically distributed with density f(x. 0> O. Hint: You may use Problem 2.0"2) 15.:r > 0.) ! ! !' ! 14.I")/O") . 0) = Ocx c . If n exists.l exp{ OX C } .. x> 0. MLEs are unaffected by reparametrization. I). .. .0"2). reparametrize the model using 1]). 0 > O. Define ry = h(8) and let f(x. Show that any T such that X(n) < T < X(1) + is a maximum likelihood estimate of O. 0) = VBXv'O'. c constant> 0. Show that if 0 is a MLE of 0. I I I (b) LetP = {PO: 0 E e}. the MLE of w is by definition ~ W. iJ( VB. (Rayleigh density) (f) f(x. > O. for each wEn there is 0 E El such that w ~ q( 0). Thus. ncR'.. x > 0. (72 Ii (a) Show that if J1 and 0.144 Methods of Estimation Chapter 2 (c) f(". ~ I in Example 2. and o belongs to only one member of this partition. I i' 13.l L:~ 1 (Xi .16(b).. is a sample from a N(/l' ( 2 ) distribution. X n be a sample from a U[0 0 + I distribution. (a) Let X .a)l rather than 0. Suppose that T(X) is sufficient for 0 and ~at O(X) is an MLE of O. they are equivariant 16.5 show that no maximum likelihood estimate of e = (1". 00 = I 0" exp{ (x . .e. < I" < 00. 0) = cO".2 are unknown. q(9) is an MLE of w = q(O). I • :: I Suppose that h is a onetoone function from e onto h(e). Hint: Use the factorization theorem (Theorem 1.  e . Let X" . then the unique MLEs are (b) Suppose J.
) Let M = number of indices i such that Y i = r + 1. in a sequence of binomial trials with probability of success fJ. Show that maximum likelihood estimates do not exist. X 2 = the number of failures between the first and second successes.. Bremner.. •• .1 (I_B).0") E = {(I'..B). 1) distribution. Show that the maximum likelihood estimate of 0 based on Y1 . ..n . Suppose that we only record the time of failure... If time is measured in discrete periods.Xn be independently distributed with Xi having a N( Oi.. Thus.[X < r] ~ 1 LB k=l k  1 (1_ B) = B'..P. . . .B). 19. We want to estimate O. 21. . k=I. < 00. 17. (b) The observations are Xl = the number of failures before the first success..~1 ' Li~1 Y. and Brunk (1972). 16(i). . Censored Geometric Waiting Times. a model that is often used for the time X to failure of an item is P. but e . We want to estimate B and Var.X I = B(I.. (a) The observations are indicators of Bernoulli trials with probability of success 8.1 (1 . and have commOn frequency function.O") : 00 < fl. Yn which are independent. (a) Find maximum likelihood estimates of the fJ i under the assumption that these quantities vary freely. 0 < a 2 < oo}. where 0 < 0 < 1. k ~ 1. A general solution of this and related problems may be found in the book by Barlow." Y: . if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods. Derive maximum likelihood estimates in the following models.Section 2. B) = 100' cp 9 (x I') + 0' 10 cp(x 1') 1 where cp is the standard normal density and B = (1'.. find the MLE of B. X n ) is a sample from a population with density f(x. ..  ..2..r f(r + I. In the "life testing" problem 1. . . 1 <i < n. .[X ~ k] ~ Bk . M 18.5 Problems and Complements 145 Now show that WMLE ~ W ~ q(O). Bartholomew. f(k. 20.B) = 1. (We denote by "r + 1" survival for at least (r + 1) periods. identically distributed. (KieferWolfowitz) Suppose (Xl. Y n IS _ B(Y) = L. (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B. and so on. we observe YI . Let Xl.B) =Bk .6.
Hint: Coosider the mtio L(b + 1. x)/ L(b.6). Assume that Xi =I=.0 2.5 86.6..0 11. 7 .Xm and YI . Suppose Y.8 14.z!.jp): 0 <j.5 0.0"2) if. ... J2 J2 o o o I 3.. x) as a function of b.tl.4)(2.up(X. Ii equals one of the numbers 22.0 5.2 Peed Speed  2 I 1 1 1 1 1 1 1 J2 o o 1 1 1 3.1 2.9 3.2 4. Tool life data Peed 1 Speed 1 1 Life 54. Set zl ~ .Y)' i=1 j=1 /(rn + n).'. TABLE 2. where'i satisfy (2. .x n .5 66. Y. 1 1 1 o o . .8 2.I' fl. where [t] is the largest integer that is < t. .Xj for i =F j and that n > 2.ji.1.J. and assume that zt'·· . II I I :' 24.. i: In an experiment to study tool life (in minutes) of steelcutting tools as a function of cutting speed (in feet per minute) and feed rate (in thousands of an inch per revolution). = fl. N. distribution. Suppose X has a hypergeometric. . 1 <k< p}.2.t.a 2 ) and NU".. " Yn be two independent samples from N(J. n).8 0. Show that the maximum likelihood estimate of b for Nand n fixed is given by if ~ (N + 1) is not an integer. Let Xl.8 3.0 I . . 1i(b. (1') is If = (X.2 3.X)' + L(Y. Weisberg. 23. (1') where e n ii' = L(Xi .p(x..< J.5 I' I I . (1') populations. the following data were obtained (from S.' wherej E J andJisasubsetof {UI .(Zi) + 'i. Polynomial Regression.a 2 ) = Sllp~.2 4. 1985).. and only if. Show that the MLE of = (fl. respectively.2. .146 Methods of Estimation Chapter 2 that snp. . Xl.4 o o o o o o o o o o o o o Life 20. and ~ b(X) = X X (N + 1) or (N + 1)1 n n othetwise.0 0.
weighted least squares estimates are plugin estimates.l be a square root matrix of W.6) with g({3.(b l + b. Being larger. this has to be balanced against greater variability in the estimated coefficients.8 and let v(z. Y)Z') and E(v(Z.y)dzdy.2.1). .0:4Z? + O:SZlZ2 + f Use a least squares computer package to compute estimates of the coefficients (f3's and o:'s) in the two models.3. y). Let W..5 Problems and Complements 147 The researchers analyzed these data using Y = log tool life. + E eto (b) Y = + O:IZl + 0:2Z2 + Q3Zi +. (b) Let P be the empirical probability . = (cutting speed .2. Consider the model (2. Two models are contemplated (a) Y ~ 130 + pIZI + p.y)/c where c = I Iv(z.(P) coincide with PI and 13.4)(2. 28. This will be discussed in Volume II.  27.1 (see (B. Y)[Y . in Problem 2. Let (Z. . Both of these models are approximations to the true mechanism generating the data. The best linear weighted mean squared prediction error predictor PI (P) + p.6)).Y = Z D{3+€ satisfy the linear regression mooel (2.13)/6. . Consider the model Y = Z Df3 + € where € has covariance matrix (12 W.(P) = Cov(Z'.Y') have density v(z. .. let v(z..4)(2.19).2. Let Z D = I!zijllnxd be a design matrix and let W nXn be a known symmetric invertible matrix.y)!(z. 26. of Example 2. Derive the weighted least squares nonnal equations (2. Show that p.1. l/Var(Y I Z = z). (2.(P) and p.900)/300. z) ~ Z D{3.6) with g({3. z. Set Y = Wly. (a) Let (Z'.Section 2..(P)Z of Y is defined as the minimizer of t = zT {3. (c) Zj. z) ing are equivalent. ZD = Wz ZD and € = WZ€.5) for (a) and (b). the second model provides a better approximation.. 25. ZI ~ (feed rate .(P)E(Z·). ~.2. Show that (3. Y)Y') are finite. (12 unknown.y) defined .ZD is of rank d. Y) have joint probability P with joint density ! (z. Y')/Var Z' and pI(P) = E(Y*) 11. (2. .2.2.z. E{v(Z. That is. Use these estimated coefficients to compute the values of the contrast function (2.1).Z)]'). However.1. Show that the follow ZDf3 is identifiable.2. (a) Show tha!.y)!(z.6. (a) The parameterization f3 (b) ZD is of rank d.2. y) > 0 be a weight funciton such that E(v(Z.
Chon) shonld be read row by row. i = 1. .229 4. 0) = II j=q+l k 0.182 0.. different from Y? I I TABLE 2. .5.jf3j for given covariate values {z..599 0.097 1.j}. 31.564 1. = nq = 0. Is ji.081 2.. 1'.114 1. suppose some of the nj are zero. ' I (f) The following data give the elapsed times YI . <J > 0. _'..393 3. in this model the optimal " MSPE predictor of the future Yi+ 1 given the past YI .ZD.nk > 0.. ~ ~ ~Show that the MLE of OJ is 0 with OJ = nj In.870 30.379 2.064 5.968 2.~=l Z.249 1.834 3. .+l I Y .397 4.716 7. .091 3.i.300 6.d.391 0.1.958 10. = (Y  29.1..093 5. then the  f3 that minimizes ZD. The data (courtesy S. Show that the MLE of . In the multinomial Example 2.6) is given by (2.6) W. (See Problem 2. j = 1.ZD.) (c) Find a matrix A such that enxl = Anx(n+l)ECn+l)xl' (d) Find the covariance matrix W of e. k.€n+l are i.666 3. Let ei = (€i + {HI )/2. . .301 2.156 5.916 6. Suppose YI .274 5.455 9. Consider the model Yi = 11 + ei.. 4 (b) Show that Y is a multivariate method of moments estimate of p. That is. I Yi is (/1 + Vi).196 2.. nq+1 > p(x.908 1.968 9. .20)..665 2.) ~ ~ (I' + 1'. .058 3.2.. /1i + a}. .019 6. n.676 5. Then = n2 = .'.457 2.ZD.582 2.020 8.046 2.a. Hint: Suppose without loss of generality that ni 0.8. where 1'.6) T 1 (Y .17. Use a weighted least squares c~mputer routine to compute the weighted least squares estimate Ii of /1.360 1. .053 4.476 3..2. k. .856 2..971 0. = L. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay.611 4. with mean zero and variance a 2 • The ei are called moving average errors.093 1.703 4.392 4. i = 1. which vanishes if 8 j = 0 for any j = q + 1.6)T(y ..039 9... \ (e) Find the weighted least squares estimate of p.921 2.100 3.453 1.689 4. ..669 7.918 2.038 3. . n.148 ~ Methods of Estimation Chapter 2 (b) Show that if Z  D has rank d.054 1.858 3.723 1.071 0. J. (a) Show that E(1'. ...788 2...860 5. Yn are independent with Yi unifonnly distributed on [/1i .155 4.).131 5. where El" .480 5. .075 4. Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore.(Y . 2.511 3.
i. These least absolute deviation estimates (LADEs).32(b).{3p.I/a}.  Illi < 1.fin are called (b) If n is odd. Y( n) denotes YI . If n is even. .) + Y(r+l)] where r = ~n. The HodgesLehmann (location) estimate XHL is defined to be the median of the ~n(n + 1) pairwise averages ~(Xi + Xj). where Iii = ~ L:j:=1 ZtjPj· 32. A > 0. ~ 33. where Jii = L:J=l Zij{3j. Let X..d. .d.. 34. . Let B = arg max Lx (8) be "the" MLE.. . and x. Find the MLE of A and give its mean and variance. the sample median yis defined as ~ [Y(.. An asymptotically equivalent procedure is to take the median of the distribution placing mass and mass ./3p that minimizes the maximum absolute value conrrast function maxi IYi . a) is obtained by finding (31.Ili I and then setting (j = n~I L~ I IYi .1. XHL i& at each point x'. be i. = L Ix.Jitl. Yn are independent with Vi having the Laplace density 1 2"exp {[Y. ~ f3I.. Hint: See Example 1. then the MLE exists and is unique. .28Id(F.4.{:i. . = I' for each i. Give the MLE when .. as (Z..5 Problems and Complements 149 (f3I' . be i.. .... Suppose YI .. be the observations and set II = ~ (Xl . + Xj i<j  201· (b) Define BH L to be the minimizer of J where F [x . with density g(x . . " . Let g(x) ~ 1/". Show that XHL is a plugin estimate of BHL. . i < j. . . Hint: Use Problem 1.tLil and then setting a = max t IYi .>0 ~ ~ where tLi = :Ej=l Ztj{3j for given covariate values {Zij}. .d. Z and W are independent N(O..Section 2.i.. (See (2. . Show that the sample median ii is the minimizer of L~ 1 IYi . . 1)... 35. the sample median fj is defined as Y(k) where k ~ ~(n + 1) and Y(l).8). ./Jr ~ ~ and iii.) Suppose 1'.J.(1 + x'j.3. (3p.I'. .:C j • i <j (a) Show that the HodgesLehmann estimate is the minimizer of the contrast function p(x.x.. (a) Show Ihat if Ill[ < 1. a)T is obtained by finding 131.l1. (a) Show that the MLE of ((31.6.). Yn ordered from smallest to largest.7 with Y having the empirical distribution F. . Let x. let Xl and X.& at each Xi..17). 8 E R. be the Cauchy density. . F)(x) *F denotes convolution....(3p that minimizes the least absolute deviation contrast function L~ I IYi . x E R.O) Hint: See Problem 2. y)T where Y = Z + v"XW.2.
2. .ryt» denote the KullbackLeibler divergence between p( x. 36. Also assume that S. .B )'/ (7'. B) and pix. Let 9 be a probability density on R satisfying the following three conditions: I. Let Xi denote the number of hits at a certain Web site on day i.B)g(x. such that for every y E (a. B ) and p( X.. 'I i 39. Assume that 5 = L:~ I Xi has a Poisson. . N(B. Find the values of B that maximize the likelihood Lx(B) when 1t>1 > L Hint: Factor out (x . 51.B I ) (K'(ryo. Suppose X I. ry) = p(x. 9 is twice continuously differentiable everywhere except perhaps at O. P(nA).h(y . (a. Suppose h is a II function from 8 Onto = h(8). distribution. .. . Assume :1... 1 n + m. and 8 2 are independent.d. and 5. ryl Show that o n ». BI ) (p' (x. Find the MLEs of Al and A2 hased on 5. . o Ii !n 'I . If we write h = logg. Let Xl and X2 be the observed values of Xl and X z and write j.0).x2)/2.Xn are i. symmetric about 0. where x E Rand (} E R. b).. Let K(Bo.. X. b) there exists a 0 > 0 ~ (c) There is an interval such that h(y + 0) . Show that (a) The likelihood is symmetric about ~ x. > h(y) . Ii I. ~ (d) Use (c) to show that if t> E (a. . B) = g(xB). a < b.+~l Vi and 8 2 = E.. then the MLE is not unique. ryo) and p' (x. .. Problem 35 can be generalized as follows (Dhannadhikari and JoagDev. ~ (XI + x.f)) in the likelihood equation. . 3. and positive everywhere. Define ry = h(B) and let p' (x. where Al +..b). Let Vj and W j denote the number of hits of type 1 and 2 on day j.B) g(x + t> .B). 37.t> .i.B)g(i . B) denote their joint density. .j'.) be a random samplefrom the distribution with density f(x. . Let X ~ P" B E 8. Let (XI.\2 = A. The likelihood function is given by Lx(B) g(xI . Sl. ~ Let B ~ arg max Lx (B) be "the" MLE. = I ! I!: Ii.. (b) Either () = x or () is not unique.+~l Wj have P(rnAl) and P(mA2) distributions. 38. .. then B is not unique. B) is and that the KullbackLiebler divergence between p(x.h(y) . Show that the entropy of p(x. n. then h"(y) > 0 for some nonzero y. j = n + I.150 Methods of Estimation Chapter 2 (b) Show that if 1t>1 > 1. i = 1. 1985). On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making). (7') and let p(x. that 8 1 E. 9 is continuous. h I (ry») denote the density or frequency function of X for the ry parametrization.)/2 and t> = (XI . Bo) is tn( B.
1. where 0 (a) Show that a solution to this equation is of the form y (a. 0). X n be a sample from the generalized Laplace distribution with density 1 O + 0.) Show that statistics.. . Aj) + Ei. (tn. . x > 0. 1'). . n.. i = 1. n > 4.. and 1'. a.2. . Suppose Xl. .. 1). and g(t.I[X.~t square estimating equations (2.c n are uncorrelated with mean zero and variance (72. (. 0 > O.Section 2. a get.. and fj.1. 41. . + C. j = 1.1. [3. For the ca.olog{1 + exp[[3(t.7) for a.. 1959.1'.. < 0] are sufficient (b) Find the maximum likelihood estimates of 81 and 82 in tenns of T 1 and T 2 • Carefully check the "T1 = 0 or T 2 = 0" case. .O) ~ a {l+exp[[3(tJ1. /3. > OJ and T. . 1 X n satisfy the autoregressive model of Example 1.~e p = 1.)/o)} (72. = LX. Give the least squares where El" •.j ~ x <0 1. 1 En are uncorrelated with mean 0 and variance estimating equations (2. Yn). and /1. = log a . exp{x/O. 6. [3. A) = g(z. 1'. 1989) Y dt ~ [3 1  Idy [ (Y)!] ' y > 0.. a > 0. .}. on a population of a large number of organisms.5 Problems and Complements 151 40. h(z. n where A = (a. . . o + 0. An example of a neural net model is Vi = L j=l p h(Zij. The mean relative growth of an organism of size y at time t is sometimes modeled by the equation (Richards. yd 1 ••• .)/o]}" (b) Suppose we have observations (t I . j 1 1 cxp{ x/Oj}.5.  J1.0). [3.7) for estimating a. . [3 > 0. (e) Let Yi denote the response of the ith organism in a sample and let Zij denote the level of the jth covariate (stimulus) for the ith organism.. Variation in the population is modeled on the log scale by using the model logY. [3. Let Xl.p.I[X. Seber and Wild. . I' E R. T. i = 1. = L X. give the lea. where OJ > O. 42. ..
the bound is sharp and is attained only if Yi x. Xn· Ip (x.152 Methods of Estimation Chapter 2 (aJ If /' is known.5). gamma. Yn are independent PlY.) 1 II(Z'i /1)2 (b) If j3 is known.).3. ..::. . C2' y' 1 1 for (a) Show that the density of X = (Xl.i. = L(CI + C.l.'" £n)T of autoregression errors. .l. + 8. L x.3. (One way to do this is to find a matrix A such that enxl = Anxn€nx 1.(8 .3.fi) = 1log p pry.. Hint: Let C = 1(0)..) Then find the weighted least square estimate of f.. • II ~. Let Xl..3.. t·. Under what conditions on (Xl 1 .Xi)Y' < L(C1 + c.02): 01 > 0. '12 = A and where r denotes the gamma function.0. I . 3. . There exists a compact set K c e such that 1(8) < c for all () not in K. . Hint: I .. + 0. . = OJ. X n be i. . " l Xn) does the Mill exist? What is the MLE? Is it unique? e 4.. . a. > 0. . fi exists iff (Yl . find the covariance matrix W of the vector € = (€1. S.3 1. . This set K will have a point where the max is attained.p)... = 1] = p(x"a. write Xi = j if the ith plant has genotype j.. 0 for Xi < _£I. < . . n > 2.1. + C1 > 0).0 .)I(c.. (X. In a sample of n independent plants. Consider the HardyWeinberg model with the six genotypes given in Problem 2. show that the MLE of fi is jj =  2.4) and (2.x. < I} and let 03 = 1. = If C2 > 0. 1 Xl <i< n. Let = {(01.y.15.. Xn)T can be written as the rank 2 canonical exponential family generated by T = (ElogX"EXi ) and hex) = XI with ryl = p.. .d. Yn) is not a sequence of 1's followed by all O's or the reverse. . < Show that the MLE of a. Suppose Y I .l  L: ft)(Xi /. Is this also the MLE of /1? Problems for Section 2.. Prove Lenama 2. n i=l n i=l n i=1 n i=l Cl LYi + C. fi) = a + fix..J. Give details of the proof or Corollary 2. r(A.1 > _£l. (b) Show that the likelihood equations are equivalent to (2. I < j < 6.. _ C2' 2..x.
.1 <j < ac k1. In the heterogenous regression Example 1. _. 12.. fe(x) wherec.6.3.'~ve a unique solution (.1). 1 <j< k1.. ji(i) + ji.. But c( e) is closed so that '10 = c( eO) and eO must satisfy the likelihood equations. . 10. C! > 0. (3) Show that. 0 < Zl < .1 to show that in the multinomial Example 2. See also Problem 1.z=7 11. . (b) Show that if a = 1. .d.A( ryj) ~ max{ 'IT toA( 'I) : 'I E c(8)} > 00. if n > 2.en. > 3. Then {'Ij} has a subsequence that converges to a point '10 E f.n) are the vertices of the :Ij <n}. the MLE 8 exists but is not unique if n is even. the MLE 8 exists and is unique. Let Xl. show 7.t)... (O.10 with that the MLE exists and is unique. " (a) Show that if Cl: > ~ 1.) ~ Ail = exp{a + ilz. < Zn. Hint: If BC has positive volume. Let XI. n > 2. • It) w' ( Xi It) . .3.0 < ZI < . . .l E R.Xn E RP be i.1 (a) ~ ~ c(a) exp{ Ix .3.8) .1} 0 OJ = (b) Give an algorithm such that starting at iP = 0. [(Ai).. . Hint: If it didn't there would exist ryj = c(9j ) such that ryJ 10 . Hint: The kpoints (0.ill T exists and is unique. Use Corollary 2. 0:0 = 1.3.6. Prove Theorem 2. 8.. e E RP. < Zn Zn.0. then it must contain a sphere and the center of the sphere is an interior point by (B.. convex set {(Ij. the likelihood equations logfo that t (X It) w' iOJ t=I = 0 ~ { (X i .n.. Let Y I . Suppose Y has an exponential. . with density.0). MLEs of ryj exist iffallTj > 0.lkIl :Ij >0. . X n be Ll. 9. u > 1 fR exp{Ixlfr}dxand 1·1 is the Euclidean norm.. w( ±oo) = 00. .. .3.fl. .. and assume forw wll > 0 so that w is strictly convex. J.}. Zj < .Ld. . and Zi is the income of the person whose duration time is ti.. a(i) + a.. (0.40.. Show that the boundary of a convex C set in Rk has volume 0.' . distribution where Iti = E(Y. .. ~fo (X u I. < Show that the MLE of (a.0).O.. Yn denote the duration times of n independent visits to a Web site.Section 2..9.. .5 Problems and Complements 153 n 6.
pal0'2 + J11J.4. 0'1 Problems for Section 2. (I't') IfEPD 8a2 a2 D > 0 an d > 0. . En i. (a) Show that the MLEs of a'f. O'~.1.. . N(O. b successively. E" . (b) If n > 5 and J11 and /12 are unknown. b) ~ I w( aXi . respectively. Note: You may use without proof (see Appendix B. > 0. b) = x if either ao = 0 or 00 or bo = ±oo. .. .6. = (lin) L:. show that the estimates of J11.154 Methods of Estimation Chapter 2 (c) Show that for the logistic distribution Fo(x) [1 + exp{ X}]I. Show that if T is minimal and t: is open and the MLE doesn't exist. 1985. Describe in detail what the coordinate ascent algorithm does in estimation of the regression coefficients in the Gaussian linear model i: y ~ ZDJ3 + <. provided that n > 3. EeT = (J11. (i) If a strictly convex function has a minimum. 8aob :1 13. Apply Corollary 2.n log a is strictly convex in (a.(Yi .. I (Xi .3. b) and lim(a.6. and p = [~(Xi .b) . il' i.1 and J1.)(Yi .:. See. . Golnb and Van Loan. P coincide with the method of moments estimates of Problem 2.4 1. Y. ~J ' .J1. "i .2)2. L:.2.. O'~ + J1i. complete the Estep by finding E(Zl I l'i) and E(ZiYi I Yd· (b) In Example 2.l and CT..bo) D(a. Let (X10 Yd.2)/M'''2] iI I'.'2). (Xn . .) Hint: (a) Thefunction D( a.J1.9). EM for bivariate data. 2. 7if)F 8 08 D 802 vb2 2 2 > (a'D)2 ' then D'IS strictIy convex.1)".2).. (See Example 2. rank(ZD) = k.4. it is unique. ar 1 a~. IPl < 1.d. /12. b = : and consider varying a. {t2. W is strictly convex and give the likelihood equations for f.) has a density you may assume that > 0. ag. verify the Mstep by showing that E(Zi I Yi).J1. .8.. . and p when J1.J1. Hint: (b) Because (XI.' ! (a) In the bivariate nonnal Example 2. then the coordinate ascent algorithm doesn't converge to a member of t:. J12.) I .4.i.:. (b) Reparametrize by a = . O'?.2 are assumed to be known are = (lin) L:. for example. p) population..3.. Chapter 10. Yn ) be a sample from a N(/11. (Check that you are describing the GaussSeidel iterative method for solving a system of linear equations. O'~ + J1~.b)_(ao.. 3.
(c) Give explicitly the maximum likelihood estimates of Jt and 5. the model often used for X is that it has the conditional distribution of a B( n. is distributed according to an exponential ryl < n} = i£(1 I) uzuy.[lt ~ 1] = A ~ 1 . . be independent and identically distributed according to P6. 1].(1 _ B)n]{x nB . it is desired to estimate the proportion 8 that has the genetic trait. Y (~ A 2:7 I(Yi  Y)'  O"~)/ (0". and given II = j. I' >.B)"]'[(1 .(I.[1 . 1 < i < n.B)n} _ nB'(I.. j = 0. and that the (b) Use (2. where 8 = 80l d and 8 1 estimate of 8. Do you see any problems with >. B) variable.1) x Rwhere P.5 Problems and Complements 155 4. (a) Justify the following crude estimates of Jt and A.[lt ~ OJ. I n ) 8"(10)"" (a) Show that P(X = x MLE exists and is unique.n._x(1. Consider a genetic trait that is directly unobservable but will cause a disease among a certain proportion of the individuals that have it. For families in which one member has the disease.(1 . B= (A. Y1 '" N(fLl a}).I ~ + (1  B)n] . Suppose that in a family of n members in which one has the disease (and. 6.P. 2:Ji) . when they exist.2B)x + nB'] _. Because it is known that X > 1. as the first approximation to the maximum likelihood . IX > 1) = \ X ) 1 (1 6)" .{(Ii. also the trait). ~2 = log C\) + (b) Deduce that T is minimal sufficient. Hint: Use Bayes rule. 1') E (0.. X is the number of members who have the trait.. Suppose the Ii in Problem 4 are not observed.B)[I.4.. B E [0.O"~). given X>l..3) to show that the NewtonRaphson algorithm gives ~ _ 81 ~ 8 ~ _ ~ _ B(I. i ~ 1'. Let (Ii. 2 (uJ L }iIi + k 2: Y. .and Msteps of the EM algorithm for this problem. Yd : 1 < i family with T afi I a? known..? (b) Give as explicitly as possible the E.Ii). 8new . 1 and (a) Show that X . thus.B)n[n .x = 1.Section 2. Y'i).
N .c Pabc = l. 1 <c < C and La. (a) Suppose for aU a.X3 be independent observations from the Cauchy distribution about f(x.riJ(A) . X.2 noting that the sequence of iterates {rymJ is bounded and. N++ c N a +c N+ bc Pabc = . find (}l of (b) above using (} =   x/n as a preliminary estimate. X n be i.N+bc where N abc = #{i : Xi = (a. Hint: Apply the argument of the proof of Theorem 2. . (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the other of the local maxima (b) Make a similar study of the NewtonRaphson method in this case.1 (1 + (x e.1 generated by N++c.156 (c) [f n = 5. . * maximizes = iJ(A') t. X 3 = a. X Methods of Estimation Chapter 2 = 2.9)')1 Suppose X. = 1.v ~ b I W = c] = P[U = a I W = c]P[V = b I W ~ i. . V = b.e. . PIU = a.b. Consider the following algorithm under the conditions of Theorem 2. hence.. v < 00. c]' Show that this holds iff PIU ~ a. (1) 10gPabc = /lac = Pabe.1) + C(A + B2) = C(A + B1) . iff U and V are independent given W.X 2 . 8.. the sequence (11ml ijm+l) has a convergent subse quence.4.4. Let Xl. + Vbc where 00 < /l. Let and iJnew where ). V. 9.Na+c. 1 < b < B. b1 c and then are . Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a.9) = ". b1 C. W = c] 1 < a < A.d. n N++ c ++c  . (c) Show that the MLEs exist iff 0 < N a+c . N +bc given by < N ++c for all a.c)} and "+" indicates summation over the ind~x. where X = (U. 7. . Show that the sequence defined by this algorithm converges to the MLE if it exists.i. Define TjD as before. v vary freely is an exponential family of rank (C . (h) Show that the family of distributions obtained by letting 1'.A(iJ(A)).2. W). Let Xl. ~ 0.b.
I) + (B . Initialize: pd: O) = N a ++ N_jo/>+ N++ c abc nnn Pabc d2) dl) Nab+ Pabc dO) n n Pab+ d1) dl) dO) Pabc d3) N a + c Pabc Pa+c N+bc Pabc d 2) Pabc n P+bc d2)· Reinitialize with ~t~.a) 1 2 .[X = x I SeX) = s] = 13. Hint: P. c" and "a.I) ~ AB + AC + BC .Section 2. c" parameters. Hint: Note that because {p~~~} belongs to the model so do all subsequent iterates and that ~~~ is the MLE for the exponential family Pabc = ellauplO) _=.N++c/A. Let f.5 has the specified mixtnre of Gaussian distribution.I)(C . N±bc .I)(C . Suppose X is as in Problem 9. = fo(x  9) where fo(x) 3'P(x) + 3'P(x . f vary freely.and Msteps of the EM algorithm in this case.(A + B + C).N++ c / B. Justify formula (2.1)(B . Nl+co (c) The model implies Pabc = P+bcPa±c/P++c and use the likelihood equations. (a) Show that S in Example 2..b'. (b) Give explicitly the E. 10.3 + (A .:.5 Problems and Complements 157 Hint: (b) Consider N a+ c .c' obtained by fixing the "b. Show that the algorithm converges to the MLE if it exists and di· verges otheIWise.c.4.4. (a) Show that this is an exponential family of rank A + B + C .a>!b"..8).1) + (A . 12. 11.=_ L ellalblp~~~'CI a'. but now (2) logPabc = /lac + Vbc + 'Yab where J1.(x) :1:::: 1(S(x) ~ = s). v. (b) Consider the following "proportional fitting" algorithm for finding the maximum likelihood estimate in this model.
2. the process determining whether X j is missing is independent of X j .6 . necessarily 0* is the global maximizer. where E) .. If /12 = 1.. .{Xj }.5. 18. Complete the E.p is the N(O} '1) density. A. This condition is called missing at random. En a an 2. Establish the last claim in part (2) of the proof of Theorem 2. 17. Bm + I)} has a subsequence converging to (B* . but not on Yi.4. 14. given X . For X . Show for 11 = 1 that bisection may lead to a local maximum of the likelihood.4.~ go.4. ~ 16.5. thus. {31 1 a~. I I' . NOTES . Bm+d} has a subsequence converging to (0" JJ*) and. Hint: Use the canonical nature of the family and openness of E. Hint: Show that {(19 m. . Verify the fonnula given in Example 2." . That is.) underpredicts Y.4. and independent of El.d. we observe only}li. the "missingness" of Vi is independent of Yi. = {(Zil Yi) Yi :i = 1.1 (1) "Natural" now was not so natural in the eighteenth century when the least squares principle was introduced by Legendre and Gauss. {32). Hint: Show that {( 8m . . Limitations a/the EM Algorithm. al = a2 = 1 and p = 0. Ii .. in Example 2. find the probability that E(Y."" En.3. this assumption may not be satisfied. . Establish part (b) of Theorem 2. Noles for Section 2. Fisher (1922) argued that only estimates possessing the substitution property should be considered and the best of these selected. suppose Yi is missing iff Vi < 2.158 Methods of Estimation Chapter 2 and r. Then using the Estep to impute values for the missing Y's would greatly unclerpredict the actual V's because all the V's in the imputation would have Y < 2. the probability that Vi is missing may depend on Zi.6.n}. For instance. In Example 2. a 2 . . Zn are ii. .d. EM and Regression. That is. ~ .and Msteps of the EM algorithm for estimating (ftl. The assumption underlying the computations in the EM algorithm is that the conditional probability that a component X j of the data vector X is missing given the rest of the data vector is not a function of X j . Zl. For example. For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1986). If Vi represents the seriousness of a disease.3 for the actual MLE in that example.. I Z. given Zi.4. . consider the model I I: ! I I = J31 + J32 Z i + Ei I are i. These considerations lead essentially to maximum likelihood estimates. ! . (2) The frequency plugin estimates are sometimes called Fisher consistent. 2 ). • ... N(ftl.6.i.. N(O. Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n. if a is sufficiently large. (J*) and necessarily g. 15. suppose all subjects with Yi > 2 drop out of the study. R.
g. A.2 (I) An excellent historical account of the development of least squares methods may be found in Eisenhart (1964). The Analysis ofFrequency Data Chicago: University of Chicago Press. . THOMAS. for any A. S. 1. M. WEISS. M. MA: MIT Press. 1997. 2433 (1964). G. AND P.Section 2.1974.• AND I. 138 (1977). Statist. Appendix A. C.1. Matrix Computations Baltimore: John Hopkins University Press. E. 1997).. T. 1975. BJORK. Discrete Multivariate Analysis: Theory and Practice Cambridge. A. J. FEINBERG. 1985.2. SOULES. T. M. "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Am!. AND J. Statist. A. E. "Y. GoLUB. M. ANOH. Fisher 1950) New York: J... G. NJ: Princeton University Press." The American Statistician. HOLLAND. L. DEMPSTER. AND A. R." reprinted in Contributions to MatheltUltical Statistics (by R. 1991. M. a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates. AND N. 39. FISHER." Journal Wash. 164171 (1970). J. and MacKinlay. AND N. LAIRD. "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm. D. F.3 (I) Recall that in an exponential family. Sciences. ANDERSON. J. PETRIE. Note for Section 2. Wiley and Sons. E... "The Meaning of Least in Least Squares. La. Note for Section 2. W. La. 2. 1972. EiSENHART. MACKINLAY. VAN LOAN." J. GIlBELS. 54. M. BARTHOLOMEW. Soc. A. DHARMADHlKARI. AND K. DAHLQUIST. 1974. Statistical Inference Under Order Restrictions New York: Wiley. The Econometrics ofFinancial Markets Princeton. R. Numerical Analysis New York: Prentice Hall. HABERMAN. CAMPBELL. P[T(X) E Al o for all or for no PEP.. G. S. A. RUBIN.loAGDEV. (2) For further properties of KullbackLeibler divergence.. see Cover and Thomas (1991). c. BREMNER. BAUM. B. Math. t 996. H. AND C. AND D.. Roy. FAN.5 (1) In the econometrics literature (e. Elements oflnfonnation Theory New York: Wiley. "Y.7 References 159 Notes for Section 2. Acad. Local Polynomial Modelling and Its Applications London: Chapman and Hall. 1922. COVER. "On the Mathematical Foundations of Theoretical Statistics... "Examples of Nonunique Maximum Likelihood Estimators. BISHOP. 0. Campbell. A.. B. 199200 (1985). W...7 REFERENCES BARLOW. BRUNK. 41. S.
" J. 63. S. C. I • I . WAND. 1989. WEISBERG. A. SEBER. 1987. RUPPERT.160 Methods of Estimation Chapter 2 KOLMOGOROV." Ann. A. New York: Wiley. WU. Amer. Nonlinear Regression New York: Wiley. 1997.. WILD. AND D. B. I 1 . AND T. i I. 102108 (1956). Ames. .623656 (1948). J. AND M. 22.. Statisr.1. 10. IA: Iowa State University Press.. SHANNON. COCHRAN. "On the Shannon Theory of Information Transmission in the Case of Continuous Signals. J. "A Flexible Growth Function for Empirical Use. 128 (1968).. SNEDECOR.. 290300 ( 1959). Theory. G. G. MA: Harvard University Press. E... D. "Association and Estimation in Contingency Tables... KR[SHNAN. 2nd ed. Statist.. A." 1.·sis with Missing Data New York: J. 1985. Statistical Methods.. F. The History ofStatistics Cambridge."/RE Trans! Inform.. Applied linear Regression. E. Botany. MACLACHLAN. Statistical Anal. P. LIlTLE. E. Journal.. Exp. Wiley. RUBIN.379243. S.. 27. "A Mathematical Theory of Communication.. J. STIGLER. 1986. R. 13461370 (1994).95103 (1983) . MOSTELLER. 11. W. N. AND W. 1%7. In. RICHARDS. C. "On the Convergence Properties of the EM Algorithm. • . 6th ed. AND c. G. "Multivariate Locally Weighted Least Squares Regression. F. J. The EM Algorithm and Extensions New York: Wiley." Ann. Statist. Assoc." Bell System Tech.
in the context of estimation. then for data X '" Po and any decision procedure J randomized or not we can define its risk function.4. 3. AND OPTIMAL PROCEDURES 3.6(X)). and 62 on the basis of the risks alone is not well defined unless R(O. However.2 and 3. action space A.a).1 INTRODUCTION Here we develop the theme of Section 1. We also discuss other desiderata that strongly compete with decision theoretic optimality. In Sections 3. R(·.3 that if we specify a parametric model P = {Po: E 8}. in particular computational simplicity and robustness.3 we show how the important Bayes and minimax criteria can in principle be implemented. ) < R(B.6) = E o{(O. we study.6) = EI(6.+ R+ by ° R(O.6 . (3. 6) as measuring a priori the performance of 6 for this model.6(X).2. We think of R(. which is how to appraise and select among decision procedures. 6) : e . o r(1T.Chapter 3 MEASURES OF PERFORMANCE. loss function l(O. NOTIONS OF OPTIMALITY.2 BAYES PROCEDURES Recall from Section 1. the relation of the two major decision theoretic principles to the 000decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness. in Chapter 4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case. 6 2 ) for all 0 or vice versa.1) 161 . Strict comparison of 6. by introducing a Bayes prior density (say) 1r for comparison becomes unambiguous by considering the scalar Bayes risk.6) =ER(O. actual implementation is limited. We return to these themes in Chapter 6. after similarly discussing testing and confidence bounds. In Section 3. OUf examples are primarily estimation of a real parameter. However.3.
2.5. such that (3.2. oj = ".3 we showed how in an example.(O)dfI p(x I O).6) In the discrete case. Thus. for instance. In view of formulae (1. Our problem is to find the function 6 of X that minimizes r(".) ~ inf{r(".2. Issues such as these and many others are taken up in the fundamental treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger (1985) and Bernardo and Smith (1994). j I 1': II . and 1r may express that we care more about the values of the risk in some rather than other regions of e. we find that either r(1I". . 6) ~ E(q(O) . We don't pursue them further except in Problem 3. 0) 2 and 1T2(0) = 1.3) In this section we shall show systematically how to construct Bayes rules.5).2. In the continuous case with real valued and prior density 1T. 'f " . 1(0. 1 1  . We first consider the problem of estimating q(O) with quadratic loss.r= . 6) Bayes rule 6* is given by = a? 6'(X) = E[q(O) I XJ.(O)dO (3.4) the parametrization plays a crucial role here. e "().. B denotes mean height of people in meters. it is plausible that 6..5) This procedure is called the Bayes estimate for squared error loss. (0. This exercise is interesting and important even if we do not view 1r as reflecting an implicitly believed in prior distribution on ().. Suppose () is a random variable (or vector) with (prior) frequency function or density 11"(0). After all. as usual. Clearly. (0.. we can give the Bayes estimate a more explicit form. x _ !oo= q(O)p(x I 0).2.6(X)j2.6).2. we just need to replace the integrals by sums.8) for the posterior density and frequency functions. if 1[' is a density and e c R r(". we could identify the Bayes rules 0. considering I. It is in fact clear that prior and loss function cannot be separated out clearly either. We may have vague prior notions such as "IBI > 5 is physically implausible" if. and that in Section 1. if computable will c plays behave reasonably even if our knowledge is only roughly right. If 1r is then thought of as a weight function roughly reflecting our knowledge.(O)dfI (3. This is just the prohlem of finding the best mean squared prediction error (MSPE) predictor of q(O) given X (see Remark 1. 00 for all 6 or the Using our results on MSPE prediction.2. Here is an example.2) the Bayes risk of the problem. (3. For testing problems the hypothesis is often treated as more important than the alternative.4) !.6) = J R(0. and instead tum to construction of Bayes procedure.(0) a special role ("equal weight") though (Problem 3..4.6): 6 E VJ (3.2.(0)1 . 0) and "I (0) is equivalent to considering 1 (0. 0) = (q(O) using a nonrandomized decision rule 6.2. X) is given the joint distribution specified by (1.162 Measures of Performance Chapter 3 where (0. Recall also that we can define R(..3).
6. Applying the same idea in the general Bayes decision problem.r( 7r. E(Y I X) is the best predictor because E(Y I X ~ . Suppose that we want to estimate the mean B of a nonnal distribution with known variance a 2 on the basis of a sample Xl.00 with. 1]0 fixed).2.a)' I X ~ x) as a function of the action a. tends to a as n _ 00.E(e I X))' I X)] E[(J'/(l+~)]=nja 2 + 1/72 ' 1 2 n n/ No finite choice of '1]0 and 7 2 will lead to X as a Bayes estimate. . . X is the estimate that (3. Intuitively.7) reveals the Bayes estimate in the proper case to be a weighted average MID + (1 .1.2. take that action a = J*(x) that makes r(a I x) as small as possible. This quantity r(a I x) is what we expect to lose.2. If we choose the conjugate prior N( '1]0. " Xn.1). a 2 / n. . we obtain the posterior distribution The Bayes estimate is just the mean of the posterior distribution J'(X) ~ 1]0 l/T' ] _ [ n/(J2 ] [n/(J' + l/T' + X n/(J' + l/T' (3. 0 We now tum to the problem of finding Bayes rules for gelleral action spaces A and loss functions l.5.Section 3.w)X of the estimate to be used when there are no observations. a) I X = x). we see that the key idea is to consider what we should do given X = x. Such priors with f 7r(e) ~ 00 or 2: 7r(0) ~ 00 are called impropa The resulting Bayes procedures are also called improper. /2) as in Example 1. But X is the limit of such estimates as prior knowledge becomes "vague" (7 . However.2 Bayes Procedures 163 Example 3. Because the Bayes risk of X. If we look at the proof of Theorem 1. Bayes Estimates for the Mean of a Normal Distribution with a Normal Prior.) minimizes the conditional MSPE E«(Y . we fonn the posterior risk r(a I x) = E(l(e. r) . For more on this. see Section 5.E(e I X))' = E[E«e .2.7) Its Bayes risk (the MSPE of the predictor) is just r(7r.12. J') ~ 0 as n + 00. In fact. In fact. and X with weights inversely proportional to the Bayes risks of these two estimates.2. we should. This action need not exist nor be unique if it does exist. that is. for each x. X is approximately a Bayes estimate for anyone of these prior distributions in the sense that Ir( 7r.1.6) yields. J' )]/r( 7r. Fonnula (3.4. '1]0. if we substitute the prior "density" 7r( 0) = 1 (Prohlem 3. Thus. if X = :x and we use action a. To begin with we consider only nonrandomized rules. J') E(e . the Bayes estimate corresponding to the prior density N(1]O' 7 2 ) differs little from X for n large.
35. and aa are r(at 10) r(a. Therefore. Suppose that there exists a/unction 8*(x) such that r(o'(x) [x) = inf{r(a I x) : a E A}.. The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corrpeting procedures. . Therefore.8) Thus. A = {ao.2. by (3.Op}.o(X))] = E[E(I(O.2.. More generally consider the following class of situations. let Wij > 0 be given constants. ?f(O.1.o(X)) I X = x] = r(o(x) Therefore.) = 0. 11) = 5.164 Measures of Performance Chapter 3 Proposition 3.1. :i " " Proof.8. a q }. Suppose we observe x = O. ~ a. and the resnlt follows from (3.89. Then the posterior distribution of o is by (1.o) = E[I(O.9) " "I " • But. r(a.3. consider the oildrilling example (Example 1. .) = 0. 6* = 05 as we found previously. I x) > r(o'(x) I x) = E[I(O. [I) = 3. o As a first illustration. r(a. Bayes Procedures Whfn and A Are Finite.o(X)) I X] > E[I(O. " . E[I(O. 0'(0) Similarly. az has the smallest posterior risk and.2. (3.8). As in the proof of Theorem 1. we obtain for any 0 r(?f.70 and we conclude that 0'(1) = a. (3.o'(X)) I X]. E[I(O.4.67 5.! " j.2.. Example 3. if 0" is the Bayes rule. Let = {Oo. !i n j.74.8) Then 0* is a Bayes rule.5) with priOf ?f(0. I 0)  + 91(0" ad r(a.· .·.o(X)) I X)]. r(al 11) = 8. 10) 8 10.2. .2. the posterior risks of the actions aI.o'(X)) IX = x].2..2.9).2. az. and let the loss incurred when (}i is true and action aj is taken be given by j e e .
Section 3.2
Bayes Procedures
165
Let 1r(0) be a prior distribution assigning mass 1ri to Oi, so that 1ri > 0, i = 0, ... ,p, and Ef 0 1ri = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each O. Then, by (1.2.8), the posterior probabilities are
and, thus,
raj (
x I) =
EiWipTiP(X I Oi) . Ei 1fiP(X I Oi)
(3.2.10)
The optimal action 6* (x) has
r(o'(x) I x)
=
O<rSq
min r(oj
I x).
Here are two interesting specializations. (a) Classification: Suppose that p
= q, we identify aj with OJ, j
1, i f j
=
0, ... ,p, and let
Wii
O.
This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,
r(Oi
Ix) = Pl8 f ei I X = xl
and minimizing r(Oi I x) is equivalent to the reasonable procedure of maximizing the posterior probability,
PI8 = e I X = i
xl
=
1fiP(X lei) Ej1fjp(x I OJ)
(b) Testing: Suppose p = q = 1, 1ro = 1r, 1rl = 1  'Jr, 0 < 1r < 1, ao corresponds to deciding 0 = 00 and al to deciding 0 = 01 • This is a special case of the testing fonnulation of Section 1.3 with 8 0 = {eo} and 8 1 = {ed. The Bayes rule is then to
decide e decide e
= 01 if (1 1f)p(x I 0, ) > 1fp(x I eo) = eo if (1  1f)p(x IeIl < 1fp(x Ieo)
and decide either ao or al if equality occurs. See Sections 1.3 and 4.2 on the option of randomizing .between ao and al if equality occurs. As we let 'Jr vary between zero and one. we obtain what is called the class of NeymanPearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) < Ct. This is treated further in Chapter 4. D
166
Measures of Performance
Chapter 3
To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form" the Bayes procedure for an estimation problem when the loss is not quadratic.
Example 3.2.3. Bayes Estimation ofthe Probability ofSuccess in n Bernoulli Trials. Suppose that we wish to estimate () using X I, ... , X n , the indicators of n Bernoulli trials with
probability of success
e, We shall consider the loss function I given by
(0  a)2 1(0, a) = 0(1 _ 0)' 0 < 0 < I, a real.
(3.2.11)
"
" " ,

d
This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for () close to zero, this l((), a) is close to the relative squared error (fJ  a)2 JO, lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if aU terms on the righthand side are finite,
rea I k)
(3.2.12)
;,
I
, " " I
,

I
Minimizing this parabola in a, we find our Bayes procedure is given by
J'(k) = E(1/(1  0) I S = k) E(I/O(I  0) IS  k)
;
(3.2.13)

i! , ,.
provided the denominator is not zero. For convenience let us now take as prior density the density br " (0) of the bela distribution i3( r, s). In Example 1. 2.1 we showed that this leads to a i3(k +r,n +s  k) posteriordistributionforOif S = k. Ifl < k < nI and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and
.
,
";1 
J'(k)
Jo'(I/(1  O))bk+r,n_k+,(O)dO
J~(1/0(1 O))bk+r,n_k+,(O)dO
B(k+r,nk+sI) B(k + r  I, n  k + s  I) k+rl n+s+r2'
(3.2.14)
where we are using the notation B.2.11 of Appendix B. If k = 0, it is easy to see that a = is the only a that makes r(a I k) < 00. Thus, J'(O) = O. Similarly, we get J'(n) = 1. If we assume a un!form prior density, (r = s = 1), we see that the Bayes procedure is the usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). 0
°
"Real" computation of Bayes procedures
The closed fonos of (3.2.6) and (3.2.10) make the compulation of (3.2.8) appear straightforward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 ~ (0" . .. , Op) has a hierarchically defined prior density,
"(0,, O ,... ,Op) = 2
"1 (Od"2(02
I( 1 ) ... "p(Op I Op_ d.
(3.2.15)
I
Section 3.2
Bayes Procedures
167
Here is an example. Example 3.2.4. The random effects model we shall study in Volume II has
(3.2.16)
where the €ij are i.i.d. N(O, (J:) and Jl, and the vector ~ = (AI, .. ' ,AI) is independent of {€i; ; 1 < i < I, 1 < j < J} with LI." ... ,LI.[ i.i.d. N(O,cri), 1 < j < J, Jl, '"'" N(Jl,O l (J~). Here the Xi)" can be thought of as measurements on individual i and Ai is an "individual" effect. If we now put a prior distribution on (Jl,,(J~,(J~) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (J.L, (J~, (J~, AI,. '. ,AI) with the Xii I () independent N(f.l + Ll. i . cr~). Then p(x I 0) = lli,; 'Pu.(Xi;  f.l Ll. i ) and
11'(0)
=
11'}(f.l)11'2(cr~)11'3(cri)
II 'PU" (LI.,)
i=l
I
(3.2.17)
where <Pu denotes the N(O, (J2) density. ]n such a context a loss function frequently will single out some single coordinate Os (e.g., LI.} in 3.2.17) and to compute r(a I x) we will need the posterior distribution of ill I X. But this is obtainable from the posterior distribution of (J given X = x only by integrating out OJ, j t= s, and if p is large this is intractable. ]n recent years socalled Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable 0 and the use of Bayesian methods has spread. We return to the topic in Volume II.
Linear Bayes estimates
When the problem of computing r( 1r, 6) and 611" is daunting, an alternative is to consider  of procedures for which r( 'R', 6) is easy to compute and then to look for 011" E 'D  a class 'D that minimizes r( 1f 16) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q( 0)  a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;=l bjXj _ If in (1.4.14) we identify q(8) with Y and X with Z, the solution is

8(X)
= Eq(8) + IX 
E(X)]T{3
where f3 is as defined in Section 1.4. For example, if in the 11l0del (3.2.16), (3.2.17) we set q(fJ) = Ll 1 , we can find the linear Bayes estimate of Ll 1 hy using 1.4.6 and Problem 1.4.21. We find from (1.4.14) that the best linear Bayes estimator of LI.} is
(3.2.18)
where E(Ll.I) given model
= 0,
X = (XIJ, ... ,XIJ)T, P.
= E(X)
and/l
=
ExlcExA,. For the
168
Var(X I ;)
Measures of Performance
Chapter 3
=E
Var(X,)
I B) +
Var E(XIj
I BJ = E«T~) +
(T~ + (T~,
E Cov(X,j , Xlk I B)
+ Cov(E(X'j I B), E(Xlk I BJ)
0+ Cov(1' + II'!, I' + tl.,)
= (T~ +
(T~,
Cov(tl.
Xlj)
=E
Cov(X,j , tl.,
!: ,,
I',
" From these calculations we find and Norberg (1986).
I B) + Cov(E(XIj I B), E(tl. l I 0» = 0 + (T~, =
(Tt·
f3
and OL(X). We leave the details to Problem 3.2.10.
'I
Linear Bayes procedures are useful in actuarial science, for example, Biihlmann (1970)
if
ii "
Bayes estimation, maximum likelihood, and equivariance
As we have noted earlier. the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior 71"(0) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than
modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing the estimate (j M is said to be equivariant with respect to reparametrization if for every onetoDne function h from e to fl = h (e). the estimate of w '" h(B) is W = h(OM); that is, M
~
I
= h(BM ). In Problem 2.2.16 we show that the MLE procedure is equivariant. If we consider squared error loss. then the Bayes procedure (}B = E(8 I X) is not equivariant
(h(O»
M
~

~
~
for nonlinear transfonnations because
E(h(O) I X)
op h(E(O I X))
1 ,
,
,,
for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is
i
re(a I x)
= LIB  a]'7f(O I x).
'Ee
(3.2.19)
i
If we set w = h(O) for h onetoone onto fl = heel, then w has prior >.(w) and in the w parametrization, the posterior Bayes risk is
= 7f(h
1
(w))
I
,
r,,(alx)
= L[waj2>,(wlx)
wE"
L[h(B)  a)2 7f (B I x).
(3.2.20)
'Ee
Thus, the Bayes procedure for squared error loss is not equivariant because squared error loss is not equivariant and, thus, r,,(a I x) op Te(h 1(a) Ix).
Section 3.2
Bayes Procedures
169
Loss functions of the form l((}, a) = Q(Po, Pa) are necessarily equivariant. The KullbackLeibler divergence K«(},a), (J,a E e, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h 1 (a)), thus, with this loss function,
ro(a I x)
~
re(h1(a) I x).
See Problem 2.2.38. In the discrete case using K means that the importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.l 0». In the N(O, case the K L (KullbackLeibler) loss K( a) is ~n(a  0)' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families
"J)
e,
K(1], a) = L[ryj  ajJE1]Tj
j=1
•
+ A(1]) 
A(a).
Moreover, if we can find the KL loss Bayes estimate 1]BKL of the canonical parameter 7] and if 1] = c( 9) :  t E is onetoone, then the K L loss Bayes estimate of 9 in the
e
general exponential family is 9 BKL = C 1(1]BKd. For instance, in Example 3.2.1 where J.l is the mean of a nonnal distribution and the prior is nonnal, we found the squared error Bayes estimate jiB = wf/o + (lw)X, where 1]0 is the prior mean and w is a weight. Because the K L loss is equivalent to squared error for the canonical parameter p, then if w = h(p), WBKL = h('iiUKL), where 'iiBKL =
~
wTJo + (1  w)X.
Bayes procedures based on the KullbackLeibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume 11.
Bayes methods and doing reasonable things
There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1%9; Lindley, 1965; Savage, 1954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior 1r. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E V there is a 60 E V o such that R(0, 60 ) < R(0, 6) for all O. Summary. We show how Bayes procedures can be obtained for certain problems by COmputing posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.
170
Measures of Performance
Chapter 3
3.3
MINIMAX PROCEDURES
In Section 1.3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worstcase analysis; the true () is one that is as "hard" as possible. That is, 6 1 is better than 82 from a minimax point of view if sUPe R(0,6 1 ) < sUPe R(B, 82 ) and is said to be minimax if
o·
supR(O, 0')
.
= infsupR(O,o).
,
.
,
I·
Here (J and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {p. : 0 E e}. It is fruitful to consider proper subclasses of V and subsets of P. but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by considering a socalled zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 8. For the basic game, 5 picks 0 without N's knowledge, N picks 1f without 5's knowledge and then all is revealed and S pays N
r(Jr,o)
, ,
,
=
J
R(O,o)dJr(O)
where the notation f R(O, o)dJr(0) stands for f R(0, o)Jr(O)dII in the continuous case and L, R(Oj, o)Jr(O j) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important.
I: N is told the choice 0 of 5 before picking 1r and 5 knows the rules of the game. Then
Ii ,
N naturally picks 1f,; such that
r(Jr"o)
that is, 1f,; is leastfavorable against such that
~supr(Jr,o),
o. Knowing the rules of the game S naturally picks 0*
•
(3.3.1)
r( Jr,., 0·) = sup r(Jr, 0') = inf sup r(Jr, 0).
We claim that 0* is minimax. To see this we note first that,
.
,
.
(3.3.2)
for allJr, o. On the other hand, if R(0" 0) = suP. R(0, 0), then if Jr, is point mass at 0" r( Jr" 0) = R(O" 0) and we conclude that supr(Jr,o) = sup R(O, 0)
,
•
•
(3.3.3)
I
,
•
•
Section 3.3
Minimax Procedures
171
and our claim follows. II: S is told the choke 7r of N before picking 6 and N knows the rules of the game. Then S naturally picks 6 1r such that
That is, r5 rr is a Bayes procedure for
11".
Then N should pick
7r*
such that (3.3.4)
For obvious reasons, 1f* is called a least favorable (to S) prior distribution. As we shall see by example, altbough the rigbthand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language.
Theorem 3.3.1. (von Neumann). If both
e and D are finite,
?5
11"
then:
(a)
v=supinfr(1I",6), v=infsupr(1I",6)
rr
?5
are both assumed by (say)
7r*
(least favorable), 6* minimax, respectively. Further,
") =v v=r1f,u (
.
(3.3.5)
v and v are called the lower and upper values of the basic game. When v (saY), v is called the value of the game.
=
v
=
v
Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification ~ {eo} and = {eIl (Example 3.2.2) but is too reSlrictive in ilS and testing when assumption for the great majority of inference problems. A generalization due to Wald and Karlinsee Karlin (1 959)states that the conclusions of the theorem remain valid if and D are compact subsets of Euclidean spaces. There are more farreaching generalizations but, as we shall see later, without some form of compactness of and/or D, although equality of v and v holds quite generally, existence of least favorable priors and/or minimax procedures may fail.
eo
e,
e
e
The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on and D and are easy to prove.
e
Proposition 3.3.1. Suppose 6**. 7r** can be found such that
U
£** =
r ()1r.',
1f **
= 1rJ••
(3.3.6)
172
Measures of Performance
Chapter 3
that is, 0** is Bayes against 11"** and 11"** is least favorable against 0**. Then v R( 11"** ,0**). That is, 11"** is least favorable and J*'" is minimax.
To utilize this result we need a characterization of 11"8. This is given by
v
=
Proposition 3.3.2.11"8 is leastfavorable against
°iff
1r,jO: R(O,b) = supR(O',b)) = 1.
.'
(3,3,7)
That is, n a assigns probability only to points () at which the function R(·, 0) is maximal.
Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 11"8 may not be unique. In particular, if R(O, 0) = constant. the rule has constant risk, then all 11" are least favorable. We now prove Propositions 3.3.1 and 3.3.2.
Proof of Proposition 3.3.1. Note first that we always have
v<v
because, trivially,
i~fr(1r,b)
(3.3.8)
< r(1r,b')
(3.3,9)
for aIln, 5'. Hence,
v = sup in,fr(1r,b) < supr(1r,b')
•
(3,3.10)
•
for all 0' and v
<
infa, sUP1r 1'(11", (/) =
v. On the other hand. by hypothesis,
sup1'(1r,6**) > v.
v> inf1'(11"*'",6)
,
= 1'(1I"*",0*'") =
.
(3.3.11)
Combining (3.3.8) and (3.3.11) we conclude that
:r
v
as advertised.
= i~f 1'(11"**,0) = 1'(11"**,6**) = s~p1'(1I",0*"')
= V
(3.3.12)
" 'I
,.;
~l
o
Proofof Proposition 3.3.2. 1r is least favorable for b iff E.R(8,6) =
,.,
f r(O,b)d1r(O) =s~pr( ..,6).
•
(3.3.13)
But by (3.3.3),
supr(..,b) = supR(O, 6),
(3.3,14)
•
i ,
Because E.R(8, 6)
= sUP. R(O, b), (3.3,13) is possible iff (3.3.7) holds.
o
Putting the two propositions together we have the following.
•
Section 3.3
Minimax Procedures
173
11"*
Theorem 3.3.2. Suppose 0* has sUPo R((}, 0*) = 1" < 00. If there exists a prior that 0* is Bayes for 11"* and tr" {(} : R( (}, 0") = r} = 1, then 0'" is minimax.
such
Example 3.3.1. Minimax estimation in the Binomial Case. Suppose S has a B(n,B) distribution and X = Sjn,as in Example 3.2.3. Let I(B, a) ~ (Ba)'jB(IB),O < B < 1. For this loss function,
R(B X) = E(X _B)' , B(IB)

=
B(IB) = ~ nB(IB) n'

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(Ol 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic loss neither of these assertions holds. The minimax estimate is
o's=S+hln = .,fii X+ I 1 () n+.,fii .,fii+l .,fii+1 2
This estimate does have constant risk and is Bayes against a (J( y'ri/2, vn/2) prior (Problem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n t 00 of the ratio of the risks of 0* and X is > 1 for every () =f ~. At B = the ratio tends to 1. Details are left to Problem 3.3.4. 0
!
Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending ~r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p" 0'2) where p, = v, if the satellite functions, and otherwise. The variance 0'2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or"p, = v." We view this as a decision problem with 1 loss. If the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to and 1  1r to v, use 0  1 loss, and set L(x, 0, v) = p( x Iv) j p(x I 0), then the Bayes test decides I' = v if
°
°°
L(x,O,v)
and decides p,
=
=
exp
°
" {
2"EXi   , a 2a
nv'} >
17l'
if L(x,O,v) <:17l' 7l'
174
Measures of Performance
Chapter 3
This test is equivalent to deciding f.t
= v (Problem 3.3.1) if. and only if,
"yn
1 ;;;:EXi>t,
T=
, , ,
where,
t
If we call this test d,r.
=
" [log " + 2 nv'] ;;;: vyn a
17r
2
'
R(O,J. )
1 ~ <I>(t)
<I>
= <1>( t)
R(v,o,)
(t  vf)
To get a minimax test we must have R(O, 61r ) = R( V, 6ft), which is equivalent to
v.,fii t = t  "'
h •
or
"
','
I ,~
~.
, ,
•
v.,fii t= . 20
•
Because this value of t corresponds to 7r = the intuitive test, which decides JJ only ifT > ~[Eo(T) + Ev(T)J, is indeed minimax.
!.
= v if and
0
'I
•
If is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem
e
3.3.2.
Theorem 3,3,3, Let 0' be a rule such that sup.R(O,o') = r < 00, let {"d denote a sequence of pn'or distributions such that 'lrk;{8 : R(B,o*) = r} = 1, and let Tk = inffJ r( ?rie, J), where r( 7fkl 0) denotes the Bayes risk wrt 'lrk;. If
Tk Task 00,
(3.3.i5)
then J* is minimax. Proof Because r( "k, 0') = r
supR(B, 0') = rk + 0(1)
•
where 0(1)
~
0 as k
~ 00.
But hy (3.3.13) for any competitor 0
supR(O,o) > E.,(R(B,o)) > rk ~supR(O,o') 0(1).
•
•
(3.3.16)
,
'.
If we let k _ suP. R(B, 0').
00
the lefthand side of (3.3.16) is unchanged, whereas the right tends to
0
j
1
•
Section 3.3
M'mimax Procedures
175
Example 3.3.3. Normal Mean. We now show that X is minimax in Example 3.2.1. Identify 1fk with theN(1Jo, 7 2 ) prior where k = 7 2 . Then
whereas the Bayes risk of the Bayes rule of Example 3.2.1 is
i~frk(J) ~ (,,'/n) +7' n
Because
(0
7
2
0
2
=
00,
n  (,,'/n) +7'
a2
1
0
2
n·
2
In)1 « (T2 In) + 7 2 )
>
0 as T 2

we can conclude that
X is minimax.
0
Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann). Suppose XI,'" ,Xn arei.i.d, FE:F
Then X is minimax for estimating B(F) EF(Xt} with quadratic loss. This can be viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as foUows:(J)
(i)
=
".{F: VarF(XIl # M}
= O.
(ii) ".{ F : F
# N(I', M) for some I'} = O.
(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F),M).
Evidently, the Bayes risk is now the same as in Example 3.3.3 with 0"2 evidently,
= M.
Because.
) VarF(X, ) max R(F, X = max :F :F n
Theorem 3.3.3 applies and the result follows. Minimax procedures and symmetry
M
n
o
As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" 0, There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann (1986) and Chapter 5 of Lehmann and CaseUa (1998), for instance. We shall discuss this approach somewhat, by example. in Chapters 4 and Volume II but refer to Lehmann (1986) and Lehmann and Casella (1998) for further reading. Summary. We introduce the minimax principle in the contex.t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.
1. symmetry. Obviously. on other grounds. computational ease.L and (72 when XI. Survey Sampling In the previous two sections we have considered two decision theoretic optimality principles. or more generally with constant risk over the support of some prior. then in estimating a linear function of the /3J. then the game of S versus N has a value 11. The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk. and then see if within the Do we can find 8* E Do that is best according to the "gold standard. D. Do C D. . Von Neumann's Theorem states that if e and D are both finite. 176 Measures of Performance Chapter 3 . the game is said to have a value v.. we can also take this point of view with humbler aims. An alternative approach is to specify a proper subclass of procedures. all 5 E Va." R(0. II . N (Il.I . 0'5) model. such as 5(X) = q(Oo). the solution is given in Section 3. and so on. in many cases. v equals the Bayes risk of the Bayes rule 8* for the prior 1r*. if Y is postulated as following a linear regression model with E(Y) = zT (3 as in Section 2. according to these criteria.d. Ii I More specifically. This approach coupled with the principle of unbiasedness we now introduce leads to the famous GaussMarkov theorem proved in Section 6. compute procedures (in particular estimates) that are best in the class of all procedures. it is natural to consider the computationally simple class of linear estimates. there is a least favorable prior 1[" and a minimax rule 8* such that J* is the Bayes rule for n* and 1r* maximizes the Bayes risk of J* over all priors.4. An estimate such that Biase (8) 0 is called unbiased. This approach has early on been applied to parametric families Va. A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called leas! favorable. which can't be beat for 8 = 80 but can obviously be arbitrarily terrible. (72) = . This notion has intuitive appeal. When Do is the class of linear procedures and I is quadratic Joss. are minimax. 3.1 Unbiased Estimation.2. In the nonBayesian framework. for instance. in Section 1. E Do that minimizes the Bayes risk with respect to a prior 1r among all J E Do. The most famous unbiased estimates are the familiar estimates of f. S(Y) = L:~ 1 diYi . for example. 5') for aile. for which it is possible to characterize and.4 UNBIASED ESTIMATION AND RISK INEQUALITIES 3. Moreover. . I X n are d. . Bayes and minimaxity.. We show that Bayes rules with constant risk. ruling out. We introduced. I I .2.. the notion of bias of an estimate O(X) of a parameter q(O) in a model P {Po: 0 E e} as = Biaso(5) = Eo5(X)  q(O). estimates that ignore the data. When v = ii.3. This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in the N (0 . we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1['. looking for the procedure 0.6.• •••.1 .. 5) > R(0.
Write Xl. . As we shall see shortly for X and in Volume 2 for ..' .< 2 ~ ~ (3.8) Jt =X . (3. . . reflecting the probable correlation between ('ttl. .. It is easy to see that the natural estimate X ~ L:~ 1 Xi is unbiased (Problem 3.. ~ N L.4. We ignore difficulties such as families moving.1 ) ~ 1 nl L (Xi ~ ooc n ~ X) 2 ...an}C{XI.. Unbiased Estimates in Survey Sampling. . UN for the known last census incomes. If . (~) If{aj.6) where b is a prespecified positive constant. We want to estimate the parameter X = Xj.4. Suppose we wish to sample from a finite population.3 and Problem 1. ..XN} 1 (3.il) Clearly for each b. This leads to the model with x = (Xl.. . ..4.4. to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census. . for instance.14) and has = iv L:fl (3. (3..4) where 2 CT. . .···. is to estimate by a regression estimate ~ XR Xi.4.4.2) 1 Because for unbiased estimates mean square error and variance coincide we call an unbiased estimate O*(X) of q(O) that has minimum MSE among all unbiased estimat~s for all 0.4.Xn denote the incomes of a sample of n families drawn at random without replacement.Section 3.3. Example 3. . ..4 Unbiased Estimation and Risk Inequalities 177 given by (see Example 1. X N for the unknown current family incomes and correspondingly UI. Unbiased estimates playa particularly important role in survey sampling. UN' One way to do this. . X N ). Ui is the last census income corresponding to = it E~ 1 'Ui... [J = ~ I:~ 1 Ui· XR is also unbiased.. UMVU (uniformly minimum variance unbiased). X N ) as parameter . and u =X  b(U .. ..3) = 0 otherwise.5) This method of sampling does not use the information contained in 'ttl. We let Xl. (3.. these are both UMVU. a census unit.2.. UN) and (Xl.3.4.1.(Xi 1=1 I" N ~2 x) .
II it >: .4). for instance. ". M (3. . . outside of sampling. An alternative approach to using the Uj is to not sample all units with the same prob ability.1 the correlation of Ui and Xi is positive and b < 2Cov(U.X)!Var(U) (Problem 3. For each unit 1.3.18..'I . ~ .4. 0 Discussion. . I . Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems..11.19). . :I . . . A natural choice of 1rj is ~n.. I II. . . 178 Measures of Performance Chapter 3 i ... q(e) is biased for q(e) unless q is linear.. 1.4.I.15). ' ..Xi XHT=LJN i=l 7fJ.2Gand minimax estimates often are. (iii) Unbiased_estimates do not <:. Ifthc 'lrj are not all equal.4. X is not unbiased but the following estimate known as the HorvitzThompson estimate is: Ef . To see this write ~ 1 ' " Xj XHT ~ N LJ 7'" l(Xj E S). the unbiasedness principle has largely fallen out of favor for a number of reasons. It is possible to avoid the undesirable random sample size of these schemes and yet have specified 1rj.. estimate than X and the best choice of b is bopt The value of bopt is unknown but can be estimated by = The resulting estimate is no longer unbiased but behaves well for large samplessee Problem 5.j' I ': Because 1rj = P[Xj E Sj by construction unbiasedness follows. This makes it more likely for big incomes to be induded and is intuitively desirable. X M } of random size M such that E(M) = n (Problem 3.i . N toss a coin with probability 7fj of landing heads and select Xj if the coin lands heads. . However.bey the attractive equivariance property. (i) Typically unbiased estimates do not existsee Bickel and Lehmann (1969) and Problem 3. . (ii) Bayes estimates are necessarily biasedsee Problem 3.4..4.. . Xl!Var(U).j .. Specifically let 0 < 7fl. Unbiasedness is also used in stratified sampling theory (see Problem 1. " 1 ". The HorvitzThompson estimate then stays unbiased.~'! I . this will be a beller cov(U.7) II ! where Ji is defined by Xi = xJ. .3. If B is unbiased for e. 111" N < 1 with 1 1fj = n. They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later. The result is a sample S = {Xl. j=1 3 N ! I: . I j .
and appears in the asymptotic optimality theory of Section 5.9.8) is finite. The lower bound is interesting in its own right. Assumption II is practically useless as written.4.4. e e. For all x E A. We suppose throughout that we have a regular parametric model and further that is an open subset of the line. See Problem 3. B)dx. OJ exists and is finite. has some decision theoretic applications.4.p) where i = (Y .O E 81iJO log p( x.( lTD < 00 J . in the linear regression model Y = ZDf3 + e of Section 2. B)dx. %0 [j T(X)P(x.ZDf3). e. Then 11 holdS provided that for all T such that E..OJdx (3. B) for II to hold.   . and we can interchange differentiation and integration in p(x.2 The Information Inequality The oneparameter case We will develop a lower bound for the variance of a statistic.Section 3. suppose I holds.O) > O} docs not depend on O. The arguments will be based on asymptotic versions of the important inequalities in the next subsection. B) is a density. From this point on we will suppose p(x. That is. 167.8) whenever the righthand side of (3.4. the variance a 2 = Var(ei) is estimated by the unbiased estimate 8 2 = iTe/(n . good estimates in large samples ~ 1 ~ are approximately unbiased. which can be used to show that an estimate is UMVU. The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous~case theorems given later. What is needed are simple sufficient conditions on p(x.2. We make two regularity assumptions on the family {Pe : BEe}. Note that in particular (3. Finally. as we shall see in Chapters 5 and 6. f3 is the least squares estimate. Some classical conditiolls may be found in Apostol (1974). In particular we shall show that maximum likelihood estimates are approximately unbiased and approximately best among all estimates.4. 3. This preference of 8 2 over the MLE 52 = iT e/n is in accord with optimal behavior when both the number of observations and number of parameters are large. For instance. Simpler assumptions can be fonnulated using Lebesgue integration theory.4.OJdx] = J T(xJ%oP(x.8) is assumed to hold if T(xJ = I for all x. p. (I) The set A ~ {x : p(x. for integration over Rq. We expect that jBiase(Bn)I/Vart (B n ) . For instance. 0 or equivalently Vare(Bn)/MSEe(Bn) t 1 as n t 00. (II) 1fT is any statistic such that E 8 (ITf) < 00 for all 0 E then the operations of integration and differentiation by B can be interchanged in J T( x )p(x. unbiased estimates are still in favor when it comes to estimating residual variances.4 Unbiased Estimation and Risk Inequalities 179 Nevertheless. and p is the number of coefficients in f3.
. t.B(O)} is an exponential family and TJ(B) has a nonvanishing continuous derivative on 8. (~ logp(X. 1 and 11 are satisfied for samples from gamma and beta distributions with one parameter fixed.180 for all Measures of Performance Chapter 3 e. Then (see Table 1.i J T(x) [:OP(X.2. Suppose that 1 and II hold and that E &Ologp(X. 1 X n is a sample from a Poisson PCB) population. I 1(0) 1 = Var (~ !Ogp(X.6.. o Example 3.4. Then (3.9) Note that 0 < 1(0) < Lemma 3.4.n and 1(0) = Var (~n_l .1) ~(O) = 0la' and 1 and 11 are satisfied.O)} p(x. 0))' = J(:Ologp(x. O)dx :Op(x. O)dx ~ :0 J p(x. Then Xi .O) & < 00. I J J & logp(x 0 ) = ~n 1 . 0))' p(x. (3. the Fisher information number.O)]j P(x.O)] dxand J T(x) [:oP(X..  Proof.10) and.0 Xi) 1 = nO =  0' n O· o .4.I [I ·1 . thus. suppose XI. If I holds it is possible to define an important characteristic of the family {Po}. Suppose Xl.O)] dx " I'  are continuous functions(3) of O. For instance. (J2) population.O)).1. 1 X n is a sample from a N(B. which is denoted by I(B) and given by Proposition 3. /fp(x. O)dx = O. It is not hard to check (using Laplace transform theory) that a oneparameter exponential family quite generally satisfies Assumptions I and II..4.. where a 2 is known. Similarly.1. 0) ~ 1(0) = E.&0 '0 {[:oP(X.4. . 00. O)dx. Ii ! I h(x) exp{ ~(O)T(x) . the integrals .4. then I and II hold. I' . .11 ) . (3.
4.1.Xu) is a sample from a population with density f(x. (3. then I(e) = nI. Using I and II we obtain.4.4.1 hold and T is an unbiased estimate ofB.O)dX= J T(x) UOIOgp(x. 1/. we obtain a universal lower bound given by the following.O)dX.4.(0).(T(X)) by 1/. nI.15) The theorem follows because.(T(X) > I(O) 1 (3.(0). Theorem 3.'(0)]'  I(e) ..2.l4) and Lemma 3.. Let [. Here's another important special case. We get (3. CoroUary 3.I1. and that the conditions of Theorem 3. Let T(X) be all)' statistic such that Var.I 1.13) By (A. If we consider the class of unbiased estimates of q(B) = B. Suppose that X = (XI.(T(X)) > [1/. .16) The number 1/I(B) is often referred to as the information or CramerRoo lower bound for the variance of an unbiased estimate of 'tjJ(B).4. (3. Then Var. .4.4. Denote E.' (e) = Cov (:e log p(X. ej.1. (3.O))P(x.12) Proof.(0) = E(feiogf(X"e»)'.O).Section 3. Suppose that I and II hold and 0 < 1(0) < 00. e) and T(X).4. 10gp(X. Proposition 3.1 hold. 0 E e.(0) is differentiable and Var (T(X)) . (e) and Var.17) . T(X)) . Suppose the conditions of Theorem 3. Var (t. > [1/.16) to the random variables 8/8010gp(X. 0 The lower bound given in the information inequality depends on T(X) through 1/.'(0) = J T(X):Op(x.4.4.'(~)I)'.4. 1/.4. Then for all 0.(T(X)) < 00 for all O. (Information Inequality).14) Now let ns apply the correlation (CauchySchwarz) inequality (A. e)) = I(e).4 Unbiased Estimation and Risk Inequalities 181 Here is the main result of this section. 0 (3. by Lemma 3.1.1. 1/.4.
..4.1) that if Xl.B) = h(x)exp[ry(B)T'(x) .j. then  1 (3. whereas if 'P denotes the N(O. then X is a UMVU estimate of B. (Br Now Var(X) ~ a'ln.18) 1 a ' . X n is a sample from a nonnal distribution with unknown mean B and known variance a 2 . If the family {Pe} satisfies I and II and if there exists an unbiased estimate T* of 1/.B(B)J. Suppose that the family {p. .4. By Corollary 3.B)] Var " i) ~ 8B iogj(Xi. then T(X) achieves the infonnation inequality bound and is a UMVU estimate of E. the conditions of the information inequality are satisfied.. These are situations in which X follows a oneparameter exponential family.. This is a consequence of Lemma 3. .182 Measures of Performance Chapter 3 Proof.2. (Continued). i g . Suppose Xl. Because X is unbiased and Var( X) ~ Bin. Example 3.j.1) with natural sufficient statistic T( X) and ..4.4. we have in fact proved that X is UMVU even if a 2 is unknown. This is no accident. Then {P9} is a oneparameter exponential family with density or frequency function af the fonn II p(x. o II (B) is often referred to as the information contained in one observation.6..1 for every B. which achieves the lower bound ofTheorem 3. For a sample from a P(B) distribution. Example 3. 0 We can similarly show (Problem 3.(B).1 and I(B) = Var [:B !ogp(X..2.p'(B)]'ll(B) for all BEe.(B) such that Var.(8) has a continuous nonvanishing derivative on e. 1) density. the MLE is B ~ X.18) follows.4. then X is UMVU.Xn are the indicators of n Bernoulli trials with probability of success B. Next we note how we can apply the information inequality to the problem of unbiased estimation. (3.3. Conversely. .4.4. Note that because X is UMVU whatever may be a 2 .4..[T'(X)] = [. . We have just shown that the information [(B) in a sample of size n is nh (8).4.(T(X».19) Ii i . then T' is UMVU as an estimate of 1. As we previously remarked.4.1 we see that the conclusion that X is UMVU follows if ~  Var(X) ~ nI. I I "I and (3. Theorem 3. if {P9} is a oneparameter exponentialfamily of the form (1. : BEe} satisfies assumptions I and II and there exists an unbiased estimate T* of1.B)] = nh(B).B) ] [ ~var [%0 ]Ogj(Xi.
B) IdB.4.23) for all B1 .B) so that = T(X) . B) = aj(B)T'(x) + a2(B) (3.1.19) is highly technical. By solving for a" a2 in (3. B) = a.16) we know that T'" achieves the lower bound for all (J if.10g(1 .20) guarantees Pe(Ae) ~ 1 and assumption 1 guarantees P.4... we see that all a2 are linear combinations of {} log p( Xj. B) = 2"' exp{(2nj 2"' exp{(2n. 2 A ** = A'" and the result follows.B)) + nz)[log B . From this equality of random variables we shall show that Pe[X E A'I = 1 for all B where A' = {x: :Blogp(x. then (3. B .6). .2.Section 3. Thus. the information bound is [A"(B))2/A"(B) A"(B) Vare(T(X») so that T(X) achieves the information bound as an estimate of EeT(X). Then ~ ( 80 logp X.4.19). But now if x is such that = 1. and only if. By (3. there exist functions al ((J) and a2 (B) such that :B 10gp(X.14) and the conditions for equality in the correlation inequality (A.21 ) Upon integrating both sides of (3.4. Conversely in the exponential family case (1.3) that we have the canonical case with ~(B) = Band B( Bj = A(B) = logf h(x)exp{BT(x))dx. B .6.23) must hold for all B.4. J 2 Pe. j hence.(B)T' (x) +a2(B) for all BEe}. However.4. In the HardyWeinberg model of Examples 2.4.25) But .4. " and both sides are continuous in B.22) for j = 1. Our argument is essentially that of Wijsman (1973).. :e Iogp(x. Suppose without loss of generality that T(xI) of T(X2) for Xl. then (3. (3.4 Unbiased Estimation and Risk Inequalities 183 Proof We start with the first assertion. + n2) 10gB + (2n3 + n3) Iog(1 .4.20) to (3. thus.20) with respect to B we get (3.4. Note that if AU = nmAem .4 and 2.2 and. be a denumerable dense subset of 8. Here is the argument.1) we assume without loss of generality (Problem 3. If A. (3. X2 E A"''''.20) hold. 0 Example 3.2.(A") = 1 for all B'.A'(B» = VareT(X) = A"(B).ll.4.4. (B)T'(X) + a2(B) (3.4. Let B .B) = a.4.A (B) .20) with Pg probability 1 for each B.(Ae) = 1 for all B' (Problem 3. it is necessary.6.4. denotes the set of x for which (3. p(x.4.B) I + 2nlog(1  BJ} . The passage from (3. (3.24) [(B) = Vare(T(X) .4.4. continuous in B.p(B) = A'(B) and..
IOgp(X. B) )' aB (3. ~ ~ This T coincides with the MLE () of Example 2. Extensions to models in which B is multidimensional are considered next.7) Var(B) ~ B(I . 0 0 . • The multiparameter case We will extend the information lower bound to the case of several parameters.4. B) aB' p(x.4.. DiscRSsion.2. in the U(O.26) Proposition 3. B) to canonical form by setting t) = 10g[B((1 .2. The variance of B can be computed directly using the moments of the multinomial distribution of (NI • N z .25).6.3.27) and integrate both sides with respect to p(x.4. 0 ~ Note that by differentiating (3. For a sample from a P( B) distribution o EB(::.4. (Continued).6. () (ell" . Suppose p('.B)(2n. Proof We need only check that I a aB' log p(x. we have iJ' aB' logp (X. Sharpenings of the information inequality are available but don't help in general. B) satisfies in addition to I and II: p(·. It often happens.4. for instance.2 implies that T = (2N 1 + N z )/2n is UMVU for estimating E(T) ~ (2n)1[2nB' + 2nB(1 . See Volume II. . We find (Problem 3. Then (3.4.184 Measures of Performance Chapter 3 where we have used the identity (2nl + HZ) + (2n3 + nz) = 211.'" . A third metho:! would be to use Var(B) = I(I(B) and formula (3.B)] and then using Theorem 1. Example 3. e).26) holds. or by transforming p(x.25) we obtain a' 10gp(X. By (3. I(B) = Eo aB' It turns out that this identity also holds outside exponential families. . .24). (3. as Theorem 3..B) ~ A "( B). B)  2 (a log p(x. . Theorem 3.2 suggests.B)] ~ B. B). Even worse. '.B) is twice differentiable and interchange between integration and differentiation is pennitted. B) 2 1 a = p(x.4. in many situations. that I and II fail to hold. .2. Because this is an exponential family. B) example.. but the variance of the best estimate is not equal to the bound [.B)) which equals I(B). although UMVU estimates exist. . we will find a lower bound on the variance of an estimator . Na). I I' f .(e) exist.4.4.. " 'oi .p' (B) I'( I (B). assumptions I and II are satisfied and UMVU estimates of 1/. =e'E(~Xi) = .4.ed)' In particular.4.
4. Then 1 = log(211') 2 1 2 1 2 loger . We assume that e is an open subset of Rd and that {p(x. er 2)..4. . . The arguments follow the d Example 3. . (0) iJiJ lt2(0) ~ E [aer 2 iJl" logp(x.Xn are U.0) = er.Section 3. III (0) = E [~:2 logp(x. (a) B=T 1 (3. The (Fisher) information matrix is defined as (3. .2 ] ~ er. . in addition.d. 0) j k That is..31) and 1(0) (b) If Xl.Ok IOgp(X.O») .4. as X. B) : 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect OJ.. = Var('.4. 0)] = E[er.O) ] iJ2 ] l22(0) = E [ (iJer2)21ogp(x. er 2 ).30) iJ Ijd O) = Covo ( eO logp(X. (3. 1 <j< d.4. then X = (Xl.( x 1") 2 2a 2 logp(x 0) .4. (3. j = 1.4 /2. ~ N(I".29) Proposition 3..? ologp(X.4.Bd are unknown. Under the conditions in the opening paragraph. Let p( x. .2 = er. d. (c) If. 0 = (I". nh (O) where h is the information matrix ofX..28) where (3. 0).4. 1 <k< d..4 Unbiased Estimation and Risk Inequalities 185 of fh when the parameters 82 ..32) Proof.4 E(x 1") = 0 ~ I. a ). Suppose X = 1 case and are left to the problems. .. iJO logp(X. pC B) is twice differentiable and double integration and differentiation under the integral sign can be interchanged. .5..Xnf' has information matrix 1(0) = EO (aO:. 6) denote the density or frequency function of X where X E X C Rq. .. 0).
A.. Canonical kParameter Exponential Family.. Z = V 0 logp(x.. Here are some consequences of this result. in this case [(0) ! ..4.4.3.O) = T(X) .4.4.. Then VarO(T(X» > Lz~ [1(0) Lzy (3.'(0) = V1/.38) where Lzy = EO(TVOlogp(X.4. Example 3.2 ( 0 0 ) . UMVU Estimates in Canonical Exponential Families.13).4.34) () E e open. [(0) = VarOT(X) = (3.O») = VOEO(T(X» and the last equality follows 0 from the argument in (3. then [(0) = VarOT(X). Suppose k .(O). 0). (3. Let 1/.4.. The conditions I.6.6 hold. ! p(x. We will use the prediction inequality Var(Y) > Var(I'L(Z».35) o ~ Next suppose 8 1 = T is an estimate of (Jl with 82.4. Then Theorem 3.33) o Example 3.6.4.( 0) be the d x 1 vector of partial derivatives.4. Suppose the conditions of Example 3.A(O)}h(x) (3.(0) exists and (3. that is. .!pening paragraph hold and suppose that the matrix [(0) is nonsingular Then/or all 0.6.36) Proof. . By (3.37) = T(X).30) and Corollary 1.4. (continued).1. We claim that each of Tj(X) is a UMVU .' = explLTj(x)9j j=1 . = . II are easily checked and because VOlogp(x.(0) = EOT(X) and let ". where I'L(Z) denotes the optimal MSPE linear predictor of Y.4/2 . I'L(Z) = I'Y + (Z l'z)TLz~ Lzy· Now set Y (3. Assume the conditions of the .186 Measures of Performance Chapter 3 Thus. J)d assumed unknown.4. A(O). 1/. 0) . .
n T. .. But f.. is >.4 Unbiased Estimation and Risk Inequalities 187 estimate of EOTj(X).>'j)/n. This is a different claim than TJ(X) is UMVU for EOTj(X) if Oi. Example 3. .)/1£. .4.3. (3.A(O) is just VarOT! (X). hence.4. Multinomial Trials.40) J We claim that in this case .i. j.4. are known. .. But because Nj/n is unbiased and has 0 variance >'j(1 .. To see our claim note that in our case (3.. j=1 X = (XI.39) where..>'. a'i X and Aj = P(X = j).7.4. we let j = 1.4.. X n i.(X) = I)IXi = j]. j = 1. . = nE(T.Section 3. . 0 = (0 1 .p(OJrI(O). " Ok_il T. AI.(X)) 82 80.0).. .O) = exp{TT(x)O .Tk_I(X)). by Theorem 3.. . (1 + E7 / eel) = n>'j(1.0.T(O) = ~:~ I (3.j. . kI A(O) ~ 1£ log Note that 1+ Le j=l Oj 80 A (0) J 8 = 1 "k 11 ne~ I 0 +LJl=le l = 1£>'. Thus... In the multinomial Example 1. i =j:. the lower bound on the variance of an unbiased estimator of 1/Jj(O) = E(n1Tj(X) = >.(1.p(0)11(0) ~ (1. .. then Nj/n is UMVU for >'j. . without loss of generality.. We have already computed in Proposition 3. ...A(O)} whereTT(x) = (T!(x)..4 8'A rl(O)= ( 8080 t )1 kx k ..41) because .A.d.) = Var(Tj(X».A(0) j ne ej = (1 + E7 eel _ eOj ) . . k.6. Ak) to the canonical form p(x.4.Xn)T. .p(0) is the first row of 1(0) and. . .6 with Xl. we transformed the multinomial model M(n.
The Normal Case.42) are d x d matrices. Also note that ~ o unbiased '* VarOO > r'(O).d. interpretability of the procedure.1 Computation Speed of computation and numerical stability issues have been discussed briefly in Section 2.4.3 hold and • is a ddimensional statistic. features other than the risk function are also of importance in selection of a procedure.21. T(X) is the UMVU estimate of its expectation. They are dealt with extensively 'in books on numerical analysis such as Dahlquist.. Then J (3. These and other examples and the implications of Theorem 3.5.5 NON DECISION THEORETIC CRITERIA In practice. 188 MeasureS of Performance Chapter 3 j Example 3.pd(OW = (~(O)) dxd . N(IL."xl' Note that both sides of (3. LX.4. Suppose that the conditions afTheorem 3. . X n are i.4.8.. Summary. we show how the infomlation inequality can be extended to the multiparameter case. .4. • Theorem 3.4. . We study the important application of the unbiasedness principle in survey sampling.. . reasonable estimates are asymptotically unbiased.X)2 is UMVU for 0 2 . 3.4. 3." .4. . The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure.B)a > Ofor all . .42) where A > B means aT(A .. Using inequalities from prediction theory. If Xl.4. and robustness to model departures. Asymptotic analogues oftbese inequalities are sharp and lead to the notion and construction of efficient estimates.2 + (J2. But it does not follow that n~1 L:(Xi . even if the loss function and model are well specified. Let 1/J(O) ~EO(T(X))dXl and "'(0) = (1/Jl(O).3 whose proof is left to Problem 3. ~ In Chapters 5 and 6 we show that in smoothly parametrized models. " il 'I. 1 j Here is an important extension of Theorem 3. (72) then X is UMVU for J1 and ~ is UMVU for p.4. We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal. We derive the information inequa!ity in oneparameter models and show how it can be used to establish that in a canonical exponential family.. .4.i.3 are 0 explored in the problems.
The improvement in speed may however be spurious since AI is costly to compute if d is largethough the same trick as in computing least squares estimates can be used. It follows that striving for numerical accuracy of ord~r smaller than n. Closed form versus iteratively computed estimates At one level closed form is clearly preferable. say.4.2. Then least squares estimates are given in closed fonn by equ<\tion (2. It is in fact faster and better to Y.4.11).2 is given by where 0. for this population of measurements has a clear interpretation. < 1 then J is of the order of log ~ (Problem 3. with ever faster computers a difference at this level is irrelevant.1).Section 3.3 and 3. and Anderson (1974).1. at least if started close enough to 8. the NewtonRaphson method in ~ ~(J) which the jth iterate.1.2 Interpretability Suppose that iu the normal N(Il.5. Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at least bounds on the COnstants involved. solve equation (2. (72) Example 2. Of course. It may be shown that.1 / 2 is wasteful.3./ a even if the data are a sample from a distribution with .1. Its maximum likelihood estimate X/a continues to have the same intuitive interpretation as an estimate of /1.AI (flU 1) (T(X) . 3. takes on the order of log log ~ steps (Problem 3.1 / 2 .5 Nondecision Theoretic Criteria 189 Bjork. On the other hand.2). a method of moments estimate of (A.1. the signaltonoise ratio.9) by. variance and computation As we have seen in special cases in Examples 3. It is clearly easier to compute than the MLE.P) in Example 2.4.5.4.5 we are interested in the parameter III (7 • • This parameter. For instance. We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory. But it reappears when the data sets are big and the number of parameters large. On the other hand.A(BU ~l»)). consider the Gaussian linear model of Example 2.10). Gaussian elimination for the particular z'b Faster versus slower algorithms Consider estimation of the MLE 8 in a general canonical exponential family as in Section 2. in the algOrithm we discuss in Section 2.2 is the empirical variance (Problem 2.2. fI(j) = iPl) .3. The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d 3 operations when done in the usual way and can be numerically unstable.5. if we seek to take enough steps J so that III III < . estimates of parameters based on samples of size n have standard deviations of order n. The interplay between estimated.
amoog others. v is any value such that P(X < v) > ~.5.4. and both the mean tt and median v qualify.4) is for n large a more precise estimate than X/a if this model is correct. This is an issue easy to point to in practice but remarkably difficult to formalize appropriately. would be an adequate approximation to the distribution of X* (i..1. which as we shall see later (Section 5. 1 X~) could be taken without gross errors then p. but we do not necessarily observe X*. suppose that if n measurements X· (Xi. the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X) / [Var( X)] 1/2. On the other hand. we may be interested in the center of a population. say () = N p" where N is the population size and tt is the expected consumption of a randomly drawn individuaL (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true'' parametric model with an interpretable parameter B. X n ) where most of the Xi = X:. We consider three situations (a) The problem dictates the parameter. suppose we initially postulate a model in which the data are a sample from a gamma. that is.. We will consider situations (b) and (c). what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about).\)' distribution. We return to this in Section 5. 3. 9(Pl.. For instance. .5.I1'. . we turn to robustness. the HardyWeinberg parameter () has a clear biological interpretation and is the parameter for the experiment described in Example 2. anomalous values that arise because of human error (often in recording) or instrument malfunction.3 Robustness Finally.e. economists often work with median housing prices.)(p/>.190 Measures of Performance Chapter 3 mean Jl and variance 0'2 other than the normal. we could suppose X* rv p. the parameter v that has half of the population prices on either side (fonnally.5. we observe not X· but X = (XI •. but there are several parameters that satisfy this qualitative notion. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied. For instance. We can now use the MLE iF2. This idea has been developed by Bickel and Lehmann (l975a. they may be interested in total consumption of a commodity such as coffee. 1976) and Doksum (1975).. Gross error models Most measurement and recording processes are subject to gross errors. However.. To be a bit formal.13. However. E P*). Then E(X)/lVar(X) ~ (p/>. The actual observation X is X· contaminated with "gross errOfs"see the following discussion. Similarly. if gross errors occur. But () is still the target in which we are interested. However. E p. P(X > v) > ~). Alternatively. See Problem 3. 1975b.')1/2 = l. but there are a few • • = . (c) We have a qualitative idea of what the parameter is.
with probability .. In our new formulation it is the xt that obey (3. X n ) knowing that B(Xi. = {f (. specification of the gross error mechanism. so it is unclear what we are estimating. the sensitivity curve and the breakdown point.5.1. . X~) is a good estimate.2. Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of X· . Without h symmetric the quantity jJ. That is. Formal definitions require model specification. .Section 3.2) for some h such that h(x) = h(x) for all x}.f.5. •.5. This corresponds to. p(". Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6.h(x). A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the Ci still i.) for 1'1 of 1'2 (Problem 3.1 ) where the errors are independent.1).A Y. However. That is. The breakdown point will be discussed in Volume II. with common density f(x . . identically distributed.2). Consider the onesample symmetric location model P defined by ~ ~ i=l""l n . P.. X is the best estimate in a variety of senses. where f satisfies (3. Xi Xt with probability 1 . and definitions of insensitivity to gross errors. in situation (c). The advantage of this formulation is that jJ. the assumption that h is itself symmetric about 0 seems patently untenable for gross errors. Informally B(X1 . where Y. we do not need the symmetry assumption. . We next define and examine the sensitivity curve in the context of the Gaussian location model. However.1'). If the error distribution is normal..d. (3. and symmetric about 0 with common density f and d.l..d. Then the gross error model issemiparametric. make sense for fixed n.l. ..18). Y. . Again informally we shall call such procedures robust. if we drop the symmetry assumption. it is possible to have PC""f. iff X" .i.i. two notions.)~ 'P C) + . for example.1') and (Xt.remains identifiable.l. and then more generally. Further assumptions that are commonly made are that h has a particular fonn.) are ij.5. ISi'1 or 1'2 our goal? On the other hand. the gross errors. but with common distribution function F and density f of the form f(x) ~ (1 .is not a parameter. X n ) will continue to be a good or at least reasonable estimate if its value is not greatly affected by the Xi I Xt. Example 1..J) for all such P.d.j) E P..Xn are i. F.\ is the probability of making a gross error. has density h(y . Now suppose we want to estimate B(P*) and use B(X 1 .5. h = ~(7 <p (:(7) where K ::» 1 or more generally that h is an unknown density symmetric about O. Unfortunately.1') : f satisfies (3. We return to these issues in Chapter 6. we encounter one of the basic difficulties in formulating robustness in situation (b).5. . (3.j = PC""f. it is the center of symmetry of p(Po.5 Nondecision Theoretic Criteria ~ 191 wild values.2) Here h is the density of the gross errors and . .
. (2. ... . . . . .5. This is equivalent to shifting the Be vertically to make its value at x = 0 equal to zero.1 for all p(/l. Because the estimators we consider are location invariant. The sensitivity curoe of () is defined as ~ ~ ~ ~ ~ ~ ~ ~ ~ . We are interested in the shape of the sensitivity curve.I. and it splits the sample into two equal halves. that is.. Suppose that X ~ P and that 0 ~ O(P) is a parameter.l...L. . Xl. (}(X 1 ..14. See Section 2.1. x) . .X. Now fix Xl!'" .. . Are there estimates that are less sensitive? A classical estimate of location based on the order statistics is the sample median X defined by ~ ~ X X(k+l) !(X(k) ifn~2k+1 + X(k+l)) ifn = 2k where X (I)' ..1.. shift the sensitivity curve in the horizontal or vertical direction whenever this produces more transparent formulas.4). See Problem 3.1 for which the estimator () gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation X does to the value of ~ We return to the location problem with e equal to the mean J. +X"_I+X) n = x. 1 Xnl represents an observed sample of size nl from P and X represents an observation that (potentially) comes from a distribution different from P. X" 1'). that is.··. In our examples we shall. .j)) = J. where Xl. I . (i) It is the empirical plugin estimate of the population median v (Problem 3. X n ) = B(F).od because E(X. where F is the empirical d. . .192 The sensitivity curve Measures of Performance Chapter 3 . The sample median can be motivated as an estimate of location on various grounds.5.J. X =n (Xl+ . ~ ) SC( X.17). X"_l )J. . X n ordered from smallest to largest.. X(n) are the order statistics. . • ." . 0) = n[O(xl.2. we take I' ~ 0 without loss of generality.od Problem 2. .16). .. Then ~ ~ O.32. X n ) . has the plugin property.f) E P.2. in particular. SC(x. How sensitive is it to the presence of gross errors among XI.Xn ? An interesting way of studying this due to Tukey (1972) and Hampel (1974) is the sensitivity curve defined as follows for plugin estimates (which are well defined for all sample sizes n). is appropriate for the symmetric location model. Xnl as an "ideal" sample of size n . The empirical plugin estimate of 0 is 0 = O(P) where P is the empirical probability distribution. not its location. 1') = 0.. the sample mean is arbitrarily sensitive to gross errOfa large gross error can throw the mean off entirely.1. therefore. .... . Often this is done by fixing Xl..L = E(X). See (2. P.. Thus. We start by defining the sensitivity curve for general plugin estimates. .Xnl so that their mean has the ideal value zero. (}(PU.L = (}(X l 1'. . .. . At this point we ask: Suppose that an estimate T(X 1 .O(Xl' .
The sensitivity curve of the median is as follows: If.5.1.5. SC(x.. The sensitivity curve in Figure 3. its perfonnance at the nonnal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X. SC(x) SC(x) x x Figure 3.Xn_l = (X(k) + X(ktl)/2 = 0.5..32 and 3.5. A class of estimates providing such intermediate behavior and including both the mean and . . v coincides with fL and plugin estimate of fL..2. The sensitivity curves of the mean and median.. x is an empirical (iii) The sample median is the MLE when we assume the common density f(x) of the errors {cd in (3.1) is the Laplace (double exponential) density f(x) = 1 exp{l x l/7}. n = 2k + 1 is odd and the median of Xl. .Section 3.1).. x) nx(k) nx = _nx(k+l) for for < x(k) for x(k) < x x <x(k+I) nx(k+l) x> x(k+l) where xCI) < . XnI. ..1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when X is near Ji.9. Although the median behaves well when gross errors are expected. we obtain . say.5.5 Nondecision Theoretic Criteria 193 (ii) In the symmetric location model (3. < X(nl) are the ordered XI. 27 a density having substantially heavier tails than the normaL See Problems 2.
4. and Hogg (1974). suppose we take as our data the differences in Table 3.4. (The middle portion is the line y = x(1 . The range 0.20 seems to yield estimates that provide adequate protection against the proportions of gross errors expected and yet perform reasonably well when sampling is from the nonnal distribution. the Laplace density. f(x) = or even more strikingly the Cauchy. the mean is better than any trimmed mean with a > 0 including the median. We define the a (3.. . There has also been some research into procedures for which a is chosen using the observations. .2.5. + X(n[nuj) n . 4e'x'. Huber (1972).5. and Tukey (1972).. the sensitivity Jo ~ CUIve of an Q trimmed mean is sketched in Figure 3. Intuitively we expect that if there are no gross errors.1.1.. that is. If f is symmetric about 0 but has "heavier tails" (see Problem 3. Which a should we choose in the trimmed mean? There seems to be no simple answer. For instance.l)Q] and the trimmed mean of Xl.5. by <Q < 4.1 again. 4. Xu = X. the sensitivity curve calculation points to an equally intuitive conclusion. . .5. which corresponds approximately to a = This can be verified in tenns of asymptotic variances (MSEs}see Problem 5. infinitely better in the case of the Cauchysee Problem 5.8) than the Gaussian density. Rogers. . I I ! . i I .194 Measures of Performance Chapter 3 the median has ~en known since the eighteenth century.2[nnl where [na] is the largest integer < nO' and X(1) < . ..5). Hampel. for example.. Figure 3. For a discussion of these and other forms of "adaptation. Let 0 trimmed mean. Xa.5.. Xa  X. The estimates can be justified on plugin grounds (see Problem 3. Bickel. l Xnl is zero.2. Haber. Note that if Q = 0.) SC(x) X x(n[naJ) .10 < a < 0. whereas as Q i ~.3) Xu = X([no]+l) + . < X (n) are the ordered observations. For more sophisticated arguments see Huber (1981). f(x) = 1/11"(1 + x 2 ). we throw out the "outer" [na] observations on either side and take the average of the rest. However. The sensitivity curve of the trimmed mean.2[nn]/n)I. See Andrews.. then the trimmed means for a > 0 and even the median can be much better than the mean." see Jaeckel (1971). That is. f = <p. If [na] = [(n .5.
Xo: = x(k). 0"2) model.I L:~ I (Xi .1.X. Similarly.5.75 and X. Then a~ = 11.is typically used. Xu is called a Ctth quantile and X.XnI.5.X)2 is the empirical plugin estimate of 0. Write xn = 11.25). Spread.. Because T = 2 x (.16).10).2 . ..1 L~ 1 Xi = 11. and let Xo. . &2) It is clear that a:~ is very sensitive to large outlying Ixi values. A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = X. the scale measure used is 0. the nth sample quantile is Xo.1x.. SC(X. < x(nI) are the ordered Xl.Ct).742(x.25 are called the upper and lower quartiles. .. = ~ [X(k) + X(k+l)]' and at sample size n  1. We next consider two estimates of the spread in the population as well as estimates of quantiles. . where x(t) < . say k. other examples will be given in the problems. denote the o:th sample quantile (see 2.25. Quantiles and the lQR.4) where the approximation is valid for X fixed. 0 < 0: < 1. then SC(X.5..75 . Example 3.1. To simplify our expression we shift the horizontal axis so that L~/ Xi = O. P( X > xO') > 1 . If we are interested in the spread of the values in a population. X n denote a sample from that population. then the variance 0"2 or standard deviation 0. . &) (3.2.Section 3..5.5 Nondecision Theoretic Criteria 195 Gross errors or outlying data points affect estimates in a variety of situations. If no: is an integer. Let B(P) = Var(X) = 0"2 denote the variance in a population and let XI. where Xu has 100Ct percent of the values in the population on its left (fonnally.75 . Let B(P) = X o deaote a ath quantile of the distribution of X.X.674)0". The IQR is often calibrated so that it equals 0" in the N(/l. X n is any value such that P( X < x n ) > Ct. n + 00 (Problem 3. . 0 Example 3.
SC(x. 1'75) . o Remark 3.SC(x. Ronchetti.5. Unfortunately these procedures tend to be extremely demanding computationally. 1'. which is defined by IF(x. It is easy to see ~ ! '.6.O(F)] = .25) and the sample IQR is robust with respect to outlying gross errors x.. have been studied extensively and a number of procedures proposed and implemented. ~ ~ ~ Next consider the sample lQR T=X. Summary. . and other procedures.75 X. although this difficulty appears to be being overcome lately. Rousseuw.5.: : where F n .• .F) dO I. 0.1 denotes the empirical distribution based on Xl. An exposition of this point of view and some of the earlier procedures proposed is in Hampel. Discussion. = <1[0((1  <)F + <t. F) and~:z.(x. We discuss briefly nondecision theoretic considerations for selecting procedures including interpretability.O.. median.) .. We will return to the influence function in Volume II.5) 1 [xlk+11 _ xlkl] x> xlk+1) ' Clearly.15) I[t < xl).O. Most of the section focuses on robustness.. for 2 Measures of Performance Chapter 3 < k < n . x (t) that (Problem 3. Ii I~ . ~ .25· Then we can write SC(x. i) = SC(x. in particular the breakdown point. and computability. The sensitivity of the parameter B(F) to x can be measured by the influence function. . and Stabel (1983). trimmed mean.F) where = limIF. 1'0) ~[Xlkl) _ xlk)] x < XlkI) 2 '  2 2 1 Ix _ xlkl] XlkI) < x < xlk+11 '  (3. discussing the difficult issues of identifiability. xa: is not sensitive to outlying x's. The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean.5. r: ~i II i . Other aspects of robustness.(x. ~' is the distribution function of point mass at x (. It plays an important role in functional expansions of estimates.196 thus.1. ..Xnl.2.
. .0 E e. lf we put a (improper) uniform prior on A. respectively. give the MLE of the Bernoulli variance q(0) = 0(1 . In the Bernoulli Problem 3. where R( 71') is the Bayes risk. the parameter A = 01(1 .. 0) the Bayes rule is where fo(x. 71') = R(7f) /r( 71'. 71') as . where OB is the Bayes estimate of O. change the loss function to 1(0. 2. Show that if Xl. Find the Bayes risk r(7f. if 71' and 1 are changed to 0(0)71'(0) and 1(0.6 Problems and Complements 197 3. = (0 o)'lw(O) for some weight function w(O) > 0. Check whether q(OB) = E(q(IJ) I x).Section 3.4.2.0)/0(0).O) and = p(x 1 0)[7f(0)lw(0)]/c c= JJ p(x I 0)[7f(0)/w(0)]dOdx is assumed to be finite. 3. (X I 0 = 0) ~ p(x I 0). J).0)' and that the prinf7r(O) is the beta.2 preceeding. E = R. Show that OB ~ (S + 1)/(n+2) for ~ ~ the uniform prior.3). (c) In Example 3.2.4. . x) where c(x) =. Find the Bayes estimate 0B of 0 and write it as a weighted average wOo + (1 ~ w)X of the mean 00 of the prior and the sample mean X = Sin. In Problem 3.2 o e 1.O)P.2 with unifonn prior on the probabilility of success O. 0) = (0 . ~ ~ 4. Show that (h) Let 1(0. . Suppose 1(0. 0) is the quadratic loss (0 .. In some studies (see Section 6.2.0). which is called the odds ratio (for success). under what condition on S does the Bayes rule exist and what is the Bayes rule? 5.3. Hint: See Problem 1.r 7f(0)p(x I O)dO.0) and give the Bayes estimate of q(O). 6. (3(r. the Bayes rule does not change. Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule. Suppose IJ ~ 71'(0).24. That is. Let X I.2. 1 X n is a N(B. is preferred to 0. (a) Show that the joint density of X and 0 is f(x. s).1. density. Compute the limit of e( J.O) = p(x I 0)71'(0) = c(x)7f(0 .Xn be the indicators of n Bernoulli trials with success probability O. (72) sample and 71" is the improper prior 71"(0) = 1. we found that (S + 1)/(n + 2) is the Bayes rule..6 PROBLEMS AND COMPLEMENTS Problems for Section 3. Consider the relative risk e( J. then the improper Bayes rule for squared error loss is 6"* (x) = X.a)' jBQ(1. J) ofJ(x) = X in Example 3. a(O) > 0.
find the Bayes decision mle o' and the minimum conditional Bayes risk r(o'(x) I x). then the generic and brandname drugs are. where is known.Xu are Li. Bioequivalence trials are used to test whether a generic drug is. do not derive them. E) and positive when f) 1998) is 1.. Find the Bayes decision rule. 00.9j ) = aiO:'j/a5(aa + 1). For the following problems. I) = difference in loss of acceptance and rejection of bioequivalence.Xn ) we want to decide whether or not 0 E (e. equivalent to a namebrand drug.. I j l . There are two possible actions: 0'5 a a with losses I( B. E).. c2 > 0 I. Note that ).  I . to a close approximation. (Bj aj)2. (c) Problem 1.198 (a) T + 00.exp {  2~' B' } . Assume that given f). B. B = (B" ... . . Measures of Performance Chapter 3 (b) n .ft B be the difference in mean effect of the generic and namebrand drugs.d. (.. (E. N{O. . Xl. On the basis of X = (X I. . given 0 = B are multinomial M(n. a) = Lj~. then E(O)) = aj/ao.)T. . defined in Problem 1. (c) We want to estimate the vector (B" . 7.O'5).• N.. a) = [q(B)a]'. . 1). B). .15. where E(X) = O. where CI. a) = (q(B) .19(c). 76) distribution. B .E). bioequivalent.. . E).(0) should be negative when 0 E (E. (Use these results. and that 9 is random with aN(rlO.2. and that 0 has the Dirichlet distribution D( a). I ~' ..3. 0) and I( B.=l CjUj . Let q( 0) = L:. Suppose we have a sample Xl. a r )T. where ao = L:J=l Qj. (a) Problem 1.. .. One such function (Lindley.. Hint: If 0 ~ D(a). (b) Problem 1. a = (a" .3.) (b) When the loss function is I(B. find necessary and j sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule. • 9. .B. " X n of differences in the effect of generic and namebrand effects fora certain drug. 8.. Let 0 = ftc . .2(d)(i) and (ii).\(B) = r . (a) If I(B. . Set a{::} Bioequivalent 1 {::} Not Bioequivalent . Var(Oj) = aj(ao . and Cov{8j . by definition. i: agency specifies a number f > a such that if f) E (E.3. Cr are given constants. A regulatory " ! I i· . (e) a 2 + 00.a)' / nj~.I(d).O)  I(B.. compute the posterior risks of the possible actions and give the optimal Bayes decisions when x = O.aj)/a5(ao + 1). Suppose tbat N lo .\(B) = I(B.) with loss function I( B.
Discuss the preceding decision rule for this "prior. Con 10.1) < (T6(n) + c'){log(rg(~)+. respectively. Hint: See Example 3. v) > ?r/(l ?r) is equivalent to T > t. S x T ~ R. (Xv.. .1. Suppose 9 . representing x = (Xl..2 show that L(x.2.17).) + ~}" . . {} (Xo.6. . 2.Yo) = {} (Xo.1) is equivalent to "Accept bioequivalence if[E(O I x»)' where = x) < 0" (3. 0." (c) Discuss the behavior of the preceding decision rule for large n ("n sider the general case (a) and the specific case (b). (a) Show that the Bayes rule is equivalent to "Accept biocquivalence if E(>'(O) I X and show that (3. Any two functions with difference "\(8) are possible loss functions at a = 0 and I. For the model defined by (3.Xm).3 1.6 Problems and Complements 199 where 0 < r < 1... . find (a) the linear Bayes estimate of ~l..Section 3. .. Yo) is in the interior of S x T. 0) and l((). RP. Yo) to be a saddle point is that. and 9 is twice differentiable. Yo) = inf g(xo. + = 0 and it 00").16) and (3.3. In Example 3.Yo) Xi {}g {}g Yj = 0. (c) Is the assumption that the ~ 's are nonnal needed in (a) and (b)? Problems for Section 3.. (b) It is proposed that the preceding prior is "uninformative" if it has 170 large ("76 + 00"). (a) Show that a necessary condition for (xo.6.2. S T Suppose S and T are subsets of Rm. 1) are not constant. yo) is a saddle point of 9 if g(xo.) = 0 implies that r satisfies logr 1 = ( 2 2c' This is an example with two possible actions 0 and 1 where l(().\(±€.2.Yp). y).. A point (xo. (b) the linear Bayes estimate of fl.y = (YI. Yo) = sup g(x. Note that .
thesimplex.. Let X" . 5.i. and show that this limit ( . B1 ) Ip(X. Suppose e = {Bo.=1 CijXiYj with XE 8 m .. the conclusion of von Neumann's theorem holds. Let Lx (Bo. 0») equals 1 when B = ~.2.)w". (b) Suppose Sm = (x: Xi > 0. Hint: Show that there exists (a unique) 1r* so that 61f~' that R(Bo..a)'. prior. &* is minimax. Bd. A = {O. . I}. Show that the von Neumann minimax theorem is equivalent to the existence of a saddle point for any twice differentiable g. . . BIl has a continuous distrio 1 bution under both P Oo and P BI • Show that (a) For every 0 < 7f < 1. o(S) = X = Sin. y ESp.n)/(n + . (b) Show that limn_ooIR(O. 1rWlO = 0 otherwise is Bayes against a prior such that PIB = B ] = . o')IR(O. Hint: See Problem 3.o•• ) =R(B1 .y) ~ L~l 12. X n be i. Yc Yd foraHI < i. and that the mndel is regnlar. Let S ~ 8(n. B ) and suppose that Lx (B o.c.0). a) ~ (0 .d< p. . d) = (a) Show that if I' is known to be 0 (!.0•• )..j=O.2.j)=Wij>O.:.200 and Measures of Performance Chapter 3 8'g (x ) < 0 8'g(xo. B ) ~ p(X.n12. the test rule 1511" given by o.n)." 1 Xi ~ l}.PIB = Bo]..(X) = 1 if Lx (B o. (b) There exists 0 < 11"* < 1 such that the prior 1r* is least favorable against is.. I j 3. L:. f" I • L : I' (a) Show that 0' has constant risk and is Bayes for the beta. and 1 .. B1 ) >  (l.. 1 <j. N(I'. 4.= 1 . • ._ I)'. Yo .. .andg(x.Yo) > 0 8 8 8 X a 8 Xb 0. fJ( . iij.<7') and 1(<7'.. Thus.1 < i < m. • " = '!.l. i.n12).a..1(0.Xn ) . and o'(S) = (S + 1 2 . Suppose i I(Bi. I • > 1 for 0 i ~.b < m. n: I>l is minimax.i) =0.d. f. 2 J'(X1 . I(Bi.
" a. b. Show that X has a Poisson (. (c) Use (B. Show that if (N). < im. . 1 < j < k.o) for alII'. Permutations of I.» = Show that the minimax rule is to take L l. 00.k.... jl~ < . distribution.k} I«i). . .\. where (Ill.2.j. ik). . prior. h were t·.X)' are inadmissible. .\ .d) where qj ~ 1 . See Volume II...a? 1... . > jm).. that is. .. f(k. LetA = {(iI..Section 3_6 Problems and Complements 201 (b) If I' ~ 0. Let k . dk)T.29). 6. 9.\.. . ttl . 0') = (1.. Rk) where Rj is the rank of Xi.) . J.P.. 0 1. ik is an arbitrary unknown permutation of 1.I). . . .a < b" = 'lb. . . 8. d) = L(d. . .. 1 = t j=l (di . ik) is smaller than that of (i'l"'" i~). Write X  • "i)' 1(1'.. then is minimax for the loss function r: l(p. M (n. . .. .. respectively. . For instance. b = ttl. X k be independent with means f.3.l . 1 <i< k. o(X 1..?. . . X k ) = (R l . Let Xl.\) distribution and 1(. o'(X) is also minimax and R(I'.Pi. . X is no longer unique minimax.X)' and. 1). i=l then o(X) = X is minimax.). .). .. a) = (. Jlk.. = tj.j. and iI. .1. . . . Hint: (a) Consider a gamma prior on 0 = 1/<7'.1)1 L(Xi ... show that 0' is unifonnly best among all rules of the form oc(X) = CL Conclude that the MLE is inadmissible. d = (d). N k ) has a multinomial. I' (X"". . o(X) ~ n~l L(Xi .12.X)' is best among all rules of the form oc(X) = c L(Xi . Hint: Consider the gamma.I'kf.. .. Remark: Stein (1956) has shown that if k > 3.. (jl.. See Problem 1. that both the MLE and the estimate 8' = (n . Xk)T. ftk) = (Il?l'"'' J1....tj. Then X is nurnmax.. .Pi)' Pjqj <j < k.~~n X < Pi < < R(I'..j. Let Xi beindependentN(I'i.. Rj = L~ I I(XI < Xi)' Hint: Consider the uniform prior on permutations and compute the Bayes rule by showing that the posterior risk of a pennutation (i l . Show that if = (I'I''''. hence.. < /12 is a known set of values. . . j 7.tn I(i. / . an d R a < R b. . PI. . Xl· (e) Show that if I' is unknown.
Let X" ... J K(po. Suppose that given 6 ~ B.. Hint: Consider the risk function of o~ No... . ..2. ~ I I . of Xi < X • 1 + 1 o. n) be i. X has a binomial. X = (X"".. a) is the posterior mean E(6 I X). Define OiflXI v'n < d < v'n d v'n d d X .Xo)T has a multinomial. d X+ifX 11. .. B j > 0... Let Xi(i = 1.I)!. . See Problem 3. . 1 .i._.1r) = Show that the marginal density of X. Show that v'n 1+ v'n 2(1+ v'n) is minimax for estimating F(x) = P(X i < x) with squared error loss. 13. B). 10. ! p(x) = J po(x)1r(B)dB. . Let K(po.1'»2 of these estimates is bounded for all nand 1'.. Let the loss " function be the KullbackLeibler divergence lp(B.15. I: I ir'z .=1 Show that the Bayes estimate is (Xi . .. with unknown distribution F. Show that the Bayes estimate of 0 for the KullbackLeibler loss function lp(B.d with density defined in Problem 1.. 1)..2.. Pk.BO)T.a) and let the prior be the uniform prior 1r(B" .. M(n.202 Measures of Performance Chapter 3 Hint: Consider Dirichlet priors on (PI. Suppose that given 6 = B = (B ..q)1r(B)dB.3. distribution.. B(n. BO l) ~ (k .B). + l)j(n + k)..8. ... (b) How does the risk of these estimates compare to that of X? 12. . For a given x we want to estimate the proportion F(x) of the population to the left of x.d. . .. See also Problem 3. LB= j 01 1. X n be independent N(I'. I: • ... distribution...... v'n v'n (a) Show that the risk (for squared error loss) E( v'n(o(X) .i f X> ... q) denote the K LD (KullbackLeiblerdivergence) between the densities Po and q and define the Bayes KLD between P = {Po : BEe} and q as k(q. 14.4..
Equivariance.4 1. Give the Bayes rules for squared error in these three cases.d. . . R. Jeffrey's "Prior. N(I"o.k(p. ry)... then That is. (ry»).X is called the mutual information between 0 and X.4. Show that X is an UMVU estimate of O.~). Jeffrey's priors are proportional to 1.4. Show that (a) (b) cr5 = n. 0) cases.alai) < al(O.Section 3. . respectively.a )I( 0. Let A = 3." A density proportional to VJp(O) is called Jeffrey's prior.2) with I' . . Suppose X I. 7f) and that the minimum is I(J. . 0) and B(n. then R(O. 4.x =. ..1 . Show that in theN(O. K) . if I(O. p( x.'? /1 as in (3. J).O)!.. Fisher infonnation is not equivariant under increasing transformations of the parameter. Show that B q(ry) = Bp(h.1"0 known.1 E: 1 (Xi  J.. . J') < R( (J.f [E. X n be the indicators of n Bernoulli trials with success probability B. S. Problems for Section 3. p(X) IO. that is. Let X I. suppose that assumptions I and II hold and that h is a monotone increasing differentiable function from e onto h(8). 15.aao + (1 . ry) = p(x. ao) + (1 . X n are Ll.. 0. . a) is convex and J'(X) = E( J(X) I I(X». a" (J. . We shall say a loss function is convex.12) for the two parametrizations p(x. a. Show that if 1(0. (b) Equivariance of the Fisher Information Bound. 0) and q(x.). Hint: k(q. and O~ (1 . 0) with 0 E e c R. 2. (a) Show that if Ip(O) and Iq(fJ) denote the Fisher information in the two parametrizations. Reparametrize the model by setting ry = h(O) and let q(x.6 Problems and Complements 203 minimizes k( q. . {log PB(X)}] K(O)d(J. Suppose that there is an unbiased estimate J of q((J) and that T(X) is sufficient. Let X . for any ao. K) = J [Eo {log ~i~i}] K((J)d(J > a by Jensen's inequality..4.tO)2 is a UMVU estimate of a 2 • &'8 is inadmissible. a < a < 1. Let Bp(O) and Bq(ry) denote the information inequality lower bound ('Ij. then E(g(X) > g(E(X». N(I". . Prove Proposition 3. h1(ry)) denote the model in the new parametrization. It is often improper. the Fisher information lower bound is equivariant. Hint: Use Jensen's inequality: If 9 is a convex function and X is a random variable.
I I . Pe(E) = 1 for some 0 if and only if Pe(E) ~ ~2 > OJ docsn't depend on 0.o.. X n ) is a sample drawn without replacement from an unknown finite population {Xl. Suppose (J is UMVU for estimating fJ. 7. i = 1. Show that 8 = (Y . 2 ~ ~ 10. . . Suppose Yl . Let a and b be constants.XN }. Establish the claims of Example 3. 00. P.8.1 and variance 0. . a ~ (c) Suppose that Zi = log[i/(n + 1)1.\ = a + bOo 11. ..4.. • 13.1.5(b). =f. Does X achieve the infonnation . Ct.Yn are independent Poisson random variables with E(Y'i) = !Ji where Jli = exp{ Ct + (3Zi} depends on the levels Zi of a covariate. Hint: Use the integral approximation to sums.. Show that .4. . Let X" . then (a) X is an unbiased estimate of x = I ~ L~ 1 Xi· .{x : p(x. . compute Var(O) using each ofthe three methods indicated. 1)..4. .4. a ~ i 12. Let F denote the class of densities with mean 0. . Show that assumption I implies that if A .13 E R. then = 1 forallO.2. 9. Is it unbiased? Does it achieve the infonnation inequality lower bound? (b) Show that X is an unbiased estimate of 0/(0 inequality lower bound? + 1)... Hint: Consider T(X) = X in Theorem 3. Show that a density that minimizes the Fisher infonnation over F is f(x. 14. 204 Hint: See Problem 3. ~ ~ ~ (a) Write the model for Yl give the sufficient statistic. Zi could be the level of a drug given to the ith patient with an infectious disease and Vi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered.' .ZDf3)T(y . Measures of Performance Chapter 3 I (c) if 110 is not known and the true distribution of X t is N(Ji. .. Show that if (Xl. B(O.3. find the bias o f ao' 6. Y n in twoparameter canonical exponential form and (b) Let 0 = (a.ZDf3)/(n . In Example 3..11(9) as n ~ give the limit of n times the lower bound on the variances of and (J.p) is an unbiased estimate of (72 in the linear regression model of Section 2. {3) T Compute I( 9) for the model in (a) and then find the lower bound on the variances of unbiased estimators and {J of a and (J. .. 8. . 0) = Oc"I(x > 0).Ii . distribution. Find lim n. . n.P. (a) Find the MLE of 1/0. X n be a sample from the beta. .2 (0 > 0) that satisfy the conditions of the information inequality. . ( 2). . and .. 0) for any set E. For instance.\ = a + bB is UMVU for estimating .
UN are as in Example 3.4).X)/Var(U). XK. _ Show that X is unbiased and if X is the mean of a simple random sample without replacement from the population then VarX<VarX with equality iff Xk. Let X have a binomial.l. Show that if M is the expected sample size. 17. (c) Explain how it is possible if Po is binomial. . then E( M) ~ L 1r j=1 N J ~ n.511 > (b) Show that the inequality between Var X and Var X continues to hold if ~ ..4. More generally only polynomials of degree n in p are unbiasedly estimable. . Suppose X is distributed accordihg to {p. distribution. (b) Deduce that if p.~).4.p). .. Define ~ {Xkl.6) is (a) unbiased and (b) has smaller variance than X if  b < 2 Cov(U. . even for sampling without replacement in each stratum. k=l K ~1rkXk. that ~ is a Bayes estimate for O. 'L~ 1 h = N. G 19.. Stratified Sampling. k = 1. .. Show that the resulting unbiased HorvitzThompson estimate for the population mean has variance strictly larger than the estimate obtained by taking the mean of a sample of size n taken without replacement from the population. 18.3. (See also Problem 1. B(n. Let 7fk = ~ and suppose 7fk = 1 < k < K. . 1 < i < h. K. 15. Show that is not unbiasedly estimable.. 16. .) Suppose the Uj can be relabeled into strata {xkd. Show that X k given by (3. :/i.Xkl. = 1.Section 3. P[o(X) = 91 = 1.. Suppose the sampling scheme given in Problem 15 is employed with 'Trj _ ~. .. (a) Take samples with replacement of size mk from stratum k = fonn the corresponding sample averages Xl.• .6 Problems and Complements 205 (b) The variance of X is given by (3. if and only if. ec R} and 1r is a prior distribution (a) Show that o(X) is both an unbiased estimate of (J and the Bayes estimate with respect to quadratic loss. 7.". B(n.4. X is not II Bayes estimate for any prior 1r. Suppose UI.} and X =K  1". = N(O.4. 8)..1 and Uj is retained independently of all other Uj with probability 1rj where 'Lf 11rj = n.~ for all k. : 0 E for (J such that E((J2) < 00. 20.1  E1 k I Xki doesn't depend on k for all k such that 1Tk > o..
Prove Theorem 3. E(9 Measures of Performance Chapter 3 I X) ~ ii(X) compute E(ii(X) . the upper quartile X.5) to plot the sensitivity curve of the 1. If a = 0. and we can thus define moments of 8/8B log p(x. • Problems for Section 35 I is an integer. for all X!.25. 22. . B). give and plot the sensitivity curves of the lower quartile X. Note that 1/J (9)a ~ 'i7 E9(a 9) and apply Theorem 3. B) is differentiable fat anB > x.5. is an empirical plugin estimate of ~ Here case. . B). Yet show eand has finite variance. Note that logp(x. J xdF(x) denotes Jxp(x)dx in the continuous case and L:xp(x) in the discrete 6./(9)a [¢ T (9)a]T II (9)[¢ T (9)a]. Show that. 1 2. B») ~ °and the information bound is infinite.25 and (n . Show that the a trimmed mean XCII..l)a is an integer.25 and net =k 3. Regularity Conditions are Needed for the Information Inequality. 1 X n1 C.4. If n IQR."" F (ij) Vat (:0 logp(X. give and plot the sensitivity curve of the median. use (3. Hint: It is equivalent to show that. (iii) 2X is unbiased for . that is.3. T . 4. for all adx 1.4. Var(a T 6) > aT (¢(9)II(9). 5.4. B) be the uniform distribution on (0. however. = 2k is even. with probability I for each B..75' and the IQR. Let X ~ U(O. If a = 0.. Show that the sample median X is an empirical plugin estimate of the population median v.206 Hint: Given E(ii(X) 19) ~ 9.9)' 21. An estimate J(X) is said to be shift or translation equivariant if.
Its properties are similar to those of the trimmed mean. ..) ~ 8. It has the advantage that there is no trimming proportion Q that needs to be subjectively specified.3.03 (these are expected values of four N(O. Xu.03. Deduce that X. (See Problem 3. For x > . X a arc translation equivariant and antisymmetric. Show that if 15 is translation equivariant and antisymmetric and E o(15(X» exists and is finite.. (7= moo 1 IX. X. <i< n Show that (a) k = 00 corresponds to X.  XI/0..(a) Show that X. (b) Suppose Xl.). and the HodgesLehmann estimate.30. median.1.. then (i.JL) where JL is unknown and Xi . F(x .5 and for (j is. I)order statistics). . . k t 0 to the median. .6. . (a) Suppose n = 5 and the "ideal" ordered sample of size n ~ 1 = 4 is 1. is symmetrically distributed about O. . The HodgesLehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages ~(Xi + Xj). J is an unbiased estimate of 11.e. trimmed mean with a = 1/4.1.Section 3. 7. .. and xiflxl < k kifx > k kifx<k. X n is a sample from a population with dJ. (b) Show that   .30.. The Huber estimate X k is defined implicitly as the solution of the equation where 0 < k < 00. plot the sensitivity curves of the mean. '& is an estimate of scale. i < j. xH L is translation equivariant and antisymmetric..6 Problems and Complements 207 It is antisymmetric if for all Xl. . In .5.67. X are unbiased estimates of the center of symmetry of a symmetric distribution.. One reasonable choice for k is k = 1.
.. For the ideal sample of Problem 3.. we say that g(. the standard deviation does not exist.11.. thus. (b) Find thetml probabilities P(IXI > 2). plot the sensitivity curve of (a) iTn . (e) If k < 00.6). and is fixed. In what follows adjust f and 9 to have v = 0 and 'T = 1. and "" "'" (b) the Iratio of Problem 3.25. ii. Xk) is a finite constant. where S2 = (n .• 13. Location Parameters.1)1 L:~ 1(Xi . •• . This problem may be done on the computer.) has heavier tails than f() if g(x) is above f(x) for Ixllarge. P(IXI > 3) and P(IXI > 4) for the nonnal. Use a fixed known 0"0 in place of Ci.d. If f(·) and g(. 00.208 ~ Measures of Performance Chapter 3 (b) Ifa is replaced by a known 0"0.5. (c) Show that go(x)/<p(x) is of order exp{x2 } as Ixl ~ 10.7(a). Let X be a random variable with continuous distribution function F. The (student) tratio is defined as 1= v'n(x I'o)/s.5. 00.X. In the case of the Cauchy density. 1 Xnl to have sample mean zero. . 9. with density fo((. we will use the IQR scale parameter T = X. Find the limit of the sensitivity curve of t as (a) Ixl ~ (b) n ~ 00. (d) Xk is translation equivariant and antisymmetric (see Problem 3. Let JJo be a hypothesized mean for a certain population.i.75 . then limlxj>00 SC(x.) are two densities with medians v zero and identical scale parameters 7. . = O. and Cauchy distributions. . Laplace. . . 11.O)/ao) where fo(x) for Ixl for Ixl <k > k. Let 1'0 = 0 and choose the ideal sample Xl. Suppose L:~i Xi ~ .. with k and € connected through 2<p(k) _ 24>(k) ~ e k 1(e) Xk exists and is unique when k £ ! i > O. iTn ) ~ (2a)1 (x 2  a 2 ) as n ~ 00. [. thenXk is the MLEofB when Xl. The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant \.~ !~ t . n is fixed.5.Xn arei. (a) Find the set of Ixl where g(Ixl) > <p(lxl) for 9 equal to the Laplace and Cauchy densitiesgL(x) = (2ry)1 exp{ Ixl/ry} and gc(x) = b[b2 + x 2 ]1 /rr..x}2." . X 12. Show that SC(x.
it is called a location parameter. SC(a + bx. x n _. 8 < " is said to be order preserving if X < Y ::::} Ox < ()y.5. . median v. compare SC(x.. ([v(F). . then ()(F) = c. . ~ = d> 0.xn . b > 0.F). In this case we write X " Y. . (e) Show that if the support S(F) = {x : 0 < F(x) < 1} of F is a finite interval. let H(x) be the distribution function whose inverse is H'(a) = ![xax. Hint: For the second part. .  vl/O. 8) = IF~ (x.. ..t )) = 0. F(t) ~ P(X < t) > P(Y < t) antisymmctric. and note thatH(xi/(F» < F(x) < H(xv(F». 8) and ~ IF(x.1: (a) Show that SC(x._. a. i/(F)] and. F) lim n _ oo SC(x.. 8. Fn _ I ).. i/(F)] is the value of some location parameter. (b) Show that the mean Ji.) That is.) 14. where T is the median of the distributioo of IX location parameter. 8. i/(F)] is the location parameter set in the sense that for any continuous F the value ()( F) of any location parameter must be in [v(F). a + bx.. X is said to be stochastically smaller than Y if = G(t) for all t E R. ~ se is shift invariant and scale equivariant._. Show that J1.xn ).(k) is a (d) For 0 < " < 1..(F): 0 <" < 1/2}. 8. If 0 is scale and shift equivariant. and sample trimmed mean are shift and scale equivariant. c + dO.:. then v(F) and i/( F) are location parameters. b > 0. v(F) = inf{va(F) : 0 < " < I/2} andi/(F) =sup{v. then for a E R. let v" = v.O.Xn_l) to show its dependence on Xnl (Xl.. and ordcr preserving. 0 < " < 1. (jx.5. Show that v" is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < i/(F)..t. sample median. Also note that H(x) is symmetric about zero.a +bxn ) ~ a + b8n (XI. An estimate ()n is said to be shift and scale equivariant if for all xl.67 and tPk is defined in Problem 3.. Let Y denote a random variable with continuous distribution function G. In Remark 3. ~ (b) Write the Be as SC(x. and trimmed population mean JiOl (see Problem 3. ~ (a) Show that the sample mean. if F is also strictly increasing.).. ~ ~ 15.5. . ~ ~ 8n (a +bxI.Section 3.8. any point in [v(F). xnd.6 Problems and Complements 209 if (Ja+bX = a + bex· It is antisymmetric if (J x :.(F) = !(xa + XI"). n (b) In the following cases. the ~ = bdSC(x.8.. (a) Show that if F is symmetric about c and () is a location parameter.]..5) are location parameters. (e) Let Ji{k) be the solution to the equation E ( t/Jk (X :. · . c E R. Show that if () is shift and location equivariant. .
I F(x.1 is commonly known as the Cramer~Rao inequality. ~ ~ 17. also true of the method of coordinate ascent. F)! ~ 0 in the cases (i). 18. We define S as the .3 (1) A technical problem is to give the class S of subsets of:F for which we can assign probability (the measurable sets).0 > 0 (depending on t/J) such that if lOCO) . Let d = 1 and suppose that 'I/J is twice continuously differentiable.7 NOTES Note for Section 3. we in general must take on the order of log ~ steps. we shall • I' . {BU)} do not converge.4. • .rdF(l:). (b) If no assumptions are made about h. show that (a) If h is a density that is symmetric about zero. C < 00. Assume that F is strictly increasing.1) Hint: (a) Try t/J(x) = Alogx with A> 1. where B is the class of Borel sets. Show that in the bisection method.t is identifiable. . In the gross error model (3. and we seek the unique solution (j of 'IjJ(8) = O. 81" < 0.B ~ {F E :F : Pp(A) E B}. Frechet.OJ then l(I(i) . B E B. . in order to be certain that the Jth iterate (j J is within e of the desired () such that 'IjJ( fJ) = 0.1/. (1) The result of Theorem 3. and (iii) preceding? ~ 16. (b) Show that there exists. (b) ~ ~ .Bilarge enough. 3.4 I . then fJ. 0.' > 0. is not identifiable. Because priority of discovery is now given to the French mathematician M. then J..BI < C1(1(i. The NewtonRaphson method in this case is .210 (i) Measures of Performance Chapter 3 O(F) ~ 1'1' ~ J . (e) Does n j [SC(x. ~ I(x  = X n . IlFfdF(x). · . (ii) O(F) = (iii) e(F) "7. ) (a) Show by example that for suitable t/J and 1(1(0) . I : ..2). . A. (ii). This is. .field generated by SA. • • . consequently.5. .8) . I I Notes for Section 3.
P. 2nd ed. F. AND A. Robust Estimates of Location: Sun>ey and Advances Princeton. BICKEL.. P. Statist. J. WALLACE. A. "Unbiased Estimation in Convex. 10381044 (l975a). TUKEY. BAXTER. J. 3.Section 3. 3. BICKEL. ... (3) The continuity of the first integral ensures that : (J [rO )00 . Jroo T(x) [~p(X' 0)] dx roo Joo 8(J 00 00 00 for all (J whereas the continuity (or even boundedness on compact sets) of the second integral guarantees that we can interchange the order of integration in (4) The finiteness of Var8(T(X)) and f(O) imply that 1/J'(0) is finite by the covariance interpretation given in (3. 0.(T(X)) = 00. BICKEL. BOHLMANN.. I. 2. J. 1970..cific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: SpringerVerlag. Location. ofStatist. BICKEL. AND C. T. DOWE. A." Ann. W. 4. 1. J.8 REFERENCES ANDREWS.. Point Estimation Using the KullbackLeibler Loss Function and MML. AND N.8). R. AND E. M. Statist. ROGERS. BERGER. H. LEHMANN. P. Statist. M. D. F. J. BERNARDO. Jroo T(x) :>. AND E." Ann. W. S. 1972.. P. 40. BICKEL. 1974. APoSTOL. 1122 (1975).. H.. L. 1994. M. J.8 References 211 follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement. DAHLQUIST. Ma. in Proceedings of the Second Pa.thematical Methods in Risk Theory Heidelberg: Springer Verlag. P.. R. Bayesian Theory New York: Wiley. "Measures of Location and AsymmeUy. "Descriptive Statistics for Nonparametric Models. 1974. HAMPEL. Numen'cal Analysis New York: Prentice Hall. OLIVER. Statist. 11391158 (1976). ANDERSON.. p(x. III. AND E. (2) Note that this inequality is true but uninteresting if f(O) = 00 (and 1/J'(0) is finite) or if Var.. "Descriptive Statistics for Nonparametric Models. G. 3. H. 15231535 (1969).. Families.)dXd>. BJORK. NJ: Princeton University Press. DE GROOT." Scand. Statistical Decision Theory and Bayesian Analysis New York: Springer. M. Reading. SMITH.." Ann. Dispersion. Mathematical Analysis. F." Ann. AND J. LEHMANN. Introduction. Math.] = Jroo.. II. D.. DoKSUM.10451069 (1975b). 1985. K. AND E.. A.4. LEHMANN.. LEHMANN. MA: AddisonWesley. >. 1969. P. Optimal Statistical Decisions New York: McGrawHill. "Descriptive Statistics for Nonparametric Models. HUBER.. 1998.
383393 (1974). Amer." Ann. 69. L.. StatiSl. ROUSSEUW." Proc. D. 49. 69. MA: AddisonWesley. 43.212 Measures of Performance Chapter 3 HAMPEL. Wiley & Sons." J. Statist. MA: AddisonWesley.. Theory ofPoint Estimation. C. L. F. Actuarial J. H. AND W. HUBER. "Estimation and Inference by Compact Coding (With Discussions). H. "Boostrap Estimate of KullbackLeibler Information for Model Selection. • NORBERG. KARLIN. j I JEFFREYS... 538542(1973). FREEMAN." Ann. LEHMANN. (2000). Introduction to Probability and Statistics from a Bayesian Point of View. "The Influence Curve and Its Role in Robust Estimation... R. Amer.. Robust Statistics New York: Wiley. E. S. M. R." J.. 10201034 (1971). 1954. "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification. LINDLEY. Stlltist.. C. Stu/ist.. 1986. and Economics Reading. YU. A. Third Berkeley Symposium on Math. L. 1965. AND G." Statistical Science. 1998.. London.... AND B. 1959. JAECKEL. . HOGG. Royal Statist. STEIN. 1986.. STAHEL.. HUBER. ." J. "Robust Estimates of Location. B. B. J.. L. Assoc. Assoc. Assoc. SHIBATA." Scand. E. . LINDLEY. HANSEN. RoyalStatist. 1. P.V.222 (1986). R. Math. 49. Programming." Statistica Sinica. 1. W. 2nd ed. I. HAMPEL. R. • • 1 . Stall'. . Math.. 13. E. Soc.. StrJtist. Part I: Probability. TuKEY. D. A.10411067 (1972). 7. "Adaptive Robust Procedures.n. 909927 (1974). R. Y. 136141 (1998).. 1948.375394 (1997). 197206 (1956). Statist. The Foundations afStatistics New York: J. Soc. LEHMANN. P.. CASELLA. RISSANEN. Exploratory Data Analysis Reading. University of California Press. "Stochastic Complexity (With Discussions):' J. and Probability. E. Mathematical Methods and Theory in Games. New York: Springer. J. P. 1981. "Robust Statistics: A Review.. RONCHEro. WALLACE. Math... Testing Statistical Hypotheses New York: Springer. 2Q4.. AND P.. 223239 (1987). "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution. SAVAGE." J. 240251 (1987)." Ann. London: Oxford University Press.. Robust Statistics: The Approach Based on Influence Functions New York: J. "Decision Analysis and BioequivaIence Trials. Theory ofProbability. "On the Attainment of the CramerRao Lower Bound. 2nd ed. Part II: Inference. "Model Selection and the Principle of Mimimum Description Length. Wiley & Sons. WUSMAN. 1972. S. 42. Cambridge University Press. Amer..
e Example 4. and we have data providing some evidence one way or the other.Nfb the numbers of admitted male and female applicants. modeled by us as having distribution P(j. and the corresponding numbers N mo . 3. peIfonn a survey.3 we defined the testing problem abstractly.Chapter 4 TESTING AND CONFIDENCE REGIONS: BASIC THEORY 4.3. But this model is suspect because in fact we are looking at the population of all applicants here. The design of the experiment may not be under our control. where is partitioned into {80 .1 INTRODUCTION In Sections 1.3 the questions are sometimes simple and the type of data to be gathered under our control. or more generally construct an experiment that yields data X in X C Rq. Here are two examples that illustrate these issues. treating it as a decision theory problem in which we are to decide whether P E Po or P l or.Pfl. it might be tempting to model (Nm1 .2. As we have seen. 8 1 are a partition of the model P or. distribution. respectively.PmllPmO. and 3. what is an appropriate stochastic model for the data may be questionable. 8d with 8 0 and 8. Nfo) by a multinomial.Nf1 . Accepting this model provisionally. where Po. conesponding. in examples such as 1. Usually. the parameter space e. and indeed most human activities. They initially tabulated Nm1. pUblic policy. what does the 213 . not a sample. whether (j E 8 0 or 8 1 jf P j = {Pe : () E 8 j }. respectively. If n is the total number of applicants. Sex Bias in Graduate Admissions at Berkeley. the situation is less simple. M(n.1. This framework is natural if. Nmo. to answering "no" or "yes" to the preceding questions. and what 8 0 and 8 1 correspond to in tenns of the stochastic model may be unclear. we are trying to get a yes or no answer to important questions in science. The Graduate Division of the University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. Nfo of denied applicants. parametrically. medicine.1.PfO). Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial. as is often the case. PI or 8 0 . o E 8.1.
. . It was noted by Fisher corresponds to H : p = ~ with the alternative K : p as reported in Jeffreys (1961) that in this experiment the observed fraction ':: was much closer to 3. Mendel crossed peas heterozygous for a trait with two alleles. • I I . if there were n dominant offspring (seeds). . has a binomial (n. D). B (n. that N AA. . ~.2. .: Example 4. N mOd . ~). I . . The example illustrates both the difficulty of specifying a stochastic model and translating the question one wants to answer into a statistical hypothesis. In these tenns the hypothesis of "no bias" can now be translated into: H: Pml Pmld PmOd Pfld Pild + + PfOd ..p) distribution. NIld. and so on. where N m1d is the number of male admits to department d. .. .=7xlO 5 ' n 3 .than might be expected under the hypothesis that N AA has a binomial. either N AA cannot really be thought of as stochastic or any stochastic I . 0 I • ... OUf multinomial assumption now becomes N ""' M(pmld' PmOd. m P [ NAA . for d = 1.! . The hypothesis of dominant inheritance ~. • . the natural model is to assume. if the inheritance ratio can be arbitrary." then the data are naturally decomposed into N = (Nm1d. In a modem formulation. Pfld. pml +P/I !. In one of his famous experiments laying the foundation of the quantitative theory of genetics. .. and O'Connell (1975). . admissions are petfonned at the departmental level and rates of admission differ significantly from department to department. ! . as is discussed in a paper by Bickel. I .214 Testing and Confidence Regions Chapter 4 hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male] = Pm! Pml +PmO = P[Admit I Female] = P fI Pil +PjO But is this a correct translation of what absence of bias means? Only if admission is determined centrally by the toss of a coin with probability Pml Pi! Pml + PmO PIl +PiO [n fact. Hammel. This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate..D.. one of which was dominant. That is. • Pml + P/I + PmO + Pio • I . D). the number of homozygous dominants.1. i:. Mendel's Peas. .n 3 t I . .< . The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive). . d = 1. I I 1] Fisher conjectured that rather than believing that such a very extraordinary event occurred it is more likely that the numbers were made to "agree with theory" by an overzealous assistant.. the same data can lead to opposite conclusions regarding these hypothesesa phenomenon called Simpson's paradox. distribution. If departments "use different coins. d = 1. NjOd. PfOd. • In fact.
Thus. B) distribution. In this example with 80 = {()o} it is reasonable to reject IJ if S is "much" larger than what would be expected by chance if H is true and the value of B is eo. In situations such as this one . Now 8 0 = {Oo} and H is simple. is better defined than the alternative answer 8 1 . (}o] and eo is composite. and accept H otherwise. If the theory is false.(1) As we have stated earlier. where 1 ~ E is the probability that the assistant fudged the data and 6!. where Xi is 1 if the ith patient recovers and 0 otherwise. a) = 0 if BE 8 a and 1 otherwise.€)51 + cB( nlP). . then 8 0 = [0 . see. and then base our decision on the observed sample X = (X J. and we are then led to the natural 0 . is point mass at . Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug.1. then = [00 . then S has a B(n. (2) If we let () be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite.. The set of distributions corresponding to one answer.Xn ). = Example 4. 8 1 is the interval ((}o. our discussion of constant treatment effect in Example 1. It is convenient to distinguish between two structural possibilities for 8 0 and 8 1 : If 8 0 consists of only one point. administer the new drug. n 3' 0 What the second of these examples suggests is often the case. in the e . Moreover. 0 E 8 0 . These considerations lead to the asymmetric formulation that saying P E Po (e E 8 0 ) corresponds to acceptance of the hypothesis H : P E Po and P E PI corresponds to rejection sometimes written as K : P E PJ . recall that a decision procedure in the case of a test is described by a test function Ii: x ~ {D. the set of points for which we reject. When 8 0 contains more than one point. it's not clear what P should be as in the preceding Mendel example. for instance.11. Suppose that we know from past experience that a fixed proportion Bo = 0. It will turn out that in most cases the solution to testing problem~ with 80 simple also solves the composite 8 0 problem. I} or critical region C {x: Ii(x) = I}. we call 8 0 and H simple. we reject H if S exceeds or equals some integer.!.1 Introduction 215 model needs to pennit distributions other than B( n. 8 0 and H are called compOSite. say 8 0 . suppose we observe S = EXi . That is. for instance.1. . In science generally a theory typically closely specifies the type of distribution P of the data X as. That a treatment has no effect is easier to specify than what its effect is.3. where 00 is the probability of recovery usiog the old drug. See Remark 4. We illustrate these ideas in the following example. we shall simplify notation and write H : () = eo.1. What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug. the number of recoveries among the n randomly selected patients who have been administered the new drug. Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied. say k. p). If we suppose the new drug is at least as effective as the old.1 loss l(B. K : () > Bo Ifwe allow for the possibility that the new drug is less effective than the old. 11 and K is composite. P = Po.3 recover from the disease with the old drug. The same conventions apply to 8] and K. acceptance and rejection can be thought of as actions a = 0 or 1. say.3. To investigate this question we would have to perform a random experiment. (1 . Thus. Most simply we would sample n patients..Section 4.
.2.1. We do not find this persuasive. (Other authors consider test statistics T that tend to be small. We call T a test statistic. In that case rejecting the hypothesis at level a is interpreted as a measure of the weight of evidence we attach to the falsity of H.. and later chapters.1 . (5 The constant k that determines the critical region is called the critical value.1. As we noted in Examples 4.2 and 4.. if H is false. when H is false.2. asymmetry is often also imposed because one of eo. 0 > 00 .. The Neyman Pearson Framework The Neyman Pearson approach rests On the idea that. Given this position. how reasonable is this point of view? In the medical setting of Example 4. as we shall see in Sections 4. To detennine significant values of these statistics a (more complicated) version of the following is done.216 Testing and Confidence Regions Chapter 4 tenninology of Section 1.1. .3. testing techniques are used in searching for regions of the genome that resemble other regions that are known to have significant biological activity. 1. For instance.1 and4. matches at one position are independent of matches at other positions) and the probability of a match is ~. of the two errors. then the probability of exceeding the threshold (type I) error is smaller than Q".. our critical region C is {X : S rule is Ok(X) = I{S > k} with > k} and the test function or PI ~ probability of type I error = Pe. 4. (5 > k) < k). We will discuss the fundamental issue of how to choose T in Sections 4.. if H is true. We now tum to the prevalent point of view on how to choose c.) We select a number c and our test is to calculate T(x) and then reject H ifT(x) > c and accept H otherwise. but if this view is accepted.3.3 this asymmetry appears reasonable. e 1 . The value c that completes our specification is referred to as the critical value of the test.3. It has also been argued that. ! : . one can be thought of as more important. . Note that a test statistic generates a family of possible tests as c varies. and large.that has in fact occurred. .T would then be a test statistic in our sense. suggesting what test statistics it is best to use.e. • . By convention this is chosen to be the type I error and that in tum detennines what we call H and what we call K. The Neyman Pearson framework is still valuable in these situations by at least making us think of possible alternatives and then. There is a statistic T that "tends" to be small. generally in science. ~ : I ! j . One way of doing this is to align the known and unknown regions and compute statistics based on the number of matches. 0 PH = probability of type II error ~ P. : I I j ! 1 I . 1 i . No one really believes that H is true and possible types of alternatives are vaguely known at best. it again reason~bly leads to a Neyman Pearson fonnulation. . is much better defined than its complement and/or the distribution of statistics T under eo is easy to compute. but computation under H is easy.. In most problems it turns out that the tests that arise naturally have the kind of structure we have just described. announcing that a new phenomenon has been observed when in fact nothing has happened (the socalled null hypothesis) is more serious than missing something new. Thresholds (critical values) are set so that if the matches occur at random (i.
{3(8.2(b) is the one to take in all cases with 8 0 and 8 1 simple.(6) = Pe. even nominally. the probabilities of type II error as 8 ranges over 8 1 are determined.1.1 Introduction 217 There is an important class of situations in which the Neyman Pearson framework is inappropriate.3 with O(X) = I{S > k}.Ok) = P(S > k) A plot of this function for n ~ t( j=k n ) 8i (1 .(S > 6) = 0. This quantity is called the size of the test and is the maximum probability of type I error. our test has size a(c) given by a(e) ~ sup{Pe[T(X) > cJ: 8 E eo}· (4. the approach of Example 3. Ll) Nowa(e) is nonincreasing in c and typically a(c) r 1 as c 1 00 and a(e) 1 0 as c r 00. It is referred to as the level a critical value. Example 4. Finally. Both the power and the probability of type I error are contained in the power function.. The values a = 0. if we have a test statistic T and use critical value c. In Example 4. we find from binomial tables the level 0.05 critical value 6 and the test has size . Indeed. Begin by specifying a small number a > 0 such that probabilities of type I error greater than a arc undesirable. The power of a test against the alternative 0 is the probability of rejecting H when () is true. such as the quality control Example 1.2.1.1.9. Such tests are said to have level (o/significance) a. 0) is the {3(8. Thus.1. Once the level or critical value is fixed. Because a test of level a is also of level a' > a.2.05 are commonly used in practice. and we speak of rejecting H at level a. 80 = 0.1. then the probability of type I error is also a function of 8. in the Bayesian framework with a prior distribution on the parameter. In that case.0473.1. This is the critical value we shall use.j J = 10. even though there are just two actions. whereas if 8 E power against (). it is convenient to give a name to the smallest level of significance of a test. . e t .3 and n = 10.Section 4. k = 6 is given in Figure 4. The power is a function of 8 on I.1. if our test statistic is T and we want level a. there exists a unique smallest c for which a(c) < a. the power is 1 minus the probability of type II error. we can attach.3 (continued). numbers to the two losses that are not equal and/or depend on 0.0) = Pe[Rejection] = Pe[o(X) = 1] = Pe[T(X) If 8 E 8 0 • {3(8. If 8 0 is composite as well. See Problem 3. By convention 1 . 8 0 = 0. Here > cJ.8)n. That is. Definition 4.01 and 0. Here are the elements of the Neyman Pearson story.1. Specifically. it is too limited in any situation in which. Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8 0 . 0) is just the probability of type! errot.3. if 0 < a < 1. which is defined/or all 8 E 8 by e {3(8) = {3(8.P [type 11 error] is usually considered. It can be thought of as the probability that the test will "detect" that the alternative 8 holds.
7 0. 0) family of distributions._~:.2) popnlation with . record sleeping time without the drug (or after the administration of a placebo) and then after some time administer the drug and record sleeping time again.4.5. . for each of a group of n randomly selected patients.0 Figure 4. then the drug effect is measured by p. :1 • • Note that in this example the power at () = B1 > 0. We might. 0 Remark 4.05 test will detect an improvement of the recovery fate from 0. k ~ 6 and the size is 0.218 Testing and Confidence Regions Chapter 4 1.1. If we assume XI.[T(X) > k]. _l X n are nonnally distributed with mean p.) We want to test H : J. . whereas K is the alternative that it has some positive effect. ! • i.2 0.0 0 ]. One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives.3 is the probability that the level 0.0473.t < 0 versus K : J.1. When (}1 is 0. .3770. this probability is only ... suppose we want to see if a drug induces sleep.1 it appears that the power function is increasing (a proof will be given in Section 4. Let Xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient.05 onesided test c5k of H : () = 0.:"::':::'.9 1.~==/++_++1'+ 0 o 0..0 0. Power function of the level 0. This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject.1.. a 67% improvement. Xn ) is a sample from N (/'. I j a(k) = sup{Pe[T(X) > k]: 0 E eo} = Pe.3). Example 4. I I I I I 1 .1.8 06 04 0. Suppose that X = (X I.2 o~31:.5. We return to this question in Section 4.3 for the B(lO. For instance.3 04 05 0. (The (T2 unknown case is treated in Section 4.1. . j 1 ..3.. .8 0.3 to (it > 0.3 versus K : B> 0.3. What is needed to improve on this situation is a larger sample size n. OneSided Tests for the Mean ofa Normal Distribution with Known Variance.:.. From Figure 4.6 0.t > O. That is.1 0. I i j . and variance (T2. and H is the hypothesis that the drug has no effect or is detrimental.2 is known. The power is plotted as a function of 0. It follows that the level and size of the test are unchanged if instead of 80 = {Oo} we used eo = [0.
where T(l) < . which generates the same family of critical regions. has level a if £0 is continuous and (B + 1)(1 .2 and 4.a) is the (1. ~ estimates(jandd(~. However.1.y) : YES}.1. o The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H : (j E eo c RP.( c) = C\' or e = z(a) where z(a) = z(1 .i. T(X(B)) from £0. critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods.P. if we generate i.V. (12) observations with both parameters unknown (the t tests of Example 4.2.d. #.1 and Example 4. . eo). That is.~I.3..Section 4. N~A is the MLE of p and d(N~A. in any case.1 Introduction 219 Because X tends to be larger under K than under H. < T(B+l) are the ordered T(X). It is convenient to replace X by the test statistic T(X) = . This occurs if 8 0 is simple as in Example 4.co .5). Rejecting for large values of this statistic is equivalent to rejecting for large values of X. it is natural to reject H for large values of X.(T(X)) doesn't depend on 0 for 0 E 8 0 . 1) distribution.1. T(X(lI).   = .3. Given a test statistic T(X) we need to determine critical values and eventually the power of the resulting tests. In all of these cases.• T(X(B)). £0.o if we have H(p. But it occurs also in more interesting situations such as testing p. The smallest c for whieh ~(c) < C\' is obtained by setting q:.5...p) because ~(z) (4.1.a) quantile oftheN(O.1.9).. where d is the Euclidean (or some equivalent) distance and d(x. the common distribution of T(X) under (j E 8 0 . The task of finding a critical value is greatly simplified if £.I') ~ ~ (c + V. Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward. In Example 4. it is clear that a reasonable test statistic is d((j.~(z).3.1.1.~ (c . S) inf{d(x. (Tn) for (j E 8 0 is usually invariance The key feature of situations in which under the action of a group of transformations_ See Lehmann (1997) and Volume II for discussions of this property.co =.(jo) = (~ (jo)+ where y+ = Y l(y > 0).a) is an integer (Problem 4. Because (J(jL) a(e) ~ is increasing. ..eo) = IN~A . p ~ P[AA]. The power function of the test with critical value c is p " [Vii (X !") > e _ Vii!"] (J (J 1.2) = 1. sup{(J(p) : p < OJ ~ (J(O) = ~(c).nX/ (J. In Example 4. then the test that rejects iff T(X) > T«B+l)(la)).. T(X(l)). has a closed form and is tabled.o versus p. and we have available a good estimate (j of (j. This minimum distance principle is essentially what underlies Examples 4..1. . . = P.
and . < Fo(x)} = U(Fo(x)) . Let F denote the empirical distribution and consider the sup distance between the hypothesis Fo and the plugin estimate of F. Proof. F o (Xi).4..6. o Note that the hypothesis here is simple so that for anyone of these hypotheses F = F o• the distribution can be simulated (or exhibited in closed fonn). P Fo (D n < d) = Pu (D n < d). the empirical distribution function F. can be wriHen as Dn =~ax max tI. This statistic has the following distributionjree property: 1 Proposition 4.   Dn = sup IF(x) . :i' D n = sup [U(u) . and h n (L224) for" = . This is again a consequence of invariance properties (Lehmann.1. thus. 1) distribution.u[ O<u<l .. which is evidently composite. 0 Example 4.220 Testing and Confidence Regions Chapter 4 1 . Suppose Xl.1. F and the hypothesis is H : F = ~ (' (7'/1:) for some M. < x(n) is the ordered observed sample..3.1 El{Fo(Xi ) < Fo(x)} n. for n > 80.5. 1). .)} n (4. In particular. as a tcst statistic ..Fa.. Set Ui ~ j . Goodness of Fit to the Gaussian Family.i. The natural estimate of the parameter F(!... n. .. X n are ij. We can proceed as in Example 4. .d.Fo(x)[. The distribution of D n under H is the same for all continuous Fo. • Example 4.Xn be i.. • . then by Problem B.. 1). which is called the Kolmogorov statistic. F. U.12 + OJI/Vri) close approximations to the size" critical values ka are h n (L628)..d.n {~ n Fo(x(i))' FO(X(i)) _ . and the result follows. 1997). where F is continuous. Consider the problem of testing H : F = Fo versus K : F i. + (Tx) is . As x ranges over R.10 respectively. . h n (L358).05. .. .U. The distribution of D n has been thoroughly smdied for finite and large n. ..(i_l. that is. Let X I. Goodness of Fit Tests..1.1 J) where x(l) < . ..01.L5 rewriting H : F(!' + (Tx) = <p(x) for all x where !' = EF(X ..1.1 El{U. where U denotes the U(O.. x It can be shown (Problem 4.1 El{Xi ~ U(O. ). . as X . In particular. What is remarkable is that it is independ~nt of which F o we consider. (T2 = VarF(Xtl. .1.7) that Do. the order statistics. . Un' where U denotes the empirical distribution function of Ul u = Fo(x) ranges over (0. and hn(t) = t/( Vri + 0. Also F(x) < x} = n.
a) + IJth order statistic among Tn. where X and 0'2 are the MLEs of J1 and we obtain the statistic F(X + ax) Applying the sup distance again. H is rejected. this difficulty may be overcome by reporting the outcome of the experiment in tenus of the observed size or pvalue or significance probability of the test. Tn has the same distribution £0 under H.01.4) Considered as a statistic the pvalue is <l'( y"nX /u).05. . the pvalue is <l'( T(x)) = <l' (.. if X = x. I).Z) / (~ 2::7 ! (Zi  if) • . .. and the critical value may be obtained by simulating ij. and only if. But. observations Zi. we would reject H if. Example 4.. Therefore. whereas experimenter II accepts H on the basis of the same outcome x of an experiment. Tn!.X) . If the two experimenters can agree on a common test statistic T.i.X)/ii..' . .<l'(x)1 where G is the empirical distribution of (L~l"'" Lln ) with Lli (Xi .4.1 Introduction 221 (12. a > <l'( T(x)). from N(O.Section 4. {12 and is that of ~  (Zi . then computing the Tn corresponding to those Zi. 1 < i < n. and only if. (Sec Section 8. . for instance. · · . I < i < n. Consider. under H. In).d. Now the Monte Carlo critical value is the I(B + 1)(1 . (4. . Experimenter] may be satisfied to reject the hypothesis H using a test with size a = 0. where Z I..) Thus.. .. If we observe X = x = (Xl. . the joint distribution of (Dq . H is not rejected. thereby obtaining Tn1 .(3) 0 The pValue: The Test Statistic as Evidence o Different individuals faced with the same testing problem may have different criteria of size. 1. .d. T nB . That is..<l'(x)1 sup IG(x) . ~n) doesn't depend on fl.. N(O.. whatever be 11 and (12. We do this B times independently. . This quantity is a statistic that is defined as the smallest level of significance 0' at which an experimenter using T would reject on the basis ofthe observed outcome x. It is then possible that experimenter I rejects the hypothesis H.3.1. T nB . Tn sup IF'(X x x + iix) . if the experimenter's critical value corresponds to a test of size less than the p~value. otherwise. Zn arc i. a satisfies T(x) = . whereas experimenter II insists on using 0' = 0.(iix u > z(a) or upon applying <l' to both sides if. 1).2.
1. More generally. Thus. 80).. Thus.6) I • to test H. The statistic T has a chisquare distribution with 2n degrees of freedom (Problem 4. when H is well defined. aCT) has a uniform. but K is not. I.1. let X be a q dimensional random vector.2.. Then if we use critical value c. It is possible to use pvalues to combine the evidence relating to a given hypothesis H provided by several different independent experiments producing different kinds of data.g. Similarly in Example 4.1. a(T) is on the unit interval and when H is simple and T has a continuous distribution. if r experimenters use continuous test statistics T 1 . 1). .1.222 Testing and Confidence Regions Cnapter 4 In general. Suppose that we observe X = x. the smallest a for which we would reject corresponds to the largest c for which we would reject and is just a(T(x)).1. we would reject H if. In this context.• see f [' Hedges and Olkin. Thus.(S > s) '" 1.. that is.5) .1. ' I values a(T. then if H is simple Fisher (1958) proposed using l j :i. a(8) = p. .6). But the size of a test with critical value c is just a(c) and a(c) is decreasing in c.Oo)} > 5. We have proved the following.. a(Tr ).ath quantile of the X~n distribution. 1985). . Thus. n(1 . ''The actual value of p obtainable from the table by interpolation indicates the strength of the evidence against the null hypothesis" (p. . I T r to produce p . For example. distribution (Problem 4. so that type II error considerations are unclear. and the pvalue is a( s) where s is the observed value of X.4)..5)..<I> ( [nOo(1 ~ 0 )1' 2 s~l~nOo) 0 . !: The pvalue is used extensively in situations of the type we described earlier. This is in agreement with (4. The normal approximation is used for the pvalue also. the largest critical value c for which we would reject is c = T(x). T(x) > c. for miu{ nOo.). U(O. these kinds of issues are currently being discussed under the rubric of datafusion and metaanalysis (e.1.3. The pvalue is a(T(X)).. • T = ~2 I: loga(Tj) j=l ~ ~ r (4. Proposition 4. • i~! <:1 im _ .1). Various melhods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff (1967). ~ H fJ. to quote Fisher (1958). •• 1. (4.. The pvalue can be thought of as a standardized version of our original statistic. H is rejected if T > Xla where Xl_a is the 1 . We will show that we can express the pvalue simply in tenns of the function a(·) defined in (4. and only if.1.
In this case the Bayes principle led to procedures based on the simple likelihood ratio statistic defined by L(x.3). OIl = p (x. We introduce the basic concepts and terminology of testing statistical hypotheses and give the NeymanPearson framework. we try to maximize the probability (power) of rejecting H when K is true.Otl/(1 . 00 . (0. critical regions. 0) 0 p(x.Section 4.Oo)]n. subject to this restriction. test functions.3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk. and S tends to be large when K : () = 01 > 00 is . (I . power function. Such a test and the corresponding test statistic are called most poweiful (MP). under H.) which is large when S true.01 )/(1. (null) hypothesis H and alternative (hypothesis) K. significance level. then. we specify a small number 0: and conStruct tests that have at most probability (significance level) 0: of rejecting H (deciding K) when H is true. or equivalently. We introduce the basic concepts of simple and composite hypotheses. I) = EXi is large. The statistic L is reasonable for testing H versuS K with large values of L favoring K over H. We start with the problem of testing a simple hypothesis H : () = ()o versus a simple alternative K : 0 = 01.2. and. o:(Td has a U(O. a given test statistic T. that is. power. 4. This is an instance of testing goodness offit. 0. is measured in the NeymanPearson theory. 1) distribution. test statistics. In the NeymanPearson framework. OIl > 0. the highest possible power. OIl where p(x. In this section we will consider the problem of finding the level a test that ha<. In particular./00)5[(1. 0) is the density or frequency function of the random vector X.2 and 3. L(x.1. we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 80 or e 1. by convention. Summary. where eo and e 1 are disjoint subsets of the parameter space 6. p(x. and pvalue.00)ln~5 [0. size. type II error. type I error. For instance. 00. 00 ) = 0. In Sections 3. equals 0 when both numerator and denominator vanish. that is.00)/00 (1 . (4. The statistic L takes on the value 00 when p(x.2 Choosing a Test Statistic The NeymanPearson lemma 223 The preceding paragraph gives an example in which the hypothesis specifies a distribution completely. Typically a test statistic is not given but must be chosen on the basis of its perfonnance. in the binomial example (4.0.2 CHOOSING A TEST STATISTIC: THE NEYMANPEARSON LEMMA We have seen how a hypothesistesting problem is defined and how perfonnance of a given test b. we test whether the distribution of X is different from a specified Fo.)1 5 [(1.
using (4. then I > O. i' ~ E.1. where 7l' denotes the prior probability of {Bo}.'P(X)] > kEo['PdX) . 'P(x) = 1 if S > 5. likelihood ratio tests are unbeatable no matter what the size a is.'P(X)] [:~~:::l. o Because L(x.1). (a) Let E i denote E8. B .0262 if S = 5.2) forB = BoandB = B. we consider randomized tests '1'. Theorem 4. that is.('Pk(X) .'P(X)[L(X. then it must be a level a likelihood ratio test. which are tests that may take values in (0. To this end consider (4. and 'P(x) = [0. and suppose r.. If L(x. (NeymanPearson Lemma).3.) For instance.['Pk(X) ..3).'P(X)] . Proof. It follows that II > O. then <{Jk is MP in the class oflevel a tests.1].Bo.'P(X)] a.[CPk(X) . 'Pk is MP for level E'c'PdX). . lOY some x.k] + E'['Pk(X) .kEo['Pk(X) . (c) If <p is an MP level 0: test. Note that a > 0 implies k < 00 and. (See also Section 1.3.2.224 Testing and Confidence Regions Chapter 4 We call iflk a likelihood ratio or NeymanPearsoll (NP) test ifunction) if for some 0 k < OJ we can write the test function Yk as 'Pk () = X < 1 o if L( x. Note (Section 3. Bo. Finally. If 0 < 'P(x) < 1 for the observation vector x. if want size a ~ .Bd >k <k with 'Pdx) any value in (0.2.2.7l').05 in Example 4. B. (4. 0 < <p(x) < 1.1) if equality occurs. where 1= EO{CPk(X)[L(X. Eo'P(X) < a. we have shown that I j I i E.2. Such randomized tests are not used in practice. thns.P(S > 5)I!P(S = 5) = .) .'P(X)] = Eo['Pk{X) . we choose 'P(x) = 0 if S < 5. B . k] . there exists k such that (4.'P(X)] > O. 'Pk(X) = 1 if p(x. i = 0. BI )  . We show that in addition to being Bayes optimal. 1. (b) For each 0 < a < 1 there exists an MP size 0' likelihood ratio test provided that randomization is permitted. and because 0 < 'P(x) < 1.2) that 'Pk is a Bayes rule with k ~ 7l' /(1 .2.05 . the interpretation is that we toss a coin with probability of heads cp(x) and reject H iff the coin shows heads.B.3 with n = 10 and Bo = 0.kJ}.. BIl o . Because we want results valid for all possible test sizes Cl' in [0. (a) If a > 0 and I{Jk is a size a likelihood ratio test.Bo.'P(X)] 1{P(X. Bo) = O} = I + II (say).k is < 0 or > 0 according as 'Pk(X) is 0 or 1.4) 1 j 7 . EO'PdX) We want to show E. Bo) = O.1.p is a level a test. They are only used to show that with randomization.) .3) 1 : > O. then ~ .
denotes the Bayes procedure of Exarnple 3. OJ! > k] > If Po[L(X. 00 . we conc1udefrom (4. See Problem 4. Here is an example illustrating calculation of the most powerful level a test <{Jk. that is. OJ.1.OI)' Proof. . 0. 00 ..2.) > k] Po[L(X.) . Consider Example 3.1T)p(x. 0. = v. If a = 1.2. The same argument works for x E {x : p(x.5) If 0. OJ! = ooJ ~ 0.2. It follows that (4.O.O. 00 ) > and 0 = 00 . ./f 'P is an MP level a test.7. 00 . = 'Pk· Moreover.Po[L(X.2(b).v) + nv ] 2(72 x . 00 .2. or 00 according as 1T(01 Ix) is larger than or smaller than 1/2. 0. I x) = (1 1T)p(X. Let Pi denote Po" i = 0..2. 0. Therefore. 0.0. Now 'Pk is MP size a.(X.. 60 . If not. 00 .v)=exp 2LX'2 .2.) > k and have 'P(x) = 'Pk(X) ~ 0 when L(x. for 0 < a < 1.pk is MP size a.2 Choosing a Test Statistic: The NeymanPearson lemma " 225 =:cc (b) If a ~ 0..1T)L(x.Section 4. We found nv2} V n L(X. 0. Example 4. 0.) < k.u 2 ) random variables with (72 known and we test H : 11 = a versus K : 11 = v. when 1T = k/(k + 1).) = k j. 'Pk(X) = 1 and !.Oo.O.fit [ logL(X.2. (c) Let x E {x: p(x. 0.) + 1T' (4.3. then E9. then 'Pk is MP size a.) = k] = 0. { (7 i=l 2(7 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions. Then the posterior probability of (). tben. 00 . 0.) (1 . also be easily argued from this Bayes property of 'Pk (Problem 4.1. define a. OJ! > k] < a and Po[T. Remark 4.L. X n ) is a sample of n N(j. OJ Corollary 4.5) thatthis 0. 2 (7 T(X) ~.) > then to have equality in (4.10). 0.2.I.2.. 'P(X) > a with equality iffp(" 60 ) = P(·. is 1T(O. 00 ) (11T)L(x.2 where X = (XI.) + 1Tp(X. decides 0. k = 00 makes 'Pk MP size a.2) holds forO = 0. 'Pk(X) =a . where v is a known signal. 00 .k] on the set {x : L(x.2. Next consider 0 < Q: < 1.Oo.4) we need tohave'P(x) ~ 'Pk(X) = I when L(x. Part (a) of the lemma can.1. k = 0 makes E.fit.) (1 . then there exists k < 00 such that Po[L(X. 00 . Let 1T denote the prior probability of 00 so that (I 1T) is the prior probability of (it. Because Po[L(X. 0 It follows from the NeymanPearson lemma that an MP test has power at least as large as its level..
.2. say. 0.6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > O. In this example we bave assnmed that (Jo and (J. (Jj = (Pj. itl = ito + )"6.8)./iil (J)). . among other things.6) has probability of type I error 0:. then.. the test that rejects if. they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population CQvanances. by (4.2. ). Particularly important is the case Eo = E 1 when "Q large" is equivalent to "F = (ttl . The following important example illustrates. itl' However if. if we want the probability of detecting a signal v to be at least a preassigned value j3 (say.' Rejecting H for L large is equivalent to rejecting for I 1 . From our discussion there we know that for any specified Ct.4. • • .. Example 4.a)[~6Eo' ~oJ! (Problem 4. The likelihood ratio test for H : (J = 6 0 versus K : (J = (h is based on . if ito. then we solve <1>( z(a) + (v.9).1. that the UMP test phenomenon is largely a feature of onedimensional parameter problems.2. then this test rule is to reject H if Xl is large. We will discuss the phenomenon further in the next section. It is used in a classification context in which 9 0 .) test exists and is given by: Reject if (4. 6. Such a test is called uniformly most powerful (UMP). I I . Thus. T>z(la) (4./ii/(J)) = 13 for n and find that we need to take n = ((J Iv j2[z(la) + z(I3)]'.1. (JI correspond to two known populations and we desire to classify a new observation X as belonging to one or the other. Note that in general the test statistic L depends intrinsically on ito..2. But T is the test statistic we proposed in Example 4.2. By the NeymanPearsoo lemma this is the largest power available with a level Ct test. if Eo #. however. large. The power of this test is." The function F is known as the Fisher discriminant function. Eo are known. this is no longer the case (Problem 4.2). 0 An interesting feature of the preceding example is that the test defined by (4. Suppose X ~ N(Pj. <I>(z(a) + (v. . > 0 and E l = Eo. We return to this in Volume II.I. .. and only if.. E j ). 0. If this is not the case. This is the smallest possible n for any size Ct test.226 Testing and Confidence Regions Chapter 4 is also optimal for this problem.Jto)E01X large. If ~o = (1.7) where c ~ z(1 . for the two popnlations are known. l. Simple Hypothesis Against Simple Alternative for the Multivariate Normal: Fisher's Discriminant Flmction.0.95). a UMP (for all ).2. . j = 0. O)T and Eo = I. E j )..2.90 or .
3.. Suppose that (N1 . .u k nn. (}l. 0" . Example 4. Ok)' The simple hypothesis would correspond to the theory that the expected proportion of offspring of types 1.. 'P') > (3(0. This phenomenon is not restricted to the Gaussian case as the next example illustrates. Ok = OkO.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 227 Summary.3.. .. . . which states that the size 0:' SLR test is uniquely most powerful (MP) in the class of level a tests.'P) for all 0 E 8" for any other level 0:' (4. 4. the likelihood ratio L 's L~ rr(:. 1 (}k) distribution with frequency function. nk are integers summing to n. . ..p.Section 4. 0) P n! nn. However.. where nl. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : (j = ()o versus the simple alternative K : 0 = ()1.3. k are given by Ow. .. .3 UNIFORMLY MOST POWERFUL TESTS ANO MONOTONE LIKELIHOOD RATIO MODELS We saw in the two Gaussian examples of Section 4.1.OkO' Usually the alternative to H is composite. N k ) has a multinomial M(n. ."" (}k = Oklo In this case.)N' i=l to Here is an interesting special case: Suppose OjO integer I with 1 < I < k > 0 for all}. if a match in a genetic breeding experiment can result in k types. The NeymanPearson lemma. . here is the general definition of UMP: Definition 4. is established. . We note the connection of the MP test to the Bayes procedure of Section 3.3. 0 < € < 1 and for some fixed (4. then (N" . Two examples in which the MP test does not depend on 0 1 are given.2 for deciding between 00 and Ol. and N i is the number of offspring of type i. Such tests are said to be UMP (unifonnly most powerful). . there is sometimes a simple alternative theory K : (}l = (}ll.1) test !. A level a test 'P' is wtiformly most powerful (UMP) for H : 0 E versus K : 0 E 8 1 if eo (3(0. . _ (nl..nk..2 that UMP tests for onedimensional parameter problems exist. Nk) .···.. n offspring are observed.. .M( n.1. For instance. r lUI nt···· nk· ". With such data we often want to test a simple hypothesis H : (}I = Ow. Before we give the example.2) where . ... Testing for a Multinomial Vector...
Consider the oneparameter exponential family mode! o p(x. k. Example 4.. :.3. is of the form (4. N 1 < c. 0 . Example 4. Moreover. . equals the likelihood ratio test 'Ph(t) and is MP. = h(x)exp{ry(O)T(x) ~ B(O)}. This is part of a general phenomena we now describe. Suppose {P. then L(x. 6.. then this family is MLR. Bernoulli case. 0 form with T(x) . In this i. . we get radically different best tests depending on which Oi we assume to be (ho under H.OW and the model is by (4. set s then = l:~ 1 Xi. .d.O) = 0'(1 ..: 0 E e}...2) with 0 < f < I}. this test is UMP fortesting H versus K : 0 E 8 1 = {/:/ : /:/ . Ow) under H. < 1 implies that p > E.. (1) For each t E (0. B( n. = (I .(X) = '" > 0. Theorem 4. = . . e c R. p(x. Critical values for level a are easily determined because Nl . . : 0 E e} with e c R is said to be a monotone likelihood ratio (MLR) family if for (it < O the distributions POl and P0 2 are distinct and 2 the ratio p(x. we have seen three models where. Consider the problem of testing H : 0 = 00 versus K: 0 ~ 01 with 00 < 81.2.o)n[O/(1 . is an MLRfamily in T(x). the power function (3(/:/) = E.3) with 6. Example 4. The family of models {P. Thus.. Then L = pnN1£N1 = pTl(E/p)N1. there is a statistic T such that the test with critical region {x : T(x) > c} is UMP. Oil = h(T(x)) for some increasing function h. . . where u is known.3. = P( N 1 < c).nx/u and ry(!") Define the NeymanPearson (NP) test function 6 ( ) _ 1 ifT(x) > t .i. Oil is an increasing function ofT(x). . then 6. if and only if. Note that because l can be any of the integers 1 .00). type I is less frequent than under H and the conditional probabilities of the other types given that type I has not occurred are the same under K as they are under H. e c R.. (2) If E'o6. is an MLRfamily in T(x). . in the case of a real parameter.1) if T(x) = t.. it is UMP at level a = EOodt(X) for testing H : e = eo versus K : B > Bo in fact.2.nu)!".6. Because f.3. However. under the alternative.228 Testmg and Confidence Regions Chapter 4 That is. 00 .3. If 1J(O) is strictly increasing in () E e. we conclude that the MP lest rejects H. is UMP level". 02)/p(X.3. for".3 continned).1) MLR in s.(X) is increasing in 0. 0 Typically the MP test of H : () = eo versus K : () = (}1 depends on (h and the test is not UMP. X ifT(x) < t ° (4.1. Because dt does not depend on ()l. 0) ~.(x) any value in (0.1 is of this (.1.2.3. J Definition 4.3.o)n.2 (Example 4. If {P. : 8 E e}. for testing H : 8 < /:/0 versus K:O>/:/I' .
where h(a) is the ath quantile of the hypergeometric. e c p(x.(X). the critical constant 8(0') is u5xn(a).Section 4. Suppose {Po: 0 E Example 4. then by (1). Then.. she formulates the hypothesis H as > (Jo. If the inspector making the test considers lots with bo = N(Jo defectives or more unsatisfactory. where xn(a) is the ath quantile of the X.(X) for testing H: 0 = 01 versus J( : 0 = 0..l. If the distributionfunction Fo ofT(X) under X"" POo is continuous and ift(1a) isasolution of Fo(t) = 1 .4.<» is lfMP level afor testing H: (J < (Jo versus K : (J > (Jo. and specifies an 0' such that the probability of rejecting H (keeping a bad lot) is at most 0'. we now show that the test O· with reject H if. (l.Xn is a sample from a N(I1. N. where U O represents the minimum tolerable precision. if N0 1 = b1 < bo and 0 < x < b1 .(bl1) x. N . 0 Proof (I) follows from b l = iPh(t) The following useful result follows immediately.o. where 11 is a known standard. (8 < s(a)) = a. . R.3.3.... X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives. Quality Control.l) yields e L( 0 0 )=b. Because the most serious error is to judge the precision adequate when it is not. _1')2. the alternative K as (J < (Jo. Testing Precision. Corollary 4. Suppose X!. Thus. distribution. Suppose tha~ as in Example l. . For simplicity suppose that bo > n. where b = N(J. and only if. H. and we are interested in the precision u.3. If 0 < 00 . d t is UMP for H : (J < (Jo versus K : (J > (Jo. 20' 2  This is a oneparameter exponential family and is MLR in T = So The UMP level 0' test rejects H if and only if 8 < s(a) where s(a) is such that Pa . distribution. 0 Example 4. we could be interested in the precision of a new measuring instrument and test it by applying it to a known standard.5.U 2 ) population. is an MLRfamily in T(r).l of the measurements Xl. 0) = exp {~8 ~ IOg(27r0'2)} . and because dt maximizes the power over this larger class. 1 bo(bol) (blX+l)(Nbl) (box+1)(Nbo) (Nbln+x+l) (Nbon+x+l)' . . For instance. To show (2).1.o.<>.2. X < h(a). recall that we have seen that e5 t maximizes the power for testing II : (J = (Jo versus K : (J > (Jo among the class of tests with level <> ~ Eo. n). If a is a value taken on by the distribution of X. is UMP level a. Eoo.bo > n. (X) < <> and b t is of level 0: for H : (J :S (Jo.l. we test H : u > Uo l versus K : u < uo. Ifwe write ~= Uo t i''''l (Xi _1')2 000 we see that Sju5 has a X~ distribution. Because the class of tests with level 0' for H : (J < (Jo is contained in the class of tests with level 0' for H : (J = (Jo. then e}. then the test/hat rejects H if and only ifT(r) > t(1 . Let S = l:~ I (X.l.1 by noting that for any (it < (J2' e5 t is MP at level Eo.Xn . 0. .3 Uniformly Most Po~rful Tests and Monotone likelihood Ratio Models 229 and Corollary 4.( NOo.
in (O . I' ..OIl box (Nn+I)(blx) .ntI") for sample size n. On the other hand. Thus.4 we might be uninterested in values of p.O'"O. forO <:1' < b1 1. if all points in the alternative are of equal significance: We can find > 00 sufficiently close to 00 so that {3( 0) is arbitrarily close to {3(00) = a. ".+.Il = (b l . In our example this means that in addition to the indifference region and level a. we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small. This equation is equiValent to = {3 z(a) + .1.c:O". For such the probability of falsely accepting H is almost 1 .' . By CoroUary 4.. This is possible for arbitrary /3 < 1 only by making the sample size n large enough..1.B 1 ) =0 forb l Testing and Confidence Regions Chapter 4 < X <n.. this is a general phenomenon in MLR family models with p( x..1 and formula (4.ntI" = z({3) whose solution is il .a.(b o .. The critical values for the hypergeometric distribution are available on statistical calculators and software. In Example 4.x) <I L(x.1. This is a subset of the alternative on which we are willing to tolerate low power. 0 Power and Sample Size In the NeymanPearson framework we choose the test whose size is small. This continuity of the power shows that not too much significance can be attached to acceptance of H. L is decreasing in x and the hypergeometric model is an MLR family in T( x) = r... and the powers are continuous increasing functions with limOlO. 1 I . we want guaranteed power as well as an upper bound on the probability of type I error. we specify {3 close to I and would like to have (3(!') > (3 for aU !' > t. Off the indifference region.4 because (3(p.n + I) .3.Oo.(}o. in general. this is. Note that a small signaltonoise ratio ~/ a will require a large sample size n.L.:(x:.) is increasing.I. {3(0) ~ a.2). That is.230 NotethatL(x.. Therefore. as seen in Figure 4. not possible for all parameters in the alternative 8. I i (3(t) ~ 'I>(z(a) + . ~) for some small ~ > 0 because such improvements are negligible. (0.1.. It follows that 8* is UMP level Q. we would also like large power (3(0) when () E 8 1 . Thus. 0) continuous in 0. i. . we want the probability of correctly detecting an alternative K to be large.x) (N . . the appropriate n is obtained by solving i' . This is not serious in practice if we have an indifference region.'. ~) would be our indifference region. In our nonnal example 4. However. ° ° .1. In both these cases. H and K are of the fonn H : () < ()o and K : () > ()o. that is.
In Example 4. It is natural to associate statistical significance with practical significance so that a very low pvalue is interpreted as evidence that the alternative that holds is physically significant.1. that is.0.3. There are various ways of dealing with this problem. 0 ° Our discussion can be generalized. This problem arises particularly in goodnessoffit tests (see Example 4.1.90.00 )] 1/2 . (3 = .05.) = (3 for n and find the approximate solution no+ 1 . we solve (3(00 ) = Po" (S > s) for s using (4.4 this would mean rejecting H if.3. First.35 and n = 163 is 0.55)}2 = 162. We solve For instance.35(0. Formula (4.4. Now let .80 ) . 1. (h = 00 + Dt. and 0. we fleed n ~ (0. the size .35.4. ifn is very large and/or a is small. = 0. > O. Dt. Our discussion uses the classical normal approximation to the binomial distribution. where (3(0.3 continued). Thus.7) + 1.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 231 Dual to the problem of not having enough power is that of having too much.645 x 0. Again using the nonnal approximation.282 x 0. we next show how to find the sample size that will "approximately" achieve desired power {3 for the size 0: test in the binomial example. and only if.35. to achieve approximate size a. Suppose 8 is a vector. Example 4.Oi) [nOo(1 . The power achievable (exactly. if Oi = . using the SPLUS package) for the level .3 requires approximately 163 observations to have probability . 00 = 0.90 of detecting the 17% increase in 8 from 0. They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is 0:. Such hypotheses are often rejected even though for practical purposes "the fit is good enough.1.4) and find the approximate critical value So ~ nOo 1 + 2 + z(l.05)2{1. when we test the hypothesis that a very large sample comes from a particular distribution. As a further example and precursor to Section 5. Bd. test for = . far from the hypothesis.2) shows that.86.05 binomial test of H : 8 = 0..1.3(0.4. Often there is a function q(B) such that H and K can be formulated as H : q(O) <: qo and K : q(O) > qo.3 to 0. we can have very great power for alternatives very close to O." The reason is that n is so large that unimportant small discrepancies are picked up.5).Section 4. we find (3(0) = PotS > so) = <I> ( [nO(l ~ 0)11/ 2 Now consider the indifference region (80 .6 (Example 4.
Implicit in this calculation is the assumption that POl [T > col is an increasing function ofn. Testing Precision Continued.Xf as in Example 2.Bo). 0 E 9.O)<O forB < 00 forB>B o.9.2 is (}'2 = ~ E~ 1 (Xi .= (10 versus K : 0.+ 00.(0) so that 9 0 = {O : ). Thus.1) 1(0.. Then the MLE of 0. we may consider l(O. for instance. and also increases to 1 for fixed () E 6 1 as n . a E A = {O. is the a percentile of X~l' It is evident from the argument of Example 4. I . is such that q( Oil = q. 00). The theory we have developed demonstrates that if C. (4. We have seen in Example 4.= 00 is now composite.0) > ° I(O.3 that this test is UMP for H : (1 > (10 versus K : 0.232 Testing and Confidence Regions Chapter 4 q. For instance. .7.. we illustrate what can happen with a simple example. 0 E 9. when testing H : 0 < 00 versus K : B > Bo. Suppose that in the Gaussian model of Example 4. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II.4) j j 1 . that are not 01. = {O : )..(0) = o} aild 9. 0) = (B .£ is unknown.4. the distribution of Tn ni:T 2/ (15 is X. is detennined by a onedimensional parameter ). . the critical value for testing H : 0. Suppose that j3(O) depends on () only through q( (J) and is a continuous increasing function of q( 0). However. = (tm. a).1 by taking q( 0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative. We may ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O.2 only.. > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q..3. a reasonable class of loss functions are those that satisfy : 1 i j 1 1(0. I}. 0 = Complete Families of Tests The NeymanPearson framework is based on using the 01 loss function. Although H : 0. then rejecting for large values of Tn is UMP among all tests based on Tn.l' independent of IJ.5 that a particular test statistic can have a fixed distribution £0 under the hypothesis.(Tn ) is an MLR family. for 9. The set {O : qo < q(O) < ql} is our indjfference region. Example 4. It may also happen that the distribution of Tn as () ranges over 9.3. . to the F test of the linear model in Section 6. To achieve level a and power at least (3.l)l(O.2.3.3.< (10 among all tests depending on <. J.(0) > O} and Co (Tn) = C'(') (Tn) for all O..< (10 and rejecting H if Tn is small. This procedure can be applied.. first let Co be the smallest number c such that Then let n be the smallest integer such that P" IT > col > fJ where 00 is such that q( 00 ) = qo and 0.1. In general. For each n suppose we have a level a test for H versus I< based on a suitable test statistic T.
In the following the decision procedures arc test functions. 1") = (1(0.5) holds for all 8.I"(X) for 0 < 00 .o) < R(O.E.(x)) . if the model is correct and loss function is appropriate. is similarly UMP for H : 0 > 8 0 versus K : 0 < 00 (Problem 4. when UMP tests do not exist.(X) be such that.o.(Io.(2) a v R(O. 0 Summary.3.12) and. hence.3. Proof. the class of NP tests is complete in the sense that for loss functions other than the 01 loss function. Thus. Suppose {Po: 0 E e}. then any procedure not in the complete class can be matched or improved at all () by one in the complete class.{1(8. O))(E.4 CONFIDENCE BOUNDS.3. 0 < " < 1.(I"(X))) < 0 for 8 > 8 0 . a) satisfies (4. 1") for all 0 E e.4). then the class of tests of the form (4.(X) = ". the test that rejects H : 8 < 8 0 for large values of T(x) is UMP for K : 8 > 8 0 . = R(O.5) That is. AND REGIONS We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter {J i" a . we show that for MLR models.2.R(O.3. We consider models {Po : () E e} for which there exist tests that are most powerful for every () in a composite alternative t (UMP tests). is complete. 1) + [1 1"(X)]I(O. is UMP for H : 0 < 00 versus K: 8 > 8 0 by Theorem 4. Finally.3.(o.1 and. 1) 1(O.6) But 1 . E.(X)) = 1. Now 0.4 Confidence Bounds.O) + [1(0.O)} E.3.I"(X) 0 for allO then O=(X) clearly satisfies (4.Section 4.0. In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H.(X) > 1.E. locally most powerful (LMP) tests in some cases can be found.(X) = E"I"(X) > O. For () real. INTERVALS. for some 00 • E"o. We also show how. (4. Thus. hence. Intervals. (4. For MLR models. it isn't worthwhile to look outside of complete classes. e c R. e 4.E. the risk of any procedure can be matched or improved by an NP test.3) with Eo. If E.3. (4. and Regions 233 if for any decision rule 'P The class D of decision procedures is said to be there exists E such that complete(I). is an MLR family in T( x) and suppose the loss function 1(0. 1) 1(0.O)]I"(X)} Let o.3.5).I") ~ = EO{I"(X)I(O.o.) .3. The risk function of any test rule 'P is R(O. Theorem 4. a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing ()o versus 8 1 is an increasing function of a statistic T( x) for every ()o < ()1.
3. 0... Finally.) = 1 .95..1. we want to find a such that the probability that the interval [X ..a) of being correct. Suppose that JL represents the mean increase in sleep among patients administered a drug.....fri < .a) confidence interval for ". X n ) to establish a lower bound . .a.234 Testing and Confidence Regions Chapter 4 member of a specified set 8 0 ...a.a)l. That is. N (/1.a)I. X + a] contains pis 1 . is a lower bound with P(.) = Ia.. This gives . Now we consider the problem of giving confidence bounds.fri. Then we can use the experimental outcome X = (Xl. in many situations where we want an indication of the accuracy of an estimator.Q with 1 .a)l..a) for a prescribed (1. That is. In our example this is achieved by writing By solving the inequality inside the probability for tt.(X) .(X) for" with a prescribed probability (1 . . In general. We find such an interval by noting .oz(1 ..a. it may not be possible for a bound or interval to achieve exactly probability (1 . and a solution is ii(X) = X + O"z(1 .fri < .2 ) example this means finding a statistic ii(X) snch that P(ii(X) > . PEP.a) such as . .a). In the nonBayesian framework. we find P(X .. As an illustration consider Example 4. as in (1..fri.oz(1 . . ( 2 ) with (72 known. .a.d..Q equal to .(X) and solving the inequality inside the probability for p. . In the N(".95 Or some other desired level of confidence. if v ~ v(P). is a constant.) and =1 a . and we look for a statistic ..4 where X I.±(X) ~ X ± oz (1..a) confidence bound for ". we may be interested in an upper bound on a parameter. 11 I • J .i.. or sets that constrain the parameter with prescribed probability 1 . In this case.! = X . we want both lower and upper bounds.(X) is a lower confidence bound with confidence level 1 . and X ~ P. intervals.. Here ii(X) is called an upper level (1 . Similarly.(Xj < .(X).+(X)] is a level (1 .) = 1 . . we settle for a probability at least (1 .. • j I where . is a parameter.!a)/.a. We say that .8). X E Rq.. We say that [j. 1 X n are i...(X) that satisfies P(.
4 Confidence Bounds.L) obtained by replacing (7 in Z(J. P[v(X) < vi > I . finding confidence intervals (or bounds) often involves finding appropriate pivots. ii( Xl] formed by a pair of statistics v(X).. and Regions 235 Definition 4. I) distribu~on to obtain a confidence interval for I' by solving z (1 . A statistic v(X) is called a level (1 . Similarly.1')1". the confidence level is clearly not unique because any number (I . P E PI} (i. lhat is.3. the minimum probability of coverage). has aN(O. The quantities on the left are called the probabilities of coverage and (1 . D(X) is called a level (1 . The (Student) t Interval and Bounds. Moreover..l)s2 lIT'. Intervals.1) = T(I') . we will need the distribution of Now Z(I') = Jii(X 1')/". 1) distribution and is.! that Z(iL)1 "jVI(n .. the random interval [v( X). We conclude from the definition of the (Student) t distribution in Section B. (72) population.3.L.o. which has a X~l distribution.a) upper confidence bound for v if for every PEP.Section 4.4. " X n be a sample from a N(J. In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level. v(X) is a level (I .e. Now we tum to the (72 unknown case and propose the pivot T(J.0) or a 100(1 .L) by its estimate s. Note that in the case of intervals this is just inf{ PI!c(X) < v < v(X). In the preceding discussion we used the fact tbat Z (1') = Jii(X . has a N(o.o.3. where S 2 = 1 nI L (X.a) is called a confidence level.~o) for 1'.4.0) is. P[v(X) = v] > 10.!o) < Z(I') < z (1. by Theorem B. Example 4. In this process Z(I') is called a pivot. In general. For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient. Let X t . For a given bound or interval. independent of V = (n . and assume initially that (72 is known.0') < (J . .1.0)% confidence intervolfor v if.X) n i=l 2 .1.O!) lower confidence bound for v if for every PEP. P[v(X) < v < v(X)] > I . for all PEP.0) will be a confidence level if (I .
l)s2/X("I)1 is a confidence interval with confidence coefficient (1 .355 is . we have assumed that Xl.Q).Xn is a sample from a N(p" a 2 ) population.. By solving the inequality inside the probability for a 2 we find that [(n . V( <7 2 ) = (n . (4. It turns out that the distribution of the pivot T(J.i.) /.4. Solving the inequality inside the probability for Jt. For instance..4..4.a). . Suppose that X 1.1 (I . then P(X("I) < V(<7 2) < x(l.11 whatever be Il and Tj. In this case the interval (4.236 has the t distribution 7."2)) = 1".st n _ 1 (I . Confidence Intervals and Bounds for the Vartance of a Normal Distribution.d.3. (4.7. if we let Xn l(p) denote the pth quantile of the X~1 distribution. Then (72.355s/3.~a)/.1) in nonGaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5. ..2) f.1 distribution and can be used as a pivot. 0 .1. the confidence coefficient of (4.2. . Example 4.. and if al + a2 = a. ~.~a)) = 1.7 (see Problem 8.1) can be much larger than 1 . we enter Table II to find that the probability that a 'TnI variable exceeds 3.fii are natural lower and upper confidence bounds with confidence coefficients (1 . Testing and Confidence Regions Chapter 4 1 . By Theorem B . • " ~.1.l) < t n_ 1 (1..355s/31 is the desired level 0. i '1 !. On the other hand. .1 (1 .~a)/.1 (p) by the standard normal quantile z(p) for n > 120.3.n is..st n_l (1 ~. . .a.stn_ 1 (I .)/. distribution. ..l) is fairly close to the Tn .fii and X + stn.12).l)s2/X(1.1) has confidence coefficient close to 1 . X . we use a calculator.~a) < T(J.. t n .005. .. or very heavytailed distributions such as the Cauchy. Thus. Up to this point. we find P [X . we can reasonably replace t n . To calculate the coefficients t n.01..1.1.99 confidence intervaL From the results of Section B.a. Hence.4.) confidence interval of the type [X .a).Q. if n = 9 and a = 0. " . X +stn _ 1 (1 Similarly... For the usual values of a."2)' (n .fii.3.. the interval will have probability (1 . .l < X + st n_ 1 (1. .355 and IX . .~a) = 3. See Figure 5..n < J. or Tables I and II. for very skew distributions such as the X2 with few degrees of freedom. we see that as n + 00 the Tn_l distribution converges in law to the standard normal distribution.)/ v'n]. thus.1) The shortest level (I . Let tk (p) denote the pth quantile of the P (t n. X + 3.l (I . The properties of confidence intervals such as (4.nJ = 1" X ± sc/.. • r. computer software.!a) and tndl ..a) /.1 )s2/<7 2 has a X.l (I . X n are i. . N (p" a 2 ).1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal. . If we assume a 2 < 00.a) in the limit as n + 00.
16 we give an interval with correct limiting coverage probability.l)sjx(1 .4. Approximate Confidence Bounds and Intervalsfor the Probability of Success in n Bernoulli Trials.O)j )0(1 .0) . If X I.. = ~(X) < 0 < B(X)]' Because the coefficient of (j2 in g(() 1 X) is greater than zero. by the De MoivreLaplace theorem. X) < OJ (4. There is no natural "exact" pivot based on X and O.k' < . g( 0.X) = (1+ k!) 0' . . The pivot V( (T2) similarly yields the respective lower and upper confidence bounds (n . In contrast to Example 4. which typically is unknown.~ a) and observe that this is equivalent to Q P [(X . the confidence interval and bounds for (T2 do not have confidence coefficient 1.4.l)sjx(Q). Asymptotic methods and Monte Carlo experiments as described in Chapter 5 have shown that the confidence coefficient may be arbitrarily small depending on the underlying true distribution. 0 The method of pivots works primarily in problems related to sampling from nonnal populations. X) is a quadratic polynomial with two real roots.a even in the limit as n + 00. Intervals.1. There is a unique choice of cq and a2. [0: g(O. Example 4. and Regions 237 The length of this interval is random. However. they are(1) O(X) O(X) {S + k2~ .ka)[S(n . It may be shown that for n large.3.3) + ka)[S(n . In tenns of S = nX. If we consider "approximate" pivots. vIn(X .(2X + ~) e+X 2 For fixed 0 < X < I. 1959). which unifonnly minimizes expected length among all intervals of this type.1. = Z (1 .Section 44 Confidence Bounds. the scope of the method becomes much broader. 1 X n are the indicators of n Bernoulli trials with probability of success (j. In Problem 1. taking at = (Y2 = ~a is not far from optimal (Tate and Klett.S)jn] {S + k2~ + k~j4} / (n + k~) (4.0(1. then X is the MLE of (j.X) < OJ'" 1where g(O.S)jn] + k~j4} / (n + k~).4) .0) has approximately a N(O. If we use this function as an "approximate" pivot and let:=:::::: denote "approximate equality. We illustrate by an example.4.0) ] = Plg(O.0) < z (1)0(1 .Q) and (n .. if we drop the nonnality assumption." we can write P [ Let ka vIn(X ..4.0)  1Ql] = 12 Q. 1) distribution.
238 Testing and Confidence Regions Chapter 4 so that [O(X). n(l .4) has length 0. calls willingness to buy success. .4. we n .975.O > 8 1 > ~.(1.02 by choosing n so that Z = n~ ( 1.02.~a = 0.a) procedure developed in Section 4. Cai.S)ln ~ to conclude that tn . .16. to bound I above by 1 0 choose .5) is used and it is only good when 8 is near 1/2. ka. See Problem 4.8) is at least 6.4.6) Thus. For instance.a) confidence interval for O. and Das Gupta (2000). [n this case.X). we choose n so that ka. Note that in this example we can detennine the sample size needed for desired accuracy.~a) ko 2 kcc i 1. of the interval is I ~ 2ko { JIS(n . Cai.96. A discussion is given in Brown. Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 00 < ~.96) 2 = 9. He draws a sample of n potential customers.4. = T.02 and is a confidence interval with confidence coefficient approximately 0.: iI j . (1 .02 ) 2 . This fonnula for the sample size is very crude because (4. These inlervals and bounds are satisfactory in practice for the usual levels. and we can achieve the desired .95.2a) interval are approximate upper and lower level (1 .4. = length 0.0)1 JX(1 .96 0.600.. or n = 9. Another approximate pivot for this example is yIn(X .5) (4. note that the length. O(X)] is an approximate level (1 . i o 1 [.4.S)ln] Now use the fact that + kzJ4}(n + k~)l (S . it is better to use the exact level (1 .7) See Brown. This leads to the simple interval 1 I (4.~n)2 < S(n .601.4.a) confidence bounds. For small n. i " .. say I. and uses the preceding model. = 0. 1 .n 1 tn (4. To see this.5. He can then detennine how many customers should be sampled so that (4. if the smaller of nO. We can similarly show that the endpoints of the level (1 .4. consider the market researcher whose interest is the proportion B of a population that will buy a product.(n + k~)~ = 10 . and Das Gupta (2000) for a discussion. That is.
. 2nX/0 has a chi~square. Then the rdimensional random rectangle I(X) = {q(O). Example 4.). (X) ~ [q. then exp{ x/O} edenote the lower < q(O) < exp{ x/i)} is a confidence interval for q( 0) with confidence coefficient (1 . ii. Suppose q (X) and ii) (X) are realJ valued. . we will later give confidence regions C(X) for pairs 8 = (0 1 . Here q(O) = 1 . . distribution.a.. X~n' distribution.4. . q(C(X)) is larger than the confidence set obtained by focusing on B1 alone. . then q(C(X)) ~ {q(O) . then the rectangle I( X) Ic(X) has level = h (X) x .(X) < q..0') confidence region for q(O).. this technique is typically wasteful. Confidence Regions of Higher Dimension We can extend the notion of a confidence interval for onedimensional functions q( B) to rdimensional vectors q( 0) = (q.~a) < 0 <2nX/x (~a) where x(j3) denotes the 13th quantile of the X~n distribution.. For instance... . j ~ I. x IT(1a.4. qc (0))..4.. . Let 0 and and upper boundaries of this interval. qc( 0)) is at least (I .(X). 0 E C(X)} is a level (1..(O) < ii. ( 2 ) T. if the probability that it covers the unknown but fixed true (ql (0). Intervals.a).1 ).F(x) = exp{ x/O}. if q( B) = 01 .a). If q is not 1 ..) confidence interval for qj (0) and if the ) pairs (T l ' 1'. . . .. That is. Suppose X I.a) confidence interval 2nX/x (I .).4.1. j==1 (4. (0).r} is said to be a level (1 ... By using 2nX /0 as a pivot we find the (1 . . and Regions 239 Confidence Regions for Functions of Parameters We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 .8) .3. (T c' Tc ) are independent. I is a level (I .0').X n denote the number of hours a sample of internet subscribers spend per week on the Internet. . In this case.. £(8. By Problem B.Section 44 Confidence Bounds.a Note that if I. We write this as P[q(O) E I(X)I > 1. ~.a). Note that ifC(X) is a level (1 ~ 0') confidence region for 0. we can find confidence regions for q( 0) entirely contained in q( C(X)) with confidence level (1 . and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet. Let Xl. X n is modeled as a sample from an exponential.a) confidence region.
4.5. if we choose 0' j = 1 . ~ ~'f Dn(F) = sup IF(t) .1) the distribution of . Example 4. According to this inequality. F(t) .a) r.240 Testing and Confidence Regions Chapter 4 Thus.. . 0 The method of pivots can also be applied to oodimensional parameters such as F. as X rv P.(1 .a) confidence rectangle for (/" .a. for each t E R. ..Laj.4.1.d.a = set of P with P( 00. Then by solving Dn(F) < do for F.2.2).~a)/rn !a). 1 X n is a N (M.6.1 h(X) = X ± stn_l (1.. An approach that works even if the I j are not independent is to use Bonferroni's inequality (A. ( 2 ) sample and we want a confidence rectangle for (Ii.2. We assume that F is continuous. See Problem 4.ia)' Xnl (lo) is a reasonable confidence interval for (72 with confidence coefficient (1. v(P) = F(·).4. ... That is. c c P[q(O) E I(X)] > 1. we find that a simultaneous in t size 1.0:. From Example 4. if we choose a J = air..~a). . Confidence Rectangle for the Parameters ofa Normal Distribution. h (X) X 12 (X) is a level (1 . is a confidence interval for J1 with confidence coefficient (1  I (X) _ [(nl)S2 (nl)S2] 2 Xnl (1 . consists of the interval i J C(x)(t) = (max{O. D n (F) is a pivot.7). Let dO' be chosen such that PF(Dn(F) < do) = 1 .i.15. Example 4. and we are interested in the distribution function F(t) = P(X < t).1. Suppose Xl> .0 confidence region C(x)(·) is the confidence band which. then leX) has confidence level 1 . then I(X) has confidence level (1 . j = 1. in which case (Proposition 4. From Example 4. (7 2).40) 2..LP[qj(O) '" Ij(X)1 > 1. an rdimensional confidence rectangle is in this case automaticalll obtained from the onedimensional intervals. F(t) + do})· We have shown ~ ~ I o P(C(X)(t) :) F(t) for all t E R) for all P E 'P =1. .4.. Thus. . min{l.do}. It is possible to show that the exact confidence coefficient is (1 . j=1 j=1 Thus. Suppose Xl. . r.a). Moreover. tl continuous in t.4. X n are i.F(t)1 tEn i " does not depend on F and is known (Example 4.5). that is.
and more generally confidence regions. Then a (1 .19.Section 4.0: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Pe that C(X) covers q( 8) is at least 1 .!(F) : F E C(X)) ~ J. Suppose that an established theory postulates the value Po for a certain physical constant. In a parametric model {Pe : BEe}. . which is zero for t < 0 and nonzero for t > O.4. subsets of the sample space with probability of accepting H at least 1 .0: for all E e.4.4.! ~ /L(F) = f o tf(t)dt = exists. o 4. 0 Summary. We define lower and upper confidence bounds (LCBs and DCBs). is /1.! ~ inf{J.6. if J.5. for a given hypothesis H. .4. A Lower Confidence Boundfor the Mean of a Nonnegative Random Variable.. 1WoSided Tests for the Mean of a Normal Distribution.18) arise in accounting practice (see Bickel.9)   because for C(X) as in Example 4. Acceptance regions of statistical tests are. we similarly require P(C(X) :J v) > 1.4. Intervals for the case F supported on an interval (see Problem 4. We shall establish a duality between confidence regions and acceptance regions for families of hypotheses. given by (4.0: when H is true.0') lower confidence bound for /1.4 and 4. Example 4. We derive the (Student) t interval for I' in the N(Il.7.4. J.!(F+) and sUP{J. X n are i. Suppose Xl.d. confidence intervals. a level 1 .4. 1992) where such bounds are discussed and shown to be asymptotically strictly conservative. By integration by parts. For a nonparametric class P ~ {P} and parameter v ~ v(P).!(F)   ~ oosee Problem 4.0'. A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining . as X and that X has a density f( t) = F' (t).!(F) : F E C(X)} = J.5 to give confidence regions for scalar or vector parameters in nonparametric models." for all PEP..6.5 THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS Confidence regions are random subsets of the parameter space that contain the true parameter with probability at least 1 .5 The Duality Between Confidence Regions and Tests 241 We can apply the notions studied in Examples 4. then Let F(t) and F+(t) be the lower and upper simultaneous confidence boundaries of Example 4. and we derive an exact confidence interval for the binomial parameter. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X. a 2 ) model with a 2 unknown.1. Example 4.i.4. We begin by illustrating the duality in the following example.
the postulated value JLo is a member of the level (1 0' ) confidence interval [X  Slnl (1 ~a) /vn. /l) ~ O. /l) to equal 1 if.5. 00) X (0.4. generated a family of level a tests {J(X. if and only if..~a).a) confidence region for v if the probability that S(X) contains v is at least (1 .5.a. Jlo) being of size a only for the hypothesis H : Jl = flo· Conversely.1 (1 . /l)} where lifvnIX.Q) confidence interval (4. (4. .2(p) takes values in N = (0..1 (1 .I n _ 1 (1.00). In contrast to the tests of Example 4.fLo)/ s. in Example 4. in Example 4.I n .5. generating a family of level a tests. if we start out with (say) the level (1 .4.. Because p"IITI = I n. .~1 >tn_l(l~a) ootherwise. Because the same interval (4.4. in fact. Let v = v{P) be a parameter that takes values in the set N. /l = /l(P) takes values in N = (00. and only if.00).~a)] = 0 the test is equivalently characterized by rejecting H when ITI > tnl (1 . and in Example 4.242 Testing and Confidence Regions Chapter 4 measurements Xl . For instance. (4.1) we constructed for Jl as follows. by starting with the test (4.2) takes values in N = (00 . X + Slnl (1  ~a) /vn]. t n . We can base a size Q' test on the level (1 .1) by finding the set of /l where J(X.2) These tests correspond to different hypotheses..1) is used for every flo we see that we have..00). Xl!_ Knowledge of his instruments leads him to assume that the Xi are independent and identically distributed normal random variables with mean {L and variance a 2 . 1. For a function space example. if and only if.a) LCB X . all PEP. then our test accepts H. If any value of J. We accept H. Evidently.~Ct). O{X.4.a).a)s/vn > /l.1 (1. Consider the general framework where the random vector X takes values in the sample space X C Rq and X has distribution PEP.a)s/ vn and define J'(X.5. where F is the distribution function of Xi' Here an example of N is the class of all continuous distribution functions. (/" . This test is called twosided because it rejects for both large and small values of the statistic T. then S is a (I .6.. i:. o These are examples of a general phenomenon.2) we obtain the confidence interval (4. it has power against parameter values on either side of flo..flO.5. that is P[v E S(X)] > 1 . consider v(P) = F.!a) < T < I n. .4.4. Let S = S{X) be a map from X to subsets of N.5'[) If we let T = vn(X . = 1.l other than flo is a possible alternative.a. • j n . X . then it is reasonable to formulate the problem as that of testing H : fl = Jlo versus K : fl i..1.2.l (1. as in Example 4. We achieve a similar effect.2 = ..
5 The Duality Between Confidence Regions and Tests 243 Next consider the testing framework where we test the hypothesis H = Hvo : v = Va for some specified value va. Formally.3. if 0"1 + 0"2 < 1. the acceptance region of the UMP size a test of H : 0 = 00 versus K : > ()a can be written e A(Oo) = (x: T(x) < too (1. Suppose X ~ Po where (Po: 0 E ej is MLR in T = T(X) and suppose that the distribution function Fo(t) ofT under Po is continuous in each of the variables t and 0 when the other is fixed.0" quantile of F oo • By the duality theorem.a)}.Fo(t) for a test with critical constant t is increasing in (). this is a random set contained in N with probapility at least 1 .Section 4. Consider the set of Va for which HI/o is accepted. For some specified Va. We have the following.Ct. H may be accepted. It follows that Fe (t) < 1 .Ct). By Corollary 4. Suppose we have a test 15(X. for other specified Va.0". then [QUI' oU2] is confidence interval for () with confidence coefficient 1 . acceptance regions. 00). By Theorem 4. H may be rejected.aj.1. The proofs for the npper confid~nce hound and interval follow by the same type of argument. < to(la). va) with level 0". Similarly. then 8(T) is a Ia confidence region forO.0" confidence region for v.5. let Pvo = (P : v(P) ~ Vo : va E V}.(t) in e. the power function Po(T > t) = 1 .Ct) is the 1. let .3.vo) = OJ is a subset of X with probability at least 1 .a has a solution O. and pvalues for MLR families: Let t denote the observed value t = T(x) ofT(X) for the datum x. Conversely.(01 + 0"2).G iff 0 > Iia(t) and 8(t) = Ilia. Duality Theorem.1. then ()u(T) is a lower confidence boundfor () with confidence coefficient 1 . Let S(X) = (va EN. eo e Proof. That i~.1.a)) where to o (1 . if S(X) is a level 1 . Then the acceptance regIOn A(vo) ~ (x: J(x.Ct confidence region for v.0" of containing the true value of v(P) whatever be P. then the test that accepts HI/o if and only if Va is in S(X). By applying F o to hoth sides oft we find 8(t) = (O E e: Fo(t) < 1. Fe (t) is decreasing in (). 0 We next give connections between confidence bounds.(T) of Fo(T) = a with coefficient (1 . if 8(t) = (O E e: t < to(1 . E is an upper confidence bound for 0 with any solution e. Moreover. X E A(vo)). then PIX E A(vo)) > 1  a for all P E P vo if and only if S (X) is a 1 . is a level 0 test for HI/o. If the equation Fo(t) = 1 . We next apply the duality theorem to MLR families: Theorem 4.
The result follows. v) <a} = {(t.a) confidence region is given by C(X t.a)] > a} = [~(t). vol denote the pvalue of a test6(T. a).v) : J(t. The corresponding level (1 ..2.5.Fo(t) is decreasing in t. . : : !.a)ifBiBo. B) pnints.Fo(t) is increasing in O. For a E (0.1).5. .~a)/ v'n}. cr 2 ) with 0. Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials.a) is nondecreasing in B. In the (t.v) : a(t.3.1 that 1 . • . and A"(B) = T(A(B)) = {T(x) : x Corollary 4.oo). a confidence region S( to). 1 .1.1). v) = O}. We have seen in the proof of Theorem 4. I c= {(t. a) denote the critical constant of a level (1.. In general.d. B) ~ poeT > t) = 1 .a)~k(Bo.~) upper and lower confidence bounds and confidence intervals for B. . . B) plane.I}. We shall use some of the results derived in Example 4..3.1 shows the set C. I X n are U.Xn ) = {B : S < k(B. The pvalue is {t: a(t. We call C the set of compatible (t. N (IL. Because D Fe{t) is a distribution function. then C = {(t.to(l.80 E (0. We illustrate these ideas using the example of testing H : fL = J. Let T = X. Let k(Bo.a) test of H.. 00 ) denote the pvalue for the UMP size Q' test of H : () let = eo versus K : () > eo. Under the conditions a/Theorem 4.5.10 when X I .B) > a} {B: a(t.v) where. We claim that (i) k(B. E A(B)). v) =O} gives the pairs (t. vo) = 1 IT > cJ of H : v ~ Vo based on a statistic T = T(X) with observed value t = T(x). Let Xl. '1 Example 4. . where S = Ef I Xi_ To analyze the structure of the region we need to examine k(fJ.p): It pI < <7Z (1.1. A"(B) Set) Proof. a) .4. _1 X n be the indicators of n binomial trials with probability of success 6.B) = (oo. vertical sections of C are the confidence regions B( t) whereas borizontal sections are the acceptance regions A" (v) = {t : J( t. Then the set . .2 known.1. (ii) k(B. let a( t.1 I. Figure 4. • a(t. we seek reasonable exact level (1 . v will be accepted.Fo(t). and for the given v. t is in the acceptance region.1 244 Testing and Confidence Regions Chapter 4 1 o:(t. and an acceptance set A" (po) for this example. for the given t. . To find a lower confidence bound for B our preceding discussion leads us to consider level a tests for H : 6 < 00.
S(to) is a confidence interval for 11 for a given value to of T. and. (ii). To prove (i) note that it was shown in Theorem 4.[S > k(02.[S > j] < Q for all 0 < 00 O(S) = inf{O: k(O.1 (i) that PetS > j] is nondecreasing in () for fixed j.[S > k(02. Clearly. [S > j] = Q and j = k(Oo. a) increases by exactly 1 at its points of discontinuity. (iii). Q). Then P. if 0 > 00 .Q) I] > Po.5 The Duality Between Confidence Regions and Tests 245 Figure 4.[S > k(O"Q) I] > Q.Section 4. Po.Q) = S + I}. Q) as 0 tOo. whereas A'" (110) is the acceptance region for Hp. and (iv) we see that. [S > j] < Q. Q) would imply that Q > Po. Po. The claims (iii) and (iv) are left as exercises. e < e and 1 2 k(O" Q) > k(02. The assertion (ii) is a consequence of the following remarks. The shaded region is the compatibility set C for the twosided test of Hp. On the other hand.Q) ~ I andk(I.o' (iii) k(fJ.Q)] > Po.I] [0. Therefore. Therefore.5. From (i). if we define a contradiction.Q) = n+ 1. Q). it is also nonincreasing in j for fixed e. [S > j] > Q. I] if S ifS >0 =0 . then C(X) ={ (O(S).L = {to in the normal model. If fJo is a discontinuity point 0 k(O. hence.o : J.3. (iv) k(O. let j be the limit of k(O.1. P.
O(S) I oflevel (1. 0.Q) LCB for 0(2) Figure 4.Q) DCB for 0 and when S < n. . therefore.16) 3 I 2 11. Similarly.5. O(S) together we get the confidence interval [8(S). when S > 0.3 0.. From our discussion. Then 0(S) is a level (1 .2. j '~ I • I • .2Q).Q)=Sl} where j (0. j Figure 4.4. As might be expected. . O(S) is the unique solution of 1 ~( ~ s ) or(1 _ 8)nr = Q. O(S) = I. When S 0.1 d I J f o o 0.1 0.5. we define ~ O(S)~sup{O:j(O. 1 _ . . 8(S) = O.2 portrays the situatiou. we find O(S) as the unique solution of the equation. then k(O(S). i • . these bounds and intervals differ little from those obtained by the first approximate method in Example 4. if n is large. These intervals can be obtained from computer packages that use algorithms based on the preceding considerations.3.0.5 I i .2 0. When S = n. .Q) = S and.246 Testing and Confidence Regions Chapter 4 and O(S) is the desired level (I .4 0. 4 k(8. Plot of k(8.16) for n = 2. Q) is given by. I • i I . Putting the bounds O(S). .I.
and 3. it is natural to carry the comparison of A and B further by asking whether 8 < 0 or B > O. are given to high blood pressure patients. Then the wrong decision "'1 < !to" is made when T < z(1 . Example 4. Using this interval and (4.2 ) with u 2 known.13.. suppose B is the expected difference in blood pressure when two treatments. The problem of deciding whether B = 80 . the probability of falsely claiming significance of either 8 < 80 or 0 > 80 is bounded above by ~Q. For this threedecision rule. Suppose Xl. If we decide B < 0. by using this kind of procedure in a comparison or selection problem. A and B. but if H is rejected.!a). we can control the probabilities of a wrong selection by setting the 0' of the parent test or confidence interval.2.5 The Duality Between Confidence Regions and Tests 247 Applications of Confidence Intervals to Comparisons and Selections We have seen that confidence intervals lead naturally to twosided tests. the probability of the wrong decision is at most ~Q. we make no claims of significance. and vice versa.B .3. . v = va..Section 4. N(/l.3) 3. the twosided test can be regarded as the first step in the decision procedure where if H is not rejected.8 < B • or B > 80 is an example of a threeo decision problem and is a special case of the decision problems in Section 1.!a).~Q). o (4.5. However. then we select A as the better treatment.~Q) / .4.i.~a). we test H : B = 0 versus K : 8 i. If J(x. vo) is a level 0' test of H . In Section 4. we decide whether this is because () is smaller or larger than Bo. This event has probability Similarly. Make no judgment as 1O whether 0 < 80 or 8 > B if I contains B . To see this consider first the case 8 > 00 . Here we consider the simple solution suggested by the level (1 . Decide I' > I'c ifT > z(1 . Because we do not know whether A or B is to be preferred. We can use the twosided tests and confidence intervals introduced in later chapters in similar fashions.4 we considered the level (1 . If H is rejected. we usually want to know whether H : () > B or H : B < 80 .3) we obtain the following three decision rule based on T = J1i(X  1'0)/": Do not reject H : /' ~ I'c if ITI < z(1 .. and o Decide 8 > 80 if I is entirely to the right of B . twosided tests seem incomplete in the sense that if H : B = B is rejected in favor of o H : () i. o o Decide B < 80 if! is entirely to the left of B . Summary. Decide I' < 1'0 ifT < z(l. We explore the connection between tests of statistical hypotheses and confidence regions. 2. when !t < !to.Jii for !t. then the set S(x) of Vo where .d.Q) confidence interval X ± uz(1 .O.5.5. Thus. o o For instance.0') confidence interval!: 1. I X n are i. 0. Therefore.
Note that 0* is a unifonnly most accurate level (1 . Thus.n.I (X) is /L more accurate than .1) for all competitors are called uniformly most accurate as are upper confidence bounds satisfying (4.Q)er/.6. . A level (1.Q) UCB for B. or larger than eo.0') lower confidence bounds for (J. which is connected to the power of the associated onesided tests.IB' (X) I • • < B'] < P.6.u2(X) and is.2. (X) = X . we find that a competing lower confidence bound is 1'2(X) = X(k). for any fixed B and all B' < B.. If (J and (J* are two competing level (1 . less than 80 . v = Va when lIa E 8(:1') is a level 0: test. . Using Problem 4. j i I .2) for all competitors.Q). P. A level a test of H : /L = /La vs K .6. Ii n II r . We next show that for a certain notion of accuracy of confidence bounds.2 and 4.6.3.n(X . (4. . and only if. < X(n) denotes the ordered Xl. We also give a connection between confidence intervals... a level (1 any fixed B and all B' > B.6 UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account.[B(X) < B']. (4.1) is nothing more than a comparison of (X)wer functions. Which lower bound is more accurate? It does tum out that . 0 I i.Q for a binomial..Q) con· fidence region for v..Q) LCB B if. .1 continued).6.~'(X) < B'] < P.Ilo)/er > z(1 . and the threedecision problem of deciding whether a parameter 1 l e is eo. where 80 is a specified value. they are both very likely to fall below the true (J.z(1 . then the lest that accepts H .2) Lower confidence bounds e* satisfying (4. P. 0"2) random variables with (72 known. 4.5. e. which reveals that (4.1 (Examples 3.. If S(x) is a level (1 . But we also want the bounds to be close to (J.. Suppose X = (X" . Formally. I' > 1'0 rejects H when .248 Testing and Confidence Regions Chapter 4 0("'.a) LCB 0* of (J is said to be more accurate than a competing level (1 .0) confidence region for va.6. random variable S.1) Similarly. for X E X C Rq.a) LCB for if and only if 0* is a unifonnly most accurate level (1 .6.. unifonnly most accurate in the N(/L.1 t! Example 4.. B(n. (72) model. This is a consequence of the following theorem.[B(X) < B']. I' i' . We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution.Xn ) is a sample of a N (/L.1.Xn and k is defined by P(S > k) = 1 . The dual lower confidence bound is 1'.6. where X(1) < X(2) < . va) ~ 0 is a level (1 . optimality of the tests translates into accuracy of the bounds.if. ~). and only if. in fact. we say that the bound with the smaller probability of being far below () is more accurate. for 0') UCB e is more accurate than a competitor e . the following is true. . twosided tests. Definition 4.' .
Suppose ~'(X) is UMA level (1 . we find that j.0') LeB for (J. for any other level (1 .z(1 a)a/ JTi is uniformly most accurate. such that for each (Jo the associated test whose critical function o*(x. O(x) < 00 . Also.a) upper confidence bound q* for q( A) = 1 . Example 4.6.2.Because O'(X. Let f)* be a level (1 . Let O(X) be any other (1 .1. then for all 0 where a+ = a.a) Lea for q( 0) if.a) LeB 00. We define q* to be a uniformly most accurate level (1 .5. We want a unifonnly most accurate level (1 . Most accurate upper confidence bounds are defined similarly. Proof Let 0 be a competing level (1 .1 to Example 4. if a > 0.e>. 00 ) is a level a test for H : 0 = 00 versus K : 0 > 00 .a). Uniformly most accurate (UMA) bounds turn out to have related nice properties. . the probability of early failure of a piece of equipment.6. and only if. Then!l* is uniformly most accurate at level (1 . .7 for the proof). Then O(X. X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it. Defined 0(x. Identify 00 with {}/ and 01 with (J in the statement of Definition 4. .(o(X. 00 ) ~ 0 if. they have the smallest expected "distance" to 0: Corollary 4. Boundsforthe Probability ofEarly Failure ofEquipment. [O'(X) > 001. (O'(X.Section 4. for e 1 > (Jo we must have Ee.2. 00 ) is UMP level Q' for H : (J = eo versus K : 0 > 00 .•.Oo) 1 ifO'(x) > 00 ootherwise is UMP level a for H : (J = eo versus K : (J > 00 .6. However. We can extend the notion of accuracy to confidence bounds for realvalued functions of an arbitrary parameter. and 0 otherwise. a real parameter.6. 00 ) by o(x.a) lower confidence bound.Oo)) < Eo.2 and the result follows.t o .6.4. Let XI. For instance (see Problem 4. the robustness considerations of Section 3.5 favor X{k) (see Example 3. (Jo) is given by o'(x.1. [O(X) > 001 < Pe. 1 X n be the times to failure of n pieces of equipment where we assume that the Xi are independent £(A) variables. 00 )) or Pe. Pe[f < q(O')1 < Pe[q < q(O')] whenever q((}') < q((}).1. and only if.a) LCB q.a) lower confidence boundfor O.2).6 Uniformly Most Accurate Confidence Bounds 249 Theorem 4. 0 If we apply the result and Example 4.
the lengili t . .6. there does not exist a member of the class of level (1 .a) quantile of the X§n distribution. however. we can restrict attention to certain reasonable subclasses of level (1 . Thus. the UMP test accepts H if LXi < X2n(1.T. The situation wiili confidence intervals is more complicated. * is by Theorem 4.a) confidence intervals that have uniformly minimum expected lengili among all level (1. Of course. *) is a uniformly most accurate level (1 .a) UCB'\* for A. there exist level (1 .1 have this property. the intervals developed in Example 4. Confidence intervals obtained from twosided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes.a) is the (1.a) UCB for A and.a) intervals iliat has minimum expected length for all B. 1962. in general.5.\ *) where>. ~ q(B) ~ t] ~ Pe[T..6.6. 374376).T..a) UCB for the probability of early failure.a) lower confidence bounds to fall below any value ()' below the true B.) as a measure of precision.a). a uniformly most accurate level (1 . l .a) in the sense that. pp. subject to ilie requirement that the confidence level is (1 . By Problem 4.a) by the property that Pe[T. the confidence interval be as short as possible. 0 Discussion We have only considered confidence bounds. it follows that q( >.8.a) unbiased confidence intervals.a)/ 2Ao i=1 n (4. Summary. In particular. in the case of lower bounds. the confidence region corresponding to this test is (0. as in the estimation problem.a) intervals for which members with uniformly smallest expected length exist.3) or equivalently if A X2n(1a) o < 2"'~ 1 X 2 L_n= where X2n(1. If we turn to the expected length Ee(t . There are.a) iliat has uniformly minimum length among all such intervals.1. because q is strictly increasing in A.. ~ q(()') ~ t] for every (). the interval must be at least as likely to cover the true value of q( B) as any other value. the situation is still unsatisfactory because. Neyman defines unbiased confidence intervals of level (1 . To find'\* we invert the family of UMP level a tests of H : A ~ AO versus K : A < AO. That is. is random and it can be shown that in most situations there is no confidence interval of level (1 . B'.. Considerations of accuracy lead us to ask that.250 Testing and Confidence Regions Chapter 4 We begin by finding a uniformly most accurate level (1. they are less likely ilian oilier level (1 . However. These topics are discussed in Lehmann (1997). some large sample results in iliis direction (see Wilks. Therefore. Pratt (1961) showed that in many of the classical problems of estimation . By using the duality between onesided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level (1 .
7.2.a)% confidence interval is that if we repeated an experiment indefinitely each time computing a 100(1 .7.7 FREQUENTIST AND BAYESIAN FORMULATIONS We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E X c Rq are random while the parameters are fixed but unknown.'" .a lower and upper credible bounds for fL are  fL = fLB  ao Zla Vn (1 + . then Ck = {a: 7r(lx) .)2 nro 1 . the interpretation of a 100(1".a)% confidence interval. it is natural to consider the collec tion of that is "most likely" under the distribution II(alx).aB = n ~ + 1 I ~ It follows that the level 1 .4.1. Suppose that. a~). (j]. Definition 4.3.a) by the posterior distribution of the parameter given the data. the posterior distribution of fL given Xl. X has distribution P e. A consequence of this approach is that once a numerical interval has been computed from experimental data.7.d. . what are called level (1 '.Section 4. 75).A)2 nro ao 1 Ji = liB + Zla Vn (1 + . In the Bayesian formulation of Sections 1. then fl. .6. Example 4.a) credible region for e if II( C k Ix) ~ 1  a . Thus. then 100(1 . a E e c R. no probability statement can be attached to this interval.2 and 1.a. with known.::: 1 . a Definition 4. and that fL rv N(fLo. Xl.1.12. :s: alx) ~ 1 ..a. then Ck will be an interval of the form [fl.a) credible bounds and intervals are subsets of the parameter space which are given probability at least (1 . II(a:s: t9lx) . and {j are level (1 . from Example 1. If 7r( alx) is unimodal.::: k} is called a level (1 .i. N(fL..Xn are i. Let II( 'Ix) denote the posterior probability distribution of given X = x. with fLo and 75 known. given a. We next give such an example.a) lower and upper credible bounds for if they respectively satisfy II(fl. a a Turning to Bayesian credible intervals and regions. Suppose that given fL. with ~ a6) a6 fLB = ~ nx+l/1 ~t""o n ~ + I ~ ~2 . Then.7 Frequentist and Bayesian Formulations 251 4.1.Xn is N(liB. Let 7r('lx) denote the density of agiven X = x. Instead.. and that a has the prior probability distribution II.a)% of the intervals would contain the true unknown parameter value.
/10)2. a doctor administering a treatment with delayed effect will give patients a time interval [1::. = x a+n (a) / (t + b) is a level (1 .252 Testing and Confidence Regions Chapter 4 while the level (1 . In the Bayesian framework we define bounds and intervals.8 PREDICTION INTERVALS In Section 1. the level (1 . . Let A = a. See Example 1. the Bayesian interval tends to the frequentist interval. we may want an interval for the . Let xa+n(a) denote the ath quantile of the X~+n distribution. N(/10. by Problem 1.4.n. it is desirable to give an interval [Y. In addition to point prediction of Y. the interpretations are different: In the frequentist confidence interval. Note that as TO > 00. then .2 ..a) credible interval is [/1.a) credible interval is similar to the frequentist interval except it is pulled in the direction /10 of the prior mean and it is a little narrower. Then.. the probability of coverage is computed with the data X random and B fixed. X n . For instance.n(1+ ~)2 nTo 1 • Compared to the frequentist interval X ± Zl~oO/.2 .. called level (1 a) credible bounds and intervals.4 for sources of such prior guesses.02) where /10 is known.• . (t + b)A has a X~+n distribution. the center liB of the Bayesian interval is pulled in the direction of /10. given Xl.7. We shall analyze Bayesian credible regions further in Chapter 5.12.2. whereas in the Bayesian credible interval. is shifted in the direction of the reciprocal b/ a of the mean of W(A). Y] that contains the unknown value Y with prescribed probability (1 . However. however. the interpretations of the intervals are different.a). /1 +1with /1 ± = /111 ± Zl'" 2 ~ 00 .6. where /10 is a prior guess of the value of /1. Suppose that given 0. 4.3. In the case of a normal prior w( B) and normal model p( X I B).d. t] in which the treatment is likely to take effect. . D Example 4.4 we discussed situations in which we want to predict the value of a random variable Y.2. the probability of coverage is computed with X = x fixed and () random with probability distribution II (B I X = x). where t = 2:(Xi ..2. Xl. Compared to the frequentist bound (n 1) 8 2 / Xnl (a) of Example 4.2 and suppose A has the gamma f( ~a.. b > are known parameters. that determine subsets of the parameter space that are assigned probability at least (1 . ~b) density where a > 0. .a) upper credible bound for 0.a) by the posterior distribution of the parameter () given the data :c.Xn are Li.a) lower credible bound for A and ° is a level (1 .. 01 Summary. Similarly.
) acts as a confidence interval pivot in Example 4. Thus.Section 4. ::.+ 18tn l (1  (4. where MSE denotes the estimation theory mean squared error.. ~ We next use the prediction error Y . "~).Xn .i. Moreover. 8 2 = (n . X is the optimal estimator.3. and can c~nclu~e that in the class of prediction unbiased predictors. Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is Note that Y can be regarded as both a predictor of Y and as an estimate of p" and when we do so. .1). as X '" N(p" cr 2 ). The (Student) t Prediction Interval. it can be shown using the methods of .8. has the t distribution.1.Ct) prediction interval as an interval [Y.l + 1]cr2 ). We define a predictor Y* to be prediction unbiased for Y if E(Y* .1.3 and independent of X n +l by assumption.4. It follows that. . we find the (1 . Note that Y Y = X . By solving tnl Y.l distribution.1) Note that Tp(Y) acts as a prediction interval pivot in the same way that T(p.1)1 L~(Xi . fnl (1  ~a) for Vn.X n +1 '" N(O.8.. AISP E(Y) = MSE(Y) + cr 2 . Y) ?: 1 Ct. . The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4. in this case.a) prediction interval Y = X± l (1 ~Ct) ::. As in Example 4.4. We define a level (1. Y] based on data X such that P(Y ::.. the optimal estimator when it exists is also the optimal predictor. Also note that the prediction interval is much wider than the confidence interval (4. TnI. 1) distribution and is independent of V = (n .8 Prediction Intervals 253 future GPA of a student or a future value of a portfolio. by the definition of the (Student) t distribution in Section B.1)8 2 /cr 2. which is assumed to be also N(p" cr 2 ) and independent of Xl...Y to construct a pivot that can be used to give a prediction interval.4. which has a X. Y ::. In fact. . In Example 3.1. .8.4. let Xl. .3. [n.Y) = 0. We want a prediction interval for Y = X n +1. Tp(Y) !a) .d. the optimal MSPE predictor is Y = X. Let Y = Y(X) denote a predictor based on X = (Xl..X) is independent of X by Theorem B. we found that in the class of unbiased estimators..Xn be i. It follows that Z (Y) p  Y vnY l+lcr has a N(O.
Suppose XI.0:) in the limit as n ) 00 for samples from nonGaussian distributions.. . where X n+l is independent of the data Xl'. Here Xl"'" Xn are observable and Xn+l is to be predicted.254 Testing and Confidence Regions Chapter 4 Chapter 5 that the width of the confidence interval (4. 00 :s: a < b :s: 00. n + 1. Example 4..v) = E(U(k») . whereas the level of the prediction interval (4. By ProblemB. Let X(1) < . (4. = i/(n+ 1).2..xn ).. by Problem B. Q(. We want a prediction interval for Y = X n+l rv F. Moreover..'. " X n + l are Suppose that () is random with () rv 1'( and that given () = i. The posterior predictive distribution Q(.' X n .Un+l are i.. .d.8. b). Un ordered.. with a sum replacing the integral in the discrete case. . I x) has in the continuous case density e. 0 We next give a prediction interval that is valid from samples from any population with a continuous distribution. .8.5 for a simpler proof of (4.9. .0:) Bayesian prediction interval for Y = X n + l if .i. X(k)] with k :s: Xn+l :s: XCk») = n + l' kj = n + 1 . p(X I e)..i..' •. I x) of Xn+l is defined as the conditional distribution of X n + l given x = (XI.. whereas the width of the prediction interval tends to 2(]z (1. " X n.d. Let U(1) < ..4.i.~o:).. i = 1. This interval is a distributionfree prediction interval. .2) P(X(j) It follows that [X(j) . Bayesian Predictive Distributions Xl' . then P(U(j) j :s: Un+l :s: UCk ») P(u:S: Un+l:S: v I U(j) = U.8.E(U(j)) where H is the joint distribution of UCj) and U(k). where F is a continuous distribution function with positive density f on (a. See Problem 4.4.12. 1).. Now [Y B' YB ] is said to be a level (1 . Ul .2j)/(n + 1) prediction interval for X n +l . UCk ) = v)dH(u. .v) j(v lI)dH(u.8. the confidence level of (4.2. Set Ui = F(Xi ). < UCn) be Ul . E(UCi») thus. . as X rv F. U(O.1) is approximately correct for large n even if the sample comes from a nonnormal distribution.1) is not (1 . .1) tends to zero in probability at the rate n!. .j is a level 0: = (n + 1 . < X(n) denote the order statistics of Xl.2). uniform. that is.d.8. then..2.Xn are i.
Thus.i. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distributionfree prediction interval. Note that E[(Xn+l . Summary.0:).I . In the case of a normal sample of size n + 1 with only n variables observable. Xn+l . (J"5).9 Likelihood Ratio Procedures 255 Example 4. T2). T2 known.0 and 0 are still uncorrelated and independent.1) .. Because (J"~ + 0.8. Xn is T = X = n 1 2:~=1 Xi . 0 The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box. 4. we construct the Student t prediction interval for the unobservable variable. .8.3) we compute its probability limit under the assumption that Xl " '" Xn are i. We consider intervals based on observable random variables that contain an unobservable random variable with probability at least (1.. The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables.1 where (Xi I B) '" N(B . from Example 4. However. Thus.9 4.3) converges in probability to B±z (1 .3. where X and X n + l are independent.8.0:) Bayesian prediction interval for Y is [YB ' Yt] with Yf = liB ± Z (1. .0 and 0 are uncorrelated and. 1983).1. and X ~ Bas n + 00. (n(J"~ /(J"5) + 1. This is the same as the probability limit of the frequentist interval (4.  0) + 0 I X = t} = N(liB. (4.~o:) (J"o as n + 00. X n + l and 0.8.3) To consider the frequentist properties of the Bayesian prediction interval (4. .c{Xn+l I X = t} = .d. A sufficient statistic based on the observables Xl. even in .2.8.c{(Xn +1 where. (J"5 known. N(B. B = (2 / T 2) TJo + ( 2 / (J"0 X. To obtain the predictive distribution. Consider Example 3.Section 4. and 7r( B) is N( TJo .2 o + I' T2 ~ P. and it is enough to derive the marginal distribution of Y = X n +l from the joint distribution of X. (J"5).9. (J" B n(J" B 2) It follows that a level (1 . the results and examples in this chapter deal mostly with oneparameter problems in which it sometimes is possible to find optimal procedures. we find that the interval (4.0)0] = E{E(Xn+l . Xn+l .1 LIKELIHOOD RATIO PROCEDURES Introduction Up to this point.B)B I 0 = B} = O. by Theorem BA. independent.7. note that given X = t. (J"5 + a~) ~2 = (J" B n I (. The Bayesian prediction interval is derived for the normal model with a normal prior.~o:) V(J"5 + a~.
.1(c». To see that this is a plausible statistic. and conversely. p(x. . Although the calculations differ from case to case.~a). e) is a continuous function of e and eo is of smaller dimension than 8 = 8 0 U 8 1 so that the likelihood ratio equals the test statistic I e e 1 A(X) = sup{p(x.x) = p(x. where T = fo(X . a 2 ) population with a 2 known. . For instance. eo). the MP level a test 'Pa(X) rejects H if T ::::. Calculate the MLE eo of e where e may vary only over 8 0 . recall from Section 2. The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6. 3. e) as a measure of how well e "explains" the given sample x = (Xl. In this section we introduce intuitive and efficient procedures that can be used when no optimal methods are available and that are natural for multidimensional parameters. by the uniqueness of the NP test (Theorem 4.Xn ) has density or frequency function p(x. if /Ll < /Lo./LO. Find a function h that is strictly increasing on the range of A such that h(A(X)) has a simple form and a tabled distribution under H ./Lo)/a. e) : () sup{p(x. Xn).1 that if /Ll > /Lo. Calculate the MLE e of e. In particular cases. e) : e E 8 l }. . The test statistic we want to consider is the likelihood ratio given by L(x) = sup{p(x.2. Also note that L(x) coincides with the optimal test statistic p(x . Because h(A(X)) is equivalent to A(X) . . there is no UMP test for testing H : /L = /Lo vs K : /L I. likelihood ratio tests have weak optimality properties to be discussed in Chapters 5 and 6. z(a ). if Xl./Lo.1) whose computation is often simple. We start with a generalization of the NeymanPearson statistic p(x .e): E 8 0 }. 1). 1. B')/p(x . and for large samples.. We are going to derive likelihood ratio tests in several important testing problems.2 that we think of the likelihood function L(e. Form A(X) = p(x.2. eo). 2. Note that in general A(X) = max(L(x). ifsup{p(x.a )th quantile obtained from the table.. 8 1 = {e l }..256 Testing and Confidence Regions Chapter 4 the case in which is onedimensional. ed / p(x. .9. the basic steps are always the same. eo) when 8 0 = {eo}.. optimal procedures may not exist. sup{p(x. To see this. In the cases we shall consider. e) E 8} : eE 8 0} (4. . note that it follows from Example 4. we specify the size a likelihood ratio test through the test statistic h(A(X) and . 4. . On the other hand. e) : e E 8l}islargecomparedtosup{p(x. Because 6a (x) I. there can be no UMP test of H : /L = /Lo vs H : /L I. .its (1 . the MP level a test 6a (X) rejects H for T > z(l .2. So. Xn is a sample from a N(/L. ed/p(x. e) : e E 8 0 } e e e Tests that reject H for large values of L(x) are called likelihood ratio tests.'Pa(x). Suppose that X = (Xl . e) and we wish to test H : E 8 0 vs K : E 8 1 . then the observed sample is best explained by some E 8 1 .
We can regard twins as being matched pairs. To see how the process works we refer to the specific examples in Sections 4. After the matching.9 Likelihood Ratio Procedures 257 We can also invert families of likelihood ratio tests to obtain what we shall call likelihood confidence regions. 8) > (8)] = eo a. we consider pairs of patients matched so that within each pair the patients are as alike as possible with respect to the extraneous factors. 8n e (4.2) where sUPe denotes sup over 8 E e and the critical constant c( 8) satisfies p.9. In order to reduce differences due to the extraneous factors. we measure the response of a subject when under treatment and when not under treatment. An example is discussed in Section 4. and soon.2 Tests for the Mean of a Normal DistributionMatched Pair Experiments Suppose Xl.a) confidence region C(x) = {8 : p(x.24.8) . the difference Xi has a distribution that is .9. the experiment proceeds as follows. which are composite because 82 can vary freely. and other factors. sales performance before and after a course in salesmanship. mileage of cars with and without a certain ingredient or adjustment. Suppose we want to study the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age. with probability ~) and given the treatment. We shall obtain likelihood ratio tests for hypotheses of the form H : 81 = 8lD . In the ith pair one patient is picked at random (i. while the second patient serves as control and receives a placebo. This section includes situations in which 8 = (8 1 . Let Xi denote the difference between the treated and control responses for the ith pair. diet. Here are some examples. The family of such level a likelihood ratio tests obtained by varying 8lD can also be inverted and yield confidence regions for 8 1 . We are interested in expected differences in responses due to the treatment effect.Section 4.9. we can invert the family of size a likelihood ratio tests of the point hypothesis H : 8 = 80 and obtain the level (1 . 0'2) population in which both JL and 0'2 are unknown.c 0 0 It is often approximately true (see Chapter 6) that c( 8) is independent of 8.e. In that case. C (x) is just the set of all 8 whose likelihood is on or above some fixed value dependent on the data.. 4. and so on.9. If the treatment and placebo have the same effect. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo.Xn form a sample from a N(JL.2. bounds. Response measurements are taken on the treated and control members of each pair. [ supe p(X.82 ) where 81 is the parameter of interest and 82 is a nuisance parameter. p (X .'" .9. An important class of situations for which this model may be appropriate occurs in matched pair experiments. For instance.5. 8) ~ [c( 8)t l sup p(x. That is. Studies in which subjects serve as their own control can also be thought of as matched pair experiments.
0 )2 . = E(X 1 ) denote the mean difference between the response of the treated and control subjects. Our null hypothesis of no treatment effect is then H : fJ. e). = fJ. Let fJ.0 as an established standard for an old treatment. = O. Finding sup{p(x. a 2 ) : fJ.258 Testing and Confidence Regions Chapter 4 symmetric about zero.0.a)= ~ ~2 (1 ~ 1 ~ n i=l 2 ) . B) 1[1~ ="2 a 4 L(Xi .fJ.L Xi 1 ~( n i=l .X) . as discussed in Section 4. = fJ. we test H : fJ. This corresponds to the alternative "The treatment has some effect.0) 2  a2 n] = 0." However. B) : B E 8 0 } boils down to finding the maximum likelihood estimate a~ of a 2 when fJ. We think of fJ. the test can be modified into a threedecision rule that decides whether there is a significant positive or negative effect. n i=l is the maximum likelihood estimate of B. The problem of finding the supremum of p(x. B) at (fJ. where we think of fJ. The likelihood equation is oa 2 a logp(x. as representing the treatment effect. = fJ.B): B E 8} = p(x. . The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter 5 if the nonnality assumption is not satisfied.L ( X i . which has the immediate solution ao ~2 = . =Ie fJ.o.5. where B=(x. B) was solved in Example 3. Form of the TwoSided Tests Let B = (fJ" a 2 ).0.0}.6.L X i ' .0 is known and then evaluating p(x. TwoSided Tests We begin by considering K : fJ. good or bad. We found that sup{p(x.=1 fJ. To this we have added the nonnality assumption. 8 0 = {(fJ" Under our assumptions. a~). However.3. 'for the purpose of referring to the duality between testing and confidence procedures.
In Problem 4. the likelihood ratio tests reject for large values of ITn I.2.Section 4.  {~[(log271') + (log&5)]~} Our test rule. However. which thus equals log . the size a likelihood ratio test for H : M Z Mo versus K : M < Mo rejects H if.9. (n . to find the critical value. suppose n = 25 and we want a = 0. (Mo. ITnl Z 2.x)2 = n&2/(n .1). For instance.4. &5 gives the maximum of p(x. and only if. Therefore. Thus. A proof is sketched in Problem 4.8) logp(x. 8) for 8 E 8 0. &5/&2 (&5/&2) is monotone increasing Tn = y'n(x .\(x) logp(x. therefore. Mo .1). A and B. .Mo) . Mo versus K : M > Mo (with Mo = 0) is suggested.Mo)/a. the relevant question is whether the treatment creates an improvement. Because 8 2 function of ITn 1 where = 1 + (x .3.11 we argue that P"[Tn Z t] is increasing in 8. which can be established by expanding both sides.~a) and we can use calculators or software that gives quantiles of the t distribution. where 8 = (M .05. &0)) ~ 2 {~[(log271') + (log&2)]~} ~ log(&5/&2).MO)2/&2. OneSided Tests The twosided formulation is natural if two treatments. To simplify the rule further we use the following equation. are considered to be equal before the experiment is performed. The statistic Tn is equivalent to the likelihood ratio statistic A for this problem.1)1 I:(Xi . 8 Therefore.\(x). the testing problem H : M ::. is of size a for H : M ::.1. Because Tn has a T distribution under H (see Example 4. Therefore.064. and only if. the test that rejects H for Tn Z t nl(1 . Then we would reject H if. The test statistic . the size a critical value is t n 1 (1 . if we are comparing a treatment and control. rejects H for iarge values of (&5/&2). Similarly.\(x) is equivalent to log . or Table III.9.a).9 Likelihood Ratio Procedures 259 By Theorem 2.
We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 ~ 81 > 0 in the twosided case and 8 ~ 81 or 8 :S 81 in the onesided cases. however. The reason is that. A Stein solution. say.1)s2/a 2 are independent and that (n 1)s2/a 2 has a X. The density of Z/ ./Lo) / a has N (8.7) when discussing confidence intervals.4. Similarly the onesided tests lead to the lower and upper confidence bounds of Example 4.4. p. 1) and X~ distributions. This distribution. If we consider alternatives of the form (/L /Lo) ~ ~.9.1. / (n1)s2 /u 2 n1 V has a 'Tn1.n(X ../Lo)/a) = 1.l distribution. ( 2) only through 8. and the power can be obtained from computer software or tables of the noncentral t distribution./Lo) / a as close to 0 as we please and.4. we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8.3. Computer software will compute n.b. Problem 17)./Lo) / a.b distribution. note that from Section B. We have met similar difficulties (Problem 4. Thus.2. With this solution. with 8 = fo(/L . The power functions of the onesided tests are monotone in 8 (Problem 4. Likelihood Confidence Regions If we invert the twosided tests.jV/k is given in Problem 4./Lol ~ ~. thus. respectively. Because E[fo(X /Lo)/a] = fo(/L ./Lo)/a and Yare ./Lo)/a . To derive the distribution of Tn. Note that the distribution of Tn depends on () (/L.260 Power Functions and Confidence 4 To discuss the power of these tests. denoted by Tk. 1) distribution. . is possible (Lehmann. we may be required to take more observations than we can afford on the second stage. fo( X . whatever be n. 1997.9. in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with I/L. bring the power arbitrarily close to 0:.1 are monotone in fo/L/ a.1. we obtain the confidence region We recognize C(X) as the confidence interval of Example 4.12. is by definition the distribution of Z / .jV/ k where Z and V are independent and have N(8. we can no longer control both probabilities of error by choosing the sample size.11) just as the power functions of the corresponding tests of Example 4. we know that fo(X /L)/a and (n . by making a sufficiently large we can force the noncentrality parameter 8 = fo(/L . 260. the ratio fo(X .
Section 4. (See also (4.4 1.3).0 4.2.3 4 1. temperature. . 1958..Y ) is n2 where n = nl + n2.2 0. it is usually assumed that X I.995) = 3.5. blood pressure).4 3 0.9. respectively. consider the following data due to Cushny and Peebles (see Fisher. Tests We first consider the problem of testing H : Jl1 = J:L2 versus H : Jli =F Jl2. This is a matched pair experiment with each subject serving as its own control. The 0. and ITnl = 4. A discussion of the consequences of the violation of these assumptions will be postponed to Chapters 5 and 6. volume...6 10 2.9 1.8 2.8 8 0. .6 4. YI . The preceding assumptions were discussed in Example 1. Jl2. e .1 0. then x = 1.Xn1 could be blood pressure measurements on a sample of patients given a placebo.) 4.. Then Xl. .1. .4 1.6 0.. . Y) = (Xl.7 1.2 1. .. . Yn2 .1 1.9.5 1. length. ( 2) and N (Jl2. p. we conclude at the 1% level of significance that the two drugs are significantly different.8 1.06.A in sleep gained using drugs A and B on 10 patients..Xn1 .9.. Because tg(0.2 the likelihood function and its log are maximized In Problem 4. one from each popUlation. . ..4 If we denote the difference as x's.6 0. 'Yn2 are the measurements on a sample given the drug.1 0. 8 2 = 1.'" .. 'Yn2 are independent samples from N (JlI.513. ( 2) populations..6.3 5 0.0 7 3.7 5.0 3. Let e = (Jlt. For instance. 121) giving the difference B .99 confidence interval for the mean difference Jl between treatments is [0. . For quantitative measurements such as blood pressure. while YI ..1 1. As in Section 4.4 4. Patient i A B BA 1 0.g. In the control versus treatment example..84].3 Tests and Confidence Intervals for the Difference in Means of Two Normal Populations We often want to compare two populations with distribution F and G on the basis of two independent samples X I.3.8 9 0.9 likelihood Ratio Procedures 261 Data Example As an illustration of these procedures. It suggests that not only are the drugs different but in fact B is better than A because no hypothesis Jl = Jl' < 0 is accepted at this level. height..0 6 3.2 2 1.25.Xn1 and YI . suppose we wanted to test the effect of a certain drug on some biological variable (e. and so forth.58. . ( 2). .32..Xnl and YI . weight. it is shown that = over 8 by the maximum likelihood estimate e. Then 8 0 = {e : Jli = Jl2} and 9 1 = {e : Jli =F Jl2}' The log of the likelihood of (X. this is the problem of determining whether the treatment has any effect.
3 Tn2 1 .j112 log likelihood ratio statistic [(Xi .1 .1 I:rtl . (75).X) + (X  j1)]2 and expanding.2 (Xi . Y. j1. our model reduces to the onesample model of Section 4. the maximum of p over 8 0 is obtained for f) (j1.3.9.Y) '1':" a a a i=l a j=l J I . we find that the is equivalent to the test statistic ITI where and To complete specification of the size a likelihood ratio test we show that T has distribution when ttl = tt2.X) and .262 (X. By Theorem B. Thus. (7 ).2 (y. where 2 Testing and Confidence Regions Chapter 4 When ttl tt2 = tt. Y.2 X. where and If we use the identities ~ fCYi ?l i=l j1)2 1 n fp'i i=l y)2 + n2 (Y _ j1)2 n obtained by writing [Xi .2.2 1 I:rt2 . . .
this follows from the fact that.~ Q likelihood confidence bounds. . T( Ll) has a Tn2 distribution and inversion ofthe tests leads to the interval (4. X~ll' X~2l' respectively. twosample t test rejects if. We conclude from this remark and the additive property of the X2 distribution that (n .ILl. T has a noncentral t distribution with noncentrality parameter.ILl = Ll versus K : 1L2 . corresponding to the twosided test. We can show that these tests are likelihood ratio tests for these hypotheses. by definition.2)8 2/0.Ll. 1L2.9 likelihood Ratio Procedures 263 are independent and distributed asN(ILI/o. T and the resulting twosided.2 has a X~2 distribution.2 under H bution and is independent of (n . 1/n2).X) /0.9. and that j nl n2/n(Y .Section 4.N(1L2/0. IL 1 and for H : ILl ::.ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 . there are two onesided tests with critical regions.3) for 1L2 .2)8 2 /0.has a N(O.2 • Therefore. and only if. Confidence Intervals To obtain confidence intervals for 1L2 . 1) distriTn . As for the special case Ll = 0. if ILl #. Similarly. f"V As usual. As in the onesample case.1L2.ILl #. for H : 1L2 ::. we find a simple equivalent statistic IT(Ll)1 where If 1L2 ILl = Ll. It is also true that these procedures are of size Q for their respective hypotheses. 1/nI). onesided tests lead to the upper and lower endpoints of the interval as 1 .
~o:). /11 = x and /12 = y for (Ji1. :.264 Testing and Confidence Regions Chapter 4 Data Example As an illustration. If normality holds. 472) x (machine 1) Y (machine 2) 1. . except for an additive constant.0:) confidence interval has probability at most ~ 0: of making the wrong selection.845 1.776.95 confidence interval for the difference in mean log permeability is 0. thus. 4. is The MLEs of Ji1 and Ji2 are. The level 0. a treatment that increases mean response may increase the variance of the responses. setting the derivative of the log likelihood equal to zero yields the MLE . Y1.977.Xn1 .. thus. we are lead to a model where Xl. ai) and N(Ji2. The results in terms oflogarithms were (from Hald. Ji2) E R x R.395. Suppose first that aI and a~ are known. .583 1. On the basis of the results of this experiment. respectively. H is rejected if ITI 2 t n ..x = 0. it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same. From past experience.3. a~) samples. .368. it may happen that the X's and Y's have different variances. the more waterproof material.4 The TwoSample Problem with Unequal Variances In twosample problems of the kind mentioned in the introduction to Section 4...975) = 2.0264. we conclude at the 5% level of significance that there is a significant difference between the expected log permeability for the two machines. Again we can show that the selection procedure based on the level (1 . Because t4(0. As a first step we may still want to compare mean responses for the X and Y populations. .395 ± 0.282 We test the hypothesis H of no difference in expected log permeability. This is the BehrensFisher problem. The log likelihood.9. 'Yn2 are two independent N (Ji1.627 2. 1952. .8 2 = 0.9. p. we would select machine 2 as producing the smaller permeability and. When Ji1 = Ji2 = Ji. and T = 2. Here y . For instance.042 1. consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines.790 1.2 (1 .
)18D has approximately a standard normal distribution (Problem 5.. n2. j1. and more generally (D .)18D to generate confidence procedures. For small and moderate nI. j1x = = nl .Section 4.fi? j=1 + n2(j1..1:::.y)/(nl + .y) By writing nl L(Xi .x)/(nl + . where I:::.t2 :::. It follows that "\(x.fi = n2(x .fi)2 we obtain \(x.t1 = J. DiaD has aN(I:::.n2)' It follows that the likelihood ratio test is equivalent to the statistic IDl/aD. BecauseaJy is unknown. For large nl..3.9 Likelihood Ratio Procedures J 265 where.) I 8 D depends on at I a~ for fixed nl. Unfortunately the distribution of (D 1:::.) 18 D due to Welch (1949) works well. by Slutsky's theorem and the central limit theorem.X and aJy is the variance of D..X)2 + nl(j1. = at I a~.t1. where D = Y .x)2 i=1 n2 i=1 n2 L(Yi ..1:::. = J. y) Next we compute = exp {2~~ eM  x)2 + 2:'~ (Ii  y?}. IDI/8D as a test statistic for H : J.t1. 2 8y 8~ .n2)' Similarly.28). n2 an approximation to the distribution of (D . 8D=+' nl n2 It is natural to try and use D I 8 D as a test statistic for the onesided hypothesis H : J.. n2.n2(fi . An unbiased estimate is J. it Thus..t2 must be estimated. that is Yare X) + Var(Y) = nl + n2 . J.t2.j1)2 j=1 = L(Yi . 1) distribution. (D 1:::.
Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many staweight. N(j1. etc. X test tistical studies. Y cholesterol level in blood.05 and Q: 0. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the BehrensFisher problem.1 When k is not an integer. the maximum error in size being bounded by 0. u~ > O. (Xn' Yn). Y) have a joint bivariate normal distribution.3. . p). X percentage of fat in score on mathematics exam.01. The LR procedure derived in Section 4.1.2 . Y).. uI 4.3. .9. then we end up with a bivariate random sample (XI.9. Yd. fields. Y = blood pressure. TwoSided Tests . machines. ur Testing Independence. See Figure 5.u~. Note that Welch's solution works whether the variances are equal or not. our problem becomes that of testing H : p O.266 Testing and Confidence Regions Chapter 4 Let c = Sf/nlsb' Then Welch's approximation is Tk where k c2 [ nl 1 (1 C)2]1 + n2 . with > 0. the critical value is obtained by linear interpolation in the t tables or using computer software. X = average cigarette consumption per day in grams. mice. which works well if the variances are equal or nl = n2. If we have a sample as before and assume the bivariate normal model for (X. Wang (1971) has shown the approximation to be very good for Q: = 0. Some familiar examples are: X test score on English exam.3 and Problem 5. Y = age at death.5 likelihood Ratio Procedures for Bivariate Normal Distributions If n subjects (persons..003.J1. Y diet.28. Empirical data sometimes suggest that a reasonable model is one in which the two characteristics (X. .) are sampled from a population and two numerical characteristics are measured on each case.3. can unfortunately be very misleading if =1= u~ and nl =1= n2.
the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1. (Un. Pis called the sample correlation coefficient and satisfies 1 ::.1.3.8). p). and eo can be obtained by separately maximizing the likelihood of XI. the distribution of p depends on p only. Because (U1.X . where Ui = (Xi . If p = 0. 0. the distribution of pis available on computer packages. Because ITn [ is an increasing function of [P1.4) = Thus.~( Yi . A normal approximation is available. p::. We have eo = (x.1. log A(X) is an increasing function of p2. See Example 5.9.. for any a. When p = 0.4. 0) and the log of the likelihood ratio statistic becomes log A(X) (4. the twosided likelihood ratio tests can be based on [Tn I and the critical values obtained from Table II.0"2 = .o.9 Likelihood Ratio Procedures 267 The unrestricted maximum likelihood estimate (x.2 distribution. we can control probabilities of type II error by increasing the sample size. (4. and the likelihood ratio tests reject H for large values of [P1.. (j~. .13 as 2 = . ii. p) distribution.Now. To obtain critical values we need the distribution of p or an equivalent statistic under H.5. Yn .9. ~ = (Yi . .Section 4. .X)(Yi ~=1 fj)] /n(jl(72. V n ) is a sample from the N(O. y..3.Xn and that ofY1 .5) has a Tn. Therefore.(j~.~( Xi .Jll)/O"I. (jr . 1~ )2 2 1~ )2 0"1 n i=1 n i=1 p= [t(Xi . Qualitatively.Jl2)/0"2. .1.. . VI)"" . then by Problem B. where (jr e was given in Problem 2. if we specify indifference regions.Y . 1 (Problem 2.9. There is no simple form for the distribution of p (or Tn) when p #. When p = 0 we have two independent samples. .
consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week.48. inversion of this family of tests leads to levell Q lower confidence bounds. we obtain size Q tests for each of these hypotheses by setting the critical value so that the probability of type I error is Q when p = O. These intervals do not correspond to the inversion of the size Q LR tests of H : p = Po versus K : p :::f. To obtain lower confidence bounds.~Q bounds together. We obtain c(p) either from computer software or by the approximation of Chapter 5.9. We can show that Po [p. we find that there is no evidence of correlation: the pvalue is bigger than 0. we can start by constructing size Q likelihood ratio tests of H : P Po versus K : p > po. The likelihood ratio test statistic >. if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood. ! Data Example A~ an illustration. However. We find the likelihood ratio tests and associated confidence procedures for four classical normal models: . Confidence Bounds and Intervals Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present. c(Po)] 1 ~Q. We can similarly obtain 1 Q upper confidence bounds and. Po but rather of the "equaltailed" test that rejects if. c(po)" where Ppo [p::. For instance. c(po) where P po [I)' d(po)] P po [I)'::.268 Testing and Confidence Regions Chapter 4 OneSided Tests In many cases. Therefore. by putting two level 1 . These tests can be shown to be of the form "Accept if. we would test H : P = 0 versus K : P > 0 or H : P ::.1 Here p 0.Q. c(po)] 1 . and only if.:::: c] is an increasing function of p for fixed c (Problem 4. we obtain a commonly used confidence interval for p. by using the twosided test and referring to the tables. The power functions of these tests are monotone. 0 versus K : P > 0 and similarly that corresponds to the likelihood ratio statistic for H : P 0 versus K : P < O.18 and Tn 0. 370. Because c can be shown to be monotone increasing in p. for large n the equal tails and LR confidence intervals approximately coincide with each other. We want to know whether there is a correlation between the initial weights and the weight increase and formulate the hypothesis H : p O. Thus.15). It can be shown that pis equivalent to the likelihood ratio statistic for testing H : P ::. 0 versus K : P > O. p::.: : d(po) or p::. 'I7 Summary.75. only onesided alternatives are of interest.7 18. and only if p . is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis.
. When a? and a~ are unknown. where SD is an estimate of the standard deviation of D = Y . we use (Y .L.1 1.X. . J. the likelihood ratio test is equivalent to the test based on IY . . We test the hypothesis that the means are equal and find that the likelihood ratio test is equivalent to the twosample (Student) t test. Mn e = ~? = 0. (d) How large should n be so that the 6c specified in (b) has power 0. Suppose that Xl.Xn ) is an £(A) sample..05? e :::. . Let Xl.L :::.XI.10 Problems and Complements 269 (1) Matched pair experiments in which differences are modeled as N(J. We test the hypothesis that X and Yare indepen dent and find that the likelihood ratio test is equivalent to the test based on IPl. Consider the hypothesis H that the mean life II A = J. X n ) and let 1 if Mn 2:: c = of e.Lo.L2' a~) populations. where p is the sample correlation coefficient.L2' ( 2) populations.X)I SD. respectively. . ( 2) and we test the hypothesis that the mean difference J. ~ versus K : e> ~.98 for (e) If in a sample of size n = 20. 4. what choice of c would make 6c have size (c) Draw a rough graph of the power function of 6c specified in (b) when n = 20. Approximate critical values are obtained using Welch's t distribution approximation. e).L is zero. We also find that the likelihood ratio statistic is equivalent to a t statistic with n . ..2 degrees of freedom. .Section 4. (a) Compute the power function of 6c and show that it is a monotone increasing function (b) In testing H : exactly 0. X n are independently and identically distributed according to the uniform distribution U(O.. .Ll' ( 2) and N(J.. (4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases. . The likelihood ratio test is equivalent to the onesample (Student) t test. respectively. When a? and a~ are known.Ll' an and N(J. . . Let Mn = max(X 1 . (3) Twosample experiments in which two independent samples are modeled as coming from N(J. (2) Twosample experiments in which two independent samples are modeled as coming from N(J. what is the pvalue? 2. Assume the model where X = (Xl. 0 otherwise..10 PROBLEMS AND COMPLEMENTS Problems for Section 4.48..Xn denote the times in days to failure of n similar pieces of equipment.
1). Let o:(Tj ) denote the pvalue for 1j.. 1 using a I .a)th quantile of the X~n distribution. Tr are independent test statistics for the same simple H and that each Tj has a continuous distribution. . (b) Show that the power function of your test is increasing in 8. 1nz t Z i . Assume that F o and F are continuous. r. under H. X> 0. .3).) 4.1. . T ~ Hint: See Problem B. . . give a normal approximation to the significance probability. Suppose that T 1 . j=l . Let Xl.. . " i I 5. (cj Use the central limit theorem to show that <l?[ (I"oz( <» II") + VTi(1"  1"0) I 1"] is an approximation to the power of the test in part (a).2. Hint: Use the central limit theorem for the critical value. (d) The following are days until failure of air monitors at a nuclear plant. then the pvalue <>(T) has a unifonn. i I (b) Check that your test statistic has greater expected value under K than under H.<»1 VTi)J .. (Use the central limit theorem. (b) Give an expression of the power in tenns of the X~n distribution.ontinuous distribution. 7.. If 110 = 25. . Show that if H is simple and the test statistic T has a c.. .. j = 1. 6. . . .4 to show that the test with critical region IX > I"ox( 1  <» 12n]. f(x. X n be a 1'(0) sample.O) = (xIO')exp{x'/20'}. U(O. I .Xn be a sample from a population with the Rayleigh density .12..3. Hint: Approximate the critical region by IX > 1"0(1 + z(1 .05? 3.a) is the (1 . Days until failure: 315040343237342316514150274627103037 Is H rejected at level <> ~ 0.. distribution.3. is a size Q test. where x(I . . (0) Show that. Establish (4. .. 0: (a) Construct a test of H : B = 1 versus K : B > 1 with approximate size complete sufficient statistic for this model.r. (a) Use the MLE X of 0 to construct a level <> test for H : 0 < 00 versus K : 0 > 00 ..0 > 0..~llog <>(Tj ) has a X~r distribution.2 L:. . Draw a graph of the approximate power function..4. j .. Hint: See Problem B.270 Testing and Confidence Regions Chapter 4 (a) Use the result of Problem 8. Let Xl. (c) Give an approximate expression for the critical value if n is large and B not too close to 0 or 00.
00.p(Fo(x))IF(x) . .5. 80 and x = 0. T(B) ordered.12(b).o T¢..Section 4.1.10.T(B+l) denote T.5... .d.. Let X I. . 1) to (0. B. Let T(l). then T. In Example 4.) Next let T(I). j = 1.)(Tn ) = LN(O.. Here. That is.o J J x . = ~ I. Show that the test rejects H iff T > T(B+lm) has level a = m/(B + 1).o: is called the Cramervon Mises statistic. . with distribution function F and consider H : F = Fo.1... Fo(x)1 > kol x ~ Hint: D n > !F(x) .u. Suppose that the distribution £0 of the statistic T = T(X) is continuous under H and that H is rejected for large values of T.X(B) from F o on the computer and computing TU) = T(XU)).Fo• generate a U(O.p(Fo(x))lF(x) . . .. . 1.00). Define the statistics S¢. (In practice these can be obtained by drawing B independent samples X(I). if X. .. let . . ~ ll. .. Use the fact that T(X) is equally likely to be any particular order statistic.T(X(l)). 1) variable on the computer and set X ~ F O 1 (U) as in Problem B. . then the power of the Kolmogorov test tends to 1 as n . T(l). to get X with distribution . (b) When 'Ij. . (a) Show that the statistic Tn of Example 4.5.Fo(x)IOdF(x). b > 0.o V¢.) Evaluate the bound Pp(lF(x) . (This is the logistic distribution with mean zero and variance 1. 10.l) (Tn).Fo(x)IOdFo(x) .10 Problems and Complements 271 8.p(F(x)IF(x) .. Express the Cramervon Mises statistic as a sum. (b) Use part (a) to conclude that LN(".(X') = Tn(X). .Fo(x) 10 U¢. is chosen 00 so that 00 x 2 dF(x) = 1.Fo(x)1 foteach.p(u) be a function from (0. Vw. (a) Show that the power PF[D n > kal of the Kolmogorov test is bounded below by ~ sup Fpl!F(x).Fo(x)IO x ~ ~ sup.J(u) = 1 and 0: = 2.T.o sup.2. and 1.. T(B) be B independent Monte Carlo simulated values of T. (b) Suppose Fo isN(O.. (a) For each of these statistics show that the distribution under H does not depend on F o... .5 using the nonnal approximation to the binomial distribution of nF(x) and the approximate critical value in Example 4. .a)/b.6 is invariant under location and scale.1. = (Xi . .Fo(x)1 > k o ) for a ~ 0. and let a > O. Hint: If H is true T(X). 9. 1) and F(x) ~ (1 +exp( _"/7))1 where 7 = "13/. n (c) Show that if F and Fo are continuous and FiFo. X n be d.p(F(x»!F(x) . T(X(B)) is a sample of size B + 1 from La.
f(3... 2. > cJ = a.05. and only if. One has signalMtonoise ratio v/ao = 2. which system is cheaper on the basis of a year's operation? 1 I Ii I il 2.). Show that EPV(O) = P(To > T). I). Show that the EPV(O) for I{T > c) is uniformly minimal in 0 > 0 when compared to the EPV(O) for any other test. For a sample Xl. X n .Fo(T).2 I i i 1.. 2. ! . 00 .a)) = inf{t: Fo(t) > u}. The first system costs $106 . I 1 (b) Show that if c > 0 and a E (0.1. Expected pvalues. which is independent ofT. Let To denote a random variable with distribution F o. Without loss of generality. > cis MP for testing H : 0 = 00 versus K : 0 = 0 1 • . = /10 versus K : J. I X n from this population. respectively..2 and 4. 12. Consider a population with three kinds of individuals labeled 1.. (a) Show that L(x. and 3. then the pvalue is U =I . 5 about 14% of the time. take 80 = O. let N I • N 2 • and N 3 denote the number of Xj equal to I.2.) ./"0' Show that EPV(O) = if! ( . j 1 i :J . Suppose that T has a continuous distribution Fe.Fe(F"l(l. Hint: P(To > T) = P(To > tiT = t)fe(t)dt where fe(t) is the density of Fe(t).. the other has v/aQ = 1.to on the basis of the N(p" ...1. . [2N1 + N.3. and 3 occuring in the HardyWeinberg proportions f(I.272 Testing and Confidence Regions Chapter 4 (e) Are any of the four statistics in (a) invariant under location and scale. 3..10. 01 ) is an increasing function of 2N1 + N. (See Problem 4.. A gambler observing a game in which a single die is tossed repeatedly gets the impression that 6 comes up about 18% of the time. (For a recent review of expected p values see Sackrowitz and SamuelCahn. the power is (3(0) = P(U where FO 1 (u)  < a) = 1. . the UMP test is of the form I {T > c). where if! denotes the standard normal distribution. whereas the other a I ~ a .O) = 0'. You want to buy one of two systems.. the other $105 • One second of transmission on either system COsts $103 each.0) = (1. you intend to test the satellite 100 times. 1999.0). If each time you test. Consider Examples 3. Consider a test with critical region of the fann {T > c} for testing H : () = Bo versus I< : () > 80 . i i (b) Define the expected pvalue as EPV(O) = EeU. ! J (c) Suppose that for each a E (0. i • (a) Show that if the test has level a. ! . then the test that rejects H if.) I 1 . Let T = X/"o and 0 = /" .2. you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0. where" is known. Whichever system you buy during the year. 2N1 + N. Hint: peT < to I To = to) is 1 minus the power of a test with critical value to_ (d) Consider the problem of testing H : J1. . Problems for Section 4.0) = 20(1. f(2.L > J.nO /.1) satisfy Pe.') sample Xl. Let 0 < 00 < 0 1 < 1.0)'.
then a.4 and recall (Proposition B.0. Hint: The MP test has power at least that of the test with test function J(x) 8.. find an approximation to the critical value of the MP level a test for this problem. 5. (b) Find the test that is best in this sense for Example 4. 0. derive the UMP test defined by (4. then the maximum of the two probabilities ofmisclassification porT > cJ.. and to population 0 ifT < c. [L(X.6) where all parameters are known.4. if ..<l... I.. . L .0196 test rejects if.e. B" .2. In Example 4. For 0 < a < I. Upon being asked to play.0..2. of and ~o '" I.2.imum probability of error (of either type) is as small as possible. In Examle 4. and only if. A newly discovered skull has cranial measurements (X. . .2.17). MPsized a likelihood ratio tests with 0 < a < 1 have power nondecreasing in the sample size. Bk).2. Hint: Use Problem 4. +.=  Section 4. Bd > cJ = I .2.6) or (as in population 1) according to N(I. +akNk has approx. 6. H : (J = (Jo versus K : (J = (J. Bo. 1. the gambler asks that he first be allowed to test his hypothesis by tossing the die n times. Bo.1.2. 9. Nk) ~ 4. 7. [L(X..7).1. Y) known to be distributed either (as in population 0) according to N(O. Prove Corollary 4. (a) Show that if in testing H : f} such that = f}o versus K : f) = f)l there exists a critical value c Po.N. .Pe.. A fonnulation of goodness of tests specifies that a test is best if the max. PdT < cJ is as small as possible.1..Bd > cJ then the likelihood ratio test with critical value c is best in this sense... = a. find the MP test for testing 10.0. Y) belongs to population 1 if T > c. (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerfullevel.o = (1. Find a statistic T( X..2) that linear combinations of bivariate nonnal random variables are nonnaUy distributed.. M(n. (X. I. .10 Problems and Complements 273 four numbers are equally likely to occur (i.2. prove Theorem 4...~:":2:~~=C:="::===='.2. where 11 = I:7 1 ajf}j and a 2 = I:~ 1 f}i(ai 11)2. (c) Using the fact that if(N" . with probability .. Y) and a critical value c such that if we use the classification rule.2. two 5's are obtained.imately a N(np" na 2 ) distribution. Show that if randomization is pennitted.1(a) using the connection between likelihood ratio tests and Bayes tests given in Remark 4. ..1.
A) = c AcxC1e>'x . (a) Show that L~ K: 1/>. the probability of your deciding to stay open is < 0. • • " 5.. the expected number of arrivals per day. Consider the foregoing situation of Problem 4.1. (b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP test? 2.. x > O. > 1/>'0.3.01 lest to have power at least 0..(IOn x = 1•. 0) = ( : ) 0'(1. How many days must you observe to ensure that the UMP test of Problem 4. 1 j ! Xf is an optimal test statistic for testing H : 1/ A < 1/ AO versus J ! . (c) Suppose 1/>'0 ~ 12. .<»/2>'0 where X2n(1. A possible model for these data is to assume that customers arrive according to a homogeneous Poisson process and..&(>').. a model often used (see Barlow and Proschan. Find the sample size needed for a level 0.0)"'/[1.4. • I i • 4. In Example 4. . i' i . Here c is a known positive constant and A > 0 is the parameter of interest. but if the arrival rate is > 15.3 Testing and Confidence Regions Chapter 4 1. hence. ]f the equipment is subject to wear. Suppose that if 8 < 8 0 it is not worth keeping the counter open.. Show that if X I. that the Xi are a sample from a Poisson distribution with parameter 8. r p(x. Let Xl. Let Xl be the number of arrivals at a service counter on the ith of a sequence of n days.. . I i .01. . Use the normal approximation to the critical value and the probability of rejection. show that the power of the UMP test can be written as (3(<J) = Gn(<J~Xn(<»/<J2) where G n denotes the X~n distribution function.Xn is a sample from a Weibull distribution with density f(x.) 3.Xn be the times in months until failure of n similar pieces of equipment. Hint: Show that Xf .01.<» is the (1 ..95 at the alternative value 1/>'1 = 15. i 1 . I i .1 achieves this? (Use the normal approximation. 274 Problems for Section 4. You want to ensure that if the arrival rate is < 10.Xn is a sample from a truncated binomial distribution with • I. the probability of your deciding to close is also < 0.3. (a) Exhibit the optimal (UMP) test statistic for H : 0 < 00 versus K : 0 > 00.<»th quantile of the X~n distributiou and that the power function of the lIMP level a test is given by where G 2n denotes the X~n distribution function. . .• n. .i .. 1965) is the one where Xl.: (b) Show that the critical value for the size a test with critical region [L~ 1 Xi > k] is k = X2n(1 . .3. .
 ..[1. 0 < B < 1. . Y n be the Ll. Suppose that each Xi has the Pareto density f(x.6.~) = 1. X N e>.Fo(y)l".1. . ~ > O. (b) NabeyaMiura Alternative. survival times with distribution Fa. 8. P().. To test whether the new treatment is beneficial we test H : ~ < 1 versus K : . X 2 . A > O. then P(Y < y) = 1 e  . X 2. In the goodnessoffit Example 4.XFo(y)  1 P(Y < y) = e ' 1 ' Y > 0.12.. > 1. Let Xl.10 Problems and Complements 275 then 2.•• of Li.1.8 > 80 .O) = cBBx(l+BI. (Ii) Show that if we model the distribotion of Y as C(min{X I . y > 0. X N }).tive. Let the distribution of sUIvival times of patients receiving a standard treatment be the known distribution Fo. (See Problem 4.. In Problem 1. and consider the alternative with distribution function F(x. b). imagine a sequence X I. . > po.Q)th quantile of the X~n distribution.. Assume that Fo has ~ density fo. . 7. which is independent of XI. and let YI .2 to find the mean and variance of the optimal test statistic. 00 < a < b < 00. against F(u)=u B.d. (i) Show that if we model the distribution of Y as C(max{X I . (b) Find the optimal test statistic for testing H : JL = JLo versus K . we derived the model G(y. Let N be a zerotruncated Poisson. x> c where 8 > 1 and c > O. where Xl_ a isthe (1 .O<B<1. A > O. suppose that Fo(x) has a nonzero density on some interval (a..5. . (a) Express mean income JL in terms of e.Fo(y)  }). Show that the UMP test for testing H : B > 1 versuS K : B < 1 rejects H if 2E log FO(Xi) > Xl_a.d.. Show how to find critical values. .. Hint: Use the results of Theorem 1. JL (e) Use the central limit theorem to find a normal approximation to the critical value of test in part (b).. then e. 6.6. . Find the UMP test.6) is UMP for testing that the pvalues are uniformly distributed.Section 4. survival times of a sample of patients receiving an experimental treatment.. (a) Lehmann Alte~p.B) = Fi!(x).1."" X n denote the incomes of n persons chosen at random from a certain population.).6. 1 ' Y > 0. random variable.) It follows that Fisher's method for cQmbining pvalues (see 4..:7 I Xi is an optimal test statistic for testing H : () = eo versus l\ . For the purpose of modeling. .1.
F(x) = 1 . 00 p(x.0) confidence intervals of fixed finite length for loga2 • (b) Suppose that By 1 (Xi ..0".. 1 X n be i.. j 1 To see whether the new treatment is beneficial.3. x > 0.j "'" .. n announce as your level (1 .0) UeB for '. The numerator is an increasing function of T(x).' (cf.O)d.. .'? = 2.' " 2.2. 12. Hint: A Bayes test rejects (accepts) H if J . .X)2. . 0) e9Fo (Y) Fo(Y). = . Oo)d.. 1 • .(O) L':x.. 0 = 0. 0. > Er • . every Bayes test for H : < 80 versus K : > (Jl is of the fann for some t..6t is UMP for testing H : 8 80 versus K : 8 < 80 . Show that the test is not UMP. Show that under the assumptions of Theorem 4. the denominator decreasing. 10. e e at l · 6. i = 1. . 0.(O) 6 .276 (iii) Consider the model Testing and Confidence Regions Chapter 4 G(y. Problems for Section 4. We want to test whether F is exponential. Show that under the assumptions of Theorem 4.. 'i .  1 j . Let Xl. or Weibull.2 the class of all Bayes tests is complete.i. (a) Show how to construct level (1 .01. Oo)d.0 e6 1 0~0.. Let Xi (f)/2)t~ + €i. . What would you . B> O. I Hint: Consider the class of all Bayes tests of H : () = ..1 and 01 loss. 1 X n be a sample from a normal population with unknown mean J.). we test H : {} < 0 versus K : {} > O.{Od varies between 0 and 1.3. .52. F(x) = 1 . Assume that F o has a density !o(Y). Problem 2..(O) «) L > 1 i The lefthand side equals f6". n. J J • 9.. with distribution function F(x). 1 . Show that under the assumptions of Theorem 4.{Oo} = 1 .exp( x). where the €i are independent normal random variables with mean 0 and known variance .. eo versus K : B = fh where I i I : 11.' L(x. Find the MP test for testing H : {} = 1 versus K : B = 8 1 > 1. .d. I .L and unknown variance (T2..2.4 1. Let Xl.X)' = 16.1). /. Show that the UMP test is based on the statistic L~ I Fo(Y.. Using a pivot based on 1 (Xi . O)d.exp( _x B). L(x. x > 0.(O)/ 16' I • 00 p(x.3.
5.. 6.1. 0.. n.n is obtained by taking at = Q2 = a/2 (assume 172 known).02. 0.. (a) Justify the interval [ii.1.4. Suppose that in Example 4.. Show that if Xl. Hint: Reduce to QI +a2 = a by showing that if al +a2 with Ql + Q2 = a. where c is chosen so that the confidence level under the assumed value of 17 2 is 1 .c.a) interval of the fonn [ X .4.) LCB and q(X) is a level (1.. with X . .Xn be as in Problem 4.) Hint: Use (A.2.(2) UCB for q(8).1) based on n = N observations has length at most l for some preassigned length l = 2d.3 we know that 8 < O. ii are (b) Calculate the smallest n needed to bound the length of the 95% interval of part (a) by 0.a.3).al).01 [X .q(X)] is a level (1 ..Section 4. X+SOt no l (1..] .4.~a)!.~a) /d]2 . . I")/so. Suppose we want to select a sample size N such that the interval (4.IN(X  = I:[' lX. What is the actual confidence coefficient of Ji" if 17 2 can take on all positive values? 4.a.4. Stein's (1945) twostage procedure is the following.(al + (2)) confidence interval for q(8).l.IN. Compare yonr result to the n needed for (4.. but we may otherwise choose the t! freely. < 0.SOt no l (1.no further observations. .4.X are ij.(2) " . with N being the smallest integer greater than no and greater than or equal to [Sot". where 8.IN] . . distribution./N..a) confidence interval for B. then the shortest level n (1 .1. minI ii. has a 1. Then take N .1. X!)/~i 1tl ofB. Begin by taking a fixed number no > 2 of observations and calculate X 0 = (1/ no) I:~O 1 Xi and S5 = (no 1)II:~OI(Xi  XO )2.7).1 Show that. of the form Ji. . Suppose that an experimenter thinking he knows the value of 17 2 uses a lower confidence bound for Ji.10 Problems and Complements 277 (a) Using a pivot based on the MLE (2L:r<ol (1 .z(1 .3).1] if 8 > 0. i = 1.1)] if 8 given by (4. X + z(1 . . although N is random.(X) = X . t. then [q(X). Use calculus. .d. Show that if q(X) is a level (1 . find a fixed length level (b) ItO < t i < I.~a)/ . [0.. (Define the interval arbitrarily if q > q. It follows that (1 .. Let Xl.. N(Ji" 17 2) and al + a2 < a. < a. what values should we use for the t! so as to make our interval as short as possible for given a? 3.n . there is a shorter interval 7.
(12 2 9. 11. i (b) What would be the minimum sample size in part (a) if (). Hence.jn is an approximate level (1  • 12..4.') populations.: " are known. exhibit a fixed length level (1 . 4(). (a) Use (A.6. Such two sample problems arise In comparing the precision of two instruments and in detennining the effect of a treatment.. (12.a) interval defined by (4. 1"2.l4. (c) If (12. 0. (The sticky point of this approach is that we have no control over N.. By Theorem B. T 2 I I' " I'. X has a N(J1. (a) Show that in Problem 4.) Hint: Note that X = (noIN)Xno + (lIN)Etn n _ +IXi.a:) confidence interval of length at most 2d wheri 0'2 is known. The reader interested in pursuing the study of sequential procedures such as this one is referred to the book of Wetherill and Glazebrook. (a) Show that X interval for 8. Let 8 ~ B(n.3) are indeed approximate level (1 .a) confidence interval for (1/1'). we may very likely be forced to take a prohibitively large number of observations. 0) and X = 81n. is _ independent of X no ' Because N depends only on sno' given N = k.fN(X ./3z (1. Show that the endpoints of the approximate level (1 . (b) Exhibit a level (1 ..a) confidence interval for sin 1 (v'e).18) to show that sin 1( v'X)±z (1 .. v. (1  ~a) / 2. 0") distribution and is independent of sn" 8. Let 8 ~ B(n..~a)]. Hint: Set up an inequality for the length and solve for n.001. d = i II: . . it is necessary to take at least Z2 (1 (12/ d 2 observations. . respectively. X n1 and Yll .O)]l < z (1. . (12) and N(1/. Show that these two quadruples are each sufficient..jn(X . use the result in part (a) to compute an approximate level 0.  10. .1') has a N(o. .1. 1947. Indicate what tables you wdtdd need to calculate the interval.!a)/ 4. . > z2 (1  is not known exactly. " 1 a) confidence . Show that ~(). and the fundamental monograph of Wald. ±.3. = 0. Y n2 be two independent samples from N(IJ.051 (c) Suppose that n (12 = 5.a) confidence interval for 1"2/(Y2 using a pivot based on the statistics of part (a)..95 confidence interval for (J.jn is an approximate level (b) If n = 100 and X = 0.. Suppose that it is known that 0 < ~. and.) (lr!d observations are necessary to achieve the aim of part (a). 1986..O).) I ~i . (a) If all parameters are unknown. sn.4. find ML estimates of 11. (J'2 jk) distribution. but we are sure that (12 < (It.0)/[0(1. if (J' is large.~ a) upper and lower bounds. in order to have a level (1 . Let XI. 278 Testing and Confidence Regions Chapter 4 I i' I is a confidence interval with confidence coefficient (1 . .0') for Jt of length at most 2d. I .3. I 1 j Hint: [O(X) < OJ = [.
1 and.9 confidence region for (It. 14.10 Problems and Complements 279 (b) What sample size is needed to guarantee that this interval has length at most 0. 15. known) confidence intervals of part (b) when k = I. (In practice. In the case where Xi is normal. If S "' 8(64.)'. Compute the (K.". find (a) A level 0. (a) Show that x( "d and x{1. Now use the law of large numbers.)/01' ~ (J1.) Hint: (n .4 and the fact that X~ = r (k. Hint: Use Problem 8.9 confidence interval for {L.4.J1.3.). K.4 = E(Xi .J1.1) + V2( n . Slutsky's theorem. (c) A level 0.ia). and the central limit theorem as given in Appendix A.5 is (1 _ ~a) 2. :L7/ (b) Suppose that Xi does not necessarily have a nonnal diStribution.Section 4. and (b) (4.I). then BytheproofofTheoremB.(n . Assuming that the sample is from a /V({L. ~). 100.' = 2:.7). Show that the confidence coefficient of the rectangle of Example 4.4.I) + V2( n1) t z( "1) and x(1 . Z.D andn(X{L)2/ cr 2 '" = r (4.4.~Q).3.)4 < 00 and that'" = Var[(X. 1) random variables. and 10 000.)' . In Example 4.n} and use this distribution to find an approximate 1 .2 is known as the kurtosis coefficient.3.' 1 (Xi .2.3. cr 2 ) population.4/0. Xl = Xn_l na). give a 95% confidence interval for the true proportion of cures 8 using (al (4.4 ) . (c) Suppose Xi has a X~ distribution.n(X . See A. is replaced by its MOM estimate.4.3. (n _1)8'/0' ~ X~l = r(~(nl).IUI. . Find the limit of the distribution ofn.". K.) can be approximated by x(" 1) "" (n . (b) A level 0. Suppose that 25 measurements on the breaking strength of a certain alloy yield 11. Now the result follows from Theorem 8. and X2 = X n _ I (1 . Hint: Let t = tn~l (1 .1) t z{1 . XI 16.Q confidence interval for cr 2 .9 confidence interval for cr."') .30. it equals O.' Now use the central limit theorem. See Problem 5. Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures are observed. (d) A level 0.8).9 confidence interval for {L x= + cr. 0). but assume that J1. 10.J1.1)8'/0') . V(o') can be written as a sum of squares of n 1 independentN(O.J1. Hint: By B.t ([en .1.2.027 13.2. 4) are independent. Compare them to the approximate interval given in part (a). = 3.1 is known. .3). .4.<.
.4.F(x)1 . find a level (I .F(x)]dx. . In Example 4. we want a level (1 . distribution function. U(O. 1'1(0) < X < 1'.4. 19.7. )F(x)[l. Assume that f(t) > 0 ifft E (a. X n are i..I Bn(F) = sup vnl1'(x) . indicate how critical values U a and t a for An (Fo) and Bn(Fo) can be obtained using the Monte Carlo method of Section 4. I' .3 for deriving a confidence interval for B = F(x).3 can be turned into simultaneous confidence intervals for F(x) by replacing z (I .1. Show that for F continuous where U denotes the uniform. define . In this case nF(l') = #[X.1 (b) V1'(x)[l. That is.. 18..4. Suppose Xl.4.l). (a) Show that I" • j j j = fa F(x)dx + f. Consider Example 4.~u) by the value U a determined by Pu(An(U) < u) = 1. • i.1. It follows that the binomial confidence intervals for B in Example 4. .1'(x)] Show that for F continuous. . !i . as X and that X has density f(t) = F' (t).6 with .u) confidence interval for 1".F(x)1 Typical choices of a and bare .11 .9) and the upper boundary I" = 00.F(x)1 is the approximate pivot given in Example 4. verify the lower bounary I" given by (4. < xl has a binomial distribotion and ~ vnl1'(x) . (a) For 0 < a < b < I. " • j .05 and .F(x)l.d.4. define An(F) = sup {vnl1'(X) .i.6.r fixed. i 1 .a) confidence interval for F(x).95.u. I (b) Using Example 4.F(x)1 )F(x)[1 . b) for some 00 < a < 0 < b < 00. pl(O) < X <Fl(b)}.I I 280 Testing and Confidence Regions Chapter 4 17. (c) For testing H o : F = Fo with F o continuous. (b)ForO<o<b< I.
2 = VCBs of Example 4..a) UCB for 0. Give a 90% confidence interval for the ratio ~ of mean life times.. M n /0. [8(X) < 0] :J [8(X) < 00 ]. Let Xl. 1/ n J is the shortest such confidence interval.Section 4. Yn2 be independent exponential E( 8) and £(.5. ..1 has size a for H : 0 (}"2 be < 00 . "6 based on the level (1  a) (b) Give explicitly the power function of the test of part (a) in tenos of the X~l distribution function.5 1.· (a) If 1(0.3. (15 = 1. . O ) is an increasing 2 function of + B~. = 1 rejected at level a 0.4 and 8.0.90? 4.2 that the tests of H : . ~ 01). .. is of level a for testing H : 0 > 00 . Yf (1 ..a) VCB for this problem and exhibit the confidence intervals obtained by pntting two such hounds of level (1 ..2).5.) denotes the o.10 Problems and Complements 281 Problems for Section 4.th quantile of the . Let Xl. Show that if 8(X) is a level (1 .) samples. a (b) Show that the test with acceptance region [f(~a) for testing H : Il.1"2nl.2).1 O? 2.3. X 2 be independent N( 01.4. respectively. (d) Show that [Mn . How small must an alternative before the size 0. respectively. with confidence coefficient 1 . and consider the prob2 + 8~ > 0 when (}"2 is known. x y 315040343237342316514150274627103037 826 10 8 29 20 10 ~ Is H : Il.12 and B.2 are level 0. N( O . 5.05. . < XI? < f(1. X2) = 1 if and only if Xl + Xi > c. What value of c givessizea? (b) Using Problems B.(2) together.2n2 distribution.. then the test that accepts. 3. (a) Deduce from Problem 4. and let Il. if and only if 8(X) > 00 . Experience has shown that the exponential assumption is warranted.al) and (1 .~a)J has size (c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant.a) LCB corresponding to Oc of parr (a). Hint: If 0> 00 . = 0. show that [Y f( ~a) I X. (e) Similarly derive the level (1 .~a) I Xl is a confidence interval for Il.\. = 1 versns K : Il. . (a) Find c such that Oc of Problem 4.13 show that the power (3(0 1 .Xn1 and YI . Or .1. .3. (b) Derive the level (1..a. test given in part (a) has power 0. for H : (12 > (15. lem of testing H : 8 1 = 82 = 0 versus K : Or (a) Let oc(X 1 . of l. (e) Suppose that n = 16. Hint: Use the results of Problems B..3.
we have a location parameter family.2). (b) The sign test of H versus K is given by. I: H': PIX! >OJ < ~ versus K': PIX! >OJ> ~. . Let C(X) = (ry : o(X. og are independentN (0. . 'TJo) of the composite hypothesis H : ry = ryo. (a) Shnw that C(X) is a level (l . l 7. . T) where 1] is a parameter of interest and T is a nuisance parameter.a) confidence region for the parameter ry and con versely that any level (1 . . . Suppose () = ('T}. of Example 4.~n + ~z(l . N( 020g. 1 Ct ~r (b) Find the family aftests corresponding to the level (1.[XU) (f) Suppose that a = 2(n') < OJ and P.a)y'n.0:) confidence region for 1] is equivalent to a family of level tests of these composite hypotheses.1 when (72 is unknown. 6. ry) = O}.2)..00 . We are given for each possible value 1]0 of1] a level 0: test o(X.IX(j) < 0 < X(k)] do not depend on 1 nr I. O?. ). L:J ~ ( . >k Determine the smallest value k = k(a) such that oklo) is level a for H and show that for n large. (a) Show that testing H : 8 < 0 versus K : e > 0 is equivalent to testing . k . O. Hint: (c) X. . Show that the conclusions of (aHf) still hold.  j 1 j . Thus. (c) Modify the test of part (a) to obtain a procedure that is level 0: for H : OJ e= O?. (e) Show directly that P.0:) LeB for 8 whatever be f satisfying our conditions. but I(t) = I( t) for all t.8) where () and fare unknown. Let X l1 .0:) confidence interval for 11.. (g) Suppose that we drop the assumption that I(t) = J( t) for all t and replace 0 by the v = median of F. . 0. X.a.4.282 Testing and Confidence Regions Chapter 4 = ()~ 1 2 eg and exhibit the corresponding family of confidence circles for (ell ( 2).Xn be a sample from a population with density f (t . (c) Show that Ok(o) (X. respectively. 0. X n  00 ) is a level a test of H : 0 < 00 versus K : 0 > 00 • (d) Deduce that X(nk(Q)+l) (where XU) is the jth order statistic of the sample) is a level (1 . lif (tllXi > OJ) ootherwise. Show that PIX(k) < 0 < X(nk+l)] ~ . I . and 1 is continuous and positive.
Y.O ~ J(X.a) = (b) Show that 0 if X < z(1 . if 00 < O(S) (S is inconsistent with H : 0 = 00 ). 8( S)] be the exact level (1 . hence. 7)). Show that the Student t interval (4. . 9.z(1 .a))2 if X > z(1 . 1). (a) Show that the lower confidence bound for q( 8) obtained from the image under q of q(X) (X .10 Problems and Complements 283 8.z(l. ij(x)] for ry is said to be unbiased confidence interval if Pl/[ry(X) < ry' < 7)(X)] <1 a for all ry' F ry.Section 4. Suppose X. Then the level (1 .3 and let [O(S). Let a(S. That is.5.a) . or). 00) is = 0'.Oo) denote the pvalue of the test of H : 0 = 00 versus K : 0 > 00 in Example 4. Thus. laifO>O <I>(z(1 .00 indicates how far we have to go from 80 before the value 8 is not at all surprising under H. Let 1] denote a parameter of interest. the quantity ~ = 8(8) . 12.5. the interval is unbiased if it has larger probability of covering the true value 1] than the wrong value 1]'. Y ~ N(r/. ' + P2 l' z(1 1 . p)} as in Problem 4. Yare independent and X ~ N(v.[q(X) < 0 2 ] = 1.pYI < (1 1 otherwise. Y.5.a. 11. Note that the region is not necessarily an interval or ray.1) is unbiased.1.a). 1) and q(O) the ray (X . (b) Describe the confidence region obtained by inverting the family (J(X. Define = vl1J. let T denote a nuisance parameter. a( S. Po) is a size a test of H : p = Po.5.7. Let X ~ N(O. p.Y. and let () = (ry.20) if 0 < 0 and.a).p) = = 0 ifiX .2 a) (a) Show that J(X. alll/. Let P (p. 0) ranges from a to a value no smaller than 1 . Show that as 0 ranges from O(S) to 8(S).5. 10.7. Hint: You may use the result of Problem 4.2a) confidence interval for 0 of Example 4.2. This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line.2.a) confidence interval [ry(x). Establish (iii) and (iv) of Example 4. that suP. 1).
X n x*) is a level a test for testing H : x p < x* versus K : x p > x*. k(a) .b) (a) Show that this statement is equivalent to = 1. (See Section 3. Detennine the smallest value k = k(a) such that Jk(Q) has level a for H and show that for n large. (g) Let F(x) denote the empirical distribution. ! ..5. That is. . . Construct the interval using F. . 1 . (X(k)l X(n_I») is a level (1 . F(x). Thus.a) LeB for x p whatever be f satisfying our conditions. (a) Show that testing H : x p < 0 versus I< : x p > 0 is equivalent to testing H' : P(X > 0) < (I .. Show that Jk(X I x'.p) versus K' : P(X > 0) > (1 . (t) Show that k and 1in part (e) can be approximated by h (~a) and h (1  ~a) where ! heal is given in part (b). (e) Let S denote a B(n. .6 and 4.7. Confidence Regions for QlIalltiles. where ! 1 1 heal c: n(1  p) + zl_QVnp(1  pl.4.. . i j (d) Deduce that X(n_k(Q)+I) (XU) is the jth order statistic of the sample) is a level (1 . " II [I . I • (c) Let x· be a specified number with 0 < F(x') < 1.a..a = P(k < S < n I + 1) = pi(1. That is.1» . Then   P(P(x) < F(x) < P+(x)) for all x E (a. and F+(x) be as in Examples 4.h(a).. lOOx. Let x p = ~ IFI (p) + Fi! I (P)]. it is distribution free.05 could be the fifth percentile of the duration time for a certain disease. In Problem 13 preceding we gave a disstributionfree confidence interval for the pth quantile x p for p fixed.a. 0 < p < 1. il.p)"j. We can proceed as follows. (b) The quantile sign test Ok of H versus K has critical region {x : L:~ 1 1 [Xi > 0] > k}. Suppose we want a distributionfree confidence region for x p valid for all 0 < p < 1. be the pth quantile of F.p) variable and choose k and I such that 1 . Show that the interval in parts (e) and (0 can be derived from the pivot T(x ) p  . Simultaneous Confidence Regions for Quantiles.284 Testing and Confidence Regions Chapter 4 13. Let F. L..F(x p)].p). Let Xl. = ynIP(xp) .4. . P(xp < x p < xp for all p E (0.1 and Fu 1. iI . X n be a sample from a population with continuous distribution F.a) confidence interval for x p whatever be F satisfying our conditions. Show that P(X(k) < x p < X(nl)) = 1. k+ ' i .) Suppose that p is specified. vF(xp) [1 F(xp)] " Hint: Note that F(x p ) = p.95 could be the 95th percentile of the salaries in a certain profession l or lOOx . 14.
I).j < F_x(x)] See also Example 4.Section 4.. that is.z. ~ ~ (a) Consider the test statistic x . Hint: Let t. F(x) > pl. (b) Express x p and .. Note the similarity to the interval in Problem 4. . .4.r" ~ inf{x: a < x < b.5. LetFx and F x be the empirical distributions based on the i.. the desired confidence region is the band consisting of the collection of intervals {[.13(g) preceding. F f (.xp]: 0 < l' < I}. . Suppose X denotes the difference between responses after a subject has been given treatments A and B.X n .u are the empirical distributions of U and 1 . n Hint: nFx(x) = Li~ll[Fx(Xi) < Fx(x)] ~ nFu(F(x)) and ~ ~ nF_x(x) = L IIXi < xl ~ L i==I i==I n n IIFx( X. L i==I n I IFx (Xi) .) < p} and. (c) Show how the statistic An(F) of Problem 4.. 15. We will write Fx for F when we need to distinguish it from the distribution F_ x of X. Give a distributionfree level (1 . VF(p) = ![x p + xlpl. . then D(F ~ ~ ~ ~ as D(Fu 1 F I  U = F(X) U ). Express the band in terms of critical values for An(F) and the order statistics.U with ~ U(O. That is.x.1. The hypothesis that A and B are equally effective can be expressed as H : F ~x(t) = Fx(t) for all t E R. where F u and F t . Suppose that X has the continuous distribution F.(x) = F ~(Fx(x» .. then L i==l n IIXi < X + t. TbeaitemativeisthatF_x(t) IF(t) forsomet E R..d. where p ~ F(x).10 Problems and Complements 285 where x" ~ SUp{..X I.r p in terms of the critical value of the Kolmogorov statistic and the order statistics. (b) Suppose we measure the difference between the effects of A and B by ~ the difference between the quantiles of X and X.. .i. X n and .17(a) and (c) can be used to give another distributionfree simultaneous confidence band for x p . where A is a placebo.p.r < b. .T: a <. X I.(x)] < Fx(x)] = nF1_u(Fx(x)).Q) simultaneous confidence band for the curve {VF(p) : 0 < p < I}.. F~x) has the same distribution Show that if F x is continuous and H holds.1.
. Properties of this and other bands are given by Doksum. Also note that x = H..:~ I1IFx(X... Give a distributionfree level (1. Hint: nFx(t) = 2...F x l (l . Hint: Define H by H.) < Fx(x)) = nFv(Fx(x)). the probability is (1 . where x p and YP are the pth quan tiles of F x and Fy. To test the hypothesis H that the two treatments are equally effective. he a location parameter as defined in Problem 3.3. ..x.1.a) simultaneous confidence band for A(x) = F~l:(Fx(x)) .. Let 8(. F y ) has the same distribntion as D(Fu . F y ) . ..x (': + A(x)). As in Example 1. nFy(t) = 2. It follows that if c. vt(p)) is the band in part (b). treatment A (placebo) responses and let Y1 . We assume that the X's and Y's are in~ dependent and that they have respective continuous distributions F x and F y .) < Fx(t)] = nFu(Fx(t)).:~ II[Fx(X. .Fy ) = maxIFy(t) 'ER "" . we get a distributionfree level (1 . vt = O<p<l sup vt(p) O<p<l where [V.t::. treatment B responses.) ~ < F x (")] ~ ~ = nFu(Fx(x)).d.i.) = D(Fu . It follows that X is stochastically between Xsv F and XS+VF where X s HI(F(X)) has the symmetric distribution H. then D(Fx.U ).17.:~ 1 1 [Fy(Y... Hint: Let A(x) = FyI (Fx(x)) .Fy ): 0 < p < 1}.a) that the interval IvF' vt] contains the location set LF = {O(F) : 0(.a) simultaneous confidence band 15. then  nFy(x + A(X)) = L: I[Y. i=I n 1 i t .x p . is the nth quantile of the distribution of D(Fu l F 1 . Fx(t)l· . I I I 1 1 D(Fx.2A(x) ~ HI(F(x)) l 1 + vF(F(x)).F1 _u). Let t (c) A Distdbution and ParameterFree Confidence Interval.F~x. . where do:.d. Show that for given F E F. Fx.) is a location parameter} of all location parameter values at F. let Xl. < F y I (Fx(x))] i=I n = L: l[Fy (Y. 1 Yn be i.1) empirical distributions. i .x = 2VF(F(x)). where F u and F v are independent U(O.l (p) ~ ~[Fxl(p) .(x) ~ R.286 Testing and Confidence Regions ~ Chapter 4 '1. we test H : Fx(t) ~ Fy (t) for all t versus K : Fx(t) # Fy(t) for some t E R.) : :F VF ~ inf vF(p).Jt: 0 < p < 1) for the curve (5 p (Fx.p)] = ~[Fxl(p) + F~l:(p)]. i I f F.'" 1 X n be Li.. then D(Fx . ~ ~ ~ F~x. The result now follows from the properties of B(·).. Moreover..(F(x)) . F y ). and by solving D( Fx. nFx(x) we set ~ 2. Let Fx and F y denote the X and Y empirical distributions and consider the test statistic ~ ~ .) < Fy(x p )] = nFv (Fx (t)) under H. . Then H is symmetric about zero.5. Fenstad and Aaberge (1977). (p). = 1 i 16. "" = Yp . <aJ Show that if H holds. for ~. (b) Consider the parameter tJ p ( F x."..) < do. where :F is the class of distribution functions with finite support.
(a) Consider the model of Problem 4. Fy.0:) simultaneouS coofidence band for t. Let6.6]. then B( Fx . 6.. then Y' = Y. ~ ~ = (e) A Distribution and ParameterFree Confidence Interval.6 1. thea It a unifonnly most accurate level 1 . Now apply the axioms. Show that for given (Fx . Let do denote a size a critical value for D(Fu .(X).) < do: for Ll.Xn is a sample from a r (p.4. the probability is (1 . (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E~ 1 X i )/ E~ 1 has uniformly smaller variance than T.a) that the interval [8. t. Fx +a) = O(Fx a. (d) Show that E(Y) ..0: UCB for J.·) is a shift parameter.)I L ti . t Hint: Set Y' = X + t.(x) = Fy(x thea D(Fx. x' 7 R. Properties of ~ £ = "" this and other bands are given by Doksum and Sievers (1976). ) is a shift parameter} of the values of all shift parameters at (Fx .Section 4. 0 < p < 1.a) UCB for O. ~ D(Fu •F v ). where is nnknown. F y ) > O(Fx " Fy). = 22:7 I Xf/X2n( a) is .T.= minO<p<l 6.Fy). . ~) distribution. moreover X + 6 < Y' < X + 6. Show that B = (2 L X. F y ) is in [6.6+] contains the shift parameter set {O(Fx. Fy ) . Show that if 0(·. then by solving D(F x. F y. F y ) and b _maxo<p<l 6p(Fx . B(.. 6+ maxO<p<1 6+(p). 3. O(Fx.E(X).2uynz(1 i=! i=l all L i=l n ti is also a level (1. (b) Consider the unbiased estimate of O... . Suppose X I.4. Xl > X 8t 8t 0} (c) A parameter 0 = <5 ('. nFx ("') = nFu(Fx(x)). Fv ). Show that for the model of Problem 4.    Problems for Section 4. where :F is the class of distri butions with finite support..) ~ + t. Fy ).) 12:7 I if.(Fx . F y ) E F x F.) = F (p) .3. is called a shift parameter if O(Fx. F x ) ~ a and YI > Y..)..(.L. if I' = 1I>'.10 Problems and Complements 287 Moreover.2. and 8 are shift parameters. Exhibit the UMA level (1 ..) I i=l L tt .F y ' (p) = 6.(P). . F y ) < O(Fx . :F x :F O(Fx.). . Q.0:) confidence bound forO. 2. F y. Let 6 = minO<1'<16p(Fx.6. F y ).) . It follows that if we set F\~•.('. we find a distributionfree level (1 .(x)) . tt Hint: Both 0 and 8* are nonnally distributed. T n n = (22:7 I X.2z(1 i=l n n a)ulL tt]} i=l is a uniformly most accurate lower confidence bound for 8. Show that n p is known and 0 O' ~ (2 2~A X.
4.3... • • • = t) is distributed as W( s. I'. . . s).d. i .(O .2. t > e. 0']. . In Example 4.C. Xl. e> 0. 1I"(t) Xl. '" J.X. density = B.4.1.6 to V = (B .. U = (8 . and that B has the = se' (t. distribution and that B has beta. .6 for c fixed. distribution with r and s positive integers.B')+. .l. Establish the following result due to Pratt (1961). Problems for Section 4.3. F1(tjdt.\. respectively.f:s F. p. 0). 2. Poisson. .1. Show that if F(x) < G(x) for all x and E(U)..) and that. 8.i. (0 . U(O.2. Hint: By Problem B.\ is distributed as V /050. J • 6.3).' with "> ~. [B. . then E.2. (b) Suppose that given B = B.a) confidence intervals such that Po[8' < B' < 0'] < p. Suppose that given. .0 upper and lower confidence bounds for J1 in the model of Problem 4.288 Testing and Confidence Regions Chapter 4 4. Prove Corollary 4. uniform. Let U. (B" 0') have joint densities.O] are two level (1 . ~. V be random variables with d.B)+. t)dsdt = J<>'= P. B). • .3.B) = J== J<>'= p( s. s). Suppose [B'.aj LCB such that I .E(V) are finite.B).B).6.\ =. Hint: Apply Problem 4. f3( T. .2 and B. where Show that if (B. P(>. /3( T. Suppose that B' is a uniformly most accurate level (I . (1: dU) pes.12 so that F.B(n. .3 and 4. Suppose that given B Pareto.B') < E. and for ().d.1 are well defined and strictly increasing. OJ. ·'"1 1 .( 0' Hint: E.6.. s > 0. E(U) = .6. Pa( c.W < BI = 7. j . Hint: Use Examples 4. X has a binomial..2" Hint: See Sections 8. establish that the UMP test has acceptance region (4.. n = 1. (a) Show that if 8 has a beta."" X n are i.W < B' < OJ for allB' l' B. g.Bj has the F distribution F 2r . where s = 80+ 2n and W ~ X.6. then )" = sB(r(1 . G corresponding to densities j. t) is the joint density of (8.a. > 1" (h) Show how quantiles of the X2 distribution can be used to determine level (I .l2(b). satisfying the conditions of Problem 8. 3. Show how the quantiles of the F distribution can be used to find upper and lower credible bounds for.[B < u < O]du..7 1 . s) distribution with T and s integers. 1. 5. Construct unifoffilly most accurate level 1 .a) upper and lower credible bounds for A. where So is some COnstant and V rv X~· Let T = L:~ 1 Xi' (a) Show that ()" I T m = k + 2t. then E(U) > E(V).Xn are Ll.
.2 degrees of freedom.. Here Xl..r 1x. and Y1.a) upper and lower credible bounds forO to the level (Ia) upper and lower confidence bounds for B.rlm) density and 7f(/12 1r. Show that (8 = S IM ~ Tn) ~ Pa(c'. r) where 7f(Ll I x . . (a) Give a level (1 . . r( TnI (c) Set s2 + n. r > O.0:) confidence interval for B. .Xm. . as X. /11 and /12 are independent in the posterior distribution p(O X. Yn are two independent N (111 \ 'T) and N(112 1 'T) samples. Suppose 8 has the improper priof?r(O) = l/r. = So I (m + n  2). r 1x.1.. Problems for Section 4.y. .. 'P) is the N(x ... . Show fonnaUy that the posteriof?r(O 1 x.1.Xn ). y) of is (Student) t with m + n . y) is proportional to p(8)p(x 1/11. . 1 r. .. y). = PI ~ /12 and'T is 7f(Ll.Xn + 1 be i.8 1.l. y) and that the joint density of. Let Xl.m} and + n. Xl.2.y) = 7f(r I sO)7f(LlI x .0"5). Suppose that given 8 = (ttl' J. (d) Use part (c) to give level (Ia) credible bounds and a level (Ia) credible interval for Ll.y. where (75 is known.r)p(y 1/12.112. 'T)..s') withe' = max{c..x)1f(/1. y) is obtained by integrating out Tin 7f(Ll..x) is aN(x..y.d.y) proportional to where 1f(r I so) is the density of solV with V ~ Xm+n2. ". (b) Show that given r. (a) Let So = E(x. T) = (111.x)' + E(Yj y)'. (c) Give a level (1 .. r In) density..y) is 1f(r 180)1f(/11 1 r. 1f(/11 1r. . y) is aN(y.1 )) distribution.0:) prediction interval for Xnt 1. Hint: p( 0 I x. In particular consider the credible bounds as n + 00. 4. Show thatthe posterior distribution 7f( t 1 x.Section 4. respectively. N(J.Xn is observable and X n + 1 is to be predicted.r).a) upper and lower credible bounds for O.10 Problems and Complements 289 (a)LetM = Sf max{Xj. . (d) Compare the level (1.. (b) Find level (1 . Hint: 7f(Ll 1x.i.
3) by doing a frequentist computation of the probability of coverage. CJ2) with (J2 unknown. Establish (4. . Show that the likelihood ratio statistic for testing If : 0 = ~ versus K : 0 i ~ is equivalent to 12X . X is a binomial.(0 I x)dO. . n = 100. 1 X n are i.12. 0'5)' Take 0'5 ~ 7 2 = I. ! " • Suppose Xl.x . • . give level (1 ~ a:) lower and upper prediction (c) If F is continuous with a positive density f on (a. 00 < a < b (1 .a) lower and upper prediction bounds for X n + 1 . That is. Present the results in a table and a graph.. b). give level (I .8.) Hint: First show that ...2. and a = .5. Suppose that Y. ..a (P(Y < Y) > I ...e..d. . ~o = 10.t for M = 5. . 0 > 0. .0:) lower (upper) prediction bound on Y = X n+ 1 is defined to be a function Y(Y) of Xl. 0) distribution given (J = 8. and that (J has a beta.. (3(r. distribution. . . Find the probability that the Bayesian interval covers the true mean j. Hint: Xi/8 has a ~ distribution and nXn+d E~ 1 Xi has an F 2 .. i! Problems for Section 4. where Xl. .nl. as X .2n distribution. distribution. s). B(n. Suppose that given (J = 0.d.. give level 3. Then the level of the frequentist interval is 95%.. suppose Xl.i..Q) distribution free lower and upper prediction bounds for X n + 1 • < 00.s+nx+my)/B(r+x.0'5) with 0'5 known. Un + 1 ordered. . 0). I x) is sometimes called the Polya q(y I x) = J p(y I 0). which is not observable...8. Suppose XI. (b) If F is N(p" bounds for X n + 1 .a).'" .i.. . q(ylx)= ( .9.05.15.8.• . Give a level (1 .Xn are observable and we want to predict Xn+l. Let Xl. N(Jl.. F. 4. let U(l) < . random variable. In Example 4. X n such that P(Y < Y) > 1. Show that the conditional (predictive) distribution of Y given X = xis 1 J I i I . x > 0..8).d. 5. A level (1 . . (a) If F is N(Jl.. (This q(y distribution. :[ = . " u(n+l).10.' . has a B(m.9 1. as X where X has the exponential distribution F(x I 0) =I . 1 2. )B(r+x+y.x ) where B(·..Xn are observable and X n + 1 is to be predicted.290 Testing and Confidence Regions Chapter 4 • • (b) Compare the interval in part (a) to the Bayesian prediction interval (4.Xn + 1 be i.8. X n+ l are i.11.'.2) hy using the observation that Un + l is equally likely to be any of the values U(I).i.·) denotes the beta function. B( n.9. Let X have a binomial.a) prediction interval for X n + l .5.10.s+n. < u(n+l) denote Ul .
0'2) sample with both JL and 0'2 unknown.Q). (ii) CI ~ C2 = n logcl/C2. where F is the d.(Xi < 2" " (10 i=I  X) 2 < C2 where CI and C2 satisfy. In testing H : It < 110 versus K . (n/2)[ii' / C a5  (b) To obtain size Q for H we should take Hint: Recall Theorem B. onesample t test is the likelihood ratio test (fo~ Q <S ~).(x) = . = aD nO' 1".Q. 0'2 < O'~ versus K : 0'2 > O'~.3.10 Problems and Complements 291 is an increasing function of (2x . We want to test H . · .3. XnI (1 . 0' 4. Hint: Note that liD ~ X if X < 1'0 and ~ 1'0 otherwise. = Xnl (1 . of the X~I distribution.(x) Hint: Show that for x < !n..(x) = 0. Xn_1 (~a).( x) ~ 0.V2nz(1 . and only if. n Cj I L.Section 4.X) aD· t=l n > C. JL > Po show that the onesided. Thus.='7"nlog c l n /C2n CIn . log . >. and only if. 2 2 L. let X 1l .. OneSided Tests for Scale./(n .f.(nx).I)) for Tn > 0.C2n !' 1 as n !' 00. We want to test H : = 0'0 versus K : 0' 1= 0'0' (a) Show that the size a likelihood ratio test accepts if. TwoSided Tests for Scale. ~2 .(Xi ... (c) These tests coincide with the testS obtained by inverting the family of level (1  Q) lower confidence bounds for (12..log (ii' / (5)1 otherwise.~Q) approximately satisfy (i) and also (ii) in the sense that the ratio .) .~Q) n + V2nz(l. (i) F(c.. I .F(CI) ~ 1. Show that (a) Likelihood ratio tests are of the form: Reject if. .~Q) also approximately satisfy (i) and (ii) of part (a). if Tn < and = (n/2) log(1 + T.n) and >. ° 3. (c) Deduce that the critical values of the commonly used equaltailed test.. Hint: log .. (b) Use the nonnal approximatioh to check that CIn C'n n .. In Problems 24. where Tn is the t statistic.Xn be a N(/L. if ii' /175 < 1 and = . 2.
. . where a' is as defined in < ~. I'· (c) Find a level 0.Xn1 and YI .90 confidence interval for a by using the pivot 8 2 ja z . 1 I 794 2012 1800 2477 576 3498 411 2092 897 1808 I Assume the twosample normal model with equal variances.3. Show that A(X. n.. Suppose X has density p(x. 7. €n are independent N(O. = 0 and €l. . .95 confidence interval for equaltailed tests of Problem 4.292 Testing and Confidence Regions Chapter 4 5. a 2 ) random variables.9. Let Xl.05 onesample t test. 0'2).95 confidence interval for p. e 1 ) depends on X only throngh T. (b) Can we conclude that leaving the indicated amount of mulch on the ground significantly improves forage production? Use ct = 0. (a) Find a level 0. (b) Consider the problem aftesting H : Jl. • Xi = (JXi where X o I 1 + €i. Y.lIl (7 2) and N(Jl.4. find the sample size n needed for the level 0. . Forage production is also measured in pounds per acre. .. The nonnally distributed random variables Xl. Assume the onesample normal model. 190. (7 2) is Section 4. . The following blood pressures were obtained in a sample of size n = 5 from a certain population: 124. whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre. (c) Using the normal approximation <l>(z(a)+ nl nz/n(1'1 I'z) /(7) to the power. 6.1'2. . 0'2 ~ ! corresponding to inversion of the (c) Compute a level 0.01 test to have power 0. (a) Using the size 0: = 0. respectively. 100. I Yn2 be two independent N(J. 110.. 114. .. The control measurements (x's) correspond to 0 pounds of mulch per acre.P.Xn are said to be serially correlated or to follow an autoregressive model if we can write ~ .L.2' (12) samples. and that T is sufficient for O. ..05.. x y vi I I . can we conclude that the mean blood pressure in the population is significantly larger than IDO? (b) Compute a level 0.9.. The following data are from an experiment to study the relationship between forage production in the spring and mulch left on the ground the previous fall.1 < J.. 8.O). .tl > {t2. (X. (a) Show that the MLE of 0 ~ (1'1. 0 E e. .z .90 confidence interval for the mean bloC<! pressure J..95 when nl = nz = ~n and (1'1 1'2)/17 ~ ~.l. i = 1.L2 versus K : j. 9. Assume a Show that the likelihood ratio statistic is equivalent to the twosample t statistic T.. eo..
. P. Let e consist of the point I and the interval [0. Suppose that T has a noncentral t. Condition on V and apply the double expectation theorem. Yn2 be two independent N(P. (Yl.6.. XZ distributions respectively. / L:: i 10.Section 4. Define the frequency functions p(x. (T~).<» (1 . Xo = o. (a) P.<»1 < e < <>. and only if. . (An example due to C. . Then. Yl.and twosided t tests.O)e (a) What is the size a likelihood ratio test for testing H : B = 1 versus K : B 1= I? (b) Show that the test that rejects if. 12. I). P. Y2 )dY2. respectively. The power functions of one. 7k. = 0.O) for 00 = (Z.. get the joint distribution of Yl = ZI vVlk and Y2 = V. . Hint: Let Z and V be independent and have N(o.. i = 1. 0) by the following table. II. 7k. (b) Show that the likelihood ratio statistic of H : 0 = 0 (independence) versus K : 0 o(serial correlation) is equivalent to C~=~ 2 X i X i _I)2 / l Xl. has level a and is strictly more 11.2) In exp{ (I/Z(72) 2)Xi . Let Xl..IITI > tl is an increasing function of 161. 13.IIZI > tjvlk] is increasing in [61. Stein).'" p(X.[Z > tVV/k1 is increasing in 6.6.OXi_d 2} i= 1 < Xi < 00. X n1 . X powerful whatever be B.n. Fix 0 < a < and <>/[2(1 . (Yl) = f PY"Y. Show that. has density !k . The F Test for Equality of Scale. J7i'k(~k)ZJ(k+ll io Hint: Let Z and V be as in the preceding hint.. .[T > tl is an increasing function of 6. ! x 0 2 1<> 2 I 0 <> I 2 1<> 2 1 ~a Ia 2 iI Oe (i~H <» (1') a 10: (t~) (~. for each v > 0. . Consider the following model. samples from N(J1l. . 1 {= xj(kl)ellx+(tVx/k'I'ldx. Show that the noncentral t distribution.2. (Tn. ... Then use py...10 Problems and Complements 293 X n ) is n (a) Show that the density of X = (Xl. From the joint distribution of Z and V. distribution. (b) P.(t) = . with all parameters assumed unknown.
. Un = Un.Y)'/E(X.). p) distribution. 0. .0 has the same dis r j .. . p) distribution. (Xn . . can you conclude at the 10% level of significance that blood cholest~rollevel is correlated with weight/height ratio? 'I I .9.a/2) or F < f( n/2). note that . .1 .~ I .. Let '(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p the bivariate normal model.nl1 ar is of the fonn: Reject if. ! . Because this conditional distribution does not depend on (U21 •. where f( t) is the tth quantile of the F nz .8. the continuous versiop of (B. . Let (XI. i=l j=l n n (b) Show that if we have a sample from a bivariate N(1L1l 1L2.' ! .1I(a). Yn) be a sampl~ from a bivariateN(O. V.4.. (d) Relate the twosided test of part (c) to the confidence intervals for a?/ar obtained in Problem 4. The following data are the blood cholesterol levels (x's) and weightlheight ratios (y's) of 10 men involved in a heart study. distribution and that critical values can be . i=2 i=2 n n S12 =L i=2 n U. (1~. 1. .62 284 2.7 to conclude that tribution as 8 12 /81 8 2 .96 279 2.9. Sbow. · . YI ). show that given Uz = Uz . Vn ) is a sample from a N(O.61 310 2.4. • .1. and use Probl~1ll4.10. .64 298 2. > C. . .. . vn 1 l i 16. I ' LX.4.4. F > /(1 . <. 0.1' Sf = L U.. that T is an increasing function of R. Un).7 and B.V. . !I " . i 0 in xi !:. T has a noncentral Tnz distribution with noncentrality parameter p.0 has the same distribution as R.X)' (b) Show that (aUa~)F has an F nz obtained from the :F table..'. where a.68 250 2..~ I i " 294 Testing and Confidence Regions Chapter 4 . 1. .37 384 2. a?. . and only if.24) implies that this is also the unconditional distribution.19 315 2. .4) and (4.8. F ~ [(nl . using (4. 15.1)/(n. (11. Finally. 14.71 240 2. . Let R = S12/SIS" T = 2R/V1 R'.94 Using the likelihood ratio test for the bivariate nonnal model. as an approximation to the LR test of H : a1 = a2 versus K : a1 =I a2. x y 254 2.'. (Un.. Argue as in Proplem 4. p) distribution. . then P[p > c] is an increasing function of p for fixed c. .5) that 2 log '(X) V has a distribution. .4. (a) Show that the LR test of H : af = a~ versus K : ai > and only if. and using the arguments of Problems B. r> (c) Justify the twosided F test: Reject H if.12 337 1. (a) Show that the likelihood ratio ~tatistic is equivalent to ITI where .nt 1 distribution. L Yj'. • I " and (U" V. Consider the problem of testing H : p = 0 versus K : p i O. 1 .1)]E(Y. .9.J J . Hint: Use the transfonnations and Problem B. si = L v. where l _ .
P.895 and t = 0..[ii . T6 .0. 4.05 and 0. W. Box.0.. Consider the cases Ii. The term complete is then reserved for the class where strict inequality in (4. i] versus K : 8 't [i.3.01 + 0. and r. respectively. 1974) to the critical value is <. AND J. the class of Bayes procedures is complete.15 is used here. T.4 (I) If the continuity correction discussed in Section A.o = 0.819 for" = 0. Leonard. F.[ii) where t = 1. "Is there a sex bias in graduate admissions?" Science. 4.3) holds for some 8 if <p 't V. it also has confidence level (1 . !. R.. REFERENCES BARLOW. AND F.9. (2) The theory of complete and essentially complete families is developed in Wald (1950). and C.Section 4. Notes for Section 4. Apology for Ecumenism in Statistics and Scientific lnjerence.3).. BICKEL. (b) Compare your solution to the Bayesian solution based on a continuous loss function rro = 0. . S in 8(X) would be replaced by 5 + and 5 in 8(X) is replaced by 5 . Rejection is more definitive..(t) = tl(.12 1965. Box.3 (1) Such a class is sometimes called essentially complete.1 (1) The point of view usually taken in science is that of Karl Popper [1968]. (a) Find the level" LR test for testing H : 8 E given in Problem 3. Stephens. see also Ferguson (1967). Essentially.10. if the parameter space is compact and loss functions are bounded. (2) We ignore at this time some reallife inadequacies of this experiment such as the placebo effect (see Example 1.11 NOTES Notes for Section 4. 187. Data Analysis and Robustness. Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding. i]. Wu.9. G. 1973. PROSCHAN. 1983. HAMMEL. .. Tl . Editors New York: Academic Press. O'CONNELL. Because the region contains C(X).~. P. Consider the bioequivalence example in Problem 3. Nntes for Section 4. (2) In using 8(5) as a confidence hound we are using the region [8(5). Mathematical Theory of Reliability New York: J. 00..).1.035.2.851. E.01. (3) A good approximation (Durbin.. Wiley & Sons. More generally the closure of the class of Bayes procedures (in a suitable metric) is complete.11 Notes 295 17. E. G.11.398404 (975).2. 00.I\ E.
I . WETHERDJ. SAMUEL<:AHN.. 54. SIEVERS. G. "Probabilities of the type I errors of the Welch tests. A. L. c. DOKSUM. STEIN. 1952. WANG. OLlaN. 326331 (1999). 13th ed." The American Statistician. 243246 (1949). W. K. 38. the Growth ofScientific Knowledge. R. 1958. CAl.. Sequential Methods in Statistics New York: Chapman and Hall." Biometrika.. New York: Springer.• Statistical Methods for Research Workers. Philadelphia. T. 53. S.. j 1 . WALD. 64.. 473487 (1977). Sequential Analysis New York: Wiley." J.." J. Assoc.. A. R. 1986. 3rd ed. "Distribution theory for tests based on the sample distribution function. .. SACKRoWITZ. W. S. Statistical Theory with Engineering Applications . "Plots and tests for symmetry. Amer. AND G. AARBERGE. L." Regional Conference Series in Applied Math. New York: I J 1 Harper and Row. 56.. Amer. D. STEPHENs. B. KLETI. DURBIN. 8. PRATI.296 Testing and Confidence Regions Chapter 4 BROWN. ! I I I FERGUSON. A. K.. Mathematical Statistics New York: J. Y. j j New York: 1. Testing Statistical Hypotheses. Assoc. 1997. 730737 (1974). "A twosample test for a linear hypothesis whose pOWer is independent of the variance. . Sratisr. 36. G. "Length of confidence intervals." 1. K. Statistical Decision Functions New York: Wiley.. R.• Conjectures and Refutations.." Ann.. ." Biometrika." J. ! . A. Statist. Assoc.. M. JEFFREYS. DAS GUPTA. A. E. . "P values as random variablesExpected P values. New York: Hafner Publishing Company. Amer. 63. "On the combination of independent test statistics. 1961. WILKS. LEHMANN. 549567 (1961)... Mathematical Statistics. 421434 (1976). Pennsylvania (1973). TATE. A Decision Theoretic Approach New York: Academic Press. H. GLAZEBROOK. VAN ZWET.." The AmencanStatistician. Amer. 1967. AND K. . 9. "EDF statistics for goodness of fit. SIAM. 66. 54 (2000). The Theory of Probability Oxford: Oxford University Press. OSTERHOFF. F. 2nd ed. 1947. Statist. 605608 (1971).. HAlO. 1950.. Statist. FISHER. 1. 1985. "Interval estimation for a binomial proportion.. j 1 . POPPER. AND 1.. 1968." An" Math. V. 1%2. AND E. 69.. . HEDGES. Aspin's tables:' Biometrika. Wiley & Sons. H. "PloUing with confidence: Graphical comparisons of two populations. AND I. Statist. . 16. FL: Academic Press. R. I . 659680 (1967). WELCH. Math. AND G. A. AND A. J. WALD.. DOKSUM. FENSTAD. Statistical Methods for MetaAnalysis Orlando.243258 (1945). D. AND R.674682 (1959). "Further notes on Mrs. T. . Wiley & Sons. "Optimal confidence intervals for the variance of a normal distribution. Statist.. L.
1 (D.Chapter 5 ASYMPTOTIC APPROXIMATIONS 5.9.1.. N(J1.Xn are i. (5. consider med(X 11 •. Even if the risk is computable for a specific P by numerical integration in one dimension. However.1. To go one step further. This distribution may be evaluated by a twodimensional integral using classical functions 297 .. our setting for this section EFX1 and use X we can write. If n is odd.fiiXIS has a noncentral t distribution with parameter 1"1 (J and n . .1.1). consider evaluation of the power function of the onesided t test of Chapter 4. from Problem (B.3) gn(X) =n ( 2k ) F k(x)(1 . If XI.F(x)k f(x).2.2) where.1. Ifwe want to estimate J1(F} = (5.13).i. closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule.1. and calculable for any F and all n by a single onedimensional integration. k Evaluation here requires only evaluation of F and a onedimensional integration. but a different one for each n (Problem 5. 1 X n from a distribution F. telling us exactly how the MSE behaves as a function of n.2 that . the qualitative behavior of the risk as a function of parameter and sample size is hard to ascertain.2) and (5..Xn ) as an estimate of the population median II(F).1 degrees of freedom. a 2 ) we have seen in Section 4. computation even at a single point may involve highdimensional integrals.1) This is a highly informative formula.1. Worse. and F has density f we can wrile (5.. and most of this chapter. consider a sample Xl. if n ~ 2k + 1. v(F) ~ F.3).d.1 INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS Despite the many simple examples we have dealt with. . Worse. the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5. In particular. .• .
Xn)}n>l. where A= { ~ . Monte Carlo is described as follows.o(X1). Draw B independent "samples" of size n. The other.fiit !Xi.n (Xl.Xn):~Xi<n_l (~2 (EX. where X n = ~ :E~ of medians. of distributions of statistics. X n + EF(Xd or p £F( .1.i. The classical examples are. is to approximate the risk function under study by a qualitatively simpler to understand and easier to compute function. which we explore further in later chapters. Rn (F). .. save for the possibility of a very unlikely event. 00. It seems impossible to determine explicitly what happens to the power function because the distribution of fiX / S requires the joint distribution of (X.Xnj ) . in this context. .. but for the time being let's stick to this case as we have until now. Approximately evaluate Rn(F) by _ 1 B i (5.. we can approximate Rn(F) arbitrarily closely. which occupies us for most of this chapter.3. VarF(X 1 ». j=l • By the law of large numbers as B j.I I !' 298 Asymptotic Approximations Chapter 5 (Problem 5. . But suppose F is not Gaussian. . S) and in general this is only representable as an ndimensional integral. 10=1 f=l There are two complementary approaches to these difficulties. X n as n + 00. The first. fiB ~ R n (F). . . We shall see later that the scope of asymptotics is much greater. {X 1j l ' •• . or the sequence Asymptotic statements are always statements about the sequence.2) and its qualitative properties are reasonably transparent. is to use the Monte Carlo method. Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5.. more specifically. based on observing n i. I j {Tn (X" . X nj }. 1 < j < B from F using a random number generator and an explicit fonn for F. _. . for instance the sequence of means {Xn }n>l.1. observations Xl.. In its simplest fonn. . . Asymptotics. just as in numerical integration.4) i RB =B LI(F. i .. • .)2) } . or it refers to the sequence of their distributions 1 Xi.fii(Xn  EF(Xd) + N(O. . always refers to a sequence of statistics I.3.. Thus.d.
14.l1) states that if EFIXd' < 00.14../' is as above and .' m (5.1.1.f.6) and (5. (5.9) when . the celebrated BerryEsseen bound (A. if EFXf < 00.01. £ = .. say. .' 1'1 > €] < . 1') < z] ~ ij>(z) (5.1.3). for n sufficiently large..2 . PF[IX n _  . 1'1 > €] < 2exp {~n€'}..7) where 41 is the standard nonnal d.1. X n ) but in practice we act as if they do because the T n (Xl.. below .2 is unknown be. n = 400.1. the right sup PF x [r. and P F • What we in principle prefer are bounds. .Section 5.1 Introduction: The Meaning and Uses of Asymptotics 299 In theory these limits say nothing about any particular Tn (Xl.1. . X n )) (in an appropriate sense).2 hand side of (5.8) Again we are faced with the questions of how good the approximation is for given n.. . if EFIXd < 00. x. the much PFlIx n  Because IXd < 1 implies that .1.1. DOD? Similarly. which are available in the classical situations of (5..5) That is. For instance. if more delicate Hoeffding bound (B. .1. 7).1') < x] _ ij>(x) < CEFIXd' v'~ 3 1/' " "n (5.4.25 whereas (5.9) is .6) gives IX11 < 1.6) does not tell us how large n has to be for the chance of the approximation ~ot holding to this degree (the lefthand side of (5.(Xn .. Thus.' = 1 possible (Problem 5. For € = .6) to fall.. For instance. Further qualitative features of these bounds and relations to approximation (5.10) is . Is n > 100 enough or does it have to be n > 100. say.1.10) < 1 with . As an approximation.6) for all £ > O.1..8) are given in Problem 5.15. We interpret this as saying that.1.VarF(Xd. this reads (5.l) (5.9) As a bound this is typically far too conservative. ' .9.. Similarly. by Chebychev's inequality..1. the weak law of large numbers tells us that. the central limit theorem tells us that if EFIXII < 00. Xn is approximately equal to its expectation.1.1. The trouble is that for any specified degree of approximation. then (5. (see A.01.1. X n ) we consider are closely related as functions of n so that we expect the limit to approximate Tn(Xl~"· . (5.1.omes 11m'.Xn ) or £F(Tn (Xl. then P [vn:(X F n .1. (5.11) .
As we mentioned.F) > N(O 1) . for all F in the n n . good estimates On of parameters O(F) will behave like  Xn does in relation to Ji. • model. and asymptotically normal.1. Bounds for the goodness of approximations have been available for X n and its distribution to a much greater extent than for nonlinear statistics such as the median. B ~ O(F).1.1.11) is again much too consctvative generally. We now turn to specifics.8) is typically much betler than (5.di) where '(0) = 0. 1 i . The qualitative implications of results such as are very impor~ tant when we consider comparisons between competing procedures.12) where (T(O.3. "(0) > 0. (5.1. I .(l) The approximation (5. i . although the actual djstribution depends on Pp in a complicated way. quite generally.t and 0"2 in a precise way. The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17  j • i. Asymptotics has another important function beyond suggesting numerical approximations for specific nand F.1.8) differs from the truth.300 Asymptotic Approximations Chapter 5 where C is a universal constant known to be < 33/4. • c F (y'n[Bn .I '1.' (0)( (1 / y'n)( v'27i') (Problem 5. Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo. If they are simple. (5. Section 5.3 begins with asymptotic computation of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for oneparameter exponential families.2 deals with consistency of various estimates including maximum likelihood.1. (b) Their validity for the given n and Ttl for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. even here they are not a very reliable guide. It suggests that qualitatively the risk of X n as an estimate of Ji. The estimates B will be consistent. (5. d) = '(II' . as we have seen. Although giving us some idea of how much (5.5) and quite generally that risk increases with (1 and decreases with n. As we shall see.' I I • If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good.e(F)]) (1(e.1. Consistency will be pursued in Section 5. . The arguments apply to vectorvalued estimates of Euclidean parameters. consistency is proved for the estimates of canonical parameters in exponential families. for any loss function of the form I(F.2 and asymptotic normality via the delta method in Section 5. behaves like . Yet. F) typically is the standard deviation (SD) of J1iOn or an approximation to this SD. Section 5. For instance. which is reasonable. asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate. .11) suggests. . In particular. : i . Practically one proceeds as follows: (a) Asymptotic approximations are derived.7) says that the behavior of the distribution of X n is for large n governed (approximately) only by j.
If Xl.. . We also introduce Monte Carlo methods and discuss the interaction of asymptotics.7 without further discussion.1). We will recall relevant definitions from that appendix as we need them. for .2.7.Section 5. A. Means. .i. Monte Carlo. where P is the empirical distribution.1. The simplest example of consistency is that of the mean.4 deals with optimality results for likelihoodbased procedures in onedimensional parameter models. (5.i. In practice. A stronger requirement is (5.d.) However. and other statistical quantities that are not realistically computable in closed form. remains central to all asymptotic theory. Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to 00. 1 X n are i. 00. if.. AlS.) forsuP8 P8 lI'in . . we talk of uniform cornistency over K. and 8. For P this large it is not unifonnly consistent. e Example 5.7.14..lS. by quantities that can be so computed.. But. and B.q(8)1 > 'I that yield (5. . Section 5.2) is called unifonn consistency. for all (5.14. If is replaced by a smaller set K. . which is called consistency of qn and can be thought of as O'th order asymptotics.Xn ) is that as e n 8 E ~ e. . but we shall use results we need from A. The least we can ask of our estimate Qn(X I.1). ' . Most aSymptotic theory we consider leads to approximations that in the i. case become increasingly valid as the sample size increases.Xn from Po where 0 E and want to estimate a real or vector q(O).2.2.2) Bounds b(n. Finally in Section 5. (See Problem 5. and probability bounds.5 we examine the asymptotic behavior of Bayes procedures.2.14. by the WLLN. The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A. 5.2. is a consistent estimate of p(P). distributions.. 1denotes Euclidean distance. in accordance with (A.1.2.1) and (B. P where P is unknown but EplX11 < 00 then.2 Consistency 301 in exponential families among other results. That is. 'in 1] q(8) for all 8. X ~ p(P) ~ ~  p = E(XJ) and p(P) = X. with all the caveats of Section 5. > O.2. The stronger statement (5.1 CONSISTENCY PlugIn Estimates and MlEs in Exponential Family Models Suppose that we have a sample Xl. asymptotics are methods of approximating risks.d.1) where I .2. Summary.2) are preferable and we shall indicate some of qualitative interest when we can.2.2 5.
1.6) = sup{lq(p)  q(p')I: Ip . . By the weak law of large numbers for all p.b) say.p'l < 6}. with Xi E X 1 . · .2. 6) It easily follows (Problem 5. Binomial Variance. . in this case.. Pp [IPn  pi > 6] ~ . 0) is defined by . 0 ~ To some extent the plugin method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plugin estimates. there exists 6 «) > 0 such that p. sup{ Pp[liln pi > 61 : pES} < kl4n6 2 (Problem 5.3) Evidently. the kdimensional simplex. Ip' pi < 6«).2. it is uniformly continuous on S.14. then by Chebyshev's inequality. for every < > 0.2. by A. . xd is the range of Xl' Let N i = L~ 11(Xi = Xj) and Pj _ Njln. 6 >0 O. and p = X = N In is a uniformly consistent estimate of p. Asymptotic Approximations Chapter 5 X is uniformly consistent over P because o Example 5. i .. where Pi = PIXI = xi]' 1 < j < k. 0 In fact. implies Iq(p')q(p) I < <Then Pp[li1n ." is the following: I X n are LLd. . Letw.2.ql > <] < Pp[IPn . If q is continuous w(q. p' E S. w. Suppose thnt P = S = {(Pl.6)! Oas6! O. q(P) is consistent.p) distribution.Pk) : 0 < Pi < 1. Suppose the modulus ofcontinuity of q. where q(p) = p(1p). Suppose that q : S + RP is continuous. and (Xl.. = Proof.. which is ~ q(ji).I l 302 instance.b] < R+bedefinedastheinver~eofw.5) I A simple and important result for the case in which X!.2. for all PEP. Other moments of Xl can be consistently estimated in the same way. Theorem 5.. l iA) E S be the empirical distribution.6. we can go further.. P = {P : EpX'f < M < oo}. Evidently.l : [a.2. Because q is continuous and S is compact..l «) = inf{6 : w(q.• X n be the indicators of binomial trials with P[X I = I] = p. But further. (5.3) that > <} (5.pi > 6«)] But. Pn = (iiI.1 < j < k. Let X 1. Thus. Then N = LXi has a B(n. w( q.·) is increasing in 0 and has the range [a. consider the plugin estimate p(Iji)ln of the variance of p. w(q.4) I i (5.2. 0 < p < 1. ...2. . w(q.LJ~IPJ = I}.1) and the result follows. Then qn q(Pn) is a unifonnly consistent estimate of q(p).
V2. Vi). 'Vi) is the statistic generating this 5parameter exponential family.)1 < 1 } tions such that EUr < 00.UV) so that E~ I g(Ui . Var(U.i.ar. 1 < j < d. (i) Plj [The MLE Ti exists] ~ 1. .2. Var(V.. . Suppose P is a canonical exponentialfamily of rank d generated by T.2. = Proof.).2..v} = (u. !:.U 2. is consistent for v(P).) > 0.2.1.7. thus. Jl2.. variances.a~. ' .9d) map X onto Y C R d Eolgj(X')1 < 00. Let TI. ai. . if h is continuous.V.2 Consistency 303 Suppose Proposition 5.) > 0. o Example 5. Let g(u. U implies !'.. 1 are discussed in Problem 5.Section 5. 1 < i < n be i.d.3. Variances and Correlations. and correlation coefficient are all consistent. )1 < oo}.2. af > o.1 and Theorem 2. then vIP) h(g). More generally if v(P) = h(E p g(X .2. conclude by Proposition 5.) I < that h(D n) Foreonsisteney of h(g) apply Proposition B. where h: Y ~ RP. where P is the empirical distribution. Suppose [ is open.3.p).lpl < 1. Questions of uniform consistency and consistency when P = { DistribuCorr(Uj.2.2. If we let 8 = (JLI.6) if Ep[g(X.p).then which is well defined and continuous at all points of the range of m. X n are a sample from PrJ E P. We need only apply the general weak law of large numbers (for vectors) to conclude that (5.JL2.4. and let q(O) = h(m(O)). Then.Jor all 0. We may. Theorem 5. (ii) ij is consistent. let mj(O) '" E09j(X. EVj2 < 00.1: D n 00. Let g (91. then Ifh=m. ag. N2(JLI.6. is a consistent estimate of q(O).1 that the empirical means. )) and P = {P : Eplg(X . 11. . If Xl. 0 Here is a general consequence of Proposition 5. Then. 1 < j < d. h(D) for all continuous h. [ and A(·) correspond to P as in Section 1..1.1 . Let Xi = (Ui .
3.d. We showed in Theorem 2. . 1 i . Let Xl. T(Xdl < 6} C CT' By the law of large numbers.3. = 1 n p. is solved by 110. see. The argument of the the previous subsection in which a minimum contrast estimate.. Note that. Rudin (1987).. .E1). BEe c Rd. is a continuous function of a mean of Li.2. vectors evidently used exponential family properties.1 to Theorem 2. {t: It .2 Consistency of Minimum Contrast Estimates .2..2. I .1. A more general argument is given in the following simple theorem whose conditions are hard to check.L[p(Xi .. . (5.lI o): 111110 1 > e} > D(1I0 . j 1 Pn(X. Let 0 be a minimum contrast estimate that minimizes I . Suppose 1 n PII sup{l.lI) . exists iff the event in (5. • .. I I Proof Recall from Corollary 2. i I . I J Theorem 5. . . Po. I ! . and the result follows from Proposition 5.  inf{D(II.= 1 . if 1)0 is true. II) 0 j =EII. 7 I . n LT(Xi ) i=I 21 En.7) occurs and (i) follows..d.Xn ) exists iff ~ L~ 1 T(X i ) = Tn belongs to the interior CT of the convex support of the distribution of Tn. . . But Tj. as usual.2. . the inverse AI: A(e) ~ e is continuous on 8. '1 P1)'[n LT(Xi ) E C T) ~ 1.3. the MLE. X n be Li. D . which solves 1 n A(1)) = . n . E1).. where to = A(1)o) = E1)o T(X 1 ).LT(Xi ).1 that On CT the map 17 A(1]) is II and continuous on E. 0 Hence. z=l 1 n i..3.. By definition of the interior of the convex support there exists a ball 8.2.2.D(1I0 .1I 0 ) foreverye I .j. II) = where.1 belong to the interior of the convex support hecause the equation A(1)) = to.2.7) I .8) I > O. By a classical result.9) n and Then (} is consistent. . for example. D( 110 ..p(X" II) is uniquely minimized at 11 i=l for all 110 E 6... 5. (T(Xl)) must hy Theorem 2. i=l 1 n (5. T(Xl).1 that i)(Xj.304 Asymptotic Approximations Chapter 5 .3. i i • • ! . II) L p(X n.1I)11: II E 6} ~'O (5.
8) can often failsee Problem 5.D( IJ o. But because e is finite..[max{[ ~ L:~ 1 (p(Xi . [inf c e.2.9) follows from Shannon's lemma.L[p(Xi . /fe e =   Proof Note that for some < > 0.[IJ ¥ OJ] = PIJ. IJ) = logp(x.10) By hypothesis. ifIJ is the MLE. IJj ) .D(IJo.2.2.D(IJo. for all 0 > 0.5.IJol > E] < PIJ [inf{ .p(X" IJo)) : IJ E g ~(P(X" K'} > 0] ~ 1. o Then (5.D(IJo.8) by (i) For ail compact K sup In { c e. [I ~ L:~ 1 (p(Xi .2. _ I) 1 n PIJ o [16 . {IJ" . IJo)) . IJ) . IJ) . Ie ¥ OJ] ~ Ojoroll j. IJ)II ' IJ n i=l E e} > . PIJ. IJ) n ~ i=l p(X"IJ o)] .2.9) hold for p(x.2.IJd ).. and (5. then.8) follows from the WLLN and PIJ. IJ) .10) tends to O. 1 n PIJo[inf{. IJj))1 > <I : 1 < j < d} ~ O. IJ). .2.IJ)1 < 00 and the parameterization is identifiable.2.2.2. EIJollogp(XI.IJor > <} < 0] because the event in (5.2. But for ( > 0 let o ~ ~ inf{D(IJ.inf{D(IJo.2 Consistency 305 proof Note that.2. (5.1 we need only check that (5. E n ~ [p(Xi .11) implies that the righthand side of (5.11) implies that 1 n ~ 0 (5.12) which has probability tending to 0 by (5. 2 0 (5.IJol > <} < 0] (5.2."'[p(X i . IJ j )) : 1 <j < d} > <] < d maxi PIJ.IJO)) ' IIJ IJol > <} n i=l . IIJ . . is finite. IIJ . An alternative condition that is readily seen to work more widely is the replacement of (5.2.lJ o): IIJ IJol > oj.Section 5.11) sup{l. IJ j ) .2. PIJ.13) By Shannon's Lemma 2. 0 Condition (5. Coronary 5. (5.1.2.8).L(p(Xi.8) and (5. IJ)I : IJ K } PIJ !l o. 0) .IJ o)  D(IJo.2.2. A simple and important special case is given by the following.D(IJ o.IJ) p(Xi.14) (ii) For some compact K PIJ.IIIJ  IJol > c] (5.
let Ilglioo = sup{ Ig( t) I : t E R) denote the sup nonn. We show how consistency holds for continuous functions of vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families. sequence of estimates) consistency. Sufficient conditions are explored in the problems.2..1 that the principal use of asymptotics is to provide quantitatively or qualitatively useful approximations to risk.1) where • . . > 2.Il E(X Il). J. Let h : R ~ R. =suPx 1k<~)(x)1 <M < . As usual let Xl.d. We introduce the minimal property we require of any estimate (strictly speak ing.3. in Section 5. . . Unifonn consistency for l' requires more. When the observations are independent but not identically distributed. see Problem 5.1 The Delta Method for Moments • • We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders. consistency of the MLE may fail if the number of parameters tends to infinity. We denote the jth derivative of h by 00 .Xn be i. X valued and for the moment take X = R. We conclude by studying consistency of the MLE and more generally Me estimates in the case e finite and e Euclidean. A general approach due to Wald and a similar approach for consistency of generalized estimating equation solutions are left to the problems.33. m Mi) and assume (b) IIh(~)lloo j 1 . We then sketch the extension to functions of vector means.1.. If fin is an estimate of B(P). D . ~l Theorem 5. ~ 5. Summary.) = <7 2 Eh(X) We have the following. .i. + Rm (5.3. and assume (i) (a) h is m times differentiable on R. ! ! (ii) EIXd~ < 00 Let E(X1 ) = Il. I ~. we require that On ~ B(P) as n ~ 00.306 Asymptotic Approximations Chapter 5 We shall see examples in which this modification works in the problems. then = h(ll) + L (j)( ) j=l h . If (i) and (ii) hold. Unfortunately checking conditions such as (5.8) and (5.14) is in general difficult.AND HIGHERORDER ASYMPTOTlCS: THE DelTA METHOD WITH APPLICATIONS We have argued.3 FIRST.2. 5.B(P)I > Ej : PEP} t 0 for all € > O. that sup{PIiBn .. VariX.3.3.
21.1 < k < r}}.and HigherOrder Asymptotics: The Delta Method with Applications 307 The proof is an immediate consequence of Taylor's expansion. . :i1 + .3. ..5[jJ2J max {l:: { . . ...1. . X ij ) least twice. . tI.i j appears at by Problem 5. EIX I'li = E(X _1')' Proof.. rl l:: . .3) and j odd is given in Problem 5.'1 +.3) (5..3.3.3.'" ..5. .1'1.) ~ 0. We give the proof of (5. so the number d of nonzero tenns in (a) is b/2] (c) l::n _ r.3. tr i/o>2 all k where tl .3 First. (n . Let I' = E(X.. The more difficult argument needed for (5. .. . . j > 2. The expression in (c) is. .2) where IX' that 1'1 < IX .3) for j even. Moreover. + i r = j.ij } • sup IE(Xi .. (5.3.3.[jj2] + 1) where Cj = 1 <r..2. . t r 1 and [t] denotes the greatest integer n. C [~]! n(n .3. 'Zr J . +'r=j J .Section 5. bounded by (d) < t. _ . Xij)1 = Elxdj il.. ik > 2.. and the following lemma. In!..3. then there are constants C j > 0 and D j > 0 such (5. (b) a unless each integer that appears among {i I . If EjX Ilj < 00.1) . Lemma 5.t" = t1 .. then (a) But E(Xi .4) Note that for j eveo. .... for j < n/2.4) for all j and (5.
3.3. 1 iln .l) + [h(llj'(1')}':..l) + {h(21(J.4).1')2 = 0 2 In. 0 The two most important corollaries of Theorem 5.l) and its variance and MSE.3) for j even.l)}E(X . if I' = O. 00 and 11hC 41 11= < then G(n. • .6.l)h(J.3.3.ll j < 2j EIXd j and the lemma follows.l) 6 + 2h(J. If the conditions of (b) hold.6) n . By Problem 5. apply Theorem 5. . (n li/2] + 1) < nIJf'jj and (c).3.1. . and (e) applied to (a) imply (5.3.4) for j odd and (5. I I I Corollary 5.2. give approximations to the bias of h(X) as an estimate of h(J.308 But (e) Asymptotic Approximations Chapter 5 1 I j i njn(n 1) . I .1'11. then G(n. n (b) Next.+ O(n. (d).3f') in (5.5) can be replaced by Proof For (5. Corollary 5. (5. then I • I.3.5) follows.l) + 00 2n I' + G(n3f2). < 3 and EIXd 3 < 00. Proof (a) Write 00. (5.l) + [h(1)]2(J.2 ).llJ' + G(n3f2) (5. I 1 .1. .3.1')' + 1 E[h'](3)(X')(X _ 1')3 = h2 (J.3.3. In general by considering Xi . Because E(X ..J. EIX1 . if 1 1 <j (a) Ilh(J)II= < 00. then h(')( ) 2 0 = h(J.3.1') + {h(2)(J. and EXt < replaced by G(n'). I (b) Ifllh(j11l= < 00. I ..1.1 with m = 4. respectively.3. Then Rm = G(n 2) and also E(X .3f2 ) in (5.l)h(1)(J.3.5) apply Theorem 5. (a) ifEIXd3 < 00 and Ilh( 31 11 oo < Eh(X) 00.3.3. i • r' Var h(X) = 02[h(11(J. using Corollary 5..l)E(X . 1 < j < 3.1')3 = G(n 2) by (5. 0 1 I .3f2 ).5) (b) if E(Xt) < G(n.l)h(J.3..6) can be 1 Eh 2 (X) = h'(J..1 with m = 3.JL as our basic variables we obtain the lemma but with EIXd j replaced by EIX 1 .
A qualitatively simple explanation of this important phenonemon will be given in Theorem 5. which is neglible compared to the standard deviation of h(X). the MLE of ' is X .2 ) 2e. for large n.3. and.8) because h(ZI (t) = 4(r· 3 .1.1 If the Xi represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours.7) If h(t) ~ 1 .l / Z) unless h(l)(/1) = O. Bias and Variance of the MLE of the Binomial VanOance.= E"X I ~ 11 A. We will compare E(h(X)) and Var h(X) with their approximations.3. 0 Clearly the statements of the corollaries as well can be turned to expansions as in Theorem 5.3. then heX) is the MLE of 1 .1) = 11.. (5. .l ).3.Section 5. the bias of h(X) defined by Eh(X) .d. If X [.3. which is G(n.2.Z) (5.3. .h(/1) is G(n. by Corollary 5.10) + [h<ZI (/1)J'(14} + R~ ! with R~ tending to zero at the rate 1/n3 .5). as the plugin estimate of the parameter h(Jl) then. Bias.3. T!Jus. as we nonnally would. To get part (b) we need to expand Eh 2 (X) to four terms and similarly apply the appropriate form of (5.exp( 2') c(') = h(/1). then we may be interested in the warranty failure probability (5.3. Note an important qualitative feature revealed by these approximations. X n are i.h(/1)) h(2~(II):: + O(n.t. {h(l) (/1)h(ZI U')/13 (5.3. If heX) is viewed. Thus.(h(X)) E. when hit) = t(1 . where /1.3.3.3.3. Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5.1.. by Corollary 5.2 (Problem (5.6).(h(X) .i.3(1_ ')In + G(n.9) o Further expansion can be done to increase precision of the approximation to Var heX) for large n.4) E(X .11) Example 5./1)3 = ~~.. by expanding Eh 2 (X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = ~[h(I)U')]'(1Z +.4 ) exp( 2/t).Z. Example 5. We can use the two coronaries to compute asymptotic approximations to the means and variance of heX)..and HigherOrder Asymptotics: The Delta Method with Applications 309 Subtracting Cal from (b) we get C5. (1Z 5.3.1 with bounds on the remainders.3 First.Z ".3.3.  .i.exp( 2It).t) and X.
.p) + .. the error of approximation is 1 1 + . R'n p(1 ~ p) [(I .pl.10) yields .10) is in a situation in which the approximation can be checked.3.3. M2) ~ 2. (5. 1 < j < { Xl . .3. • .3. " ! .2p)..2p(1 . .2p) n n +21"(1 . D 1 i I .310 Asymptotic Approximations Chapter 5 B(l. (5.p(1 .amh i.p)(I.2p)2p(1 .{2(1.e x ) : i l + .5) is exact as it should be.' • " The generalization of this approach to approximation of moments for functions of vector means is fonnally the same but computationally not much used for d larger than 2. n n Because MII(t) = I .9d(Xi )f.[Var(X) + (E(X»2] I nI ~ p(1 . Next compute Varh(X)=p(I.6p(l.p(1 .3 ).2p)' . asSume that h has continuous partial derivatives of order up to m. D " ~ I . 0 < i. .!c[2p(1 n p) . .p)J n p(1 ~ p) [I .p)'} + R~ p(l . Let h : R d t R.P) {(1_2 P)2+ 2P(I.2t. ~ Varh(X) = (1. Suppose g : X ~ R d and let Y i = g(Xi ) = (91 (Xi). • ~ O(n.. a Xd d} .j .p).p) . II'.5) yields E(h(X))  ~ p(1 . Theorem 5. . and that (i) l IIDm(h)ll= < 00 where Dmh(x) is the array (tensor) a i. .p) n . 1 and in this case (5.3.2p)p(l. I .2p)')} + R~.p) = p ( l . First calculate Eh(X) = E(X) . < m.P )} (n_l)2 n nl n Because 1'3 = p(l.2(1 .E(X') ~ p ..' .2.p)(I.p) {(I _ 2p)2 n Thus. + id = m.p) . and will illustrate how accurate (5..p)] n I " .
if EIY.) Var(Y.))) ~ N(O.3. The proof is outlined in Problem 5.2.)} + O(n (5. under appropriate conditions (Problem 5. EI Yl13 < 00 Eh(Y) h(J1. and (5. as for the case d = 1.14) + (g:.3.. 1 < j < d where Y iJ I = 9j(Xi ). Suppose thot X = R.. (7'(h)) where and (7' = VariX. and the appropriate generalization of Lemma 5. + ~ ~:l (J1.1..3. h : R ~ R.3.~ (J1..) + 2 g:'.3. by (5.3. 5.).) )' Var(Y.3 / 2 ) in (5.) Var(Y. B. The most interesting application.) + a. = EY 1 . is to m = 3..1 Then.2 The Delta Method for In law Approximations = As usual we begin with d !. then Eh(Y) This is a consequence of Taylor's expansion in d variables. 1. (J1.) gx~ (J1.~E(X.hUll)~.6).8.3.»)'var(Y'2)] +O(n') Approximations (5. and J. ijYk = ~ I:~ Yik> Y = ~ I:~ I Vi.13) Moreover.).x. (J1.Section 5.4. for d = 2.) Cov(Y".. The results in the next subsection go much further and "explain" the fonn of the approximations we already have.h(!. Y 12 ) 3/2).3).and HigherOrder Asymptotics: The Delta Method with Applications 311 (ii) EIY'Jlm < 00.5). We get.12) Var h(Y) ~ [(:.3.(J1.) Cov(Y". Lemma 5.14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d . (5.3. (5. Y12 ) (5.3.3.". Similarly.3.3.13). Theorem 5. E xl < 00 and h is differentiable at (5. Suppose {Un} are real random variables and tfult/or a sequence {an} constants with an + 00 as n + 00.3. then O(n.3.'.15) Then .12) can be replaced by O(n.) + ~ {~~:~ (J1.3.2 ). 0/ The result follows from the more generally usefullernma.3 First.I' < 00.11.C( v'n(h(X) . .
~ EVj (although this need not be true. • • .3. 0 .ul < Ii '* Ig(v)  g(u) .16) Proof. (e) Using (e).u)!:.• ' " and. see Problems 5. 'j But (e) implies (f) from (b). By definition of the derivative. by hypothesis. an = n l / 2. u = /}" j \ V . Formally we expect that if Vn !:. (ii) 9 : R Then > R is d(fferentiable at u with derivative 9(1) (u). I .8).3. j . for every (e) > O. ( 2 ). ~: .7. V for some constant u.3. = I 0 .17) .N(O. ..15) "explains" Lemma 5. j " But. . hence. I (g) I ~: . V and the result follows.ul Note that (i) (b) '* . V.3. . an(Un .312 Asymptotic Approximations Chapter 5 (i) an(Un . Thus. PIIUn €  ul < iii ~ 1 • "'. . we expect (5.a 2 ). from (a). (5.i. for every € > 0 there exists a 8 > 0 such that (a) Iv .u) !:. Therefore. Consider 1 Vn = v'n(X !") !:. The theorem follows from the central limit theorem letting Un X. for every Ii > 0 (d) j.3. then EV.N(O. V .g(ll(U)(v  u)1 < <Iv . _ F 7 . ·i Note that (5.1. .32 and B.
Then (5. For the proof note that by the central limit theorem.TiS X where S 2 = 1 nl L (X..Section 5. and the foregoing arguments.a) for Tn from the Tn .3 First. FE :F where EF(X I ) = '".X) n 2 . where g(u. then S t:.1 (Ia) + Zla butthat the t n.+A)al + Aa~) .3.3..i.'" 1 X n1 and Y1 .d. we find (Problem 5. Let Xl. N n (0 Aal.s Tn =. £ (5.28) that if nI/n _ A. Let Xl. "t" Statistics.'" . . EZJ > 0./2 ). 1'2 = E(Y. Slutsky's theorem.3 we saw that the two sample t statistic Sn vn1n2 (Y X) ' n = = n s nl + n2 has a Tn2 distribution under H when the X's and Y's are normal withar = a~.3.3. (1 . ••• . Example 5. 1).l (1. (1  A)a~ .9. (b) The TwoSample Case. Using the central limit theorem. else EZJ ~ = O. A statistic for testing the hypothesis H : 11. Yn2 be two independent samples with 1'1 = E(X I ). VarF(Xd = a 2 < 00.17) yields O(n J.N(O. In general we claim that if F E F and H is true. But if j is even.a) critical value (or Zla) is approximately correct if H is true and F is not Gaussian. j even = o(nJ/').X" be i.t2 versus K : 112 > 111. Now Slutsky's theorem yields (5.18) because Tn = Un!(sn!a) = g(Un .) and a~ = Var(Y. i=l If:F = {Gaussian distributions}.O < A < 1.l distribution.and HigherOrder Asymptotics: The Delta Method with Applications 313 where Z ~ N(O.2.> 0 .2 and Slutsky's theorem.j odd. v) = u!v. In Example 4. we can obtain the critical value t n. 1). 1 2 a P by Theorem 5.18) In particular this implies not only that t n. (a) The OneSample Case.). then Tn .1 (1 .3. Consider testing H: /11 = j. al = Var(X.. sn!a).).3. and S2 _ n nl ( ~ X) 2) n~(X.= 0 versuS K : 11.
One sample: 10000 Simulations. 316. then the critical value t..3. i 2(1 approximately correct if H is true and the X's and Y's are not normal.05.3. .1 shows that for the onesample t test.5 1 1. Chisquare data I 0.. Other distributions should also be tried.2 shows that when t n _2(10:) critical value is a very good approximation even for small n and for X.. oL. and in this case the t n .. .. where d is either 2.3.5 .5 2 2. The simulations are repeated for different sample sizes and the observed significance levels are plotted.'. or 50. We illustrate such simulations for the preceding t tests by generating data from the X~ distribution M times independently.. " . the asymptotic result gives a good approximation when n > 10 1. the For the twosample t tests... 0) for Sn is Monte Carlo Simulation As mentioned in Section 5. Y .314 Asymptotic Approximations Chapter 5 It follows that if 111 = 11...1 (0.5 . 20. and the true distribution F is X~ with d > 10.95) approximation is only good for n > 10 2 .. Figure 5.. X~..1.I = u~ and nl = n2. ur . i I . Each plotted point represents the results of 10..5 3 Log10 sample size j i Figure 5.. . each lime computing the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table. 32. as indicated in the plot.1.c:'::c~:____:_J 0.. Figure 5.02 . The X~ distribution is extremely skew. 10.000 onesample t tests using X~ data. Here we use the XJ distribution because for small to moderate d it is quite different from the normal distribution.. when 0: = 0.2 or of a~. approximations based on asymptotic results should be checked by Monte Carlo simulations. . .
To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T.9.5 2 2. scaled to have the same means. Moreover. as we see from the limiting law of Sn and Figure 5.3. let h(X) be an estimate of h(J.2.and HigherOrder Asymptotics: The Delta Method with Applications 315 1 This is because.cCc":'::~___:. when both nl I=. 1 (ri .4 based on Welch's approximation works well. then the twosample t tests with critical region 1 {S'n > t n . In this case Monte Carlo studies have shown that the test in Section 4. _ y'ii:[h(X) .n2 and at I=.. in this case.a~.' 0. or 50.3. Equal Variances 0.02 d'::'. the t n _2(O..0:) approximation is good when nl I=. even when the X's and Y's have different X~ distributions. }. N(O.Section 5_3 First.Xd.3.h(JL)] ".5 3 log10 sample size Figure 5. and a~ = 12af.n2 and at = a').) i O. Each plotted point represents the results of 10.Xi have a symmetric distribution. By Theoretn 5. the t n 2(1 .12 0. Y . ChiSquare Dala.3. y'ii:[h(X) . af = a~.000 twosample t tests.L) where h is con tinuously differentiable at !'. D Next.0' H<I o 0.5 1 1. and the data are X~ where d is one of 2. in the onesample situation. 10. 2:7 ar Two sample.X = ~.(})) do not have approximate level 0. and Yi . 10000 Simulations._' 0. However.2 (1 .95) approximation is good for nl > 100. Other Monte Carlo runs (not shown) with I=.hoi n s[h(l)(X)1 .a~ show that as long as nl = n2. For each simulation the two samples are the same size (the size indicated on the xaxis).o'[h(l)(JLJf).ll)(!.3.
Combining Theorem 5. 1) so that ZI_Q is the asymptotic critical value.5 1 1. I: . If we take a sample from a member of one of these families. and 9. such as the binomial. . such that Var heX) is approximately independent of the parameters indexing the family we are considering. +___ 9. It turns out to be useful to know transformations h.3..10000 Simulations: Gaussian Data. and beta. In Appendices A and B we encounter several important families of distributions.. called variance stabilizing. For each simulation the two samples differ in size: The second sample is two times the size of the first.3 and Slutsky's theorem.02""~".' OIlS .c:~c:_~___:J 0. 1) and in the second they are N(Ola 2) where a 2 takes on the values 1.316 Asymptotic Approximations Chapter 5 Two Sample.I _ __ 0 I +__ I :::.0. £l __ __ 0.6) and .J~ f).3.4.3. o. too. Poisson. if H is true . Each plotted point represents the results of IO. 2nd sample 2x bigger 0. we see that here.12 . which are indexed by one or more parameters.oj:::  6K . Unequal Variances. Variance Stabilizing Transfonnations Example 5. The data in the first sample are N(O. Tn ..£.6. From (5. then the sample mean X will be approximately normally distributed with variance 0"2 In depending on the parameters indexing the family considered...".5 Log10 (smaller sample size) l Figure 5.{x)() twosample t tests. . 0. gamma. " 0.3. as indicated in the plot. '! 1 . +2 + 25 J O'. The xaxis denotes the size of the smaller of the two samples. We have seen that smooth transformations heX) are also approximately normally distributed..~'ri i _ ..3. N(O.3..
. . See Problems 5. 1/4) distribution. c) (5.' . h must satisfy the differential equation [h(ll(A)j2A = C > 0 for some arbitrary c > O. this leads to h(l)(A) = VC/J>.15 and 5. is to exhibit monotone functions of parameters of interest for which we can give fixed length (independent of the data) confidence intervals. Thus.19) for all '/..16. To have Varh(X) approximately constant in A.3. In this case (5.5. See Example 5. which has as its solution h(A) = 2. suppose that Xl. a variance stabilizing transformation h is such that Vi'(h('Y) . yX± r5 2z(1 . Under general conditions (Bhattacharya and Rao. 0 One application of variance stabilizing transformations. Thus. Also closely related but different are socaBed normalizing transformations. Major examples are the generalized linear models of Section 6. Such a function can usually be found if (J depends only on fJ. If we require that h is increasing.3.)] 2/ n .3. Some further examples of variance stabilizing transformations are given in the problems. sample.Ct confidence interval for J>. finding a variance stabilizing transfonnation is equivalent to finding a function h such that for all Jl and (J appropriate to our family. I X n is a sample from a P(A) family.. Thus. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II.3.3..(A)') has approximately aN(O.d. As an example. h(t) = Ii is a variance stabilizing transformation of X for the Poisson family of distributions. p. In this case a'2 = A and Var(X) = A/n.\ + d.)C.3.6. Substituting in (5. where d is arbitrary. 1n (X I. .Xn are an i.Section 5..and HigherOrder Asymptotics: The Delta Method with Applications 317 (5. by their definition. Suppose further that Then again. The notion of such transformations can be extended to the following situation. which varies freely.1 0 ) Vi' ' is an approximate 1. . A > 0... A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models. .19) is an ordinary differential equation. Edgeworth Approximations The normal approximation to the distribution of X utilizes only the first two moments of X. Suppose.hb)) + N(o.3 First. 538) one can improve on .6) we find Var(X)' ~ 1/4n and Vi'((X) ... X n) is an estimate of a real parameter! indexing a family of distributions from which Xl.3.13) we see that a first approximation to the variance of h( X) is a' [h(l) (/.i. . 1976. in the preceding P( A) case.
0877 0.0254 0.9950 0.9999 1. we need only compute lIn and 1'2n. .9900 0.0397 0.8008 0.<p(x) ' . TABLE 5.86 0.1964 1.40 0.15 0.ססoo 0.9943 0. 1).1254 0. where Tn is a standardized random variable.38 0...3.ססoo 1. " I. Therefore.85 0 0.40 0.6548 4.9097 o.3.5.9684 0.3.6000 0. Table 5.2. H 3 (x) = x3  3x.7792 5.9029 0. "J 1 '~ • 'YIn = E(Vn? (2n)1 E(V .0050 0.0287 0.0100 0.2024 EA NA x 0.20) is called the Edgeworth expansion for Fn .4415 0.95 3.15 0..0001 0 0.' •• .1000 0.318 Asymptotic Approximations Chapter 5 the normal approximation by utilizing the third and fourth moments. ii.0553 0.1051 0.66 0. We can use Problem B.1.0005 0 0. 0 x Exacl 2. To improve on this approximation.6999 0. ~ 2.34 2. Example 5. Edgeworth Approximations to the X 2 Distribution. xro I • I I .9724 0.7000 0.0481 1. .3x) + _(xS .3.3000 0.1) + 2 (x 3 .8000 0.0010 ~: EA NA x Exact .n.9984 0...9506 0.n)' _ 3 (2n)2 = 12 n 1 I . Let F'1 denote the distribution of Tn = vn( X .79 0.1.0500 ! ! .0105 0.3.75 0.0208 ~1. V has the same distribution as E~ 1 where the Xi are independent and Xi N(O.3006 0. .0032 ·1.4999 0.0250 0.51 1.91 0.4000 0.n)1 V2ri has approximately aN(O.61 0.I .4000 0. It follows from the central limit theorem that Tn = (2::7 I Xi n)/V2ri = (V .2000 0.II 0.9999 1. • " .35 0." • [3 n . P(Tn < x).77 0./2 2 . Fn(x) ~ <!>(x) .ססoo 0.9500 'll .5421 0.2706 0.38 0. 1) distribution.9876 0. Exact to' EA NA {.5999 0.9997 4.1 gives this approximation together with the exact distribution and the nonnal approximation when n = 10.9996 1.0284 1.72 . Edgeworth(2) and nonna! approximations EA and NA to the X~o distribution. According to Theorem 8.5000 0.21) The expansion (5.!.9990 0.04 1.0655 O.IOx 3 + 15x)] I I y'n(X n 9n ! I i !. 0.3.4 to compute Xl. i = 1.JL) / a and let lIn and 1'211 denote the coefficient of skewness and kurtosis of Tn.ססOO I . Suppose V rv X~.9750 0. I' 0.95 :' 1.9905 0.. 4 0. (5.3513 0. Then under some conditionsY) where Tn tends to zero at a rate faster than lin and H 2 • H 3 • and H s are Hermite polynomials defined by H 2 (x) ~ x 2  I. Hs(x) = xS  IOx 3 + 15x. 0.9995 0.
Suppose {Un} are ddimensional random vectors and that for some sequence of constants {an} with an !' (Xl as n Jo (Xl. with  r?) is asymptotically nonnal.0 .3.ar. .3. Let central limit theorem T'f.I. Let (X" Y.3.6.2. Let p2 = Cov 2(X._p2).22) 2 g(l)(u) = (2U. and let r 2 = C2 /(j3<.. .u) ~~ V dx1 forsome d xl vector ofconstants u. Because of the location and scale invariance of p and r. we can show (Problem 5. The proof follows from the arguments of the proof of Lemma 5. xmyl).1. Then Proof.0 2 (5.0 >'1.u) ~ N(O.and HigherOrder Asymptotics: The Delta Method with Applications 319 The Multivariate Case Lemma 5.J12) / a2 to conclude that without use the transformations Xi j loss of generality we may assume J11 = 1k2 = 0. We can write r 2 = g(C. a~ = Var Y.2 T20 .0 TO .o}.m.2. .1.d. E Next we compute = T11 . Using the central limit and Slutsky's theorems. Ui/U~U3..j. Y)/(T~ai where = Var X.2.J11)/a.0.3 and (B. P = E(XY). 1.0.1.).0. >'2.0.2 >'2. vn(i7f .2.0. UillL2U~) = (2p.3.0 >'1. = Var(Xkyj) and Ak. R d !' RP has a differential g~~d(U) at u..3.0.9) that vn(C . (i) un(U n .5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test of H : p = is based on Irl.2. Lemma 5. U3) = Ui/U2U3. we can .2 extends to the dvariate case. >'1. E). 0< Ey4 < 00. aiD.3.5.J 2 2 + 4 2 4 2 Tn P 120 + P T 02 +2{ _2 p3Al. 1).2 >'1. and Yj = (Y . (X n .1) jointly have the same asymptotic distribution as vn(U n .1) and vn(i7~ .1. It follows from Lemma 5.n1EY/) ° ~ ai ~ and u = (p.2. Yn ) be i. Y) where 0 < EX 4 < 00.a~) : n 3 !' R. then by the 2 vn(U .u) where Un = (n1EXiYi.2.1. .6) that vn(r 2 N(O.2rJ >"..l = Cov(XkyJ.Section 5.= (Xi .n lEX. _p2.2 + p'A2.9.ij~ where ar Recall from Section 4. = ai = 1. where g(UI > U2. (ii) g.3.3 First. Example 5. 4f. as (X.p).i./U2U3.3.
fii[h(r) .3.1) is achieved hy choosing h(P)=!log(1+ P ). and it provides the approximate 100(1 .Y) ~ N(/li.UI. Asymptotic Approximations Chapter 5 = 4p2(1_ p2)'. Then ~. per < c) '" <I>(vn . Suppose Y 1.3. .~a)/vn3} where tanh is the hyperbolic tangent.1). (1 _ p2)2). E) = .3.fii(r . 1938) that £(vn .320 When (X. .4. (5. we see (Problem 5. .p).. N(O. j f· ~.1'2. . that is. ': I = Ilgki(rn)11 pxd.5 (a) ! . . I Y n are independent identically distributed d vectors with ElY Ii 2 < 00.3. has been studied extensively and it has been shown (e.h(p)J !:. . which is called Fisher's z.8. . 2 1.fii(Y .p) !:.h(p)) is closely approximated by the NCO.23) • .l) distribution.19). c E (1.hp )andhhasatotaldifferentialh(1)(rn) f • . EY 1 = rn.P The approximation based on this transfonnation. Theorem 5.3[h(c) . Argue as before using B. it gives approximations to the power of these tests. . then u5 .3. Var Y i = E and h : 0 ~ RP where 0 is an open subset ofRd. . ! f This expression provides approximations to the critical value of tests of H : p = O.a)% confidence interval of fixed length. p=tanh{h(r)±z (1.rn) + o(ly . (5. .3.. o Here is an extension of Theorem 5.mil £ ~ hey) = h(rn) and (b) so that . and (Prohlem 5. David.u 2.h(p)]).3.g.3(h(r) .10) that in the bivariate nonnal case a variance stahilizing transformation h(r) with .h= (h i .9) Refening to (5. .rn) N(o..3. . N(O.24) I Proof.: ! + h(i)(rn)(y .
. D Example 5.i=k+l Xi ".0 < 2. L::+. we conclude that for fixed k. Then. Suppose that n > 60 so that Table IV cannot be used for the distribution ofTk.. where V . i=k+1 Using the (b) part of Slutsky's theorem.l) distribution. 14.7) implies that as m t 00.m. we can write. (5. This row.3.1.3.Section 5.3 First.3.and HigherOrder Asymptotics: The Delta Method with Applications 321 (c) jn(h(Y) . By Theorem B.m distribution. We write T k for Tk. Now the weak law of large numbers (A. See also Figure B.m' To show this. gives the quantiles of the distribution of V/k. I).05 quantiles are 2.d.211 = 0.i. when k = 10.1.:: . if . Suppose for simplicity that k = m and k . when the number of degrees of freedom in the denominator is large.m. > 0 is left to the problems.9) to find an approximation to the distribution of Tk. EX.37 for the :F5 . as m t 00. > 0 and EXt < 00. .3.3. To get an idea of the accuracy of this approximation.h(m)) = y'nh(l)(m)(Y  m) + op(I).). By Theorem B. Suppose that Xl. 1 Xl has a x% distribution. check the entries of Table IV against the last row. The case m = )'k for some ). Thus. 00. then P[T5 . if k = 5 and m = 60.k.= density. which is labeled m = 00.m distribution can be approximated by the distribution of Vlk.:l ~ m k+m L X.25) has an :Fk. only that they be i.. we can use Slutsky's theorem (A. with EX l = 0.26) l.1 in which the density of Vlk.. where k + m = n.k+m L::l Xl 2 (5..37] = P[(vlk) < 2. xi and Normal Approximation to the Distribution of F Statistics.3.3. the:F statistic T _ k. But E(Z') = Var(Z) = 1.x%.m  (11k) (11m) L.~1. . Then according to Corollary 8.1. the :Fk. is given as the :FIO . We do not require the Xi to be normal. Next we turn to the normal approximation to the distribution of Tk..Xn is a sample from a N(O. When k is fixed and m (or equivalently n = k + m) is large.' = Var(X.21 for the distribution of Vlk.05 and the respective 0. where Z ~ N(O.7. For instance.15.0 distribution and 2. we first note that (11m) Xl is the average ofm independent xi random variables. the mean ofaxi variable is E(Z2).
By Theorem 5. .k (t 1 (5.5.8(a» does not have robustness of level. Thus (Problem 5.3.3.29) is unknown. Equivalently Tk = hey) where Y i ~ (li" li.I. which by (5. !1 .m • < tl P[) :'. . When it can be eSlimated by the method of moments (Problem 5..)/ad 2 .4. v'n(Tk .3.28) .2Var(Yll))' In particular if X. E(Y i ) ~ (1.). v) identity. ~ N (0.a). ifVar(Xf) of 2a'. Specifically.7).m(1. 0 ! j i · 1 where K = Var[(X.o) k) K (5. the distribution of'. In general (Problem 53. I' Theorem 5. _.k(tl)] '" 'fI() :'. 4). 2) distribution. the F test for equality of variances (Problem 5. We conclude that xl jer}.1)/12 j 2(m + k) mk Z.3. Then if Xl. (5. if it exists and equal to c (some fixed value) otherwise. l (: An interesting and important point (noted by Box. h(i)(u.3.)T.a) '" 1 + is asymptotically incorrect. 1 5.m(l. 1'. . = (~. . where J is the 2 x 2 v'n(Tk 1) ""N(0. X n are a sample from PTJ E P and ij is defined as the MLE 1 .m 1) <) :'.27) where 1 = (1.3.). a').3. = E(X.8(d)). • • 1)/12) . 1)T.' !k . a'). when rnin{k.'. ~ Ck.m critical value fk. P[Tk.. 1953) is that unlike the t test. when Xi ~ N(o.3. Suppose P is a canonical exponential family of rank d generated by T with [open.3 Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families Our final application of the 8method follows. :. k. the upper h.k(T>.m} t 00._o l or ~. i = 1.28) satisfies Zto '" 1 I I:.. and a~ = Var(X. I)T and h(u.»)T and ~ = Var(Yll)J.a) .3. j m"': k (/k. :f. i. 322 Asymptotic Approximations Chapter 5 where Yi1 = and Yi2 = Xf+i/a2. i..1) "" N(O. as k ~ 00. v) ~ ~.8(c» one has to use the critical value .3. 1'.m = (1+ jK(:: z. In general.m(1 . ! i l _ .k(Tk.m 1) can be ap proximated by a N(O. .
8. Thus.3.Section 5. T/2 By Theorem 5. where..23). I'.I(T/) (5. by Corollary 1. 0').A 1(71))· Proof.2 that.33) 71' 711 Here 711 = 1'/0'. Recall that A(T/) ~ VarT/(T) = 1(71) is the Fisher information. This is an "asymptotic efficiency" property of the MLE we return to in Section 6.O. (i) follows from (5.4.30) But D A .2..5. in our case. Then T 1 = X and 1 T2 = n.3.38) on the variance matrix of y'n(ij .i.2. iiI = X/O". PT/[T E A(E)] ~ 1 and. are sufficient statistics in the canonical model. as X witb X ~ Nil'. Now Vii11. . (5.11) eqnals the lower bound (3. (I" +0')] .T.3. by Example 2.A(T/)) + oPT/ (n .).  (TIl' .4. Ih our case.and HigherOrder Asymptotics: The Delta Method with Applications 323 (i) ii = 71 + ~ I:71 A •• • I (T/)(T(X.3. We showed in the proof ofTheorem 5.3. The result is a consequence.3.6.". therefore. if T ~ I:7 I T(X. Example 5.1.d.24).ift = A(T/). = 1/20'.2. Thus. X n be i.) .'l 1 (ii) LT/(Vii(iiT/)). Note that by B. (ii) follows from (5.2 and 5.3. and 1).3 First. the asymptotic variance matrix /1(11) of .Nd(O.3.2 where 0" = T. o Remark 5. and.3. PT/[ii = AI(T)I ~ L Identify h in Theorem 5.l4.N(o.4 with AI and m with A(T/). = (5. = A by definition and. (5. = 1/2.4. thus.3.71) for any nnbiased estimator ij. of Theorems 5.1.32) Hence.EX.1.3.31) .3. hence. For (ii) simply note that..8.In(ij . Let Xl>' .
. .6.1) i . 0. .. which lead to a result on the asymptotic nonnality of the MLE in multiparameter exponential families.33) and Theorem 5.1 by studying approximations to moments and central moments of estimates.d. variables are explained in tenns of similar stochastic approximations to h(Y) .L. Nk) where N j . We begin in Section 5. . We focus first on estimation of O. . Higherorder approximations to distributions (Edgeworth series) are discussed briefly. .)'.') N(O. The moment and in law approximations lead to the definition of variance stabilizing transfonnations for classical onedimensional exponential families.1.3. Assume A : 0 ~ pix. . Summary.4. . we can use (5.7).4. . ..1. . < 1.3.d. 5.Pk) where (5.. and confidence bounds. see Example 2. Eo) . We consider onedimensional parametric submodels of S defined by P = {(p(xo.(T. Consistency is Othorder asymptotics. I . 5. Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal.L~ I l(Xi = Xj) is sufficient. and PES. 0). Secondorder asyrnptotics provides approximations to the difference between the error and its firstorder approximation. under Li. is twice differentiable for 0 < j < k. 0 .4 and Problem 2. These "8 method" approximations based on Taylor's fonnula and elementary results about moments of means of Ll.3. the difference between a consistent estimate and the parameter it estimates.15). I i I .3.. EY = J. testing. I I I· < P. .26) vn(X /". Y n are i.g. N = (No. taking values {xo..324 Asymptotic Approximations Chapter 5 Because X = T.. stochastic approximations in the case of vector statistics and parameters are developed.. . 0). . J J . and il' = T. . Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean.4 to find (Problem 5.d. and so on. 8 open C R (e. the (k+ I)dimensional simplex (see Example 1.P(Xk' 0)) : 0 E 8}.... Following Fisher (1958)'p) we develop the theory first for the case that X""" X n are i. Thus.: CPo. sampling. and h is smooth. . when we are dealing with onedimensional smooth parametric models. Finally.".i..1 Estimation: The Multinomial Case .4 ASYMPTOTIC THEORY IN ONE DIMENSION I: I " ! . il' . Xk} only so that P is defined by p . . i 7 •• .h(JL) where Y 1. In Chapter 6 we sketch how these ideas can be extended to multidimensional parametric families.d.i. for instance. as Y.s where Eo = diag(a 2 )2a 4 ). These stochastic approximations lead to Gaussian approximations to the laws of important statistics. i In this section we define and study asymptotic optimality for estimation. Firstorder asymptotics provides approximations to the difference between a quantity tending to a limit and the limit.
1.4. Assume H : h is differentiable.4. (5. ..l (Xl.11). (0) 80(Xj. Many such h exist if k 2.p(Xk.4.2 (8. fJl E.4.7) where .4 Asymptotic Theory in One Dimension 325 Note that A implies that k [(X I .9) .4.8).. 0 <J (5. 8) is similarly bounded and well defined with (5.1.4.3) Furthermore (See'ion 3.4.4.4. Theorem 5. if A also holds. Then we have the following theorem.4. Moreover.8) ~ I)ogp(Xj. Next suppose we are given a plugin estimator h (r. Under H.Section 5. < k. Consider Example where p(8) ~ (p(xo.4) g.8)1(X I =Xj) )=00 (5. for instance. h) with eqUality if and only if. bounded random variable (5.8) logp(X I .6) 1.4.1.5) As usual we call 1(8) the Fisher infonnation..2(0.0). (5.8) .11)) of (J where h: S satisfies ~ R h(p(8» = 8 for all 8 E e > (5. > rl(O) M a(p(8)) P.2) is twice differentiable and g~ (X I) 8) is a well~defined.:) (see (2. pC') =I 1 m .4. ..Jor all 8.2). h) is given by (5. 88 (Xl> 8) and =0 (5. 8))T.
&8(X. but also its asymptotic variance is 0'(8.8) ) ~ .326 Asymptotic Approximations Chapter 5 Proof.9). 2: • j=O &l a:(p(8))(I(XI = Xj) . (5.12). By (5.4. 0) = • &h fJp (5. h) = I . using the correlation inequality (A.15) i • .p(xj. (5..8)&Pj 2: • &h a:(p(0))p(x. which implies (5. whil.h)Var. h(p(8)) ) ~ vn Note that.4. kh &h &Pj (p(8))(I(Xi = Xj) . I (x j.')  h(p(8)) } asymptotically normal with mean 0.8)). ( = 0 '(8.4. by (5./: .2 noting that N) vn (h (. for some a(8) i' 0 and some b(8) with prob~bility 1.4.p(x). ~ir common variance is a'(8)I(8) = 0' (0.4.13). using the definition of N j .4. Noting that the covariance of the right' anp lefthand sides is a(8).16) o . : h • &h &Pj (p(8)) (N . not only is vn{h (.4. we obtain 2: a:(p(8)) &0(xj. Taking expectations we get b(8) = O..I n.l6) as in the proof of the information inequality (3.8)) = a(8) &8 (X" 8) +b(O) &h Pl (5. • • "'i. we s~~ iliat equality in (5.h)I8) (5.p(xj.12) [t.6). h).10). 8).14) .8) = 1 j=o PJ or equivalently. (8. Apply Theorem 5.8) with equality iff.4.10) Thus.8) j=O 'PJ Note that by differentiating (5..4.8) gives a'(8)I'(8) = 1. (5.3.. 0)] p(Xj.13) I I · . we obtain &l 1 <0 .4.4.II.8) ) +op(I).4.4.. by noting ~(Xj. I h 2 (5. h k &h &pj(p(8)) (N p(xj.11) • (&h (p(O)) ) ' p(xj.
::7 We shall see in Section 5. Example 5. In the next two subsections we consider some more general situations. p) and HardyWeinberg models can both be put into this framework with canonical parameters such as B = log ( G) in the first case.Section 5A Asymptotic Theory in One Dim"e::n::'::io::n ~_ __'3::2:.i....xd).8) = exp{OT(x) . . 0). achieved by if = It (r..0 (5. Let p: X x ~ R where e D(O.18) and k h(p) = [A]I LT(xj)pj )".0)) ~N (0.Xn are tentatively modeled to be distributed according to Po.4. J(~)) with the asymptotic variance achieving the infonnation bound Jl(B). by ( 2. E open C R and corresponding density/frequency functions p('.8) is.4. 0) and (ii) solvesL::7~ONJgi(Xj.. 5. 0 E e.4... then.). Suppose p(x.19) The binomial (n.I L::i~1 T(Xi) = " k T(xJ )"..d. Xk}) and rem 5.A(O)}h(x) where h(x) = l(x E {xo.1.2.3) L. 0 Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena. . Suppose i.Oo) = E. Then Theo(5.3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check.0)=0.2 Asymptotic Normality of Minimum Contrast and MEstimates o e o We begin with an asymptotic normality theorem for minimum contrast estimates.17) L. Write P = {P.3 that the information bound (5. . As in Theorem 5. the MLE of B where It is defined implicitly ~ by: h(p) is the value of O...o(p(X"O)  p(X"lJo )) .4.4. . : E Ell.4.5 applies to the MLE 0 and ~ e is open. if it exists and under regularity conditions.3."j~O (5.4. OneParameter Discrete Exponential Families. . Xl. . is a canonical oneparameter exponential family (supported on {xo. Note that because n N T = n .(vn(B .3. which (i) maximizes L::7~o Nj logp(xj.
4.3.1/ 2) n '. . As we saw in Section 2.22) J i I . A4: sup.. (5. 8.21) i J A2: Ep. That is.0(P)) l. o if En ~ O.!:. That is.pix.p(X" On) n = O. rather than Pe.23) I. (5. O(P» / ( Ep ~~ (XI.p(.4. ~I AS: On £.t) . 0 E e.1. parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for. Theorem 5. O(P).4. hence.4.p(x. • is uniquely minimized at Bo. # o. Under AOA5.4.' '.1. .O(P)) +op(n.p(Xi. < co for all PEP.~(Xi. denote the distribution of Xi_ This is because. i' On where = O(P) +.p2(X. n i=l _ 1 n Suppose AO: . PEP and O(P) is the unique solution of(5. j.O). O)dP(x) = 0 (5.p(x. J is well defined on P. P) = .O(p)))I: It . f 1 .. 0) is differentiable.328 Asymptotic Approximations Chapter 5 . We need only that O(P) is a parameter as defined in Section 1. ~ (X" 0) has a finite expectation and i ! .p E p 80 (X" O(P») A3: . . O(Pe ) = O. O(P))) .O(P)I < En} £.p(x. Suppose AI: The parameter O(P) given hy the solution of .20) In what follows we let p. as pointed out later in Remark 5.4. .4.. ~i  i . On is consistent on P = {Pe : 0 E e}.p = Then Uis well defined. O)ldP(x) < co. . { ~ L~ I (~(Xi.. L.LP(Xi. Let On be the minimum contrast estimate On ~ argmin .L .=1 n ! (5.21) and.2. 1 n i=l _ . under regularity conditions the properties developed in this section are valid for P ~ {Pe : 0 E e}.
4.20). where (T2(1jJ.' " 1jJ(Xi .O(P)) 2' (E '!W(X. But by the central limit theorem and AI.27) we get..29) . and A3.4.4.4..L 1jJ(X" 8(P)) = Op(n.22) follows by a Taylor expansion of the equations (5.On)(8n .O(P)) / ( El' ~t (X" O(P))) o while E p 1jJ2(X"p) = (T2(1jJ.O(P)I· Apply AS and A4 to conclude that (5.26) and A3 and the WLLN to conclude that (5. Next we show that (5.O(P))) p (5._ .21).4. t=1 1 n (5. O(P)). (5.4.O(P)) = n.8(P)) I n n n~ t=1 n~ t=1 e (5.Section 5.28) .4.27) Combining (5.4. 8(P)) + op(l) = n ~ 1jJ(Xi . applied to (5.' " iJ (Xi .22) because . we obtain.24) follows from the central limit theorem and Slutsky's theorem.O(P)) Ep iJO (Xl. By expanding n.p) = E p 1jJ2(Xl .4.p) < 00 by AI.4.425)(5.24) proof Claim (5.4 Asymptotic Theory in One Dimension 329 Hence. A2. Let On = O(P) where P denotes the empirical probability. . en) around 8(P).lj2 and L 1jJ(X" P) + op(l) i=1 n EI'1jJ(X 1.O(P)) ~ . 1 1jJ(Xi . (iJ1jJ ) In (On . n. using (5.Ti(en .4.4..4.1j2 ).l   2::7 1 iJ1jJ.20) and (5.25) where 18~  8(P)1 < IiJn .
and that the model P is regular and letl(x.d. Nothing in the arguments require that be a minimum contrast as well as an Mestimate (i. . 0) Covo (:~(Xt. Solutions to (5. AI.20) are called M~estimates. I o Remark 5. n I _ . Identity (5.4.(x) (5. B) where p('. 0) l' P = {Po: or more generally.O)?jJ(X I . Remark 5. (2) O(P) solves Ep?jJ(XI. that 1/J = ~ for some p).4. If an unbiased estimateJ (X d of 0 exists and we let ?jJ (x. Remark 5. A6: Suppose P = Po so that O(P) = 0.e. . A2.i. X n are i. {xo 1 ••• 1 Xk}. O)dp.4. 0) = 6 (x) 0.12). 1 •• I ! .4.4.. I 'iiIf 1 I • I I . Z~ (Xl.4.4.Eo (XI. 0) is replaced by Covo(?jJ(X" 0). and we define h(p) = 6(xj )Pj.2 may hold even if ?jJ is not differentiable provided that .'.4. en j• .4. essentially due to Cramer (1946). 0 • .4.30) is formally obtained by differentiating the equation (5.2.29). for A4 and A6.4.8).21).28) we tinally obtain On . Conditions AO. for Mestimates.O(P) Ik = n n ?jJ(Xi . O(P)) ) and (5. P but P E a}.4.2.1.4.31) for all O. Suppose lis differen· liable and assume that ! • &1 Eo &O(X I .2 is valid with O(P) as in (I) or (2).30) Note that (5.1. e) is as usual a density or frequency function.=0 ?jJ(x. O(P)) + op (I k n n ?jJ(Xi . If further Xl takes on a finite set of values..! .22) follows from the foregoing and (5.: 1. Our arguments apply even if Xl. we see that A6 corresponds to (5.4.?jJ(XI'O)).4. O)p(x. This extension will be pursued in Volume2. 0) = logp(x. . We conclude by stating some sufficient conditions.3. This is in fact truesee Problem 5. A4 is found. • Theorem 5. and A3 are readily checkable whereas we have given conditions for AS in Section 5.4.30) suggests that ifP is regular the conclusion of Theorem 5. it is easy to see that A6 is the same as (3. B» and a suitable replacement for A3.O). =0 (5. An additional assumption A6 gives a slightly different formula for E p 'iiIf (X" O(P)) if P = Po. Our arguments apply to Mestimates.4.O) = O. 330 Asymptotic Approximations Chapter 5 Dividing by the second factor in (5. written as J i J 2::.. O(P) in AIA5 is then replaced by (I) O(P) ~ argmin Epp(X I .
where EeM(Xl. 0) .20) occurs when p(x. (b) There exists J(O) sup { > 0 such that O"lj') Ehi) eo (Xl. 0) (5.32) where 1(8) is t~e Fisher information introduq.4.eo (Xl. 0) and .O). 0') is defined for all x. 0) obeys AOA6. where iL(X) is the dominating measure for P(x) defined in (A.01 < J( 0) and J:+: JI'Ii:' (x. In this case is the MLE and we obtain an identity of Fisher's.4. s) diL(X )ds < 00 for some J ~ J(O) > 0.4.4. 5. < M(Xl..34) Furthermore. = en en = &21 Ee e02 (Xl. 10' . That is. 0) < A6': ~t (x.4) but A4' and A6' are not necessary (Problem 5.4.4 Asymptotic Theory in One Dimension 331 A4/: (a) 8 + ~~(xI. Pe) > 1(0) 1 (5.3 Asymptotic Normality and Efficiency of the MLE The most important special case of (5.4.4. AOA6.O) 10gp(x.:d in Section 3.O) = l(x.35) with equality iff.1). then if On is a minimum contrast estimate whose corresponding p and '1jJ satisfy 2 0' (.0'1 < J(O)} 00. 0) g~ (x. satisfies If AOA6 apply to p(x.4. B). w.3.O) = l(x.Section 5.4. We can now state the basic result on asymptotic normality and e~ciency of the MLE.p = a(0) g~ for some a 01 o. Theorem 5. "dP(x) = p(x)diL(x). We also indicate by example in the problems that some conditions are needed (Problem 5.p(x.33) so that (5. 0') ..p.O) and P ~ ~ Pe.8) is a continuous function of8 for all x." Details of how A4' (with ADA3) iIT!plies A4 and A6' implies A6 are given in the problems. then the MLE On (5.4. lO.. .
4. 442. We next compule the limiting distribution of .2. see Lehmann and Casella.39) with "'(B) < [I(B) for all B E eand"'(Bo) < [I(Bo). We can interpret this estimate as first testing H : () = 0 using the test "Reject iff IXI > n.. .".36) is just the correlation inequality and the theorem follows because equality holds iff 'I/J is a nonzero multiple a( B) of 3!. for some Bo E is known as superefficiency.4. and Polen = OJ .. .4. By (5. Let Z ~ N(O.4.Xn be i. Then . . . we know that all likelihood ratio tests for simple BI e e eo. for higherdimensional the phenomenon becomes more disturbing and has important practical consequences.4. 1). .(X B). 00. j 1 " Note that Theorem 5.i . . However.4.n(Bn . PoliXI < n1/'1 . if B I' 0.d. thus. 1. .B).. PoliXI < n 1/4 ] ..'(B) = I = 1(~I' B I' 0.4. cross multiplication shows that (5. The optimality part of Theorem 5.4. Then X is the MLE of B and it is trivial to calculate [( B) _ 1.33) and (5..nB).l'1 Let X" . }. For this estimate superefficiency implies poor behavior of at values close to 0. 5. .4. N(B. claim (5.35).. The major testing problem if B is onedimensional is H : < 8 0 versus K : > If p(. .1/4 I (5.ne .38) Therefore.. PIIZ + .34) follow directly by Theorem 5. (5.. and. Consider the following competitor to X: B n I 0 if IXI < n.36) Because Eo 'lj. .. I I Example 5.i.1/ 4 " and using X as our estimate if the test rejects and 0 as our estimate otherwise. ..30) and (5.3 generalizes Example 5.4.4..A'(B)..4... p. I. (5. 0 because nIl' . .439) where .• •. 0 1 . 1. 1 . 1 I I.4.. Hodges's Example. If B = 0.4.2.35) is equivalent to (5.37) .<1>( _n '/4 .. 1 ..1/4 X if!XI > n.nBI < nl/'J <I>(n l/ ' . PolBn = Xl .'(0) = 0 < l(~)' The phenomenon (5. We discuss this further in Volume II.1 once we identify 'IjJ(x.r 332 Asymptotic Approximations Chapter 5 Proof Claims (5. B) with J T(x) . {X 1. B) = 0.. 1998. .nB) .4. B) is an MLR family in T(X). Therefore.1).3 is not valid without some conditions on the esti· mates being considered. 0 e en e.4 Testing ! I i .4.
• • where A = A(e) because A is strictly increasing.43) Property (5. ljO < 00 . B E (a.40). < >"0 versus K : >.0:c":c'=D. The proof is straightforward: PoolvnI(Oo)(Bn .42)  ~ o. If pC e) is a oneparameter exponential family in B generated by T(X)..:cO"=~ ~ 3=3:.41) follows.4. Theorem 5. 00)] = PolvnI(O)(Bn  0) > VnI(O)(cn(a.Section 5.Oo)] = a and B is the MLE n n of We will use asymptotic theory to study the behavior of this test when we observe ij. Proof.. Suppose (A4') holds as well as (A6) and 1(0) < oofor all O.:c".m='=":c'. derive an optimality property.4.40) > Ofor all O. Let en (0:'..00) > z] ~ 11>(z) by (5. .00) > z] ..4. .4.2 apply to '0 = g~ and On.(a. 0 E e} is such thot the conditions of Theorem 5.a quantile o/the N(O..0)].4.46) PolBn > cn(a. "Reject H if B > c(a.Oo) = 00 + ZIa/VnI(Oo) +0(n. (5. PoolBn > 00 + zl_a/VnI(Oo)] = POo IVnI(Oo) (Bn  00) > ZI_a] ~ a.. (5. Then c.4.(a.(1 1>(z))1 ~ 0. are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error Q at eo. On the (5.:.3 versus simple 2 . "Reject H for large values of the MLE T(X) of >.4.Oo)] PolOn > cn(a. Then ljO > 00 . The test is then precisely.. Thus. e e e e.4.42) is sometimes called consistency of the test against a fixed alternative.4.l4.d. It seems natural in general to study the behavior of the test.h the same behavior.45) which implies that vnI(Oo)(c. Zla (5. That is. 00 )" where P8 .4 Asymptotic Theo:c'Y:c.41) where Zlct is the 1 . (5.r'(O)) where 1(0) (5. 1 < z . as well as the likelihood ratio test for H versus K. a < eo < b. Xl. b). 00 )]  ~ 1. eo) denote the critical value of the test using the MLE en based On n observations. 1) distribution.4. [o(vn(Bn 0)) ~N(O. 00) . and (5.44) But Polya's theorem (A.. the MLE.I / 2 ) (5.4. and then directly and through problems exhibit other tests wi!.(a. PolO n > c. > AO.00) other hand. Suppose the model P = {Po .[Bn > c(a. 00) ..::.22) guarantees that sup IPoolvn(Bn . . X n distributed according to Po. this test can also be interpreted as a test of H : >.. ~ 0.4.4.
4. the power of tests with asymptotic level 0' tend to 0'. That is.4.   j Proof.50). (3) = Theorem 5. If "m(8 .1 / 2 )) . Claims (5.2 and (5. X n ) is any sequence of{possibly randomized) critical (test) functions such that .40) hold uniformly for (J in a neighborhood of (Jo. Write Po[Bn > cn(O'.jnI(8)(Bn .4.80 ) tends to zero.jI(80)) ill' < O. < 1 lI(Zl_a ")'.50)) and has asymptotically smallest probability of type I error for B < Bo.8) > .4. (5.4.48) .4.4. j (5.k'Pn(X1. then (5.8) > .50) can be interpreted as saying that among all tests that are asymptotically level 0' (obey (5. 1 ~.4. the test based on 8n is still asymptotically MP. . . f.4.jnI(O)(Bn .  i.jnI(8)(Bn .1/ 2 ))]. 0. 00 if 8 < 80 .jnI(8)(80 .49) ! i . then by (5.80) ~ 8)J Po[. Suppose the conditions afTheorem 5.49)) the test based on rejecting for large values of 8n is asymptotically uniformly most powerful (obey (5.4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to Ct from the left at 00 and continues rising steeply to 1 to the right of 80 ... I .4. I. (5. j .jnI(80) + 0(n.4. Theorem 5.48). 00 if8 > 80 0 and . ~! i • Note that (5. (5. ~ " uniformly in 'Y.80) tends to infinity.50) i.4.42) and (5.···.jI(80)) ill' > 0 > llI(zl_a ")'. o if "m(8 .jnI(8)(Cn(0'.4. 80)J 1 P.41).80 1 < «80 )) I J . In either case.[. .[.(80) > O.(1 lI(z))1 : 18 .8 + zl_a/..4.jnI(80) +0(n.J 334 Asymptotic Approximations Chapter 5 By (5. LetQ) = Po. the power of the test based on 8n tends to I by (5. assume sup{IP.48) and (5.8) + 0(1) . .5. Furthennore.jnI(8)(80 . X n ) i.017".4.51) I 1 I I .4.")' ~ "m(8  80 ). then limnE'o+. .jnI(8)(Cn(0'.8) < z] .8 + zl_o/. .80 ). Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ")' "m( 8 .jnI(8)(80 . 80 )  8) . In fact.4. these statements can only be interpreted as valid in a small neighborhood of 80 because 'Y fixed means () + B .43) follow.4. iflpn(X1 . On the other hand.. .47) lor.
<P(Zla(1 + 0(1)) + .. X n . if I + 0(1)) (5. note that the Neyman Pearson LR test for H : 0 = 00 versus K : 00 + t. Llog·· (X 0) =dn (Q. . To prove (5. Finally.48) follows.j1i(8  8 0 ) is fixed.Xn) is the critical function of the LR test then.4 Asymptotic Theory in One Dimension 335 If..5..4. for all 'Y. . 1(8) = 1(80 + + fi) ~ 1(80) because our uniformity assumption implies that 0 1(0) is continuous (Problem 5.?.[logPn( Xl.50) and." + 0(1) and that It may be shown (Problem 5. . Thus.4.8) 1. . [5In (X" .OO)J .80 ) < n p (Xi..53) where p(x. hence. .8 0 ) . i=1 P 1.Section 5. P. There are two other types of test that have the same asymptotic behavior. hand side of (5. ~ .OO+7n) +EnP. is asymptotically most powerful as well. Q).4. 8 + fi) 0 P'o+:.o + 7. It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form "Reject if L log[p(X" 8n )!p(X i=1 n i .54) Assertion (5. is O.4. " Xn.53) is Q if.4. X. .jI(Oo)) + 0(1)) and (5.4.4.4 and 5.4. The test li8n > en (Q. 0 n p(Xi .50) note that by the NeymanPearson lemma.. 00)]1(8n > 80 ) > kn(Oo..53) tends to the righthand side of (5.50) for all.4... .4. for Q < ~. .Xn) is the critical function of the Wald test and oLn (Xl.8) that. 0 The asymptotic!esults we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n.. t n are uniquely chosen so that the right.<P(Zl_a . > dn (Q.4. € > 0 rejects for large values of z1. These are the likelihood ratio test and the score or Rao test.Xn ) = 5Wn (X" .54) establishes that the test 0Ln yields equality in (5.7)..52) > 0. Further Taylor expansion and probabilistic arguments of the type we have used show that the righthand side of (5. 80)] of Theorems 5. . 1..' L10g i=1 p (X 8) 1. 0) denotes the density of Xi and dn .4.)] ~ L (5.jn(I(8o) + 0(1))(80 .. .. k n (80 ' Q) ~ if OWn (Xl..5 in the future will be referred to as a Wald test.. The details are in Problem 5. 0 (5.4.4.4. 00 + E) logPn(X l E I " .O+.
336
Asymptotic Approximations
Chapter 5
where PlI(X 1, ... ,Xn ,8) is the joint density of Xl, ... ,Xn . Fort: small. n fixed, this is approximately the same as rejecting for large values of a~o logPn(X 11 • • • 1 X n ) eo).
,
• ,
The preceding argument doesn't depend on the fact that Xl" .. , X n are i.i.d. with common density or frequency function p{x, 8) and the test that rejects H for large values of a~o log Pn (XI, ... ,Xn , eo) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming
"Reject H iff
t
i=I
iJ~
: ,
logp(X i ,90) > Tn (a,9 0)."
0
I I
(5.4.55)
It is easy to see (Problem 5.4.15) that
Tn(a, 90 ) = Z10 VnI(90) + o(n I/2 )
and that again if G (Xl, ... , X n ) is the critical function of the Rao test then
nn
,
•
.,
,•
Po,+' WRn (X t, ... ,Xn ) = Ow n(X t , ... ,Xn)J ~ 1, rn
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(90 ), which d' may require numerical integration, can be replaced by _n l d021n(Bn) (Problem 5.4.10).
5.4,5
Confidence Bounds
Q
We define an asymptotic Levell that
lower confidence bound (LCB) On by the requirement
(5.4.57)
r I I i ,
., '
.'
for all () and similarly define asymptotic level!  a DeBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
(i) By using a natural pivot.
, f.
(
.,
(ii) By inverting the testing regions derived in Section 5.4.4.
,. "',
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)(A6), (A4'), and I(9) finite for all it follows (Problem 5.4.9) that
e.
Co(
V n  e)) ~ N(o, 1) nI(lin)(li
Z1a/VnI(lin ).
(5.4.58)
for all () and, hence. an asymptotic level!  a lower confidence bound is given by
9~ = lin e~,
(5.4.59)
Turning tto method (ii), inversion of 8Wn gives fonnally
= inf{9: en(a, 9) > 9n }
(5.4.60)
,
=
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
337
or if we use the approximation C (0, e) ~ n
e+ zlQ/vnI(iJ), (5,4,41),
e) > en}'
~~, = inf{e , Cn(C>,

(5,4,61)
In fact neither e~I' or e~2 properly inverts the tests unless cn(Q, e) and Cn (Q, e) are increasing in The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, e~l is preferable because this bound is not only approximately but genuinely level 1 Q. But computationally it is often hard to implement because cn(Q, 0) needs, in general, to be computed by simulation for a grid of values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4,10) are preferred but Can be quite inadequate (Problem 5,4,1 I), These bounds e~, O~I' e~2' are in fact asymptotically equivalent and optimal in a suitable sense (Problems 5,4,12 and 5,4,13),
e.
e
Summary. We have defined asymptotic optimality for estimates in oneparameter models. In particular, we developed an asymptotic analogue of the information inequality of Chapter 3 for estimates of in a onedimensional subfamily of the multinomial distributions, showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of oneparameter discrete exponential families. In Section 5.4.2 we developed the theory of minimum contrast and M estimates, generalizations of the MLE, along the lines of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mooel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimality results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson (1996), Le Cam and Yang (1990), Lehmann (1999), Rao (1973), and Serfling (1980),
e
5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as n t 00 in a sense we now describe. The framework we consider is the one considered in Sections 5.2 and 5.4, i.i.d. observations from a regular madel in which is open C R or = {e 1 , , , , , e,} finite, and e is identifiable, Most of the questions we address and answer are under the assumption that fJ = 0, an arbitrary specified value, or in frequentist tenns, that 8 is true.
e
e
Consistency The first natural question is whether the Bayes posterior distribution as n + 00 concentrates all mass more and more tightly around B. Intuitively this means that the data that are coming from Po eventually wipe out any prior belief that parameter values not close to are likely, Formalizing this statement about the posterior distribution, II(· I X It •.• , X n ), which is a functionvalued statistic, is somewhat subtle in general. But for = {O l ,' .. , Ok} it is
e
e
i
338
straightforward. Let
Asymptotic Approximations Chapter 5
I.
11(8 i XI,···, Xn)
Then we say that II(·
=PIO = 8 I Xl,···, Xn].
(5.5.1)
e, P,li"(8 I Xl, ... , Xn)  11 > ,] ~ 0 for all f. > O. There is a slightly stronger definition: rIC I XI,' .. ,Xn ) is a.S. iff for all 8 E e, 11(8 I Xl, ... , Xn) ~ 1 a.s. P,.
is consistent iff for all 8 E
General a.s. consistency is not hard to formulate:
I Xl, ... , Xn)
(5.5.2)
consistent
(5.5.3)
11(· I X), ... , Xn)
=}
OJ'} a.s. P,
(5.5.4)
where::::} denotes convergence in law and <5{O} is point mass at satisfactory result for finite.
e
e.
There is a completely
, ,
,
Theorem 5.5.1. Let 1rj  p[e = Bj ], j = 1, ... 1 k denote the prior distribution 0/8. Then II(· I Xl, ... ,Xn ) is consistent (a.s. consistent) iff 7fj > afor j = I, ... , k.
Proof. Let p(., B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1["] = 0 for some j implies that 1f(Bj I Xl, ... ,Xn ) = 0 for all Xl, .. . , X n because, by (1.2.8),
.,
,J,
11(8j
I Xl, ... ,Xn)
PIO = 8j I Xl, ... ,Xn ] 11j Ir~l p(Xi , 8il
L:.~l 11. ni~l p(Xi , 8.)
k
n .
(5.5.5)
,
,
, ,,
Intuitively, no amount of data can convince a Bayesian who has decided a priori that OJ is impossible. On the other hand, suppose all 71" j are positive. If the true is (J j or equivalently 8 = (J j, then
e
log
11(8.IXl, ... ,Xn) =n 11(8 j I X), ... ,Xn)
(11og+ L.. og P(Xi,8.)) . 11. 1{f.,1 n
7fj
n i~l
p(Xi ,8j)
,
By the weak (respectively strong) LLN, under POi'
.,i
1{f.,log p(Xi ,8.)  Lni~l
p(Xi ,8j )
+
E
OJ
(I
I
P(XI ,8.)) og p(X I ,8j )
i
I
in probability (respectively a.s.). But Eo;
(log: ~:::;))
~
< 0, by Shannon's inequality, if
Ba
• .,
=I=
Bj' Therefore,
11(8.IXI, ... ,Xn) 1 og 11(8j I X), ... , X n )
00
,
in the appropriate sense, and the theorem follows.
o
i ~,
h
_
Section 55
Asymptotic Behavior and Optimality of the Posterior Distribution
339
e
Remark 5.5.1. We have proved more than is stated. Namely. that for each I XI, . .. ,Xn ]  a exponentially.
e E e. Po[O =l0
As this proof suggests, consistency of the posterior distribution is very much akin to consistency of the MLE. The appropriate analogues of Theorem 5.2.3 are valid. Next we give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution
Under conditions AOA6 for p(x, B) that if B is the MLE,

= lex, Bj
=logp(x, B), we showed in Section 5.4
(5.5.6)
Ca(y'n(e  B» ~ N(O,rl(B)).
Consider C( ..;ii((}  B) I Xl, ... , X n ), the posterior probability distribution of y'n((} B( Xl, ... , X n )), where we emphasize that (j depends only on the data and is a constant given XI, ... , X n . For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions that strengthen A4 and A5 by replacing convergence in Po probability by convergence a.s. p•. We also add,



A7: For all (), and all 0> o there exists t(o,(})
p. [sup
> 0 such that
{~ t[I(Xi,B') /(Xi,B)]: 18'  BI > /j} < '(0, B)] ~ I.
e such that 1r(') is continuous and positive
AS: The prior distribution has a density 1f(') On at all B. Remarkably,
Theorem 55.2 (UBernsteinlvon Mises"). If conditions ADA3, A4(a.s.), A5(a.s.), A6, A7, and A8 hold. then
C(y'n((}(})
I X1, ... ,Xn )
~N(O,l
1
(B»)
(5.5.7)
a.s. under P%ralle.
We can rewrite (5.5.7) more usefully as
sup IP[y'n((}  e) < x I Xl, ... , X n]  of>(xVI(B»)j ~ 0
x
(5.5.8)
for all a.s. Po and, of course, the statement holds for our usual and weaker convergence in Po probability also. From this restatement we obtain the important corollary. Corollary 5.5.1. Under the conditions of Theorem 5.5.2,
e
sup IP[y'n(O  e) < x j Xl, ... , XnJ  of>(xVl(e)1
x
~0
(5.5.9)
a.s. P%r all B.
1 , ,
340
Asymptotic Approximations Chapter 5
Remarks
(I) Statements (5,5.4) and (5,5,7)(5,5,9) are, in fact, frequentist statements about the
asymptotic behavior of certain functionvalued statistics.
(2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in P, probability if A4 and
A5 are used rather than their strong formssee Problem 5.5.7.
(3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
identifiability guarantees consistency of Bin a regular model.

Proof We compute the posterior density of .,fii(O  B) as
(5.5.10)
where en = en(X!, . .. ,Xn) is given by

Divide top and bottom of (5.5.10) by
;,'
II7
1 p(Xi ,
B) to obtain
(5.5.11)

where l(x,B)
= 10gp(x,B) and
,
,
We claim that
for all B. To establish this note that (a) sup { 11" + 1I"(B) : ItI < M} tent and 1T' is continuous. (b) Expanding, (5.5.13)
(e In) 
~ 0 a.s.
for all M because
eis a.s. consis
I
I
1
i
1 ! , ,
I
p
J
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
341
where
Ie  Bit)1 < )n.
We use I:~
1
g~ (Xi, e) ~ 0 here. By A4(a.s.), A5(a.s.),
1
n
sup { n~[}B,(Xi,B'(t))n~[}B,(Xi,B):ltl<M
In [}'I
[}'l
}
~O,
for all M, a.s. Po. Using (5.5.13). the strong law of large numbers (SLLN) and A8, we obtain (Problem 5.5.3),
Po
[dnqn(t)~1f(B)exp{Eo:;:(Xl,B)~}
forallt] =1.
(5.5.14)
Using A6 we obtain (5.5.12).
Now consider
dn =
I:
r
+y'n
1f(e+
;")exp{~I(Xi,9+;,,) 1(Xi,e)}ds
(5.5.15)
dnqn(s)ds
J1:;I<o,fii
J
1f(t) exp
{~(l(Xi' t) 1(X
i,
9)) } l(lt 
el > o)dt
By AS and A7,
Po [sup { exp
{~(l(Xi,t) 1(Xi , e») } : It  el > 0} < e"'("O)] ~ 1
(5.5.16)
for all 0 so that the second teon in (5.5.14) is bounded by y'ne"'("O) ~ 0 a.s. Po for all 0> O. Finally note that (Problem 5.5.4) by arguing as for (5.5.14), tbere exists o(B) > 0 such that
Po [dnqn(t) < 21f(8) exp {~ Eo (:;: (Xl, B))
By (5.5.15) and (5.5.16), for all 0
~}
for all It 1 < 0(8)y'n]
~ I.
(5.5.17)
> 0,
(5.5.18)
Po [dn 
r dnqn(s)ds ~ 0] = I. J1:;I<o,fii
exp {_ 8'I(B)} ds
2
Finally, apply the dominated convergence theorem, Theorem B.7.5, to dnqn(sl(lsl < 0(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s. Po,
d ~ 1f(B)
n
r= L=
= 1f(8)v'21i'.
JI(B)
(5.5.19)
,
I
342
Hence, a.S. Po,
Asymptot'lc Approximations Chapter 5
qn(t) ~ V1(e)<p(tvI(e))
where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. 0 Example 5.5.1. Posterior Behavior in the Normal Translation Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N{ (), ( 2 ) distribution with a 2 known and we put aN ('TJ, 7 2 ) prior on 8. Then the posterior
distribution of8 isN(Wln7J!W2nX,
(~I r12)1) where
,,2
W2n
.,
I
• •
WIn
= nT 2 +U2'
= !WIn
(5.5.20)
,
'"
,
r
, .,,
I
Evidently, as n + 00, WIn + 0, X + 8, a.s., if () = e, and (~I T\) 1 + O. That is, the posterior distribution has mean approximately (j and variance approximately 0, for n large, or equivalently the posterior is close to point mass at as we vn(O  9) has posterior distribution expect from Theorem 5.5.1. Because 9 =
;
N ( .,!nw1n(ry  X), n (~+
;'»
1).
x,
e
Now, vnW1n
=
O(n 1/ 2) ~ 0(1) and
0
n (;i + ~ ) 1
rem 5.5.2.
+ (12 =
II (8) and we have directly established the conclusion of Theo
I ,
I
I
Example 5.5.2. Posterior Behavior in the Binomial~Beta Model. (Example 3.2.3 continued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X" ... , X n Li.d. Bernoulli (I, e) and put a beta, (3(r, s) prior on e, then, as in Example 3.2.3, (J has posterior (3(8n +r, n+s  8 n ). We have shown in Problem 5.3.20 that if Ua,b has a f3(a, b) distribution, then as a + 00, b + 00,
I
If 0
a) £ (a+b)3]1( Ua,b a+b ~N(o,I). [ ab
< B<
(5.5.21)
j I ,
i j
1 is true, Sn/n ~. () so that Sn + r + 00, n + s  Sn + 00 a.s. Po. By identifying a with Sn + r and b with n + s  Sn we conclude after some algebra that because 9 = X,
vn((J  X)!:' N(O,e(l e))
a.s. Po, as claimed by Theorem 5.5.2.
o
Bayesian optimality of optimal frequentist procedures and frequentist optimality of
Bayesian procedures
•
Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors are asymptotically efficient in the sense of the previous section.
,
1 ,
I
I
t
hz _
I
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
343
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following.
~
Theorem 5.~.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution ofB. Then
(i)
(5.5.22)
a.s. Pe for all
e. Consequently,
~, _ I ~ 1 az. I (e)ae(X" e) +op,(n 1/2 ) e  e+ n L.
l=l
(5.5.23)
and LO( .,fii(rr  e)) ~ N(o, rl(e)).
(ii)
(5.5.24)
E( .,fii(111 
el11111 Xl,'"
,Xn) = mjn E(.,fii(111  dl 
1(11) 1Xl.··· ,Xn) + op(I).
(5.5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions In (e, d) = .,fii(18 dlIell· But the Bayes estimatesforl n and forl(e, d) = 18dl must agree whenever E(11111 Xl, ... , Xn) < 00. (Note that if E(1111 I Xl, ... , X n ) = 00, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.l4.22)
sup IP[.,fii(O  e)
< x I Xl,'" ,Xu) 1>(xy'""'I(""'e))1 ~
Oa.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7.1 I). Thus, any median of the posterior distribution of .,fii(11  e) tends to 0, the median of N(O, II (~)), a.s. Po. But the median of the posterior of .,fii(0  (1) is .,fii(e'  e), and (5.5.22) follows. To prove (5.5.24) note that
~ ~ ~
and, hence, that
E(.,fii(IOelll1e'l) IXl, .. ·,Xn) < .,fiile
e'l ~O
(5.5.27)
a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?). claim (5.5.24) follows and, hence,
E( .,fii(10 
01 101) I h ... , Xn)
= E( .,fii(10 
0'1  101) I X" ... , X n ) + op(I).
(5.5.28)
344
~
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for In(e,d), (5.5.25) and the theorem follows. 0 Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang (1990).
Bayes credible regions
~
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
Theorem 5.5.4. Suppose the conditions afTheorem 5.5.2 are satisfied. Let
where en is chosen so that 1l"(Cn I Xl, ... ,Xn) = 1  0', be the Bayes credible region defined in Section 4.7. Let Inh) be the asymptotically level 1  'Y optimal interval based on B, given by
~
where dn(y)
i •
= z (! !)
JI~). ThenJorevery€ >
0, 0,
~ 1.
P.lIn(a + €) C Cn(X1 , .•. ,Xn ) C In(a  €)J
(5.5.29)
I
I
I
The proof, which uses a strengthened version of Theorem 5.5.2 by which the posterior density of Jii( IJ  0) converges to the N(O,Il (0)) density nnifonnly over compact neighborhoods of 0 for each fixed 0, is sketched in Problem 5.5.6. The message of the theorem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n 1j2 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n 1 order (see Schervisch, 1995).
Thsting
! ,
•
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of 00 given X I, ... ,Xn if H is false is of a different magnitude than the pvalue for the same data. For more on this socalled Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + D.) or (00  D., 00 + D.), then Bayes and freqnentist testing procedures agree in the limit. See Problem 5.5.2. Summary. Here we established the frequentist consistency of Bayes estimates in the finite parameter case, if all parameter values are a prior possible. Second. we established
i
! I
I b
II
!
_
j
TI
Section 5.6
Problems and Complements
345
the socalled Bernsteinvon Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the socalled Bernstein~von Mises theorem and frequentist contjdence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1
1. Suppose Xl, ... , X n are i.i.d. as X ous case density.
rv
F, where F has median F 1 (4) and a continu
(a) Show that, if n
= 2k + 1,
, Xn)
EFmed(X),
n (
_
~
)
l'
k (1  t)kdt F' (t)t
EF
med (X"
2
,Xn )
n( 2;) [1P'(t)f tk (1t)k dt
= 1, 3,
(b) Suppose F is unifonn, U(O, 1). Find the MSE of the sample median for n and 5.
2. Suppose Z ~ N(I', 1) and V is independent of Z with distribution X;'. Then T
Z/
=
(~)!
is said to have a noncentral t distribution with noncentrality J1 and m degrees
of freedom. See Section 4.9.2. (a) Show that
where fm(w) is the x~ density, and <P is the nonnal distribution function.
(b) If X" ... ,Xn are i.i.d.N(I',<T2 ) show that y'nX /
(,.~, L:(Xi
_X)2)! has a
noncentral t distribution with noncentrality parameter .fiiJ1/IT and n  1 degrees of freedom. (c) Show that T 2 in (a) has a noncentral :FI,m distribution with noncentrality parameter J12. Deduce that the density of T is
p(t)
= 2L
i=O
00
P[R = iJ . hi+,(f)[<p(t 1')1(t > 0)
+ <p(t + 1')1(t < 0)1
where R is given in Problem B.3.12.
, " I I. ,
346
Hint: Condition on
Asymptotic Approximations
Chapter 5
ITI.
1, then Var(X) < 1 with equality iff X
3. Show that if P[lXI < 11
±1 with
probability! .
Hint: Var(X) < EX 2
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through ..jiif..
(a) Show that the ratio of the Hoeffding function h( VilE) to the Chebychev function e( Jii€) tends to as Jii€ ~ 00 so that he) is arbitrarily better than en in the tails.
°
I,.,'
i~
(b) Show that the normal approximation 24> (
V;€)  1 gives lower results than h in
00.
,
the tails if P[lXI < 1] = 1 because, if ,,2 < 1. 1  <p(t)  <p(t)lt as t ~ Note: Hoeffding (1963) exhibits better bounds for known a 2 .
R has .\(0) ~ 0, is bounded, and has a hounded second derivative .\n. Show that if Xl, ... , X n are i.i.d., EX l = f.L and Var Xl = 02 < 00, then
5. Suppose.\ : R
~
E.\(X 
1') = .\'(0)
;'/!; +
0
(~)
as n >
00.
= E>.'(O)JiiIX  1'1 + E (";' (X  I')(X I'?) where IX  1'1 < Ix  1'1· The last term is < suPx I>." (x) 1,,2 In and the first tends to
Hint: JiiE(.\(IX 1,1) .\(0))
I
",
1! "
.\'(0)" f== Izl<p(z)dz by Remark B.7. 1(2).
Problems for Section 5,2
1. Using the notation of Theorern 5.2.1, show that
"
2. Let X" ... ,Xn be ij,d. N(I',,,2), Show that for all n
Sup p(•• u) [IX
u
>
1, all €
>
°
/,1 > ,] ~ 1.
Hint: Let (J
)
00.
•
3, Establish (5.2.5). Hint: Iiln  q(p)1
> € =} IPn  pi > w (€).
l
J
4. Let (Ui , V;), 1 < i < n, be i.i.d.  PEP.
(a) Let y(P)
= PIU, > 0, V, > OJ. Show that if P = N(O, 0,1,1, p), then
p
~ sin21l' (Y(P)  ~).
2. Hint: K can be taken as [A.n 1 (XiJ. ".e)1 : e' E 5'(e.(ii) holds. inf{p(X.lJ. N (/J. Show that the maximum contrast estimate 8 is consistent. eo) : e' E S(O.\} . 7. Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but p is no longer consistent.. e') .14)(i). by the basic property of maximum contrast estimates. e)1 (ii) Eo. 6. eo) : Ie  : Ie . . (Wald) Suppose e ~ p( X.e'l < . (a) Show that condition (5. eo)} : e E K n {I : 1 IB . Hint: sup 1. e) .i.sup{lp(X.. Or of sphere centers such that K Now inf n {e: Ie .l)2 _ + tTl = o:J.o)} =0 where S( 0) is the 0 ball about Therefore. e) is continuous.\} c U s(ejJ(e j=l T J) {~ t n .p(X. inf{p(X. .p(X.p(X. .2.2 rII. . (i).eol > ..eol > . {p(X" 0) . Eo.Section 5.2.6 Problems and Complements 347 (b) Deduce that if P is the bivariate normal distribution. then is a consistent estimate of p.•. and < > 0 there is o{0) > 0 such that e. (c) Suppose p{P) is defined generally as Covp (U. . VarpUVarpV > O}. 5. e') . Suppose Xl. for each e eo. Prove that (5.14)(i) add (ii) suffice for consistency.8) fails even in this simplest case in which X ~ /J is clear. and the dominated convergence theorem. n Lt= ". i (e))} > <.p(Xi . By compactness there is a finite number 8 1 . A].e')p(X. 05) where ao is known. V)j /VarpU Varp V for PEP ~ {P: EpU' + EpV' < 00. (Ii) Show that condition (5. 5_0  lim Eo.(eo)} < 00.lO)2)) . sup{lp(X. Hint: From continuity of p. t e.~ .d. where A is an arbitrary positive and finite constant. eol > A} > 0 for some A < 00.Xn are i. e E Rand (i) For some «eo) >0 Eo. (1 (J.
L. in (i) and apply (ii) to get (iv) E IL:~ 1 (Xi  x. 1 + . < < 17 < 1/<..9) in the exponential model of Example 5.1. • • (i) Suppose Xf. L:~ 1(Xi .d. Establish Theorem 5..348 Asymptotic Approximations Chapter 5 > min l<J$r {~tinf{p(Xi.3.i. L:!Xi .3 I.O'l n i=l p(X"Oo)}: B' E S(Bj. J (iii) Condition on IXi  X. = 1.3. ~ . Establish (5. II I .X.) 2] . 8.. . Hint: See part (a) of the proof of Lemma 5. i .. 2. ..I'lj < EIX .X'i j . Hint: Taylor expand and note that if i . li .[.3.3... Show that the log likelihood tends to 00 as a + 0 and the condition fails.)2]' < t Mjn~ E [. for some constants M j . and if CI."d' k=l d < mm+1 L d ElY. Extend the result of Problem 7 to the case () E RP. . < > OJ. The condition of Problem 7(ii) can also fail. (72).xW) < I'lj· 3.. n. Establish (5. . P > 1. I 10. Establish (5. . and let X' = n1EX. i .d. en are constants. then by Jensen's inequality. and take the values ±1 with probability ~. For r fixed apply the law of large numbers.2. 4. . . Then EIX . . with the same distribution as Xl.Xn but independent of them.O(BJll}.3. Indicate how the conditions of Problem 7 have to be changed to ensure uniform consistencyon K. . + id = m E II (Y.. Problems for Section 5. .X.1. < Mjn~ E (. 1/(J.l m < Cm k=I n.i.3.k / 2 . Let X~ be i.. . . 9..X~ are i.. .1'. Compact sets J( can be taken of the form {II'I < A.d.)I' < MjE [L:~ 1 (Xi . (ii) If ti are ij..11). .3) for j odd as follows: ..
then (sVaf)/(s~/a~) has an .i.d. . theu the LR test of H : or = a~ versus K : af =I a~ is based on the statistic s1 / s~. ~ iLli I IX. respectively. )=1 5. 1 < j < m..Xn1 bei..pli ~ E{IX.1 and 'In = n2 .l(e) Now suppose that F and G are not necessarily nonnal but that and that 0 < Var( Xn < Ck.1) +E{IX. j = 1."" X n be i.a~d < [max(al"" . X.Section 5.) I : i" . (b) Show that when F and G are nonnal as in part (a).r'j".. L~=1 i j = m then m a~l.m 00. j > 2.\k for some .a as k t 00.Tn) + 1 . 2 = Var[ ( ) /0'1 ]2 . . s~ ~ (n2 . ~ xl". G. where s1 = (n. (a) Show that if F and G are N(iL"af) and N(iL2.i. under H : Var(Xtl ~ Var(Y. . P(sll s~ < ~ 1~ a as k ~ 00. i. ~ iLl' I IXli > liLl}P(lXd > 11.andsupposetheX'sandY's are independent. km K. .ad)]m < m L aj j=1 m <mmILaj. PH(st! s~ < Ck. 6. . 1)'2:7' . I < liLl}P(IXd < liLl)· 7. Show that ~ sup{ IE( X. n} EIX.m with K. Show that if m = . R valued with EX 1 = O. replaced by its method of moments estimate.\ > 0 and = 1 + JI«k+m) ZIa..(X..aD. ~ iLli < 2 i EIX. Show that under the assumptions of part (c).d.28. .m be Ck..3.c> ..1)'2:7' .m distribution with k = nl . (d) Let Ck. if a < EXr < 00./LI = E(Xt}.I. LetX1"".d.. . Let XI. Establish 5. 0'1 = Var(Xd· Xl /LI Ck. 8.(l'i .i. FandYi""'Yn2 bei.. Show that if EIXlii < 00.).I) Hint: By the iterated expectation theorem EIX. . then EIX.Fk.m) Then.6 Problems and Complements 349 Suppose ad > 0..
N.p) ~ N(O. TN i...\ < 1.d.1 · /.a as k . i (b) Suppose Po is such that X I = bUi+t:i. "i. t N }.".p) ~ N(O. . . . 0 < .p') ~ N(O.XN} or {(uI. .2< 7'.3. Var(€.1'2)1"2)' such that PH(SI!S~ < Qk. . (UN. "1 = "2 = I.". = (1.. ." .+. = = 1. 4p' (I .A») where 7 2 = Var(XIl.l EXiYi . Without loss of generality.iL) • . i = I. Show that under the assumptions of part (e). 1). ' .3. ! .x) ~ N(O. Wnte (l 1pP .I.. n." •.m (depending on ~'l ~ Var[ (X I .. Consider as estimates e = ') (I X = x. + I+p" ' ) I I I II.4.1. jn(r . (I _ P')').. Po. H En. ._ J . N where the t:i are i. (b) Use the delta method for the multivariate case and note (b opt  b)(U . that is. = ("t . and if EX. then jn(r .p).. .t .p')') and. . (b) If (X.m) + 1 . JLl = JL2 = 0.00. from {t I.i. ..I).a: as k + 00. (iJi .~) (X .lIl) + 1 . (a) If 1'1 = 1'2 ~ 0. .d.6...XN)} we are interested in is itself a sample from a superpopulation that is known up to parameters.I))T has the same asymptotic distribution as n~ [n. iik. In survey sampling the modelbased approach postulates that the population {Xl.. use a normal approximation to tind an approximate critical value qk. then jn(r' . • . Instead assume that 0 < Var(Y?) < x. (a) jn(X .1. . Tin' which we have sampled at random ~ E~ 1 Xi. . In particular.1 that we use Til.(I_A)a2).l EY? . then P(sUs~ < qk. i = 1. .xd. X) U t" (ii) X R = bopt(U ~  iL) as in Example 3. n.1JT..2" ( 11. there exists T I . () E such that T i = ti where ti = (ui.m with KI and K2 replaced by their method of moment estimates..m (0 Let 9.xd.. (iJl.4.. jn((C . Under the assumptions of part (c). p).350 Asymptotic Approximations Chapter 5 (e) Next drop the assumption that G E g.\ 00. < 00 (in the supennodel). • 10. Et:i = 0. show that i' ~ . if ~ then . • op(n').XC) wHere XC = N~n Et n+l Xi. (cj Show that if p = 0. 7 (1 . . suppose in the context of Example 3. . . if 0 < EX~ < 00 and 0 < Eyl8 < 00.6. Y) ~ N(I'I.. as N 2 t Show that. Without loss of generality. Show that jn(XRx) ~N(0. In Example 5. . be qk.I' ill "rI' and "2 = Var[( X.1 EX.Ii > 0.+xn n when T. . Hint: Use the centra] limit theorem and Slutsky's theorem. . to estimate Ii 1 < j < n.p. .) =a 2 < 00 and Var(U) Hint: (a) X .. if p i' 0.i. . . err eri Show that 4log (~+~) is the variance stabilizing transfonnation for the correlation 1 Ip coefficient in Example 5. ~ ] .Il2. suppose i j = j.
i'b)(Yc . Here x q denotes the qth quantile of the distribution. It can be shown (under suitable conditions) that the nonnal approximation to the distribution of h( X) improves as the coefficient of skewness 'Y1 n of heX) diminishes.. is known as Fisher's approximation. The following approximation to the distribution of Sn (due to Wilson and Hilferty. each with HardyWeinberg frequency function f given by . . X = XO. .3..6) to explain why... Hint: If U is independent of (V. n = 5. Let Sn have a X~ distribution. Suppose XI. (c) Compare the approximation of (b) with the central limit approximation P[Sn < xl = 1'((x .s. (h) Deduce fonnula (5.. E(WV) < 00. X n are independent.12). X. !) distribution. 1931) is found to be excellent Use (5.6 Problems and Complements 351 12. then E(UVW) = O.3. EU 13..90. 14.14 to explain the numerical results of Problem 5.3. W). (a) Show that the only transformations h that make E[h(X) .3' Justify fonnally j1" variance (72.3 < 00. Suppose X I I • • • l X n is a sample from a peA) distrihution. (a) Suppose that Ely. (b) Let Sn . .99.: .i'a)(Yb . (h) Use (a) to justify the approximation 17.14). (a) Show that if n is large..Section 5.3.i'c) I < Mn'. and Hint: Use (5. Show that IE(Y. . (a) Use this fact and Problem 5.E{h(X))1' = 0 to terms up to order l/n' for all A > 0 are of the form h{t) ~ ct'!3 + d..25.. Suppose X 1l ...13(c).10.y'ri has approximately aN (0. . = 0. Normalizing Transformation for the Poisson Distribution.. X~. This < xl (h) From (a) deduce the approximation P[Sn '" 1>( v'2X  v'2rl). 15. X n is a sample from a population with mean third central moment j1. 16.3..n)/v'2rl) and the exact values of PISn < xl from the X' table for x = XO.
a) Hmt: Use Bm. 1'2)h2(1'1. Cov(X. 1'2) = h. (b) Find an approximation to P[JX' < t] in terms of 0 and t.. .. Show that the only variance stabilizing transformation h such that h(O) = 0. Variance Stabilizing Transfo111U1tion for the Binomial Distribution.Xm . .. Show that ifm and n are both tending to oc in such a way that m/(m + n) > a.1'2)]2 177 n +2h.h(l'l.. 1'2)(X .y).. h(l) = I. Yn are < < . 101 02 20(10) I f (10)2 2 tJ in terms of fJ and t. where I' = (a) Find an approximation to P[X < E(X. and h'(t) > for all t. Justify fonnally the following expressions for the moments of h(X. v'a(l. (e) What is the approximate distribution of Vr'( X . Y) where (X" Yi).n  m/(m + n») < x] 1 > 'li(x).I'll + h2(1'1.n P . (a) (b) Var(h(X. if m/(m + n) . (Xn . where I I h. E(Y) = 1'2. Let X I. Y) .1') + X 2 . 0< a < I. which are integers. Bm. Var(X) = 171. X n be the indicators of n binomial trials with probability of success B. 20. Y" . 1'2) Hint: h(X.a tends to zero at the rate I/(m + n)2. + (mX/nY)] where Xl. I ~ 352 x Asymptotic Approximations Chapter 5 where °< e < f(x) 1. [vm +  n (Bm. then I m a(l.. 1'2)pC7.(x.n tends to zero at the rate I/(m + n)2. ° . .n = (mX/nY)i1 independent standard eXJX>nentials.y) = a ayh(x.y) = a axh(x. . (1'1. i 21.y). . V)) '" . 172 + [h 2(1'l. (1'1. Var(Y) = C7~. Let then have a beta distribution with parameters m and n. h 2 (x. is given by h(t) = (2/1r) sin' (Yt).2.. Y) ~ p!7IC72. 1'2)]2C7n + O(n 2) i . Yn) is a sample from a bivariate population with E(X) ~ 1'1...a) E(Bmn ) = . 1'2)(Y  + O( n '). • j b _ . . .)? 18.5 that under the conditions of the previous problem. 19.• • • ~{[hl (1". Var Bmnm + n +Rmn = 'm+n ' ' where Rm. Show directly using Problem B.
)] is asymptotically distributed (b) Use part (a) to show that when J1. < t) = p(Vi < (b) When J1.. (a) When J1. (e) Fiud the limiting laws of y'n(X . Show that Eh(X) 3 h(J1.J1. Compare your answer to the answer in part (a).6 Problems and Complements 353 22. _.)1J1. xi 25.4 + 3o.J1.l) = o. ° while n[h(X h(J1.l = ~. = ~. . find the asymptotic distribution of y'n(T.) 23. (a) Show that y'n[h(X) .)2.2) using the delta method. .).. Let Xl>"" X n be a sample Irom a population with and let T = X2 be an estimate of 112 .J1.X2 .(1.) + ": + Rn where IRnl < M 1(J1. and that h(1)(J.l and variance (J2 < Suppose h has a second derivative h<2) continuous at IJ. XI. = 0.)] !:. X n is a sample from a population and that h is a realvalued function 01 X whose derivatives of order k are denoted by h(k).. Suppose that Xl. Let Xl. J1. k > I. Let Sn "' X~. Usc Stirling's approximation and Problem 8.2V with V "' Give an approximation to the distribution of X (1 .)] ~ as ~h(2)(J1. n[X(I .l4 is finite.4 to give a directjustification of where R n / yin 0 as in n Recall Stirling's approximation: + + 00.2. ()'2 = Var(X) < 00... (It may be shown but is not required that [y'nR" I is bounded. 24.Section 5. .) + ~h(2)(J1. find the asymptotic distrintion of nT nsing P(nT y'nX < Vi)..J1.2)/24n2 Hint: Therefore.J1.h(J1.X) in tenns of the distribution function when J. . 1 X n be a sample from a population with mean J.)2 and n(X .3 1/6n2 + M(J1. xi. Suppose IM41 (x)1 < M for all x and some constant M and suppose that J. = E(X) "f 0.2V where V ~ 00.
I i t I .o. (a) Show that P[T(t. = = az. Yd. " Yn as separate samples and use the twosample t intervals (4. . if n}.t. 1 X n1 and Y1 . . Hint: Use (5. 1 Xn.9. Assume that the observations have a common N(jtl.1/ S D where D and S D are as in Section 4.JTi times the length of the onesample t interval based on the differences.\)a'4)/(1 . Show that if Xl" . Let n _ 00 and (a) Show that P[T(t. .9.a. 27. the interval (4. . 0 < . and a~ = Var(Y1 ). (c) Apply Slutsky's theorem.I I (a) Show that T has asymptotically a standard Donnal distribution as and I I .3) and the intervals based on the pivot ID . . Suppose Xl.. Asymptotic Approximations Chapter 5 then ~ £ c: 2 yn(X ..9. Yn2 are as in Section 4.33) and Theorem 5.. Hint: (a). 'I :I ! t.\a'4)I').4. n2 + 00..Jli = ~. that E(Xt) < 00 and E(Y. . .(~':~fJ').9. .') < 00. .\ul + (1 ~ or al .. E In] > 1 .t. . Let T = (D .3) has asymptotic probability of coverage < 1 .9.9. Suppose that instead of using the onesample t intervals based on the differences Vi .(j ' a ) N(O.. XII are ij. the intervals (4. 28. then limn PIt. Suppose nl + 00 . n2 + 00. 2pawz)z(1  > 0 and In is given by (4. 29. al = Var(XIl.9. cyi .3. . and SD are as defined in Section 4.3). . (c) Show that if IInl is the length of the interval In. Viillnl ~ 2vul + a~z(l . N(Pl (}2).•.9. p) distribution.\ < 1. We want to obtain confidence intervals on J1. We want to study the behavior of the twosample pivot T(Li. . .2a 4 ). I . (Xnl Yn ) are n sets of control and treatment responses in a matched pair experiment.3. .. .Xi we treat Xl.4. so that ntln ~ . Suppose (Xl.3).. . (d) Make a comparison of the asymptotic length of (4.) (b) Deduce that if.\.}.354 26. whereas the situation is reversed if the sample size inequalities and variance inequalities agree.\ probability of coverage.3 independent samples with III = E(XIl.2 . Y1 . What happens? Analysis for fixed n is difficult because T(~) no longer has a '12n2 distribution.) < (b) Deduce that if p tl ~ <P (t [1.)/SD where D.4.3./"..\ > 1 .3) have correct asymptotic (c) Show that if a~ > al and . a~.t2. /"' = E(YIl.~a) > 2( Val + a'4  ~ a) where the righthand side is the limit of .d.) of Example 4. b l .\.a. . Eo) where Eo = diag(a'. < tJ ~ <P(t[(.\)al + .9.
.a).3. Show that if 0 < EX 8 < 00. Tn . Hint: If Ixll ~ L. .9.3.. Y nERd are Li. X. Generalize Lemma 5. then lim infn Var(Tn ) > Var(T).xdf and Ixl is Euclidean distance. 33.I)sz/().1)1 L. Plot your results. .I k and k only. and let Va ~ (n .p. (c) Let R be the method of moment estimaie of K. Show that k ~ ex:: as 111 ~ 00 and 112 t 00.8Z = (n . X n be i. EIY.I)z(a).d. I is the Euclidean norm. .16. Show that as n t 00.z = Var(X).z are Iii = k. (d) Find or write a computer program that carries out the Welch test. x = (XI. T and where X is unifonn. then for all integers k: where C depends on d. < XnI (a)) (b) Let Xn_1 (a) be the ath quantile of X. but Var(Tn ) + 00.d.. k) be independent with Xi} ~ N(l'i.. . . (c) Show using parts (a) and (b) that the tests that reject H : fJI = 112 in favor of K: 112 > III when T > tk(l . .~ t (Xi .9 and 4. then P(Vn < va<) t a as 11 t 00.. . where tk(1.4.a) is the critical value using the Welch approximation. vectors and EIY 1 1k < 00. .a)) and evaluate the approximations when F is 7.£.6 Problems and Complements 355 (b) Let k be the Welch degrees of freedom defined in Section 4..I) + y'i«n . has asymptotic level a. ().i.3 using the Welch test based on T rather than the twosample t test based on Sn._I' Find approximations to P(Yn and P{Vn < XnI (I . Let .4.. as X ~ F and let I' = E(X). then there exist universal constants 0 < Cd < Cd < 00 Such that cdlxl1 < Ixl < Cdlxh· 31. Vn = (n . (a) SupposeE(X 4 ) < 00.. It may be shown that if Tn is any sequence of random variables such that Tn if the variances ofT and Tn exist. (a) Show that the MLEs of I'i and (). Let XI.1 L j=I k X ij and iT z ~ (kp)I L L(Xij i=l j=l p k Iii)" . 30." . 32.3.1). j = I. Hint: See Problems B.1.3 by showing that if Y 1.:~l IXjl. Let X ij (i = I.3.Z).Section 5. (). I< = Var[(X 1')/()'jZ. U( 1.£.z has a X~I distribution when F is theN(fJl (72) distribution.. where I. . Carry out a Monte Carlo study such as the one that led to Figure 5.X)" Then by Theorem B.
Set (. I .) Hint: Show that Ep1/J(X1  8) is nonincreasing in B. 00 (iii) I!/!I(x) < M < for all x.f. _ .0) !.(0) Varp!/!(X... (Use . n 1 F' (x) exists. (a) Show that (i). F(x) of X.i.n for t E R. .. .n(On .0) and 7'(0) .. Show lbat if I(x) ~ (d) Assume part (c) and A6.. Deduce that the sample median is a consistent estimate of the population median if lbe latter is unique..0). .nCOn .)). I ! 1 i .. . .. Show that On is consistent for B(P) over P.1)cr'/k.p(X.p(X.p(X.Xn be i.0))  7'(0)) 0. [N(O))' . (c) Assume the conditions in (a) and (b). .. then < t) = P(O < On) = P (. . 1 X n. random variables distributed according to PEP.. . . .. (k . (c) Give a consistent estimate of (]'2.d. . Show that. .. .4 1. I i . O(P) is lbe unique solution of Ep. N(O. . aliI/ > O(P) is finite. ..\(On)] !:. 1/4/'(0)).p(Xi .O(P)) > 0 > Ep!/!(X. is continuous and lbat 1(0) = F'(O) exists.p(X...n7(0) ~[. That is the MLE jl' is not consistent (Neyman and Scott. Let i !' X denote the sample median. .n(X .p(X. . then 8' !.p( 00) as 0 ~ 00. Let Xl. Assmne lbat '\'(0) < 0 exists and lbat .p(x) (b) Suppose lbat for all PEP. \ . Use the bounded convergence lbeorem applied to . Ep.On) .. where P is the empirical distribution of Xl. (ii). Let On = B(P). N(O.. • 1 = sgn(x). Suppose !/! : R~ R (i) is monotone nondecreasing (ii) !/!(oo) < 0 < !/!(oo) .p(X.0) ~ N Hint: P( ... I(X. .0). . under the conditions of (c). Show that _ £ ( .0) !:. and (iii) imply that O(P) defined (not uniquely) by Ep.... I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _1 .L~. ! Problems for Section 5.On) < 0). .. 1) for every sequence {On} wilb On 1 n = 0 + t/....356 Asymptotic Approximations Chapter 5 . 1948). (b) Show that if k is fixed and p ~ 00.. 1/). (e) Suppose lbat the d. N(O) = Cov(.0) = O. ..
(x) = (I . 3. Condition A6' pennits interchange of the order of integration by Fubini's theorem (Billingsley. 0 > 0. Show that if (g) Suppose Xl has the gross error density f.Xn ) is the MLE.20 and note that X is more efficient than X for these gross error cases.. J = 1. 82 ) Pis N(I".. Hint: Apply A4 and the dominated convergence theorem B.(x.y. 1979) which you may assume.c)'P. the asymptotic 2 .(x.9)dl"(x) = J te(1J.. Conclude that (b) Show that if 0 = max(XI.' .2.b)dl"(x) J 1J.. oj). then ep(X. and O WIth .B)) ~ [(I/O).~ for 0 > x and is undefined for 0 g~(X.B) < xl = 1.    .0) (see Section 3.a)dl"(x).(x) + <'P.0) ~ N(O.(x. evaluate the efficiency for € = .10. Hint: teJ1J. not O! Hint: Peln(B .O).(1:e )n ~ 1.O)p(x.. x E R.O)p(x.I / 2 . 4. denotes the N(O.. 'T = 4.(x.O) is defined with Pe probability I but < x. This is compatible with I(B) = 00.. Let XI. 2. 0» density. .'" . not only does asymptotic nonnality not hold but 8 converges B faster than at rate n.d.(x).Section 5. Show that assumption A4' of this section coupled with AGA3 implies assumption A4. then 'ce(n(B . (a) Show that g~ (x. lJ :o(1J.(x . ( 2 ).b)p(x. ~"'. 0) = .5) where f. X) as defined in (f).4. (h) Suppose that Xl has the Cauchy density fix) = 1/1r(1 + x 2).a)p(x.15 and 0. .0)p(x. Show that A6' implies A6.7. Find the efficiency ep(X. Show that   ep(X.i."' c . 0 «< 0.05. U(O.6 Problems and Complements 357 (0 For two estImates 0.0.O)dl"(x)) = J 1J. If (] = I. X) = 1r/2.(x.jii(Oj .X) =0..0. .  = a?/o. .exp(x/O). X n be i.O))dl"(x) ifforalloo < a < b< 00.5  and '1'. Thus. " relative efficiency of 8 1 with respect to 82 is defined as ep(8 1 .
Suppose A4'.i'''i~l p(Xi.5 Asymptotic Approximations Chapter 5 ~lOg n p(Xi.Oo) p(Xi .5 continue to hold if ."('1(8) .4.358 5. Show that the conclusions of PrQple~ 5. ) ! .4. Show that 0 ~ 1(0) is continuous.4. A2.18 . j. I I j 6.80 p(X" 80) +.80 + p(Xi . 1(80 ).p ~ 1(0) < 00. . .4) to g. under F oo ' and conclude that ... I (c) Prove (5.in. 0 ~N ~ (.~ (X."log .0'1 < En}'!' 0 .·_t1og vn n j = p( x"'o+. . I .O) 1(0) and Hint: () _ g.00) . 0 for any sequence {en} by using (b) and Polyii's theorem (A. ~ L..< ~log n p(X.14.22).i(X..50).i (x...(X. I 1 . 0). 8 ) i .i. Apply the dominated convergence theorem (B. 8 ) 0 . 8) is continuous and I if En sup {:. O.in) ~ . ) n en] .[ L:.7.In ~ "( n & &8 10gp(Xi.in) . i.O') . I .. and A6 hold for. &l/&O so that E. (8) Show that in Theorem 5. i I £"+7. .g. 80 +. Hint: (b) Expand as in (a) but aroond 80 (d) Show that ?'o+". ~log (b) Show that n p(Xi .In) + . 8)p(x.in) 1 is replaced by the likelihood ratio statistic 7.) P"o (X .80 + p(X .
gives B Compare with the exact bound of Example 4. (Ii) Compare the bounds in Ca) with the bound (5. Let B be two asymptotic l.14.6 and Slutsky's theorem.2. (a) Show that under assumptions (AO)(A6) for all Band (A4'). = X.6.:vell . (a) Show that under assumptions CAO)(A6) for 1/! consistent estimate of I (fJ). (Ii) Suppose the conditions of Theorem 5.57) can be strengthened to: For each B E e. there is a neighborhood V(Bo) of B o o such that limn sup{Po[B~ < BJ : B E V(Bo)} ~ 1 .59).4.4. B' _n + op (n 1/') 13.4. setting a lower confidence bound for binomial fJ.4. which is just (4. Let B~ be as in (5.4.a.Q" is asymptotically at least as good as B if.54) and (5.2. (a) Show that. B' nj for j = 1.4.4.6 Problems and Complements 359 8. hence.5 and 5.4. 9. We say that B >0 nI Show that 8~I and. Hint: Use Problem 5. for all f n2 nI n2 .4.Section 5. and give the behavior of (5. all the 8~i are at least as good as any competitors. if X = 0 or 1. ~ 1 L i=I n 8'[ ~ 8B' (Xi.2. Establish (5. 10.4.7.5 hold.4. Compare Theorem 4. A. Hint.61).61) for X ~ 0 and L 12.4. 11.59).4.B lower confidence bounds. at all e and (A4'). (a) Establish (5. B).3). Consider Example 4. which agrees with (4. Then (5. the bound (5.5.4. (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5. (Ii) Deduce that g.56).59) coincide.4.58). ~.7). Hint: Use Problems 5.3.4. . f is a is an asymptotic lower confidence bound for fJ. Let [~11.4.
1. " " I I . Consider the Bayes test when J1. where Jl is known.L and A..=2[1  <1>( v'nIXI)] has a I U(O. 0 given Xl.A 7" 0. (a) Show that the test that rejects H for large values of v'n(X .d.2. each Xi has density ( 27rx3 A ) 1/2 {AX + J. . I ... Jl > OJ .55)..] versus K . . .)l/2 Hint: Use Examples 3. 1). (b) Suppose that I' ).21J2 2x A} . where .2. . Let X 1> . 2. the pvalue fJ . (c) Find the approximate critical value of the NeymanPearson test using a Donnal approximation. I' has aN(O. Show that (J/(J l'.t = 0 versus K : J.\ < Ao.)) and that when I' = LJ. Problems for Section 5. ! I • = '\0 versus K : . 00.\ l . By Problem 4. ! (a) Show that the posterior probability of (OJ is where m n (. > ~.11).. (d) Find the Wald test for testing H : A = AO versus K : A < AO. Hint: By (3.LJ. inverse Gaussian with parameters J. Consider testing H : j. Show that (J l'.i. 7f(1' 7" 0) = 1 .d.2.1.. LJ. N(Jt.360 1 Asymptotic Approximations Chapter 5 14. (e) Find the Rao score test for testing H : . the test statistic is a sum of i.6.4. .) has pvalue p = <1>(v'n(X .LJ. That is. 1 . phas a U(O. is distributed according to 7r such that 1> 7f({O}) and given I' ~ A > 0. X > 0. variables with mean zero and variance 1(0). 1) distribution. X n i. Suppose that Xl.1 and 3. J1. the evidence I . is a given number. : A < Ao (a) Find the NeymanPearson (NP) test for testing H : . Consider the problem of testing H : I' E [0..I)..5 ! 1. X n be i.10) and (3.•.'" XI! are ij.5.1.. N(I'..d.4.01 . (c) Suppose that I' = 8 > O. 1) distribution.\ > o.. T') distribution. 1.\ = '\0 versus K • (b) Show that the NP test is UMP for testing H : A > AO versus K : A < AO. if H is false.i.4.d. 1. = O.i. Now use the central limit theorem. Establish (5. 1 15.riX) = T(l + nr 2)1/2rp (I+~~. That is. A exp . =f:.I ' '1 against H as measured by the smallness of the pvalue is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "paradox").
P. [0.01 < 0 } continuThen 00.050 logdnqn(t) = ~ {I(O). By Theorem 5.8) ~ 0 a. yn(E(9 [ X) .034 . all .5.. Show that the posterior probability of H is where an = n/(n + I). 10 .052 100 .5.5.:. oyn.} dt < .)}..5.645 and p = 0.5.5. ~ L N(O.2 it is equivalent to show that J 02 7f (0)dO < a.1 I'> = 1.~~(:.13) and the SLLN.i It Iexp {iI(O)'.Section 5... i ~.s.029 .. L M M tqn(t)dt ~ 0 a. Hint: By (5..' " .sup{ :.17).054 50 . for all M < 00. 1) and p ~ L U(O.O') : 100'1 < o} 5..for M(.1 is not in effect.058 20 . (Lindley's "paradox" of Problem 5. (e) Verify the following table giving posterior probabilities of 1.0'): iO' .0 3. Extablish (5. 0' (i» n L 802 i=l < In n ~sup {82 8021(X i .14).046 . Apply the argnment used for Theorem 5.O'(t)). ~l when ynX = n 1'>=0. (e) Show that when Jl ~ ~.S. 1) prior.2. By (5.l has a N(O.2. 4.1.2 and the continuity of 7f( 0). Establish (5. Suppose that in addition to the conditions of Theorem 5.I). Hint: 1 n [PI .17).(Xi.i Itlqn(t)dt < J:J').5. J:J')...) sufficiently large.( Xt.042 . Fe.(Xi. Hint: In view of Theorem 5.. ifltl < ApplytheSLLN and 0 ous at 0 = O. ~ E.Oo»)+log7f(O+ J. yn(anX   ~)/ va.4.) (d) Compute plimn~oo p/pfor fJ.6 Problems and Complements 361 (b) Suppose that J.I(Xi.s.05.
I Notes for Section 5.1 (I) The bound is actually known to be essentially attained for Xi = a with probability Pn and 1 with probability 1 . (a) Show that sup{lqn (t) . The sets en (c) {t .". Suppose that in Theorem 5. Finally. jo I I . . . . For n large these do not correspond to distributions one typically faces. (I) If the rightband side is negative for some x.Pn where Pn a or 1. • Notes for Section 5.f. i=l }()+O(fJ) I Apply (5. to obtain = I we must have C n = C(ZI_ ~ [f(0)nt l / 2 )(1 + op(I)) by Theorem 5.s(8)vn tqn(t)dt ~ roc vIn(t  0) exp {i=(l(Xi' t) .29). Show tbat (5.5 " I i .. A proof was given by Cramer (1946). 362 f Asymptotic Approximations Chapter 5 > O. 5.5. (0» (b) Deduce (5. I i.) by A4 and A5.7 NOTES Notes for Section 5. L . dn Finally. Fisher (1925).3 ). . .s. ! ' (2) Computed by Winston Cbow. ~ 0 a. (O)<p(tI. Fn(x) is taken to be O. for all 6. 'i roc J. all d and c(d) / in d. • .) ~ 0 and I jtl7r(t)dt < co.s. .16) noting that vlnen.!.9) hold with a.I(Xi}»} 7r(t)dt.s.) and A5(a.5.1 . (1) This famous result appears in Laplace's work and was rediscovered by S.d) for some c(d). . .. .1.S.4 (1) This result was first stated by R. A.5. von Misessee Stigler (1986) and Le Cam and Yang (1990).2 we replace the assumptions A4(a. Bernstein and R. " .8) and (5.5. . . Hint: (t : Jf(O)<p(tJf(O)) > c(d)} = [d. Notes for Section 5. 7. ~ II• '. convergence replaced by convergence in Po probability.5. . I : ItI < M} O. I: .5. See Bhattacharya and Ranga Rao (1976) for further discussion. qn (t) > c} are monotone increasing in c. J I . .
" Proc. R. Proc. 700725 (1925). C. 1986... CRAMER. FERGUSON. Statistical Decision Theory and Bayesian Analysis New York: SpringerVerlag. p. FISHER. 17. I. L. J. LEHMANN. B.. Mathematical Analysis. HANSCOMB. Wiley & Sons. AND D. Monte Carlo Methods London: Methuen & Co.. RANGA RAO. HAMMERSLEY. 1985.A. HOEFFDING. F. Vol. Statisticallnjerence and Scientific Method." Proc. Cambridge University Press. W. Symp. H. 1967. 58. 1973. Mathematical Methods of Statistics Princeton.. E.. 3rd ed. SERFLING. G. C. Statist. J.. reprinted in Biometrika Tables/or Statisticians (1966). E. 1380 (1963). 1979. I Berkeley. P. J. 16. E.. "Consistent estimates based on partially consistent observations. Sci. Asymptotics in Statistics. Soc. S. R. NEYMAN. J. A. 1987. Elements of LargeSample Theory New York: SpringerVerlag. M. O. HfLPERTY.. H. AND R. 40. WILSON. Amer." Econometrica. New York: J. 1980. 1946.S. The Behavior of the Maximum Likelihood Estimator Under NonStandard Conditions. "Theory of statistical estimation. The History of Statistics: The Measurement of Uncertainty Before 1900 Cambridge. 1976.. 1958. Prob. S. 684 (1931). STIGLER. Vol. RUDIN. DAVID. Math. Camb.8 REFERENCES BERGER. Approximation Theorems of Mathematical Statistics New York: J. Tables of the Correlation Coefficient. 3rd ed. BILLINGSLEY. L. New York: McGraw Hill.Section 5.. RAo. Vth Berkeley Symposium. 1999. Editors Cambridge: Cambridge University Press. M. Wiley & Sons. M. U.. P." J. . H. Phil. LE CAM. Assoc.. R.. CA: University of California Press. Statist.. L. AND G. Theory ofPoint EstimatiOn New York SpringerVerlag. AND E. J. HUBER. W.8 References 363 5. Linear Statistical Inference and Its Applications. Pearson. 1996. R. 1938. Normal Approximation and Asymptotic Expansions New York: Wiley. SCOTT. NJ: Princeton University Press. YANG. Vth Berk. FISHER. AND G. 2nd ed. 'The distribution of chi square. T. L. SCHERVISCH.. 132 (1948). A. A CoUrse in Large Sample Theory New York: Chapman and Hall.. "Nonnormality and tests on variances:' Biometrika. N. P. CASELLA.. 1998.. Box. 1995. BHATTACHARYA.. S. LEHMANN. 1990. Theory o/Statistics New York: Springer. L. MA: Harvard University press. "Probability inequalities for sums of bounded random variables.. R. 22. 1964. Acad. E. Some Basic Concepts New York: Springer. Nat.. Probability and Measure New York: Wiley. AND M. Hartley and E. 318324 (1953).
. ' i ' .I 1 . . I . (I I i : " . .\ .. .
real parameters and frequently even more semi. the fact that d.or nonpararnetric models.3). often many.3. we have not considered asymptotic inference.7.4. We begin our study with a thorough analysis of the Gaussian linear model with known variance in which exact calculations are possible.3). for instance.and semiparametric models. testing.1 INFERENCE FOR GAUSSIAN LINEAR MODElS • Most modern statistical questions iovol ve large data sets. confidence regions.2 and 5. and efficiency in semiparametric models. however. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for functionvalued statistics.2. tests. There is. the number of observations. [n this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates.3. with the exception of Theorems 5. multiple regression models (Examples 1.4. curve estimates. This chapter is a leadin to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non. However.6.5. in which we looked at asymptotic theory for the MLE in multiparameter exponential families.1. 365 . and prediction in such situations.8 C R d We have presented several such models already. and confidence regions in regular onedimensional parametric models for ddirnensional models {PO: 0 E 8}. 1.1) and more generally have studied the theory of multiparameter exponential families (Sections 1. and n. the modeling of whose stochastic structure involves complex models governed by several. We shall show how the exact behavior of likelihood procedures in this model correspond to limiting behavior of such procedures in the unknown variance case and more generally in large samples from regular ddimensional parametric models and shall illustrate our results with a number of important examples.1.2. the multinomial (Examples 1. an important aspect of practical situations that is not touched by the approximation. 2. the number of parameters. the properties of nonparametric MLEs. 2. 2. 2. The inequalities ofVapnikChervonenkis.Chapter 6 INFERENCE IN THE MUlTIPARAMETER CASE 6. Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II.6.2. the bootstrap. are often both large and commensurate or nearly so.3.
Example 6.3) whereZi = (Zil. n (6. say the ith case. We consider experiments in which n cases are sampled from a population. Regression. Notational Convention: In this chapter we will. . It turn~ out that these techniques are sensible and useful outside the narrow framework of model (6. .5) .. . • .1.1.. ..€nareij. .. the Zij are called the design values.1.. These are among the most commonly used statistical techniques...1 is also of the fonn (6.2. The model is Yi j = /31 + €i. 1 Yn from a population with mean /31 = E(Y). The OneSample Location Problem." . .. i = 1. .l p.1 The Classical Gaussian linear Model l' f Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants Zil. 6. In vector and matrix notation.4 and 2. let expressions such as (J refer to both column and row vectors.1. The normal linear regression model is }i = /31 + L j=2 p Zij/3j + €i. 0"2).j . .. I I.'" . N(O. " n (6.. .1.1..1.N(O..1..1.3). '.Herep= 1andZnxl = (l.a2).1. In this section we will derive exact statistical procedures under the assumptions of the model (6. Here Yi is called the response variable. .Zip.4) 0 where€I.d. . Z = (Zij)nxp.3).2J) (6. .. .. e ~N(O.1.3): l I • Example 6. are U. andJ is then x nidentity matrix. • i . i = l. .Zip' We are interested in relating the mean of the response to the covariate values.zipf. we have a response Yi and a set of p .2(4) in this framework.1. We have n independent measurements Y1 . " 'I and Y = Z(3 + e. 366 Inference in the Multiparameter Case Chapter 6 ! ..1) j . In the classical Gaussian (nannal) linear model this dependence takes the fonn p 1 Yi where EI.d.1.1 covariate measurements denoted by Zi2 • .. . .1.n (6. and Z is called the design matrix. 1 en = LZij{3j j=1 +Ei. and for each case. I The regression framewor~ of Examples 1. we write (6. Here is Example 1. i .1.6 we will investigate the sensitivity of these procedures to the assumptions of the model. .2) Ii . when there is no ambiguity.l)T. i = 1. In Section 6.
.3 we considered experiments involving the comparisons of two population means when we had available two independent samples.1. To see that this is a linear model we relabel the observations as Y1 .2) and (6. Yn1 correspond to the group receiving the first treatment.3 and Section 4.9. + n p = n.1fwe set Zil = 1. In Example I. Generally. and the €kl are independent N(O. . The random design Gaussian linear regression model is given in Example 1./.. To fix ideas suppose we are interested in comparing the performance of p > 2 treatments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k. If we are comparing pollution levels.6) is often reparametrized by introducing ( l ~ pl I:~~1 (3k and Ok = 13k . and so on. We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional distribution of Y given a set of observed covariate values. we are interested in qualitative factors taking on several values. where YI.. (6.5) is called the fixed design norrnallinear regression model. we arrive at the one~way layout or psample model. . 1 < k < p.3. In this case. one from each population..•. (6. .1. 0 Example 6.Yn . and so on.. /3p are called the regression coefficients. Then for 1 < j < p..1.Section 6. . we want to do so for a variety of locations. we often have more than two competing drugs to compare.1. Yn1 + 1 . The model (6.. Frequently. We treat the covariate values Zij as fixed (nonrandom). ."i = 1. n.1. then the notation (6. .6) where Y kl is the response of the lth subject in the group obtaining the kth treatment.~) random variables.6) is an example of what is often called analysis ojvariance models. If the control and treatment responses are independent and nonnally distributed with the same variance a 2 .3.1..4.1. TWosample models apply when the design values represent a qualitative factor taking on only two values. the design matrix has elements: 1 if L jl nk + 1<i < L nk k=1 j k=l ootherwise and z= o 0 ••• o o Ip where I j is a column vector of nj ones and the 0 in the "row" whose jth member is I j is a column vector of nj zeros. .Way Layout. nl + . The model (6.1 Inference for Gaussian Linear Models 367 where (31 is called the regression intercept and /32..0: because then Ok represents the difference between the kth and average treatment . · · . 13k is the mean response to the kth treatment.. Ynt +n2 to that getting the second. The pSample Problem Or One.3) applies. this terminology is commonly used when the design values are qualitative. if no = 0.
and with the parameters identifiable only once d . Even if {3 is not a parameter (is unidentifiable). We now introduce the canonical variables and means . then r is the rank of Z and w has dimension r. • I t Ew ¢:> t = L:(v[t)Vi i=l ¢:> vTt = 0. Ui = v. .. i = 1. the vector of means p of Y always is. i=I (6.2). •. Cj = (Zlj.. Note that any t E Jl:l can be written .. . Because dimw = r.3. I5 h .. We assume that n > r. . . {3* is identifiable in the pdimensional linear subspace {(3' E RP+l : L:~~l dk = O} of RP+ 1 obtained by adding the linear restriction E~=l 15k = 0 forced by the definition of the 15k '5.• znj)T. " .. . However. . the linear Y = Z'(3' + E. i . an orthononnal basis VI... When VrVj = O.17). V n for Rn such that VI. Z) and Inx I is the vector with n ones. The parameter set for f3 is RP and the parameter set for It is W = {I' = Z(3.. . TJi = E(Ui) = V'[ J1.. there exists (e. E ~N(O. j = I •.(T2J) where Z. Recall that orthononnal means Vj = 0 fori f j andvTvi = 1. . . (3 E RP}. n.! x (p+l) = (1. b I _ .. 1 JLn)T I' where the Cj = Z(3 = L j=l P .. we call Vi and Vj orthogonal. . i = T + 1. i5p )T.1. .a 2 ) is identifiable if and only ifr = p(Problem6...1..po In tenns of the new parameter {3* = (0'.. • .P. The Canonical Fonn of the Gaussian Linear Model The linear model can be analyzed easily using some geometry.. by the GramSchmidt process) (see Section B. j = 1. •• Note that w is the linear space spanned by the columns Cj. . This type oflinear model with the number of columns d of the design matrix larger than its rank r. .6jCj are the columns of the design matrix. It follows that the parametrization ({3. It is given by 0 = (itl. Let T denote the number of linearly independent Cj....p.Y. of the design matrix. . V r span w. k model is Inference in the Multiparameter Case Chapter 6 1.r additional linear restrictions have been specified.368 effects. Note that Z· is of rank p and that {3* is not identifiable for {3* E RP+l.7) !. . n. vT n i I t and that T = L(vTt)v...g. j = I.. . r i' . n. . is common in analysis of variance models. ..
1]r)T varies/reely over Rr. and then translate them to procedures for .1.Section 6. n. (T2)T. . . 0.. 1 If Cl.3. n because p.u) 1 ~ 2 n 2 . Let A nxn be the orthogonal matrix with rows vi. . . and .2 L i=l + _ '" ryiui _ 02~ i=l 1 r r 2 (6. which are sufficient for (IJ.. . . which is the guide to asymptotic inference in general.'Ii) .. Then we can write U = AY.2 Estimation 0... Un are independent nonnal with 0 variance 0..' based on Y using (6. 20. and by Theorem B.11) _ n log(21r"'). ais (v) The MLE of.10). 2 0.Cr are constants. Proof. Theorem 6.)T is sufJicientfor'l. (6.2 known (i) T = (UI •. then the MLE of 0: Ci1]i is a = E~ I CiUi. i it and U equivalent.L = 0 for i = r + 1.2. Note that Y~AIU.9 N(rli.. ..1. u) based on U £('I.2 ) using the parametrization (11. '" ~ A 1'1... (6.10) It will be convenient to obtain our statistical procedures for the canonical variables U. 2 ' " 'Ii i=l ~2(T2 6..1.2. . . IJ. = L.•• (ii) U 1 . Theorem 6.1.8)(6.1]r. where = r + 1.. r.1. 1 ViUi and Mi is UMVU for Mi..2<7' L.Ur istheMLEof1]I. We start by considering the log likelihood £('1. also UMVU for CY.2"log(21r<7 ) t=1 n 1 n _ _ ' " u. U. i i = 1. v~. . (iii) U i is the UMVU estimate of1]i.. ."...2 We first consider the known case. In the canonical/orm ofthe Gaussian linear model with 0. observing U and Y is the same thing.. .).. n...1.9) Var(Y) ~ Var(U) = . Moreover. .(Ui .1 Inference for G::'::"::"::"::"::L::'::"e::'::'::M::o::d::e'::' '" '3::6::::. . E w. . U1 .1. . is Ui = making vTit.• . 7J = AIJ.'J nxn ..2 and E(Ui ) = vi J..1.8) So. .1. i (iv) = 1. .. (3. . n.1. whereas (6. it = E. = 1.. ... •.and 7J are equivalently related. The U1 are independent and Ui 11i = 0. . while (1]1. • .
.2). there exists an orthonormal basis VI) ..2. = 1..6 to the canonical exponential family obtained from (6. {3. Assume that at least one c is different from zero. n.. . Q is UMVU. (6.3 and Example (iii) is clear because 3.11).) = '7" i = 1.. wecan assume without loss of generality that 2:::'=1 c..4. (i) By observation.. In the canonical Gaussian linear model with a 2 unknown. define the norm It I of a vector tERn by  ! It!' = I:~ .Ur'L~ 1 Ul)T is sufficient. . (6. ..Ui · If all thee's are zero. (We could also apply Theorem 2.1. . " V n of R n with VI = C = (c" . I i I .. = 0. But because L~ 1 Ui2 L:~ I Ui2 + r+l Ul.1 L~ r+l U.• EUl U? Ul Projections We next express j1 in terms of Y. 2(J i=r+l ~ U.4.1.4.. 2 2 ()j = 1'0/0. (U I .11) by setting T j = Uj .. To this end. (iv) The conclusions a/Theorem 6. . and is UMVU for its expectation E(W.3.11) is a function of 1}1.. then W . r. 1lr only through L~' 1 CUi .1. The maximizer is easily seen to be n 1 L~ r+l (Problem 6.'7..O)T E R n .10.. i = 1. Proof.1).. and .11) is an exponential family with sufficient statistic T.. .(v) are still valid.6.. . n " . .(J2J) by Theorem B. 370 Inference in the Multiparameter Case Chapter 6 I iI . i > r+ 1. (ii) The MLE ofa 2 is n. (ii) U1. apply Theorem 3. the MLE of q(9) = I:~ I c.1. . recall that the maximum of (6...1.r) 2:7 1 L~ r+l ul is an unbiased estimator of (J2.11) has fJi 1 2 = Ui . .3..4. (iv) By the invariance of the MLE (Section 2. . .) = a.) (iii) By Theorem 3.1.2. . . WI = Q is sufficient for 6 = Ct. 2:7 r+l un Tis sufficientfor (1]1. obtain the MLE fJ of fl. r.. Let Wi = vTu. By Problem 3.'  .ti· 1 . By (6. 1 U1" are the MLEs of 1}1. is q(8) = L~ 1 <. ~i = vi'TJ. (v) Follows from (iv). 1 Ur . To show (ii). Ui is UMVU for E(U. l Cr .N(~. . by observation. .. T r + 1 = L~ I and ()r+l = 1/20. 0  i Next we consider the case in which a 2 is unknown and assume n >r+ 1. By GramSchmidt orthogonalization..2 .1. "7r because. j = 1.4 and Example 3.. . I . I i > . where J is the n x n identity matrix. That is.. we need to maximize n (log 27f(J2) 2 II ". 0. ..1."7i)2 and is minimized by setting '(Ii = Ui. . " as a function of 0. (i) T = (Ul1" .1]r. (iii) 8 2 =(n . . Proof. I.2. this statistic is equivalent to T and (i) follows.1. Theorem 6. The distribution of W is an exponential family.3. To show (iv).4. and give a geometric interpretation of ii.2 .. ( 2 )T.2 (ii).
ji) = 0 and the second equality in (6. .2...14) follows.1.. ) 2 or.4.Z/3I' : /3 E W}.1. = Z/3 and (6. any linear combination ofY's is also a linear combination of U's..1.2(iv) and 6.14) (v) f3j is the UMVU estimate of (3j. IY . _. . then /3 is identifiable.1. (ii) and (iii) are also clear from Theorem r+l VjUj . and  Jii is the UMVU estimate of J.. spans w. i=l..12) ii.1.L of vectors s orthogonal to w can be written as ~ 2:7 w..n log(21f<T .12) implies ZT.' and by Theorems 6.. because Z has full rank. That is. Proof. fj = arg min{Iy .3 because Ii = l:~ 1 ViUi and Y . by (6. j = 1. The projection Yo = 7l"(Y I L. any linear combination of U's is a UMVU estimate of its expectation.. It follows that /3T (ZT s) = 0 for all/3 E RP.. Thus.1.L.13) (iv) lfp = r. . /3. note that 6. We have Theorem 6. note that the space w. ZT (Y . the MLE = LSE of /3 is unique and given by (6. To show (iv).ti.3 maximizes 1 log p(y. fj = (ZTZ)lZTji..2<T' IY .3 of. ZTZ is nonsingular.J) of a point y Yo =argmin{lyW: tEw}.1. In the Gaussian linear model (i) jl is the unique projection oiY on L<.Ii = . = Z T Z/3 and ZTji = ZTZfj and. 0 ~ .3.1. /3 = (ZTZ)l ZT.9).L = {s ERn: ST(Z/3) = Oforall/3 E RP}.3(iv).1.. . To show f3 = (ZTZ)lZTy. equivalently.Z/3 I' .1. and /3 = (ZTZll ZT 1".1. which implies ZT s = 0 for all s E w.p.1.1 Inference for Gaussian linear Models 371 E Rn on w is the point Definition 6. f3j and Jii are UMVU because.Section 6.1 and Section 2. (i) is clear because Z.. ~ The maximum likelihood estimate .jil' /(n r) (6.n.' and is given by ji (ii) jl is orthogonal to Y (iii) 8' ~ = zfj (6.1.1. the MLE of {3 equal~ the least squares estimate (LSE) of {3 defined in Example 2. <T) = ..3 E RP.
Y = il. see also Section RIO.1. w~ obtain Pi = Zi(3. The residuals are the projection of Y on the orthocomplement of wand ·i .. the ith component of the fitted value iJ. In the Gaussian linear model (i) the fitted values Y = iJ.1. . i . and I. (ii) (iii) y ~ N(J. (12(J . then (6.4. In this method of "prediction" of Yi. . n}.1. =H .14).16) J I I CoroUary 6.H). i = 1.3) is to be taken.15) Next note that the residuals can be written as ~ I .'. . L7 1 .372 Inference in the Multiparameter Case Chapter 6 Note that in Example 2." As a projection matrix H is necessarily symmetric and idempotent.1. The estimate fi = Z{3 of It is called thefitted value and € = y . The goodness of the fit is measured by the residual sum of squares (RSS) IY .Y = (J .1. the best MSPE predictor of Y if f3 is known as well as z is E(Y) = ZT{3 and its best (UMVU) estimate not knowing f3 is Y = ZT {j. (12 (ZTz) 1 ). (j ~ N(f3. .H)Y. and the residual € are independent. In statistics it is also called the hat matrix because it "puts the hat on Y..I" (12H). I. By Theorem 1. it is commOn to write Yi for 'j1i.1 we give an alternative derivation of (6. = [y..' € = Y . : H = Z(Z T Z)IZT The matrix H is the projection matrix mapping R n into w.1. 1 < i < n.1. i = 1. (iv) ifp = r. € ~ N(o. Note that by (6.2 illustrates this tenninology in the context of Example 6. (6.H)).1. Var(€) = (12(J . .14) and the normal equations (Z T Z)f3 = Z~Y.3) that if J = J nxn is the identity matrix. 1 · . There the points Pi = {31 + fhzi. n lie on the regression line fitted to the data {('" y.). 1 ~ ~ ! I ~ ~ ~ i . We can now conclude the following. moreover..12) and (6.)I are the vertical distances from the points to the fitted line.1. .ii is called the residual from this fit. That is. whenp = r. . ~ ~ ·.1. Example 2.. Taking z = Zi.1.2. the residuals €.. • . H T 2 =H.2 with P = 2.5.ill 2 = 1 q. . I . Suppose we are given a value of the covariate z at which a value Y following the linear model (6. • ~ ~ Y=HY where i• 1 .({3t + {3. 1 < i < n. H I I It follows from this and (B.
2. .1 ~ Inference for Gaussian Linear Models 373 Proof (Y.8 and (3.p. In this example the nonnal equations (Z T Z)(3 = ZY become n.3. .. The OneWay Layout (continued). We now see that the MLE of It is il = Z. . If the design matrix Z has rank P. If {Cijk _.ii = Y. Example 6.2. .1.Ill'.Section 6.j = 1. k = 1..2).Y are given in Corollary 6. Var(.8) = (J2(Z T Z)1 follows from (B.3).3.1).5. 0 ~ ~ ~ ~ ~ ~ ~ Example 6. Regression (continued).. The independence follows from the identification of j1 and € in tenns of the Ui in the theorem. joint Gaussian. .3. (not Y.1.p. . 0 Example 6. . which we have seen before in the unbiased estimator 8 2 of (72 is L:~ I (Yi ~  Yf/ Problem 1. then replacement of a subscript by a dot indicates that we are considering the average over that subscript.. 'E) is a linear transformation of U and. k = 1.ii.4...1 and Section 2. and€"= Y . and € are nonnally distributed with Y and € independent.a of the kth treatment is ~ Pk = Yk. 1=1 At this point we introduce an important notational convention in statistics. the treatments..1.8. . In the Gaussian case. in general) p k=l L and the UMVU estimate of the incremental effect Ok = {3k . One Sample (continued). i = 1...1. where n = nl + . then the MLE = LSE estimate is j3 = (ZTZr1 ZTy as seen before in Example 2.i respectively.1.1.8..1. k=I. is th~ UMVU estimate of the average effect of all a 1 p = Yk .p. Here Ji = . 0 ~ We now return to our examples. (n . a = {3.81 and i1 = {31 Y. } is a multipleindexed sequence of numbers or variables. hence.1. nk(3k = LYkl. Moreover. The error variance (72 = Var(El) can be unbiasedly estimated by 8 2 = (n _p)lIY .1. Thus.. . + n p and we can write the least squares estimates as {3k=Yk .no The variances of .8 and that {3j and f1i are UMVU for {3j and J1.p.... . By Theorem 6.. o . in the Gaussian model. . Y.
.p}. which together with a 2 specifies the model. in the context of Example 6..17) I. the LADEs are obtained fairly quickly by modem computing methods.. . . For instance.7 and 2. We first consider the a 2 known case and consider the likelihood ratio statistic . • .i 1 6. In general.1.1. whereas for the full model JL is in a pdimensional subspace of Rn. under H.. . . The first inferential question is typically "Are the means equal or notT' Thus we test H : 131 = .En in (6.2. 1 and the estimates of f3 and J. a regression equation of the form mean response ~ .17). Next consider the psample model of Example 1.4.JL): JL E w} sup{p(Y. q < r.374 Inference in the Multiparameter Case Chapter 6 Remark 6." Now.3 with 13k representing the mean resJXlnse for the kth population.zT . j where Zi2 is the dose level of the drug given the ith patient.L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L~ 1 IYi .1. the mean vector is an element of the space {JL : J. . For more on LADEs. i = 1. see Problems 1.. i = 1.. . The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEssee Stigler (1986). under H. The most important hypothesistesting questions in the context of a linear model correspond to restriction of the vector of means JL to a linear subspace of the space w.1. (6. . Now we would test H : 132 = 0 versus K : 132 =I O. and the matrix IIziillnx3 with Zil = 1 has rank 3.. However. Zi3 is the age of the ith patient. = {3p = 13 for some 13 E R versus K: "the f3's are not all equal.JL): JL E wo} ~ ! .31. see Koenker and D'Orey (1987) and Portnoy and Koenker (1997).Li = 13 E R.. {JL : Jti = 131 + f33zi3. j.\(y) = sup{p(Y. we let w correspond to the full model with dimension r and let Wo be a qdimensional linear subspace over which JL can range under the null hypothesis H. n} is a twodimensional linear subspace of the full model's threedimensional linear subspace of Rn given by (6. An alternative approach 10 the MLEs for the nonnal model and the associated LSEs of this section is an approach based on MLEs for the model in which the errors El.2. The LSEs are preferred because of ease of computation and their geometric properties. • = 131 + f32Zi2 + f33Zi3.3 Tests and Confidence Intervals .1.. ! j ! I I . .1) have the Laplace distribution with density . in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider.1.1. . which is a onedimensional subspace of Rn. I. Thus. 1 < .al. j 1 • 1 . . " .6.
l2). 2 log .. .l9) then. o .iio 2 1 } where i1 and flo are the projections of Yon wand wo. . In this case the distribution of L:~q+1 (Uda? is called a chisquare distribution with r .19).. But if we let A nx n be an orthogonal matrix with rows vi.\(Y) ~ exp {  2t21Y . We have shown the following.1.1. 1) distribution with OJ = 'Ida.1.q {(}2) for this distribution.1. V r span wand set . We write X.2(v). 2 log ..1.. Write 'Ii is as defined in (6. .q( (}2) distribution with 0 2 =a 2 ' \ '1]i L.1.\(Y) It follows that = exp 1 2172 L '\' Ui' (6. by Theorem 6. when H holds..20) i=q+I 210g.. . then r.1. . = IJL i=q+l r JLol'.. respectively.\(Y) has a X. . i=q+l r = a 21 fL  Ito [2 (6.'" (}r)T (see Problem B. span Wo and VI.J X.iii' . v~' such that VI.q degrees offreedom and noncentrality parameter Ej2 = 181 2 = L:=q+ I where 8 = «(}q+l.1. by Theorem 6.21 ) where fLo is the projection of fL on woo In particular. (}r.1. Proposition 6.21).._q' = AJL where A L 'I.IY .. In the Gaussian linear model with 17 2 known.1. .1. V q (6. Note that (uda) has a N(Oi.3.\(Y) = L i=q+l r (ud a )'.\(Y) Proof We only need to establish the second equality in (6.Section 6_1 Inference for Gaussian Linear Models 375 for testing H : fL E Wo versus j{ : JL E W  woo Because (6.I8) then..
1.'I/L. which is equivalent to the likelihood ratio statistic for H : Ii.• j . when H I lwlds. has the noncentral F distribution F r _q. E In this case.1 that the MLEs of a 2 for It E wand It E wQ are . ]t consists of rejecting H when the fit. we obtain o A(Y) = P Y._q(02) distribution with 0' = . I .5) that if we introduce the variance equal likelihood ratio w'" .liol' (n _ r) 'IY _ iii' .. .1.iT'): /L E w} . statistic :\(y) _ max{p{y.1../to I' ' n J respectively.JLo.q)'{[A{Y)J'/n . IIY . Remark 6.1.E Wo for K : Jt E W .itol 2 = L~ q+ 1 (Ui /0) 2. For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >.23) .r){r . Thus.18). as measured by the residual sum of squares under the model specified by H.r IY .2 1JL .q variable)/df (central X~T variahle)/df with the numerator and denominator independent.1.1 suppose the assumption "0.19). T is an increasing function of A{Y) and the two test statistics are equivalent. In the Gaussian linear model the F statistic defined by (6.1.l) {Ir .max{p{y. By the canonicaL representation (6. T = n ./L.(Y)..nr distribution. Substituting j).JtoI 2 . it can be shown (Problem 6. In particular.21ii. We write :h.2. We have seen in PmIXlsition 6. which has a X~r distribution and is independent of u.2.L n where p{y.2 is known" is replaced by '.I. The distribution of such a variable is called the noncentral F distribution with noncentrality parameter 0 2 and r .1.3. {io. is poor compared to the fit under the general model. . .1. T has the representation . (6. .q and m = n ./Lol'.liol' IY r _q IY _ iii' iii' (r ./L.r. T has the (central) Frq.L a =  n I' and .I}.22) Because T = {n . and 86 into the likelihood ratio statistic.m{O') for this distrihution where k = r .wo.0. J • T = (noncentral X:..1.1 that 021it . The resulting test is intuitive.') denotes the righthand side of (6.. We have shown the following. In Proposition 6. we can write .1.2 is the same under Hand K and estimated by the MLE 0:2 for /. 8 2 . /L.~I.r degrees affreedam (see Prohlem B.'IY _ iii' = L~ r+l (Ui /0)2.a:. We know from Problem 6.q and n . T is called the F statistic for the general linear hypothesis.iT') : /L E wo} (6.J.14).'}' YJ.a = ~(Y'E. Proposition 6.22). .itol 2 have a X.. = aD IIY ..376 Inference in the Multiparameter Case Chapter 6 Next consider the case in which a 2 is unknown.q)'Ili .n_r(02) where 0 2 = u.1..
19) made it possible to recognize the identity IY . We test H : (31 case wo = {I'o}. where t is the onesample Student t statistic of Section 4.1 Inference for Gaussian Linear Models ~ 377 then >. It follows that ~ (J2 known case with rT 2 replaced by r. Y3 y Iy . r = 1 and T = 1'0 versus K : (3 'f 1'0.Y)2' which we recognize as t 2 / n.1. We next return to our examples.1.iLol'.(Y) equals the likelihood ratio statistic for the 0'2. 0 . q = 0.1.q noncentral X? q 2Iog.iLl' ~++Yl I' I' 1 . In this = (Y 1'0)2 (nl) lE(Y. (6.1.1.iLol' = IY . See Pigure 6.~T = (n .\(Y) = .22).1.1.1. Yl = Y2 Figure 6.3.24) Remark 6.1 and Section B. and the Pythagorean identity.1. The projections il and ilo of Yon w and Wo. The canonical representation (6.Section 6.1.1'0 I' y.IO. Example 6.25) which we exploited in the preceding derivations.2.r)ln central X~_cln where T is the F statistic (6. (6.iLl' + liL . One Sample (continued).9. This is the Pythagorean identity.
1. 7) iW l.. In the special case that ZrZ2 = 0 so the variables in Zl are orthogonal to the variables in Z2.  I 1.1.1.q) x 1 vector of main (e.3. P nk (Y. Without loss of generality we ask whether the last p . . P . anddfF = np and dfH = nq are the corresponding degrees of freedom.g. The F test rejects H if F is large when compared to the ath quantile of the Fpq.1.dfp) RSSF/dh I 2 where RSSF = IY and RSSH = IY . The One.}f32. I • ! ..' • ito = Thus.. = {3p.. ..Way Layout (continued). In this case f3 (ZrZ)lZry and f3 0 = (Z[Ztl1Z[Y are the ML& under the full model (6. . I • I b .ZfZl(ZiZl)'ziz.. in general 02 depends on the sample correlations between the variables in Zl and those in Z2' This issue is discussed further in Example 62. L~"(Yk' .: I :I T _ n ~ p L~l nk(Y . However. and we partition {3 as f3T = (f3f..n_p(fP) distribution with noncentrality parameter (Problem 6.22) we obtain the F statistic for the hypothesis H in the oneway layout .q covariates does not affect the mean response.. • 1 F = (RSSlf .. '1 Yp.1.)2. To formulate this question. we want to test H : {31 = . 02 simplifies to (J2(p .1. Under H all the observations have the same mean so that. 0 I j Example 6.2. which only depends on the second set of variables and coefficients.vy.{3p are Y1.=11=1 k=1 •• Substituting in (6. . Z2) where Z1 is n x q and Z2 is 11 X (p . (6.27) 1 ..22) we can write the F statistic version of the likelihood ratio test in the intuitive fonn =   i .. Regression (continued). treatment) effect coefficients and f3 1 is a q x 1 vector of "nuisance" (e.q covariates in multiple regression have an effect after fitting the first q.Yk.)2' .g. . economic status) coefficients.y)2 k . We consider the possibility that a subset of p . I3I) where {32 is a (p .. Recall that the least squares estimates of {31.ito 1 are the residual sums of squares under the full model and H.Q)f3f(ZfZ2)f32. respectivelY.np distribution.)2= Lnk(Yk.1..1 L~=. 1 02 = (J2(p .26) and H. . Now the linear model can be written as i i I I ! We test H : f32 (6.26) 0 versus K : f3 2 i' O.1. As we indicated earlier.RSSF)/(dflf . Under the alternative F has a noncentral Fp_q. age. j ! j • litPol2 = LL(Yk _y..q).1. •..n 378 Inference in the Multiparameter Case Chapter 6 iI II " • Example 6.P . Using (6. _y.q)1f3nZfZ2 . . we partition the design matrix Z by writing it as Z = (ZI. respectively.
25) 88T = 888 + 88w . (3p. we see that the decomposition (6.Section 6. 888 =L k=I P nk(Yk..1.1. SSB.29) Thus.1.np distribution with noncentrality parameter (6... If the fJ.p) of the (n 1) degrees offreedom of SST /0" as "cooling" from S8w/a'.fLo 12 for the vector Jt (rh. respectively.. compute a. the total sum of squares. . The sum of squares in the numerator.. Because 88B/0" and SSw / a' are independent X' variables with (p . the within groups (or residual) sum of squares. T has a noncentral Fp1. are not all equal. we have a decomposition of the variability of the whole set of data. . T has a Fpl. Note that this implies the possibly unrealistic . identifying 0' and (p . . .. are often summarized in what is known as an analysis a/variance (ANOVA) table.. (3" .fJ. k=l 1=1 measures variation within the samples.)' .2 1M .1) and 8Sw /(n . k=1 1=1 which measures the variability of the pooled samples.. sum of squares in the denominator.p). . SST/a 2 is a (noncentral) X2 variable with (n .Y. Ypnp' The 88w ~ L L(Y Y p "' k'  k )'.p) degrees of freedom.1 Inference for Gaussian Linear Models 379 When H holds. As an illustration. ijjT = There is an interesting way of looking at the pieces of infonnation summarized by the F statistic.1 and 6. . To derive IF..1. We assume the oneway layout is valid. (6. .1. SST.. n. and the F statistic.3.)'. into two constituent components. (3p)T and its projection 1'0 = (ij.28) where j3 = n I :2:=~"" 1 nifJi..1) degrees offreedom as "coming" from SS8/a' and the remaining (n . See Tables 6.1. . .30) can also be viewed stochastically. . consider the following data(I) giving blood cholesterol levels of men in three different socioeconomic groups labeled I. Ypl .1) and (n . the unbiased estimates of 02 and a 2 . . (31. and III with I being the "high" end... is a measure of variation between the p samples YII .. then by the Pythagorean identity (6. . which is their ratio. This information as well as S8B/(P . Yi nl .1) degrees of freedom and noncentrality parameter 0'.. [f we define the total sum of squares as 88T =L l' L(Y nk k'  Y.np distribution. the between groups (or treatment) sum of squares and SSw.···. .
j:
I ,
380
Inference in the Multiparameter Case Chapter 6
,
:
TABLE 6.1.1. ANOYA table for the oneway layout
Sum of squares
d.f.
,
Between samples
Within samples
SSe
I:
r I lld\'k_
Mean squares
F value
1\1 S '
i'dS B
Lf!
P
1
MSB
~
, ,
Total
58W  ""k '1'1n I "(I'k l  I')'  I " k 55T  ,. P 1 L "I" 1 (I'kl _ I' )' ~ k I
np
Tl  1
" '''
A1S w = SSw
1
1
j
TABLE 6.1.2. Blood cholesterol levels
I
J
286 290
II III
403 312 403
311 222 244
269 302 353
336 420 235
259 420 319
386 260
353
210
l 1
I
I
,
i,
.!
, , I
I , ,
assumption that the variance of the measurement is the same in the three groups (not to speak of normality). But see Section 6.6 for "robustness" to these assumptions. We want to test whether there is a significant difference among the mean blood cholesterol of the three groups. Here p = 3, nl = 5, n2 = 10, n3 = 6. n = 21, and we compute
TABLE 6.1.3. ANOYA table for the cholesterol data
;
,
88
Between groups Within groups Total
I
,
,I
1202.5 85,750,5 86,953.0
dJ, 2 18 20
M8
601.2 4763,9
F~value
0.126
'I
From :F tables, we find that the pvalue corresponding to the Fvalue 0.126 is 0.88. Thus, there is no evidence to indicate that mean blood cholesterol is different for the three socioeconomic groups. 0 Remark 6.1.4. Decompositions such as (6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding to variation of covariates are referred to as analysis oj variance. They can be fonnulated in any linear model including regression models. See Scheff" (1959, pp. 4245) and Weisberg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to establish the distribution theory of the components via a device known as Cochran's theorem (Graybill, 1961, p. 86). Their principal use now is in the motivation of the convenient summaries of infonnation we call ANOVA tables.
,
,I "
,
,
,
Section 6.1
Inference for Gaussian linear Models
~
381
Confidence Intervals and Regions We next use our distributional results and the method of pivots to find confidence intervals for J.li, 1 < i < n, !3j, 1 <j < p, and in general, any linear combination
n
,p
= ,p(/l) = Lai/Li
i::: 1
~ aT /l
of the J.l's. If we set;j;
= 1:7
1 ai/Ii
= aT fl and
~
~
where H is the hat matrix, then (,p  ,p)ja(,p) has a N(O, I) distribution. Moreover,
(n  r)8 2ja 2 ~
IY  iii 2 ja 2 =
L
i=r+l
n
(Uda 2 )'
has a X~r distribution and is independent of ;;;. Let
~
~
be an estimate of the standard deviation a('IjJ) of
~
'0. This estimated standard deviation is
called the standard error of 'IjJ. By referring to the definition of the t distribution, we find that the pivot
has a TnT. distribution. Let t n  r (1  40) denote the 1 ~o: quantile of the bution, then by solving IT(,p)1 < t n _, (I  ~",) for,p, we find that
Tn r
distri
is, in the Gaussian linear model, a 100(1  a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider'IjJ = p. We obtain the interval
i' ~ Y ± t n 1
(1
~q) sj.,fii,
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = T. First consider 1/J = f3j for some specified ~gression coefficient (3j. The 100(1  a)% confidence interval for (3j is
(3j
}" = (3j ± t n p (','" s{ [ (ZT Z) 11 j j ' 1 )
382
Inference in the Multiparameter Case
Chapter 6
where [(ZTZ)~l]j) is the jthdiagonal element of (ZTZ)~ '. Computersoftware computes (ZTZ)~I and labels S{[(ZTZ)~lI]j); as the standard errOr of the (estimated) jth regression coefficient. Next consider ljJ = j.ti = mean response for the ith case, 1 < i < n. The level (1  0:) confidence interval is
J.Li =
Jii ± t n _ p (1 
!a) sJh:
where hii is the ith diagonal element of the hat matrix H. Here sjh;; is called the standard error of the (estimated) mean of the ith case. Next consider the special case in which p = 2 and
Yi = 131 + 132Zi2 + Eil i = 1 ", n.
"
If we use the identity
n
~)Zi2  Z2)(l'i  Y)
i=1
~)Zi2  Z.2)l'i,
We obtain from Example 2.2.2 that
ih ~
Because Var(Yi) = a 2 , we obtain
~ Var(.6,)
L~l (Zi2  z.,)l'i
L~ 1 (Zi2  Z.2)2 .
(6.1.30)
= (J I L)Zi' i=l
''''
n
Z.2) ,
,
and the 100(1  a)% confidence interval for .6, has the form
732 ± t n p (I  ~a) sl J2:(Zi2  Z.2)'. The confidence interval for 131 is given in Problem 6.1.10. .62
=
Similarly, in the p = 2 case, it is straightforward (Problem 6.1.10) to compute
i.
h ii
I ,
•
;
_ 1

(Zi2  z.,)2
",n
+ n
L....i=l Zi2 
(
Z·2
)'
,
0
•
I
and the confidence interval for th~ mean response J1.i of the ith case has a simple explicit
fonn.
,
1
,
i ,
.I
Example 6.1.3. OneWay Layout (continued). We consider 'I/J = 13k. 1 < k .6k = Yk. ~ N(.6k, (J'lnk), we find the 100(1  a)% confidence interval
~
< p. Because
•
,
I
= 7J. ± t n  p (1 ~a) slj'nj; where 8' = SSwl(np). The intervals for I' = .6. and the incremental effect Ok =.6k1'
.6k
are given in Problem 6.1.11. 0
j
j
"I
I
i
,
I
I
Do
l
Section 6.2
Asymptotic Estimation Theory in p Dimensions
383
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j, 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector /3 with prescribed probability (1  0:). This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of f3 0 that is accepted when the level (1  a) F test is used to test H : (3 ~ (30' Under H, /' = /'0 = Z(3o; and the numerator of the F statistic (6.1.22) is based on
Iii  /'01'
C "
=
Izi3 
Z(3ol' =
(13  (30)T(Z T Z)(i3 (30)
(30)
Thus, using (6.1.22), the simultaneous confidence region for
f3 is the ellipse
= {(30 .. (13  (30)T(Z 2 Z)(i3 rs
T
< f r,ni"" (1 _ 1 20
'J}
. .T
(6.1.31)
where fr,nr
(1  40:)
is the 1  !o quantile of the :Fr,nr distribution.
Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) write Z = (ZJ, Z2) and /3T = (f3j, {3f), where f32 is a vector of main effect coefficients and f3 1is a vector of "nuisance" coefficients. Similarly, we partition {3 as {3 = ({31 ,(32 ) where (31 is q x 1 and (3, is (p  q) x 1. By Corollary 6.1.1, O"(ZTZ) is the variancecovariance matrix of {3. It follows that if we let 8 denote the lower right (p  q) x (p ~ q) comer of (ZTZ)I, then (728 is the variancecovariance matrix of 132' Thus, a joint 100(1  0:)% confidence region for {32 is the p  q dimensional ellipse
~ ~ ~
.T ",T
C={(3
0' .
. (i3,(302)TSl(i3,_(3o,) <f
() 2
pqs

pq,np
(11
20:·
l}
o
Summary. We consider the classical Gaussian linear model in which the resonse Yi for (3jZij of the ith case in an experiment is expressed as a linear combination J1i = covariates plus an error fi, where Ci, ... 1 f n are i.i.d. N (0, (72). By introducing a suitable orthogonal transfonnation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transfonnation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {(3j}, the resJXlnse means {J1i}, and linear combinations of these.
LJ=l
6.2
ASYMPTOTIC ESTIMATION THEORY IN p DIMENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic properties of the MLE and related tests and confidence bounds for onedimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generalizing Section 5.4.2.
384'
f ",'c:,c:""""ce=:::;'=''''h':.M=,'"';,,pa::.'::'m'''e::.t::"...:C::':::",..::Cc::h,:,p:::'e::'~6
6.2.1
Estimating Equations
OUf assumptions are as before save that everything is made a vector: X!, ... , X n are i.i.d. Pwhere P E Q, a model containing P = {PO: 0 E e} such that
(i)
e open C RP.
e.
(ii) Densities of P, are pC 0),9 E
1 , I
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=(,p" ... , ,pp)T where,p) ~ g:., is well defined and
 L >I'(X n.
1=1
1
n
..
i,
On) = O.
(6.2.1)
A solution to (6.2.1) is called an estimating equation estimate or an M estimate.
AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p
unknowns):
J
A2. Epl'l'(X" 0(P)1 2
>I'(x, O)dP(x)
=0
(6.2.2)
, , ,
I
is well defined on Q so that O(P) is the unique solution of (6.2.2). Necessarily O(PO) because Q => P.
=0
I , ,
< 00 where I· I is the Euclidean nonn.
I I:
.' , ,
,
I
A3. 'l/Ji{', 8), 1 < i < P. have firstorder partials with respect to all coordinates and using the notation of Section B.8,
I
where
': I
,
~
.,
i
~~
l:iI ~
is nonsingular.
A4. sup
..
{I ~ L~ 1 (D>I'(X"
p
t)  D>I'(Xi , O(P))) I: It  O(P)I < En} ~ 0 if En ~ O.
, .I ,
,
I
I
AS. On ~ O(P) for all P E Q.
Theorem 6.2.1. Under AOA5 ofthis section
8n = O(P) + where
L iii(X n.
t=l
1
n
i,
O(P))
+ op(n 1 / 2 )
(6.2.3)
i..I , ,
• •
iii(x,o(p))
= (EpD>I'(X" 0(P)W 1 >1'(x, O(P)).
(6.2.4)
b
Section 6.2
Asymptotic Estimation Theory in p Dimensions
385
Hence,
(6.2.5)
where
E(q" P) ~ J(O. p)Eq,q,T(X" O(p))F (0, P)
and
81/J~
J
(6.2.6)
r'(o,p)
= EpDq,(X"O(P))
~
E p 011 (X"O(P))
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
 n
2.:= q,(Xi , O(P)) = n 2.:= Dq,(Xi,0~)(9n
i=l
n
1
n
 O(P)).
(6.2.7)
i=l
Note that the lefthand side of (6.2.7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open , subset of RP , representable, for instance, as the Set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to conclude that A3 and A4 guarantee that with probability tending to 1, ~ l::~ I Dq,(Xi , 6~) is nonsingular. Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by p, the behavior in (6.2.3) is guaranteed for P E Q, which can include P <1c P. In fact, typically Q is essentially the set of P's for which O(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to: A6. If 1(,0) is differentiabLe
EODq,(X"O)

EOq,(X" O)DI(X" 0) CovO(q,(X" 0), DI(X, , 0))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the onedimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however. be shown that with probability tending to 1, a rootfinding algorithm starting at a consistent estimate 6~ will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
386
Inference in the Multiparameter Case
Chapter 6
6,2,2
comes
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x,O)
=
l(x,O)
= 10gp(x,0), and >I>(x,O)
obeys AOA6, then (62,8) be.
T
l
EODl (X" O)D l( X 1,0» VarODl(X"O)
(62,9)
where
j 1 ,
, ,
is the Fisher information matrix I(e) introduced in Section 3.4. If p: e _ R, e c R d , is a scalar function, the matrix)! 8~~P(}J (e) is known as the Hessian or curvature matrix of the sutface p. Thus, (6.2.9) stateS that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
Theorem 6.2.2. If AOA6 holdfor p(x, 0)
,1, , ,
I
j
=10gp(x, 0), then the MLE  satisfies On
i,
(62,10)
(62,11)
On ~ 0+  :Lr'(O)DI(Xi,O) + op(n"/')
n
i=1
1
n
,,
so that
is a minimum contrast estimate with p and 'f/J satisfying AOA6 and corresponding asymptotic variance matrix E('I1, Pe), then
"
"
If en
"
, ,
E(>I>,PO)
On
> r'(O)
(62,12)
in the sense of Theorem 3.4.4 with equality in (6,2,12) for 0
,
= 0 0 iff, undRr 0 0 ,
(6,2,13)
~
On + op(n 1/2 ),
, i ,
•
,
Proof. The proofs of (6,2,10) and (6,2,11) parallel those of (5.4.33) and (5.4,34) exactly, The proof of (6,2,12) parallels that of Theorem 3.4.4, For completeness we give it Note that hy (6,2,6) and (6,2,8)
I
" :I I.
1" ,
where U >I>(Xt,O). V Var(U T , VT)T nonsingular Var(V)
=
E(>I>,PO)
= CovO'(U, V)VarO(U)CovO'(V, U)
~
(62,14)
DI(X1,0),
But hy (B,lO.8), for any U,V with
(6,2.15)
> Cov(U, V)Var1(U)Cov(V, U),
Taking inverses of both sides yields
r1(0)
= Var01(V) < E(>I>,O).
(6.2.16)
• • !
Section 6.2
Asymptotic Estimation Theory in p Dimensions
387
Equality holds in (6.2.15) by (B. 10.2.3) iff for some b U
= b(O)
(6.2.17)
= b + Cov(U, V)Var1(V)V with probability 1. This means in view of Eow = EODl = 0 that
w(X"O) = b(0)Dl(X1,0).
In the case of identity in (6.2.16) we must have
[EODw(X 1, OW'W(X 1 , 0)
=
r1(0)DI(X" 0).
(6.2.18)
Hence, from (6.2.3) and (6.2.10) we conclude that (6.2.13) holds.
o
apxl, a T 8 n
~
We see that, by the theorem, the MLE Is efficient in the sense that for any has asymptotic bias o(n1/2) and asymptotic variance nlaT ]1(8)a, which is n.? larger than that of any competing minimum contrast estimate. Further any competitor 8 n such that aTO n has the same asymptotic behavior as a T 8 n for all a in fact agrees with On to ordern 1/2


A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6
on the asymptotic nonnality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zr, Yi)T. 1 < i < n, be ij.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways:
(il
(6.2.19) where, is distributed as N(O, (),2) independent of Z and E(Z) Z, Y has aN (a + ZT [3, (),2) distrihution.
= O.
That is, given
(ii) The distribution Ho of Z is known with density h o and E(ZZT) is nonsingular.
The second assumption is unreasonable but easily dispensed with. It readily follows (Prohlem 6.2.6) that the MLE of [3 is given by (with probability 1) [3 = [Z(n}Z(n}]
T
~
lT
Zen} Y.
(6.2.20)
Here Zen) is the n x p matrix IIZij ~ Z.j II where z.j = ~ 1 Zij. We used subscripts (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Zen) = (Zl, .. , 1 Zn)T is referred to as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and ()'2 are
p
2:7
Ci
=Y
 ~ Zj{3j, (j
J=l
"

2
.
1 ~2 = IY  (Ci + Z(n)[3)1 .
n
(6.2.21 )
I
388
~
Inference in the Multiparameter Case
Chapter 6
Note that although given ZI," " Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AOA6 hold in this case because if H o has density k o and if 8 denotes (a,{3T,a 2 )T. then
~
j
1
j
I(X,8) Dl(X,8)
and
 20'[Y  (a + Z T 13W
(
1
)
2(logo'
1
)
+ log21T) + logho(z)
(6.2.22)
;2'
1
i,
Z ;" 20 4
(.2
o o
1)
I
.,,
a 2
1(8) =
I
0 0' E(ZZT)
0
,
o o
20 4
(6.2.23)
i ,
,
1
so that by Theorem 6.2.2
iI
I'
L(y'n(ii 
a,(3  13,8'  0 2»
~ N(O,diag(02,02[E(ZZ T W',20 4 ».
(6.2.24)
1
!
I
!
,,
This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of H o 2 known plays no role in the limiting result for a,j3,Ci • Of course, these will only be the MLEs if H o depends only on parameters other than (a, f3,( 2 ). In this case we can estimate E(ZZT) by ~ L~ 1 ZiZ'[ and give approximate confidence intervals for (3j. j = 1 .. ,po " An interesting feature of (6.2.23) is that because 1(8) is a block diagonal matrix so is II (6) and, consequently, f3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perfonn inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew 0 2 , the MLE would still be and its asymptotic varianc~ optimal for this model. If we knew a and 13. ;;2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, 0=2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = {P(9,"} : 8 E e, '/ E £}
~
j,
1 ,
l
I ,
1 ,
1
13
. ,
:
~
•
L
.' ,
we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE (J (or more generally, an efficient estimate of e) in the pair (e, iiJ is lhe same as that of e(,/o), the efficient estimate for 'P'I7o = {P(9,l'jo) : E 8}. The possibility of adaptation is in fact rare. though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, ({32 ... , /3p ) with
~ ~
e
(i) a, (3" ... , {3p known. (ii)
13 arbitrary.
/32
... = {3p = O. LetZi
In case (i), we take, without loss of generality, a = (ZiI, ... , ZiP) T. then the efficient estimate in case (i) is
,
•• "
I
;;n _ L~l Zit Yi Pi n 2
Li=l Zil
(6,2.25)
, , ,
, I
.
I
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance (T2[EZlJ1. On the other hand, [31 is the first coordinate of {3 given by (6.2.20). Its asymptotic variance is the (1,1) element of O"'[EZZTjl, which is strictly bigger than ,,' [EZfj1 unless [EZZTj1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate [31 adaptively if {3z, .. . , (3p are regarded as nuisance parameters. What is happening can be seen by a representation of [Z'fn) Z(n)ll Zen) Y and [11(11) where ['(II) Ill'j(lI) II· We claim that


=
2i
fJl
= 2:~_l(Z"
",n
 Zill))y; (Z _ 2(11 ),
tl
t
(6.2.26)
L....t=1
where Z(1) is the regression of (Zu, ... , Z1n)T on the linear space spanned by (Zj1,' . " Zjn)T, 2 < j < p. Similarly.
[11 (II)
= 0"'/ E(ZlI
 I1(ZlI
1
Z'l,"" Zp,»)'
(6.2.27)
where II(Z11 I ZZ1,' .. , Zpl) is the projection of ZII on the linear span of Z211 .. , , Zpl (Problem 6.2.11). Thus, I1(ZlI I Z'lo"" Zpl) = 2:;=, Zjl where (ai, ... ,a;) minimizes E(Zll  2: P _, ajZjl)' over (a" ... , ap) E RPI (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3" ... , f3p when the variables Z2, .. . ,Zp are in any way correlated with Z1 and the price is measured by
a;
[E(Zll  I1(Zll I Z'lo"" Zpl)'jl = ( _ E(I1(Zll 1 Z'l, ... ,ZPI)),)I '''''';:;f;c;o,~''"'''''1 , E(ZII) E(ZII)
(6.2.28) In the extreme case of perfect collinearity the price is 00 as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where (Z2, . .. , Zp) have no value in predicting Zl linearly (see Section 1.4). Correspondingly in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, ... , n, (31 is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if E(Zu  I1(ZlI I Z'l, ... , Zpt})' = O. 0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are ij.d. but not necessarily Gaussian with density ~fo (~). for instance, e x
fo(x)
~
=
(1 + eX)"
the logistic density. Such error densities have the often more realistic, heavier tails(l) than the Gaussian density. The estimates {301 0'0 now solve
and
390
Inference in the Multiparameter Case
Chapter 6
f' ~ ~ where1/; = 'X;,X(Y) =  (iQ)~ Yfo(y)+1 ,(30 (fito, ... ,ppO)T The assumptions of Theorem 6.2.2 may be shown to hold (Problem 6.2.9) if
=
(i) log fa is strictly concave, i.e.,
*
is strictly decreasing.
(ii) (log 10)" exists and is bonnded.
Then, if further fa is symmetric about 0,
I(IJ)
cr"I((3T, I)
= cr"
where
Cl
(C 1 Z ) E(t
~)
(6.2.29)
, ,
= J (f,(x»)' lo(x)dx, c, = J (xf,(x) + I)' 10 (x)dx.
~
Thus, ,Bo, ao are opti
mal estimates of {3 and (l in the sense of Theorem 6.2.2 if fo is true. Now suppose fa generating the estimates Po and (T~ is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable conditions we can apply Theorem 6.2.1 with
1 ,
,
i
where
i
1f;j(Z,y,(3,(J)
,
;
"
~ 1f; (y  L~1 Zkpk )
, I
<j < p
(6.2.30)
!
,
•
> (y  L~1
to conclude that
where (301 ao solve
I 1
j
Zkpk )
1
I
I
I
L:o( y'n(,Bo  (30)) ~ N(O, I: ('I', P») L:( y'n(a  cro) ~ N(O, (J'(P))
1 •
,
(6.2.31)
• ,
J'I'(y zT(30)dP=O
• ,
I
p
II . ,.I'
ii
I:
and E('I', P) is as in (6.2.6). What is the relation between (30' (Jo and (3, (J given in the Gaussian model (6.2.19)? If 10 is symmetric about 0 and the solution of (6.2.31) is unique, . then (30 = (3. But (Jo = c(fo)q for some, c(to) typically different from one. Thus, (30 can be used for estimating (3 althougll if t1j. true distribution of the is N(O, (J') it should ' perform less well than (3. On the Qther hand, o is an estimate of (J only if normalized by a constant depending on 10. (See Problem 6.2.5.) These are issues of robustness, that
~

a
'i
1:
is, to have a bounded sensitivity curve (Section 3~5. Problem 3.5.8), we may well wish to use a nonlinear bounded '.Ii = ('!f;11'" l ,pp)T to estimate f3 even though it is suboptimal when € rv N(O, (12). and to use a suita~ly nonnalized version of {To for the same purpose. One effective choice of.,pj is the Hu~er function defined in Problem 3.5.8. We will discuss these issues further in Section 6.6 and Volume II. 0
•
the latter can pose a fonnidable problem if p > 2. ~ (6. .2. A new major issue that arises is computation. up to the proportionality constant f e . A4(a. and Rao's tests.s. Ifthe multivariate versions of AOA3.32) a.3 The Posterior Distribution in the Multiparameter Case The asymptotic theory of the posterior distribution parallels that in the onedimensional case exactly. O ). The problem arises also when. 1T(8) rr~ 1 P(Xi1 8). . All of these will be developed in Section 6. and Tierney (1989). The three approaches coincide asymptotically but differ substantially.. Again the two approaches differ at the second order when the prior begins to make a difference. We defined minimum contrast (Me) and M estimates in the case of pdimensional parameters and established their convergence in law to a nonnal distribution. under PeforallB.2.. Wald tests (a generalization of pivots). for fixed n.5.5.5.19). fJ p vary freely. We have implicitly done this in the calculations leading up to (5. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : fJ 1 < 0 versus K : fh > 0 where fJ 2 .3. However.s. ife denotes the MLE.s.I as the Euclidean nonn in conditions A7 and A8. the likelihood ratio principle. This approach is refined in Kass.) and A6A8 hold then. 6. say.Section 6. Summary.3.19) (Laplace's method). in perfonnance and computationally. The consequences of Theorem 6.). These methods are beyond the scope of this volume but will be discussed briefly in Volume II. say (fJ 1. Kadane. and interpret I .8 we obtain Theorem 6. .2 Asymptotic Estimation Theory in p Dimensions 391 Testing and Confidence Bounds There are three principal approaches to testing hypotheses in multiparameter models. A5(a.2. Confidence regions that parallel the tests will also be developed in Section 6. the equivalence of Bayesian and frequentist optimality asymptotically.(t) n~ 1 p(Xi .2.3 are the same as those of Theorem 5.. because we then need to integrate out (03 .2. When the estimating equations defining the M estimates coincide with the likelihood . as is usually the case. we are interested in the posterior distribution of some of the parameters. . A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems. t)dt. We simply make 8 a vector. See Schervish (1995) for some of the relevant calculations. .3.. Although it is easy to write down the posterior density of 8. Op). Using multivariate expansions as in B. 2 The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5. typically there is an attempt at "exact" calculation.
s. I j I I .9 and 6. I . and other methods of inference.3 • . PO'  . this result gives the asymptotic distribution of the MLE.1 . LARGE SAMPLE TESTS AND CONFIDENCE REGIONS ~ 0) converges a.9) . we need methods for situations in which. In such .4.3. I . In Section 6.". In these cases exact methods are typically not available. 1 '~ J . covariates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations. We present three procedures that are used frequently: likelihood ratio. and showed that in several statistical models involving normal distributions.. We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {PO.2 to extend some of the results of Section 5.1 we developed exact tests and confidence regions that are appropriate in re~ gression and anaysis of variance (ANOVA) situations when the responses are normally distributed.f/ : () E 8. .6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal..4. However. and confidence procedures. 1 . 6. X II .4. 1 Zp. Finally we show that in the Bayesian framework where given 0.B} has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {PO. and we tum to asymptotic approximations to construct tests. We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AIestimate if we know the correct model P = {Po : BEe} and use the MLE for this model. 9 E 8} sup{p(x. equations. 170 specified. '(x) = sup{p(x. . These were treated for f) real in Section 5. Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic i. We shall show (see Section 6. adaptive estimation of 131 is possible iff Zl is uncorre1ated with every linear function of Z2) .9). However.392 Inference in the Multiparameter Case Chapter 6 . "' ~ 6. 9 E 8 0 } for testing H : 9 E 8 0 versus K . confidence regions. .8 0 .I . Wald and Rao large sample tests. ! I In Sections 4. to the N(O. . as in the linear model.denotes the MLE for PO' then the posterior distribution of yn(O if 0 r' (9)) distribution. under P. in many experimental situations in which the likelihood ratio test can be used to address important questions.4 to vectorvalued parameters.i.d. the exact critical value is not available analytically. X n are i.•• . ry E £} if the asymptotic distribution of In(() . . A(X) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions.I . 9 E 8" 8 1 = 8 . In linear regression.1 we considered the likelihood ratio test statistic.110 : () E 8}. In this section we will use the results of Section 6. I H I! ~. Another example deals with M estimates based on estimating equations generated by linear models with nonGaussian error distribution.
. as X where X has the gamma. where A nxn is an orthogonal matrix with rows vI.. Moreover.2 we showed that the MLE. when testing whether a parameter vector is restricted to an open subset of Rq or R r . the X~_q distribution is an approximation to £(21og . <1~) where {To is known.randXi = Uduo.. q < r.. Thus.3 Large Sample Tests and Confidence Regions 393 cases we can turn to an approximation to the distribution of .3 we test whether J.2. we conclude that under H. r and Xi rv N(O. iJ). x ~ > 0..\(Y) = L X. We next give an example that can be viewed as the limiting situation for which the approximation is exact: independent with Yi rv N(Pi. i = r+ 1. SetBi = TU/uo.l.d. 1..q' i=q+1 r Wilks's theorem states that.tn)T is a member of a qdimensional linear subspace of woo versus the alternative that J.i. Using Section 6. ."" V r spanw.1..2 we showed how to find B as a nonexplicit solution of likelihood equations. . qQ. 0 It remains to find the critical value. .v'[.. . Suppose XL X 2 . 2 log . . .. = (J. . iJ > O. . under regularity conditions. Wilks's approximation is exact. E W .Xn are i.1.3.\(X) based on asymptotic theory.Section 6. iJ). i = 1.. and we transform to canonical form by setting Example 6.\(Y)). 0 We iHustrate the remarkable fact that X~q holds as an approximation to the null distribution of 2 log A quite generally when the hypothesis is a nice qdimensional submanifold of an rdimensional parameter space with the following. ~ ~ ~ ~ ~ The approximation we shall give is based on the result "2 log '\(X) ..Wo where w is an rdimensional linear subspace of Rn and W ::> Wo.5) to be iJo = l/x and p(x. As in Section 6.. 0).£ X~" for degrees of freedom d to be specified later. The MLE of f3 under H is readily seen from (2. such that VI. ~ X.3.3. iJo) is the denominator of the likelihood ratio statistic.1). exists and in Example 2. This is not available analytically. Other approximations that will be explored in Volume II are based on Monte Carlo and bootstrap simulations.n. . . •. The Gaussian Linear Model with Known Variance. Y n be nn. Q > 0.1 exp{ iJx )/qQ).. .tl. 0) = IT~ 1 p(Xi.l. . 1)..1. . Let YI .. =6r = O. 0 = (iX. Suppose we want to test H : Q = 1 (exponential distribution) versus K : a i. V q span Wo and VI. which is usually referred to as Wilks's theorem or approximation.1.. .4. ~ In Example 2. 0) ~ iJ"x·.\(x) is available as p(x.3. Then Xi rvN(B i . In this u 2 known example. distribution with density p(X.i = 1. the numerator of .3.. the hypothesis H is equivalent to H: 6 q + I = .'··' J.i = 1.n. Here is an example in which Wilks's approximation to £('\(X)) is useful: Example 6. . .
The Gaussian Linear Model with Unknown Variance. under H .3. : . The result follows because." . Because .2 with 9(1) = log(l + I).· .. If Yi are as in = (IL.7 to Vn = I:~_q+lXl/nlI:~ r+IX. "'" T.d..2) V ~ N(o. . Proof Because On solves the likelihood equation DOln(O) = 0. (6.2.2 but (T2 is unknown then manifold whereas under H.1.2 unknown case. N(o.3.logp(X..1 ((Jo)). I(J~ .i=q+l 'C'" I=r+l Xi' ) X.. i\ I . I1«(Jo)). ..3. Apply Example 5. 0 Consider the general i. ·'I1 ..1) " I.. and (J E c W. 2 log >'(Y) = Vn ". Here ~ ~ j.2.(Jol. Write the log likelihood as e In«(J) ~ L i=I n logp(X i .Onl + IOn  (Jol < 210n .(Y) defined in Remark ° 6. o . . case with Xl. c = and conclude that 210g .(J) Uk uJ rxr c'· • 1 .3. ..3 and AA that that In((J~) "'" 2[ln((Jn) In((Jo)] £r ~ V I((Jo)V.q' Finally apply Lemma 5..3. where DO is the derivative with respect to e. = 2[ln«(Jn) In((Jo)] ~ ~ ~ £ X..~ L H . I( I In((J) ~ n 1 n L i=l & & &" &".1. V T I«(Jo)V ~ X.2. !. ranges over a q + Idimensional manifold. X. Eln«(Jo) ~ I«(Jo). B). where x E Xc R'.2. 1 • ! We tirst consider the simple hypothesis H : 6 = 8 0 _ Theorem 6.(X) ~ 1 .(Jol < I(J~ . .2. . "'" (6. I I i Example 6. an = n. .. ! . . we derived e e 2log >'(Y) ~ n log ( 1 + 2::" L.3. ".394 Inference in the Multiparameter Case Chapter 6 1 . hy Corollary B.(Jol.3.3. By Theorem 6. an expansion of lnUI) about en evaluated at 8 = 0 0 gives 2[ln«(Jn) In((Jo)] = n«(Jn . Hence. I..(In[ < l(Jn . In Section 6. Suppose the assumptions of Theorem 6.. Note that for J.Xn a sample from p(x.\(Y) £ X. (72) ranges over an r + Idimensional Example 6. Then.(Jo) for some (J~ with [(J~ .(Jo) In«(Jn)«(Jn . where Irxr«(J) is the Fisher information matrix.6.1. 21og>..i. (J = (Jo. I . 0).3.. j . . . y'ri(On ._q as well. and conclude that Vn S X. we can conclude arguing from A.(Jo) ".2 are satisfied.q also in the 0.
We set d ~ r .3. By (6.(X) 8 0 when > x. ..3.0(2».0 0 ).q. dropping the dependence on 6 0 • M ~ PI 1 / 2 (6. .a. (JO.)T.(9 0 .2[1. Furthennore.n and (6. Let 0 0 E 8 0 and write 210g >'(X) ~ = 2[1n(9 n )  In(Oo)] .35) where 8(00) = n 1/2 L Dl(X" 0) i=1 n and 8 = (8 1 .T.8q ). has approximately level 1.. for given true 0 0 in 8 0 • '7=M(OOo) where.n) /n(Oo)l. Let Po be the model {Po: 0 E eo} with corresponding parametn"zation 0(1) = ~(1) (1) ~(1) (8 1. Then under H : 0 E 8 0.)T.' ". where x r (1.3 Large Sample Tests and Confidence Regions 395 = As a consequence of the theorem.3.j} are specified values.3.2. 8 2)T where 8 1 is the first q coordinates of8. OT ~ (0(1).4).. Examples 6.3) is a confidence region for 0 with approximat~ coverage probability 1 .Section 6.. Suppose that the assumptions of Theorem 6.6) .3.. T (1) (2) Proof.1 and 6.8. Let 00. and {8 0 .3.(Ia).1) applied to On and the corresponding argument applied to 8 0 .3. 10 (0 0 ) = VarO.8. 0b2) = (80"+1".0(1) = (8 1" ". Make a change of parameter..2. 0 E e. Next we tum to the more general hypothesis H : 8 E 8 0 . j ~ q + 1.+l.n = (80 . Theorem 6.0) is the 1 and C!' quantile of the X.2. where e is open and 8 0 is the set of 0 E e with 8j = 80 .a)) (6.j. the test that rejects H : () 2Iog>. (6.)1'.10) and (6.80.3.0:'. 0).2 illustrate such 8 0 .(] . ~(11 ~ (6. ~ {Oo : 2[/ n (On) /n(OO)] < x.4) It is easy to see that AOA6 for 10 imply AOA5 for Po..81(00).". distribution. 0(2) = (8..2 hold for p(x. SupposetharOo istheMLEofO underH and that 0 0 satisfiesA6for Po.
.1I0 + M. under the conditions of Theorem 6.3. • 1 y{X) = sup{P{x. His {fJ E applying (6. = 0.. J) distribution by (6.+I> .0. 18q acting as r nuisance parameters. .(X) > x r ./ L i=I n D'1I{X" liD + M.3.9) .8 0 : () E 8}.(l .3..I '1): 11 0 + M. with ell _ .2.00 .. .. '1 E Me}. ~ 'I.1I 0 + M.2. (O) r • + op(l) (6.3..9). This asymptotic level 1 ..8) that if T('1) then = 1 2 n.1/ 2 p = J.10) LTl(o) . IIr ) : 2[ln(lI n ) .1'1 '1 E eo} and from (B. if A o _ {(} .all (6.0: confidence region is i .3.+1 ~ .1 '1ll/sup{p(x.a) is an asymptotically level a test of H : 8 E Of equal importance is that we obtain an asymptotic confidence region for (e q+ 1.. ••. I I eo~ ~ ~ {(O.5) to "(X) we obtain..l.00 : e E 8 0} MAD = {'1: '/.LTl(o) + op(l) i=l i=l L i=q+l r Ti2 (0) + op(l).In( 00 .(X) = y{X) where ! 1 1 .. .13) D'1I{x. . is invariant under reparametrization . rejecting if .1> . = .11) .6) and (6.1 '1). a piece of 8. Now write De for differentiation with respect to (J and Dry for differentiation with respect to TJ.7). because in tenus of 7]. • . Me : 1]q+l = TJr = a}. .• I .8.Tf(O)T .II).3. then by 2 log "(X) TT(O)T(O) .2..3.1 '1) = [M. Tbe result follows from (6.. by definition. : • Var T(O) = pT r ' /2 II.l.1ITD III(x.+ 1 . Thus.. We deduce from (6. Moreover.3.1I0 + M.(1 .1 because /1/2 Ao is the intersection of a q dirnensionallinear subspace of R r with JI/2{8 . 1e ). 0 Note that this argument is simply an asymptotic version of the one given in Example 6. Note that.3.396 Inference in the Multiparameter Case Chapter 6 and P is an orthogonal matrix such that.3. ). Such P exists by the argument given in Example 6. which has a limiting X~q distribution by Slutsky's theorem because T(O) has a limiting Nr(O. (6. Or)J < xr_...
'! are the MLEs.3. R.13) then the set {(8" . .13).3. 9j . if 0 0 is true.2 to this situation is easy and given in Problem 6. "'0 ~ {O : OT Vj = 0.13).6.3. .. t. 8q)T : 0 E 8} is opeu iu Rq.3.2 falls under this schema with 9. Theorem 6. l B.5 and 6.3.3.cTe:::.'':::':::"::d_C:::o:::..2. 2' 2 + X 2 )) 2 and 210g A(X) ~ xf but if 81 + 82 < 1 clearly 2 log A(X) ~ op(I). 8."de"'::'"e:::R::e"g.( 0 ) ~ o} (6.3. Examples such as testing for independence in contingency tables.3.. The formulation of Theorem 6.80.2 and the previously conditions on g.14) a hypothesis of the fonn (6. which require the following general theorem.o:::"'' 397 CC where 00. Theorem 6.t. The proof is sketched iu Problems (6.3. . . Vl' are orthogonal towo and v" .e:::S:::. q + 1 < j < r written as a vector g.3... 8q o assuming that 8q + 1 . If 81 + B2 = 1.::.3. of f)l" . A(X) behaves asymptotically like a test for H : 0 E 8 00 where 800 = {O E 8: Dg(Oo)(O .. let (XiI. Here the dimension of 8 0 and 8 is the same but the boundary of 8 0 has lower dimension. Then.3. Suppose H is specified by: There exist d functions. (6.2 is still inadequate for most applications.) be i.3.3.. q + 1 < j < r. .p:::'e:.3.. where J is the 2 x 2 identity matrix aud 8 0 = {O : B1 + O < I}. J).2)(6.d. Suppose the assumptions of Theorem 6. More sophisticated examples are given in Problems 6... " ' J v q and Vq+l " . More complicated linear hypotheses such as H : 6 .3. As an example ofwhatcau go wrong. themselves depending on 8 q + 1 . X.. q + 1 <j < r}.. v r span " R T then. Wilks's theorem depends critically On the fact that nOt only is open but that if giveu in (6.:::f.i. B2 ..13) Evideutly...8r are known.3. The esseutial idea is that. .(0) ~ 8j . will appear in the next section.... .8 0 E Wo where Wo is a linear space of dimension q are also covered. such that Dg(O) exists and is of rauk r . 2 e eo Ii o ~ (Xl + X 2) 2 + ~ 1 _ (X..n under H is consistent for all () E 8 0.o''C6::. It can be extended as follows. if A(X) is ~ the likelihood ratio statistic for H : 0 E 8 0 given in (6. (6... N(B 1 ...3. Defiue H : 0 E 8 0 with e 80 = {O E 8 : g(O) = o}.... Suppose the MLE hold (JO. We need both properties because we need to analyze both the numerator and denominator of A(X)..:::m.::Se::.g. We only need note that if WQ is a linear space spanned by an orthogonal basis v\.... . 210g A(X) ". X~q under H.12) The extension of Theorem 6.l:::.3)..q at all 0 E 8.
16) n(9 n _IJ)TI(9 n )(9 n IJ) !:.15) Because I( IJ) is continuous in IJ (Problem 6. [22 is continuous.7.18) i 1 Proof..0.Nr(O.q' ~(2) (6. for instance.9). .3. (6.3.398 Inference in the Multiparameter Case Chapter 6 6.Or) and define the Wald (6. theorem completes the proof.. it follows from Proposition B.lJ n ) where IJ n statistic as ~(1) ~(2) ~(1) = (0" . j Wn(IJ~2)) !:. 0 ""11 F .~ D 2l n (1J 0 ) or I (IJ n ) or .2.3.a)} is an ellipsoid in W easily interpretable and computablesee (6. (6.2 Wald's and Rao's Large Sample Tests The Wald Test Suppose that the assumptions of Theorem 6.16).. i favored because it is usually computed automatically with the MLE. Under the conditions afTheorem 6.2. I.15) and (6. Y .2. the lower diagonal block of the inverse of . . .7.. It follows that the Wald test that rejects H : (J = 6 0 in favor of K : () • i= 00 when .3..3.1.2 hold. Then . respectively.3. I 22 (9 n ) is replaceable by any consistent estimate of J22( 8).10).X.fii(lJn (2) L 1J 0 ) ~ Nd(O. . ~ Theorem 6. (6. I22(lJo)) if 1J 0 E eo holds. according to Corollary B. . If H is true. y T I(IJ)Y ..~ D 2ln (IJ n ). For the more general hypothesis H : (} E 8 0 we write the MLE for 8 E as = ~ ! I: (IJ n . X.2. y T I(IJ)Y..: b. in particular . . hence.l(a) that I(lJ n ) ~ I(IJ) asn By Slutsky's theorem B. r 1 (1J)) where. 1(8) continuous implies that [1(8) is continuous and.2.) and IJ n ~ ~ ~(2) ~ (Oq+" . has asymptotic level Q.fii(1J IJ) ~N(O. i .6.2.3. More generally /(9 0 ) can be replaced by any consistent estimate of I( 1J0 ). The last Hessian choice is ~ ~ . It and I(8 n ) also have the advantage that the confidence region One generates {6 : Wn (6) < xp(l . ~ ~ 00.2.31).3.17) ~ ~ e ~ en 1 • I I Wn(IJ~2)) ~ n(9~2) 1J~2)l[I"(9nJrl(9~) _1J~2») where I 22 (IJ) is the lower diagonal block of II (IJ) written as I2 I 1 _ (Ill(lJ) (IJ) I21(IJ) I (1J)) I22(IJ) with diagonal blocks of dimension q x q and d x d. More generally. Slutsky's it I' I' . But by Theorem 6.3.~ D 2 1n ( IJ n).4. the Hessian (problem 6. I .3.rl(IJ» asn ~ p ~ L 00.
as (6.19) : (J c: 80.Q). (J = 8 0 ..0:) is called the Rao score test.3 Large Sample Tests and Confidence Regions 399 The Wold test. the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n.1/ 2 D 2 In (0) where D 1 l n represents the q x 1 gradient with respect to the first q coordinates and D 2 l n the d x 1 gradient with respect to the last d. This test has the advantage that it can be carried out without computing the MLE. Let eo '1'n(O) = n.\(X) is the LR statistic for H Thus.ln(Oo.. N(O. X. which rejects iff HIn (9b ») > x 1_ q (1 .21) where III is the upper left q x q block of the r x r infonnation matrix I (80 ). a consistent estimate of . Rao's SCOre test is based on the observation (6.. 2 Wn(Oo ) ~(2) where . under H.3. What is not as evident is that.nlE ~ (2) _ T  . therefore.n W (80 . the two tests are equivalent asymptotically.. and so on.3.n under H.. = 2 log A(X) + op(l) (6.3. .nl]  2  1 . un.8) E(Oo) = 1.. by the central limit theorem.3.19) indicates. and the convergence Rn((}o) ~ X.ln(8 0 . (6.I Dln ((Jo) is the likelihood score vector. is. Rn(Oo) = nt/J?:(Oo)r' (OO)t/Jn(OO) !. The extension of the Rao test to H : (J E runs as follows.3.(00 )  121 (0 0 )/.. as n _ CXJ. The argument is sketched in Problem 6. The test that rejects H when R n ( (}o) > x r (1.9. in practice they can be very different.3..n) under n H. It can be shown that (Problem 6.n] D"ln(Oo.6. 112 is the upper right q x d block. the asymptotic variance of .' (0 0 )112 (00) (6. ~ B ) T given by {8(2) : r W n (0(2)) < x r _ q (1 These regions are ellipsoids in R d Although. (Problem 6.Section 6.9) under AOA6 and consistency of (JO. ~1 '1'n(OO. The Rao Score Test For the simple hypothesis H that. It follows from this and Corollary B.3..n) 2  + D21 ln (00.22) .2 that under H.:El ((}o) is ~ n 1 [D. asymptotically level Q.n)[D.. The Wald test leads to the Wald confidence regions for (B q +] . requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests. The Rao test is based on the statistic Rn(Oo ) ~n'1'n(Oo..n) ~  where :E is a consistent estimate of E( (}o).3.. 1(0 0 )) where 1/J n = n .20) vnt/Jn(OO) !. Furthermore.
and showed that this quadratic fonn has limiting distribution q .400 Inference in the Multiparameter Case Chapter 6 where D~ is the d x d matrix of second partials of ill with respect to matrix of mixed second partials with respect to e(l). 2 log A(X) has an asymptotic X. For instance.. e(l).4 LARGE SAMPLE METHODS FOR DISCRETE DATA In this section we give a number of important applications of the general methods we have developed to inference for discrete data. called the Wald statistic. Rao. On the other hand. . which stales that if A(X) is the LR statistic.5. Under H : (J E A6 required only for Po eo and the conditions ADAS a/Theorem 6.) . . .8 0 where is an open subset of R: and 8 0 is the collection of 8 E 8 with the last r ~ q coordinates 0(2) specified. D 21 the d x d Theorem 6. Summary. it shares the disadvantage of the Wald test that matrices need to be computed and inverted. for the Wald test neOn .3. . The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. We also considered a quadratic fonn. which is based on a quadratic fonn in the gradient of the log likelihood. under regularity conditions. We considered the problem of testing H : 8 E 80 versus K : 8 E 8 . Finally..q distribution under H.19) holds under On and that the power behavior is unaffected and applies to all three tests. where X~ (')'2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'? It may he shown that the equivalence (6. Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score testssee Rao (1973) for more on this.2 but with Rn(lJb 2 1 1 » !:. and Wald Tests It is possible as in the onedimensional case to derive the asymptotic power for these tests for alternatives of the form On = ~ eo + ~ where 8 0 E 8 0 . X. . I I . X~' 2 j > xd(1 . 0(2).)I(On)( vn(On  On) + A) !:. b . I . which measures the distance between the hypothesized value of 8(2) and its MLE. The analysis for 8 0 = {Oo} is relatively easy.3.a)}.0 0 ) I(On)(On . Power Behavior of the LR. . . .2..0 0 ) = T'" "" (vn(On .On) + t. then. we introduced the Rao score test. In particular we shall discuss problems of . The asymptotic distribution of this quadratic fonn is also q• I e X: X: 6._q(A T I(Oo)t. and so on. We established Wilks's theorem.a)} and • The Rao large sample critical and confidence regions are {R n (Ob » {0(2) : Rn (0(2) < xd(l.
. j = 1.. = L:~ 1 1{Xi = j}.lnOo.. + n LL(Bi . k .33) and (3. . 6. ." .2.1 GoodnessofFit in a Multinomial Model. Pearson's X2 Test As in Examples 1.()k_l)T and test the hypothesis H : ()j = ()OJ for specified (JOj. . In Example 2.3. ()j.7. .1. j=1 To find the Wald test. j = 1.) /OOk = n(Bk . log(N. It follows that the large sample LR rejection region is ~ k 2IogA(X) = 2 LN. . the Wald statistic is klkl Wn(Oo) = n LfB.4 Large Sample Methods for Discret_e_D_a_ta ~_ _ 401 goodnessoffit and special cases oflog linear and generalized linear models (GLM). 2.0).32) thaI IIIij II.OOi)(B. trials in which Xi = j if the ith trial prcx:luces a result in the jth category. Ok Thus. treated in more detail in Section 6. Because ()k = 1 .OO. and 2.)/OOk. k1. . with ()Ok =  1 1 E7 : kl j=l ()OJ.) > Xkl(1. L(9. Let ()j = P(Xi = j) be the probability of the jth category. k Wn(Oo) = L(N. where N.4. k. we need the information matrix I = we find using (2.. j=1 Do. .8 we found the MLE 0. I ij [ .2..00k)2lOOk..4.E~.nOo. Thus.]2100. j=l .i.3.8.)2InOo..5.. .00. # j. or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by theory. ~ N. Ok if i if i = j. we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution.d.+ ] 1 1 0. For i.6. .Section 6. j=l i=1 The second term on the right is kl 2 n Thus. In.2. consider i. j = 1. we consider the parameter 9 = (()l.
The general form (6. . Then. ]1(9) = ~ = Var(N).2.4.11). 80i 80j 80k (6.80j ) = (Ok . . note that from Example 2..4. '" ..Expected)2 Expected (6.4.402 Inference in the Multiparameter Case Chapter 6 The term on the right is called Pearson's chisquare (X 2 ) statistic and is the statistic that is typically used for this multinomial testing problem. 80j 80k 1 and expand the square keeping the square brackets intact... the Rao statistic is .1S.4. because kl L(Oj ..1) X where the sum is over categories and "expected" refers to the expected frequency E H (Nj ).13.1) of Pearson's X2 will reappear in other multinomial applications in this section.L . j=1 .. we write ~ ~  8j 80j . where N = (N1 .~ 8 8 0j 0k  2 80j ) ( n LL _ _ L kl kl kl ( J=1 z=1 ~ 8_Z 80i ~) 8k 80k (~ 8 _J 80J ~) ) 8k . It is easily remembered as 2 = SUM (Observed . The second term on the right is n [ j=1 (!. ~ = Ilaijll(kl)X(kl) with Thus. ~ . by A. we could invert] or note that by (6. ...~) 8 80j 80 k ~ ~ ~ ] 2 0j = n [~ 80 k ~ 1] 2 To simplify the first term on the right of (6. 8k 80 k = {[8 0k (8 j  80j )]  [8 0j (8 k ~  80k ) ] } .L ..2.Nk_d T and. . n ( L kl ( j=1 ~ ~) !. with To find ]1.2) .80k). To derive the Rao test.2).8.
i=1 For example. in the HardyWeinberg model (Example 2. If we assume the seeds are produced independently.k. Mendel observed nl = 315. j.4). where 8 0 is a composite "smooth" subset of the (k . which has a pvalue of 0. Mendel's theory predicted that 01 = 9/16. 02. In experiments on pea breeding. For comparison 210g A = 0.4.25)2 104. k = 4 X 2 = (2. .i::. 3. nOlO = 312..2) becomes It follows that the Rao statistic equals Pearson 's X2. 8). and (4) wrinkled green. M(n.75)2 104. There is insufficient evidence to reject Mendel's hypothesis. this value may be too small! See Note 1. (3) round green. l::. • 6. fh = 03 = 3/16. Then.9 when referred to a X~ table. 4 as above and associated probabilities of occurrence fh. n2 = 101. Example 6. 8 0 .1.75. Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds. LOi=l}.25 + (3.fh. n4 = 32.25. 7.75 + (3.Section 6. (2) wrinkled yellow. However.25)2 312. Testing a Genetic Theory.4. We will investigate how to test H : 0 E 8 0 versus K : 0 ¢:.. we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1. Possible types of progeny were: (1) round yellow. n020 = n030 = 104. which is a onedimensional curve in the twodimensional parameter space 8.75 . and we want to test whether the distribution of types in the n = 556 trials he performed (seeds he observed) is consistent with his theory.25 + (2. n3 = 108.: .75)2 = 04 34. 2.4. n040 = 34. 04. 04 = 1/16. Contingency Tables Suppose N = (NI .75. Here testing the adequacy of the HardyWeinberg model means testing H : 8 E 8 0 versus K : 8 E .2 GoodnessofFit to Composite Multinomial Models.48 in this case. Nk)T has a multinomial.1) dimensional parameter space k 8={8:0i~0. . distribution.4 Large Sample Methods for Discrete Data 403 the first term on the right of (6.1.
4. 2 log ..3) Le. Other examples.404 Inference in the Multiparameter Case Chapter 6 8 1 where 8 1 = 8 .. nk.. . .. Consider the likelihood ratio test for H : e E 8 0 versus K : e 1:. . we obtain the Rao statistic for the composite multinomial hypothesis by replacing eOj in (6.3. Then we can conclude from Theorem 6.4.. That is. However..X approximately has a X. ek(11))T takes £ into 8 0 . i=l t n. .. . l~j~q. under H. ij = (iiI. 8(11)) : 11 E £}. nk.. To apply the results of Section 6.. the log likelihood ratio is given by log 'x(nb . The Wald statistic is only asymptotically invariant under reparametrization.q distribution for large n. i=l k If 11 ~ e( 11) is differentiable in each coordinate.3 that 2 log . thus. 1 ~ j ~ q or k (6. a 11 'r/J ." For instance.4. e~) T ranges over an open subset of Rq and ej = eOj . involve restrictions on the e obtained by specifying independence assumptions on i classifications of cases into different categories.. also equal to Pearson's X2 • . {p(...1).X approximately has a xi distribution under H. we define ej = 9j (8)..~) and test H : e~ = O. .. . nk) = L ndlog(ni/n) log ei(ij)].. which will be pursued further later in this section. If £ is not open sometimes the closure of £ will contain a solution of (6. by the algebra of Section 6. nk. Maximizing p(n1. ij satisfies l a a'r/j logp(nl. .. The algebra showing Rn(8 0 ) = X2 in Section 6. 8) for 8 E 8 0 is the same as maximizing p( n1.(t)~ei(11)=O.ne j (ij)]2 = 2 ne. .. j Oj is. . e~ = e2 ..2) by ej(ij). j = 1.8 0 . .4.1. then it must solve the likelihood equation for the model.nk. We suppose that we can describe 8 0 parametrically as where 11 = ('r/1. . is a subset of qdimensional space. .3).l now leads to the Rao statistic R (8("") n 11 =~ ~ j=l [Ni .. If a maximizing value.4. 8 0 • Let p( n1. To avoid trivialities we assume q < k . . . and the map 11 ~ (e 1(11). The Rao statistic is also invariant under reparametrization and. where 9j is chosen so that H becomes equivalent to "( e~ . . 8(11)) = 0.r for specified eOj .q' Moreover. r.r. £ is open.. j = q + 1. approximately X.3 and conclude that. the Wald statistic based on the parametrization 8 (11) obtained by replacing e by e (ij).. 8(11)) for 11 E £. and ij exists.4. . to test the HardyWeinberg model we set e~ = e1 .2~(1 ."" 'r/q) T. . 8) denote the frequency function of N. .1.q) exists..("") X J 11 I where the righthand side is Pearson's X2 as defined in general by (6.
AB. Then a randomly selected individual from the population can be one of four types AB. 0 Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories. do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second Band B.3) becomes i(1 °:. k 4. (2) sugarygreen.4. Denote the probabilities of these types by {}n.. i(1 .a) with if (2nl + n2)/2n. Independent classification then means that the events [being an A] and [being a B] are independent or in terms of the B .TJ). {}21.Section 6. and so on. 1] is the desired estimate (see Problem 6. The only root of this equation in [0.4. nl (2+TJ) (n2 + n3) n4 (1TJ) +:ry 0. we obtain critical values from the X~ tables. (}4) distribution.1). (h. . green base leaf versus white base leaf) leads to four possible offspring types: (1) sugarywhite. The likelihood equation (6. (4) starchygreen. For instance. ij {}ij = (Bil + Bi2 ) (Blj + B2j ). {}22. T() test the validity of the linkage model we would take 8 0 {G(2 + TJ). (2nl + n2)(2 3 + n2) . AB.. . then (Nb .. N 4 ) has a M(n. p.4.5. HardyWeinberg. A linkage model (Fisher. {}12. TJ).4 Methods for Discrete Data 405 Example 6. 301).4. We found in Example 2.2. The Fisher Linkage Model. .4) which reduces to a quadratic equation in if. AB.• . is or is not a smoker. If Ni is the number of offspring of type i among a total of n offspring. . We often want to know whether such characteristics are linked or are independent. An individual either is or is not inoculated against a disease. 1} a "onedimensional curve" of the threedimensional parameter space 8. The results are assembled in what is called a 2 x 2 contingency table such as the one shown. 1958. iTJ) : TJ :. (6. ( 2n3 + n2) 2) T 2n 2n o Example 6. H is rejected if X2 2:: Xl (1 . (3) starchywhite. is male or female. specifies that where TJ is an unknown number between 0 and 1.4.6 that Thus. A selfcrossing of maize heterozygous on two characteristics (starchy versus sugary. respectively. 9(ij) ((2nl + n2) 2n 2n 2.4. Because q = 1. To study the relation between the two characteristics we take a random sample of size n from the population.
These solutions are the maximum likelihood estimates.4.6) the proportions of individuals of type A and type B. if N = (Nll' N 12 . We test the hypothesis H : () E 8 0 versus K : 0 f{. 'r/2 to indicate that these are parameters.RiCj/n) are all the same in absolute value and. Here we have relabeled 0 11 + 0 12 .4. By our theory if H is true. the (Nij . 1.2). I}. 0 12 .'r/2)) : 0 ::. This suggests that X2 may be written as the square of a single (approximately) standard normal variable. respectively. which vary freely. 'r/2 ::.3) become .Tid (n12 + n22) (1 .5) whose solutions are Til 'r/2 = (n11+ n 12)/n (nll + n21)/n. 8 0 .4.011 + 021 as 'r/1. C j = N 1j + N 2j is the jth column sum. In fact (Problem 6. 0 ::.~ + n12) iiI (nll + n2d (nll i72 (n21 + n22) (1 .406 Inference in the Multiparameter Case Chapter 6 l A 11 The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column. N 21 .'r/1).Ti2) (6. 'r/1 (1 . Then. ( 22 ).4. for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic. 021 .4.'r/1) (1 . the likelihood equations (6. where z tt [R~jl1 2=1 J=l . q = 2. 0 11 .'r/2). N 22 )T. (1 . Pearson's statistic is then easily seen to be (6. Thus. we have N rv M(n. (6.7) where Ri = Nil + Ni2 is the ith row sum. 'r/2 (1 . For () E 8 0 . because k = 4. where 8 0 is a twodimensional subset of 8 given by 8 0 = {( 'r/1 'r/2. X2 has approximately a X~ distribution. 'r/1 ::.
3) that if A and B are independent. a.. i = 1. Nab Cb Ra n .e.4 Large Sample Methods for Discrete Data 407 An important altemative form for Z is given by (6.. a.. a. b NIb Rl a Nal C1 C2 .a) as a level a onesided test of H : peA I B) = peA I B) (or peA I B) :::. . a. a. 1 :::. . B..g.. (Jij TJil The hypothesis that the characteristics are assigned independently becomes H : TJil TJj2 for 1 :::. if and only if.4. b where N ij is the number of individuals of type i for characteristic 1 and j for characteristic 2. respectively... Z = v'n[P(A I B) .4. . j :::. Z indicates what directions these deviations take... 1). j = 1. 1 :::.. Next we consider contingency tables for two nonnumerical characteristics having a and b states. . Z ~ z(l. 1 :::.Section 6.. . b where the TJil. that A is more likely to occur in the presence of B than it would in the presence of B).peA I B)] [~(B) ~(~)ll/2 peA) peA) where P is the empirical distribution and where we use A.. then {Nij : 1 :::. b} "J M(n. that is. .. 1 Nu 2 N12 .. peA I B)) versus K : peA I B) > peA I B).. The X2 test is equivalent to rejecting (twosidedly) if. then Z is approximately distributed as N(O. . B. Therefore.. Thus. . i :::. eye color. B. i :::.. and only if. b ~ 2 (e. (Jij : 1 :::. b).8) Thus. The N ij can be arranged in a a x b contingency table. . B to denote the event that a randomly selected individual has characteristic A. j :::. If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N ij . Positive values of Z indicate that A and B are positively associated (i.4. . hair color).4. peA I B) = peA I B). i :::. it is reasonable to use the test that rejects. if X2 measures deviations from independence. It may be shown (Problem 6. TJj2 are nonnegative and 2:~=1 = 2:~=1 TJj2 = 1. j :::. If (Jij = P[A randomly selected individual is of type i for 1 and j for 2].
perhaps after a transfonnation. we call Y = 1 a "success" and Y = 0 a "failure. k. (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y = 0). 1 :::.3 Logistic Regression for Binary Responses In Section 6. Next we choose a parametric model for 7r(z) that will generate useful procedures for analyzing experiments with binary responses.4. (6."" 7rk)T based on X = (Xl"'" Xk)T is t. such as the probit gl(7r) = <I>1(7r) where <I> is the N(O. ' XI. approximately normally distributed ~ for known constants {Zij} and and whose means are modeled as J.10. Because 7r(z) varies between 0 and 1. whicll we introduced in Example 1. Maximum likelihood and dimensionality calculations similar to those for the 2 x 2 table show that Pearson's X2 for the hypothesis of independence is given by (6. The argument is left to the problems as are some numerical applications.4.7r)] are also used in practice. 6. we obtain what is called the logistic linear regression model where . and the loglog transform g2(7r) = 10g[log(1 .6. Instead we turn to the logistic transform g( 7r). Examples are (1) medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0). [Xi C :i"J log + m. we observe independent Xl' .4. Other transforms. In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1.11) When we use the logit transfonn g(7r).~ =1 Zij {3j = unknown parameters {31. log(l "il] + t."F~1 Yij where Yij is the response on the jth of the mi trials in block i. i :::." We assume that the distribution of the response Y depends on the known covariate vector ZT. or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y = 0).408 Inference in the Multiparameter Case Chapter 6 with row and column sums as indicated. B(mi' 7ri)' where 7ri = 7r(Zi) is the probability of success for a case with covariate vector Zi. . The log likelihood of 7r = (7rl.Li = L. usually called the log it. 1) d. a simple linear representation zT ~ for 7r(. As is typical.f.. log ( ~: ) .) over the whole range of z is impossible..1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are.{3p. with Xi binomial. we observe the number of successes Xi = L.9) which has approximately a X(al)(bl) distribution under H.8 as the canonical parameter zT TJ = g ( 7r) = log [7r / (1 .4. . In this section we assume that the data are grouped or replicated so that for each fixed i.. (6.7r)].. Thus.
15) Vi = log X+. that if m 1fi > 0 for 1 ::.3.4. Similarly. Alternatively with a good initial value the NewtonRaphson algorithm can be employed. ( 1 . Although unlike coordinate ascent NewtonRaphson need not converge. .! ) ( mi .4.14) where W = diag{ mi1fi(l1fi)}kxk. .4.p.14).1 and by Proposition 3. the empirical logistic transform.. As the initial estimate use (6. Tp) T.+ . It follows that the NILE of f3 solves E f3 (Tj ) = T .(1.Li = E(Xi ) = mi1fi. Note that IN(f3) is the log likelihood of a pparameter canonical exponential model with parameter vector f3 and sufficient statistic T = (T1' . i ::.13) By Theorem 2.4 the Fisher information matrix is I(f3) = ZTWZ (6. if N = 2:7=1 mi.4.. Then E(Tj ) = 2:7=1 Zij J..3.4. .. in (6. p (Jj Tj  ~ mi loge1+ exp{Zif3} ) + ~ log '.16) 1fi) 1].2 can be used to compute the MLE of f3.3. we can guarantee convergence with probability tending to 1 as N + 00 as follows.. I N(f3) = f.3. Theorem 2. W is estimated using Wo = diag{mi1f. mi 2mi Xi 1 Xi 1 = ..l. Zi The log likelihood l( 7r(f3)) = == k (1. the likelihood equations are just ZT(X . the solution to this equation exists and gives the unique MLE 73 of f3.4 Large Sample Methods for Discrete Data 409 The special case p = 2. Zi)T is the logistic regression model of Problem 2.3. where Z = IIZij Ilrnxp is the design matrix. We let J.+ .J.f3p )T is.Section 6. j Thus. (6. k ( ) (6. .4.log(l:'.J) SN(O.3.~ Xi 2 1 +"2 ' (6.Li. j = 1.12) where T j = 2:7=1 ZijXi and we make the dependence on N explicit.. Because f3 = (ZTz) 1 ZT 'T] and TJi = 10g[1fi (1 1fi) in TJi has been replaced by 730 is a plugin estimate of f3 where 1fi and (1 1f. k m} (V.4.1fi )* = 1 . or Ef3(Z T X) = ZTX. 7r Using the 8method.[ i(l7r iW')· .1. it follows by Theorem 5.4. IN(f3) of f3 = (f31.1fi)*}' + 00.L) = O.. mi 2mi Here the adjustment 1/2mi is used to avoid log 0 and log 00.. . The condition is sufficient but not necessary for existencesee Problem 2..1 applies and we can conclude that if 0 < Xi < mi and Z has rank p. The coordinate ascent iterative procedure of Section 2.
3.\ for testing H : TJ E w versus K : TJ n .Wo 210g.410 Inference in the Mllltin::lr.1 linear subhypotheses are important. recall from Example 1. Theorem 5. The LR statistic 2 log .. .4.. fl) has asymptotically a X%r. If Wo is a qdimensional linear subspace of w with q < r. i = 1.)l {LOt (6.4. by Problem 6.13. i 1. .q distribution as mi + 00.. The Binomial OneWay Layout. As in Section 6.13.:IITlPt'pr Case 6 Because Z has rankp.4. then we can form the LR statistic for H : TJ E Wo versus K: TJ E w . it foHows (Problem 6.4.14) that {3o is consistent..12) k zT D(X.~.4.4. ji) measures the distance between the fit ji based on the model wand the data X. k. that is.\ has an asymptotic X.. In the present case.4. suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition. . The samples are collected indeB(1ri' mi). D(X. We want to contrast w to the case where there are no restrictions on TJ. distribution for TJ E w as mi + 00. Here is a special case. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and recording the number Xi of patients that recover. To get expressions for the MLEs of 7r and JL.k. D(X.17) where X: = mi Xi and M~ mi Mi. and the MLEs of 1ri and {Li are Xdmi and Xi. Testing In analogy with Section 6.11) and (6.w is denoted by D(y.. We obtain k independent samples. {3 E RP} and let r be the dimension of w. .1. . one from each location. By the multivariate delta method. f"V . In this case the likelihood is a product of independent binomial densities.\=2t [Xi log i=1 (Ei. i = 1.1 (L:~=l Xij(3j).. Thus. where ji is the MLE of JL for TJ E w.4..flm.4. we set n = Rk and consider TJ E n. and for the ith location count the number Xi among mi that has the given attribute.IOg(E.1.) +X. we let w = {TJ : 'f]i = {3.6. For a second pendently and we observe Xl.8 that the inverse of the logit transform 9 is the logistic distribution function Thus.. k < oosee Problem 6. 2 log . the MLE of 1ri is 1ri = 9. Example 6. ji) 2 I)Xi 10g(Xdfli) + XIlog(XI!flDJ i=1 (6.18) {Lo~ where jio is the MLE of JL under H and fl~i = mi ."" Xk independent with Xi example. from (6. ji). .
1).1.Idimensional parameter space e. 7r E (0. we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates. It follows from Theorem 2. In Section 6. . (3 = L Zij{3j j=l p of covariate values. the Wald and Rao statistics take a form called "Pearson's X2.Section 6. and (3 = 10g[7r1(1 .4.L (ml7r. and as in that section. In particular. and give the LR test.1 that if 0 < T < N the MLE exists and is the solution of (6.Li = ~i. When the hypothesis is that the multinomial parameter is in a qdimensional subset of the k . we give explicitly the MLEs and X2 test. Finally. versus the alternative that the 7r'S are not all equal. then J.3 we considered experiments in which the mean J. an important hypothesis is that the popUlations are homogenous.Li of a response is expressed as a function of a linear combination ~i =. z. The Pearson statistic 2 _ X  L k1 i=l (X "'(1) m'7r . Using Z as given in the oneway layout in Example 6. J. The LR statistic is given by (6. 6.4. We derive the likelihood equations.4.15). We used the large sample testing results of Section 6. We found that for testing the hypothesis that a multinomial parameter equals a specified value.4. where J.." which equals the sum of standardized squared distances between observed frequencies and expected frequencies under H.3. Thus.3.3 to find tests for important statistical problems involving discrete data. Under H the log likelihood in canonical exponential form is {3T .7r)]. .4.mk7r)T.13).18) with JiOi mi7r. In the special case of testing equality of k binomial parameters.1. we find that the MLE of 7r under His 7r = TIN.7r ~ i "')2 mi 7r is a Wald statistic and the X2 test is equivalent asymptotically to the LR test (Problem 6. In the special case of testing independence of multinomial frequencies representing classifications in a twoway contingency table.. the Rao statistic is again of the Pearson X2 form.N log(1+ exp{{3}) + ~ log ( k ~: ) where T = 2::=1 Xi. in the case of a Gaussian response.3. the Pearson statistic is shown to have a simple intuitive form. if J. discuss algorithms for computing MLEs.5 Generalized Linear Models 411 This model corresponds to the oneway layout of Section 6. Summary.Li 7ri = gl(~i)' where gl(y) is the logistic distribution .5 GENERALIZED LINEAR MODELS Yi In Sections 6. we test H : 7rl = 7r2 7rk = 7r.1 and 6.Li = EYij. N 2::=1 mi.
412 Inference in the 1\III1IITin'~r:>rn"'T. the GLM is the canonical subfamily of the original exponential family generated by ZTy. such that g(fL) = L J'jZ(j) j=1 p Z{3 where Z(j) (Zlj.. . 1989) synthesized a number of previous generalizations of the linear model.r Case 6 function. in which case 9 is also called the link function. We assume that there is a onetoone transform g(fL) of fL. Znj)T is the jth column vector of Z. = L:~=1 Zij{3.5. These Ii are independent Gaussian with known variance 1 and means JLi The model is GLM with canonical g(fL) fL. g(JLn)) T. A (11) = Ao (TJi) for some A o.6. 11) given by (6. McCullagh and NeIder (1983.5. the mean fL of Y is related to 11 via fL = A(11).. .. Yn)T is n x 1 and ZJxn = (ZI.. most importantly the log linear model developed by Goodman and Haberman.. fL determines 11 and thereby Var(Y) A( 11). As we know from Corollary 1. See Haberman (1974). Y) where Y = (Yb .. j=1 p In this case. in which case JLi A~ (TJi).1.1) where 11 is not in E. The generalized linear model with dispersion depending only on the mean The data consist of an observation (Z. the identity. the natural parameter space of the nparameter canonical exponential family (6. Canonical links The most important case corresponds to the link being canonical. which is p x 1. More generally. (ii) Log linear models. but in a subset of E obtained by restricting TJi to be of the form where h is a known function. Special cases are: (i) The linear model with known variance..1). ."" zn) with Zi = (Zib"" Zip)T nonrandom and Y has density p(y. ••• . g = or 11 L J'j Z(j) = Z{3. that is. g(fL) is of the form (g(JL1) . . called the link junction. Typically. Typically. Note that if A is oneone.
NewtonRaphson coincides with Fisher's method of scoring described in Problem 6. 0 < (}i < 1 withcanonicallinkg(B) = 10g[B(1B)]. log J. "Ij = 10gBj. . .B 1.5. The coordinate ascent algorithm can be used to solve (6 . With a good starting point f3 0 one can achieve faster convergence with the NewtonRaphson algorithm of Section 2. as seen in Example 1. Then. . that procedure is just (6. 1 ::. .4. 1 ::. This isjust the logistic linear model of Section 6. 2:~=1 B = 1. Algorithms If the link is canonical.2) It's interesting to note that (6. j ::.2) can be interpreted geometrically in somewhat the same way as in the Gausian linear modelthe "residual" vector Y .4.1.8) (6. i ::. 1 ::..5 Generalized linear Models 413 Suppose (Y1. The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ((}i). b. the .6. In this case. the link is canonical. ZTy or = ZT E13 y = ZT A(Z.3.1::. B+ j = 2:~=1 Bij .8) is orthogonal to the column space of Z. ... for example. 1f. j ::. Then () = IIBij II.1.7.t1. j::..Yp)T is M(n. a. is that of independence Bij = Bi+B+j where Bi+ = 2:~=1 Bij . so that Yij is the indicator of. If we take g(/1) = (log J.3) where In this situation and more generally even for noncanonical links. and the log linear model corresponding to log Bij = f3i + f3j. b. /1(.5.tp) T . that Y = IIYij 111:Si:Sa. j ::.8) is not a member of that space.. say.2) (or ascertain that no solution exists). Bj > 0. See Haberman (1974) for a further discussion.5./1(.Section 6. where f3i.80 ~ f3 0 . f3j are free (unidentifiable parameters). if maximum likelihood estimates they necessarily uniquely satisfy the equation f3 exist. classification i on characteristic 1 and j on characteristic 2. by Theorem 2. p. in general.5. But. 1 ::. p are canonical parameters. Suppose.3.. .Bp).5. The models we obtain are called log linearsee Haberman (1974) for an extensive treatment.
. iLl) where iLl is the MLE under WI. the deviance of Y to iLl and tl(iL o.5. iLo) == inf{D(Y.1](Y)) l(Y . Yn ) T (assume that Y is in the interior of the convex support of {y : p(y. We can think of the test statistic .5) is always ~ O.20) when the data are the residuals from the fit at stage m.13m' which satisfies the equation (6. 1]) > o}).1](lLa)] for the hypothesis that IL = lLa within M as a "measure" of (squared) distance between Y and lLo.1) for which p = n. Write 1](') for AI. This name stems from the following interpretation. iLo) . iLl) == D(Y. iLo) .5. If Wo C WI we can write (6. In that case the MLE of J1 is iL M = (Y1 . In this context. (6. the algorithm is also called iterated weighted least squares.4) That is. IL) : IL E wo} where iLo is the MLE of IL in Woo The LR statistic for H : IL E Wo versus K : IL E WI . Let ~m+l == 13m+1 .D(Y . This quantity.5.D(Y. the variance covariance matrix is W m and the regression is on the columns ofW mZProblem 6. The LR statistic for H : IL E Wo is just D(Y. Unfortunately tl =f. . ••• . the correction Am+ l is given by the weighted least squares formula (2. Testing in GLM Testing hypotheses in GLM is done via the LR statistic.414 Inference in the Multiparameter Case Chapter 6 true value of {3.6) a decomposition of the deviance between Y and iLo as the sum of two nonnegative components. 210gA = 2[l(Y.5. For the Gaussian linear model with known variance 0'5 (Problem 6.5. called the deviance between Y and lLo. D generally except in the Gaussian case. We can then formally write an analysis of deviance analogous to the analysis of variance of Section 6.2.Wo with WI ~ Wo is D(Y.5. iLl)' each of which can be thought of as a squared distance between their arguments. then with probability tending to 1 the algorithm converges to the MLE if it exists.2. as n ~ 00. As in the linear model we can define the biggest possible GLM M of the form (6.1.4).
Section 6 .. = f3d = 0. Thus. can be estimated by f = :EA(Z{3) where :E is the sample variance matrix of the covariates. Zn in the sample (ZI' Y 1 ). . and can conclude that the statistic of (6.1 . For instance. . y . if we assume the covariates in logistic regression with canonical link to be stochastic.5. asymptotically exists. which.5.q.2.. .5.2. However.3. . and 6.. This can be made precise for stochastic GLMs obtained by conditioning on ZI.0.. which we temporarily assume known. (3))Z.. . . Similar conclusions r '''''''. fil) is thought of as being asymptotically X.7) More details are discussed in what follows.5. . Asymptotic theory for estimates and tests If (ZI' Y 1 ).3). then (ZI' Y1 ). Ao(Z. (Zn' Y n ) can be viewed as a sample from a population and the link is canonical.3 hold (Problem 6. 1.2 and 6. is consistent with probability 1.9) What is ]1 ((3)? The efficient score function a~i logp(z . 6.2.3..10) is asymptotically X~ under H.8) This is not unconditionally an exponential family in view of the Ao (z. ..5. (3) term. f3p ). ••• . we can calculate i=l where {3 H is the (p xl) MLE for the GLM with {3~xl = (0.. (Zn' Y n ) has density (6. we obtain If we wish to test hypotheses such as H : 131 = . (3) is (Yi and so. then ll(fio. d < p.. there are easy conditions under which conditions of Theorems 6. if we take Zi as having marginal density qo. the theory of Sections 6.. . . f3d+l. so that the MLE {3 is unique. (Zn.5 Generalized Linear Models 415 Formally if Wo is a GLM of dimension p and WI of dimension q with canonical links. and (6.3 applies straightforwardly in view of the general smoothness properties of canonical exponential families. in order to obtain approximate confidence procedures.. . Y n ) from the family with density (6.
for c( r) > 0. r). p(y. For further discussion of this generalization see McCullagp. and NeIder (1983.5. r) = exp{c. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6.13) (6. when it can. . As he points out.3. General link functions Links other than the canonical one can be of interest. From the point of analysis for fixed n. if in the binary data regression model of Section 6. Note that these tests can be carried out without knowing the density qo of Z1. which we postpone to Volume II. Because (6. we take g(JL) = <1>1 (JL) so that 7ri = <1> (zT!3) we obtain the socalled probit model.416 Inference in the Multiparameter Case Chapter 6 follow for the Wald and Rao statistics.A(11))}h(y.5.5.4.5.12) The lefthand side of (6.14) so that the variance can be written as the product of a function of the mean and a general dispersion parameter. JL ::. . Cox (1970) considers the variance stabilizing transformation which makes asymptotic analysis equivalent to that in the standard Gaussian linear model.1 ::. r)dy = 1. 11.9 are rather similar. It is customary to write the model as. 1989). (J2) and gamma (p.1) depend on an additional scalar parameter r. noncanonicallinks can cause numerical problems because the models are now curved rather than canonical exponential families.11) Jp(y.1 (r)(11T y . Important special cases are the N (JL. These conclusions remain valid for the usual situation in which the Zi are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables. A) families. (6. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean. the results of analyses with these various transformations over the range .10) is of product form A( 11) [1/ c( r)] whereas the righthand side cannot always be put in this form. then it is easy to see that E(Y) = A(11) Var(Y) = c(r)A(11) (6. 11. then A(11)/c(r) = log! exp{c1 (r)11T y}h(y. For instance. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred.5. However.5.r)dy.
. ____ . of a linear predictor of the form Ef3j Z(j). we studied what procedures would be appropriate if the linearity of the linear model held but the error distribution failed to be Gaussian. the GaussMarkov theorem. In Example 6.31) with fa symmetric about O.2. where the Z(j) are observable covariate vectors and (3 is a vector of regression coefficients. Furthermore.i.6. under further mild conditions on P. a 2 ) or.2. whose further discussion we postpone to Volume II.2.6. we use the asymptotic results of the previous sections to develop large sample estimation results. for estimating /lL(Z) in a submodel of P 3 with /3 (6. We considered the canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor. y)T rv P that satisfies Ep(Y I Z) = ZT(3.Section 6. are discussed below. ".6 Robustness Pr. if we consider the semiparametric model. Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem 6. then the LSE of (3 is. called the link function.2. There is another semiparametric model P 2 = {P : (ZT. the distributional and implicit structural assumptions of parametric models are often suspect. These issues. /3.i. the LSE is not the best estimate of (3. Ii 6.19) with Ei i.5). so is any estimate solving the equations based on (6.d.. EpZTZ nonsingular. is "act as ifthe model were the one given in Example 6.d. We found that if we assume the error distribution fa.19) even if the true errors are N(O.1. For this model it turns out that the LSE is optimal in a sense to be discussed in Volume II. Y) has a joint distribution and if we are interested in estimating the best linear predictor /lL(Z) of Y given Z."n"..2. the resulting MLEs for (3 optimal under fa continue to estimate (3 as defined by (6. PI {P : (zT. in fact. have a fixed covariate exact counterpart.. We discussed algorithms for computing MLEs of In the random design case.2. roughly speaking. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function.2. which is symmetric about 0. Y) rv P given by (6.6 ROBUSTNESS PROPERTIES AND SEMIPARAMETRIC MODELS As we most recently indicated in Example 6.8 but otherwise also postponed to Volume II. in fact.J . the right thing to do if one assumes the (Zi' Yi) are i. still a consistent asymptotically normal estimate of (3 and. confidence procedures.2. and tests. with density f for some f symmetric about O}. and Models 417 Summary. had any distribution symmetric around 0 (Problem 6. E p y2 < oo}.1) where E is independent of Z but ao(Z) is not constant and ao is assumed known.r+i". That is. if P3 is the nonparametric model where we assume only that (Z.2." Of course.
Let Ci stand for any estimate linear in Y1 . Var(Ei) + 2 Laiaj COV(Ei.1. Then.2) holds.1. Ifwe replace the Gaussian assumptions on the errors E1.2) where Z is an n x p matrix of constants. E(a) = E~l aiJLi Cov(Yi. Var(e) = a 2 (I .. the estimate = E~l aiJii has uniformly minimum variance among all unbiased estimates linear in Y 1 . moreover.H). Theorem 6. Example 6..d. .4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case.6.6. .an.14).1.p . .4...6). is UMVU in the class of linear estimates for all models with EY/ < 00. Var(jj) = a 2 (ZTZ)1. The preceding computation shows that VarcM(Ci) = Varc(Ci).3) are normal. . Because Varc = VarCM for all linear estimators.. Ej). € are n x 1. .6.. the result follows. See Problem 6. and by (B. Yj) = Cov( Ei. Instead we assume the GaussMarkov linear model where (6. and Y.1. .1. and ifp = r.'. .1. for any parameter of the form 0: = E~l aiJLi for some constants a1.1.i. One Sample (continued). in Example 6.2.1. Suppose the GaussMarkov linear model (6.Ej) = a La.Y . By Theorems 6. Moreover. and a is unbiased.3. . in addition to being UMVU in the normal case. Many of the properties stated for the Gaussian case carry over to the GaussMarkov case: Proposition 6. N(O. the conclusions (1) and (2) of Theorem 6.6.1 and the LSE in general still holds when they are compared to other linear estimates. n 2 VarcM(a) = La.En in the linear model (6. jj and il are still unbiased and Var(Ji) = a 2 H. 0 Note that the preceding result and proof are similar to Theorem 1.. {3 is p x 1.En are i.1.S. our current Ji coincides with the empirical plugin estimate of the optimal linear predictor (1..1. .' En with the GaussMarkov assumptions.3(iv). for . . The GaussMarkov theorem shows that Y.2(iv) and 6. 13i of Section 6. 418 Inference in the Multiparameter Case Chapter 6 Robustness in Estimation We drop the assumption that the errors E1. where Varc(Ci) stands for the variance computed under the Gaussian assumption that E1.4. n a Proof. Varc(Ci) for all unbiased Ci.Yn . n = 0:. The optimality of the estimates Jii. and Ji = 131 = Y.1. However. ( 2 ).1.4 are still valid. .. Because E( Ei) = 0. In fact.6. In Example 6. Varc(a) ::.. JL = f31. i=l i<j i=l where VarCM refers to the variance computed under the GaussMarkov assumptions.
2 we know that if '11(" 8) = Die.P) i= II (8 0 ) in general. consider H : 8 tics Tw = n(e ( 0 )T I(8 0 )(e ( 0 ).. the asymptotic behavior of the LR. For this density and all symmetric densities. 8) and Pis true we expect that 8n 8(P).4) and the Wald test statis8 0 evidently we have i= Tw But if 8(P) = 8 0 . 0..Ill}.ii(8 . Suppose we have a heteroscedastic 0.. y E R. From the theory developed in Section 6.4.6. P)).and twosample problems using asymptotic and Monte Carlo methods.6. There is an important special case where all is well: the linear model we have discussed in Example 6. Example 6.Section 6. 8)dP(x) ° = 80 (6..8(P)) ~ N(O.2 are UMVU. Y has a larger variance (Problem 6..ennip.2. 1 (Zn. which does not belong to the model.6. which is the unique solution of ! and w(x.6 Robustness "" .6. as (Z.5) .3) .5) than the nonlinear estimate Y median when the density of Y is the Laplace density 1 2A exp{ Aly . when the not optimal even in the class of linear estimates.. the asymptotic distribution of Tw is not X.6. Wald and Rao tests depends critically on the asymptotic behavior of the underlying MLEs 8 and 80 .. ~('11.6.. and . Suppose (ZI' Yd..ti".d.2. The Linear Model with Stochastic Covariates. a major weakness of the GaussMarkov theorem is that it only applies to linear estimates. 00. aT Robustness of Tests In Section 5. then Tw ~ VT I(8 0 )V where V '" N(O.:. Another weakness is that it only applies to the homoscedastic case where Var(1'i) is the same for all i.1..2. 0. Y) where Z is a (p x 1) vector of random covariates and we model the relationship between Z and Y as (6. Now we will use asymptotic methods to investigate the robustness of levels more generally.6. but Var( Ei) depends on i. If 8(P) (6.1. However. we can use the GaussMarkov theorem to conclude that the weighted least squares are unknown. Y n ) are d.6). for instance. The 6method implies that if the observations Xi come from a distribution P.2. As seen in Example 6. ~(w. This observation holds for the LR and Rao tests as wellbut see Problem 6. P)) with :E given by (6. Because ~ ('lI . ."n". A > O. . Il E R.:lrarnetlric Models 419 sample n large. Y is unbiased (Problem 3. Remark 6. our estimates are estimates of Section 2.12).6. Thus.4.3 we investigated the robustness of the significance levels of t tests for the one. If the are version of the linear model where E( €) known.
Thus.6.. under H : (3 = 0.q+ 1. (see Examples 6. by the law of large numbers.r Case 6 with the distribution P of (Z.. It is intimately linked to the fact that even though the parametric model on which the test was based is false. procedures based on approximating Var(.2) the LR. Then. and we consider. !ltin:::.6.r::nn". (12).. Zen) S (Zb . n lZT Z (n) (n) so that the confidence ~ ~ ZiZ'f n ~ 1 (6. X.5) has asymptotic level a even if the errors are not Gaussian.Yi) = IY n .7. For instance.2...np(1 . even if E is not Gaussian.5) and (12(p) = Varp(E). (Jp (Jo. This kind of robustness holds for H : (Jq+ 1 (Jo.1 that (6.np(l a) . say.. EpE 0. .6. . hence. the test (6. (i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model. E p E2 < 00.2. . .30) then (3(P) specified by (6. it is still true by Theorem 6.  2 Now.3. Wald.6) (3 is the LSE.P i=l n p  Z(n)(31 .6.t xp(l a) by Example 5.420 Inference in the M .. when E is N(O.9) i=l n1Zfn)Z(n)S2 / Vii are still asymptotically of correct level.6.6.7) and (6.Zn)T and  2 1 ~ " 2 1 ~(Yi ..B) by where W pxp is Gaussian with mean 0 and. It follows by Slutsky's theorem that.. it is still true that if qi is given by (6. the limiting distribution of is Because !p.8) Moreover.2.p or more generally (3 E £0 + (30' a qdimensional affine subspace of RP.6. H : (3 = O. and Rao tests all are equivalent to the F test: Reject if Tn = (3 Z(n)Z(n)(3/ S 2 !p.t.1 and 6.2. Y) such that Eand Z are independent.3) equals (3 in (6.a) where _ ""T T  2 (6.
the lifetimes A2 the standard Wald test (6.6..6.6.2... the theory goes wrong..3.5) are still meaningful but (and Z are dependent.3. . Suppose without loss of generality that Var( (') = L Then (6.3. However suppose now that X(I) 101 and X(2) 102 are identically distributed but not exponential. under H.. Suppose E(E I Z) = 0 so that the parameters f3 of (6. XU) and X(2) are identically distributed.6. But the test (6. That is.15) . i=1 (6.6..6. let the distribution of ( given Z = z be that of a(z)(' where (' is independent of Z. To see this. If we take H : Al .10) does not have asymptotic level a in general. note that.2 clearly hold. V2 VarpX{I) (~2) Xz in general.6. If it holds but the second fails. The TwoSample Scale Problem. X(2) are independent E of paired pieces of equipment. then it's not clear what H : (J = (Jo means anymore.1 Vn(~ where (3) + N(O.6.. (6.14) o Example 6.7) fails and in fact by Theorem 6.10) nL and ~ ~ X(j) 2 (6. It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general. Q) (6.17) is Suppose our model is that X "Reject iff n (0 1 where &2 2 O) > x 1 (1  .) respectively. (X(I).6 Robustness Properties and Semiparametric Models 421 If the first of these conditions fails. Simply replace (12 by (J ~2 = ~ ~ { ( ~1) n~ 2=1 Xz _ (X(I) + X(2»))2 2 + _ (X(1) + X(2»))2} 2 .Section 6. (6. For simplicity.13) but & V2Ep(X(I») ::f. 2  (th).12) The conditions of Theorem 6. Then H is still meaningful..6.4. The Linear Model with Stochastic Covariates with E and Z Dependent. Example 6. We illustrate with two final examples.11) i=1 &= v 2n ! t(xP) + X?»). X(2») where X(I).6. we assume variances are heteroscedastic.a)" (6. E (i.6.
the test (6. .. In this case. (52) are still reasonable when the true error distribution is not Gaussian.11 .422 Inference in the Multiparameter Case Chapter 6 (Problem 6. and are uncorrelated.Id .6.6.A1.6) and. use Theorem 3.. the confidence procedures derived from the normal model of Section 6.6.6) by Q1.Ad)T where (I1.11) with both 11 and (52 unknown. d.1. in general. Show that in the canonical exponential model (6. LR. ..1 are still approximately valid as are the LR. then the MLE and LR procedures for a specific model will fail asymptotically in the wider model. (5 . N(O. Wald.JL • ""n 2. in general. For the canonical linear Gaussian model with (52 unknown. Specialize to the case Z = (1.9. ""2 _ ""12 (TJ1. .. This is the stochastic version of the dsample model of Example 6.4.. We also demonstrated that when either the hypothesis H or the variance of the MLE is not preserved when going to the wider model. and Rao tests need to be modified to continue to be valid asymptotically. .6) does not have the correct level.16) Summary. = {3d = 0 fail to have correct asymptotic levels in general unless (52 (Z) is constant or A1 = . . 0 To summarize: If hypotheses remain meaningful when the model is false then.. (i) the MLE does not exist if n = r.fiT) o:2)T of U 2)T . .d.6. then the MLE (iil. In the linear model with a random design matrix. n 1 L. Wald..3 to compute the information lower bound on the variance of an unbiased estimator of (52.. 1967). Compare this bound to the variance of 8 2 • .. 1 ::. A solution is to replace Z(n)Z(n)/ 8 2 in (6. (52) assumption on the errors is replaced by the assumption that the errors have mean zero.6. For d = 2 above this is just the asymptotic solution of the BehrensFisher problem discussed in Section 4. The simplest solution at least for Wald tests is to use as an estimate of [Var y'n8]1 not 1(8) or ~D2ln(8) but the socalled "sandwich estimate" (Huber. Ur.Ad) distribution.7 PROBLEMS AND COMPLEMENTS Problems for Section 6. In particular. the twosample problem with unequal sample sizes and variances. N(O. 0 < Aj < 1. TJr.4.. 6.d. the methods need adjustment.6). . we showed that the MLEs and tests generated by the model where the errors are U. . (6.i. j ::. . . (ii) if n 2: r + 1. and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model.. identical variances. . The GaussMarkov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i. We considered the behavior of estimates and tests when the model that generated them does not hold. where Qis a consistent estimate of Q.. In partlCU Iar.. provided we restrict the class of estimates to linear functions of Y1 .. and Rao tests.1 1.3. It is easy to see that our tests of H : {32 = . . . (5 2)T·IS (U1. . = Ad = 1/ d (Problem 6. Yn .1.Id) has a multinomial (A1.. ••• .n l1Y ."i=r+ I i .
i = 1. 1 n f1.29). see Problem 2.1.4. . . . .. Here the empirical plugin estimate is based on U. Yl). {Ly = /31. Var(Uf) = 20. 0. fO 0...(_C)i)2 l+c L i=l (a) Show that Bis the weighted least squares estimate of (). (d) Show that Var(O) ~ Var(Y). .. fn where ei = ceil + fi.2 known case with 0.Section 6."" (Z~. Yn ) where Z.4 • 3. Derive the formula (6...1. zi = (Zi21"" Zip)T. = (Zi2. Consider the model (see Example 1.22. 9. n.28) for the noncentrality parameter ()2 in the regression example.d. J =~ i=O L)) c i (1. and the 1. then B the MLE of ().2 coincides with the likelihood ratio statistic A(Y) for the 0. 1 {LLn)T.. of 7. .2 ). t'V (b) Show that if ei N(O..i. Suppose that Yi satisfies the following model Yi ei = () + fi.Li = {Ly+(zi {Lz)f3. 4. .2 ). where a.a)% confidence interval for /31 in the Gaussian linear model is Z2)2 ] . i = 1. (Zi. is (c) Show that Y and Bare unbiased..1.. .1] is a known constant and are i. Derive the formula (6.. and (3 and {Lz are as defined in (1. Find the MLE B ().. (e) Show that Var(B) < Var(Y) unless c = O. c E [0.29) for the noncentrality parameter 82 in the oneway layout.2 . . Let Yi denote the response of a subject at time i.7 Problems and Complements 423 Hint: By A. n.'''' Zip)T. i eo = 0 (the €i are called moving average errors. Show that Ii of Example 6. 5. . .n ~ where fi can be written as fi = ceil + ei for given constant c satisfying 0 ~ c are independent identically distributed with mean zero and variance 0. i = 1.n. i = 1. the 100(1 .2. Let 1.l+c (_c)j+l)/~ (1. 8... where IJ. . .5) Yi () + ei. Show that in the regression example with p = r = 2. N(O.2 with p = r coincides with the empirical plugin estimate of JLL = ({LL1. Show that >:(Y) defined in Remark 6.2 replaced by 0: 2 • 6.1.1.13.d.14). 0.
. ••. We want to predict the value of a 14. Consider a covariate x.p (1 . . n2 = . Show that if p Inference in the Multiparameter Case Chapter 6 = r = 2 in Example 6. .a) confidence intervals for linear functions of the form {3j .p (~a) (n . a 2 ::. ::.{3i are given by = ~(f. Y ::. (b) Find a level (1 a) prediction interval for Y (i. (n .2)2 a and 8k in the oneway layout Var(8k ) . (c) If n is fixed and divisible by 2(p . .Z...a)% confidence intervals for a and 6k • 12. Often a treatment that is beneficial in small doses is harmful in large doses. In the oneway layout.1.2)(Zj2 . Yn ) such that P[t.. . Let Y 1 .• statistics HYl. then Var(8k ) is minimized by choosing ni = n/2.p)S2 /x n . T2 1 (1/ 2n) L.":=::...1 2)? and that a level (1 a) confidence interval for a 2 is given by ~a)::.:1 Yi + (3/ 2n) L~= ~ n+ 1 }i. 1  (a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2? (b) Which estimate has the smallest MSE for estimating 0 13..2. (b) Find confidence intervals for 'lj. ~ 1 .Z.e . (a) Find a level (1 . = np n/2(p 1). which is the amount ..p)S2 /x n . Note that Y is independent of Yl. then the hat matrix H = (h ij ) is given by 1 n 11.1 and variance a 2 . l(Yh .~ L~=1 nk = (p~:)2 + Lk#i . Yn . (a) Show that level (1 . 15.k)' C (b) If n is fixed and divisible by p. 'lj. .• 424 10. Assume the linear regression model with p future observation Y to be taken at the pont z.I). Yn ).2) L(Zi2 . = 2(Y . Consider the three estimates T1 = and T3 Y. where n is even. Show that for the estimates (a) Var(a) = +==_=:. then Var( a) is minimized by choosing ni = n / p... The following model is useful in such situations. . (Zi2 Z.a) confidence interval for the best MSPE predictor E(Y) = {31 + {32 Z.. Yn be a sample from a population with mean f..2). (d) Give the 100(1 . ~(~ + t33) t31' r = 2.a).
which is yield or production. Hint: Because C' is of rank r. xC' = 0 => x = 0 for any rvector x.r2) (x).. 7 2 ) does not exist so that A6 doesn't hold.emEmts 425 . Let 8. and Q = P. then the r x r matrix C'C is of rank r and.0289. n are independent N(O. .1) + 2<t'CJL. fi3 and level 0. ( 2 ).0'2) is the N(Il'. fli) where fi1.2. X (nitrogen) Hint: Do the regression for {Li = (31 + (32zil + (32zil = 10gXi . r :::.1 and let Q be the class of distributions with densities of the form (1 E) <t' CJL. fi2.A6 when 8 = ({L.77 167. a 2 > O}. Show that if C is an n x r matrix of rank r. ( 2 ) logp(x.. P. 16. and a response variable Y. 1 2 where <t'CJL. (a) Show that AOA4 and A6 hold for model Qo with densities of the form 1 2<t'CJL.2 1.I::llogxi' You may use X = 0. (b) Show that the MLE of ({L.0'2)(x ) + E<t'(JL.7 Probtems and Lornol. Find the value of X that maxi mizes the estimated yield Y= e131 e132xx133.• . . E :::. 17. But xC'C = o => IIxC'I1 2 = xC'Cx' = 0 => xC' = O. In the Gaussian linear model show that the parametrization ({3. p(x. ( 2 ). 8).95 Yi = e131 e132xi Xf3. and p be as in Problem 6. En Xi.. Yi) and (Xi. Check AO. p.r2 )l 1 2 7 > O. i = 1. 653). Problems for Section 6.. Suppose a good fit is obtained by the equation where Yi is observed yield for dose 10gYi where E1. •. P = {N({L. (32. ( 2 ) density.5 Zi2 * + (33zi2 where Zil = Xi  X. 8) = 2. : {L E R. (b) Plot (Xi. Y (yield) 3. ( 2 )T is identifiable if and only if r = p. (33. hence. . 1952. compute confidence intervals for (31.or dose of a treatment. . n. a 2 <7 2 . . Assume the model = (31 + (32 X i + (33 log Xi + Ei. (a) For the following data (from Hald. nonsingular.Section 6.
n[z(~)~en)jl)' X. show that c(fo) = (To/a is 1 if fo is normal and is different from 1 if 10 is logistic..2 hold if (i) and (ii) hold.)Z(n)(fj . given Z(n). 5.2.2.  (3)T)T has a (". e Euclidean.2.24). Hint: !x(x) = !YIZ(Y)!z(z). and that 0:2 is independent of the preceding vector with n(j2 /0 2 having a (b) Apply the law oflarge numbers to conclude that T P n I Z(n)Z(n) ~ E ( ZZ T ).20).'. iT 2 ) are conditional MLEs.2.2. (I) Combine (aHe) to establish (6. '".24) directly as follows: (a) Show that if Zn = ~ I:~ 1 Zi then. T(X) = Zen). H E H}.(b) The MLE minimizes .1. Z i ~O. H abstract.2.21). /1.i > 1. Y). (c) Apply Slutsky's theorem to conclude that and.2.2.fii(ji .  I .". . (J derived as the limit of Newtonwith equality if and only if > [EZfj1 4. b . In Example 6.1. Hint: (a). Establish (6. The MLE of based on (X. then ({3.2. show that the assumptions of Theorem 6.1/ 2 ).426 Inference in the Multiparameter Case Chapter 6 I ._p distribution.1 show thatMLEs of (3. (fj multivariate nannal distribution with mean 0 and variance...I3). 6. . X .Pe"H). 8. (6.10 that the estimate Raphson estimates from On is efficient.2 are as given in (6. . In some cases it is possible to find T(X) such that the distribution of X given T(X) = tis Q" which doesn't depend on H E H. In Example 6.2. i' " (b) Suppose that the distribution of Z is not known so that the model is semiparametric. that (d) (fj  (3)TZ'f... hence. Show that if we identify X = (z(n). I I . (P("H) : e E e. (e) Show that 0:2 is unconditionally independent of (ji.2. ell of (} ~ = (Ji. In Example 6. show that ([EZZ T jl )(1. .2. (a) In Example 6. (e) Construct a method of moment estimate moments which are ~ consistent. 3.. T 2 ) based on the first two I (d) Deduce from Problem 6.. t) is then called a conditional MLE.IY .(3) = op(n.1) EZ . I .2.Zen) (31 2 e ~ 7. and . Fill in the details of the proof of Theorem 6. .
(ii) gn(Oo) ~ g(Oo).l exists and uniquely ~ .'2:7 1 'l'(Xi. and f(x) = e..(XiI') _O.. the <ball about 0 0 .1/ 2 ). (iv) Dg(O) is continuous at 0 0 . Let Y1 •. p is strictly convex. . ao (}2 = (b) Write 8 1 uniquely solves 1 8.p(Xi'O~) n. Him: You may use a uniform version of the inverse function theorem: If gn : Rd j. . that is.7 Problems and Complements 427 9. a > 0 are unknown and ( has known density f > 0 such that if p(x) log f(x) then p" > 0 and.2. (b) Show that under AGA4 there exists E > 0 such that with probability tending to 1. (iii) Dg(0 0 ) is nonsingular. p i=l J.•. = J. .0 0 ).L'l'(Xi'O~) n.l + aCi where fl. R d are such that: (i) sup{jDgn(O) .0 0 1 < <} ~ 0.O) has auniqueOin S(Oo. Hint: n .l real be independent identically distributed Y. Y. (al Let ii n be the first iterate of the NewtonRapbson algorithm for solving (6.LD. ] t=l 1 n .' .Section 6. (logistic).LD.Dg(O)j .1) starting at 0.3). Examples are f Gaussian. hence. 8~ = 1 80 + Op(n.Oo) i=cl (1 .x (1 + C X ) . 10 .2.L 'l'(Xi'O~) n i=l 1 1 =n n L 'l'(Xi.p(Xi'O~) + op(1) n i=l n ) (O~ .. <). (a) Show that if solves (Y ao is assumed known a unique MLE for L. Suppose AGA4 hold and 8~ is vn consistent. Show that if BB a unique MLE for B1 exists and 10. a' . E R.1' _ On = O~  [ . t=l 1 tt Show that On satisfies (6..
.2. f.27).. 0 < ZI < . Zi. q(. = versus K : O oF 0..3.(Z" .i. •.Zi )13.3. t• where {31.d. 'I) : 'I E 3 0 } and. log Ai = ElI + (hZi. Inference in the Multiparameter Case Chapter 6 > O. Show that if 3 0 = {'I E 3 : '1j = 0.. iteration of the NewtonRaphson algorithm starting at 0. Thus. Zip)) + L j=2 p "YjZij + €i J ./3)' = L i=1 1 CjZiU) n p .ip range freely and Ei are i. Hint: You may use the fact that if the initial value of NewtonRaphson is close enough to a unique solution.~_. and Rao tests for testing H : Ih.12). I '" LJ i=l ~(1) Y. i=l • Z. 2 ° 2. . < Zn 1 for given covariate values ZI. 'I) = p(. Establish (6.O E e. Problems for Section 6. .. hence. Hint: Write n J • • • i I • DY.2 holds for eo as given in (6.. Y.. Similarly compute the infonnation matrix when the model is written as Y. q + 1 < j < r} then )'(X) for the original testing problem is given = by )'(X) = sup{ q(X· 'I) : 'I E 3}/ sup{q(X. 1 Yn are independent Poisson variables with Yi . . .. P(Ai). a 2 ). . {Vj} are orthonormal. and ~ . Suppose that Wo is given by (6. converges to the unique root On described in (b) and that On satisfies 1 1 • j. N(O... that Theorem 6. . minimizing I:~ 1 (Ii 2 i.. then it converges to that solution.3. i . .(j) l.2. )Ih  '" j=l LJ(lJj + where ~(l) = I:j and the Cj do not depend on j3. .. CjIJtlZ.' " 13. Reparametnze l' by '1(0) = Lj~1 '1j(O)Vj where '1j(9) 0 Vj. there exists a j and their image contains a ball S(g(Oo). Ii 'I :f Z'[(3)2 over all {3 is the same as minimizing n . II.428 then.3 i 1. .2. 8) for 'I E 3 and 3 = {'1( 0) : 9 E e}.3). •.1 on 8(9 0 • 6) (c) Conclude that with probability tending to 1. 1 Zw Find the asymptotic likelihood ratio. i2. Suppose responses YI . • (6.l hold for p(x.12) and the assumptions of Theorem (6. > 0 such that gil are 1 .26) and (6. P I . for "II sufficiently large.6).O). ~. Wald.II(Zil I Zi2.... .3.(Z"  ~(') z. = 131 (Zil . LJ j=2 Differentiate with respect to {31.
. > 0 or IJz > O. pXd . B). .. Assume that Pel of Pea' and that for some b > 0.o log (X 0) PI. . p .'1) and Tin '1(lJ n) and. = ~ 4. . S(lJ o) and a map ce which is continuously differentiable such that (i) ~J(IJ) = 9J(IJ) on S(lJo). q + 1 <j< rl. Bt ) is a KullbackLeibler in fonnation ° "_Q K(Oo.'1 8 0 = and. {IJ E S(lJo): ~j(lJ) = 0. .d. (ii) 'I is IIon S( 1J0 ) and D'I( IJ) is a nonsingular r x r matrix for aIlIJ E S( 1J0).Section 6. .. Deduce that Theorem 6. OJ}.n 7J(Bo.l. aTB.Xn ) be the likelihood ratio statistic.7 Problems and C:com"p"":c'm"'. N(IJ" 1). . IJ. 0 ») is nK(Oo. 1 < i < n. N(Oz.n:c"' 4c:2=9 3..d.2. respectively.3. 'I) S(lJo)} then... Suppose that Bo E (:)0 and the conditions of Theorem 6. Oil and p(X"Oo) = E.(X" . 1). Show that under H.gr.IJ) where q(.'" . be ii. . even if ~ b = 0.. Oil + Jria(00 .2.'1(IJ» is uniquely defined on:::: = {'1(IJ) : IJ E Tio.. hence. =  p(·.(I(X i . IJ. X n Li.) 1(Xi .. with density p(. .aq are orthogonal to the linear span of ~(1J0) . Consider testing H : (j = 00 versus K : B = B1 . (b) If h = 2 show that asymptotically the critical value of the most powerful (NeymanPearson) test with Tn ~ L~ .X. \arO whereal.. Suppose 8)" > 0... 210g.. Testing Simple versus Simple. Consider testing H : 8 1 = 82 = 0 versus K : 0.3. independent. (Adjoin to 9q+l. There exists an open ball about 1J0. Xi. q + 1 <j < r S(lJo) ..n) satisfy the conditions of Problem 6. j = 1.(X1 . Let e = {B o .) O. Let (Xi. Vi).) where k(Bo.3 is valid. . < 00.3 hold.3.) Show that if we reparametrize {PIJ: IJ E S(lJo)} by q('. q('. 1 5. (ii) E" (a) Let ). with Xi and Y.
. < .li : 1 <i< n) has a null distribution.\(X).. 2 e( y'n(ii. XI and X~ but with probabilities ~ .) under the model and show that (a) If 0. > 0. > 0. Po) distribution and (Xi. Hint: By sufficiency reduce to n = 1. /:1.\(Xi. for instance. .. = v'1~c2' 0 <. 0. < cIJ"O.2 for 210g. O > 0.0. Z 2 = poX1Y v' .0"10' O"~o. Exhibit the null distribution of 2 log . and Dykstra. > OJ..1. Yi : l<i<n).1. which is a mixture of point mass at O. 1) with probability ~ and U with the same distribution. 7. Yi : 1 and X~ with probabilities ~.430 Infer~nce in the Multiparameter Case Chapter 6 (a) Show that whatever be n. Note: The results of Problems 4 and 5 apply generally to models obeying AQA6 when we restrict the parameter space to a cone (Robertson. Then xi t. (b) If 0.\( Xi. ±' < i < n) is distributed as a respectively. are as above with the same hypothesis but = {(£It. 1 < i < n. Wright.~. j ~ ~ 6.6. ~ ° with probability ~ and V is independent of ~ (e) Obtain tbe limit distribution of y'n( 0. (d) Relate the result of (b) to the result of Problem 4(a).6.3. = a + BB where . ' · H In!: Consl'defa1O . = 0. 2 log >'( Xi. . Let Bi l 82 > 0 and H be as above. (h) : 0 < 0.0.19) holds. be i. =0. 0.) if 0.3. Sucb restrictions are natural if..0. (b) Suppose Xil Yi e 4 (e) Let (X" Y. I . Show that (6..) have an N. Hint: ~ (i) Show that liOn) can be replaced by 1(0). ii.i. under H. (0.. Show that 2log. and ~ where sin.  OJ. (iii) Reparametrize as in Theorem 6. Po .0.2 and compute W n (8~2» showing that its leading term is the same as that obtained in the proof of Theorem 6. = (T20 = 1 andZ 1= X 1.=0 where U ~ N(O.3.0). we test the efficacy of a treatment on the basis of two correlated responses per individual. • i . li).)) ~ N(O.d. mixture of point mass at 0. In the model of Problem 5(a) compute the MLE (0. (ii) Show that Wn(B~2» is invariant under affine reparametrizations "1 B is nonsingular.  0. 1988).
Show that under A2.8) for Z. Show that under AOA5 and A6 for 8 11 where ~(lIo) is given by (6.10. (a) Show that the correlation of Xl and YI is p = peA n B) . (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (estimated) expectations from each entry has all rows and columns summing to zero.2 to lin . 3. hence. A3.P(B))· (b) Show that the sample correlation coefficient r studied in Example 5. 0 < P( A) < 1. is of the fonn (b) Deduce that X' ~ Z' where Z is given by (6. .0 < PCB) < 1.nf.4.Section 6.8) by Z ~ .21).4) explicitly and find the one that corresponds to the maximizer of the likelihood. then Z has a limitingN(011) distribution. 1.4 ~ 1(11) is continuous. 2. (e) Conclude that if A and B are independent. 9..1(0 0 ).4.3.8) (e) Derive the alternative form (6. Exhibit the two solutions of (6.2. (b) (6.P(A)P(B) JP(A)(1 .6 is related to Z of (6.7 Problems and Complements ~(1 ) 431 8.3. In the 2 x 2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a Born.4.3.22) is a consistent estimate of ~l(lIo).4. 10. Hint: Argue as in Problem 5. Under conditions AOA6 for (a) and AOA6 with A6 for (a) ~(1 ) i!~1) for (b) establish that [~D2ln(en)]1 is a consistent estimate of 1.3. A611 Problems for Section 6. Hint: Write and apply Theorem 6.P(A))P(B)(1 .
1 .D.6 that H is true. 6.4.2 is 1t(Ci.. . . a .u. B!c1b!. R 2 = T2 = n .~l.4 deduce that jf j(o:) (depending on chosen so that Tl... . = Lj N'j. 8l! / (8 1l + 812 )).)). Fisher's Exact Test From the result of Problem 6.. N ll and N 21 are independent 8( r. Consider the hypothesis H : Oij = TJil TJj2 for all i.9) and has approximately a X1al)(bl) distribution under H. 8 21 . 8(r2' 82 I/ (8" + 8..54). TJj2 are given by TJil ~ = .1. Suppose in Problem 6. . (a) Show that the maximum likelihood estimates of TJil. . (a) Show that then P[N'j niji i = 1.2. (b) Sbow that 812 /(8l! ° + 812 ) ~ 821 /(8 21 + 822 ) iff R 1 and C 1 are independent. 1 : X 2 (b) How would you. (22 ) as in the contingency table.ra) nab . . in principle. nab) A ) = B. (a) Let (NIl. This is known as Fisher's exact test.TI.. j = 1. 7. It may be shown (see Volume IT) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5. n R.. 811 \ 12 .~. . TJj2 = 2::: a x b contingency table with associated probabilities Bij and 1 Oij. C j = L' N'j. n a2 ) . TJj2..4... use this result to construct a test of H similar to the test with probability of type I error independent of TJil' TJj2? 1 . are the multinomial coefficients. . ( where ( . Ti) (the hypergeometric distribution).. CI.C. j. . . I (c) Show that under independence the conditional distribution of N ii given R. n...~~.4. i = 1. N 22 ) rv M (u.. . S.. C i = Ci. Cj = Cj] : ( nll. n) can be then the test that rejects (conditionally on R I = TI' C 1 = GI) if N ll > j(a) is exact level o.b1only. j = 1.~~. ( rl. (b) Deduce that Pearson's X2 is given by (6.1 II . i = 1. N 12 • N 21 . . b I Ri ( = Ti.. 432 Inference in the Multiparameter Case Chapter 6 i 4... nal ) n12. Let N ij be the entries of an let 1Jil = E~=l (}ij. Hint: (a) Consider the likelihood as a function of TJil. Let R i = Nil + N i2 • Ci = Nii + N 2i · Show that given R 1 = TI. .. TJj2 ~ = Cj n where R. =T 1.
The following table gives the number of applicants to the graduate program of a small department of the University of California.5? Admit Deny Men Women 1 19 . C2). if and only if. B INDEPENDENT) (iii) PeA (C is the complement of C. Give pvalues for the three cases. = 0 in the logistic model.05 level (a) using the X2 test with approximate critical value? (b) using Fisher's exact test of Problem 6. and petfonn the same test on the resulting table.ziNi > k. Zi not all equal.12 I .0.4. for suitable a. 2:f . where Pp~ [2:f .ZiNi > k] = a. Would you accept Or reject the hypothesis of independence at the 0.93'1 215 103 69 172 Deny 225 162 n=387 (d) Relate your results to the phenomenon discussed in (a).14). (i) p(AnB (ii) I C) = PeA I C)P(B I C) (A.) Show that (i) and (ii) imply (iii). classified by sex and admission status. Suppose that we know that {3. (e) The following 2 x 2 tables classify applicants for graduate study in different departments of the university according to admission status and sex. Test in both cases whether the events [being a man] and [being admitted] are independent. (h) Construct an experiment and three events for which (i) and (ii) hold. . but (iii) does not. 10. ~i = {3.. Show that. and that under H.Section 6. B. N 22 is conditionally distributed 1t(r2. consider the assertions.7 Problems and Complements 433 8. + IhZi. Admit Men Women 1 235 1~35' 38 7 273 42 n = 315 Deny Admit 270 45 Men Women I 122 1'. there is a UMP level a test.BINDEPENDENTGIVENC) n B) = P(A)P(B) (A.BINDEPENDENTGNENC) p(AnB I C) ~ peA I C)P(B I C) (A. n. 5 Hint: (b) It is easier to work with N 22 • Argue that the Fisher test is equivalent to rejecting H if N 22 > q2 + n .4.(rl + cI). Then combine the two tables into one.(rl + cd or N 22 < ql + n . (b). (a) If A. 9. and that we wish to test H : Ih < f3E versus K : Ih > f3E. Establish (6.. 11. which rejects. C are three events.1""".. if A and C are independent or B and C are independent.
Ld.20) for the regression described after (6. < f3g in this (c) By conditioning on L~ 1 Xi and using the approach of Problem 6. Given an initial value ()o define iterates  Om+l . Tn.. or Oi = Bio (known) i = 1. but Var.. " lOk vary freely.4. in the logistic regression model. 1 .. .4. (Zn. mk ~ 00 and H : fJ E Wo is true then the law of the statistic of (6. i Problems for Section 6.4. 1 i Show that for GLM this method coincides with the NewtonRaphson method of Section 2. = rn I < /3g and show that it agrees with the test of (b) Suppose that 131 is unknown. ~ 8 m + [1(8 m )Dl(8 m ).2. Verify that (6.) ~ B(m. .3) l:::~ I (Xi . .OkO) under H.18) tends 2 to Xrq' Hint: (Xi . a 2 = 0"5 is of the form: Reject if (1/0". This is an approximation (for large k.. .5.. 2..4). which is valid for the i.4. n) and simplification of a model under which (N1. Let Xl.00 or have Eo(Nd : .. .. 3. for example. case. Nk) ~ M(n.. a under H. Compute the Rao test statistic for H : (32 case. a5 j J j 1 J I .Oio)("Cooked data"). . but under K may be either multinomial with 0 #.1 14.d. . with (Xi I Z. 13.i. . then 130 as defined by (6. . .OiO)2 > k 2 or < k}. . Suppose that (Z" Yj)..d. . i 16. . 434 • Inference in the Multiparameter Case Chapter 6 f • 12..LJ2 Z i )).5. I. asymptotically N(O..(Ni ) < nOiO(1 . Use this to imitate the argument of Theorem 6. X k be independent Xi '" N (Oi...8) and.. k and a 2 is unknown.. Zi and so that (Zi..l 1 .i.z(kJl) ~I 1 . . a 2 ) where either a 2 = (known) and 01 .15) is consistent. " Ok = 8kO.: nOiO.4. In the binomial oneway layout show that the LR test is asymptotically equivalent to Pearson's  X2 test in the sense that 2log'\  X2 . 15. if the design matrix has rank p.4) is as claimed formula (2.)llmi~i(1 'lri). Xi) are i. Suppose the ::i in Problem 6.5. .3.. E {z(lJ.11..5 construct an exact test (level independent of (31)..3. .. Show that the likelihO<Xt ratio test of H : O} = 010 . . I). .. Y n ) have density as in (6.f. Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fishersee Rao (1973). • I 1 • • (a)P[Z.11 are obtained as realization of i.'Ir(.. · .". . Show that.. . I < i < k are independent.4. Show that if Wo C WI are nested logistic regression models of dimension q < r < k and mI. (a) Compute the Rao test for H : (32 Problem 6.5 1. 010.
(3). and b' and 9 are monotone.Section 6. as (Z.p.5. Show that. Let YI. (d) Gaussian GLM. contains an open L. . give the asymptotic distribution of y'n({3 ..WZ v where Zv = Ilz'jll is the design matrix and W = diag( WI. ~ .9). 1'0) = jy 1'01 2 /<7.. . Hint: Show that if the convex support of the conditional distribution of YI given ZI = zU) contains an open interval about p'j for j = 1..5).= 1 . .) and v(.. Set ~ = g(. .5..)z'j dC. Wn). T. T. P(I"). 00 d" ~ a(3j (b) Show that the Fisher information is Z]. Give 0. In the random design case.). J JrJ 4. your result coincides with (6.. Show that the conditions AQA6 hold for P = P{3o E P (where qo is assumed known).. L.).. 05. h(y. g("i) = zT {3. .) .b(Oi)} p(y.(. Give 0. then the convex support of the conditional distribution of = 1 Aj Yj zU) given Z j = Z (j) .Oi)~h(y. k. Yn be independent responses and suppose the distribution of Yi depends on a covariate vector Zi.5. . <. Yn ) are i.. has the Poisson. ball about "k=l A" zlil in RP . T). . distribution. C(T).. . J .ztkl} is RP (c) P[ZI ~ z(jl] > 0 for all j.7 Problems and Complements 435 (b) The linear span of {ZII). and v(.) ~ 1/11("i)( d~i/ d".T)exp { C(T) where T is known. d". . (a) Show that the likelihood equations are ~ i=I L. . Find the canonical link function and show that when g is the canonical link.. T). 9 = (b')l. ~ N("" <7. C(T). (c) Suppose (Z" Y I ).9).. under appropriate conditions. (Zn... b(9).) and C(T) such that the model for Yi can be written as O. h(y. (y. the deviance is 5. and v(. (e) Suppose that Y.O(z)) where O(z) solves 1/(0) = gI(zT {3). the resuit of (c) coincides with (6.. Suppose Y. Y) and that given Z = z. Wi = w(".) = ~ Var(Y)/c(T) b"(O). . Y follow the model p(y.T).. Show that for the Gaussian linear model with known variance D(y. . Hint: By the chain rule a l(y a(3j' 0) = i3l dO d" a~ . j = 1. Assume that there exist functions h(y.. b(B). b(9).F. Show that when 9 is the canonical link..i.y ..1 = 0 V fJ~ ... k. g(p.d. .
O' 2(p)/0'2 = 2/1r. and R n are the corresponding test statistics.4. Consider the Rao test for H : f} = f}o for the model "P = {P/I : /I E e} and ADA6 hold.p under the sole assumption that E€ = 0.6. Consider the linear model of Example 6. that is.3. Suppose that the ttue P does not belong to"P but if f}(P) is defined by (6..d. ! I I ! I . Note: 2 log An.1.7. the infonnation bound and asymptotic variance of Vri(X 1'). n . then under H.j 436 Inference in the Multiparameter Case Chapter 6 i Problems for Section 6. j. (3) where s(t) is the continuous distribution function of a random variable symmetric about 0. . 2. .1 under this model and verifying the fonnula given.10). (6.15) by verifying the condition of Theorem 6. /10) is estimated by 1(80 ). set) = 1. and Rao tests are still asymptotically equivalent in the sense that if 2 log An. W n . then v( P) . but that if the estimate ~ L:~ dDIllDljT (Xi. then it is. f}o) is used. • I 3. Show that.O' 2 (p») = 1/4f(v(p». Show that the standard Wald test forthe problem of Example 6. . in fact. t E R. W n and R n are computed under the assumption of the Gaussian linear model with a 2 known.s(t). I "j 5. In the hinary data regression model of Section 6. By Problem 5.10) creates a valid level u test.3. Show that the LR. P.. I 4. then the Rao test does not in general have the correct asymptotic level. = 1" (b) Show that if f is N(I'. I . 7.6.6. then 0'2(P) < 0'2.6.6 1.2. Apply Theorem 6.6.14) is a consistent estimate of2 VarpX(l) in Example 6. 0'2).6. . (a) Show that if f is symmetric about 1'. . .2. Show that 0: 2 given in (6. Suppose Xl. I: . i 6.6. Wn Wn + op(l) + op(I). " .Q+1>'" . replacing &2 by 172 in (6. Establish (6.i. hence. l 1 .3) then f}(P) = f}o. the unique median of p.(3p ~ (3o. . then the sample median X satisfies ~ f at I I Vri(X where O' (p) 2 yep») ~ N(0. but if f~(x) = ~ exp Ix 1'1. . if VarpDl(X.3 is as given in (6. Suppose ADA6 are valid. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation. then O' 2(p) > 0'2 = Varp(X. Wald.2 and the hypothesis (3q+l = (3o. let 1r = s(z.1.6.3 and.).. if P has a positive density v(P). 0 < Vae f < 00.1) ~ .Xn are i.
O)T and deduce that (c) EPE(p) = RSS(p) + ~.1) Yi = J1i + ti. .Zn and evaluate the performance of Yjep) . 1 n. (Model Selection) Consider the classical Gaussian linear model (6.1.Section 6.p don't matter. . where ti are i.lpI)2 be the residual sum of squares. A natural goal to entertain is to obtain new values Yi". Show that if the correct model has Jri given by s as above and {3 = {3o.. i .2 is known. then 13L is not a consistent estimate of f3 0 unless s(t) is the logistic distribution.... Hint: Apply Theorem 6.2 (b) Show that (1 + ~D + .I EL(Y.' i=l . But if 13L is defined as the solution of EZ1s(Zr {30) = Q(/3) where Q({3) = E(Zr A(Zr/3) is p x 1. (b) Suppose that Zi are realizations of U. hence. Gaussian with mean zero J1i = Z T {3 and variance 0'2. 1 n.. i = 1.. = (3p = 0 by the (average) expected prediction error ~ ~ n EPE(p) = n. are indepeqdent of}] 1 ' · ' 1 Yn and ~* is distributed as }'i.(p) the corresponding fitted value..7 Problems and Complements 437 (a) Show that ~ Jr can be written in this form for both the probit and logit models. i = 1. 0.. Zi are ddimensional vectors for covariate (factor) values. the model with 13d+l = . . 1 V. 1973. ..jP)2 where ILl ) = z.d. y~p) and. that Zj is bounded with probability 1 and let ih(X ln )). Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d . Let RSS(p) = 2JY. _..i .1. 8.(b) continue to hold if we assume the GaussMarkov model. /3P+I = '" = /3d ~ O. . ~ ~ (d) Show that (a).. Model selection consists in selecting p to minimize EPE(p) and then using Y(P) as a predictor (Mallows. Zi... for instance).. (a) Show that EPE(p) ~ .. Suppose that . then ~ Vri(rJL QI ({3) Var(ZI (Y1   {3 Ll has a limiting normal distribution with mean 0 and variance A(Zr {3)) )[QI ({3)J where Q({3) = E(Zr A(Zr{30)ZI) is p x p and necessarily nonsingular.{3lp) p and {3(P) ~ (/31.d... be the MLE for the logit model.2 is an unbiased estimate of EPE(P). Zi) : 1 <i< n}. .p. f3p.2.9.lp)2 Here Yi" l ' •• 1 Y. at Zll'" . L:~ 1 (P..9. . Let f3(p) be the LSE under this assumption and Y. where Xln) ~ {(Y. . . .
and (ii) 'T}i = /31 Zil + f32zi2. (b) The result depends only on the mean and covariance structure of the i=l"". 3rd 00.4 (1) R.438 (e) Suppose p ~ Inference in the Multiparameter Case Chapter 6 2 and 11(Z) ~ Ii. 1 "'( i=1 . Y. )I'i). Use 0'2 = 1 and n = 10. 1969.~I'.' .  J .9 REFERENCES Cox. ti..". Note for Section 6. . Hint: (a) Note that ".1 (1) From the L A.n. The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section 6. Heart Study after Dixon and Massey (1969).z. i 6.. ~(p) n . if we consider alternatives to H.8 NOTES Note for Section 6. . 1 n n L (I'. . 1 I . D. but it is reasonable.i2} such that the EPE in case (i) is smaller than in case (d) and vice versa. EPE(p) n R SS( p) = . .16). AND F. t12 and {Z/I. i=1 Derive the result for the canonical model. R..4.2 (1) See Problem 3.5.L . which are not multinomial. . W. + Evaluate EPE for (i) ~i ~ "IZ.) . 6. To guard against such situations he argued that the test should be used in a twotailed fashion and that we should reject H both for large and for small values of X2 • Of course. A. The Analysis of Binary Data London: Methuen. DIXON. LR test statistics for enlarged models of this type do indeed reject H for data corresponding to small values of X2 as well as large ones (Problem 6. 1970. Introduction to Statistical Analysis. . MASSEY. New York: McGrawHill.6. this makes no sense for the model we discussed in this section. Y/" j1~. we might envision the possibility that an overzealous assistant of Mendel "cooked" the data.9 for a discussion of densities with heavy tails. Give values of /31. For instance. Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good. Note for Section 6. ! .
S.• "Sur quelques points du systeme du monde. 1986." Memoires de l'Academie des Sciences de Paris (Reprinted in Oevres CompUtes. DYKSTRA. 1974. HUBER. AND R. AND j. KOENKER. 36. "Computing regression qunntiles. GRAYBILL. Paris) (1789). 1988. L. KASS. f. An Inrmdllction to Linear Stati.Section 6. Fifth BNkeley Symp.. Prob. A. 1995. S." Technometrics. R. ROBERTSON. New York. P.. . 383393 (1987). "The Gaussian Hare and the Laplacian Tortoise: Computability of squarederror versus absoluteerror estimators. 1973. second edition. t 983. HABERMAN. STIGLER. 1985. KADANE AND L. New York: J. /5. RAO.. Order Restricted Statistical Inference New York: Wiley. MA: Harvard University Press.. 12." Proc. I New York: McGrawHill. S. T. WRIGHT. Ser. c. 2nd ed. WEISBERG. 221233 (1967). "Approximate marginal densities of nonlinear functions. Soc. TIERNEY. R." Statistical Science. M. P. Linear Statisticallnference and Its Applications. New York: Hafner.. Statistical Theory with £r1gineering Applications New York: Wiley. A. 1952. The Analysis of Variance New York: Wiley. "The behavior of the maximum likelihood estimator under nonstandard conditions. MALLOWS. T. R. A. New York: Wiley.. 279300 (1997). Statist. $CHERVISCH. Statist.425433 (1989). J. Math.. Applied linear Regression. I. PORTNOY. 1989. II." Biometrika. Vol. D'OREY. J.. 13th ed." J. R. AND R. C.9 References 439 Frs HER. "Some comments on C p . Roy. The Analysis of Frequency Data Chicago: University of Chicago Press. GauthierVillars.661675 (1973). 1959. LAPLACE. C. KOENKER. 1958. Genemlized Linear Models London: Chapman and Hall. McCULLAGH. NELDER.• St(ltistical Methods {Of· Research lV(Jrkers. E. HAW. Wiley & Sons. 1961. II. of California Press. SCHEFFE. 2nd ed. Univ. AND Y. 76. Theory ofStatistics New York: Springer. S.Him{ Models.. 475558.. The l!istory of Statistics: The Measuremem of Uncel1ainty Before 1900 Cambridge. PS.
. . .. I j i. . I . . I I .. j. .I :i . i 1 I .
A coin that. Therefore. although we do not exclude this case. The Kolmogorov model and the modem theory of probability based on it are what we need.Appendix A A REVIEW OF BASIC PROBABILITY THEORY In statistics we study techniques for obtaining and using information in the presence of uncertainty. we include some proofs as well in these sections. The situations we are going to model can all be thought of as random experiments. which are relevant to our study of statistics. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. require that every repetition yield the same outcome. Sections A.I THE BASIC MODEl Classical mechanics is built around the principle that like causes produce like effects. Probability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. is tossed can land heads or tails. an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation. We add to this notion the requirement that to be called an experiment such an action must be repeatable. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts. in addition. we include some commentary.IS contain some results that the student may not know.14 and A. The adjective random is used only to indicate that we do not. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level. The purpose of this appendix is to indicate what results we consider basic and to introduce some of the notation that will be used in the rest of the book. at least conceptually. Viewed naively. This 441 . The reader is expected to have had a basic course in probability theory. The intensity of solar flares in the same month of two different years can vary sharply. A. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties.
. We shall use the symbols U. The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A.1. is to many statistician::. it is called a composite event. we preSUnle the reader to be familiar with elementary set theory and its notation at the level of Chapter I of Feller (1968) or Chapter 1 of Parzen (1960)." Another school of statisticians finds this formulation too restrictive. complementation. the probability model.. A2 . A. c. However. We denote events by A. . de Groot (1970). and Loeve. {w} is called an elementary event..~ n and is typically denoted by w. B. 1992. In this sense. " are pairwise disjoint sets in I • I Recall that Ui I Ai is just the collection of points that are in anyone of the sets Ai and that two sets are disjoint if they have no points in common. is denoted by A.1/ /1. Raiffa and Schlaiffer (\96\). A. A probabiliry distribution or measure is a nonnegative function P on A having the following properties: (i) P(Q) n 1 n . where 11. = 1. If A contains more than one point. and inclusion as is usual in elementary set theory. 1974. then (ii) If AI.l The sample space is the set of all possible outcomes of a random experiment. . • . induding the authors. In this section and throughout the book. For a discussion of this approach and further references the reader may wish to consult Savage (1954).3 Subsets of are called events.. from horse races to genetic experiments.l. Savage (1962). 1 1  . almost any kind of activity involving uncertainty. If wEn. Chung.l. Ii I I I j . 1977). .4 We will let A denote a class of subsets of to which we an assign probabilities. intersections. intersection. and so on or by a description of their members. the relation A C B between sets considered as events means that the occurrence of A implies the occurrence of B. as we shall see subsequently. By interpreting probability as a subjective measure. Lindley (1965). . Its complement. We denote it by n. For example." The set operations we have mentioned have interpretations also. which by definition is a nonempty class of events closed under countable unions. We now turn to the mathematical abstraction of a random experiment. I l 1. whether it is conceptually repeatable or not. For technical mathematical reasons it may not be possible to assign a probability P to every subset of n. and Berger (1985). i ! A. .I I 442 A Review of Basic Probability Theory Appendix A 1 longtefm relative frequency 11 .\ is the number of times the possible outcome A occurs in n repetitions. falls under the vague heading of "random experiment. Grimmett and Stirzaker. and complementation (cf. set theoretic difference. C for union. n. the operational interpretation of the mathematical concept of probability.2 A sample point is any member of 0. A. A random experiment is described mathematically in tenns of the following quantities. they are willing to assign probabilities in any situation involving uncertainty.1. the null set or impossible event. A is always taken to be a sigma field.
. Section 8 Grimmett and Stirzaker (1992) Section 1.P(A). A.2. Sections 13.. In this case.3. c .40 < P(A) < 1. P(B) A. and Stone (1971) Sections 1. Port.) (Bonferroni's inequality).7 P 1 An) = limn~= P(A n ). A. Sections 15 Pitman (1993) Sections 1. c A.Section A. For convenience. (1967) Chapter l.l1. C An .6 If A .3 DISCRETE PROBABILITY MODELS A.1.. A..2.2) n . we can write f! = {WI. A. then P (U::' A.). and P together describe a random experiment mathematically. (A.3 Hoel.4).t A probability model is called discrete if is finite or countably infinite and every subset of f! is assigned a probability.2 Elementary Properties of Probability Models 443 A.A) = PCB) .1 If A c B. 1.S P A.PeA).l. References Gnedenko (1967) Chapter I.2 Parzen (1960) Chapter 1.S The three objects n. when we refer to events we shall automatically exclude those that are not members of A.L~~l peA. P) either as a probability model or identify the model with what it represents as a (random) experiment. References Gnedenko. . (n~~l Ai) > 1.3 A. (U.2 ELEMENTARY PROPERTIES OF PROBABILITY MODELS The following are consequences of the definition of P.3 Hoe!.2 and 1.2.3 Panen (1960) Chapter I. J. and Stone (1992) Section 1. That is. ~ A.. A.2.' 1 peA. .68 Grimmett and Stirzaker (1992) Sections 1. Port. by axiom (ii) of (A. W2.3 If A C B.. > PeA).2.. } and A is the collection of subsets of n. Sections 45 Pitman (1993) Section 1. we have for any event A.2.' 1 An) < L. then PCB .3.2 PeN) 1 .3 A.... P(0) ~ O. We shall refer to the triple (0.2.
.3.~ 1 1 ~• 1 .:c Number of elements in A N .:.. . For large N.4. A.3.' . I B).4.=:. which we write PtA I B). guinea pigs. If A" A 2 . is an experiment leading to the model of (A.• P (Q A. (A. . A) which is referred to as the conditional probability measure given B. flowers. .l n' " I"I ". j=l (A. and drawing.. Sections 67 Pitman (1993) Section l.4) . n PiA) = LP(A I Bj)P(Bj ). the identity A = . say N. Transposition of the denominator in (AA. then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur. . '< = ~=~:. (A.4)(ii) and (A. selecting at random. Given an event B such that P( B) > 0 and any other event A. i • i: " i:. the function P(.. a random number table or computer can be used. Sections 45 Parzen (1960) Chapter I..3) yield U. (A..2) In fact. I B) = ~ PiA.). etc. and P( A ) ~ j . shaking well. i References Gnedenko (1967) Chapter I.. . From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred. then . we define the conditional probability of A given B. by l P(A I B) ~ PtA n B) P(B) .3.4 Suppose that WI.(A n B j ).I B) is a probability measure on (fl.3) If B l l B 2 .4.. Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another.=. .444 A Review of Basic Probability Theory Appendix A An important special case arises when n has a finite number of elements. for fixed B as before. Such selection can be carried out if N is small by putting the "names" of the Wi in a hopper.4 CONDITIONAL PROBABILITY AND INDEPENDENCE I j 1 1 ! .3) 1 I A. Then P( {w}) = 1/ N for every wEn. WN are the members of some population (humans.4. ~f 1 .1.:c:.1) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment. (A. 1 1 PiA n B) = P(B)P(A I B). machines. all of which are equally likely.l) gives the multiplication rule.3). are (pairwise) disjoint events and P(B) > 0.1 • 1 . (A.• 1 B n are (pairwise) disjoint events of positive probability whose union is fl.4.
... ..S) The conditional probability of A given B I defined by .) A) ~ ""_ PIA I B )p(BT L. Section 4.i.. and (A. P(il.3). .···.8) may be written P(A I B) ~ P(A) (AA..8) > 0.i k } of the integers {l..1 ).En such that P(B I n . n··· nA i .• ..) j=l k (AA.. B n ) and for any events A.}.. n Bnd > O. Two events A and B are said to be independent if P(A n B) If P( B) ~ P(A)P(B). (AA... An are said to be independent if P(A i ...A. Ifall theP(A i ) are positive.4.. I. . .) ~ P(A J ) (AA. . B I . Chapter 3. References Gnedenko (1967) Chapter I.)P(B.9) In other words. relation (A. .lO) is equivalent to requiring that P(A J I A.4. A and B are independent if knowledge of B does not affect the probability of A. P(B" I Bl.Section AA Conditional Probability and Independence 445 If P( A) is positive. n B n ) > O. . Simple algebra leads to the multiplication rule.) = II P(A.4..I J (AA.II) for any j and {i" .n}. the relation (AA.lO) for any subset {iI.. . .. Port. . (AA... P(B 1 n·· n B n ) ~ P(B 1 )P(B2 I BJlP(B3 I ill. B ll is written P(A B I •. BnJl (AA.. The events All .. .. B 2 ) ." .S parzen (1960) Chapter 2.. . . Sections IA Pittnan (1993) Section lA ... and Stone (1971) Sections lA.4) and obtain Bayes rule . we can combine (A..i.7) whenever P(B I n .. I PeA I B.I_1 .. Sections 9 Grimmett and Stirzaker (1992) Section IA Hoel.} such thatj ct {i" .
.3) for events Al x . we should have . I n  I . . Loeve. On the other hand. x " . Pn({W n }) foral! Wi E fli. (A.. I P n on (fln> An). x An by 1 j P(A. then Ai corresponds to . it is easy to give examples of dependent experiments: If we draw twice at random from a hat containing two green chips and one red chip..' . wn ) is a sample point in . If P is the probability measure defined on the sigma field A of the compound experiment. 446 A Review of Basic Probability Theory Appendix A A. . x fl 2 x '" x fln)P(fl.. . ) = P(A. x flnl n Ifl) x A 2 x ' . x flnl n ' . . .. This makes sense in the compound experiment. on (fl2.. P([A 1 x fl2 X ...on. . ... The interpretation of the sample space 0 is that (WI.. (A5A) i..'" £n and recording all n outcomes. I ! i w? 1 . To say that £t has had outcome pound event (in . x An of AI.on.5. (A5.6 where examples of compound experiments are given... X . .. If we are given n experiments (probability models) t\. r. An are events. if we toss a coin twice.) . .o n = {(WI..0) given by . These will be discussed in this section. We shall speak of independent experiments £1.. W2 is the outcome of £2 and E Oi corresponds to the Occurrence of the comso on. .o i . X fl n 1 X An). independent.5 COMPOUND EXPERIMENTS There is an intuitive notion of independent experiments. I <i< n. ...1 Recall that if AI. the Cartesian product Al x .5. 1995.wn )}) I: " = PI ({Wi}) . An is by definition {(WI. then the probability of a given chip in the second draw will depend on the outcome of the first dmw. . \ I I .. 1 £n if the n stage compound experiment has its probability structure specified by (A5. x . and if we do not replace the first chip drawn before the second draw. . 1 < i < n}. then intuitively we should have alI classes of events AI. . . .0 1 x··· XOi_I x {wn XOi+1 x··· x. . The (n stage) compound experiment consists in performing component experiments £1.An with Ai E Ai. To be able to talk about independence and dependence of experiments. then (A52) defines P for A) x ". Pn(A n ). A. A2). W n ) : W~ E Ai." X An) = P..0 if and only if WI is the outcome of £1. X A 2 X '" x fl n ). 1974. £n with respective sample spaces OJ. There are certain natural ways of defining sigma fields and probabilities for these experiments. the subsets A ofn to which we can assign probability(1). we introduce the notion of a compound experiment. x An) ~ P(A.53) holds provided that P({(w"". For example.."" . it can be uniquely extended to the sigma field A specified in note (l) at the end of this appendix. x'" . . P. .. 1 1 . P(fl.. (A53) It may be shown (Billingsley. InformaUy. then the sample space . x An. 1977) that if P is defined by (A. that is. . if Ai E Ai. x On in the compound experiment.' .0 ofthe n stage compound experiment is by definition 0 1 x·· . If we want to make the £i independent.0 1 X '" x ... . the sigma field corresponding to £i. the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second. . More generally.. In the discrete case (A.3).1 X Ai x 0i+ I x . Chung. .w n ) E o : Wi = wf}.. a compound experiment is one made up of two or more component experiments. .2) 1 i " 1 If we are given probabilities Pi on (fl" AI). The reader not interested in the formalities may skip to Section A.1 .
J.£"_1 has outcome wnd. we refer to such an experiment as a multinomial trial with probabilities PI. any point wEn is an ndimensional vector of S's and F's and.. In the discrete case we know P once we have specified P( {(wI . If o is the sample space of the compound experiment. .6.... /I. . then (A. the following.5. Other examples will appear naturally in what follows...6.1 Suppose that we have an experiment with only two possible outcomes. References Grimmett and Stirzaker ( (992) Sections (.5 Parzen (1960) Chapter 3 A.3) where n ) ( k = n! kl(n . we shall refer to such an experiment as a Bernoulli trial with probability of success p.·.4..SP({(w] . . n The probability structure is determined by these conditional probabilities and conversely. and Stone (1971) Section 1. if an experiment has q possible outcomes WI.6. A.S) . If the experiment is perfonned n times independently.w n )}) = P(£l has outcome wd P(£2 hasoutcomew21 £1 has outcome WI)'" P(£T' has outcome W I £1 has outcome WI. we say we have performed n Bernoulli trials with success probability p.Pq' If fl is the sample space of this experiment and W E fl. . the compound experiment is called n multinomial trials with probabilities PI... ill the discrete case.6 BERNOULLI AND MULTINOMIAL TRIALS.7) we have.wq and P( {Wi}) = Pi.6.u.: n )}) for each (WI. SAMPLING WITH AND WITHOUT REPLACEMENT A. A. i = 1" . .Section A 6 Bernoulli and Multinomial Trials. The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success). 1.3) is known as the binomial probability.4 More generally. Sampling With and Without Replacement 447 Specifying P when the £1 are dependent is more complicated. By the multiplication rule (A.2) where k(w) is the number of S's appearing in w. then (A. (A.5.k)!' The fonnula (A. . Port.6.: n ) with Wi E 0 1. .. . which we shall denote by 5 (success) and F (failure). If Ak is the event [exactly k S's occur].'" .. Ifweassign P( {5}) = p. .. If we repeat such an experiment n times independently. .6 Hoel.6..· IPq...
6. .(N(I. and Stone (1971) Section 2. . for any outcome a = (Wil"" 1Wi. 10) is known as the hypergeometric probability..1 j . .6. __________________J i .) of the compound experiment. we are sampling with replacement.k q is the event (exactly k l WI 's are observed. n P({a}) ~ (N)n where 1 I (A. Port.448 A Review of Basic Probability Theory Appendix A where k~(w) = number of times Wi appears in the sequence w. then ~ .) A = n! k k k !. .4 Parzen (1960) Chapter 3. and P(Ak) = 0 otherwise. .6.S If we have a finite population of cases = {WI"" WN} and we select cases Wi successively at random n times without replacement. . · · • A. i . When n is finite the tenn.S) as follows. exactly k 2 wz's are observed. If Np of the members of n have a "special" characteristic S and N (1 ~ p) have the opposite characteristic F and A k = (exactly k "special" individuals are obtained in the sample). n . the component experiments are not independent and. and the component experiments are independent and P( {a}) = liNn. The probability models corresponding to such experiments can all be thought of as having a Euclidean space for sample space.6) where the k i are natural numbers adding up to n..1 . . P) independently n times. i .. .1O) J for max(O. . The fonnula (A. (N)n .p)) < k < min( n. .A l P).1 . we shall sometimes refer to the outcome of the compound experiment as a sample of size n from the population given by (n. I' . . exactly kqwq's are observed)..6.n)!' If the case drawn is replaced before the next drawing.0. .7 If we perform an experiment given by (. Np).6.. ·Pq t (A.. with replacement is added to distinguish this situation from that described in (A.. A.6. A.. Sections 14 Pitman (1993) Section 2.. then P( k" . I A. kq!Pt' . 1 . References Gnedeuko (1967) Chapter 2. PtA ) k =( n ) (Np).N (1 . If AkJ.7 PROBABILITIES ON EUCLIDEAN SPACE Random experiments whose outcomes are real numbers playa central role in theory and practice.P))nk k (N)n = (A6. = N! (N .9) .k. . Sectiou 11 Hoel. . .
(ak. where ( )' denotes transpose. any nonnegative function pon R k vanishing except on a sequence {Xl.7... (A7.S) for some density function P and all events A...7. We will write R for R1 and f3 for f31.7. Recall that the integral on the right of (A.. Riemann integrals are adequate. That is. A.bk ) are k open intervals.8 are usually called absolutely continuous.5) A. is defined to be the smallest sigma field having all open k rectangles as members. . A. we shall call the set (aJ. xd'. .7.. P(A) is the volume of the "cylinder" with base A and height p(x) at x. . which we denote by f3k.7.1If (al.Xk) :ai <Xi <bi.bd. . } of vectors and that satisfies L:~ I P(Xi) = 1 defines a unique discrete probability distribution by the relation P(A) = L x.Xn .7.. x A. for practical purposes.7 Probabilities on Euclidean Space 449 We shall use the notation R k of k~dimensional Euclidean space and denote members of Rk by symbols such as x or (Xl.. A..Section A. . .7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation P(A) = L p(x)d x =1 (A7. It may be shown that a function P so defined satisfies (AlA).EA pix.7.9) . ..S) is by definition r JR' 1A(X)P(x)dx where 1A(x) ~ 1 if x E A. We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely.). . and 0 otherwise. Integrals should be interpreted in the sense of Lebesgue. 1 <i <k}anopenkrectallgle.2 The Borelfield in R k .. . However. only an Xi can occur as an outcome of the experiment.7. Geometrically. } are equivalent. P defined by A.3 A discrete (probability) distribution on R k is a probability measure P such that L:~ I P( {Xi}) = 1 for some sequence of points {xd in R k .I) because the study of this model and that of the model that has = {Xl. Thefrequency function p of a discrete distribution is defined on Rk by n pix) = P({x»).b k ) = {(XI""...7.6 A nonnegative function p on R k • which is integrable and which has r p(x)dx = 1.3.bd x '" (ak. An important special case of (A. X n .S) is given by (A. . (A7A) Conversely. This definition is consistent with (A. JR' where dx denotes dX1 . Any subset of R k we might conceivably be interested in turns out to be a member of f3k. dx n • is called a density function..
450 A Review of Basic Probability Theory Appendix A It turns out that a continuous probability distribution determines the density that generates it "uniquely. and Stone (1971) Sections 3..xo + h]) '" 2hp(xo) and P([ h Xl 1 + h]) + Xl p(xo) hi) '" ( ).7.7. ..12) The dJ.1. if p is a continuous density on R. then P = Q.]).2.x. For instance."(l) Although in a continuous model P( {x}) = 0 for every x. When k = 1..:0 and Xl are in R. Xo P([xo .11 The distribution function (dJ.2 parzen (1960) Chapter 4.. F is a function of a real variable characterized by the following properties: > I . the density function has an operational interpretation close to thal of the frequency function.1'. be thought of as measuring approximately how much more Or less likely we are to obtain an outcome in a neighborhood of XQ then one in a neighborhood of Xl_ A.16) I It may be shown that any function F satisfying (A. F is continuous at x if and only if P( {x}) (A.h.7. • J . .f. 22 Hoel.7. Sections 14.16) defines a unique P on the real line. and h is close to 0. (A.1 and 4.13HA.15) I.4.J x .7.7.7.7.7 Pitman (1993) Sections 3. r . limx~oo F(x) limx~_oo =1 F(x) = O.13) x <y =? F(x) < F(y) (Monotone) F(x) (Continuous from the right) (A. We always have F(x)F(xO)(2) =P({x}).1.7.lO) The ratio p(xo)jp(xl) can. . Sections 21. 3.14) . POlt. Thus. 5. defines P in the sense that if P and Q are two probabilities with the same d. 4. ! I .h. (A. x. .5 i • .17) = O.7. x (00. P Xl (A. References Gnedenko (1967) Chapter 4. x n j X =? F(x n ) ~ (A. x. (A. thus. then by the mean value theorem paxo .) = P( ( 00. 5.) F is defined by F(Xl' ..
or equivalently a function from to Rk such that the set {w . dJ. Similarly.8 Random VariOlbles and Vectors: Transformations 451 A. X(w) E: B} ~ X. dj. such that(2) gl(B) = {y E: Rk : g(y) E: .. Letg be any function from Rk to Rm. the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred.8. When we are interested in particular random variables or vectors. these quantities will correspond to random variables and vectors. The study of real.1 A random variable X is a function from Oto Rsuch that the set {w: X(w) E B} X. density. we will refer to the frequency Junction. The subscript X or X will be used for densities. in fact. 13k . .(1) = A. the concentration of a certain pollutant in the atmosphere.1 (H) is in A for every B E BkJI) For k = 1 random vectors are just random variables.5) p(x)dx .a RANDOM VARIABLES AND VECTORS: TRANSFORMATIONS Although sample spaces can be very diverse.8. The event Xl( B) will usually be written [X E B] and P([X E BJ) will be written PIX E: H].8) PIX E: A] LP(X). In the discrete case this means we need only know the frequency function and in the continuous case the density. . ifXisdiscrete xEA L (A.7. the time to breakdown and length of repair time for a randomly chosen machine. A.Section A.7. we measure the weight of pigs drawn at random from a population. Thus. by definition.3) A.8. m > 1. we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined. from (A.S. In the probability model. Forexample.Xk)T is ktuple of random variables.4 A random vector is said to have a continuous or discrete distribution (or to be continuous or discrete) according to whether its probability distribution is continuous or discrete.8. the yield per acre of a field of wheat in a given year. referring to those features of its probability distribution. and so on of a random vectOr when we are. Px ) given by n Px(B) = PIX E BI· (A. The probability distribution of a random vector X is. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X.5) and (A. k. the probability measure Px in the model (R k .or vectorvalued functions of a random vectOr X is central in the theory of probability and of statistics. Here is the formal definition of such transformations. and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted.1 (B) is in 0 fnr every BE B. if X is continuous.2 A random vector X = (Xl •. and so on.'s.
8. y).8.y)dy. then the frequency function of X.1 1 Xi = X and 92(X) = k. a # 0.y)(x. then g(X) is discrete and has frequency function Pg(X)(t) = L {x:g(x)=t} Px(x). y)T is continuous with density p(X.7) If X is discrete with frequency function Px. then 1 (I . Another common example is g(X) = (min{X.y).8. max{X.7. .X)2. This is called the change of variable formula.Y).S.8. These notions generalize to the case Z = (X. If g(X) ~ aX + 1'.12) by summing or integrating out over yin P(X. it may be shown (as a consequence of (A. and X is continuous.6) An example of a transformation often used in statistics is g (91. V).8.8.8) Suppose that X is continuous with density PX and 9 is realvalued and onetoone(3) on an open set S such that P[X E 5] = 1.)'.8. The probability distribution of g(X) is completely detennined by that of X through L:: P[g(X) E BI = PIX E gI(B)].8.11) Similarly. known as the marginal frequency function.1O) From (A. .12) 1 . (A.Y). • II ! I • .8)) that X is a marginal density function given by px(x) ~ 1: P(X.11) and (A.92/ with 91 (X) = k. assume that the derivative l of 9 exists and does not vanish on S.Y) (x. Furthennore.9) t E g(S). y (A.8) it follows that if (X. is given by(4) i I ( PX(X) = LP(X. (A.7) and (A.jPX a (A. and 0 otherwise. i .1 E~' l(X i .).: for every B E Bill. The (marginal) frequency or density of X is found as in (A. . Pg(X) (I) = j. a random vector obtained by putting two random vectors together. . .1') . . (A.y)(x. Discrete random variables may be used to approximate continuous ones arbitrarily closely and vice versa.S. if (X.(5) (A. . Then the random tran~form(lti(m g( X) is defined by g(X)(w) = g(X(w)). ! for Pg(x)(I) ~ PX(gl(t)) Ig'(g 1(1))1 (A. Then g(X) is continuous with density given by . Yf is a discrete random vector with frequency function p(X.452 A Review of BClsic ProbClbility Theory Appendix A [J} E BI.8. j .
6. A.. Suppose X (Xl. (X"XIX. it is common in statistics to work with continuous distributions. ..13 A convention: We shan write X = Y if the probability of the event IX fe YI is O. .1.. References Gnedenko (1967) Chapter 4.9. and Stone (1971) Sections 3.9. .An in S.9...4 Parzen (1960) Chapter 7. A. .9 Independence of Random Variables and Vectors 453 In practice.4) (A.4 A. E BI are independent.1 Two random variables X I and X 2 are said to be independent if and only if for sets A and B in E.) and Y" and so on.. .3.3 By (A.9.1. . . To generalize these definitions to random vectors Xl.Xn (not necessarily of the same dimensionality) we need only use the events [Xi E Ail where Ai is a set in the range of Xi' A.9 INDEPENDENCE OF RANDOM VARIABLES AND VECTORS A. 5.9. For example. Nevertheless. The justification for this may be theoretical or pragmatic.. .Section A. which may be easier to deal with.9.9. Sections 2124 Grimmett and Stirzaker (1992) Section 4. A. " X n ) is continuous. . S. 6.7 The preceding equivalences are valid for random vectors XlJ .. X 2 ) and (Yt> Y2 ) are independent. One possibility is that the observed random variable or vector is obtained by rounding off to a large number of places the true unobservable continuous random variable specified by some idealized physical modeL Or else.l5 and B. either ofthe (A. Theorem A.5) = following two conditions hold: A. the events [X I E All. Sections 15.[Xn E An] are independent. so are Xl + X2 and YI Y2 . all random variables are discrete because there is no instrument that can measure with perfect accuracy.2. X n with X (XI>"" Xn)' . if (Xl. then X = (Xl. .. " X n ) is either a discrete or continuous random vector. the events [XI E AI and IX.Xn are said to be (mutually) independent if and only if for any sets AI. Then the random variables Xl. whatever be g and h.. so are g(X) and h(Y).7 Hoel.7.S.9. .2 The random variables X I. Port. if X and Yare independent. the approximation of a discrete distribution by a continuous one is made reasonable by one of the limit theorems of Sections A. ' .6IftheXi are all continuous and independent...S.7). 1 X n are independent if. and only if... 9 Pitman (1993) Section 4.
W.. . X2.. I.X.EA XiPX (Xi) < .• }.I0 THE EXPECTATION OF A RANDOM VARIABLE r: r I Let X be the height of an individual sampled at random from a finite population. If XI . then the Xi form a sample from a distribution that assigns probability p to 1 and (1. } into two sets A and B where A consists of aU nonnegative Xi and B of all negative Xi· If either 'L. Take • . i==l (A. .9. If A is any event.. .2.. If X is a nonnegative.. . by 00 1 i . discrete random variable with possible values {Xl. A..454 A Review of Basic Probability Theory Appendix A A. written E(X).I) . X n are independent identically distributed kdimensional random vectors with dJ.p) to 0. it follows that this average is given by 'L. .4 Par""n (1960) Chapter 7.' (Infinity is a possible value of E(X).I0. Then a reasonable measure of the center of the distribution of X is the average height of an individual in the given population. decompose {Xl l X2. "" 1 X q are the only heights present in the population. we define